Serving Hardware Sizing and Capacity Planning
Modern AI systems rarely fail because a model is unavailable. They fail because capacity is misread: tokens are cheaper than expected until a spike arrives, latency looks fine until the tail collapses, an innocuous feature doubles average context length, or a queue forms and never drains. Serving is not training-with-smaller-batches. It is a live production workload with demand uncertainty, strict latency targets, and an economic shape that can swing by orders of magnitude when a single variable moves.
A practical sizing discipline treats serving as a flow problem. Requests arrive with a distribution of prompt lengths, output lengths, and tool calls. The system converts that flow into GPU work and memory pressure. Capacity planning is the act of turning those distributions into hardware requirements with explicit safety margins, then verifying the plan under realistic traffic.
Competitive Monitor Pick540Hz Esports DisplayCRUA 27-inch 540Hz Gaming Monitor, IPS FHD, FreeSync, HDMI 2.1 + DP 1.4
CRUA 27-inch 540Hz Gaming Monitor, IPS FHD, FreeSync, HDMI 2.1 + DP 1.4
A high-refresh gaming monitor option for competitive setup pages, monitor roundups, and esports-focused display articles.
- 27-inch IPS panel
- 540Hz refresh rate
- 1920 x 1080 resolution
- FreeSync support
- HDMI 2.1 and DP 1.4
Why it stands out
- Standout refresh-rate hook
- Good fit for esports or competitive gear pages
- Adjustable stand and multiple connection options
Things to know
- FHD resolution only
- Very niche compared with broader mainstream display choices
The serving workload has three resource bottlenecks
Serving consumes three resources that behave differently.
- Compute throughput: the matrix multiplications and attention operations that create tokens.
- Memory bandwidth and movement: the cost of reading weights and activations and moving them through the memory hierarchy.
- Stateful memory footprint: the weight memory plus the per-request KV cache and other per-session state.
Serving rarely saturates all three at once. A configuration that is compute-limited at short contexts can become memory-limited at longer contexts. A configuration that looks stable at average load can fall apart because KV cache growth pushes the system into eviction and recomputation.
A reliable sizing practice begins with explicit identification of the dominant bottleneck for the target traffic, then validates that the bottleneck remains dominant across expected variation.
The core quantities that determine serving demand
Serving demand can be represented with a small set of quantities that map directly to scaling behavior.
| Quantity | Meaning | Why it matters |
|---|---|---|
| Prompt tokens | Tokens in the input context | Drives prefill cost and KV cache size |
| Output tokens | Tokens generated | Drives decode cost and total GPU time |
| Context length distribution | How long prompts actually are in production | Determines tail behavior and worst-case memory |
| Concurrency | Number of in-flight requests | Converts per-request cost into sustained throughput |
| Target latency | SLO or SLA target, often with p95 or p99 requirements | Limits queuing and forces headroom |
| Model size | Parameter count and architecture | Determines weight memory and base compute |
| Precision | FP16, BF16, FP8, INT8, etc. | Changes speed, memory footprint, and accuracy |
| Serving policy | batching, streaming, caching, routing | Controls utilization and tail latency |
These variables interact. Increasing batch size raises throughput but increases per-request waiting time. Increasing context length increases KV cache, which can lower maximum safe concurrency even when compute is available. Switching precision can shift the bottleneck from memory to compute or the reverse.
Prefill vs decode: the two phases that behave differently
Most transformer serving splits into two phases.
- Prefill: processing the prompt to build the initial hidden state and KV cache. Prefill is more parallel and can benefit strongly from batching because the prompt tokens can be processed as a block.
- Decode: generating tokens one at a time (or in small blocks), updating KV cache each step. Decode can become latency-sensitive because each token depends on the previous token.
Capacity planning needs separate estimates for these phases because their utilization properties differ.
A common failure pattern is sizing based on average throughput during prefill-heavy benchmarking, then discovering decode-heavy traffic creates a much lower sustained throughput at the same latency target.
A sizing model that is honest about uncertainty
A simple model still needs guardrails. The goal is not a perfect analytical prediction, but a transparent calculation that identifies which assumptions drive the answer.
Define these operational measurements on the target deployment stack.
- Prefill throughput: prompt tokens per second per GPU for representative prompt lengths.
- Decode throughput: generated tokens per second per GPU at representative concurrency.
- KV cache per request: bytes per token stored per layer, multiplied by context length and model architecture factors.
These are best measured with the actual runtime and kernels used in production, because compilation choices, attention kernels, and memory layout matter.
Then represent demand with these traffic measurements.
- Requests per second (RPS) by endpoint or product feature.
- Distribution of prompt tokens per request.
- Distribution of output tokens per request.
- Burst factor over time windows relevant to autoscaling and queue formation.
From these, compute a conservative capacity estimate.
- Required GPUs for prefill: (RPS × average prompt tokens) ÷ (prefill tokens/sec per GPU) × headroom
- Required GPUs for decode: (RPS × average output tokens) ÷ (decode tokens/sec per GPU) × headroom
Compute and memory constraints must both be satisfied, so the required GPU count is the maximum of the compute-based requirement and the memory-based requirement.
KV cache is the real concurrency limiter for many systems
The KV cache stores key and value vectors per layer for each token in the context for each active sequence. This state enables fast attention without recomputing the entire history each step. It is also the reason that serving capacity can collapse when context length rises.
A useful planning thought is that every concurrent request reserves a slice of memory that grows with context length.
| Driver | Effect on KV cache | Operational consequence |
|---|---|---|
| Longer prompts | Larger cache at start | Lower safe concurrency from first token |
| Long output generation | Cache grows during decode | Concurrency shrinks over time in streaming workloads |
| Multi-turn chats | Cache persists across turns | Session stickiness increases memory pressure |
| Tool calls | Idle gaps while state stays resident | Memory held without token production |
KV cache pressure creates secondary effects.
- Paging and eviction: if a runtime offloads KV cache to CPU memory, latency can spike because PCIe or interconnect bandwidth becomes part of the critical path.
- Fragmentation: memory allocators can fragment under variable sequence lengths, reducing usable capacity.
- Latency blowups: when the system hits a memory ceiling, it can degrade abruptly instead of gradually.
Serving capacity planning should treat KV cache as a first-class dimension, not a footnote.
Batching and queues: the throughput and latency tradeoff
Batching increases utilization by amortizing overhead and improving matrix multiplication efficiency. Queues form naturally when batching is used, because requests wait for a batch window to fill.
The design question is not whether to batch, but how.
- Static batching: a fixed batch size or fixed window. Simple and predictable, but can waste capacity during low load or violate latency during high load.
- Dynamic batching: batch within a time budget and shape constraints. Better utilization, but more complex and can create tail behavior if not bounded.
- Continuous batching: merge requests into a rolling schedule. Often used for decoder steps, enabling higher throughput at moderate latency.
Queueing discipline matters as much as batching choice.
- Separate prefill and decode queues can prevent decode latency from being dominated by prefill bursts.
- Priority classes can protect interactive traffic from bulk jobs.
- Admission control can preserve quality by rejecting or deferring work rather than letting the tail collapse.
A capacity plan that ignores queues is a plan that only holds at low utilization.
Tail latency: why averages mislead operators
User experience is governed by the slowest requests, not the average request. Tail latency is shaped by multiple mechanisms that compound each other.
- Long contexts that force larger KV cache and slower attention.
- Variability in output length. Some prompts cause short completions while others produce long outputs.
- Tool calls and retries that extend session duration.
- GPU scheduling effects, especially when sharing devices among models or tenants.
- Background maintenance and logging overhead that aligns with spikes.
A practical way to reason about tail latency is to track not only token throughput but queue waiting time distribution. If queue waiting becomes a material fraction of end-to-end latency, the system is operating too close to capacity for interactive traffic.
Capacity planning as a cycle of measurement, modeling, and verification
Capacity planning becomes robust when treated as a repeating cycle.
- Measure: benchmark prefill throughput, decode throughput, and memory headroom on the actual serving stack.
- Model: translate traffic distributions into compute and memory requirements with explicit headroom.
- Verify: run load tests that match production distributions, including burst patterns, and compare observed queues and latency tails to the model.
- Correct: update assumptions and add safeguards such as admission control, routing, or cache policy.
This cycle prevents the common error of treating a single benchmark run as a forecast.
Load testing that resembles production
Load tests often fail because they do not resemble production behavior. A realistic test includes these characteristics.
- Mixed prompt lengths, including long-tail prompts that occur rarely but dominate worst-case behavior.
- Mixed output lengths, including generation-heavy flows.
- Concurrency patterns that mimic user activity: peaks, troughs, and correlated bursts.
- Stateful sessions when the product is conversational, because session memory alters concurrency and cache.
- Tool calls and retrieval, because external calls can extend session lifetimes and hold memory.
A test that uses uniform prompts and uniform outputs can dramatically overestimate capacity.
Hardware sizing is never only about GPUs
GPUs are the visible line item, but serving capacity depends on the surrounding system.
- CPU: tokenization, request routing, compression, and postprocessing can become bottlenecks at high RPS.
- RAM: hosts caches, routing tables, and sometimes offloaded KV cache. Memory pressure can create latency spikes.
- Storage: model weights and artifacts must load fast enough to support rolling updates.
- Networking: for multi-GPU or multi-node serving, interconnect latency and bandwidth can affect synchronization, cache traffic, and cross-node routing.
- Power and thermal envelope: sustained serving loads can behave differently from training loads and can trigger throttling if cooling is insufficient.
A complete plan includes these resources because they determine whether GPU capacity can be converted into end-to-end throughput.
Risk management: the margins that keep systems honest
A sizing number without a margin is a promise the system cannot keep. The right margin depends on the product.
Common margin drivers include:
- Feature drift: product changes that increase context length or generation length.
- Model iteration: moving from one model to another with different compute characteristics.
- Traffic uncertainty: marketing events, integrations, or seasonal peaks.
- Runtime changes: kernel updates, compiler shifts, and driver changes that affect throughput.
Margins can be implemented in more than one way.
- Pure headroom: provision more GPUs than the model requires.
- Policy margins: enforce maximum context, maximum output, or stricter routing when load rises.
- Tiered service: degrade gracefully by switching to cheaper models for lower priority traffic.
- Queue limits: cap queue depth to prevent the system from amplifying an incident.
The key is to make the margin explicit and test that it works in practice.
The infrastructure consequences: why serving sizing is a strategic capability
Accurate capacity planning affects more than reliability.
- It determines cost per request and cost per token.
- It affects release velocity because canary rollouts require spare capacity.
- It influences product design choices, such as whether longer contexts are a default experience.
- It shapes competitive advantage because stable low latency at scale is a differentiator.
Serving hardware sizing is not a one-time procurement decision. It is a recurring operational capability that links product ambition to infrastructure reality.
Keep exploring on AI-RNG
- In this category
- Hardware, Compute, and Systems Overview
- Interconnects and Networking: Cluster Fabrics
- Cluster Scheduling and Job Orchestration
- Kernel Optimization and Operator Fusion Concepts
- Model Compilation Toolchains and Tradeoffs
- Series and navigation
- Infrastructure Shift Briefs
- Tool Stack Spotlights
- AI Topics Index
- Glossary
More Study Resources
- Category hub
- Hardware, Compute, and Systems Overview
- Related
- Interconnects and Networking: Cluster Fabrics
- Cluster Scheduling and Job Orchestration
- Kernel Optimization and Operator Fusion Concepts
- Model Compilation Toolchains and Tradeoffs
- Infrastructure Shift Briefs
- Tool Stack Spotlights
- AI Topics Index
- Glossary
Books by Drew Higgins
Bible Study / Spiritual Warfare
Ephesians 6 Field Guide: Spiritual Warfare and the Full Armor of God
Spiritual warfare is real—but it was never meant to turn your life into panic, obsession, or…
