Serving Hardware Sizing and Capacity Planning

Serving Hardware Sizing and Capacity Planning

Modern AI systems rarely fail because a model is unavailable. They fail because capacity is misread: tokens are cheaper than expected until a spike arrives, latency looks fine until the tail collapses, an innocuous feature doubles average context length, or a queue forms and never drains. Serving is not training-with-smaller-batches. It is a live production workload with demand uncertainty, strict latency targets, and an economic shape that can swing by orders of magnitude when a single variable moves.

A practical sizing discipline treats serving as a flow problem. Requests arrive with a distribution of prompt lengths, output lengths, and tool calls. The system converts that flow into GPU work and memory pressure. Capacity planning is the act of turning those distributions into hardware requirements with explicit safety margins, then verifying the plan under realistic traffic.

Competitive Monitor Pick
540Hz Esports Display

CRUA 27-inch 540Hz Gaming Monitor, IPS FHD, FreeSync, HDMI 2.1 + DP 1.4

CRUA • 27-inch 540Hz • Gaming Monitor
CRUA 27-inch 540Hz Gaming Monitor, IPS FHD, FreeSync, HDMI 2.1 + DP 1.4
A strong angle for buyers chasing extremely high refresh rates for competitive gaming setups

A high-refresh gaming monitor option for competitive setup pages, monitor roundups, and esports-focused display articles.

$369.99
Was $499.99
Save 26%
Price checked: 2026-03-23 18:31. Product prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on Amazon at the time of purchase will apply to the purchase of this product.
  • 27-inch IPS panel
  • 540Hz refresh rate
  • 1920 x 1080 resolution
  • FreeSync support
  • HDMI 2.1 and DP 1.4
View Monitor on Amazon
Check Amazon for the live listing price, stock status, and port details before publishing.

Why it stands out

  • Standout refresh-rate hook
  • Good fit for esports or competitive gear pages
  • Adjustable stand and multiple connection options

Things to know

  • FHD resolution only
  • Very niche compared with broader mainstream display choices
See Amazon for current availability
As an Amazon Associate I earn from qualifying purchases.

The serving workload has three resource bottlenecks

Serving consumes three resources that behave differently.

  • Compute throughput: the matrix multiplications and attention operations that create tokens.
  • Memory bandwidth and movement: the cost of reading weights and activations and moving them through the memory hierarchy.
  • Stateful memory footprint: the weight memory plus the per-request KV cache and other per-session state.

Serving rarely saturates all three at once. A configuration that is compute-limited at short contexts can become memory-limited at longer contexts. A configuration that looks stable at average load can fall apart because KV cache growth pushes the system into eviction and recomputation.

A reliable sizing practice begins with explicit identification of the dominant bottleneck for the target traffic, then validates that the bottleneck remains dominant across expected variation.

The core quantities that determine serving demand

Serving demand can be represented with a small set of quantities that map directly to scaling behavior.

QuantityMeaningWhy it matters
Prompt tokensTokens in the input contextDrives prefill cost and KV cache size
Output tokensTokens generatedDrives decode cost and total GPU time
Context length distributionHow long prompts actually are in productionDetermines tail behavior and worst-case memory
ConcurrencyNumber of in-flight requestsConverts per-request cost into sustained throughput
Target latencySLO or SLA target, often with p95 or p99 requirementsLimits queuing and forces headroom
Model sizeParameter count and architectureDetermines weight memory and base compute
PrecisionFP16, BF16, FP8, INT8, etc.Changes speed, memory footprint, and accuracy
Serving policybatching, streaming, caching, routingControls utilization and tail latency

These variables interact. Increasing batch size raises throughput but increases per-request waiting time. Increasing context length increases KV cache, which can lower maximum safe concurrency even when compute is available. Switching precision can shift the bottleneck from memory to compute or the reverse.

Prefill vs decode: the two phases that behave differently

Most transformer serving splits into two phases.

  • Prefill: processing the prompt to build the initial hidden state and KV cache. Prefill is more parallel and can benefit strongly from batching because the prompt tokens can be processed as a block.
  • Decode: generating tokens one at a time (or in small blocks), updating KV cache each step. Decode can become latency-sensitive because each token depends on the previous token.

Capacity planning needs separate estimates for these phases because their utilization properties differ.

A common failure pattern is sizing based on average throughput during prefill-heavy benchmarking, then discovering decode-heavy traffic creates a much lower sustained throughput at the same latency target.

A sizing model that is honest about uncertainty

A simple model still needs guardrails. The goal is not a perfect analytical prediction, but a transparent calculation that identifies which assumptions drive the answer.

Define these operational measurements on the target deployment stack.

  • Prefill throughput: prompt tokens per second per GPU for representative prompt lengths.
  • Decode throughput: generated tokens per second per GPU at representative concurrency.
  • KV cache per request: bytes per token stored per layer, multiplied by context length and model architecture factors.

These are best measured with the actual runtime and kernels used in production, because compilation choices, attention kernels, and memory layout matter.

Then represent demand with these traffic measurements.

  • Requests per second (RPS) by endpoint or product feature.
  • Distribution of prompt tokens per request.
  • Distribution of output tokens per request.
  • Burst factor over time windows relevant to autoscaling and queue formation.

From these, compute a conservative capacity estimate.

  • Required GPUs for prefill: (RPS × average prompt tokens) ÷ (prefill tokens/sec per GPU) × headroom
  • Required GPUs for decode: (RPS × average output tokens) ÷ (decode tokens/sec per GPU) × headroom

Compute and memory constraints must both be satisfied, so the required GPU count is the maximum of the compute-based requirement and the memory-based requirement.

KV cache is the real concurrency limiter for many systems

The KV cache stores key and value vectors per layer for each token in the context for each active sequence. This state enables fast attention without recomputing the entire history each step. It is also the reason that serving capacity can collapse when context length rises.

A useful planning thought is that every concurrent request reserves a slice of memory that grows with context length.

DriverEffect on KV cacheOperational consequence
Longer promptsLarger cache at startLower safe concurrency from first token
Long output generationCache grows during decodeConcurrency shrinks over time in streaming workloads
Multi-turn chatsCache persists across turnsSession stickiness increases memory pressure
Tool callsIdle gaps while state stays residentMemory held without token production

KV cache pressure creates secondary effects.

  • Paging and eviction: if a runtime offloads KV cache to CPU memory, latency can spike because PCIe or interconnect bandwidth becomes part of the critical path.
  • Fragmentation: memory allocators can fragment under variable sequence lengths, reducing usable capacity.
  • Latency blowups: when the system hits a memory ceiling, it can degrade abruptly instead of gradually.

Serving capacity planning should treat KV cache as a first-class dimension, not a footnote.

Batching and queues: the throughput and latency tradeoff

Batching increases utilization by amortizing overhead and improving matrix multiplication efficiency. Queues form naturally when batching is used, because requests wait for a batch window to fill.

The design question is not whether to batch, but how.

  • Static batching: a fixed batch size or fixed window. Simple and predictable, but can waste capacity during low load or violate latency during high load.
  • Dynamic batching: batch within a time budget and shape constraints. Better utilization, but more complex and can create tail behavior if not bounded.
  • Continuous batching: merge requests into a rolling schedule. Often used for decoder steps, enabling higher throughput at moderate latency.

Queueing discipline matters as much as batching choice.

  • Separate prefill and decode queues can prevent decode latency from being dominated by prefill bursts.
  • Priority classes can protect interactive traffic from bulk jobs.
  • Admission control can preserve quality by rejecting or deferring work rather than letting the tail collapse.

A capacity plan that ignores queues is a plan that only holds at low utilization.

Tail latency: why averages mislead operators

User experience is governed by the slowest requests, not the average request. Tail latency is shaped by multiple mechanisms that compound each other.

  • Long contexts that force larger KV cache and slower attention.
  • Variability in output length. Some prompts cause short completions while others produce long outputs.
  • Tool calls and retries that extend session duration.
  • GPU scheduling effects, especially when sharing devices among models or tenants.
  • Background maintenance and logging overhead that aligns with spikes.

A practical way to reason about tail latency is to track not only token throughput but queue waiting time distribution. If queue waiting becomes a material fraction of end-to-end latency, the system is operating too close to capacity for interactive traffic.

Capacity planning as a cycle of measurement, modeling, and verification

Capacity planning becomes robust when treated as a repeating cycle.

  • Measure: benchmark prefill throughput, decode throughput, and memory headroom on the actual serving stack.
  • Model: translate traffic distributions into compute and memory requirements with explicit headroom.
  • Verify: run load tests that match production distributions, including burst patterns, and compare observed queues and latency tails to the model.
  • Correct: update assumptions and add safeguards such as admission control, routing, or cache policy.

This cycle prevents the common error of treating a single benchmark run as a forecast.

Load testing that resembles production

Load tests often fail because they do not resemble production behavior. A realistic test includes these characteristics.

  • Mixed prompt lengths, including long-tail prompts that occur rarely but dominate worst-case behavior.
  • Mixed output lengths, including generation-heavy flows.
  • Concurrency patterns that mimic user activity: peaks, troughs, and correlated bursts.
  • Stateful sessions when the product is conversational, because session memory alters concurrency and cache.
  • Tool calls and retrieval, because external calls can extend session lifetimes and hold memory.

A test that uses uniform prompts and uniform outputs can dramatically overestimate capacity.

Hardware sizing is never only about GPUs

GPUs are the visible line item, but serving capacity depends on the surrounding system.

  • CPU: tokenization, request routing, compression, and postprocessing can become bottlenecks at high RPS.
  • RAM: hosts caches, routing tables, and sometimes offloaded KV cache. Memory pressure can create latency spikes.
  • Storage: model weights and artifacts must load fast enough to support rolling updates.
  • Networking: for multi-GPU or multi-node serving, interconnect latency and bandwidth can affect synchronization, cache traffic, and cross-node routing.
  • Power and thermal envelope: sustained serving loads can behave differently from training loads and can trigger throttling if cooling is insufficient.

A complete plan includes these resources because they determine whether GPU capacity can be converted into end-to-end throughput.

Risk management: the margins that keep systems honest

A sizing number without a margin is a promise the system cannot keep. The right margin depends on the product.

Common margin drivers include:

  • Feature drift: product changes that increase context length or generation length.
  • Model iteration: moving from one model to another with different compute characteristics.
  • Traffic uncertainty: marketing events, integrations, or seasonal peaks.
  • Runtime changes: kernel updates, compiler shifts, and driver changes that affect throughput.

Margins can be implemented in more than one way.

  • Pure headroom: provision more GPUs than the model requires.
  • Policy margins: enforce maximum context, maximum output, or stricter routing when load rises.
  • Tiered service: degrade gracefully by switching to cheaper models for lower priority traffic.
  • Queue limits: cap queue depth to prevent the system from amplifying an incident.

The key is to make the margin explicit and test that it works in practice.

The infrastructure consequences: why serving sizing is a strategic capability

Accurate capacity planning affects more than reliability.

  • It determines cost per request and cost per token.
  • It affects release velocity because canary rollouts require spare capacity.
  • It influences product design choices, such as whether longer contexts are a default experience.
  • It shapes competitive advantage because stable low latency at scale is a differentiator.

Serving hardware sizing is not a one-time procurement decision. It is a recurring operational capability that links product ambition to infrastructure reality.

Keep exploring on AI-RNG

More Study Resources

Books by Drew Higgins

Explore this field
Inference Hardware Choices
Library Hardware, Compute, and Systems Inference Hardware Choices
Hardware, Compute, and Systems
Compiler and Kernel Optimizations
Cost per Token Economics
Edge and Device Compute
GPUs and Accelerators
Memory Bandwidth and IO
Networking and Clusters
On-Prem vs Cloud Tradeoffs
Power and Cooling
Storage Pipelines