Capacity Planning and Load Testing for AI Services: Tokens, Concurrency, and Queues
| Field | Value |
|---|---|
| Category | MLOps, Observability, and Reliability |
| Primary Lens | AI innovation with infrastructure consequences |
| Suggested Formats | Research Essay, Deep Dive, Field Guide |
| Suggested Series | Deployment Playbooks, Infrastructure Shift Briefs |
More Study Resources
- Category hub
- MLOps, Observability, and Reliability Overview
- Related
- Monitoring: Latency, Cost, Quality, Safety Metrics
- Cost Anomaly Detection and Budget Enforcement
- SLO-Aware Routing and Degradation Strategies
- Synthetic Monitoring and Golden Prompts
- Benchmarking Hardware for Real Workloads
- Deployment Playbooks
- Infrastructure Shift Briefs
- AI Topics Index
- Glossary
Capacity Planning Starts With the Real Unit of Work
Traditional web services often plan around requests per second, CPU, memory, and database IO. AI services add a more elastic unit: tokens. A “request” can be tiny or enormous depending on prompt length, retrieved context, tool traces, and output size. Two requests with the same HTTP shape can have wildly different compute costs and latencies.
Capacity planning for AI therefore starts with a basic discipline:
Flagship Router PickQuad-Band WiFi 7 Gaming RouterASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
A flagship gaming router angle for pages about latency, wired priority, and high-end home networking for gaming setups.
- Quad-band WiFi 7
- 320MHz channel support
- Dual 10G ports
- Quad 2.5G ports
- Game acceleration features
Why it stands out
- Very strong wired and wireless spec sheet
- Premium port selection
- Useful for enthusiast gaming networks
Things to know
- Expensive
- Overkill for simpler home networks
- Track and model the distribution of token counts, not only the average.
- Separate prompt tokens (prefill) from output tokens (decode).
- Treat tool calls and retrieval as additional service stages, not as incidental overhead.
When these are modeled, scaling becomes less mysterious. When they are ignored, teams alternate between overspending and firefighting.
The Latency Anatomy of an AI Request
Most AI inference pipelines have several distinct phases:
- Admission and queueing: waiting for an available worker or GPU slot
- Prefill: ingesting the prompt and building the key-value cache
- Decode: generating output tokens, often the longest phase
- Tool and retrieval stages: external calls that can dominate p95 latency
- Post-processing: formatting, safety checks, logging, and caching
Each stage has its own failure modes and scaling levers. Capacity planning is the art of finding which stage is binding under current workloads, then adding the right constraint or resource.
A common mistake is to treat end-to-end latency as one number. The more useful breakdown is:
- Queue time
- Time to first token
- Tokens per second during decode
- Tool latency and error rate
- Total completion time
This breakdown exposes whether the problem is “not enough compute,” “too much variability,” or “a dependency bottleneck.”
Concurrency, Queues, and the Reality of Bursty Traffic
AI products often face bursty demand: launches, news cycles, school deadlines, and enterprise batch jobs. Queues are the shock absorbers of the system. If queues are not designed, they design themselves.
Two simple ideas guide most sizing work:
- Concurrency is limited by the resources that must be held while a request is running.
- Queueing delay grows rapidly when utilization approaches saturation.
In AI inference, the held resources can include GPU memory for the key-value cache, CPU threads for tokenization, and network slots for tool calls. When concurrency is mis-sized, latency spikes can appear suddenly even when average utilization looks safe.
A Practical Workload Model for AI Services
A usable model does not require perfect mathematics. It requires a handful of measurable quantities.
Useful workload descriptors:
- Prompt token distribution: p50, p95, p99
- Output token distribution: p50, p95, p99
- Tool call rate: fraction of requests that invoke tools and how many calls
- Retrieval expansion: average retrieved tokens appended to prompts
- Target SLOs: p95 end-to-end latency, time-to-first-token, and success rate
- Demand shape: steady rate plus burst amplitude and duration
From these, a team can estimate the “token work per second” required and compare it to observed throughput under realistic conditions.
A healthy system keeps headroom. Headroom is not waste. It is the price of low tail latency during bursts and failure conditions.
Load Testing That Resembles Reality
Load tests that use a single synthetic prompt shape produce misleading confidence. AI workloads are heavy-tailed. The worst latencies come from the long prompts, the multi-step tool flows, and the occasional massive output.
A realistic load test includes:
- A mix of prompt sizes that matches production distributions
- A mix of tool and retrieval usage rates, including worst-case paths
- Realistic output lengths and stop conditions
- Warm and cold cache scenarios
- Failure injection for tool timeouts and retry storms
- Concurrency ramping that reveals queueing behavior
The goal is not to produce a pretty throughput number. The goal is to learn the system’s breaking points.
Synthetic monitoring with golden prompts complements load testing. Load tests find scaling limits. Golden prompts detect regressions and shifts in behavior over time.
Token Budgets, Output Caps, and Degradation Strategies
Capacity planning is inseparable from product constraints. If a system permits unbounded output, it permits unbounded latency and cost.
Effective constraint tools include:
- Output token caps tied to user tiers and task types
- Retrieval caps that limit appended context size
- Tool budgets that cap the number of external calls per request
- Timeouts with graceful partial results rather than silent failure
- SLO-aware routing that uses cheaper or faster modes when under load
Degradation should be designed, not improvised. A planned “lower fidelity” mode is better than an accidental collapse.
A subtle point: degradation strategies should preserve trust. Cutting corners in ways that reduce grounding or increase speculation can harm the product more than it helps. Under load, it may be better to shorten outputs and require citations than to answer quickly with less support.
Batching, Caching, and the Compute-IO Trade Space
Modern inference stacks use several techniques to increase throughput:
- Batching: grouping multiple requests to improve GPU utilization
- Continuous batching: adding requests to a running batch as tokens are produced
- Prompt caching: reusing prefill results for repeated prefixes
- Retrieval caching: reusing top-k results for stable queries
- Response caching: serving identical answers for identical inputs where appropriate
These techniques create new tradeoffs:
- Batching increases throughput but can increase time-to-first-token for small requests.
- Caching reduces cost but introduces freshness concerns and invalidation complexity.
- Aggressive caching can leak behavior across tenants if isolation is not enforced.
Capacity planning should treat batching and caching as first-class design choices rather than as afterthought optimizations.
Multi-Tenancy: Fairness Is a Capacity Problem
In shared systems, one customer can consume disproportionate resources and degrade everyone’s tail latency. Multi-tenancy controls are therefore part of capacity planning:
- Per-tenant rate limits and token budgets
- Priority queues for interactive traffic versus batch jobs
- Isolation of high-risk tool workflows
- Admission control that rejects work early rather than timing out late
- Fair scheduling that prevents a single long request from blocking many short ones
Fairness is not only ethical. It is operationally necessary. Without it, the system’s capacity becomes unpredictable because demand spikes from one segment spill over into others.
The Hardware Reality: Memory, Not Only FLOPs
AI throughput is often bounded by memory and bandwidth rather than by raw compute. Key constraints include:
- GPU memory limits that cap concurrency due to key-value cache growth
- Bandwidth limits that slow prefill and retrieval-heavy prompts
- CPU bottlenecks in tokenization and logging pipelines
- Network bottlenecks during tool-heavy workloads
- Storage bottlenecks during index reads and retrieval expansion
Hardware benchmarking should mimic real request mixes. “Peak tokens per second” on a microbenchmark rarely predicts p95 latency under production-like workloads.
When capacity planning includes hardware-aware constraints, scaling decisions become more rational: add GPUs when decode is binding, add memory or reduce context when KV cache is binding, improve networking when tool calls dominate.
Capacity Planning as a Continuous Practice
AI systems change frequently: models, prompts, corpora, tools, and user behavior all shift. Capacity planning is therefore not a one-time spreadsheet. It is an operational loop:
- Measure the workload distribution regularly.
- Re-run load tests after major model or policy changes.
- Watch tail latency and queue time as leading indicators of saturation.
- Track cost per successful task, not only cost per request.
- Update degradation strategies as the product matures.
The strongest organizations treat capacity as a product property. They plan for predictable behavior, even when demand and tools change.
References and Further Reading
- Queueing intuition for services: why tail latency rises near saturation
- SRE methods: SLOs, error budgets, and load testing discipline
- GPU inference optimization: batching, caching, and KV memory constraints
A Worked Sizing Sketch Without Pretending to Be Exact
A simple sizing sketch helps turn vague concern into a concrete plan. The numbers below are illustrative, but the method is reusable.
Assume an interactive assistant with these measured properties in production-like tests:
| Metric | Typical | High tail |
|---|---|---|
| Prompt tokens (including retrieval) | 900 | 2,800 |
| Output tokens | 350 | 1,200 |
| Time to first token | 0.6s | 1.8s |
| Decode rate (tokens/sec) | 120 | 70 |
| Tool calls per request | 0.4 | 2.0 |
From this, two observations usually appear quickly:
- The long prompts dominate prefill time and memory pressure even if they are a minority of traffic.
- Tool-heavy paths dominate p95 end-to-end latency even when the model decode is fast.
A practical capacity plan follows:
- Size concurrency so the high-tail prompt fits without exhausting GPU memory for the key-value cache.
- Add a queue budget so interactive users do not wait behind batch work.
- Add budgets for tool calls and strict timeouts so a tool dependency cannot create a retry storm.
- Use routing that distinguishes “chatty long-form” from “short answer” tasks, because they are different workloads.
Even when the numbers shift, this style of sketch keeps planning anchored to the real unit of work: tokens, tool stages, and tail behavior.
Admission Control and Backpressure: Reject Early, Recover Faster
When a system is overloaded, the worst outcome is to accept everything and fail slowly. Timeouts waste compute and frustrate users. Admission control makes overload survivable:
- Cap in-flight requests per worker based on GPU memory and expected token counts.
- Prefer fast failure with a clear message over long hanging requests.
- Use priority queues so interactive traffic is not crowded out by bulk jobs.
- Apply per-tenant budgets so a single tenant cannot consume shared headroom.
Backpressure is not only about protecting infrastructure. It protects user trust by keeping the system responsive even under stress.
Books by Drew Higgins
Prophecy and Its Meaning for Today
New Testament Prophecies and Their Meaning for Today
A focused study of New Testament prophecy and why it still matters for believers now.
