Name: ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
Brand: ASUS
SKU: GT-BE98-PRO
Price: 598.99 USD
Availability: InStock

Serving Hardware Sizing and Capacity Planning

Modern AI systems rarely fail because a model is unavailable. They fail because capacity is misread: tokens are cheaper than expected until a spike arrives, latency looks fine until the tail collapses, an innocuous feature doubles average context length, or a queue forms and never drains. Serving is not training-with-smaller-batches. It is a live production workload with demand uncertainty, strict latency targets, and an economic shape that can swing by orders of magnitude when a single variable moves.

A practical sizing discipline treats serving as a flow problem. Requests arrive with a distribution of prompt lengths, output lengths, and tool calls. The system converts that flow into GPU work and memory pressure. Capacity planning is the act of turning those distributions into hardware requirements with explicit safety margins, then verifying the plan under realistic traffic.

Flagship Router Pick

Quad-Band WiFi 7 Gaming Router

ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router

ASUS • GT-BE98 PRO • Gaming Router

A flagship gaming router angle for pages about latency, wired priority, and high-end home networking for gaming setups.

$598.99

Was $699.99

Save 14%

Price checked: 2026-03-23 18:31. Product prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on Amazon at the time of purchase will apply to the purchase of this product.

Quad-band WiFi 7
320MHz channel support
Dual 10G ports
Quad 2.5G ports
Game acceleration features

(paid link)

View ASUS Router on Amazon

Check the live Amazon listing for the latest price, stock, and bundle or security details.

Why it stands out

Very strong wired and wireless spec sheet
Premium port selection
Useful for enthusiast gaming networks

Things to know

Expensive
Overkill for simpler home networks

See Amazon for current availability

As an Amazon Associate I earn from qualifying purchases.

The serving workload has three resource bottlenecks

Serving consumes three resources that behave differently.

Compute throughput: the matrix multiplications and attention operations that create tokens.
Memory bandwidth and movement: the cost of reading weights and activations and moving them through the memory hierarchy.
Stateful memory footprint: the weight memory plus the per-request KV cache and other per-session state.

Serving rarely saturates all three at once. A configuration that is compute-limited at short contexts can become memory-limited at longer contexts. A configuration that looks stable at average load can fall apart because KV cache growth pushes the system into eviction and recomputation.

A reliable sizing practice begins with explicit identification of the dominant bottleneck for the target traffic, then validates that the bottleneck remains dominant across expected variation.

The core quantities that determine serving demand

Serving demand can be represented with a small set of quantities that map directly to scaling behavior.

Quantity	Meaning	Why it matters
Prompt tokens	Tokens in the input context	Drives prefill cost and KV cache size
Output tokens	Tokens generated	Drives decode cost and total GPU time
Context length distribution	How long prompts actually are in production	Determines tail behavior and worst-case memory
Concurrency	Number of in-flight requests	Converts per-request cost into sustained throughput
Target latency	SLO or SLA target, often with p95 or p99 requirements	Limits queuing and forces headroom
Model size	Parameter count and architecture	Determines weight memory and base compute
Precision	FP16, BF16, FP8, INT8, etc.	Changes speed, memory footprint, and accuracy
Serving policy	batching, streaming, caching, routing	Controls utilization and tail latency

These variables interact. Increasing batch size raises throughput but increases per-request waiting time. Increasing context length increases KV cache, which can lower maximum safe concurrency even when compute is available. Switching precision can shift the bottleneck from memory to compute or the reverse.

Prefill vs decode: the two phases that behave differently

Most transformer serving splits into two phases.

Prefill: processing the prompt to build the initial hidden state and KV cache. Prefill is more parallel and can benefit strongly from batching because the prompt tokens can be processed as a block.
Decode: generating tokens one at a time (or in small blocks), updating KV cache each step. Decode can become latency-sensitive because each token depends on the previous token.

Capacity planning needs separate estimates for these phases because their utilization properties differ.

A common failure pattern is sizing based on average throughput during prefill-heavy benchmarking, then discovering decode-heavy traffic creates a much lower sustained throughput at the same latency target.

A sizing model that is honest about uncertainty

A simple model still needs guardrails. The goal is not a perfect analytical prediction, but a transparent calculation that identifies which assumptions drive the answer.

Define these operational measurements on the target deployment stack.

Prefill throughput: prompt tokens per second per GPU for representative prompt lengths.
Decode throughput: generated tokens per second per GPU at representative concurrency.
KV cache per request: bytes per token stored per layer, multiplied by context length and model architecture factors.

These are best measured with the actual runtime and kernels used in production, because compilation choices, attention kernels, and memory layout matter.

Then represent demand with these traffic measurements.

Requests per second (RPS) by endpoint or product feature.
Distribution of prompt tokens per request.
Distribution of output tokens per request.
Burst factor over time windows relevant to autoscaling and queue formation.

From these, compute a conservative capacity estimate.

Required GPUs for prefill: (RPS × average prompt tokens) ÷ (prefill tokens/sec per GPU) × headroom
Required GPUs for decode: (RPS × average output tokens) ÷ (decode tokens/sec per GPU) × headroom

Compute and memory constraints must both be satisfied, so the required GPU count is the maximum of the compute-based requirement and the memory-based requirement.

KV cache is the real concurrency limiter for many systems

The KV cache stores key and value vectors per layer for each token in the context for each active sequence. This state enables fast attention without recomputing the entire history each step. It is also the reason that serving capacity can collapse when context length rises.

A useful planning thought is that every concurrent request reserves a slice of memory that grows with context length.

Driver	Effect on KV cache	Operational consequence
Longer prompts	Larger cache at start	Lower safe concurrency from first token
Long output generation	Cache grows during decode	Concurrency shrinks over time in streaming workloads
Multi-turn chats	Cache persists across turns	Session stickiness increases memory pressure
Tool calls	Idle gaps while state stays resident	Memory held without token production

KV cache pressure creates secondary effects.

Paging and eviction: if a runtime offloads KV cache to CPU memory, latency can spike because PCIe or interconnect bandwidth becomes part of the critical path.
Fragmentation: memory allocators can fragment under variable sequence lengths, reducing usable capacity.
Latency blowups: when the system hits a memory ceiling, it can degrade abruptly instead of gradually.

Serving capacity planning should treat KV cache as a first-class dimension, not a footnote.

Batching and queues: the throughput and latency tradeoff

Batching increases utilization by amortizing overhead and improving matrix multiplication efficiency. Queues form naturally when batching is used, because requests wait for a batch window to fill.

The design question is not whether to batch, but how.

Static batching: a fixed batch size or fixed window. Simple and predictable, but can waste capacity during low load or violate latency during high load.
Dynamic batching: batch within a time budget and shape constraints. Better utilization, but more complex and can create tail behavior if not bounded.
Continuous batching: merge requests into a rolling schedule. Often used for decoder steps, enabling higher throughput at moderate latency.

Queueing discipline matters as much as batching choice.

Separate prefill and decode queues can prevent decode latency from being dominated by prefill bursts.
Priority classes can protect interactive traffic from bulk jobs.
Admission control can preserve quality by rejecting or deferring work rather than letting the tail collapse.

A capacity plan that ignores queues is a plan that only holds at low utilization.

Tail latency: why averages mislead operators

User experience is governed by the slowest requests, not the average request. Tail latency is shaped by multiple mechanisms that compound each other.

Long contexts that force larger KV cache and slower attention.
Variability in output length. Some prompts cause short completions while others produce long outputs.
Tool calls and retries that extend session duration.
GPU scheduling effects, especially when sharing devices among models or tenants.
Background maintenance and logging overhead that aligns with spikes.

A practical way to reason about tail latency is to track not only token throughput but queue waiting time distribution. If queue waiting becomes a material fraction of end-to-end latency, the system is operating too close to capacity for interactive traffic.

Capacity planning as a cycle of measurement, modeling, and verification

Capacity planning becomes robust when treated as a repeating cycle.

Measure: benchmark prefill throughput, decode throughput, and memory headroom on the actual serving stack.
Model: translate traffic distributions into compute and memory requirements with explicit headroom.
Verify: run load tests that match production distributions, including burst patterns, and compare observed queues and latency tails to the model.
Correct: update assumptions and add safeguards such as admission control, routing, or cache policy.

This cycle prevents the common error of treating a single benchmark run as a forecast.

Load testing that resembles production

Load tests often fail because they do not resemble production behavior. A realistic test includes these characteristics.

Mixed prompt lengths, including long-tail prompts that occur rarely but dominate worst-case behavior.
Mixed output lengths, including generation-heavy flows.
Concurrency patterns that mimic user activity: peaks, troughs, and correlated bursts.
Stateful sessions when the product is conversational, because session memory alters concurrency and cache.
Tool calls and retrieval, because external calls can extend session lifetimes and hold memory.

A test that uses uniform prompts and uniform outputs can dramatically overestimate capacity.

Hardware sizing is never only about GPUs

GPUs are the visible line item, but serving capacity depends on the surrounding system.

CPU: tokenization, request routing, compression, and postprocessing can become bottlenecks at high RPS.
RAM: hosts caches, routing tables, and sometimes offloaded KV cache. Memory pressure can create latency spikes.
Storage: model weights and artifacts must load fast enough to support rolling updates.
Networking: for multi-GPU or multi-node serving, interconnect latency and bandwidth can affect synchronization, cache traffic, and cross-node routing.
Power and thermal envelope: sustained serving loads can behave differently from training loads and can trigger throttling if cooling is insufficient.

A complete plan includes these resources because they determine whether GPU capacity can be converted into end-to-end throughput.

Risk management: the margins that keep systems honest

A sizing number without a margin is a promise the system cannot keep. The right margin depends on the product.

Common margin drivers include:

Feature drift: product changes that increase context length or generation length.
Model iteration: moving from one model to another with different compute characteristics.
Traffic uncertainty: marketing events, integrations, or seasonal peaks.
Runtime changes: kernel updates, compiler shifts, and driver changes that affect throughput.

Margins can be implemented in more than one way.

Pure headroom: provision more GPUs than the model requires.
Policy margins: enforce maximum context, maximum output, or stricter routing when load rises.
Tiered service: degrade gracefully by switching to cheaper models for lower priority traffic.
Queue limits: cap queue depth to prevent the system from amplifying an incident.

The key is to make the margin explicit and test that it works in practice.

The infrastructure consequences: why serving sizing is a strategic capability

Accurate capacity planning affects more than reliability.

It determines cost per request and cost per token.
It affects release velocity because canary rollouts require spare capacity.
It influences product design choices, such as whether longer contexts are a default experience.
It shapes competitive advantage because stable low latency at scale is a differentiator.

Serving hardware sizing is not a one-time procurement decision. It is a recurring operational capability that links product ambition to infrastructure reality.

Keep exploring on AI-RNG

More Study Resources

Category hub
Hardware, Compute, and Systems Overview

Books by Drew Higgins

Healing

Christian Living / Healing

Forgiving What You Can’t Forget

A Christ-centered path toward forgiveness, healing, and release from the wounds that keep following you.

This title should be framed as a gospel-shaped healing book rather than generic self-help. It fits…

Kindle Paperback

Featured

AI / Apologetics

Beyond the Machine

A Christ-centered challenge to the claim that artificial intelligence can become truly human.

This book examines the limits of artificial intelligence, the meaning of personhood, and the difference between…

Kindle Paperback

Featured

A Witness Series

A Witness

A prophetic fiction series about deception, endurance, and the cost of remaining faithful when the world turns against truth.

Set in a near-future world shaped by global spiritual compromise, this series follows witnesses, remnant believers,…

View Series

Christian Living

Christian Living / Spiritual Growth

Until We Are Complete

A call to growth, maturity, and wholeness in Christ until what is unfinished is made complete.

This title reads best as a growth-and-completion book centered on spiritual formation. It should be placed…

Kindle Paperback

Explore this field

Inference Hardware Choices

Library Hardware, Compute, and Systems Inference Hardware Choices

Serving Hardware Sizing and Capacity Planning