Memory Hierarchy: HBM, VRAM, RAM, Storage

Memory Hierarchy: HBM, VRAM, RAM, Storage

Memory hierarchy is the quiet governor of AI performance. Compute can scale fast, but data still has to arrive on time, in the right format, and in the right place. When memory movement is cheap, models feel effortless. When it is expensive, the same model becomes a slow, unstable cost center: GPUs idle, latency spikes, and teams argue about “utilization” without agreeing on what is being limited.

The practical view is simple: every level of memory buys speed by sacrificing capacity and price. The hierarchy works because most workloads reuse small portions of data repeatedly. AI workloads are demanding because they often mix large working sets with reuse patterns that change across training, evaluation, and serving. Building reliable systems means learning to spot when the hierarchy is helping and when it is being forced to do something it cannot do.

Value WiFi 7 Router
Tri-Band Gaming Router

TP-Link Tri-Band BE11000 Wi-Fi 7 Gaming Router Archer GE650

TP-Link • Archer GE650 • Gaming Router
TP-Link Tri-Band BE11000 Wi-Fi 7 Gaming Router Archer GE650
A nice middle ground for buyers who want WiFi 7 gaming features without flagship pricing

A gaming-router recommendation that fits comparison posts aimed at buyers who want WiFi 7, multi-gig ports, and dedicated gaming features at a lower price than flagship models.

$299.99
Was $329.99
Save 9%
Price checked: 2026-03-23 18:31. Product prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on Amazon at the time of purchase will apply to the purchase of this product.
  • Tri-band BE11000 WiFi 7
  • 320MHz support
  • 2 x 5G plus 3 x 2.5G ports
  • Dedicated gaming tools
  • RGB gaming design
View TP-Link Router on Amazon
Check Amazon for the live price, stock status, and any service or software details tied to the current listing.

Why it stands out

  • More approachable price tier
  • Strong gaming-focused networking pitch
  • Useful comparison option next to premium routers

Things to know

  • Not as extreme as flagship router options
  • Software preferences vary by buyer
See Amazon for current availability
As an Amazon Associate I earn from qualifying purchases.

A Mental Model That Predicts What Breaks

A helpful starting point is to separate two questions that often get blended:

  • How much computation is required for a step or a token.
  • How much data must move to make that computation possible.

If the compute requirement dominates, faster chips and better kernels win. If the data movement dominates, most “speedups” move the bottleneck around rather than removing it. Memory hierarchy problems show up as the same symptoms across many stacks: high GPU power draw with low achieved throughput, smooth averages with terrible tail latency, and “OOM” failures that appear inconsistent because fragmentation and allocation behavior change over time.

Another useful split is “resident” versus “streamed” data:

  • Resident data stays close to the compute unit for long enough that repeated reuse amortizes the cost of loading it.
  • Streamed data is read once or a few times, and the system must keep feeding the compute unit continuously.

AI training tries to make weights and activations resident as much as possible. AI inference tries to make weights resident and to keep attention cache behavior predictable as context grows. Data pipelines often behave like pure streaming, which makes storage and IO decisions as important as GPU choice.

The Layers of the Hierarchy Without Marketing Names

Different platforms slice the hierarchy differently, but the functional roles are consistent. The table uses relative language because exact numbers vary by generation, configuration, and workload.

LayerRoleTypical StrengthTypical Failure Mode
On-chip registers and small cachesKeep the hottest values next to computeExtremely low latencyThrash when reuse is low or access is irregular
Shared memory and mid-level cachesStage data for reuse across many threadsHigh bandwidth with predictable accessBank conflicts, cache misses, stalled warps
High-bandwidth device memory (HBM or VRAM)Hold model weights, activations, and caches on the acceleratorMassive bandwidthCapacity pressure, fragmentation, paging
Host system RAMSpillover, staging buffers, dataset preprocessingLarge capacityTransfer bottlenecks across the host-device link
Local fast storage (NVMe SSD)Checkpoints, sharded datasets, spill filesGood sequential throughputRandom IO slowdown, queue depth sensitivity
Networked storage and object storesShared datasets, long-term artifactsScale and durabilityLatency variability, throttling, contention

Two links tie the whole hierarchy together:

  • The host-device link (often PCIe, sometimes combined with higher-speed interconnect inside a node).
  • The network between nodes (used for distributed training, remote data, and service-to-service calls).

When those links become the limiting factor, “more GPU” often increases spending faster than it increases throughput.

Bandwidth, Latency, and Working Sets

Bandwidth and latency are different kinds of pain.

  • Bandwidth limits show up as a ceiling: throughput stops increasing after a point, and adding parallelism does not help.
  • Latency limits show up as jitter: p99 or p999 latency grows, services feel inconsistent, and retries multiply the load.

Working set size is the bridge between the two. A working set is the portion of data the computation needs close by during a phase. When the working set fits in a fast layer, reuse is cheap and predictable. When it does not, the system pays transfer costs repeatedly.

Three working sets matter most in modern AI:

  • Weights working set: model parameters that must be read for each token or batch.
  • Activation working set: intermediate values needed for backpropagation during training.
  • Attention cache working set: the key-value cache that grows with context length in many transformer-style decoders.

The attention cache is why serving can behave beautifully for short prompts and then fall apart for long contexts. Capacity alone is not the only issue. The access pattern changes, and memory traffic grows even when the model weights stay constant.

Training: The Memory Shape Is Bigger Than the Model

Training workloads are memory-intensive in ways that surprise teams who only think in “parameter count.” The parameter count is only the beginning. Training also carries:

  • Gradients, often stored at higher precision than weights.
  • Optimizer state, which can be multiple copies of weights depending on the optimizer.
  • Activations for backprop, which can dominate memory on large sequences or deep networks.
  • Communication buffers for distributed training.

This is why two models with similar parameter counts can have very different training footprints. Sequence length, microbatch size, layer types, and checkpointing strategy can shift the activation working set by multiples.

Several practical techniques exist to keep training inside the fastest layers:

  • Mixed precision reduces weight and activation storage while keeping selected accumulations stable.
  • Gradient accumulation trades time for memory by splitting large batches into multiple microbatches.
  • Activation checkpointing recomputes portions of the forward pass during backprop to reduce stored activations.
  • Sharding strategies split weights, gradients, and optimizer state across devices so no single GPU holds everything.

Each technique changes the memory profile, but each also changes the throughput profile. Saving memory by recomputing increases compute. Sharding increases communication. A stable configuration is one where the memory savings do not introduce a larger bottleneck elsewhere, especially in interconnect or network bandwidth.

Inference: Weights, KV Cache, and the Cost of Context

Serving shifts the problem. Training is often throughput-limited over long runs. Serving is frequently tail-latency-limited under bursty traffic. Memory hierarchy influences both.

Serving has a simple first-order goal: keep the model weights resident on the accelerator and keep per-request overhead low. The moment weights are not resident, performance collapses, and the system becomes a transfer service rather than an inference service.

The second-order goal is attention cache discipline. Many decoders store key and value tensors for each layer and token. The cache grows roughly with:

  • number of layers
  • hidden size and attention heads
  • sequence length
  • batch size

That growth creates two coupled constraints: capacity and bandwidth. Even if capacity is sufficient, bandwidth can become the limiter because each new token requires reading and writing cache segments. That is where long-context requests can degrade the entire service if not isolated.

Practical serving stacks use techniques such as:

  • Prefill and decode separation, because prefill is heavier and has different parallelism.
  • Paged attention or segmented KV cache layouts to reduce fragmentation and improve locality.
  • Quantization formats that shrink weights and sometimes cache, trading accuracy and kernel complexity for capacity and bandwidth relief.
  • Prefix caching for repeated prompts in workflows, which reduces both compute and memory traffic for common prefixes.

The common trap is to treat the cache as an internal detail. In production it becomes a first-class resource that needs admission control, isolation, and explicit budgeting.

Storage and IO: Where GPU Time Gets Wasted Off-Device

Storage choices decide whether GPUs spend time computing or waiting.

A strong data pipeline has three properties:

  • It can deliver batches at a steady pace at the needed throughput.
  • It can handle burst and recovery without repeated full restarts.
  • It is observable enough that “training is slow” can be traced to a specific stage.

Local NVMe is often a hidden hero. It provides a staging layer that reduces dependence on networked storage during training. Checkpoints and dataset shards can be read and written with predictable throughput. When that layer is missing, training jobs can compete on shared network links and shared storage backends, causing periodic slowdowns that look like “random” training instability.

IO bottlenecks also show up in evaluation and batch inference. The compute path might be fast, but data decoding, tokenization, feature extraction, or serialization can dominate. The memory hierarchy lens helps because it reframes the question: how often is data being decoded and copied between layers, and how many times are the same bytes being transformed?

Practical moves that consistently help:

  • Use columnar or chunk-friendly formats when scanning large datasets.
  • Reduce repeated parsing by caching pre-tokenized or pre-processed artifacts when it is safe.
  • Prefer streaming pipelines that keep data in a form close to what the model consumes.
  • Track queue depth and per-stage latency to detect backpressure early.

High utilization is not a badge. It is a signal. The goal is stable throughput at predictable cost, not a single metric at 100 percent while everything else burns.

Diagnostics: Finding the True Limiter

Memory problems are often misdiagnosed as “GPU issues” or “network issues” because symptoms overlap. A simple diagnostic posture separates the system into stages and asks where time is spent.

Common indicators that memory hierarchy is the limiter:

  • GPU utilization looks high in short bursts but low on average, with frequent stalls.
  • Achieved memory bandwidth is near peak while compute units are underused.
  • Host-to-device transfers spike during phases that should be compute-heavy.
  • Tail latency increases with longer contexts even when batch sizes are small.
  • OOM errors appear after long runs due to fragmentation or leaked allocations.

Useful measurements come from multiple layers:

  • Device counters for memory throughput and cache behavior.
  • Host metrics for page faults, swap activity, and disk throughput.
  • Service metrics for queue time, batching behavior, and tail latency.
  • Pipeline metrics for per-stage batch readiness and backpressure.

The discipline is to treat “memory” as a multi-layer system rather than a single capacity number. Once the limiting layer is identified, fixes become concrete: move data closer, reduce the working set, increase reuse, or change the parallelism scheme so the link is not saturated.

Design Rules That Hold Up Under Real Load

A few rules stay useful across hardware generations:

  • Keep the dominant working sets resident in the fastest layer available.
  • When something must be streamed, make it sequential and predictable.
  • Budget memory not only for steady state but for bursts, retries, and long contexts.
  • Treat cache growth as a resource that needs governance, not as a hidden detail.
  • Make transfers visible in metrics, not only in profiler screenshots.
  • Prefer designs that degrade gracefully when memory pressure increases.

The memory hierarchy is not an academic concept. It is the difference between a system that behaves like a product and a system that behaves like a demo. When the hierarchy is respected, scaling becomes a controlled engineering problem. When it is ignored, costs rise while reliability falls.

More Study Resources

Books by Drew Higgins

Explore this field
Memory Bandwidth and IO
Library Hardware, Compute, and Systems Memory Bandwidth and IO
Hardware, Compute, and Systems
Compiler and Kernel Optimizations
Cost per Token Economics
Edge and Device Compute
GPUs and Accelerators
Inference Hardware Choices
Networking and Clusters
On-Prem vs Cloud Tradeoffs
Power and Cooling
Storage Pipelines