NUMA and PCIe Topology: Device Placement for GPU Workloads
AI workloads move huge volumes of data through a machine that was not built as a single, uniform pool of resources. Modern servers are mosaics: multiple CPU sockets, multiple memory controllers, multiple PCIe root complexes, and often multiple layers of switches between devices. Two GPUs in the same chassis can be separated by a topology that behaves like a long hallway with narrow doors. If you place work without caring about that hallway, you pay with latency, wasted CPU cycles, and underutilized accelerators.
Topology-aware placement is the discipline of aligning compute, memory, and IO paths so that the expensive parts of the system spend time doing useful work rather than waiting on cross-socket transfers and congested links.
Flagship Router PickQuad-Band WiFi 7 Gaming RouterASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
A flagship gaming router angle for pages about latency, wired priority, and high-end home networking for gaming setups.
- Quad-band WiFi 7
- 320MHz channel support
- Dual 10G ports
- Quad 2.5G ports
- Game acceleration features
Why it stands out
- Very strong wired and wireless spec sheet
- Premium port selection
- Useful for enthusiast gaming networks
Things to know
- Expensive
- Overkill for simpler home networks
The mental model: locality is a performance budget
In a single-socket desktop, “memory” often feels like one thing. In a multi-socket server, memory is local to a socket first. Accessing remote memory can be significantly slower and can consume inter-socket bandwidth that other threads also depend on.
The same idea applies to PCIe. Devices hang off root complexes and switches. A GPU may be “near” one CPU socket in the sense that DMA paths and interrupts are serviced most efficiently on that socket. Another GPU in the same box may be “near” the other socket. If your process runs on socket A but feeds a GPU attached to socket B, the machine spends time moving bytes across internal links before the GPU ever sees them.
The simplest rule is that every long path becomes visible at scale.
- Tokenization and preprocessing that crosses sockets increases per-batch overhead.
- Host-to-device copies that cross sockets consume memory bandwidth twice.
- Network traffic that lands on a NIC far from the GPU adds latency and burns CPU.
- Peer-to-peer GPU traffic can be fast or slow depending on whether it traverses a favorable fabric path.
Placement is not about perfection. It is about avoiding the largest self-inflicted penalties.
NUMA basics that show up in AI systems
NUMA describes the reality that memory access time depends on where a thread runs relative to where memory is allocated.
AI workloads hit NUMA pain in predictable ways.
- Data loading and preprocessing
- CPU threads allocate and touch buffers that later get copied to GPUs.
- If those buffers are allocated on the wrong socket, copies become remote memory traffic.
- Communication stacks
- Network and interprocess communication can spend significant CPU time in hot loops.
- If these loops run on a remote socket relative to the NIC or GPU, overhead increases.
- Scheduler churn
- If processes are moved between sockets, caches are cold and memory locality breaks.
- Multi-process training
- One process per GPU can accidentally produce cross-socket behavior if affinity is not controlled.
NUMA is not a problem only for massive clusters. It can be the difference between stable throughput and constant jitter on a single high-end server.
PCIe topology: where GPUs and NICs really live
PCIe is a fabric with lanes and switches. Its performance is shaped by:
- Lane count and link generation
- Switch topology and oversubscription
- Root complex placement relative to CPU sockets
- Peer-to-peer capabilities between devices
- Shared links that become congested when multiple devices talk at once
Two common placement failures show up repeatedly.
- Cross-socket feeding
- A process runs on socket A, but the GPU is attached to socket B. DMA and interrupts bounce across sockets.
- Shared uplink contention
- Multiple GPUs share a switch uplink that becomes a bottleneck during heavy transfers or collective operations.
These failures can hide behind superficially healthy metrics. GPU utilization may look fine until traffic spikes, then p99 latency worsens or training step time becomes unstable. Topology is often the missing explanation.
Placement goals differ for training and inference
Training often cares about synchronized throughput. Inference often cares about tail latency and predictability.
- Training placement goals
- Keep each GPU fed consistently.
- Keep collective communication efficient by grouping GPUs with strong peer-to-peer paths.
- Keep data pipeline threads on the socket nearest to the GPUs they feed.
- Inference placement goals
- Keep request handling threads near the NICs and GPUs to reduce overhead.
- Avoid cross-socket paths that add micro-latency that compounds into p99.
- Keep memory allocations stable to avoid jitter from remote memory traffic.
Both benefit from locality, but they measure success differently.
CPU affinity and pinning as baseline hygiene
If you do nothing, the operating system will try to be fair. Fairness can destroy locality.
The baseline hygiene is to make the placement explicit.
- Pin CPU threads that feed a GPU to the socket closest to that GPU.
- Pin network processing threads near the NIC used for that workload.
- Avoid frequent process migration between sockets.
- Keep noisy background work away from the cores that serve latency-critical paths.
The objective is not to squeeze out a tiny gain. The objective is to remove variance and prevent the worst-case path from becoming normal.
Memory affinity: where buffers are born matters
A common topology tax happens during host-to-device transfer. The CPU creates a batch buffer, then the GPU reads it via DMA. If the buffer is allocated on the wrong socket, the data must move across sockets before it can be transferred, and then the GPU reads it. The same bytes cross internal links multiple times.
A locality-friendly pattern is:
- Allocate and touch buffers on the same socket that will perform the transfer.
- Use thread placement so that “first touch” aligns with the intended socket.
- Keep allocation churn low to reduce fragmentation and remote allocation fallback.
- When using pinned memory, be deliberate about where pinned pages are located.
Pinned memory improves DMA behavior but can make locality mistakes more expensive, because pinned pages cannot be moved as easily by the OS. That is why pinned allocations should be controlled and measured rather than sprayed across a process.
Multi-GPU topology: grouping that matches the fabric
Not all multi-GPU sets behave the same. Some pairs have strong peer-to-peer connectivity. Some pairs traverse slower paths.
A topology-aware placement strategy often includes:
- Group GPUs that share the best peer-to-peer paths for synchronized work.
- Prefer placing tightly coupled shards on GPUs that share the strongest interconnect.
- Avoid splitting a single tightly synchronized job across distant topology islands when you can keep it within one island.
- If a split is unavoidable, adjust the partition strategy so that the highest-traffic paths remain local and only lower-traffic coordination crosses the weaker links.
The same idea applies to multi-tenant allocation. If you give a tenant a set of GPUs that spans topology islands, you have quietly lowered their effective performance, even though they received the “right number” of devices.
NIC placement and data movement
Many AI workloads depend on fast networking. Where the NIC sits relative to GPUs and CPU cores matters.
A locality-aware pattern is:
- Keep request handling threads near the NIC to minimize interrupt and kernel overhead.
- Keep GPU-facing copy and staging threads near the GPU’s socket.
- Avoid NIC-to-GPU paths that cross sockets when lower-latency paths exist.
- For workloads that use kernel-bypass networking, align the user-space networking threads with the NIC locality and the GPU locality when possible.
This is where topology and networking become one system. Link speed is not the only metric. The path inside the box matters.
Topology-aware scheduling: making placement a platform capability
Manual pinning is workable for a single team. At scale, placement must become platform policy.
A topology-aware scheduler should be able to:
- Discover hardware topology and expose it as resources
- Place jobs so that GPU sets are topology-consistent
- Bind CPU and memory resources to match GPU placement
- Reserve NIC locality where relevant
- Enforce fairness without creating hidden topology penalties
This is not an academic feature. It directly affects cost. A job that runs at lower throughput because of topology is a job that consumes more device time for the same output.
For the orchestration layer, see Cluster Scheduling and Job Orchestration and Multi-Tenancy Isolation and Resource Fairness.
Diagnosing topology problems without guesswork
Topology problems have signatures.
- GPU utilization dips when transfers spike
- Step time variance increases while average utilization looks acceptable
- CPU utilization increases on one socket while the other is underused
- Remote memory access counters rise during heavy pipeline stages
- PCIe throughput appears capped below expected levels
- Latency tail worsens without a clear software change
The fastest way to validate topology suspicion is to compare two placements.
- Same workload pinned locally
- Same workload pinned cross-socket
If performance changes materially with placement alone, the topology tax is real. That gives you a clear direction: make placement explicit and enforce it in the scheduler.
Containers, virtualization, and the illusion of uniform resources
Containers and virtualization can hide hardware, but they cannot remove topology. If the platform abstracts devices without providing topology-aware placement, tenants will see unpredictable behavior.
A stable platform provides:
- Clear device assignment
- CPU and memory affinity aligned with the device
- Limits that prevent noisy neighbors from stealing host resources
- Monitoring that reveals when placement is suboptimal
See Virtualization and Containers for AI Workloads for the operational layer that often determines whether topology work holds under real multi-tenant pressure.
What good looks like
Topology-aware placement is “good” when it produces throughput and latency that are stable across time, not only on a good day.
- Jobs are placed on topology-consistent GPU sets.
- CPU and memory affinity match device locality.
- NIC placement is aligned with the critical data paths.
- Monitoring shows low remote memory traffic for GPU-feeding paths.
- Performance is predictable enough that capacity planning can use real measurements.
When AI becomes infrastructure, the machine is not a black box. It is a topology. Placement is how you respect it.
- Hardware, Compute, and Systems Overview: Hardware, Compute, and Systems Overview
- Nearby topics in this pillar
- Memory Hierarchy: HBM, VRAM, RAM, Storage
- Interconnects and Networking: Cluster Fabrics
- Virtualization and Containers for AI Workloads
- Cluster Scheduling and Job Orchestration
- Cross-category connections
- Capacity Planning and Load Testing for AI Services: Tokens, Concurrency, and Queues
- Monitoring: Latency, Cost, Quality, Safety Metrics
- Series and navigation
- Infrastructure Shift Briefs
- Tool Stack Spotlights
- AI Topics Index
- Glossary
More Study Resources
- Category hub
- Hardware, Compute, and Systems Overview
- Related
- Memory Hierarchy: HBM, VRAM, RAM, Storage
- Interconnects and Networking: Cluster Fabrics
- Virtualization and Containers for AI Workloads
- Cluster Scheduling and Job Orchestration
- Capacity Planning and Load Testing for AI Services: Tokens, Concurrency, and Queues
- Monitoring: Latency, Cost, Quality, Safety Metrics
- Infrastructure Shift Briefs
- Tool Stack Spotlights
- AI Topics Index
- Glossary
Books by Drew Higgins
Prophecy and Its Meaning for Today
New Testament Prophecies and Their Meaning for Today
A focused study of New Testament prophecy and why it still matters for believers now.
