Name: ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router
Brand: ASUS
SKU: GT-BE98-PRO
Price: 598.99 USD
Availability: InStock

NUMA and PCIe Topology: Device Placement for GPU Workloads

AI workloads move huge volumes of data through a machine that was not built as a single, uniform pool of resources. Modern servers are mosaics: multiple CPU sockets, multiple memory controllers, multiple PCIe root complexes, and often multiple layers of switches between devices. Two GPUs in the same chassis can be separated by a topology that behaves like a long hallway with narrow doors. If you place work without caring about that hallway, you pay with latency, wasted CPU cycles, and underutilized accelerators.

Topology-aware placement is the discipline of aligning compute, memory, and IO paths so that the expensive parts of the system spend time doing useful work rather than waiting on cross-socket transfers and congested links.

Flagship Router Pick

Quad-Band WiFi 7 Gaming Router

ASUS ROG Rapture GT-BE98 PRO Quad-Band WiFi 7 Gaming Router

ASUS • GT-BE98 PRO • Gaming Router

A flagship gaming router angle for pages about latency, wired priority, and high-end home networking for gaming setups.

$598.99

Was $699.99

Save 14%

Price checked: 2026-03-23 18:31. Product prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on Amazon at the time of purchase will apply to the purchase of this product.

Quad-band WiFi 7
320MHz channel support
Dual 10G ports
Quad 2.5G ports
Game acceleration features

(paid link)

View ASUS Router on Amazon

Check the live Amazon listing for the latest price, stock, and bundle or security details.

Why it stands out

Very strong wired and wireless spec sheet
Premium port selection
Useful for enthusiast gaming networks

Things to know

Expensive
Overkill for simpler home networks

See Amazon for current availability

As an Amazon Associate I earn from qualifying purchases.

The mental model: locality is a performance budget

In a single-socket desktop, “memory” often feels like one thing. In a multi-socket server, memory is local to a socket first. Accessing remote memory can be significantly slower and can consume inter-socket bandwidth that other threads also depend on.

The same idea applies to PCIe. Devices hang off root complexes and switches. A GPU may be “near” one CPU socket in the sense that DMA paths and interrupts are serviced most efficiently on that socket. Another GPU in the same box may be “near” the other socket. If your process runs on socket A but feeds a GPU attached to socket B, the machine spends time moving bytes across internal links before the GPU ever sees them.

The simplest rule is that every long path becomes visible at scale.

Tokenization and preprocessing that crosses sockets increases per-batch overhead.
Host-to-device copies that cross sockets consume memory bandwidth twice.
Network traffic that lands on a NIC far from the GPU adds latency and burns CPU.
Peer-to-peer GPU traffic can be fast or slow depending on whether it traverses a favorable fabric path.

Placement is not about perfection. It is about avoiding the largest self-inflicted penalties.

NUMA basics that show up in AI systems

NUMA describes the reality that memory access time depends on where a thread runs relative to where memory is allocated.

AI workloads hit NUMA pain in predictable ways.

Data loading and preprocessing
CPU threads allocate and touch buffers that later get copied to GPUs.
If those buffers are allocated on the wrong socket, copies become remote memory traffic.
Communication stacks
Network and interprocess communication can spend significant CPU time in hot loops.
If these loops run on a remote socket relative to the NIC or GPU, overhead increases.
Scheduler churn
If processes are moved between sockets, caches are cold and memory locality breaks.
Multi-process training
One process per GPU can accidentally produce cross-socket behavior if affinity is not controlled.

NUMA is not a problem only for massive clusters. It can be the difference between stable throughput and constant jitter on a single high-end server.

PCIe topology: where GPUs and NICs really live

PCIe is a fabric with lanes and switches. Its performance is shaped by:

Lane count and link generation
Switch topology and oversubscription
Root complex placement relative to CPU sockets
Peer-to-peer capabilities between devices
Shared links that become congested when multiple devices talk at once

Two common placement failures show up repeatedly.

Cross-socket feeding
A process runs on socket A, but the GPU is attached to socket B. DMA and interrupts bounce across sockets.
Shared uplink contention
Multiple GPUs share a switch uplink that becomes a bottleneck during heavy transfers or collective operations.

These failures can hide behind superficially healthy metrics. GPU utilization may look fine until traffic spikes, then p99 latency worsens or training step time becomes unstable. Topology is often the missing explanation.

Placement goals differ for training and inference

Training often cares about synchronized throughput. Inference often cares about tail latency and predictability.

Training placement goals
Keep each GPU fed consistently.
Keep collective communication efficient by grouping GPUs with strong peer-to-peer paths.
Keep data pipeline threads on the socket nearest to the GPUs they feed.
Inference placement goals
Keep request handling threads near the NICs and GPUs to reduce overhead.
Avoid cross-socket paths that add micro-latency that compounds into p99.
Keep memory allocations stable to avoid jitter from remote memory traffic.

Both benefit from locality, but they measure success differently.

CPU affinity and pinning as baseline hygiene

If you do nothing, the operating system will try to be fair. Fairness can destroy locality.

The baseline hygiene is to make the placement explicit.

Pin CPU threads that feed a GPU to the socket closest to that GPU.
Pin network processing threads near the NIC used for that workload.
Avoid frequent process migration between sockets.
Keep noisy background work away from the cores that serve latency-critical paths.

The objective is not to squeeze out a tiny gain. The objective is to remove variance and prevent the worst-case path from becoming normal.

Memory affinity: where buffers are born matters

A common topology tax happens during host-to-device transfer. The CPU creates a batch buffer, then the GPU reads it via DMA. If the buffer is allocated on the wrong socket, the data must move across sockets before it can be transferred, and then the GPU reads it. The same bytes cross internal links multiple times.

A locality-friendly pattern is:

Allocate and touch buffers on the same socket that will perform the transfer.
Use thread placement so that “first touch” aligns with the intended socket.
Keep allocation churn low to reduce fragmentation and remote allocation fallback.
When using pinned memory, be deliberate about where pinned pages are located.

Pinned memory improves DMA behavior but can make locality mistakes more expensive, because pinned pages cannot be moved as easily by the OS. That is why pinned allocations should be controlled and measured rather than sprayed across a process.

Multi-GPU topology: grouping that matches the fabric

Not all multi-GPU sets behave the same. Some pairs have strong peer-to-peer connectivity. Some pairs traverse slower paths.

A topology-aware placement strategy often includes:

Group GPUs that share the best peer-to-peer paths for synchronized work.
Prefer placing tightly coupled shards on GPUs that share the strongest interconnect.
Avoid splitting a single tightly synchronized job across distant topology islands when you can keep it within one island.
If a split is unavoidable, adjust the partition strategy so that the highest-traffic paths remain local and only lower-traffic coordination crosses the weaker links.

The same idea applies to multi-tenant allocation. If you give a tenant a set of GPUs that spans topology islands, you have quietly lowered their effective performance, even though they received the “right number” of devices.

NIC placement and data movement

Many AI workloads depend on fast networking. Where the NIC sits relative to GPUs and CPU cores matters.

A locality-aware pattern is:

Keep request handling threads near the NIC to minimize interrupt and kernel overhead.
Keep GPU-facing copy and staging threads near the GPU’s socket.
Avoid NIC-to-GPU paths that cross sockets when lower-latency paths exist.
For workloads that use kernel-bypass networking, align the user-space networking threads with the NIC locality and the GPU locality when possible.

This is where topology and networking become one system. Link speed is not the only metric. The path inside the box matters.

Topology-aware scheduling: making placement a platform capability

Manual pinning is workable for a single team. At scale, placement must become platform policy.

A topology-aware scheduler should be able to:

Discover hardware topology and expose it as resources
Place jobs so that GPU sets are topology-consistent
Bind CPU and memory resources to match GPU placement
Reserve NIC locality where relevant
Enforce fairness without creating hidden topology penalties

This is not an academic feature. It directly affects cost. A job that runs at lower throughput because of topology is a job that consumes more device time for the same output.

For the orchestration layer, see Cluster Scheduling and Job Orchestration and Multi-Tenancy Isolation and Resource Fairness.

Diagnosing topology problems without guesswork

Topology problems have signatures.

GPU utilization dips when transfers spike
Step time variance increases while average utilization looks acceptable
CPU utilization increases on one socket while the other is underused
Remote memory access counters rise during heavy pipeline stages
PCIe throughput appears capped below expected levels
Latency tail worsens without a clear software change

The fastest way to validate topology suspicion is to compare two placements.

Same workload pinned locally
Same workload pinned cross-socket

If performance changes materially with placement alone, the topology tax is real. That gives you a clear direction: make placement explicit and enforce it in the scheduler.

Containers, virtualization, and the illusion of uniform resources

Containers and virtualization can hide hardware, but they cannot remove topology. If the platform abstracts devices without providing topology-aware placement, tenants will see unpredictable behavior.

A stable platform provides:

Clear device assignment
CPU and memory affinity aligned with the device
Limits that prevent noisy neighbors from stealing host resources
Monitoring that reveals when placement is suboptimal

See Virtualization and Containers for AI Workloads for the operational layer that often determines whether topology work holds under real multi-tenant pressure.

What good looks like

Topology-aware placement is “good” when it produces throughput and latency that are stable across time, not only on a good day.

Jobs are placed on topology-consistent GPU sets.
CPU and memory affinity match device locality.
NIC placement is aligned with the critical data paths.
Monitoring shows low remote memory traffic for GPU-feeding paths.
Performance is predictable enough that capacity planning can use real measurements.

When AI becomes infrastructure, the machine is not a black box. It is a topology. Placement is how you respect it.

Hardware, Compute, and Systems Overview: Hardware, Compute, and Systems Overview
Nearby topics in this pillar
Memory Hierarchy: HBM, VRAM, RAM, Storage
Interconnects and Networking: Cluster Fabrics
Virtualization and Containers for AI Workloads
Cluster Scheduling and Job Orchestration
Cross-category connections
Capacity Planning and Load Testing for AI Services: Tokens, Concurrency, and Queues
Monitoring: Latency, Cost, Quality, Safety Metrics
Series and navigation
Infrastructure Shift Briefs
Tool Stack Spotlights
AI Topics Index
Glossary

More Study Resources

Category hub
Hardware, Compute, and Systems Overview

Books by Drew Higgins

Featured

AI / Apologetics

Beyond the Machine

A Christ-centered challenge to the claim that artificial intelligence can become truly human.

This book examines the limits of artificial intelligence, the meaning of personhood, and the difference between…

Kindle Paperback

Featured

Salvation / Gospel Foundations

The Power of Salvation

A Scripture-centered call to understand the saving power of Jesus Christ more deeply.

Built around Scripture-based teaching and Spirit-led reflection, this book is suited for readers who want a…

Kindle Paperback

Spiritual Warfare

Bible Study / Spiritual Warfare

Ephesians 6 Field Guide: Spiritual Warfare and the Full Armor of God

A steady, Scripture-anchored guide for believers who want clarity without fear and strength without hype.

Spiritual warfare is real—but it was never meant to turn your life into panic, obsession, or…

Kindle Paperback

God’s Promises in the Bible for Difficult Times cover

Encouragement

Christian Living / Encouragement

God’s Promises in the Bible for Difficult Times

A Scripture-based reminder of God’s promises for believers walking through hardship and uncertainty.

This works best as an encouragement-and-hope title anchored in gospel assurance. It should perform well in…

Kindle Paperback

Explore this field

Edge and Device Compute

Library Edge and Device Compute Hardware, Compute, and Systems

NUMA and PCIe Topology: Device Placement for GPU Workloads