Inference Architecture and Scaling

Definition

The hardware architectures, system designs, and networking topologies being built to scale AI inference — including GPU-LPU disaggregation, rack-scale systems, and the shift from training-dominated to inference-dominated compute.

Key Points

Nvidia's Attention-FFN Disaggregation (AFD): GPUs handle stateful attention (with dynamic KV cache), LPUs handle stateless FFN — using ping-pong pipeline parallelism (nvidia inference kingdom expands)
Groq LPU: SRAM-based, deterministic execution, very fast but limited memory capacity; complements GPU throughput strengths (nvidia inference kingdom expands)
LPX rack: 256 LPUs with FPGAs for protocol conversion, up to 256GB DDR5 per FPGA for KV cache (nvidia inference kingdom expands)
CPO roadmap: copper intra-rack through Rubin Ultra; CPO only for inter-rack starting NVL576 (nvidia inference kingdom expands)
Kyber rack: 144 GPUs, 72 NVLink 7 switches, 14.4 Tbit/s per GPU scale-up (nvidia inference kingdom expands)
CMX: NVMe-based KV cache tier for long-context and agentic workloads (nvidia inference kingdom expands)
Hopper vs Blackwell: 20x performance gap for inference at 100 tok/s for DeepSeek/Kimi K2.5 — far exceeding the FLOPS difference (dwarkesh dylan patel interview)
Scale-up domains: Nvidia 72 GPUs all-to-all; Google thousands of TPUs in torus; Amazon hybrid (dwarkesh dylan patel interview)
Vera Rubin: 5x inference throughput of Blackwell, 10x token cost reduction (clouded judgement per token pricing)

Open Questions

Will AFD (GPU+LPU) become the standard inference architecture, or is it Nvidia-specific?
How will KV cache management evolve as context lengths keep growing?
When does CPO actually reach volume production?
Will TPU/Trainium architectures converge toward similar disaggregation patterns?
Will liquid cooling become the default over air cooling for AI workloads?
How fast will modular/prefabricated data center deployment accelerate?
At what point does the energy footprint of AI create meaningful political backlash?

Physical Infrastructure

Ball identifies 11 essential data center components, each being reinvented for AI workloads (fourth industrial revolution)
Real estate-power proximity is a critical bottleneck: power lost in transmission favors co-location (fourth industrial revolution)
Power is solvable through diverse generation: 16+ gas vendor types, behind-the-meter installations, fuel cells, solar+battery (dwarkesh dylan patel interview)
Modularization is key to scaling: ship integrated blocks (power, cooling, racks) from factories to reduce on-site labor (dwarkesh dylan patel interview)
Nvidia's Vera ETL256: 256 CPUs in a single liquid-cooled rack; STX: reference storage rack with BlueField-4 (nvidia inference kingdom expands)
Data centers at 3-4% of US grid today, targeting 10% by 2028 (dwarkesh dylan patel interview)
Space data centers not viable this decade: chip supply, not power, is the binding constraint (dwarkesh dylan patel interview)

Related Concepts

gpu and compute economics
semiconductor supply chain bottlenecks
token economics and pricing
ai scaling limits and research paradigm — if pre-training gains plateau, inference becomes the primary compute workload and monetization surface