Inference Architecture and Scaling

Inference Architecture and Scaling

Definition

The hardware architectures, system designs, and networking topologies being built to scale AI inference — including GPU-LPU disaggregation, rack-scale systems, and the shift from training-dominated to inference-dominated compute.

Key Points

Open Questions

  • Will AFD (GPU+LPU) become the standard inference architecture, or is it Nvidia-specific?
  • How will KV cache management evolve as context lengths keep growing?
  • When does CPO actually reach volume production?
  • Will TPU/Trainium architectures converge toward similar disaggregation patterns?
  • Will liquid cooling become the default over air cooling for AI workloads?
  • How fast will modular/prefabricated data center deployment accelerate?
  • At what point does the energy footprint of AI create meaningful political backlash?

Physical Infrastructure

Related Concepts