Nvidia – The Inference Kingdom Expands

last_validated: 2026-04-06 decay_rate: slow

Nvidia – The Inference Kingdom Expands

Summary

SemiAnalysis's deep-dive into Nvidia's GTC 2026 announcements. Nvidia "acquired" Groq for $20B (structured as IP license + team hire to avoid antitrust review) and is integrating the LPU into its inference stack. The LP30 chip pairs with GPUs via Attention-FFN Disaggregation (AFD) — GPUs handle attention, LPUs handle stateless FFN computation. New systems include LPX rack (256 LPUs), Vera ETL256 (256 CPUs), and STX storage racks. The CPO roadmap shows copper within racks through Rubin Ultra, with CPO only for inter-rack scale-up starting with NVL576. Feynman NVL1152 targets 8x Kyber racks with all-CPO between racks.

Key Claims

  • Nvidia paid $20B for Groq, structured to avoid regulatory approval — instant IP/team access
  • LP30: 500MB on-chip SRAM, 1.2 PFLOPs FP8, on Samsung SF4 (not constrained like TSMC N3)
  • AFD separates attention (GPU, stateful) from FFN (LPU, stateless) with ping-pong pipeline parallelism
  • No CPO for intra-rack scale-up in Rubin Ultra — copper only; CPO begins at NVL576 for inter-rack
  • Kyber rack: 144 GPUs, 72 NVLink 7 switches, each GPU with 14.4 Tbit/s uni-di scale-up bandwidth
  • CMX addresses KV cache bottleneck for long-context/agentic workloads via NVMe tier between DRAM and shared storage
  • Vera ETL256: 256 CPUs in single liquid-cooled rack for RL simulation/preprocessing at scale

Tags

#nvidia #gtc-2026 #groq #inference #lpu #afd #cpo #networking #rubin #feynman #kvcache

Related