Google announced the eighth generation of its Tensor Processing Unit at Cloud Next on 22 April, splitting the TPU line into two purpose-built variants for the first time in its history. The TPU 8t is optimised for training and delivers up to 12.6 petaFLOPS of 4-bit floating-point compute per chip, with 216 GB of high-bandwidth memory at 6.5 TB/s, 128 MB of on-chip SRAM, and 19.2 Tbps chip-to-chip bandwidth. A single superpod scales to 9,600 accelerators connected via optical-circuit switches, with 2 petabytes of shared HBM and the ability to connect multiple pods through Google's Virgo Network — enabling up to 134,000 TPUs per data centre and 1 million across sites. Google claims the TPU 8t is 2.8 times faster at training than the previous Ironwood generation and achieves 97 per cent 'goodput,' meaning actual training time versus downtime.
The TPU 8i is built specifically for inference workloads, prioritising low latency for real-time agentic AI applications. It delivers 10.1 petaFLOPS of FP4 compute with 288 GB of HBM at 8.6 TB/s bandwidth and 384 MB of on-chip SRAM — triple the SRAM of the training variant, allowing a model's active working set to reside entirely on-chip. The standout feature is a new Collective Acceleration Engine that reduces synchronisation latencies by five-fold, which is critical for multi-agent systems where dozens of models need to coordinate responses in real time. Google claims 80 per cent higher performance per dollar for LLM inference compared to Ironwood, meaning enterprises can serve twice the users at the same cost. The company is also replacing x86 CPUs with ARM-based Axion processors as TPU hosts.
For context engineers, the architectural decision to split training and inference into separate silicon is the most significant detail. Google's statement — 'forget one chip to rule them all' — acknowledges that the compute profiles for training frontier models and serving them at scale are fundamentally different problems requiring different hardware solutions. Training demands massive floating-point throughput and interconnect bandwidth across thousands of chips; inference demands low latency, high SRAM for model weights, and efficient synchronisation for agentic workflows where multiple models collaborate. This bifurcation mirrors what developers already know from software architecture: the same system rarely optimises well for both batch processing and real-time serving. Both chips reach general availability later in 2026, and they will underpin Google's entire Gemini infrastructure — including the Gemini Enterprise Agent Platform announced at the same conference.