What it is
Inference chips are processors purpose-built for the runtime side of AI workloads. Where training a frontier model requires massive parallel compute over weeks (NVIDIA H100 and B200 GPUs dominate this), serving the trained model to users in production has very different requirements: low latency per request, high throughput on small-batch decoding, predictable cost per token, and tolerance for the long tail of model architectures that have already been trained. Vendors in this space include Groq (LPU — language processing unit, built around static dataflow for high tokens-per-second on transformer decode), Cerebras (wafer-scale chips that fit large models entirely on-die), SambaNova, AWS Inferentia/Trainium, Google TPU v5e and v6, Etched (a transformer-specific ASIC), and a growing roster of startups. Some vendors target both training and inference; the "inference chips" framing specifically highlights the runtime-economics use case.
Why it matters
Inference cost is the dominant ongoing operating cost of a deployed AI product. A frontier model trained once might cost $100M, but the serving infrastructure that runs it for a million users a day can cost more than that within twelve months. Inference chips matter because they shift the unit economics: a 5–10× improvement in tokens-per-dollar at inference compounds across every customer interaction for years. The category also matters strategically — if specialized inference silicon delivers materially better economics on the long-tail of agentic workloads (many short calls, often interrupted by tool use), the cloud-vendor matrix that dominated 2023–2025 (NVIDIA on AWS / Azure / GCP) starts splintering. For agent operations leaders, inference chip choice flows directly into the model selection and routing decisions that the capability registry makes — a model that's 4× cheaper on Groq than on a generic GPU may be the right pick even if it's a half-step behind on quality.
Key components
- GPUs vs purpose-built — general-purpose NVIDIA dominant in training; inference-specific silicon increasingly competitive at runtime
- Tokens per second — the headline metric for inference chip benchmarks (Groq famously hit 500+ tokens/sec on Llama 70B)
- Tokens per dollar — the unit-economics metric that matters for production deployments
- Vendor landscape — Groq, Cerebras, SambaNova, AWS Inferentia, Google TPU, Etched, and others
- Interaction with routing — inference chip choice affects which models are economical for which workloads in a vendor-neutral stack
Related terms
LLM (Large Language Model)
The AI technology behind ChatGPT, Claude, and the intelligence in Agentforce. Trained on massive amounts of text to understand and generate human language.
Agent Operations
The discipline of running AI agents in production — capturing what they do, attributing what it costs, evaluating what they produce, and intervening when something goes wrong. The operational layer above agent observability and orchestration.
LLM Gateway
A unified proxy in front of multiple LLM providers that captures every call, enforces policy, and lets a single application talk to Anthropic, OpenAI, xAI, Gemini, and local models through one interface.
LLM Cost Attribution
The practice of tying every LLM call back to the task, agent, process, or skill that triggered it — across every vendor — so AI spend can be measured against outcomes, not just tokens.
Capability Registry
A structured catalog that maps AI capabilities (reasoning, structured output, tool use, vision, long context) to the models that can serve them — the substrate that makes skills portable across LLM vendors.