Groq vs Tenstorrent

Comparison

Groq and Tenstorrent represent two radically different bets on the future of AI silicon beyond NVIDIA's GPU monopoly. Groq built a deterministic Language Processing Unit (LPU) optimized purely for inference speed—a bet so compelling that NVIDIA acquired the company for $20 billion in December 2025. Tenstorrent, led by legendary chip architect Jim Keller, is building an open RISC-V-based architecture that spans training and inference while offering IP licensing and chiplet-level customization. One prioritizes raw token throughput; the other prioritizes architectural openness and cost efficiency. Together, they illustrate the branching paths of post-GPU AI compute.

Feature Comparison

DimensionGroqTenstorrent
ArchitectureLanguage Processing Unit (LPU) with deterministic static scheduling and SRAM-only memoryRISC-V mesh-based Tensix cores with conditional execution and GDDR6 memory
Primary FocusInference-only, optimized for maximum token throughputTraining and inference, with emphasis on flexibility and openness
Current FlagshipGroq 3 LPU (shipping Q3 2026 under NVIDIA), 150 TB/s bandwidth, 1,500 tokens/sec targetBlackhole (6nm TSMC), 120 Tensix++ cores, 774 TFLOPS FP8
Memory Architecture230MB on-die SRAM, no HBM—delivers up to 80 TB/s on-die bandwidthGDDR6 DRAM (no HBM), distributed software architecture for memory management
OwnershipAcquired by NVIDIA (Dec 2025) for ~$20B; now NVIDIA's Real-Time Inference divisionIndependent startup; $1.18B total funding, ~$2.6B+ valuation
LeadershipCo-founder Jonathan Ross now leads NVIDIA inference divisionCEO Jim Keller (former AMD Zen, Apple A-series, Tesla HW3 architect)
Business ModelCloud inference API (pre-acquisition); NVIDIA hardware product (post-acquisition)IP licensing, chiplet sales, dev kits ($12K+), complete systems
Open vs ProprietaryProprietary architecture, now within NVIDIA's closed ecosystemOpen RISC-V ISA, open-source software stack (TT-Metalium)
ManufacturingGroq 3 LPU fabricated via NVIDIA's supply chainSamsung Foundry SF4X (cost-optimized), TSMC 6nm for Blackhole
Cost StrategyPremium performance-per-watt: 35x throughput/MW vs Blackwell NVL72Deep cost undercut: no HBM, cheap process nodes, targeting 60%+ gross margins
Software EcosystemIntegrated into NVIDIA CUDA/TensorRT ecosystem post-acquisitionTT-Metalium open-source SDK, growing community but early-stage tooling
Key Metric276–1,665 tokens/sec on Llama 70B (standard to speculative decoding)774 TFLOPS FP8 raw compute; emphasis on cost-per-inference over raw speed

Detailed Analysis

Architectural Philosophy: Determinism vs Flexibility

Groq's LPU feeds tokens through a single wide pipeline of functional units executing in lock-step—no kernel switching, no cache misses, every clock cycle doing useful work. This deterministic approach eliminates the scheduling overhead that plagues GPU and TPU architectures, delivering predictable latency that matters enormously for agentic AI applications where multiple LLM calls chain within a single interaction. Tenstorrent's Tensix cores take the opposite approach: a mesh-based architecture with conditional execution that can skip unnecessary computation dynamically. This flexibility lets the same hardware handle training and inference workloads, and the RISC-V instruction set means the architecture can be extended and customized by licensees without proprietary lock-in.

The NVIDIA Factor

NVIDIA's $20 billion acquisition of Groq in December 2025 fundamentally changed the competitive landscape. The Groq 3 LPU, unveiled at GTC 2026, claims 35x higher throughput per megawatt than NVIDIA's own Blackwell NVL72 for trillion-parameter models—NVIDIA essentially acquired the technology that would obsolete its own inference approach. Jonathan Ross and Groq's senior engineering team now lead NVIDIA's Real-Time Inference division. For Tenstorrent, this consolidation is a double-edged sword: it removes an independent competitor but also validates the thesis that specialized inference hardware is critical. Jim Keller's strategy of building the anti-NVIDIA—open architecture, no HBM dependency, IP licensing—becomes more differentiated as NVIDIA absorbs Groq's technology into its walled garden.

Memory and Cost Architecture

Both companies reject HBM (High Bandwidth Memory), but for different reasons. Groq uses only on-die SRAM to achieve extreme bandwidth (150 TB/s on Groq 3) at the cost of limited capacity—the architecture works brilliantly for inference but cannot scale to training-sized models without multi-chip configurations. Tenstorrent uses commodity GDDR6 DRAM paired with a distributed software memory architecture, sacrificing peak bandwidth for dramatically lower bill-of-materials cost. Manufacturing on Samsung's SF4X process rather than TSMC's cutting-edge nodes further reduces Tenstorrent's cost basis. If Groq 3 represents the performance ceiling of AI inference silicon, Tenstorrent's Blackhole represents the cost floor—and in an inference economy where margins matter, both positions are viable.

Software Ecosystem Maturity

Post-acquisition, Groq benefits from NVIDIA's CUDA ecosystem—the largest and most mature GPU software stack in the world. Developers already using TensorRT for inference can potentially target Groq 3 LPUs with minimal code changes. Tenstorrent's TT-Metalium is open-source and growing, but remains early-stage compared to CUDA's decade-plus head start. Jim Keller has acknowledged this gap, noting that Tenstorrent needs 18–36 months of software stack maturation to achieve meaningful market traction. The open-source approach could ultimately be an advantage—much as Linux eventually displaced proprietary Unix—but the near-term developer experience favors NVIDIA/Groq.

Target Markets and Deployment Models

Groq (now under NVIDIA) targets hyperscale cloud providers and enterprises demanding the absolute fastest inference for real-time agentic web applications—the 1,500 tokens-per-second target enables multi-agent systems communicating in real time. Tenstorrent targets a broader market including edge deployment, sovereign AI initiatives, and organizations that want to customize their AI silicon without NVIDIA dependency. Tenstorrent's developer workstations starting at $12,000 and IP licensing model enable a long tail of hardware customization that Groq/NVIDIA's vertically integrated approach cannot serve. The Taiwan office expansion signals Tenstorrent's ambition to embed in the global semiconductor supply chain.

The Open Architecture Bet

Tenstorrent's use of RISC-V aligns with a broader industry shift toward open-source foundations in AI infrastructure. Just as open-source models from Meta and Mistral challenged proprietary LLMs, open-source hardware architectures challenge NVIDIA's proprietary GPU stack. Tenstorrent's licensing model—selling IP and chiplets that others can integrate—mirrors ARM's approach to mobile computing. If RISC-V AI accelerators achieve even a fraction of ARM's mobile success, Tenstorrent's early positioning becomes enormously valuable. Groq's absorption into NVIDIA forecloses this path entirely: its technology is now proprietary NVIDIA IP, accessible only through NVIDIA's product stack and pricing.

Best For

Real-Time Multi-Agent Systems

Groq (NVIDIA)

Groq 3's 1,500 tokens/sec target and deterministic latency make it unmatched for applications where multiple AI agents must communicate in real time. The lock-step execution eliminates latency spikes that break agentic workflows.

Cost-Optimized Inference at Scale

Tenstorrent

Tenstorrent's no-HBM GDDR6 architecture and cheap manufacturing process enable dramatically lower cost-per-inference. For high-volume workloads where cost matters more than peak latency, Blackhole offers compelling economics.

Edge and On-Premise AI Deployment

Tenstorrent

Tenstorrent's developer workstations, chiplet sales model, and customizable RISC-V architecture serve edge deployment scenarios where NVIDIA's hyperscale-focused Groq 3 LPU is impractical or unavailable.

Enterprise Cloud Inference

Groq (NVIDIA)

NVIDIA's ecosystem integration means Groq 3 will be available through major cloud providers with mature tooling, support contracts, and CUDA compatibility—exactly what enterprise procurement requires.

Sovereign AI and Export-Restricted Markets

Tenstorrent

Nations building domestic AI capability without NVIDIA dependency can license Tenstorrent's RISC-V IP and manufacture locally. The open architecture avoids US export control chokepoints that affect NVIDIA products.

Custom AI Silicon (IP Licensing)

Tenstorrent

Companies wanting to build custom AI chips can license Tenstorrent's Tensix cores and RISC-V IP. Groq's technology is now locked inside NVIDIA with no licensing path for third parties.

Latency-Sensitive Consumer AI Products

Groq (NVIDIA)

Products like AI coding assistants, real-time translation, and conversational interfaces benefit most from Groq's sub-second response times and consistent latency profile.

AI Training Workloads

Tenstorrent

Groq is inference-only by design. Tenstorrent's architecture handles both training and inference—a 192-chip Blackhole training cluster is already operational, with larger clusters planned.

The Bottom Line

Groq and Tenstorrent no longer compete directly—they represent different futures for AI silicon. Groq, now inside NVIDIA, will define the performance ceiling for inference: the Groq 3 LPU shipping Q3 2026 promises to be the fastest inference chip ever built, backed by NVIDIA's manufacturing scale and software ecosystem. Tenstorrent, independent under Jim Keller, is building the open alternative: cheaper to manufacture, customizable via IP licensing, and free from NVIDIA's proprietary ecosystem. If you need maximum inference speed and operate within NVIDIA's ecosystem, Groq 3 is the clear choice. If you need cost efficiency, architectural customization, sovereignty from NVIDIA, or the ability to handle both training and inference, Tenstorrent offers a credible and increasingly mature path. The AI compute market is large enough for both approaches—and the industry is healthier for having them.