Autoregressive Generation

Autoregressive generation is the process by which language models produce text: predicting one token at a time, where each new token is conditioned on all previously generated tokens. This sequential, left-to-right process is the fundamental mechanism behind every LLM conversation, code completion, and AI-generated document.

The process works by computing a probability distribution over the entire vocabulary at each step. Given the sequence so far ("The capital of France is"), the model assigns probabilities to every possible next token ("Paris": 95%, "Lyon": 2%, etc.) and samples from this distribution. The chosen token is appended to the sequence, and the process repeats. The model never "plans ahead"—it can only look backward at what it's already generated and forward at patterns learned during training.

This one-token-at-a-time constraint has profound implications. It means generation speed is fundamentally sequential—each token requires a full forward pass through the model, which is why LLMs can read (process input) much faster than they can write (generate output). It means the model can't revise earlier tokens based on later reasoning (though reasoning models work around this by thinking before answering). And it means the quality of generation depends heavily on sampling strategies: temperature (how random to be), top-k (consider only the k most likely tokens), and nucleus sampling all shape the tradeoff between creativity and coherence.

Autoregressive generation contrasts with other paradigms. Diffusion models generate images by iteratively denoising—working on all pixels simultaneously. GANs generate outputs in a single forward pass. Recursive Language Models aim to combine autoregressive generation with hierarchical structure. Understanding autoregressive generation—its power, its constraints, and the engineering innovations (KV caching, speculative decoding, quantization) that make it practical—is understanding the engine that powers the agentic web.