Context Engineering

Context engineering is the discipline of designing what an AI model or agent sees inside its context window at every step of a task — which instructions, retrieved knowledge, tool results, prior actions, and persistent memory get assembled into the prompt, and which get withheld, summarized, or offloaded. Where prompt engineering asked "how should we phrase the instruction?", context engineering asks "what should the model be looking at right now, and why?"

The shift in framing is consequential. As AI agents moved from one-shot chat completions to long-horizon tasks involving dozens or hundreds of tool calls, the limiting factor stopped being how the user phrased a single prompt and became how the agent's working memory was managed across an entire run. Bigger context windows did not solve the problem; in many ways they made it worse, since attention degrades over long sequences, inference costs balloon, and irrelevant tokens crowd out the signal the model actually needs. Anthropic captured the emerging consensus in late 2025: context is a finite resource with diminishing marginal returns, and engineering it well is now the central task of building reliable agents.

The Four Moves

Practitioners have converged on four basic operations for shaping context. Offloading moves information out of the prompt and into external systems — the filesystem, a vector store, a scratchpad, or a structured todo.md the agent edits as it works — so the context window holds pointers and summaries rather than raw content. Retrieval pulls information back in dynamically when needed, rather than front-loading everything the agent might possibly want; this is where retrieval-augmented generation sits inside the broader discipline.

Isolation partitions context across sub-agents or sub-tasks so that a research subroutine's intermediate scratch work does not pollute the main agent's reasoning — a pattern that has driven the rise of multi-agent and three-agent harness architectures. Reduction, also called compaction, compresses or deletes history the agent no longer needs while preserving what it will need later. The hard part is deciding what is which.

Working Techniques

A handful of concrete techniques have emerged as load-bearing in production. Task recitation, popularized by Manus, has the agent maintain and continuously rewrite a plan or todo file inside the prompt, pushing the global goal back into recent attention so it is not lost across long tool sequences averaging fifty or more calls. Error preservation deliberately keeps failed actions and their error messages in context rather than cleaning them up — the model uses them to update its priors and avoid repeating the same mistake.

Structured variation defends against an underappreciated failure mode: language models are excellent mimics, and a context full of similar past action-observation pairs will pull them into copying that pattern even when it is no longer optimal. Small amounts of variation in templates, phrasing, and ordering break the loop. Tool masking instead of removal hides irrelevant tools from the model's view without destroying the KV-cache, preserving inference performance while narrowing the action space. And cache-aware prompt design treats the prefix of the context as a stable hot path — small edits at the top of a long prompt invalidate enormous amounts of cached work, so production systems pin their system prompts and append rather than mutate.

Where It Sits

Context engineering sits between two adjacent disciplines. Prompt engineering (2022–2023) was the craft of composing the right single instruction for a stateless model. Harness engineering (2026–) is the practice of building the surrounding infrastructure — sandboxing, approvals, sub-agent lifecycles, observability — that keeps autonomous agents safe and recoverable. Context engineering is the layer in between: the runtime discipline of curating what the model sees on every turn. In Gartner's mid-2025 framing, context engineering is "in" and prompt engineering is "out"; in practice the older discipline has not vanished, but it has become a sub-skill inside the broader work of context management.

The empirical character of the field is striking. There is no closed-form solution for composing a good context, and teams refer to the work — only half-jokingly — as stochastic graduate descent: iterate on context shape, measure agent behavior, repeat. Manus rebuilt its agent framework four times before settling on a context architecture that scaled. LangChain improved a coding agent's Terminal Bench 2.0 score from 52.8% to 66.5% by changing nothing about the model — only what the model was being shown.

Implications

If foundation models are commoditizing, then much of what differentiates a good agent product from a bad one lives in the context layer. The teams that win the next round of agent quality will be the ones with the most disciplined practice of measuring what their agent is looking at, removing what does not earn its place, and routing the rest through memory, retrieval, and isolation systems built for that purpose. Context engineering is the interface between raw model capability and useful agent behavior, and treating it as a first-class engineering practice — with budgets, tests, and observability — is becoming table stakes for any team shipping agents into production.