The Architecture of Forgetting

Language models have no memory. What looks like recollection is a reconstruction at every call. A comparative analysis of how six agentic CLIs solve the problem of a scarce working memory architecturally, with precise thresholds and verifiable citations from the respective codebases.

FC
Frank Csehan
May 17, 2026 · 20 min read

Language models appear to their users like collegial conversation partners. They reply in the first person, refer back to earlier statements, seem to understand code and grasp intentions. This effect is a carefully constructed illusion. Technically speaking, a language model has no memory. It has a context, which its surrounding application retransmits in full at every call. What looks like continuity within a session is in fact a complete retransmission of the entire conversation at every single step.

This architecture has a consequence that only emerges in productive use. Once a language model moves out of the editor chat into a command line and starts processing tasks for hours on end, the context window becomes the system’s scarcest resource. The agentic command-line tools of the major vendors, namely Claude Code by Anthropic, Codex CLI by OpenAI, Google’s Gemini CLI, Alibaba’s Qwen Code, OpenCode from the open-source community and GitHub’s Copilot CLI, all answer the same technical question: how do you manage a scarce, transient working memory in such a way that a longer session remains productive? A systematic reading of the six codebases, which in five of six cases is openly accessible thanks to open-source licences, reveals a remarkable convergence in solution patterns and at the same time five clearly distinguishable axes along which design decisions diverge. This article places both findings side by side.

Three structural properties of the context window

Anyone who wants to understand the comparison of these tools must first know the constraints under which all six vendors operate. They follow from the architecture of the underlying transformer models and are extensively documented in the research literature.

First property: the context window is hard-limited. One million tokens, the equivalent of around 750,000 English words, has become the new standard in the current top-tier models. Anthropic made the million-token window generally available for Claude Opus 4.6 and Sonnet 4.6 in March 2026 and ships it as default for Claude Code users on Max, Team and Enterprise plans without any configuration; OpenAI offers the million in GPT-5.5, capped at 400,000 tokens inside the Codex CLI; Google delivers it in Gemini 2.5 Pro; Alibaba in Qwen 3.5-Plus. Older and smaller models such as Claude Sonnet 4.5 remain at 200,000 tokens, which had been the industry standard up to 2025. Even with these enlarged windows, the attention operation of the transformer architecture scales quadratically with sequence length: twice the context means four times the compute. Sparse attention and state-space approaches dampen this effect but do not eliminate it. A single token beyond the hard limit and the API call fails. There is no runtime extension.

Second property: the context window is transient and stateless. At every single model call the entire previous history is retransmitted, consisting of system prompt, all messages, all tool calls and all tool results. Vendor-side prompt caching, which stores stable prefixes of the conversation, reduces the cost of repeated transmissions substantially: Anthropic charges cache reads at one tenth of the regular input price, a ninety-percent reduction; OpenAI automatically grants a fifty-percent discount on reused input tokens, newer models in some cases more. None of this changes the underlying mechanism. What looks like memory is organised repetition.

Third property: the context window ages. Long before the hard limit, response quality declines. The term context rot was coined in 2025 in a technical report by Chroma Research for precisely this phenomenon; the authors tested 18 frontier models, including GPT-4.1, Claude Opus 4 and Gemini 2.5, and observed the effect in every single one. Anthropic has since adopted the term in its own documentation and quantifies in the Best Practices for Claude Code the onset range for its million-token models: between 300,000 and 400,000 tokens, around thirty to forty percent fill. Empirical evaluations on the LoCoMo benchmark by Snap Research, which covers conversations averaging 9,000 tokens across up to 35 sessions, show across model families that even long-context LLMs trail human performance significantly once the session grows long. The argument of this article is therefore not that a larger window achieves nothing; it is that a larger window shifts the problem in time rather than solving it.

The model’s attention spreads across more tokens, old ballast pulls the model away from the current focus, and the probability rises that relevant instructions are ignored. Research calls this instruction omission: the silent dropping of instructions without the model flagging it. A 2025 paper by Distyl AI shows that even the best frontier models reach only around 68 percent accuracy at 500 simultaneous instructions. The instructions fit easily into the window. The model sees them. It does not follow them all.

From these three properties follows a technical necessity. An agentic CLI must actively manage its context: compress older content, discard irrelevant tool outputs, offload important findings into durable files, reconstruct exactly what is relevant before each new task. How precisely the six tools do this is the actual subject of this comparison.

Four convergent base patterns

A systematic reading of the six codebases reveals a surprising agreement. Despite different programming languages, different vendors and different design philosophies, the tools share four base patterns. This convergence is no accident. It is the result of the same constraints, which force all vendors to explore the same solution space.

The first pattern is selection over completeness. None of the tools examined copies files indiscriminately into the context. Each makes decisions about which instruction files are loaded, which tool results are kept, which memory layers are active. Selection is the first line of defence against context rot. Gemini CLI, for instance, classifies read-in files into four tiers labelled internally as FULL, PARTIAL, SUMMARY and EXCLUDED, and decides per file how much of it lands in the context. Claude Code, during its micro-compaction, deliberately clears old tool results without touching the conversation itself.

The second pattern is hierarchical discovery of instruction files. All six CLIs search for their project rules, whether they are called CLAUDE.md, AGENTS.md, GEMINI.md, QWEN.md or copilot-instructions.md, on the path from the project root down to the current working directory. More specific rules in submodules override the more general ones. Anyone working in a monorepo automatically gets the appropriate rule set. This is an industry consensus, not the invention of any single vendor. The rationale is obvious: a flat, project-wide rules file would be too coarse-grained for large codebases.

The third pattern is the prioritisation of user content over tool output. When context gets tight, all tools discard old tool results first, not the user’s messages. The logic is clear. What a user has written expresses intent. What a tool has produced can be regenerated by invoking the tool again. A user message, by contrast, is not reproducible; it is an authentic part of the session.

The fourth pattern is structured re-anchoring after compaction. As soon as a compression has occurred, important state data is explicitly appended to the new context: file attachments, the active plan, enabled tools, connections to external services, hook outputs from the session start. The new context is thus not a loose summary but a reconstructed working picture. Claude Code is particularly pronounced here, reattaching file attachments, plan-mode state, active skills, tool and agent deltas after every compaction.

Together, these four patterns form an informal pattern language for context management. The terms trim, summarise, persist and prune capture their operations. Anyone who can name them can assess new tools in minutes instead of weeks. They are to context management what design patterns were to object orientation: not an algorithm, but a vocabulary.

Five axes of architectural distinction

As clear as the convergence is, the differences are just as distinct. Five axes are particularly consequential for tool choice. They will be set out below with the precise values supported by reading the source code.

Axis 1: aggressiveness of compaction

This axis is expressed through the threshold at which the tool triggers a compaction, measured as a fill rate of the context window. The spread is considerable. Gemini CLI starts early, at fifty percent fill. Qwen Code, which is built on a fork of the Gemini codebase, waits until seventy percent. Copilot CLI splits the task across two thresholds: at eighty percent an asynchronous compaction runs in the background, at ninety-five percent a blocking emergency brake holds the call until the compression is complete. Codex waits until ninety percent of the configured context window. The internal mechanics are noteworthy: Codex does allow the automatic token limit to be configured explicitly per model, but internally clamps the value to a maximum of ninety percent of the context window. Claude Code operates without a fixed percentage trigger; instead, several overlapping layers run, whose depth of intervention ranges from surgical to complete.

Every threshold is a compromise. Early compaction costs additional LLM calls, since the summarisation is itself a model invocation. Late compaction lives longer in the context-rot zone and risks quality loss shortly before intervention. The choice between fifty and ninety-five percent is a choice between token economy and response stability.

Axis 2: depth of the memory system

The spectrum spans a wide range. OpenCode dispenses with persistent memory entirely. Whatever a session has produced lives on only in the files the user explicitly creates. Claude Code, Codex, Gemini and Qwen each use a flat Markdown file at a defined path: Claude Code under ~/.claude/.../memory/, Codex in a directory called memories, Gemini and Qwen in a project- or user-wide GEMINI.md or QWEN.md. Copilot CLI, by contrast, runs two SQLite databases in the background, combines them with vector embeddings and full-text search, and makes persistent memory a dedicated session capability with structured permission requests. Every memory write there passes through a permission gate that requires three fields: subject, fact and citations. Memory at Copilot is therefore not a free-form Markdown blob but a record separated by topic, fact and provenance.

This range, from the absence of a memory system to a fully fledged retrieval system with provenance fields, is the clearest difference in the comparison. Anyone who only needs memory as a note in the repository has five tools to choose from. Anyone who needs memory as a queryable database across many sessions has exactly one.

Axis 3: sub-agent infrastructure

Sub-agents serve to isolate context. A sub-task is delegated to a child agent that receives its own window, which expires once the task is complete. This protects the main context from bloating, because the detailed knowledge of the sub-task does not flow back into the main conversation; only the result returns. Claude Code provides a simple delegation tool here. Copilot CLI runs a fleet dispatcher that orchestrates parallel sub-agents, together with a dedicated smaller sub-agent model and explicit depth limits that prevent a task from recursing indefinitely. Codex uses sub-agents among other things for memory consolidation in a dedicated phase after session end. OpenCode and Qwen are sparsely equipped on this axis.

Axis 4: observability

One of the biggest differences sits here. Copilot CLI emits structured events such as session.compaction_start and session.compaction_complete, supports OpenTelemetry and offers a preCompact hook for external observers. Every compaction can thus be measured, logged and audited. Token counts before and after the compression, the number of removed messages, the generated summary, the GitHub request trace, all of this is part of the event structure. Anyone integrating a tool into productive pipelines or having to maintain audit trails depends on this kind of infrastructure. The other tools are significantly less instrumented on this axis.

Axis 5: extension system

Hook points such as sessionStart, preToolUse, subagentStart or preCompact turn a closed tool into a platform. They are the modern form of extension points and let teams hang their own governance, validation and logic into the agent loop. Claude Code and Copilot CLI are the richest equipped here. Codex offers specific but less orchestrable extension points. Gemini, Qwen and OpenCode remain more sparing; they leave extensions largely to the configuration mechanism rather than to an event-based model.

Six architectures in detail

On the basis of the four patterns and five axes, each individual CLI can now be characterised.

Claude Code treats compaction as a gradient with four layers that differ in their depth of intervention. At the lowest layer, a micro-compaction replaces individual older tool results with a placeholder text. It is surgical, leaves the conversation untouched and runs continuously during the session. A second layer cuts individual older message blocks when needed, before the micro-compaction operates. A third layer is session-memory compaction, which runs as a post-sampling hook after every model response, extracting memory entries from the session and storing them in a dedicated file. The fourth layer is the classic, complete compaction with subsequent rehydration of all state data. This architecture is the most elastic of the six and simultaneously the one with the highest internal complexity. It follows the idea that compaction should not be a single cut but a family of staggered interventions.

Codex treats the session as a cleanly defined state space. Automatic compaction kicks in at ninety percent of the context window; the user window, the most recent portion of the conversation, is preserved with 20,000 tokens at the end so that the current working context stays protected. The memory system is built as a separate subsystem. Memory entries are routed through a dedicated trace-summarisation endpoint to a sub-agent which, in a dedicated phase after the session ends, consolidates them into durable memory files. Reproducibility is the clearly visible design goal: initial context, compaction boundary and memory operations are all explicit and traceable.

Gemini CLI distributes the task across four specialised services that work together and affords itself a dedicated utility model for generating summaries. This model internally goes by the name chat-compression-2.5-flash-lite. It is a smaller, cheaper variant of the main model and exclusively handles compression. With this, Gemini comes closest to the separation of memory operations from the main model recommended by research. Compaction begins aggressively at fifty percent fill, which generates extra LLM calls but keeps the main model well below the context-rot zone.

Qwen Code is a fork of Gemini CLI by Alibaba. The codebase inherits large parts of the Gemini logic. The compaction threshold is, however, set higher at seventy percent, and the compressor sits at a different point in the pipeline. A closer look reveals that the compression bypasses the memory path that Gemini routinely traverses. The result is a modern architecture with an internal break in integration: what Gemini cleanly separates is brought together again in Qwen at one point.

OpenCode understands itself as a janitor tool. It runs strong overflow management with a 20,000-token buffer that triggers compaction precisely when the context is about to overflow, rather than orienting itself by a fixed percentage threshold. Instruction files are loaded lazily, the code is flat and hierarchically organised, the overflow logic is transparent and easy to follow. A persistent memory system is entirely absent. Anyone who prioritises simplicity and traceability finds the lowest complexity here.

Copilot CLI is of all six tools the closest to what research papers from 2025 and 2026 demand. Compaction runs two-stage, asynchronously at eighty percent and blocking at ninety-five percent. Persistent memory is managed across multiple sessions in two SQLite databases with vector embeddings and full-text search. Memory writes are permission-controlled and carry their own provenance field, documenting the sources of the stored fact. Sub-agents are dispatched in parallel through a fleet dispatcher. Compaction emits structured events with extensive metrics. A hook system allows external logic to be injected before and after every compaction. The price of this richness: the core is closed source. Only the SDK is publicly accessible; the central algorithms can however be inferred from the freely available SDK code and the documented telemetry, since the default thresholds as well as the event structure of the compaction are published there as a typed interface.

Memory subsystems: from Markdown to database

A separate look is warranted for the memory subsystem, because here the architectures diverge most strongly and the largest research gap lies open.

Four of the six tools store memory as a Markdown file. Claude Code, Codex, Gemini and Qwen follow this path. The advantage is obvious: Markdown is human-readable, versionable, and can be placed in the repository. A developer can read memory entries in a normal text editor, track them in git, review them in a pull request. The disadvantage is just as obvious: a flat file is not a queryable system. Whoever wants to know what was decided about the authentication module two weeks ago has to either read the file themselves or throw its entire contents back into the agent’s context.

OpenCode dispenses with any persistent memory system. This is a deliberate design decision, not an oversight. Anyone needing memory creates their own Markdown files in the repository and refers to them from the prompt. The consequence is that every session starts with whatever is documented in the repository, not with whatever the tool has stored internally. This clarity has its price: there is no automatically built-up learning curve across sessions.

Copilot CLI takes the opposite route and implements a full memory subsystem. Two SQLite databases store content and embeddings separately, full-text search allows classic keyword queries, vector embeddings allow semantic similarity searches. Memory is written or voted on through structured permission requests; at every write the three fields subject, fact and citations must be present. This is closer to the demands of the research literature, particularly to works such as MemInsight and CDMem, than any of the other tools examined.

That said, even this system is not the state of the art. Cross-session memory is, according to the documentation, explicitly experimental. Intent-driven retrieval, that is, the selection of memory entries on the basis of the current task intent, is missing; retrieval runs over paths, namespaces and semantic similarity, not over goal states. Causal and temporal structures between memory entries are limited to timestamps; there is no explicit event graph.

Sub-agents as context isolation

Sub-agents are the second large field of architectural difference. The basic idea is simple: instead of handling a sub-task in the same context and thus burdening the main window, the main agent hands the task to a child agent that starts with an empty window, completes the task and returns only the result. The intermediate dialogue, the tool calls, the detailed research are spared from the main context.

Claude Code provides a simple delegation tool for this. The main agent can hand a task to a sub-agent that operates in the same model with its own context. This is conceptually clean and practically useful, especially for research tasks in large codebases.

Copilot CLI has the most extensive build-out. A fleet dispatcher can start sub-agents in parallel, a dedicated smaller model handles the sub-agent work at lower cost, explicit depth limits prevent endless recursions. This brings the architecture close to the patterns discussed in research as hierarchical multi-agent systems.

Codex uses sub-agents primarily for memory consolidation in a dedicated phase after session end. This separation between session sub-agents and consolidation sub-agents is architecturally interesting: memory operations are taken off the hot path and handled asynchronously. This lowers session latency but shifts the memory update in time.

Gemini, Qwen and OpenCode offer less along this axis. Sub-agents exist, but they are not the carrying design principle as in Copilot, nor a deeply integrated tool as in Claude Code.

State of research and the frontier gap

As far as the six tools have come, they all trail what academic research has put forward in concepts over the past eighteen months. Works such as MemoryOS (Oral at EMNLP 2025), HiAgent (ACL 2025), LightMem (ICLR 2026) and Mnemis (accepted at ACL 2026, from Microsoft Research) sketch a next generation of memory architectures that differ from today’s CLI implementations in four respects.

First, the papers call for intent-sensitive retrieval: memory is selected on the basis of the current task intent, not by path or timestamp. Second, they demand explicit causal-temporal structures, that is, relations between facts that are not hidden in prose but addressable as a directed graph. Third, they describe learning policies for inserting and deleting memory entries that learn from the success of later steps instead of using fixed heuristics. Fourth, they call for security governance for persistent storage with access control, audit trails and explicit handling of personally identifiable information.

Today’s CLIs implement fragments of this. Tool outputs are sacrificed before user prose because they are reproducible: that is an efficiency feature the research praises. Memory is written selectively. Copilot stores provenance fields, the others do not. Truly intent-driven retrieval logic does not exist in any of the tools. Causal structures are entirely absent. What the CLIs master today is the economic efficiency of context management. What they have not yet mastered is its depth.

The survey „Memory in the Age of AI Agents", published on arXiv in December 2025 with 47 authors, reaches a similar conclusion. Traditional taxonomies such as the distinction between long-term and short-term memory, it argues, are insufficient for modern agent memory systems. Multi-session consistency across hours and days is considered largely unsolved. Selective forgetting is described as the hardest of the open challenges. Even Mnemis, one of the currently best-placed systems, achieves 93.9 out of 100 points on the LoCoMo benchmark with GPT-4.1-mini and 91.6 on LongMemEval-S, while explicitly making no claim to a comprehensive solution and instead transparently documenting its limits.

Consequences for tool choice

Four conclusions follow from the analysis that hold for practice.

First, the question „which CLI is the best?" is the wrong question. There is no universal answer, only suitable tools for suitable requirements. Anyone working in a CI/CD pipeline will favour Copilot for its observability or Codex for its reproducibility. Anyone developing interactively and with heavy configuration will find the greatest flexibility in Claude Code. Anyone needing memory as a queryable database across many projects has exactly one choice. Anyone who values simplicity above all is well served by OpenCode.

Second, the pattern language carries. Anyone who has internalised trim, summarise, persist and prune, along with hierarchical discovery, sub-agent isolation, post-compact rehydration and hook-based extension, no longer assesses the next tool on the market through demo videos but by reading the relevant code paths. That is the real return of an architectural comparison: a vocabulary that outlives the half-life of individual implementations.

Third, and this is the most uncomfortable insight for vendors and users alike, even the best memory system does not replace structured planning of the work itself. The tools deliver mechanics. The plan, what is to be built, in which order, and by what criteria for acceptance, is still delivered by the human. The context window is a resource. What goes into it remains a question of method.

Fourth and finally, the gap between research and implementation is an opportunity. The next generation of CLIs will not be decided by the main model, but by which vendors first bring intent-sensitive retrieval, causal-temporal memory structures and learning policies into productive form. Anyone choosing today should factor in that this field will look different in two years, without the pattern language presented here becoming obsolete.

The problem of a scarce, transient working memory cannot be solved by larger models. It is structural and follows from the architecture of the transformer. Anyone working productively with agentic tools will have to treat context management as a discipline of its own, comparable to memory management in classical systems programming. The six CLIs examined here show that the industry has discovered this discipline. They also show how many paths remain open for mastering it.

AI Agentic Coding Software Engineering Architecture Claude Code Copilot