Two years ago, I started working seriously with coding agents. Not as a gimmick, not for demos, but as a primary tool in my daily work. Since then, I’ve followed every generation of these tools: from the early Copilot versions through Cursor to Claude Code, OpenAI Codex, and the GitHub Copilot Coding Agent. Every new release, every new model, every new architecture.
One insight cost me the most time and frustration until I accepted it: The model is rarely the problem. It’s the context.
What “Context” Really Means
In the world of language models, “context” is often equated with the “context window”—the number of tokens a model can process simultaneously. 128k, 200k, soon a million. The numbers keep growing. The problem remains.
Because context isn’t just what the model can see. Context is what it’s supposed to pay attention to. And those are two completely different things.
A recent paper by Distyl AI (“How Many Instructions Can LLMs Follow at Once?”) investigated exactly that. The researchers gave 20 different models tasks with an increasing number of concurrent instructions: from 10 to 500. The results finally provided me with an explanation for something I’ve been observing for two years but could never quite put into words:
Even the best-performing model tested (Gemini 2.5 Pro) achieved only 69% accuracy with 500 instructions. The instructions fit easily into the context window. The model sees them all. It just doesn’t follow them all.
Of course, the benchmark is simplified. Incorporating keywords into a text is different from adhering to architectural rules in a codebase. But the principle applies: If you give an agent an instruction file (CLAUDE.md, AGENTS.md, or a comparable format) containing 30 project rules, along with 15 files as context and a complex task description, you are effectively sending hundreds of implicit constraints into an interaction. The question isn’t whether the model can read them all. The question is whether it takes them all into account at the same time.
Context Window is memory space. Following instructions is attention. More memory doesn't help if attention is limited. A Context Window of one million tokens is like a desk that's three meters long. That doesn't mean you can read all the documents on it at the same time.
Three Ways to Forget
The paper identifies three patterns of how models degrade under increasing instruction load. Anyone who works regularly with agents will recognize them all.
Threshold Decay: The model works almost perfectly up to a certain threshold, then performance drops off abruptly. Reasoning models like o3 or Gemini 2.5 Pro exhibit this pattern. In practice, this was the most frustrating scenario for me: The agent performs excellently through the first 15 tasks. On task 16, it suddenly forgets the architectural specifications from the instruction file. It feels like losing a reliable colleague who suddenly starts doing things you didn’t discuss.
Linear Decay: Steady, gradual deterioration with each additional instruction. Claude 3.7 Sonnet and GPT-4.1 exhibit this pattern. For me, this was the most insidious for a long time because I didn’t notice it. The quality doesn’t drop abruptly. It erodes. Like a conversation that slowly loses its focus without anyone being able to pinpoint the moment it happened.
Exponential Decay: Rapid collapse even at moderate instruction density. Smaller models like Llama 4 Scout or Claude 3.5 Haiku. Useful for simple, focused tasks. Not for complex projects.
What has preoccupied me the most: The dominant error type at high instruction density is omission. The model simply leaves instructions out. It doesn’t hallucinate them incorrectly; it ignores them. Without a word. Without a hint. Anyone who’s experienced this knows how it feels: You review the code, everything looks clean, and only on the third look do you notice that the error handling you explicitly instructed is missing.
Two Years of Evolution
To understand why the current generation is so much better, it’s worth looking back. Not as a history lesson, but because it’s only through contrast that you see which problems have been solved and which have merely been postponed.
2024: The Era of Manual Context Management
Two years ago, a typical workflow looked like this: You opened a chat, fed the model code snippets, explained the context in prose, formulated the task, and hoped for the best. With every new message, you had to drag the context along with you, or risk the model forgetting it.
Cursor was an early breakthrough because it automatically incorporated the editor context. You no longer had to explain to the model which file you were in. But the architectural level (why the code is structured that way, which patterns apply, what the project conventions are) had to be re-established with every session.
The result: You became a context manager. Half of your working time wasn’t spent on the actual task, but on maintaining the context. Prompt engineering was, at its core, context management. And it was exhausting.
2025–26: Context as an Architectural Feature
The current generation has understood that context management isn’t a user problem, but an architectural problem. And every tool solves it in its own way.
Common to all current CLI agents is the concept of a persistent instruction file in the project root: CLAUDE.md in Claude Code, AGENTS.md in OpenAI Codex and Copilot CLI, and .cursorrules in Cursor. The file is automatically loaded with every interaction and contains project rules, conventions, patterns, and technical decisions. Everything the agent needs to know about the project.
Claude Code takes it a step further with a Memory system: The agent can remember things that apply across sessions. “Always use bun instead of npm.” “The project uses Vitest, not Jest.” “API authentication runs via JWT with RS256.” Once said, permanently stored.
Also worth mentioning in Claude Code is Plan Mode. Before the agent writes code, it creates a plan that the human must approve. That sounds trivial. In practice, it has changed my workflow: It forces the agent to make its context explicit. You can see what it has understood and what it hasn’t. And you can correct it before a single line of code is written. That “Wait a minute, I meant something different” before implementation instead of after—it not only saves time, it prevents the frustration of having to tear out finished code.
Other tools are experimenting with an approach I find particularly promising: structured task management directly within the agent. Instead of keeping tasks as plain text in the chat, they are stored in a local database, along with status, dependencies, and progress. The agent can check what it has already completed, what is still pending, and what is blocked. This isn’t a chat with a memory. It’s an agent with a backlog.
Regardless of the specific implementation, this approach solves a problem that has constantly bothered me in practice: the loss of work progress. In the chat-based world, a context switch (a compiler error, a lint issue, a failing test) was often the moment when the agent forgot the original plan. It fixed the error but could no longer reliably return to the original task. Persistent task management makes the work progress independent of the conversation flow, whether it’s based on SQLite, Markdown files, or another format.
The Real Problem: Large Existing Applications
Greenfield is comparatively simple.
You start from scratch, define the architecture, and build step by step. The context is manageable because there isn’t much context to begin with. Agents shine here. That’s why most demos look so impressive.
The reality of professional software development looks different, and anyone who knows it understands: You work within existing systems. Systems with 500,000 lines of code, evolved structures, implicit conventions, historical decisions that no one documented anymore, but which still had their reasons. Systems that are running in production and whose behavior must not be disrupted.
Here, context becomes the bottleneck. Not because the agent is stupid, but because the relevant context is so vast that even a human needs days to get up to speed.
I was tasked with implementing a new module in an existing enterprise application. The module had to work with the existing authentication system, use the existing role and permission model, fit into the existing CI/CD process, and use the established patterns for database access, logging, and error handling.
Simply put: “Add feature X.”
In reality: A contextual problem with 47 implicit constraints, none of which the agent knows about unless you brief it.
My first attempt (to be honest) was to throw the entire folder at the agent and describe the task. “Here’s the code, here’s what I want—go ahead.” The result was working code that ignored every single project convention. A custom logging framework instead of the existing one. Custom error handling. Custom database abstraction. Technically correct, architecturally a disaster. Anyone who’s experienced this knows the feeling: The code does the right thing, but it doesn’t belong.
That is exactly the omission problem from the paper: The agent had all the information in the context window. It just didn’t take it all into account.
My Path Back to Structure
In a previous post, I described why AI-assisted development needs more planning, not less. There, I also recommended doing the initial architecture work without the agent. When working on large existing codebases, the opposite has since proven better for me: planning together with the agent, because the agent can systematically explore a codebase that even I don’t fully know. It can scan directory structures, identify patterns across hundreds of files, and uncover dependencies that I would have missed.
What I formulated there as a principle has since crystallized into a concrete workflow that I use for every major project.
At first glance, it seems old-fashioned. At second glance, it is a direct consequence of what we know about context degradation. And it’s surprisingly fun once you get into it.
Phase 1: The Planning Session
Before a single line of code is written, I conduct a detailed planning session with the agent. Not as a monologue, but as a dialogue. And that’s the part I enjoy the most.
The session begins with the agent examining the existing codebase. Not superficially, but systematically: directory structure, architectural patterns, data models, dependencies, test structure, CI configuration. I instruct it to ask questions. Good agents do this on their own. Great agents ask the right questions.
Then we discuss. I contribute my architectural knowledge; the agent contributes its pattern knowledge. We compare the existing code with best practices. We identify where the code deviates from established patterns, and whether these are deliberate decisions or accumulated inconsistencies. We try out ideas, discard them, and refine them.
It feels like pair programming with a colleague who has read all the documentation in the world but has never worked in this specific building. You know the building yourself. Together, this results in something neither of you would have on your own.
The end result is a joint, coordinated plan. Not vague. Concrete. With clear architectural decisions, defined interfaces, and identified risks.
This phase sometimes takes an hour. Sometimes two. It took me a long time to stop seeing this time as a delay. Now, for me, it’s the most valuable part of the entire process.
Phase 2: Epics and Tickets
Here comes the step that made the biggest difference for me.
I break down the agreed-upon plan (again together with the agent) into epics. Each epic represents a self-contained, coherent block of work: “Extend database schema,” “Implement API endpoints,” “Build frontend components,” “Write integration tests.”
Each epic is further broken down into tickets. Each ticket describes a single, focused task with:
- Context: Which files are affected? What existing patterns apply?
- Task: What exactly needs to be done?
- Acceptance criteria: How do we know when it’s done?
- Dependencies: Which tickets must be completed first?
I store these tickets in a local directory as Markdown files, organized by epic. The agent can read them; humans can read them. They are the shared contract.
Immediately after creating the epics, I update the agent’s instruction file (the CLAUDE.md, AGENTS.md, or the respective equivalent). All architectural decisions, conventions, and patterns from the planning session are now incorporated into the file before the first line of code is written. This way, the agent starts the implementation not only with a plan, but with an updated set of rules that keeps its work consistent across all epics.
An example of a ticket:
# EPIC-02/TICKET-03: Extend User Service with role verification
## Context
- Existing UserService in `src/services/user.service.ts`
- Role enum in `src/types/roles.ts`
- Existing pattern: Guards use `@UseGuards(RoleGuard)`
- Existing tests in `src/services/__tests__/user.service.spec.ts`
## Task
- Add method `hasPermission(userId, permission)` to UserService
- Respect existing role hierarchy from `roles.ts`
- Cache permissions per request (not globally)
## Acceptance Criteria
- [ ] Method exists and is typed
- [ ] Unit tests for all role combinations
- [ ] Existing tests continue to pass
- [ ] No new logging pattern introduced
## Dependencies
- EPIC-02/TICKET-01 (Role enum extended)
- EPIC-02/TICKET-02 (Database migration)
Phase 3: Epic by Epic
I assign the agent a complete epic, not individual tickets. The epic provides context and structure: which part of the system is affected, which files are relevant, and what the final goal should be. The tickets within the epic give the agent the individual tasks within this structure, in the correct order, with clear dependencies.
The feeling when an agent works through a cleanly structured epic ticket by ticket and everything passes through the pipeline successfully in the end is fantastic. It’s like building with Lego instead of untangling spaghetti.
Why does this work? Because the epic provides the context and the tickets keep the density of instructions per step low. The agent doesn’t have to keep the entire project in mind; it only needs to understand this epic. And within the epic, each ticket has 5–10 clear constraints. Not 50. Not 500. A number that’s well below the degradation threshold described in the paper.
Context switches (compiler errors, lint issues, failing tests) remain local within the epic. The agent fixes the bug and returns to the next ticket because the ticket list is the anchor. It’s there in black and white, independent of the conversation’s flow. No drift, no creeping forgetfulness.
And if a ticket within the epic doesn’t work? Then the bug is isolated. You don’t debug the entire feature, but a single, clearly defined task within a known context. This is a different way of working than “the agent broke something somewhere in 2,000 lines and I get to search for it.”
Phase 4: Continuously Consolidate Knowledge
The instruction file isn’t written just once and then forgotten. New insights constantly emerge during implementation: quirks of the test framework, implicit API conventions, patterns that only become visible when working in the code.
After every completed epic (and sometimes after individual tickets, if something unexpected has come up), I instruct the agent to update the instruction file and, in the case of Claude Code, the memory as well.
This is the moment when temporary knowledge becomes persistent knowledge. Has the agent discovered that a certain test framework has a quirk? That belongs in the instruction file. Did it encounter an undocumented API convention? Document it. Did it find a pattern that is consistently followed in the existing code but wasn’t documented anywhere? It is now.
Without this ongoing step, knowledge erodes with every new session. With it, an increasingly rich contextual foundation builds up over time, one that noticeably speeds up the onboarding for every new session.
The ticket system is not bureaucracy. It is a pragmatic response to an empirically proven problem: models degrade under the weight of instructions. Fewer instructions per interaction, but more structure around them. For me, this has proven to be the most stable architecture for keeping frustration levels low and the quality of results high.
What Is Shifting
In classic software development, the individual was simultaneously architect, planner, and implementer. Agile attempted to merge these roles: everyone plans, everyone codes, everyone tests. That worked as long as the individual wrote everything themselves and learned in the process.
With coding agents, I observe that these roles are separating again. Not a return to the old waterfall model, but a new configuration:
People are shifting toward architecture and planning. They understand the system, make decisions, define tasks, and review results. They write less code in the sense of characters in a file. But they think more code than before, because designing the plan requires a deeper understanding than mere typing ever did.
The agent takes on more of the implementation. It writes the code, debugs the errors, and iterates against the pipeline. It is fast, precise, and tireless, as long as its context is right.
Tickets become the interface between the two. Not verbally, not in a chat history, not in the fleeting context of a prompt. In writing, structured, persistent.
I’ve used this workflow with Claude Code, with OpenAI Codex, and with GitHub Copilot CLI. It works with every tool. Not because the tools are similar (they aren’t), but because the workflow addresses the underlying problem they all share: the limited ability of language models to reliably follow many instructions simultaneously.
What I’ve Learned from This
When I distill my two years of experience with coding agents in existing codebases, I’ve identified five things that made the difference:
1. Keep the density of instructions per interaction low. One ticket, one task, clear context. Not “implement the entire feature.” Less is more reliable. This isn’t an opinion; research shows it.
2. Make context persistent, not ephemeral. Instruction files (CLAUDE.md, AGENTS.md), memory files, project documentation. Everything the agent needs to know across sessions belongs in files, not in chat histories.
3. Plan before implementation, together with the agent. The planning session is not a waste of time. It is the phase in which humans and machines build a shared understanding. Without this understanding, every implementation is a guessing game.
4. Use tickets as a contract. Written, structured, with context and acceptance criteria. The ticket is the anchor to which the agent returns when a context switch distracts it.
5. Consolidate knowledge after every milestone. What the agent has learned must be captured in persistent files. Otherwise, every new session is a cold start.
Craftsmanship Means Mastering the Context
In my first post, I wrote that software craftsmanship is evident in the last 10%. In my second post, I asked who can still recognize this 10%. In my third post, I argued that AI-assisted development requires more planning, not less.
This post is the practical consequence: Craftsmanship in the era of coding agents means mastering the context. Not mastering the model—that gets better every few months. Rather, it’s the ability to structure one’s own knowledge in such a way that a machine can reliably implement it.
This is a new skill. One that didn’t exist three years ago. And one that’s harder than it looks.
For a long time, I hoped that the next generation of models would solve the context problem. Now I believe: It’s not worth waiting. Context management isn’t a bug in today’s models. It’s a feature of probabilistic systems. Even if the context window grows to ten million tokens and the models improve significantly once again: The ability to follow an arbitrarily long list of instructions simultaneously and without error is not what Transformer architectures are capable of. The solution lies more in the process than in the model.
This is less glamorous than “10x developer through AI.” But it’s what works in practice. At least in mine.
Even the best tool is useless if you give it the wrong drawing. And a drawing that shows everything at once isn't a drawing—it's noise. My path through the jungle was: small drawings, one sheet at a time. That's the map I found.