The Agentic Engineering Pipeline for the SDLC

Agentic tools speed the work up. In brownfield they often produce errors that surface only late. This opening piece of a series introduces the agentic engineering pipeline for the SDLC: fixed roles, exploratory phases, state in versioned artifacts, a quality gate at every handover, and a second model opinion for important decisions.

FC
Frank Csehan
May 30, 2026 · 26 min read

Most software projects start in a system that has been around for a while. It was designed years ago and has filled up with data whose origin nobody remembers exactly. On top of that come assumptions that are rarely written down the way they actually hold. Such grown systems, brownfield in industry jargon, are the real workplace of the vast majority of developers; hardly anyone gets to start on a greenfield. This is where the biggest expectations are placed on agentic tools, and where they especially often deliver plausible but wrong results.

Over the past months I have worked out an agentic engineering pipeline for the SDLC and tried it in workshops. It is an end-to-end way of working through the software development lifecycle with agents, tailored to exactly this setting. It describes how a team works with agents when it wants to bring a new feature into an existing system without damaging the brownfield. This text is the opening of a series and takes on the pipeline as a whole: why it looks the way it does, and which principles hold it up. The later pieces then go into the individual roles: business analysis, UX, development, testing, and operations.

Why brownfield is a problem of its own

In April 2026, Google Cloud’s DORA team published a report on the return on investment of AI-assisted software development, which I discussed in detail elsewhere (The ROI Depends on the Foundation). One figure from it matters most for this topic. In greenfield, in new systems without legacy, the report puts the productivity effect of AI-assisted development tools at 35 to 40 percent. In brownfield, in a system that has grown over time, it is around ten percent. The factor between the two worlds is three to four. For organizations that earn their money maintaining existing software, that is the central business case.

A language model works only with the context it is given at the moment of the request, and that turns out differently in the two worlds. In greenfield it is small and manageable. In a legacy system it is large, scattered across many places, and a good part of it is written down nowhere. Part of this knowledge sits in a database whose column names come from an outdated domain vocabulary. A control flow encodes some long-forgotten regulatory requirement without noting it anywhere. The rest exists only in the head of the colleague who built the module years ago. An agent that does not have this context produces plausible code that is wrong exactly where no one can see it from the outside.

To this economic finding comes a methodological one. Christopher Potts and Moritz Sudhof of Stanford University analyzed nearly 27,000 real AI dialogues in their paper “A paradox of AI fluency” (arXiv:2604.25905). Their most notable result concerns the visibility of errors. With practiced users, 59 percent of errors become visible in the dialogue itself; with inexperienced users, only 12 percent. For inexperienced users, then, 88 percent of errors stay invisible in the dialogue. What happens to those errors afterwards, the study does not measure. For brownfield work, this finding alone is enough: where the dialogue triggers no correction, an error more easily travels into a production system. Letting agentic coding loose on a grown system without steering goes faster in the short term. The damage it does surfaces only weeks or months later.

The pipeline starts here. It is meant to make the invisible errors visible before they reach the grown system. That the agent types faster counts for less.

The pipeline in one picture

A feature moves through six phases, and at the handovers between roles there is a checkpoint each time.

First, the context is made known. In a legacy system that means pulling out of the existing code, the tickets, the old documentation, and the data what is known about the affected area. In this phase only reading happens, nothing is changed.

Then the team explores together with the agent. Before it thinks about a solution, it asks questions and looks for contradictions.

The results are recorded as Markdown files in the repository: what was found, what stayed open, and where sources contradicted each other.

Out of this understanding a structure emerges: an epic that describes the goal, and tickets that each cover one to three hours of work, testable and traceable.

Before any code, there is an implementation plan that fixes the order, the test strategy, the risks, and the affected files.

Finally, the agent works through the plan, ticket by ticket, and documents its progress in the plan, the epic, and the tickets themselves.

The flow is familiar. Anyone who knows a classic development process will recognize the stations. What is new lies in the rules that hold between the phases.

Brownfield has to be made readable first

In greenfield you can start with phase two, because there is nothing you would have to read first. In brownfield the first phase is the most important and at the same time the one most often skipped. It is called context extraction, and it answers the question of what the system actually does today at the place where we want to change it.

The answer sits in several places, and a good pass reads them all. The code shows what happens, but not why. The why of a decision is in the version history. The tickets reveal which requirement once stood behind it, often in language that has since gone stale. And the data itself records what the theory keeps quiet: that a field which according to the documentation should always be filled is empty in a substantial share of the legacy records, because over the years the staff made do with free-text notes. A new mandatory field introduced without this finding leaves a large part of the data set failing the rule.

One special case of context extraction occurs in a legacy system almost every time and is almost never documented: terms that live in more than one system. A term like “partner” or “case” can mean something different in the system we are changing than in the neighboring system it exchanges data with. As long as this relationship is named nowhere, it is a risk for every migration and every coexistence of two systems. The pipeline therefore requires that a term which exists across system boundaries gets its counterpart in the other system recorded in the ticket’s glossary, even when the current feature has no cross-system effect at all. Where no counterpart is known, there is a dash and a note as an open question. That turns an invisible assumption into a clarified fact or at least a visible gap. Most legacy systems, by contrast, live with the quiet ambiguity.

This first phase produces no solution yet, and it is not supposed to. Its artifact is a picture of the current state, evidenced and dated, with clearly marked gaps. Only on this picture can one responsibly decide what the new feature should do.

First principle: people stay in roles

A common story promises that the agent replaces the whole team: one person, one language model, and everything from requirement to deployment runs through the same chat. It sounds tempting because it is cheap, yet it does not hold up to a closer look.

Each role stands for its own view of the same feature. A business analyst judges differently than a developer, a tester looks at the same feature differently than a UX designer, and operations asks questions that no one in development asks. Through these differences a team finds the errors that a single perspective misses. Bundle all roles into one person and equip that person with an agent that can imitate every role, and you get only the imitation of those perspectives. The real friction between people who judge differently out of different experience does not arise that way.

In a grown system a second reason comes in. The constraints of a brownfield system are often experiential knowledge that is nowhere in the code. An experienced architect recognizes at first glance that a spot is delicate, because she has already seen, on a similar system, what can go wrong there. An agent does not have that experience. It works with training patterns, and those are an average of the world in which the specifics of this one system disappear. The overview of the grown whole, the feel for which change in one place sets off something three places further along, stays the human’s domain.

Beyond quality, there is the law. The EU AI Act, Regulation (EU) 2024/1689, provides for human oversight in particular for high-risk AI systems. For an agentic pipeline this does not imply a blanket legal duty to approve every single output. As a governance rule it still makes sense: every meaningful output needs human approval before it is accepted. That way, in the end, a person with a name stands behind a release. A chat log that no one can reconstruct does not.

The pipeline therefore has fixed roles, each with its own brief. Business analysis clarifies what is to be built and on what factual basis. UX describes how the feature feels for its users and which states it has to know. Development translates the domain structure into code, testing checks whether the delivered behavior matches the promise, and operations decides whether and how a change may take effect on the real infrastructure. All of them use the same agent, only differently, with their own questions and their own boundaries.

How differently the roles look at the same feature shows in a concrete case. A mandatory entry is to be introduced, and each role puts its own question to it:

Each of these questions knows a different way the feature can fail, and none is superfluous. The series treats each role on its own. The basic decision stays the same in every phase: each role belongs to a human, and the agent only strengthens it.

Responsibility hangs on the role. An agent speeds up the work of a role, but it bears no liability and does not know the specific system from its own observation. When a team cuts a role, the place where a human would otherwise have pushed back falls away with it.

Second principle: every phase explores before it commits

The biggest temptation in agentic work is the early jump to the solution. You type a task, the agent answers immediately with code, and it feels like progress. In brownfield this jump is dangerous, because the solution rests on an understanding that no one has checked.

The pipeline reverses the order, so that every phase begins with exploration. In business analysis, that means building a source inventory: which tickets, Confluence pages, architecture decision records, runbooks, and old documents say something about the affected area? Each source gets a status with a concrete date on which the statement held. Vague labels like “current” or “latest version” do not suffice, because a Confluence page that no one has touched for three years looks just as fresh in the system as one from yesterday, and in a grown system that is a risk.

After the inventory comes the actual analytical work. Each statement is labeled: as an evidenced fact, as a plausible but unevidenced assumption, as an open question, or as a contradiction between two sources that both seem to be right. These labels keep visible the risky places that a smooth summary would blur.

This stance holds not only in analysis. Development reads its way into the code before it writes; testing works through the acceptance criteria before it designs scenarios; operations forms a hypothesis about the cause before it proposes a change. The reading phase is anchored in the tooling: search and read operations are freely allowed, writing and acting operations need approval. The gradient is intended. Reading should be possible at any time, a changing intervention a deliberate, approved decision.

Exploration is also iterative. A phase rarely delivers a reliable picture on the first pass. You find a contradiction, go back into the sources, clear it up, and find a new open question along the way. This back and forth is part of it; that is how understanding emerges in brownfield in the first place. The pipeline allows these loops because it records their results in artifacts, where they remain available after a session ends.

Third principle: state belongs in artifacts

A language model has no memory. It works only with the context it is given in this one moment. What stood higher up in the chat log is still effective only because the client sends it along again on every call, and that log can lose effect as it grows longer. The effect is discussed as context rot and can set in before the technical limit of the context window is reached. If the project state is kept in the chat alone, it is easily lost again.

The pipeline therefore keeps the state outside the chat, in versioned artifacts. I call this rule the artifact contract: what arises during an agent session and is not transferred into a durable artifact counts as not having happened. The artifacts are Markdown files in the repository for context, findings, plans, and test output; tickets for scoping, status, and accountability; approved specification pages for decisions; and pull requests for diff, tests, and review decision.

This has three consequences that together make the value. A new session can build on the artifacts without knowing the earlier discussion, which matters more in brownfield than in a new build, because brownfield work rarely finishes in a single day and several people often share it. An epic, a plan, a finding can be read and assessed by a human before any code comes of it. And the progress log documents what was decided when and why; that trail supports the evidence and accountability obligations in regulated contexts.

One artifact arises in a legacy system at no extra cost and stays reliable: the version history. What stands in the repository binds more strongly than any memory. A commit message records the why of a change, the provenance of a line can be traced, and the state before a commit can be compared with the one after. This information arises while working, on its own, and does not go stale, unlike a separately maintained notes file that no one updates when the code changes. An agent that reads the history of the affected file before a change stands on firmer ground than one that guesses. The pipeline therefore requires Git to be used as the binding source of truth.

The compliance framework the pipeline orients itself by pushes this artifact duty further. Every AI-generated artifact is marked as AI-generated, with a traceable reference to the operation, the person, and the model that produced it. That is the EU AI Act’s transparency requirement in daily practice: a concrete note on each individual piece of generated content. Whoever later wants to know whether a specification was written by a person or proposed by a model and merely waved through finds the answer in the artifact.

Fourth principle: every role handover has a quality gate with mandatory artifacts

The core of the pipeline lies in the handovers. Between two roles the work does not simply change owner. A checkpoint stands in between, a quality gate, and this gate has two parts: a short, answerable question and an artifact that evidences the answer. Without the artifact the question is not answerable, and without an affirmed answer the next phase does not begin.

Behind it stands the shifting of error costs. An error found in analysis costs a correction in text. The same error that surfaces only in operations costs a correction in the running system, plus the search for the cause and the restoration. The more risky and consequential a step is, the earlier it needs evidence, a plan, and a human decision; the gates are the place where that decision falls.

At the end of discovery stands the question of whether the sources are solid. Are gaps and contradictions made visible, or was it smoothed over? The mandatory artifact is the labeled source inventory. Only once a human confirms that the knowledge base is evidenced and honest does structuring begin.

Before a ticket goes into implementation, a Definition of Ready checks whether it is small and testable, with acceptance criteria and explicit non-goals. This check is especially strict in a legacy system. It requires, for example, that a ticket’s outcome be phrased from the user’s point of view and not the system’s. “The caseworker can save a claim even without complete details, but receives a business message” is a testable outcome. “The system validates the claim field” is not, because it cannot be tested from the user’s point of view and shifts responsibility for success from the business side to engineering. A second rule of the same gate: no technical specification in the requirements ticket. Architecture, data model, framework choice, column names do not appear in the business ticket, because a technical specification at this point makes the later feasibility check worthless and is the most common reason development hands a ticket back.

Before the first code there is a plan review. It checks the order of the work, the test strategy, the risks, and the affected files. Only after this review does the agent begin to write. That is extra effort. But a plan reviewed before implementation catches the errors an agent would otherwise build into the code over twenty tool calls, before anyone notices that the direction was wrong.

The test gate asks whether the tests cover the acceptance, whether a deliberate test gap is documented, and whether the pull request can be assessed with a test record. What counts as evidence here is a result: a browser test that runs the behavior, a trace, a screenshot. The agent claiming something is not enough in the gate; it has to be shown to work.

The sharpest is the apply gate in operations. It separates diagnosis from effect. The agent may prepare a pipeline change, an infrastructure-as-code adjustment, or a monitoring rule as a pull request. But approving a pull request is not approving the apply. Rollback, time window, responsible person, observation, and release are decided before any effect, and the start on the real infrastructure stays a human action, even where the agent could technically perform it. The gate’s question is: is the effect reversible, is there a rollback and observation, and is the apply separated from the code review?

Beneath these human gates lies a machine layer that no one has to go through by hand. In a mature setup, automatic checkpoints examine every pull request before a human even sees it: an intake check, a policy check, a static security analysis, a prompt-injection detection, a red-team run, a sandbox execution, and a final gate with a report. These checkpoints run as deterministic scripts, and they run deliberately before the expensive, non-deterministic AI stations. That way no model spends compute on obvious garbage, and the gating becomes provable, because a script delivers a traceable result while a model answer does not guarantee one. The human gates remain responsible for the questions that call for judgment. Everything that can be checked automatically is handled beforehand by the scripts, and only the two layers together make the control that is needed.

A quality gate that asks for no mandatory artifact comes down to plain trust. With a checkable artifact, the handover becomes something a human can assess before the next phase builds on it. In brownfield that is what decides whether the added speed helps or does damage.

Fifth principle: no decision rests on a single model answer

A language model is a statistical apparatus. It reproduces what occurred often in its training data, and its tone stays just as confident whether or not that holds in the concrete case. A single model answer is therefore only a hypothesis at first. In brownfield, where the specific counts and the average of the training world is often beside the point, this distinction weighs heavily.

The pipeline meets this with a simple discipline: for every important decision, a second and, where it matters, a third opinion is obtained, and from a different model. Different models are trained on different data, with their own strengths and blind spots. When two independent models reach the same finding, reliability rises. When they diverge, the human knows the matter is not clear-cut and someone has to look more closely. That second case helps too.

In analysis, the same source can be summarized by several models and the results compared, to see where they differ in fidelity to the source, accuracy, and assumptions. In code review, a patch is checked not only by the model that wrote it, but also by a second one that does not know it and is therefore not in love with its own logic. In verification, a model can be set explicitly to attack and refute a finding; what survives a serious attempt at refutation stands on firmer ground afterwards.

Behind this stands model governance. In a mature setup the choice of model is not left to chance: models are addressed through abstract tiers, from cheap to top class, backed by budget profiles and routing rules; the brand names stay out of it. Whoever wants to use an expensive tier has to justify it, and for some decisions a human approval is added. That keeps the second or third opinion a targeted measure for the cases where the risk justifies it, rather than a silent cost driver. A trivial reformat does not need three models; an architectural decision in the brownfield context is a different matter.

The second look does not only raise accuracy; it also hardens the system against a specific weakness of agentic setups. When the context a model works with is manipulable, for example through a prompt injection in a file it has read, then a second model with different behavior is an additional hurdle for the attack. A defense that checks on several levels is harder to defeat than a single instance that you outwit once.

Sixth principle: compliance belongs in every phase

In many projects, compliance comes last. First the thing is built, then it is checked whether what was built meets the rules, and at the end the notes are added. In agentic work this order no longer holds, and that is because of the speed. When an agent produces in minutes what took a human days, a downstream check is overrun long before it can take hold. Compliance has to sit where the work passes through anyway, that is, in the phases and gates of the pipeline itself.

The regulatory frame for German and European teams covers at least the EU AI Act and the General Data Protection Regulation. The EU AI Act, Regulation (EU) 2024/1689, follows a risk-based approach. Depending on the use case and risk class, duties arise around transparency, technical documentation, logging, human oversight, and cybersecurity, in particular for high-risk AI systems and certain generative AI applications. The General Data Protection Regulation, Regulation (EU) 2016/679, governs the processing of personal data and requires, among other things, purpose limitation and data minimization. Beyond that, teams have to account for internal security requirements and applicable AI-security frameworks; for generative AI these typically concern threat modeling, data classification, model security, prompt protection, supply chain, and incident response.

Translated into the pipeline, this yields concrete rules anchored in the phases. At the root stands data minimization: in development, personal data does not belong in the artifacts. Sample data in mandatory artifacts uses only placeholders from a sample schema, so no real personal names, companies, or case numbers. Workshop material, training exports, and pull requests quickly end up in more public contexts, such as a shared page, a mirror on a code platform, or a presentation. If you anonymize from the start, it spares the later search, where one hit usually slips through anyway.

Accessibility is easily forgotten in a legacy system. In a mature setup the pipeline orients itself by the Web Content Accessibility Guidelines at level AA and by the European standard EN 301 549, as far as they apply to the use case. A UX state that captures an error case also describes how that error becomes perceptible to a person using a screen reader. For certain products and services, such requirements have been binding in the EU since 28 June 2025, in Germany through the Barrierefreiheitsstärkungsgesetz, the Accessibility Strengthening Act. For public-sector bodies, accessibility requirements have applied for longer.

Part of traceability is that every statement in an artifact names its source with a date, every AI-generated piece its producer note, and that token consumption stays traceable. This trail supports the audit trail that regulation, accountability obligations, or internal governance may require depending on the use case. In the pipeline it arises as a by-product of the artifact contract that is required anyway.

For brownfield systems there is the protection of existing data against forced migration, which I consider one of the most important rules of all. A new feature does not automatically migrate or validate existing data. New obligations apply to records newly created or actively edited; the existing data stays readable and storable. Without this line, a data migration gets dragged into a feature ticket as a side condition. But a data migration is a project of its own, with its own stakeholders, risks, and time windows; treated as a side point, it distorts the scope and blocks the handover to development. Often it is exactly this rule that decides how long a feature takes in brownfield.

How the principles work together

The six principles are not a list from which you pick the comfortable ones. They mesh, and they do so in a particular order.

A handover between two people can only take place on an object both of them read, so the roles need the artifacts. An artifact without a checkpoint is just a file that no one has bindingly assessed, which is why the artifacts presuppose the gates. A gate, in turn, can only check what an honest preliminary investigation delivered in findings, gaps, and contradictions. The second and third opinion sharpens the human decision at the roles without replacing it. And without all four, compliance has no material to act on.

The result is a chain of artifacts in which each link depends on the previous one. From a labeled source inventory comes a solid epic, from the epic tickets that pass a Definition of Ready, from the tickets a reviewable plan, from the plan a pull request that survives the test, and only from that an apply decision that can be taken responsibly. A team that skips a link cannot honestly close the next one. You can pretend to, but the omission takes its revenge where the system is stressed most, and in a grown system that is almost always production.

A common objection: doesn’t this sound like a return to the waterfall, a heavyweight process, the thing the agile movement wanted to overcome? It is not that simple. The pipeline has phases and handovers, but it is not rigid. Within each phase the work is iterative, with loops and self-corrections. The tickets are small, one to three hours, and flow through implementation one at a time. What sets it apart from the classic waterfall is the bindingness of a few checkable artifacts at the right places; the volume of documents does not matter. Which agile assumptions still hold under agentic tools and which lose their steering function, I have treated elsewhere. Here it is enough that the method provides for just as much iteration as before, only better evidenced.

What this series treats next

This text has described the pipeline as a whole and its principles. It has deliberately stayed at the level of the approach and only touched on the individual roles. Each role offers enough substance for its own piece, and in practice the value lies in the detail.

The following pieces therefore go into depth. Business analysis shows how a source inventory becomes an evidenced knowledge base, how labeling statements works, and how you recognize that a discovery is ready for structuring. UX takes on how a feature is described in one’s own words, how a team template with acceptance criteria and states emerges, and why the description of the error cases often reveals more than that of the happy paths. Development takes apart the cascade of research, epic, tickets, plan, and the final implementation loop, and shows how an agent can work autonomously across many steps without losing control. Testing deals with how acceptance criteria become scenarios and how a browser-based piece of evidence replaces a claim. And operations treats the separation of diagnosis and effect, the apply gate, and the question of what an agent may even prepare on real infrastructure and where the line lies.

Whoever wants to try the pipeline before the detailed pieces appear can start with a single rule that pulls all the others along with it: keep the state in versioned artifacts. Write the source inventory, the epic, the plan, and the test output into the repository. Once these artifacts exist, the question of the gates almost answers itself, because an artifact asks to be checked before the next one builds on it. And once the gates exist, the question of the roles follows, because a check needs someone to do the checking.

Shortcuts do not work in brownfield; these systems are too old and too tightly entangled with everything else. The confidence of a single model in a fleeting chat is not something to rely on here. The roughly ten percent productivity gain that the DORA report names for brownfield work is no law of nature; it describes the order of magnitude of a calculation under particular assumptions. With the right pipeline that number can be moved. How far, I do not know, and I will refrain from naming a figure I cannot support. What I take away from the workshops is the shift of the work from fast typing toward the handovers, where a human checks what the agent has put forward. Whether that changes anything in your own system shows only when you try it on a real feature.

AI Agentic Coding SDLC Brownfield Method Software Engineering Compliance Quality Gates