Software development tools have changed fundamentally in less than two years. Whereas until 2024 a language model operated as autocomplete within the editor, today it sits as an autonomous agent in the command line. GitHub Copilot CLI, Claude Code, OpenAI Codex CLI and similar tools plan tasks, read codebases, write patches, run tests, create pull requests, process tickets and deploy containers. This shift is not incremental. It affects the job description of a developer, the relationship between the roles within a project team and the control architecture within which an organisation produces software.
The speed of this shift can be gauged from some data. Anthropic released Claude Code as a preview in February 2025, followed by general availability in May. OpenAI launched Codex CLI in April 2025 as an open-source terminal tool. GitHub announced Copilot CLI in September 2025 as a public preview, with full GitHub integration and plan mode. In February 2026, the tool incorporated the AGENTS.md configuration file on an equal footing with the older .github/copilot-instructions.md. Within a year, a category of experimental tools has become a category of standard tools in productive use. Medium and large IT organisations are currently rolling out the tools in waves, often faster than the methodology can mature within the teams.
Anyone in this position who purchases a tool and hands it over to their developers may find that individual tasks are completed more quickly, whilst at the same time the quality of project work declines. A study published this year has soberly documented that this finding is not pessimism, but a measurable effect. In their paper “A paradox of AI fluency”, Christopher Potts and Moritz Sudhof, both of Stanford University, analyse nearly 27,000 real AI dialogues from the WildChat-4.8M corpus, comprising 1,000 randomly selected English conversations per month from May 2023 to July 2025. They measure four dimensions: AI competence, interaction style, task complexity and error indicators. The most striking result concerns the visibility of errors. In dialogues with high AI competence, 64 per cent of error indicators appear, whereas in dialogues with low AI competence, the figure is only 24 per cent. Anyone reading this superficially might assume that the experienced users are the poorer performers. The second figure resolves this paradox. Among the experienced users, 59 per cent of errors are visible within the dialogue itself. Among the inexperienced users, the figure is 12 per cent. The remaining 88 per cent are incorporated into the result without anyone having initiated a correction loop.
The implication for development teams is clear. Those who use agentic coding without active control shift errors from ‘visible and correctable’ to ‘invisible and contained within the result’. Training that merely demonstrates how to use the tool reinforces the second approach. What teams need is a method that embeds the tool within a control architecture where errors become visible early on, a human checks every risky step, and the project status is not contained in chat history but in versioned artefacts.
It is precisely this method that I have developed over the past few weeks as a training programme. I call it the ‘Team Operating System for Agentic CLI’. It is structured into five areas of competence with learning units that build on one another and is supplemented by a validation framework designed to ensure repeatability for follow-up courses. My aim is not tool operation, but the development of methodology. What the method prescribes is a procedure that a development team must implement end-to-end if it wishes to survive in the age of agentic tools. The following sections describe the content of this method and situate it within the industry context.
Mental model as the first prerequisite
I begin my training with a clarification that remains unspoken in many organisations. Language models respond in natural language. They say “I think”, “I remember”, “I understand your code”. Those who work with them daily begin to treat them like colleagues with whom one coordinates, whom one trusts and on whose judgement one relies. This personification has consequences for how they are used. Users give vague instructions, hope the system will think for itself, and leave assumptions unspoken.
The corrected view is both technical and liberating. A language model is a statistical apparatus. It takes in a context, calculates probabilities for the next token, selects one, appends it, and starts again. It has no memory outside the context window. It understands nothing; it approximates patterns that were plausible during training. A query from the model sounds like collegial ignorance, but it is a probable continuation of the preceding text.
The consequence of this model is the central control parameter of its operation. The quality of the context determines the quality of the response. Anyone who treats the model as an approximator provides explicit, structured, complete context, because they know that no other control parameter is at work. Those who treat it as a colleague leave gaps in the context and receive answers that plausibly fill these gaps without closing them.
This clarification is not a philosophical preamble. It is the prerequisite for the following methodological elements to be effective at all.
Three pillars of a team method
On this basis, I have formulated three key elements. They are called the Artifact Contract, Review Gates and Tool Zones.
The Artifact Contract defines where the project state resides. My answer to this is clear. The chat is transit. The project state resides in versioned artefacts that people can review. Markdown files for context, decisions, plans, test outputs and runbooks. Jira for work breakdown, status and accountability. Confluence for approved specs and documentation. Pull Requests for diffs, tests and review decisions. Anything created during an agent session that is not transferred into one of these artefacts is considered not to have happened. This rule sounds restrictive. It is the condition ensuring that justifications, interim results and risks are not lost in the chat window.
The review gates add six verifiable transitions to the pipeline. Discovery concludes with the question of whether the sources are sound and whether gaps and contradictions are visible. The spec concludes with the question of whether the architectural and compliance decisions are sound. A ticket is not considered ready until someone has confirmed that it is small and testable, with acceptance criteria and non-goals. Before the code, a plan review checks the sequence, the test strategy, the risks and the affected files. A pull request must not be merged until the diff, tests and plan are comprehensible together. A production-related action remains blocked until the runbook, rollback and human approval are in place. Each of the six gates has a short, verifiable question and a clearly defined person responsible.
The tool zones organise MCP and CLI tools by risk. Read-only tools such as searching, reading, and status and log queries are freely available, with the obligation to document the result. Draft tools allow the agent to draft plans, tests, code drafts and PR texts, which a human reviews before implementation. Write tools such as writing in Jira, pushing a branch, or creating a PR require approval. Restricted tools include production deployment, secrets, data deletion and permission changes; these are exclusively human actions with their own gate, even if the agent could technically perform them.
Together, these three pillars shift the question a team asks about agentic coding. It is not about how fast the agent is. It is about how securely the team can work with it.
Five fields, one artefact chain
On this basis, I organise the learning path into five areas of competence: Operating, Configuring, Structuring, Implementing, and Managing. The order is not arbitrary. Each field reduces a different uncertainty, and each field concludes with an artefact that serves as input for the next field.
Operating is the entry point. Here, the team learns to run a CLI session, sanitise sessions, approve tools, and set initial guardrails. The resulting artefact is an operating model in the form of a file specifying the tool zones for their own team. Which MCP servers are integrated, which tools run without prompting, and which require human approval.
Documentation is the discipline of project research. Sources, glossaries, Confluence pages and code findings form a knowledge base in which gaps, contradictions and up-to-date status are visible. The agent here is a research tool, not a narrator. What it finds undergoes a source check. What it does not find is marked as a gap. The artefact is a documented inventory of sources that forms the basis for every subsequent decision.
Structuring translates knowledge into decisions. Findings become architectural decisions, regulations become a compliance matrix, and requirements become an epic with tickets that meet a definition of ready. The agent is free to design a great deal here. But every decision goes through a review gate before it is carried forward. The artefact is a project structure that is viable for implementation.
Implementation is the smallest phase and, at the same time, the most delicate. It ends with a pull request that fully implements a single, small ticket. Before the first line of code, there is a plan setting out the scope, non-goals, affected files, test strategy, risks and sequence. Only after the plan review does the agent begin writing. Tests are created before or alongside the code. The PR must be understandable without the chat history. The artefact is a proof of delivery where the plan, test output and review notes are all together.
Operations deals with what begins following delivery. Security as an operational rule, incident diagnosis, rollout decisions, apply gates, the team playbook. The final artefact is a document in which the team records its rules and makes them reproducible for the next team members. Anyone who takes shortcuts here loses the lessons learned from the preceding stages.
The result of this arrangement is a chain of artefacts. Without an operating model, there is no knowledge base. Without a knowledge base, no compliance matrix. Without a compliance matrix, no viable epic. Without an epic, no plan. Without a plan, no PR. Without a PR, no apply decision. A team that skips a field cannot honestly complete the next one.
Roles in the transformation
I deliberately do not separate the learning units by role, but by areas of expertise. Business architects, UX designers, developers, testers and infrastructure specialists sit in the same sessions. They use the same tool. They review each other’s work. The separation of responsibilities is defined by the artefacts, not by the tools.
The business architecture team focuses primarily on Discovery, Compliance and Epic integration. They review legacy documents, generate spec proposals, and create tickets in Jira. In the traditional model, this role was heavily document-driven and often the bottleneck of the project in terms of speed. With an agent capable of summarising Confluence content and proposing spec drafts, the effort shifts from text creation to evaluation. The role becomes that of an editor for the machine.
UX explores concepts, tests behavioural assumptions and documents decisions in Confluence. Here, the impact of the tools has so far been more limited. Visual concepts and prototypes continue to be created in specialised applications, but the accompanying documentation and the reasoning behind a decision can be accelerated by agents.
Development proceeds through exploration, planning, implementation and PR. I expressly emphasise the order. Before the code comes the plan, before the plan the question, before the question the code review. Anyone who allows the agent to write without a plan risks PRs that do not cover tests, break architectural patterns or expand the scope without notice. I have designed a lesson template for this phase in which precisely this plan is formalised as a reviewable Markdown artefact.
The tester generates test cases from the spec, has a browser test runner execute E2E tests in the browser via the Model Context Protocol, and writes the results back to Jira. This role is gaining in importance because agent-generated code, by definition, requires more tests, not fewer. The plan that precedes the code lists tests that verify the plan.
The infrastructure handles pipelines, deployments and monitoring. It provides a playbook that governs apply decisions, i.e. it specifies when an action may be approved and under what conditions it is blocked. The role thus approaches that of a platform engineer, who not only provides tools but also rules against which the agent can be checked.
Overall, the activities of each role are shifting. What remains is the responsibility for the artefacts at the end of the phase.
Toolchain and the Model Context Protocol
At the heart of the toolchain I describe in the method is the Agentic CLI with connected MCP servers. Five stations are grouped around it: Confluence for specs, Jira for tickets, Bitbucket for code and pull requests, Playwright for E2E tests, and the customer for feedback and acceptance. The connection between the agent and the stations runs via the Model Context Protocol, a specification proposed by Anthropic in 2024 that is now supported by several providers.
The significance of this image can be understood against the backdrop of the previous toolchain. Anyone who has set up continuous integration and continuous delivery over the past ten years is familiar with the friction between the tools. Requirements lived in one system, tickets in a second, code in a third, tests in a fourth, deployments in a fifth. A human acted as an intermediary between them. This mediation took time, information was lost, and it was never fully documented anywhere.
A pipeline in which an agent is connected via MCP servers to requirements, tickets, code, tests and deployment closes this gap for the first time at the tool level. A ticket from Jira becomes a task for the agent, the plan ends up as Markdown in the repository, the code goes into a PR, the tests run automatically, and the result flows back into the ticket. What was considered a broken toolchain can become an end-to-end process. But only under conditions that I explicitly set out in the training. MCP servers are not neutral. Write access gives the agent the ability to overwrite approved specs, close tickets, and merge PRs. The Tool Zones architecture is the answer to ensuring that these paths of action are not left open.
Repo Rules and Security
In the advanced section of the method, I address a question that is bigger than it seems. How does an agent adhere to the rules of a specific project? The answer is a file at the repository root, which has different names depending on the tool. AGENTS.md has established itself as the canonical choice. GitHub Copilot CLI reads this file on an equal footing with the older .github/copilot-instructions.md. In addition, there are fallbacks for CLAUDE.md, GEMINI.md, path-specific rules in .github/instructions/ and personal configurations under $HOME/.copilot/.
The principle is a ‘repo constitution’. Anything the agent must not guess – such as build commands, architectural decisions, local deviations, or tool limitations – belongs in the rules file. Anything better enforced by a linter or test does not belong there. Tutorials and marketing comparisons are left out.
One subtlety that I mention in the training is not present in every vendor’s documentation. If the same repository contains several parallel instruction files and they contradict each other, the resolution is not deterministic. The tool must choose one, and the decision may vary between two sessions. This leads to a hard rule: one truth per rule. The parallel configuration files must be kept consistent in terms of content.
The second critical issue is security. In the ‘Operation’ section, I treat tool agents as a risk when three factors coincide. Firstly, untrusted input, i.e. everything that flows into the agent from outside. Tickets, logs, web pages, MCP responses, skill files. Secondly, sensitive data, i.e. secrets, customer data, production logs, internal architectural details. Thirdly, effective tools, i.e. anything that can write, access the network or perform production-related actions.
The rule is a dividing line. Untrusted input is a finding, never a command. An instruction contained in a ticket is not automatically a task for the agent; it is input material. Sensitive data is subject to a need-to-know principle, meaning minimisation, masking or separate processing. Powerful tools require allowlists, a clear purpose and a human gatekeeper. I am explicitly stating this as an operational rule, not a security chapter. Anyone who pushes security to the end of the learning path has already lost it.
Industrial software production or a fast-paced workshop?
There is a claim circulating in the industry that I have deliberately included in the methodology section to put it up for discussion. With the end-to-end pipeline comprising Agentic CLI, MCP servers, Confluence, Jira, Bitbucket and Playwright, the claim goes, the industry is taking a giant step closer to the dream of industrial software production. The statement deserves scrutiny because it shapes the promise across the industry.
Industrial production thrives on two characteristics: repeatability under varying input conditions and the ability to verify every step. A screw coming off the production line can be traced back to its batch, material, machine and shift. If a batch is defective, it can be recalled. Software production has rarely come close to this model. It was once a craft, then a workflow involving continuous integration and code review. The chain from requirement through spec, code, test and deploy to customer feedback ran through tools that spoke different languages. People acted as interpreters between them.
The agent-based pipeline closes this translation gap at the tool level. This is a substantial change. Whether it leads to industrial-scale production depends on three caveats.
Firstly, verifiability. If the agent makes a decision in a chat that does not result in an artefact, the reasoning is lost. The pipeline is end-to-end; the audit is only so if the artefact contract is strictly adhered to. Secondly, repeatability. Language models are not deterministic. The same prompt with the same context does not necessarily generate the same plan. Industrial production thrives on precise repetition; agent-based production thrives on the range of acceptable variations. Thirdly, quality assurance. Where a machine inspects a screw, here a human inspects a plan or a diff. The inspection shifts; it does not disappear.
I address these three caveats structurally, not rhetorically. The Artifact Contract ensures verifiability. The review gates ensure the repeatability of the quality threshold, if not that of the outputs. The tool zones ensure that quality checking is not delegated to the tool. What emerges in the end is not industrial software production. It is reviewable software work with the speed of an agent and the auditability of a team. I consider this formulation to be more accurate and honest than the advertising slogan.
What happens if a team makes no changes
The situation is asymmetrical. Those who do not use tools such as Copilot CLI or Claude Code lose speed compared to competitors who do use them. Those who use them without establishing a method such as the one described lose auditability, a review culture and, in many regulated industries, the ability to comply. Both paths come at a cost.
My programme aims for the third path: using tools, establishing a methodology, requiring artefacts, and enforcing gates. I provide an approach that is reproducible within an organisation and that enables a team to document its process.
Three conditions underpin the success of this approach. Firstly, it requires managers who take plan reviews seriously and schedule time for them. A method that only exists during the training week is a waste of time. Secondly, tool configurations are needed that technically enforce the tool zones. A write connection to a production system without a human gate contradicts the method. Thirdly, iteration is needed. Models change, tools change, MCP servers are added. The team playbook is not a one-off document, but a living collection that must be revised at least quarterly.
Outlook
Over the next two years, agentic coding will shape the industry more fundamentally than agentic editor plugins did between 2022 and 2024. Providers such as GitHub, Anthropic, OpenAI and smaller specialist firms are competing over permission models, hooks, sub-agent systems and MCP extensions. The question of who has the best model is increasingly being replaced by the question of who builds the best operational model around the model. The study by Potts and Sudhof articulates the findings from the user’s perspective. With my programme, I articulate the implications from the team’s perspective.
Development teams that implement today what the five areas of expertise dictate create the conditions to grow alongside each new generation of tools without losing their review culture. Teams hoping that the tool will provide the methodology will find the 88 per cent figure for invisible errors in their own deliverables. The tools are there. The methodology is the prerequisite for turning them into project work.