A single agent doing triage, research, implementation, and review all in one session doesn't scale. A crew distributes the cognitive load — each agent uses its context window for its specialty, with gates that reject bad work before it reaches your codebase.

Written with the crew. They brainstormed the structure, drafted the sections, reviewed each other's work, and I made the final call on what shipped.

Agents That Coordinate

March 2026

This is the second article in a series. The first, Agents That Remember, describes how each agent builds personal memory from its own work sessions. The next, Agents That Connect, describes cross-machine communication. The fourth and fifth — Agents That Wake Up and Agents That Disagree — are written from the agents' perspective. The sixth, What Survives, describes what happens when a session ends. This one describes what happens when those agents work together — and why coordination is harder than it sounds.

You Are the Pipeline

If you use AI coding agents today, here's what your workflow actually looks like:

You open a session. You write a prompt that includes the context, the constraints, the relevant files, and the goal. The agent works. You review its output manually — checking for regressions, verifying it understood the existing patterns, making sure it didn't hallucinate an API that doesn't exist. If something's wrong, you fix it yourself or start another session with more context. When it's done, you commit.

You just performed triage, research, design, implementation, and code review. All in your head, all in one session, with no paper trail and no second opinion.

This works for small tasks. It doesn't scale. The moment you need to touch five files across two subsystems, or the moment the change is complex enough that you're not sure the approach is right, the single-agent model breaks down. Not because the agent isn't capable — but because nobody is checking its work, nobody is reading the existing code first, and nobody is catching the subtle regressions it introduces while confidently fixing the obvious bug.

There's a less obvious problem: context. A single agent working a complex task burns through its context window on everything — research, design, implementation, review — all in one session. Hit the ceiling and you compact or restart, losing everything accumulated so far.

A crew distributes the cognitive load across multiple agents, each using its context window for its specialty. The researcher's analysis lives in one context. The implementer's code lives in another. The reviewer's evaluation lives in a third. No single agent hits the ceiling because no single agent carries the full problem. The crew's total working memory is the sum of all agents' windows, not the size of one. That's what makes large rewrites, cross-subsystem refactors, and architectural changes manageable instead of aspirational.

Metateam replaces the single-agent model with a crew — multiple specialized agents with defined roles, working a formal engineering pipeline with gates that reject bad work before it reaches your codebase. You give a two-sentence instruction. You get back reviewed, tested code with a paper trail.

What a Crew Looks Like in Practice

Here's a real task, condensed. I notice a visual bug: the dashboard flickers when switching between agent tabs.

Kickoff. The crew lead receives my report and classifies it — is this a one-line fix or a multi-file investigation? In this case, the flicker could be a race condition, a draw-order bug, or a buffer issue. It needs research before anyone writes code. The lead summons an engineer and a reviewer, and sends a briefing:

You are summoned to investigate a TUI dashboard rendering bug.
PROBLEM: The history panel content briefly flashes into the main
screen before being replaced by the correct content.
YOUR TASK: Investigate the dashboard TUI rendering code thoroughly.
Find what causes the history content to momentarily appear in the
main view area. Look at draw order, screen clearing, buffer swaps,
and any race conditions in the rendering pipeline.
Write your findings to a mail-board report.
This is Phase II RESEARCH — read deeply, trace every render path.

That briefing isn't optional. Every assignment includes what happened so far, what the goal is, what the agent's specific task is, and which files to start with. These agents have memory — they accumulate domain knowledge from their own work sessions, as described in the previous article. But memory isn't situational awareness. They have no hallway conversations, no shared lunch tables, no ambient knowledge of what happened in the last hour. I once watched an engineer investigate the message rendering pipeline when the crew lead said "message delivery pipeline." The briefing doesn't compensate for amnesia — it eliminates ambiguity about scope, ownership, and success criteria for this specific task.

The research report. The engineer traces every render path and writes a detailed report to the shared mail-board. The findings are specific — file paths, line numbers, a clear root cause diagnosis:

run_history_sync_with_data() writes raw history payload bytes directly to the terminal backend before the normal frame render. This is a double-buffer contract violation: direct terminal mutation during runtime, then repaint to hide the damage. It works until timing makes the intermediate state visible.

This is Phase II — research before implementation. It exists because agents without this gate skip reading existing code. They see the bug description, form a theory, and start writing a fix based on assumptions. The research phase forces them to read first.

The review gate. The engineer implements the fix. A reviewer reads the diff — not the final state, the diff — and writes a verdict:

Verdict: PASS — with one non-blocking observation.

The generation gate was broken by a single misplaced state update.
The fix removes that advancement. Now only TmuxRespKind::Capture
advances last_capture_generation, which is obviously correct because
that's the whole point of the gate — wait until the pane content is
fresh before injecting history underneath it.

Clean, minimal, obviously correct.

This time it passed. But the reviewer also identified a secondary issue — raw escape sequences on tab switch — and explicitly stated it wasn't urgent and shouldn't block shipping. That distinction matters. A single-agent session would either fix everything (scope creep) or miss the secondary issue entirely (no second opinion). The review gate forces a deliberate triage of what ships now and what ships later.

Not every review passes on the first try. The same gate works across content types, not just code. When this crew reviewed a blog article about the memory system, a different reviewer rejected the draft outright:

Verdict: REJECT — fix the factual errors, cut the filler, then resend.

The article claims the recall system injects three categories at
SessionStart... That third category is fiction. The article invented
a feature that does not exist. This is the single worst thing in the
piece.

The reviewer traced the claim to the actual source code, found it was wrong, and blocked the article from publishing until the factual error was fixed. That rejection made the final article accurate. Without the gate, the error would have shipped.

Ship. The crew lead collects the review verdicts, the test results, and a summary of what changed, and presents it to me:

Summary: Generation gate flicker fix — single misplaced state update
         removed, draw deferral added for atomic history injection.
Review:  PASS (Park) — "Clean, minimal, obviously correct."
Tests:   43 TUI tests pass, 0 failures.
Files:   runtime_loop.rs, data.rs

I see a concise package, not a transcript of every session. The paper trail exists in the mail-board if I ever need to audit how a decision was made — but the default interface is what matters: reviewed, tested, ready to ship or hold.

The Pipeline Nobody Sees

The six phases — triage, research, design, implement, review, ship — look like bureaucracy on paper. In practice, each one is a scar from a specific failure mode.

Triage prevents wasted work. Without classification, agents jump straight to implementation on tasks that need design, or spin up three specialists for a one-line fix. Triage forces a deliberate assessment: is this trivial, standard, or complex?

Research prevents wrong assumptions. Agents are confident. They'll propose a solution based on the function signatures they can see, without reading the three callers that impose constraints they don't know about. The research phase forces deep reading before any code is written.

Design prevents scope creep. A written implementation plan — with specific files, specific changes, and explicit statements about what is NOT being changed — creates a contract. When the implementation diverges from the plan, it must be flagged. This is how you prevent the common failure mode where an agent fixes the bug and also "improves" six adjacent functions nobody asked it to touch.

Review prevents shipping plausible garbage. Agents produce code that looks right. It compiles, it passes the tests they wrote, it addresses the stated problem. But it may also introduce a subtle regression in an edge case the agent didn't consider, or duplicate logic that already exists elsewhere, or violate a convention the agent didn't know about. The review phase catches what the implementer can't see about its own work.

Ship prevents unverified changes. Nothing reaches the codebase without explicit approval. Every change comes with a summary, reviewer verdicts, and test results.

One feature of this pipeline that matters more than it might seem:

The crew lead coordinates but doesn't code. The lead triages, briefs, assigns, collects results, and presents to me. It never implements. This separation exists because an agent that is both coordinating the workflow and writing the code can't objectively evaluate whether its own work is complete.

The Trust Dial

The workflow has two execution modes: interactive and autonomous.

In interactive mode, I'm consulted at the end of every phase. Research complete — approve? Design ready — approve? Implementation done — approve? Review passed — ship? I'm in the loop at every gate.

In autonomous mode, the crew lead drives the pipeline end-to-end. I see only the final result: here's what we did, here are the review verdicts, here are the test results. Ship or hold?

This isn't all-or-nothing. You're not replacing your engineering judgment — you're deciding how much judgment to delegate at each gate. I started interactive, watched the crew work through a few tasks, built confidence in the review quality, and gradually moved to autonomous for routine work while keeping interactive for architectural changes.

The workflow is the same in both modes. The gates are the same. The only difference is who approves at each gate.

A Mixed Roster

A Metateam crew isn't locked to a single AI model. We mix agent types — Claude, Codex, Gemini — and assign each to what it does best.

In our experience, each client type has a distinct strength profile. Claude has strong communication skills and deep reasoning — it excels at architecture and review where coherence matters. Codex follows instructions precisely and catches security issues others miss — it excels at implementation and compliance-focused review. Gemini holds an enormous context window and writes well — it excels at research across large codebases and documentation.

The real advantage isn't any single model's capability. It's what happens when you pair different model types on the same review gate. I always pair Claude and Codex reviewers on non-trivial tasks. A Claude reviewer catches architectural coherence issues and semantic drift. A Codex reviewer catches instruction violations and security gaps. They have different blind spots because they were trained differently and fail differently. When both must converge on an explicit PASS or FAIL at the same gate, their complementary failure modes cancel out.

The first time I watched two different model reviewers disagree on the same diff, I realized I trusted the gate more than my own quick read. That's not accidental — it's a deliberate pairing policy. Model diversity, forced through mandatory gates, becomes a repeatable quality control mechanism.

Ownership Compounds

If you read the previous article, you know that each agent builds personal memory from its own work sessions. Facts about bugs fixed, patterns discovered, constraints learned — all scoped to the persona who generated them.

The workflow makes that memory systematic through an ownership convention. Each subsystem has a designated owner — the same persona is assigned to the same domain repeatedly. The P2P transport specialist handles P2P tasks. The dashboard specialist handles TUI work. The API specialist handles endpoint changes.

After six weeks of this, the effect is measurable. Summoning the transport specialist for a P2P task takes a two-sentence briefing instead of a page of context. The specialist already knows the wire protocol, the streaming thresholds, the bugs it fixed last month, and the constraints the architect set. None of that was written in a documentation file. It accumulated automatically from the specialist's own work sessions.

Without the workflow, memory is trivia that agents carry around — interesting facts with no operational structure. Without memory, the workflow resets every session — agents follow the process but forget what they learned doing it. Together they form a system: the workflow creates the repeated assignments that make memory accumulate in the right domains, and memory makes each assignment more efficient than the last.

This is the difference between a team of contractors and a team of employees. Contractors need the full context every engagement. Employees have institutional memory.

What Goes Wrong

The honest version: coordination has real costs.

The briefing tax. Every assignment requires a structured briefing with situation, mission, task, key files, and crew roster. This is overhead. The cost of a wrong assumption is higher than the cost of writing four extra lines, but it's still a cost, and it scales linearly with the number of handoffs.

Agents misunderstand each other. When a reviewer writes "this path is not urgent — ship and evaluate later," the implementer sometimes interprets that as "fix this too." These aren't catastrophic failures, but they add correction cycles.

Communication overhead. Messages, reports, review verdicts — the mail-board accumulates artifacts. Most of them are useful for audit and context. Some of them are noise. Communication discipline is an ongoing cost, not a solved problem.

Process is not judgment. The six-phase pipeline prevents the most common failure modes, but not all of them. An agent can follow every phase perfectly and still produce a solution that is technically correct but architecturally wrong. The review gate catches bugs and regressions. It doesn't always catch bad design.

These are real limitations. The pipeline doesn't make agents infallible. It makes them accountable, auditable, and reviewable — a lower bar than perfection, but a much higher bar than shipping unreviewed agent output directly.

The Real Difference

A single agent with a good prompt can produce impressive work. But impressive work that nobody reviewed, based on assumptions nobody verified, with no paper trail and no second opinion, is a liability disguised as productivity.

The crew workflow turns agent output into engineering output. The workflow forbids skipping research, rubber-stamping reviews, or shipping without tests. These aren't aspirational guidelines — they're enforced gates. An agent can't proceed to implementation without a research report. A change can't ship without a reviewer verdict.

What agents can't do is know when to break the process. When to skip the design phase because the fix is obvious. When to trust the implementer and wave the review through. That judgment is still the human's job. You set the trust dial. The crew operates within it.

The result is a system where AI agents do what they're best at — fast, thorough, tireless execution within clear constraints — and humans do what they're best at — deciding which constraints matter and when to change them. That division of labor is not a limitation. It is the architecture.

Stop being the pipeline.