This post is also available in the following languages. Japanese, Korean

Let AI agents debate: Redesigning the development process with multi-agent collaboration

This is the official article for Tech-Verse 2026, LY Corporation's technology conference.

The hard part of AI coding is no longer getting code back. It is everything around it: turning intent into a precise spec, challenging assumptions, validating the implementation, and preparing a pull request (PR) reviewers can trust.

We redesigned that coordination layer as a phase-by-phase debate between Proposer and Challenger specialists. Each phase has its own artifact: a spec, an implementation, then a review-ready PR package. Proposers advance the artifact; Challengers challenge it differently in each phase: Socratic questions in Spec, evidence-backed objections in Build, and readiness verification in Delivery. The Orchestrator decides whether to revise, escalate, or move forward.

In our continuous integration (CI) flow, an issue can become a debated spec, a branch, a tested implementation, and a review-ready PR. Engineers still own the judgment, but they no longer have to manually coordinate every handoff.

This post shows how the pipeline works, where it breaks, and why the real leverage is not faster code generation. It is forcing AI to prove its work before humans spend attention on it.

The problem: Human coordination is the bottleneck

"10x Faster" will not come from squeezing more speed out of each manual step. It requires redesigning the process itself: how intent becomes a spec, how assumptions get challenged, how implementation is validated, and how reviewers receive enough evidence to make a decision.

Here is the uncomfortable truth: when building software with AI, the bottleneck is often not raw AI generation. It is the human coordination wrapped around it. We write a spec, ask AI to draft it, inspect the draft, clarify gaps, request code, run checks, paste failures back, review the diff, ask for a PR description, and decide whether the result is safe. The AI finishes each local task quickly — then waits for us to coordinate the next one.

This is AI-assist: the old process, plus AI bolted on. It speeds up individual steps, but leaves the expensive part untouched: the coordination layer between intent, implementation, validation, review, and delivery. The practical path is not to remove engineers from judgment; it is to move them out of repetitive coordination work in the messy middle.

In our team, we have been testing a different approach: treating development as structured collaboration between specialized AI roles, orchestrated as teams, not a sequence of prompts handed back to one assistant.

The insight: Let AI agents think together

But if humans are no longer coordinating every repetitive handoff, how does the system still catch bugs, challenge assumptions, and surface edge cases before final review?

You could add another manual checkpoint, but that gives the bottleneck a new name. You could add a single AI checker, but that often collapses into another single-perspective review.

The answer: organize AI roles into two opposing teams — Proposer and Challenger — each with different specialist responsibilities per phase, and let them debate through a structured protocol. This is not magic — this is the same reason pair programming works, but at a larger scale: instead of two people, you have two teams of AI specialists continuously challenging each other.

The exact role names are not the point. The design principle is: keep specification, implementation, review, validation, and delivery as separate responsibilities instead of collapsing them into one generic assistant response. Separate the engineering responsibilities, make each side argue from evidence, and let an Orchestrator decide when the work is good enough to move forward.

AI-native development redesigns the process so these specialist roles collaborate around shared artifacts. Humans are freed from tedious micro-coordination handoffs in the middle, but not from ownership or judgment. In between, agents automatically explore, refine, validate, challenge, and package the work. By default, humans step in at the strategic gates: defining intent at the start, approving the outcome at the end, or adding guidance when the system explicitly escalates.

How it works: Three phases, one spec-driven pipeline

The workflow is built around three moving parts: Spec → Build → Delivery, two opposing specialist sides, and an Orchestrator that decides when to revise, escalate, or move forward. That order is intentional. The Spec phase is the most important because it defines the contract that must be obeyed in every later phase: objective, constraints, interpreted requirements, explicit assumptions, open questions, proposed approach, and Definition of Done.

In that sense, this is a form of Spec-Driven Development. The spec can be short or long depending on what the problem requires, but it must become the controlling artifact of the workflow. Build agents do not simply "do what seems reasonable"; they first derive tests and checks from the agreed spec, then implement against both the spec and that validation design. Delivery agents do not simply summarize the diff; they prove whether the implementation satisfies the spec. If the spec is weak, the whole pipeline is weak. If the spec is precise, the later phases have something concrete to optimize, challenge, and verify.

In each phase, specialized agent runs are assigned to one of two sides: Proposer or Challenger. Here, "agent" means a coding-agent run — the kind of agent interface used by tools like Claude Code, Codex, or OpenCode — assigned a specialist role by the Orchestrator. An "agent team" is a group of those role-specialized runs working on opposite sides of the same phase. It does not mean a swarm of heavyweight autonomous systems with a shared hivemind. In the current implementation, each run has a narrow prompt, its own context, and access to the same workspace artifacts. The team metaphor is useful because each run is forced to argue from one engineering perspective instead of blending every responsibility into one generic answer.

The Orchestrator sits between the Proposer and Challenger sides — steering when debate goes off-topic, breaking deadlocks, and rendering phase-level verdicts on whether to continue, revise, or move forward. Its role shifts per phase: mediator during Spec and Build (guiding convergence), jury during Delivery (deciding whether the evidence is sufficient to ship).

Each phase combines a different specialist focus with a different debate strategy — not by accident, but because each phase has a fundamentally different goal.

Phase	Proposer focus	Challenger focus	Debate strategy
Spec	Research the codebase and synthesize requirements	Surface ambiguity, hidden assumptions, and risk	Socratic Questioning: only questions, no assertions or solutions
Build	Design tests and checks from the spec, then implement and maintain the change	Challenge the validation design, coverage, security, performance, acceptance criteria, and maintainability	Adversarial Review: objections need concrete proof about tests, code, or execution evidence
Delivery	Prepare the PR summary, evidence, and review map	Verify readiness, evidence quality, and visualization accuracy	Jury Debate: neither side judges itself; Orchestrator decides

In practice, those focuses map to specialist roles such as requirements-synthesizer, security-analyst, test-coverage-reviewer, technical-writer, and evidence-verifier.

At a high level, the same debate pattern repeats across all three phases:

The important detail is that each phase is grounded in evidence, not vibes. Spec context can come from the workspace and from external product sources such as Jira tickets, Confluence pages, design docs, or incident notes. The agents use that context alongside existing APIs, tests, dependencies, and conventions to ask better questions before implementation starts. Build is test-first: before changing production code, the Proposer translates the accepted spec into expected behavior, edge cases, tests to add or update, and commands to run. The Challenger can object to that validation design before or alongside the code, so a green check cannot hide missing acceptance coverage. Build review remains bidirectional: the Proposer can reject a challenge, but only with proof such as an execution path, compile/lint output, or a failing test. Delivery turns the result into a reviewer trust package: what changed, where to look first, which checks passed, what risks remain, and what the Challenger already tried to break.

What the protocol actually looks like

The exact payload changes by phase, but the protocol shape is consistent: each specialist run receives the phase context, inspects workspace evidence when needed, and returns strict JSON that the Orchestrator can parse, compare, and carry into the next round. Agents do not share one live context window. The durable shared state is the workspace, generated artifacts, and the transcript accumulated by the Orchestrator.

The Spec phase shows the pattern most clearly because it creates the contract for everything after it. A round is not just "review the spec." It is a small state machine: each turn has a defined role, a constrained response format, and a decision that either revises the spec, asks for another review, or blocks when the remaining uncertainty is unsafe to assume.

The output is still structured JSON, not a long essay. Each turn keeps the same envelope, so fields not used by that turn stay empty. For example, a Challenger question turn can look like this:

{
  "turn": "questions",
  "status": "needs-revision",
  "summary": "The spec needs one scope boundary before build.",
  "specialistInputs": [
    {
      "name": "ambiguity-reviewer",
      "stance": "caution",
      "opinion": "The named endpoint is clear, but neighboring search-like endpoints are not explicitly in or out of scope."
    }
  ],
  "decisionRationale": "Preserving the named scope avoids accidental broadening while letting the Proposer write a bounded assumption into the spec.",
  "payload": {
    "ready": false,
    "interpretation": {
      "requirements": [],
      "constraints": [],
      "definitionOfDone": [],
      "assumptions": [],
      "openQuestions": []
    },
    "questions": [
      {
        "id": "Q1",
        "severity": "major",
        "section": "constraints",
        "question": "Should the implementation be limited to the existing `/api/search` route, leaving other search-like endpoints unchanged unless explicitly named?",
        "whyItMatters": "Without this boundary, the build phase might change unrelated endpoints. Preserving the named scope is a safe default if documented in the spec.",
        "requiresUserInput": false,
        "assumptionRisk": "low",
        "confidence": 0.85
      }
    ],
    "answers": []
  }
}

This matters because the protocol distinguishes uncertainty from blockers. If an ambiguity is low-risk, reversible, and bounded by existing workspace evidence, the Proposer can write that assumption into the spec and proceed. If the missing answer would be unsafe, destructive, externally constrained, or materially irreversible, the system escalates instead of guessing.

Build and Delivery use the same structured-debate idea with different payloads: Build carries the test-first validation design, execution evidence, and review issues mapped to accepted, rejected, or compromised with justification and evidence; Delivery packages the result into PR text, test evidence, and reviewer guidance. The common point is that agents do not just produce prose. They leave structured evidence the next phase can verify.

Skills strengthen specialist roles

Specialist roles are not powered by role prompts alone. A "security analyst" is useful as a perspective, but it becomes much stronger when paired with reusable skills: threat-modeling changed endpoints, checking authentication boundaries, reviewing input validation, or producing concrete exploit scenarios. In short: role defines what the agent should care about; skill defines how it should work.

This keeps expertise modular. A test engineer can bring a boundary-test generation skill; a delivery writer can bring a PR-description skill that preserves decision evidence; a visualization specialist can bring a review-map skill that turns a diff into a diagram. When the team learns a better review pattern, it can become a reusable method instead of being buried inside one giant prompt.

How debates converge

A natural concern: what stops agents from debating indefinitely?

Each round follows a strict structure: Proposer presents → Challenger evaluates → Orchestrator decides. The Orchestrator can continue, steer (redirect off-topic discussions), resolve (accept partial consensus), or terminate. Critically, resolved issues are filtered out of subsequent rounds — so each round narrows the debate scope. A 5-issue debate might resolve 3 issues in round 1, then focus only on the remaining 2 in round 2.

The important part is not that every task produces a dramatic catch. It is that each phase narrows uncertainty and leaves structured evidence behind, so reliability comes from repeatable controls rather than lucky review moments.

What makes it reliable

Three engineering decisions keep the pipeline from wasting rounds or getting stuck in unproductive loops:

Mechanism	Problem it solves	Key idea
Complexity routing	Not every task needs full debate	Classify complexity → scale debate intensity to risk
Oscillation detection	A→B→A fix loops waste tokens	Detect repeated failure signatures → escalate to redesign
Structured priors	Raw transcripts don't teach future runs	Compress outcomes into schema-backed advisory signals

Complexity routing. Not every task needs the same debate intensity. The spec refinement review can estimate task complexity after gathering available implementation context — source files and any external references it actually resolved. When debate is enabled, trivial tasks still get a lightweight specialist debate; tasks classified as critical, including security-sensitive changes, get an extra round. Cost stays proportional to risk.

Here, "light pass" means a minimal specialist review; "one full round" means a complete Proposer/Challenger exchange without extending into multiple rounds.

Oscillation detection. If you have used AI coding tools for more than a week, you know this loop: you ask one to fix a test, it breaks another test, you ask it to fix that one, and it reintroduces the original failure. Each individual fix looks like progress. You can see the loop from the outside — but the agent often cannot, because it is focused on the current error.

This is arguably the most frustrating failure mode of single-agent workflows: the agent is working hard, burning tokens, and going nowhere.

Our system tracks failure signatures across self-fix attempts. When the current failure matches two steps ago but differs from one step ago — the classic A→B→A oscillation — it stops the self-fix loop and routes the latest result to Challenger review instead of spending another iteration on the same fix cycle.

Step 1: error A
Step 2: fix A → error B
Step 3: fix B → error A  ← same as Step 1. Loop detected.
                         → stop iterating, escalate for redesign

The multi-agent structure is what makes detection actionable. A single agent can detect "I've seen this error before" — but what does it do? Try harder? The debate structure gives it somewhere to escalate: a different team with a different perspective.

Structured priors. After each run, the system does not store the whole transcript as raw memory. It compresses reusable lessons into small schema-backed priors: what pattern appeared, how severe it was, what evidence category supported it, how often it has appeared, confidence, latest task, and suggested mitigation.

Product-wise, this is memory as a warning label, not memory as a judge. A prior might be formatted back into a prompt like this:

- repeated-failure-check validation:rate-limit-concurrency:major
  (seen 3x, confidence 0.65);
  mitigation: Add or update tests for changed behavior paths, then rerun relevant checks;
  root causes: test-gaps, trust-boundary

The important constraint: priors are advisory signals. They tell a specialist where to look first, but current evidence must still prove the claim. An old pattern can trigger extra scrutiny; it cannot auto-fail a new run. This lets the system learn from old patterns without turning memory into unchecked bias.

When to use this

Use this pipeline when "correct" can be defined with evidence: clear intent, acceptance criteria, and validation through code, tests, logs, or diffs.

It is strongest for asynchronous, CI-driven work: an issue triggers Spec debate, Build, validation, and a review-ready PR. Humans stay at the judgment checkpoints: clarify intent, set risk boundaries, approve the result, or redirect scope.

The most common current flow looks like this:

Best fits:

CI-driven issue-to-PR automation.
Small and medium implementation tasks with testable acceptance criteria.
Bug fixes with reproducible failures, logs, or code paths.
Risk-sensitive changes where security, performance, test, and maintainability challengers add value.

Avoid it when the definition of "correct" is still being negotiated. Keep humans in the inner loop for exploratory product decisions, unclear requirements, or work that cannot be validated with concrete evidence.

The pipeline can also propose changes to its own source code, but only as proposal-only work: branch and PR required, no self-merge, no CI bypass, no new permissions, and no destructive operations without approval.

Trade-offs and limitations

Dimension	Reality	Mitigation
Cost	More tokens than a single-agent pass	Routing keeps trivial tasks to one lightweight round; lazy-loading avoids pasting every source file, diff, reference, and transcript into every prompt
Latency	Minutes per task	Fits async pipelines, not real-time coding
Large specs	Still difficult to solve smoothly in one run	Spec breakdown is in progress: split large problems into medium and small ones, while isolating decisions that require human confirmation
Blind spots	Both teams can miss the same thing	Model/provider diversity can reduce shared failure modes because different systems tend to have different blind spots
Ownership	Debate does not replace engineering accountability	Humans still define intent, approve boundaries, and own production outcomes

Is the extra cost worth it? For many production-facing changes, one missed bug can cost more than an extra debate round. For a version bump, debate is waste. Routing and lazy-loading make cost proportional to risk and evidence needs. Multi-agent debate is not a universal replacement for judgment; it is a way to apply more consistent judgment before a human is asked to approve.

Where we are

We use this system daily, including on CI. This is not a "hello world" demo; it is a workflow we use for small and medium changes, especially work with clear acceptance criteria. In our use, it has produced tighter specs and surfaced issues earlier than a single-pass assistant workflow.

It is not perfect. Debates sometimes converge wrong, large specs are still hard, exploratory tasks often need research before build, and cost overhead is real. Human-in-the-loop remains intentional: the goal is not to pretend the system is always right, but to reduce manual coordination while bringing humans in at the moments where judgment is actually needed.

The next roadmap focuses on spec breakdown and confidence-based approval for low-risk changes with fast debate convergence and clear validation. The takeaway is not "auto-merge everything." The value is reducing handoffs, unsticking review loops, and adding a consistent layer of technical challenge before humans spend attention on final judgment.

So the more useful question is not only "can AI write code?" It is "what process forces AI to prove its work, be challenged, and leave enough evidence for humans to decide faster?" We built this pipeline to make those opposing specialist perspectives repeatable and integrated into how we ship code every day.

Tech-Verse 2026 to Be Held on June 29

This article has been published as the official article for the event.
Tech-Verse 2026 is a technology conference hosted by LY Corporation.
Explore cutting-edge challenges and real-world insights.

Be sure to watch the event live on YouTube LIVE.
https://tech-verse.lycorp.co.jp/2026/en/