Multi-Agent Orchestration — BulletproofSoftware.tech

1. Problem Statement

A single AI agent handling a complex task — "build an authentication system with OAuth, MFA, and session management" — produces inconsistent results. It tries to architect, implement, test, and review simultaneously. Under context pressure, it skips steps, doesn't validate its own output, and produces code that works in isolation but fails integration.

Worse: agents routinely claim completion when work is partial, stubbed, or wrong. Self-reporting is unreliable. An agent that wrote the code should not be the one validating it — the same way a developer doesn't review their own pull request.

Human development teams solve this with specialization, quality gates, and independent review. The orchestration system brings the same structure to AI agents: 29 specialized agents working through tiered workflows with 10 verification gates, independent Gemini validation of every agent output, and persistent state tracking.

The key insight: not every task needs the full pipeline. A typo fix doesn't need architecture review. A greenfield system does. The 5-signal tier classification matches orchestration complexity to task complexity.

2. Architecture Overview

5-Signal Tier Classification

Signal	Weight	Evaluates
Scope	0.30	File count, cross-domain impact, system breadth
Type	0.25	Bug fix → feature → refactor → architecture → greenfield
Risk	0.25	Blast radius, data sensitivity, reversibility
Ambiguity	0.20	Spec completeness, requirement clarity, unknowns
Intent Sensitivity	bonus	Compliance/security/financial tasks auto-escalate

TRIVIAL
1.0–1.49
Single agent, no gates

MINOR
1.5–2.24
2-3 agents, advisory

STANDARD
2.25–3.24
Full pipeline, blocking

MAJOR
3.25–4.0
Full + security + human

Independent Agent Validation

Every agent dispatch follows the same accountability loop. The orchestrating agent (Claude) dispatches work, but a separate AI (Gemini CLI) independently validates the output. Claude does not grade its own homework.

Agent dispatched → Agent returns output → Gemini validates → Verdict recorded → Proceed or remediate

Gemini Verdict	Completion	Conductor Action
PASS	100%	Proceed to next step
PARTIAL	≥70%	Proceed with advisory warning logged
PARTIAL	<70%	Block — finding-level remediation loop
FAIL	any	Block — remediate or escalate to human
ERROR	n/a	Log, proceed (Gemini unavailability is non-blocking)

Finding-Level Remediation Loop

When Gemini rejects an agent's output, the system doesn't just re-run the entire task. It extracts each specific finding and dispatches targeted fixes:

Gemini returns structured findings (file, line, issue description)
For each finding, Claude dispatches a remediation agent with the specific issue, cited files, and instruction to fix only that finding
The remediation agent provides evidence of resolution (file:line changed, rationale)
Gemini re-validates in targeted mode — checking only the specific findings, not a full re-review
Per-finding verdicts: RESOLVED, UNRESOLVED, or REGRESSED
Max 2 remediation loops — then escalate to human with full finding history

Every finding has a paper trail: what Gemini flagged, what Claude changed, whether Gemini accepted the fix.

State Schema (882 Lines)

The conductor maintains a comprehensive state object validated against a JSON schema after every Write/Edit operation. Key domains tracked:

Workflow State

project_name, tier signals, current phase/step, task queue, completed tasks with outcomes/deliverables

NHI Registry

Agent instances with IDs (nhi_*), spawn/terminate times, status, parent NHI, tools used, token usage

Checkpoints

9 trigger types (phase_transition, agent_handoff, pre_ciso_review, pre_auto_code, etc.), git SHA, BRD hash

Dead Letters

Blocked tasks (reason, since, by_step) and failed tasks (8 failure types, retry count, escalation status)

Cost Tracking

budget_limit_usd, total tokens in/out, estimated_cost_usd, budget_exceeded flag

Circuit Breaker

closed/open/half_open states, failure count, opens_at threshold, half_open cooldown

Intent Spec

Objectives (goal, signals, priority, constraints), trade-offs, delegation boundaries, prohibited behaviors

Governance Link

manifest_id/version/hash, trust_level 1-5, session_classification, audit_session_id, human_gate_required

Gemini Validations

Per-agent validation audit trail: verdict, completion %, deliverables checked/passed/failed, finding resolutions (RESOLVED/UNRESOLVED/REGRESSED), attempt counts, aggregate stats

10 Verification Gates

post_ciso, post_extraction, post_architect, post_qa, post_implementation, post_documentation, post_pentest, post_supply_chain, pre_release, completeness

3. Key Components

3.1 Agent Roster (29 Agents)

Core Agents (15)

Orchestrator — routes tasks, manages state machine
Critic — independent gate validation
Gemini Validator — independent third-party output validation via Gemini CLI
Checkpoint — state persistence, phase transitions
Builder — plan + implement from specs
Architect — design specifications
QA — testing and coverage
QA Review — multi-model adversarial review
CISO — security and compliance
Research — requirements, BRD generation
Project Setup — repository initialization
Code Reviewer — quality and standards
Compliance — regulatory standards
Doc Gen — documentation generation
Completeness Validator — artifact verification

Specialized Agents (14)

Frontend Designer — UI/UX
DevOps — infrastructure, CI/CD
Database — schema design, optimization
API Design — REST/GraphQL specs
API Docs — API documentation
Performance — optimization
Observability — monitoring
Pentest Coordinator — security testing
LLM Security — AI-specific threats
n8n — workflow orchestration
Analyze Codebase — codebase analysis
Bug Find — bug detection
Refactor — code restructuring
Advisor — multi-expert business decisions

3.2 Capability Routing

A YAML capability matrix maps task types to agents based on capability match + trust level. The orchestrator selects best-fit agents for each phase. Fallback routing handles unavailable agents.

3.3 BRD-Driven Development

User Requirements → BRD (Research) → Intent Engineering (Section 3.6)
  → Technical Spec (Architect) → Implementation Plan (Builder)
  → Code (Builder) → Tests (QA) → Security (CISO) → Review → Merge

Every STANDARD/MAJOR workflow starts with a Business Requirements Document, not code. The BRD includes Section 3.6 (Intent Engineering) capturing objectives, constraints, trade-offs, and delegation boundaries. BRD-tracker.json maintains traceability from extraction through completion.

3.4 Project Characteristics Detection

The state schema detects and tracks: has_api, has_ui, has_database, has_containers, has_kubernetes. Plus deployment_target (local/cloud/hybrid/on-premise), compliance_requirements (SOC2/GDPR/HIPAA/PCI-DSS/ISO27001/FedRAMP), security_classification, and PQC readiness assessment.

3.5 Gemini Validation Protocol

After every agent dispatch returns, the conductor runs independent validation via Gemini CLI before proceeding. This is mandatory for all agents that produce file artifacts.

Two Validation Modes

Full Validation

Used after initial agent dispatch. Gemini receives the agent's task description, expected deliverables, actual files changed (via git diff), and the agent's self-reported output. Returns a structured PASS/FAIL/PARTIAL verdict with per-deliverable evidence.

Targeted Re-Validation

Used in the remediation loop after Claude has attempted fixes. Gemini receives only the original findings and Claude's resolution evidence. Returns per-finding verdicts: RESOLVED, UNRESOLVED, or REGRESSED. PASS requires all findings resolved.

What Gemini Checks

Presence — Does the deliverable exist? Is the file actually there?
Completeness — Not stubbed, no TODO/FIXME placeholders, no empty functions
Correctness — Matches the task requirements, not just boilerplate
Functionality — Imports resolve, functions are callable, configs are valid

Audit Trail

Every validation is recorded in conductor-state.json with: validation ID, agent name, verdict, completion percentage, deliverables checked/passed/failed, specific issues found, attempt number, phase/step context, and the action taken by the conductor. Aggregate statistics track pass/fail/partial/error counts, re-dispatches triggered, and average completion percentage across the entire workflow.

Graceful Degradation

If Gemini CLI is unavailable, the conductor logs a warning and proceeds. Gemini outage is non-blocking — it degrades the accountability guarantee but doesn't halt the workflow. The Gemini Validator itself is never recursively validated.

4. Requirements

REQ-AGT-001 Tasks shall be classified using a 5-signal weighted matrix (scope 0.30, type 0.25, risk 0.25, ambiguity 0.20 + intent sensitivity bonus) producing a 1.0–4.0 tier score.

REQ-AGT-002 Workflow state shall be a JSON object validated against an 882-line schema after every Write/Edit via PostToolUse hook.

REQ-AGT-003 29 agents shall be defined as markdown files with YAML frontmatter, dispatched via the Agent tool with subagent_type parameter.

REQ-AGT-004 10 verification gates (post_ciso through completeness_validation) shall validate phase outputs with advisory findings, severity breakdowns, and completeness reports.

REQ-AGT-005 NHI (Non-Human Identity) lifecycle shall track agent instances with unique IDs, spawn/terminate times, parent lineage, tools used, and token consumption.

REQ-AGT-006 A circuit breaker pattern (closed/open/half_open) shall halt workflows after configurable failure thresholds with cooldown-based recovery.

REQ-AGT-007 Cost tracking shall enforce budget limits with per-session token accounting and overage detection.

REQ-AGT-008 Dead-letter queues shall capture blocked tasks (with reason and blocking step) and failed tasks (with 8 failure types and retry counts).

REQ-AGT-009 BRD-driven development shall generate structured requirements documents with Section 3.6 (Intent Engineering) and full-lifecycle traceability via BRD-tracker.json.

REQ-AGT-010 Checkpoints shall support 9 trigger types with git SHA, BRD tracker hash, phase/step state, and todo counts for reliable session resumption.

REQ-AGT-011 The critic agent shall independently validate gate criteria — no agent grades its own output.

REQ-AGT-012 Intent specification shall define objectives (with IDs, goals, signals, priorities, constraints), trade-offs, delegation boundaries, hard limits, and prohibited behaviors.

REQ-AGT-013 Agent handoff history shall track source/target agents, deliverables expected/received, checkpoint IDs, and rollback execution status.

REQ-AGT-014 The audit sink shall emit events for handoff, gate_decision, kill_switch, escalation, nhi_spawn, nhi_terminate, and prohibited_behavior to the governance framework.

REQ-AGT-015 Secrets policy shall enforce vault-only credential sources (env_var, vault_reference, mcp_secret) with violation detection.

REQ-AGT-016 Every agent dispatch that produces file artifacts shall be independently validated by Gemini CLI, with structured PASS/FAIL/PARTIAL/ERROR verdicts recorded in conductor-state.json.

REQ-AGT-017 Gemini validation shall operate in two modes: full validation (initial agent output) and targeted re-validation (specific findings after remediation), with per-finding RESOLVED/UNRESOLVED/REGRESSED status.

REQ-AGT-018 Failed validations shall trigger a finding-level remediation loop where each issue is addressed individually with cited evidence, with a maximum of 2 remediation attempts before human escalation.

REQ-AGT-019 Gemini validation statistics (total/pass/fail/partial/error counts, re-dispatches triggered, average completion percentage) shall be maintained as aggregate metrics in conductor-state.json.

5. Prompt to Build It

Build a multi-agent orchestration system for Claude Code:

1. TIER CLASSIFICATION: 5-signal weighted scoring (scope 0.30, type 0.25,
   risk 0.25, ambiguity 0.20 + intent sensitivity bonus) → 1.0-4.0 score
   → TRIVIAL/MINOR/STANDARD/MAJOR tier workflows

2. STATE SCHEMA (882 lines): Validated JSON with:
   - Workflow: project, tier, phase, step, task queue, completed tasks
   - NHI Registry: agent instances (nhi_* IDs), spawn/terminate, parent lineage, tokens
   - Checkpoints: 9 trigger types, git SHA, BRD hash, phase/step snapshot
   - Dead Letters: blocked (reason, since) + failed (8 types, retry, escalation)
   - Cost: budget_limit_usd, token in/out, estimated_cost, exceeded flag
   - Circuit Breaker: closed/open/half_open, failure threshold, cooldown
   - Intent: objectives, trade-offs, delegation_boundaries, prohibited_behaviors
   - Governance: manifest_id/version/hash, trust 1-5, classification, audit_session_id
   - Gemini Validations: per-agent audit trail (verdict, completion %, deliverables,
     finding resolutions, attempt counts), aggregate stats
   - 10 verification gates with advisory findings and severity breakdowns

3. 29 AGENTS (markdown files):
   Core (15): orchestrator, critic, gemini-validator, checkpoint, builder,
   architect, QA, QA review, CISO, research, project setup, code reviewer,
   compliance, doc gen, completeness validator
   Specialized (14): frontend, devops, database, API design/docs, performance,
   observability, pentest, LLM security, n8n, analyze, bug find, refactor, advisor

4. GEMINI VALIDATION PROTOCOL:
   After every agent dispatch, validate output via Gemini CLI independently.
   Two modes: full validation (initial check) and targeted re-validation
   (per-finding check after remediation). Verdicts: PASS/FAIL/PARTIAL/ERROR.
   Finding-level remediation loop: each issue addressed individually with
   cited evidence (file:line changed, rationale). Per-finding re-validation
   returns RESOLVED/UNRESOLVED/REGRESSED. Max 2 remediation loops, then
   escalate to human. Gemini unavailability degrades gracefully (non-blocking).

5. BRD PIPELINE: Requirements → BRD (Section 3.6 Intent Engineering) →
   Technical Spec → Implementation Plan → Code → Tests → Security → Review
   Tracked via BRD-tracker.json with extraction-to-completion traceability

6. CAPABILITY ROUTING: YAML matrix mapping task types to agents by
   capability match + trust level. Fallback routing when specialist unavailable.

7. HOOKS: SessionStart (detect state, inject status), PostToolUse (validate
   state schema on Write/Edit). Wire PostToolUse in settings.json.

Build as a Claude Code plugin with 29 agent .md files, state schema,
capability matrix, workflow templates, and hook scripts.

6. Design Decisions

Gemini Validation over Self-Assessment

Agents claiming "done" is unreliable — they skip steps, stub functions, and produce incomplete output under context pressure. Using a separate AI (Gemini CLI) as an independent validator eliminates self-grading bias. Finding-level remediation ensures each specific issue is addressed with evidence, not hand-waved away in a bulk re-run. The ~5-15 second overhead per agent dispatch is negligible compared to the cost of shipping incomplete work.

882-Line Schema over Loose State

Strict JSON schema validation catches state corruption immediately. Every Write/Edit operation validates via PostToolUse hook. The schema documents every field, constraint, and relationship — it's both enforcement and documentation. Gemini validation results, finding resolutions, and aggregate stats all have explicit schema definitions.

NHI Lifecycle over Anonymous Agents

Every agent instance gets a unique NHI ID, parent lineage, tool/token tracking, and explicit spawn/terminate lifecycle. This enables forensic tracing, cost attribution, and permission auditing at the individual agent level.

Circuit Breakers over Infinite Retry

Max 2 retries, then escalate. The circuit breaker pattern (closed→open→half_open) prevents cascading failures. A failing agent doesn't burn the entire token budget — it trips the breaker and the orchestrator adapts.

Intent Engineering over Requirements-Only

Section 3.6 captures not just what to build but why — objectives with constraints, trade-offs with resolutions, delegation boundaries, and prohibited behaviors. This feeds the constitutional observer for drift detection during execution.

Targeted Re-Validation over Full Re-Review

When remediation fixes specific findings, Gemini re-validates only those findings — not the entire output. This keeps the remediation loop tight and focused, avoids discovering new issues mid-fix, and provides clear per-finding accountability (RESOLVED/UNRESOLVED/REGRESSED).

7. Integration Points

→ Plugin Ecosystem

The conductor is itself a plugin. 29 agents, 7 skills, hooks, and state management all use the plugin architecture. The two-layer hook system ensures state validation fires reliably on PostToolUse.

→ Agent Governance

The state schema's governance block (manifest_id, trust_level, conductor_tier) feeds the governance policy engine's tier matrix. NHI lifecycle events emit to the audit bus. Human gates trigger at MAJOR tier + elevated tools.

→ Memory System

Completed workflows generate trajectories and learnings stored in vector memory. Task specialization scores track agent performance. Gemini validation history informs future agent reliability assessments.

→ Context Guard

Multi-agent workflows consume context rapidly. The state schema's context management skill enforces a 60% budget rule. Context guard signals trigger checkpoint saves and phase pausing.

→ Code Hardener

STANDARD and MAJOR tier workflows include a Code Hardener QA phase after implementation. Scan-fix-rescan cycles run until quality score reaches 1000, followed by adversarial dual-AI review (Claude + Gemini) with debate resolution for disputed findings.

Multi-AgentOrchestration System