The Agentic Harness: How to Ship Code with AI Agents Without Losing Your Mind

AI agents that just “vibe code” break in production. The failure mode is always the same: the agent loops, hallucinates a passing test, or picks up the wrong plan. The harness fixes this with a feedback loop that hard-stops at every failure, routes back to the right phase, and leaves a full audit trail.


The Loop

The harness is a continuous five-phase cycle. Each cycle runs Plan Research Build Eval Report. The eval result determines what triggers next:

         +-------------------------------------------+
         v                                           |
      [ Plan ] -> [ Research ] -> [ Build ] -> [ Eval ] -> [ Report ]
                                                     |
                              +----------------------+------------------+
                              v                      v                  v
                           feature                  bug               next
                         (new scope)            (fix loop)        (iterate or
                                                                  exit loop)

There are three routing outcomes from Eval:

ResultConditionNext action
featureEval passes + new capability identifiedReport runs, then new Plan cycle with expanded scope
bugEval fails (test failure, score below threshold)Re-enter Plan with failure details as input; Report is skipped
nextEval passes, no new scopeIterate on current output - tighten evals, improve coverage, or ship

The loop only exits when the orchestrator closes it or when next produces no further work.


The Plan File: One Source of Truth

The plan file is born in Phase 1 and grows through every phase. YAML frontmatter accumulates each phase’s status, timestamps, and outputs. Every downstream phase starts from this file - not from the conversation. Each phase gets a fresh 200k context window, seeded by the plan.

---
title: ""
date: YYYY-MM-DD
goal: ""
outcome: ""
definition_of_done: ""
prerequisites: [uv, gws, headless-obsidian]
sync_hash: ""
build_mode: harness    # harness (default) | gsd-headless-auto (fast-path)
 
# Phase 1 - written by Planner
plan_status: done
plan_completed_at: YYYY-MM-DD HH:MM
 
# Phase 2 - written by Research agents on completion
research_status: done | skipped
research_completed_at: YYYY-MM-DD HH:MM
research_summary: ""       # one-line TL;DR across all researchers
research_sources: []       # union of all source URLs
 
# Phase 3 - written by Builder on completion
build_status: done | failed
build_completed_at: YYYY-MM-DD HH:MM
build_pr_url: ""           # gh pr view --json url -q .url
build_failure_reason: ""   # rounds_exhausted | test_failure | lint_failure | env | other
 
# Phase 4 - written by Eval agent
eval_score: ~
eval_result: feature | bug | next
eval_completed_at: YYYY-MM-DD HH:MM
 
# Phase 5 - written by Reporter (skipped on bug routing)
report_status: done | skipped
report_completed_at: YYYY-MM-DD HH:MM
report_channels_sent: []   # e.g. [slack, github, obsidian]
---

Every phase must update its own status block in the frontmatter before exiting. Read, merge, write back - never rewrite the whole file.


DET/AGENT Interleaving - Shift Feedback Left

The key insight: deterministic (DET) steps bracket every agentic (AGENT) step. DET steps produce hard signals. AGENT steps are gsd -p invocations - single-shot, scriptable, model-selectable.

[DET]   claim sprite from pool
[DET]   sync repo to sprite (git clone/pull)
[AGENT] gsd -p "implement task: <goal>" --model openai/gpt-4.1
[DET]   run linter      <- ruff / biome / eslint (shift left: fast, <5s)
[AGENT] gsd -p "fix lint errors: $(cat .lint-output.txt)" --model openai/gpt-4.1-mini  (max 1 round)
[DET]   run tests        <- pytest / bun test (selective subset)
[AGENT] gsd -p "fix test failures: $(cat .test-output.txt)" --model openai/gpt-4.1     (max 2 rounds)
[DET]   push branch
[DET]   open PR
[DET]   update plan frontmatter: build_status, build_completed_at, build_pr_url
[DET]   release sprite (destroy + async pool replenishment)

Linting runs before tests. Tests run before CI. Issues caught locally cost nothing; CI failures cost tokens and time.

Hard round limits cap every AGENT step:

PhaseMax agent rounds
Lint fix1
Test fix2
CI fix2

If rounds are exhausted, route to bug - re-enter Plan with the failure as input. Never let an agent spin indefinitely.

The gsd -p calls support per-step model selection. Full model for implementation and test-fix; mini for lint-fix (mechanical, cheap):

# Implementation (full model)
gsd -p "Implement the following task. Read the plan at .agents/plans/<plan>.md first.
<task description from plan goal>" \
  --model openai/gpt-4.1
 
# Lint fix (cheap model - mechanical fix, no reasoning needed)
gsd -p "Fix the following lint errors. Read each file before editing.
$(cat .lint-output.txt)" \
  --model openai/gpt-4.1-mini
 
# Test fix (full model - may require understanding test intent)
gsd -p "Fix the following test failures. Read the test files and relevant source before editing.
$(cat .test-output.txt)" \
  --model openai/gpt-4.1

Each invocation is a fresh session - no memory carries across rounds. For multi-round debugging where prior context matters, pipe the previous output into the next prompt or use gsd --continue to resume the last session.


Sprite Isolation

Each task runs inside a sprite - a Firecracker microVM from sprites.dev. Sprites prevent host pollution, enable true parallelization, and are safe to destroy on failure.

# Claim a pre-warmed sprite (ready in <10s from pool)
SPRITE=$(~/.agents/skills/sprite/scripts/claim-sprite.sh my-task)
 
# Run build commands inside sprite
sprite exec -s $SPRITE git clone <repo> /workspace
sprite exec -s $SPRITE bash -c "cd /workspace && <build-command>"
 
# Release when done (destroys sprite, replenishes pool async)
~/.agents/skills/sprite/scripts/release-sprite.sh $SPRITE

Rules:

  • One sprite per task - never shared between concurrent agents
  • Always claim from pool first; sprite create only if pool is empty
  • Always release after task (pass or fail)
  • Never mount production credentials into a sprite
  • Pool size should match typical parallelism - 3 tasks in flight = pool size 3-5 (buffer for replenishment lag)

The pool means claim latency is under 10 seconds. Destruction on failure means a bad build can’t corrupt the next one.


Parallel Builds via cmux

Each task gets its own cmux workspace and its own sprite. Spawn N workspaces to run N tasks concurrently with zero shared state:

# Spawn parallel agent workspaces - each claims its own sprite
for task in auth-fix perf-refactor docs-update; do
  ~/.claude/skills/cmux/scripts/spawn-workspace.sh "agent-$task" \
    --prompt "claim sprite, implement: $task, release sprite when done"
done
 
# Check status across all workspaces
cmux list-workspaces

Pool size should match parallelism: 3 tasks in flight means a pool size of 3-5 (buffer for replenishment lag).


Scoped Context - MDC Rules

Rules live in .agents/rules/ using MDC frontmatter. They attach automatically as the agent traverses the filesystem - no manual context loading, no bloated system prompts.

{project}/
└── .agents/
    └── rules/
        ├── global.md          # applies everywhere
        ├── api.md             # globs: src/api/**
        ├── frontend.md        # globs: src/components/**, src/pages/**
        ├── tests.md           # globs: **/*.test.*, tests/**
        └── migrations.md      # globs: db/migrations/**

Rule file format:

---
globs: src/api/**
description: API layer conventions - auth middleware, error shapes, rate limiting
---
 
# API Rules
- All routes must validate with zod before handler
- Return `{ error: string }` on 4xx, never expose stack traces
- Rate limit headers required on all public endpoints

The agent reads only rules whose glob matches the files being accessed. On a large repo this prevents context explosion - a frontend agent doesn’t need to know the migration conventions, and vice versa.


Skills: The Tool Shed

Skills use two-tier resolution: project-scoped first, then global.

~/.agents/skills/           <- global: CLI wrappers, tool skills, personas
{project}/.agents/skills/   <- project: release cycle, deploy, changelog
{project}/.claude/skills/   <- project (alt): Claude Code projects may use this path

The Library (~/.agents/library.yaml) is the tool shed index. Before acting, the agent queries it to discover available skills rather than loading everything upfront:

# Discover what skills exist for this task
qmd search "deploy cloudflare" --collection agents
# or read library.yaml directly to select relevant skills

Never dump all skills into context. Select 1-3 relevant skills per task. The library skill (~/.agents/skills/library/SKILL.md) is the meta-tool for this selection.

Required project-scoped skills:

{project}/
└── .agents/
    └── skills/
        ├── build/SKILL.md         # REQUIRED - image, build command, env vars
        ├── test/SKILL.md          # REQUIRED - test command, coverage flags
        ├── release/SKILL.md       # tag, changelog, gh release create
        ├── deploy/SKILL.md        # project-specific deploy steps
        └── changelog/SKILL.md     # conventional commits -> CHANGELOG.md

If build/SKILL.md or test/SKILL.md are missing, the agent creates them before Phase 3 begins.


Parallel Research Phase

The Research phase spawns multiple specialized agents in parallel. Each writes its own output file; a synthesis step merges findings and updates plan frontmatter.

AgentSkill / ToolOptional
Web researcherweb_search, fetch_pageNo
File tree researcherft, search, grepNo
Email researchergws gmail*Yes
Slack researcherslackcliYes
Linear researcherlinear-cliYes

Each researcher is a gsd -p single-shot call:

PLAN="$(date +%F)-{title}"
RESEARCH_FILE=".agents/plans/${PLAN}-research.md"
 
# Web researcher (full model - needs reasoning over sources)
gsd -p "Research the following topic for an upcoming build task: <goal>.
Use web_search and fetch_page to gather sources.
Write a structured markdown summary with YAML frontmatter (summary, sources, eval_score)
conforming to the #Research-md contract to: ${RESEARCH_FILE}" \
  --model openai/gpt-4.1
 
# File tree researcher (mini model - grep + read, no reasoning needed)
gsd -p "Map the codebase areas relevant to: <goal>.
Use grep, find, and read to identify key files, entry points, and patterns.
Append findings as a '## Codebase' section to: ${RESEARCH_FILE}" \
  --model openai/gpt-4.1-mini

After all researchers complete, a synthesis step merges findings and stamps the plan:

gsd -p "Read ${RESEARCH_FILE}. Synthesize all sections into a one-line summary.
Update the plan file .agents/plans/${PLAN}-plan.md frontmatter:
set research_status: done, research_completed_at: $(date '+%Y-%m-%d %H:%M'),
research_summary: <one-line synthesis>, research_sources: <union of all sources>." \
  --model openai/gpt-4.1-mini

The research file is passed to the build phase to eliminate cold-start. The builder knows the codebase landscape before it writes a single line.


Closing

The harness doesn’t replace the developer - it gives AI agents the same guardrails a senior engineer uses: spec first, test before ship, cap your retry loops, isolate your environments.

The result is a pipeline you can hand off without babysitting. The plan file tells you exactly where any run failed and why. The DET steps give you deterministic checkpoints that no hallucination can skip past. The sprites keep concurrent builds from stepping on each other. And the routing table means a failing build becomes an input to the next plan, not a prompt for the agent to make something up.

That’s the harness. Ship it, don’t babysit it.