Agentic SDLC Harness

A small set of rules and conventions inside .cursor/ that turns ad-hoc AI coding into a reviewable software pipeline: a request becomes a design, then test-driven implementation (tests shape the code), review, more tests where needed, and recorded evidence — each step handed off to a specialized agent, and each workflow gated by the human. Hard limits (protected files, generated code, API contract discipline) live in .cursor/rules/guardrails.mdc and apply on every run.

1 The Idea

One general-purpose agent doing everything tends to drift, forget context, and skip checks. The harness splits the work into small, specialized agents and uses written files as the bridge between them — with the human supervising every workflow boundary.

Request -> Design -> Code -> Review -> Tests + Test review -> Evidence

Each step is performed by a different agent. Each step writes a short Markdown file the next agent can read. No agent has to remember the previous chat.

Test-driven development (TDD) is central: behavior is expressed as tests early, and implementation is written to satisfy those tests. That keeps the result honest — you get working code with proof, not a plausible story — and it catches mistakes before review and merge.

Minimalism by default: every coder and reviewer prompt now carries an explicit "no bloat, no over-engineering" directive. Coders write the smallest change that satisfies the tests; reviewers flag speculative abstractions, unused options, and scope creep as defects, not stylistic taste.

2 Who Does What

Orchestration is shared between the main agent and the human. The main agent runs the inside of a workflow; the human runs the gates between workflows.

Human (You)

The top-level orchestrator. Approves the design, mid-reviews each workflow's output, decides when to start the next workflow, and is the only one allowed to expand scope. Nothing ships without your sign-off.

Main agent

In-workflow orchestrator. Classifies the request, dispatches specialists in the right order, and writes handoff files. Stops at workflow boundaries and waits for the human.

Designers

Turn a fuzzy request into a short design with explicit scope, non-goals, and what counts as "done." A separate design-reviewer challenges the ADR before the human approves it.

Coders (TDD)

Test-driven flow: extend tests for the new behavior, then write the minimal code that makes them pass. No speculative abstractions, no unused config knobs.

Reviewers

Independent agents that check code or tests against the design — and against the minimalism rule. Bloat and over-engineering are flagged as defects, not preferences. Test reviewers in particular reject shallow tests and demand proof for each critical path.

Testers

Write integration and end-to-end tests that prove the new behavior actually works — including any extra smoke or staging checks your project defines in its verification plan.

Hooks

Two automated gates wired into .cursor/hooks.json: a lint hook (subagentStop) that runs the linter after the primary coder subagent completes, and a final evidence gate (stop) that refuses to end the job until the testing handoff records the outcomes your hooks require (typically targeted tests + full suite).

Guardrails

Non-negotiables in .cursor/rules/guardrails.mdc (always applied): do not touch protected paths without explicit human approval (this repo: root *.yaml/*.yml, .gitignore, docs/); never delete unrelated *.go; never hand-edit *.gen.go; keep externally reachable HTTP contracts aligned with openapi.yaml before router or generated-client drift. The rest of .cursor/rules/ holds workflow orchestration (workflow-*.mdc, maindev.mdc); .cursor/agents/ defines subagents; .cursor/hooks/ backs the hook scripts.

.cursor/ |--- rules/ guardrails.mdc · maindev.mdc · workflow-*.mdc |--- agents/ specialist prompts (tdd-coder, reviewers, …) |--- hooks/ lint + evidence shell scripts +--- hooks.json when hooks fire |--- handoffs/ RUN_ID baton files (written by main agent) +--- changes/ active change folders + archive

3 How A Change Flows

A typical feature passes through four workflows. Each workflow is a fixed choreography of subagents. Between workflows, the human is the gate.

W1 Design --> [HUMAN] --> W2 Implement --> [HUMAN] --> W3 Test --> [HUMAN] --> W4 UI --> Done

Quickfix skips W1. Backend-only changes skip W4. Design-only stops after W1. Cosmetic edits skip everything.

Workflow 1 — Design, ADR, Handoffs

proposal-writer --> explore + 1-3 directions | v [ HUMAN ] pick a direction | v proposal-writer --> compact docs (proposal, specs, design, tasks, verification) | v adr-writer --> detailed ADR (interfaces, flows, contracts) | v design-reviewer --> challenge the ADR (CRITICAL findings) | v [ HUMAN ] approve final ADR | v arch-docs-updater --> bump Architecture.md / changelog | v handoffs --> 1-implementor / 1-tester / 1-ui

Workflow 2 — TDD Implementation & Documentation

tdd-coder --> failing tests, then MINIMAL code to pass | v lint hook --> golangci-lint (auto on subagentStop) | v code-reviewer-go --> CRITICAL only: spec drift, bloat, over-engineering | ^ v | loop on findings tdd-coder -------+ | v domain-documenter --> module READMEs / DOMAIN.md | v handoff 2-implement --> [ HUMAN ] mid-review

Workflow 3 — Testing

integration-tester-go || e2e-tester (parallel) \ / \ / \ impl bug? tdd-coder fixes code, NEVER tests v v integration-test-reviewer || e2e-test-reviewer (parallel) | reject shallow tests; map each critical path to a test v make test --> full-suite evidence | v handoff 3-testing --> [ HUMAN ] mid-review

Workflow 4 — UI Implementation

ui-designer --> ASCII layout + htmx spec | v ui-coder || tdd-coder || tdd-coder-js (parallel, only if needed) templates Go handlers custom JS | v domain-documenter --> UI module READMEs | v ui-validator --> LLM-driven check using live browser automation | +-- creates Playwright replays for regression re-runs | ^ v | bounce back on FAIL ui-coder / tdd-coder + | v handoff 4-ui --> [ HUMAN ] final sign-off

Automated gates (hooks)

Two hooks in .cursor/hooks.json watch the pipeline and block progress when something is missing.

tdd-coder finishes (W2 or W3 fix loop) | v subagentStop hook --> golangci-lint run ./... (same scope as make lint) | +--> FAIL: orchestrator receives lint output, must re-dispatch tdd-coder +--> PASS: continue [entire job ends] | v stop hook --> evidence gate on testing handoff | requires (shape is project-defined in hooks): | - Targeted / contract tests: PASS | - Full suite (make test): PASS / BLOCKED unrelated | +--> any line missing or wrong: orchestrator gets a follow-up, job not done

Reviewers never silently fix things. Code bugs go to the coder, weak tests go to the tester, unmet specs go to the designer. The main agent runs the inner loops; the human runs the gates between workflows; the hooks enforce lint and evidence around them.

4 How Agents Talk To Each Other

Agents don't share a chat window. They communicate through short Markdown files written by the main agent after each step. The same files are what the human reads to mid-review a workflow.

Subagent finishes | returns: changed files, tests, what was implemented v Main agent writes a handoff file (1-..., 2-..., 3-..., 4-...) | includes: the task, the design, the changed files, what to check v Next subagent reads the handoff and answers focused questions | v Human reads the workflow's final handoff | mid-reviews: did the workflow actually deliver? any drift? v launches the next workflow (or sends the current one back)

Why this matters: the reviewer sees what the code was supposed to do, not just the diff — and so does the human at every workflow boundary. Missing requirements get caught, not just style issues.

Human-in-the-loop, by design: the main agent will not auto-chain workflows. After W1 you approve the ADR. After W2 you read the implementation handoff. After W3 you read the test evidence. After W4 you confirm the ui-validator outcome (LLM-driven browser pass and Playwright replays for regression). Scope expansion (e.g. fixing unrelated failing suites) requires explicit human permission.

5 Context Building & Progressive Disclosure

Each subagent starts with a clean context. It builds working knowledge on demand by reading layered Markdown, narrowest layer first, drilling deeper only when needed. No agent ever loads the whole repo.

Subagent spawns | v Tier 1 — handoff file .cursor/handoffs/N-...-<RUN_ID>.md | what to do, what is in scope, which other files to read v Tier 2 — change folder (compact) .cursor/changes/<name>/ | proposal.md specs.md design.md tasks.md verification.md v Tier 3 — ADR (detailed design) docs/adr/NNN-name.md | interfaces, flows, contracts, data structures v Tier 4 — architecture narrative docs/Architecture-short.md · Architecture.md | layer boundaries, system invariants, version history v Tier 5 — per-domain docs internal/<area>/DOMAIN.md · internal/<pkg>/README.md | short description + exposed contracts only v Tier 6 — source code internal/<pkg>/*.go (last resort)

Each tier exists because the tier above is too short to fully describe the change. A subagent stops at the deepest tier it actually needs — no info is duplicated across tiers, only linked.

Per-domain docs are first-class. Every Go package under internal/ ships its own README.md with a short description, the public types/interfaces, and the contracts they expose (errors, invariants, layer rules). Larger boundary directories (internal/runtime, internal/ui, internal/models/*) ship a DOMAIN.md tying the packages together. Coders and reviewers must use the language and contracts from these files; domain-documenter updates them after every change.

Progressive disclosure ⇒ minimalism in docs too. Handoffs are short. Change-folder docs are short. READMEs describe public contracts only — implementation lives in code. If a detail belongs in code, runtime flows, or OpenAPI, it does not belong duplicated into Markdown.

6 Code Review and Test Review

Reviews are not "look at the diff and approve." They follow the same handoff pattern, and they explicitly check for bloat as well as correctness.

Code review

A reviewer agent gets the design, the changed files, and the requirement the code was meant to satisfy. It asks: does the code actually do what the design said? Are there missing tests, broken contracts, or speculative abstractions, dead options, premature generalization, or scope creep that the spec did not require?

Test review

A separate reviewer gets the test files and the requirement they're supposed to prove. It hunts shallow tests — assertions that never fail, happy-only paths, mocks that swallow the real contract — and rejects coverage theater. It cross-checks specs.md / verification.md so every critical path for the change (success, failure, and the riskiest edge) is exercised by at least one test that would break if the behavior regressed.

Minimalism is a review criterion. Coders are told to write the smallest implementation that satisfies the tests, and reviewers are told to flag the opposite — unused config knobs, layers added "for future flexibility", duplicated abstractions, options nothing reads — as CRITICAL findings, not nitpicks.

Either reviewer can send the work back. Code problems go to the coder. Test problems go to the tester. The main agent handles the inner loop until both reviewers are happy; the human handles the outer loop between workflows.

7 Pros And Cons

Pros

Large repos with small tech debt — main payoff: progressive disclosure, handoffs, and narrow specialist prompts mean you never load the whole codebase into one context; each change is scoped, reviewed, and test-backed, so a huge tree stays shippable without compounding silent debt from endless one-thread vibecoding.
TDD-first quality: tests lead implementation, so the shipped behavior is provable and regressions are harder to sneak in.
Minimalism is enforced: coders are told to write the least code that passes the tests; reviewers reject bloat and over-engineering as defects.
Traceable: every change leaves a paper trail of design, code, tests, and evidence.
Specialized agents: coding, reviewing, and testing are different jobs done by different prompts.
Different models per role: generation, code review, and test review can run on different models, so each model's blind spots are covered by another's strengths.
Human-in-the-loop: workflows do not auto-chain; the human gates every transition, so drift is caught early instead of compounding.
Less drift: agents read written files instead of relying on a long, fragile chat memory.
Built-in checks: auto-lint and evidence checks run without anyone remembering to ask.

Cons

~10× the cost of vibecoding: versus a single chat where you hack until it feels right, this stack burns far more tokens and wall-clock time — specialists, reviewers, testers, and validators each add their own model runs.
Heavy for tiny tasks: small fixes still need a quickfix path; without it, the full pipeline is overkill.
File noise: handoff and change-folder files pile up over time.
Slower: design approvals, review loops, and human gates add waiting time.
Requires an attentive human: mid-reviews and gate decisions can't be skipped without giving up most of the safety the harness provides.
Mostly textual checks: automated gates verify shape, not deep correctness.

8 What Could Be Improved

Active-task index: a single file showing what's currently in flight, so it's obvious which handoffs belong together.
Auto-archive: move completed handoffs out of the working directory once a change ships.
Schema checks for handoffs: validate that each file has the expected sections, not just lint and evidence.
Smarter evidence gate: a reviewer that confirms tests actually prove the listed requirements, not just that the file mentions them.

One sentence summary: the harness is a relay race where the baton is a written file, the runners are specialized agents, the rules say no runner is allowed to fake the handoff or pad the work, and the human is the referee at every changeover — the payoff is large-repo work with contained tech debt instead of one overloaded chat.