Verification in Clawboo: Builder ≠ Judge Principle

Verification is the rule that makes done mean verified. When an agent completes file-mutating work, two independent checks must agree before that task can reach done: a deterministic gate (the task’s own build, test, or lint command, judged by its exit code) and — for risky or large changes — a read-only critic (an independent reviewer that cannot push changes). The principle is builder ≠ judge: the agent that did the work never certifies its own work. This matters because a self-grading model is a known failure mode. A code generator that also grades its own output has every incentive — intentional or not — to assess itself favorably. Clawboo’s only signals into done are a machine truth (an exit code) and a structurally-independent reviewer.

The verification gate is built into the board state machine, not an optional step. Every transition to done is checked against the task’s stored verification result — by any caller, through any path. The only escape is an explicit, audited humanOverride.

What you will see

When an agent completes work and transitions a task to in_review, verification runs automatically:

The build/test gate runs. The task’s configured verify command executes inside the isolated worktree. If it exits non-zero or times out, the task moves back to in_progress with a structured explanation of what failed.
The critic runs (for risky changes). If the build passes and the change is large, delegated, or explicitly flagged sensitive, an independent reviewer examines the diff from a detached checkout it cannot push to. Blocking findings (security issues, crashes, data loss, wrong algorithms, or unmet acceptance criteria) send the task back for fixes.
The task reaches done when both checks are satisfied.

If the fix loop exhausts its budget without resolving all issues, the task exits as completed_with_debt — see below.

The two verification layers

The deterministic gate

The deterministic gate is the strongest signal because it reads a truth: an exit code. It runs the task’s configured verify command (set as VERIFY_CMD in the task’s worktree) and records:

The command that ran
The exit code
Whether it timed out
A scrubbed tail of stdout/stderr as evidence
The duration

A missing verify command is a structured failure, not a skip. done requires real evidence; the absence of a check cannot certify completion. Until you configure VERIFY_CMD for a task, it will fail the gate.

The independent critic

When the build gate passes and the change meets at least one of these criteria, an independent critic reviews it:

Trigger	Threshold
`riskFlag`	The task was explicitly marked sensitive
Delegated work	Any task with a parent (depth > 0)
Large file count	More than 5 files changed
Large diff	More than 300 lines changed

The critic’s independence is structural. It runs in a detached checkout at the committed SHA — a read-only copy it literally cannot push. It has no access to the builder’s native memory or session history. When you configure a distinct reviewer model with CLAWBOO_REVIEWER_MODEL, the independence extends to the model level too. The critic emits structured findings, each with a severity. Only five severities block the task from completing:

Severity	Blocks?
`security`	✅ Yes
`crash`	✅ Yes
`data_loss`	✅ Yes
`wrong_algorithm`	✅ Yes
`missing_ac` (unmet acceptance criteria)	✅ Yes
`style`	❌ No (recorded as debt)
`perf`	❌ No (recorded as debt)
`other`	❌ No (recorded as debt)

Style nits and performance observations are recorded as debt — they never block a task or cause a loop to churn.

Configure CLAWBOO_REVIEWER_MODEL to a different model than your builder to get full model-level independence. Without it, the critic uses the same model family as the builder, which means the independence is context-level only (fresh session, detached checkout, no shared memory). The stored verification result records the reviewer model, so the same-model caveat stays visible.

The bounded fix loop and `completed_with_debt`

The verification evaluator is permanent — it always runs. Only the retry budget is bounded. When verification fails, the task returns to in_progress with a structured explanation of what needs fixing. After a configurable number of attempts (maxCycles, default 3), the loop stops:

If the build gate was passing when the budget ran out: the task reaches done as completed_with_debt. The build is green; there are critic findings the loop couldn’t resolve, and they are recorded as debtNotes. The task ships with a paper trail.
If the build gate was still failing when the budget ran out: the task moves to blocked and a human receives a clear description of why. A broken build never silently ships.

completed_with_debt is not an unconditional pass. It only promotes to done when the build/test gate is green. A task with a failing build gate cannot reach done through debt — it goes to blocked for a human to resolve.

The `humanOverride`

When a task has a non-promotable verification result and you need to ship it anyway, you can use humanOverride on the board task. This is the only way to move a task with a known-failing result to done. Every override is written to the audit log — including which result was overridden, the previous status, and who requested it. Overrides are never silent.

What verification does not cover

Read-only tasks. Verification only applies to file-mutating work with an isolated worktree and a non-empty diff.
Non-worktree runs. Tasks without a worktree carry no verification result and reach done un-gated.
The quality of your verify command. Verification reads an exit code; it cannot validate that your VERIFY_CMD actually exercises the change. A weak test suite yields a weak gate.

The Board

The state machine the → done gate lives in.

Governance

Budgets, circuit breakers, and caps that bound runs.

Observability

Where verification events and verdicts appear in traces.

Board API

The REST surface, including humanOverride and 409 verification_required.

​What you will see

​The two verification layers

​The deterministic gate

​The independent critic

​The bounded fix loop and completed_with_debt

​The humanOverride

​What verification does not cover

The Board

Governance

Observability

Board API

What you will see

The two verification layers

The deterministic gate

The independent critic

The bounded fix loop and `completed_with_debt`

The `humanOverride`

What verification does not cover