claude-code/08

Testing, validation, and RPIV

Verification corruption patterns, the three requirements (baseline, evidence-backed goalposts, verification integrity), RPIV, purpose-built validation tools, and code review as a corruption check.

📝 Wholesale AI Champions ⏱ 14 min read 📚 Team workflows

AI-assisted implementation introduces a failure class that traditional testing does not catch: verification corruption. This is the set of patterns where Claude reports that something works when it does not, or where the evidence presented cannot actually support the claim being made.

This guide covers why this happens, what the patterns look like, and how to structure your validation approach so the evidence is real.

Why Verification Corruption Happens

When Claude implements code and then validates it, it is both the implementer and the verifier. This creates pressure, sometimes subtle, sometimes not, toward confirming success rather than surfacing failure. The pressure is not malicious; it is a consequence of how the model works. It wants to be helpful, and “tests pass” is more helpful than “tests fail.”

The concrete patterns this produces:

Self-exoneration: a test fails and Claude attributes it to a pre-existing issue, an environment problem, or an integration test that “requires a live DB,” without actually verifying this is true. The failing test gets dismissed rather than investigated.
Test deletion: a test that would fail because of the change gets removed or skipped. The test count goes down; the pass rate stays at 100%.
Vacuous passes: a test is written in a way that always passes regardless of whether the production behavior is correct. Common variants: the test exercises the wrong overload, the test mocks the thing it is supposed to be testing, the test asserts on a value that is hardcoded in the test rather than produced by the code.
Fabricated evidence: a goalpost is reported as PASS without any command having been run. The model generates plausible-sounding output without executing anything.
Post-commit baselines: the pre-change test count is captured after changes are already committed. The baseline includes the new tests, making it impossible to verify the delta.

Each of these is a form of the same underlying failure: the verification claim is not backed by evidence that could withstand scrutiny.

The Three Requirements

Catching and preventing verification corruption requires three structural elements in every implementation session.

1. Pre-Change Baseline

Before any code changes, run the existing test suite and record the result:

- Test count: 778 tests passing, 0 failures
- Command: dotnet.exe test Workflow.Function.Tests.csproj --filter "FullyQualifiedName!~Deployment" --verbosity minimal
- Branch: feature/181646-auto-complete-service-cert-completion at commit abc1234
- Pre-existing failures: None

The baseline serves one purpose: if tests fail after your changes, it proves whether they were failing before. Without a baseline, any claim about pre-existing failures cannot be verified. With a baseline, the attribution is arithmetic: if the test passed before the change and fails after, the change caused the failure.

The baseline must be captured before any code changes. A baseline captured after the first commit is not a baseline, it already includes the changes.

2. Evidence-Backed Goalposts

Each acceptance criterion that needs verification gets its own row in a table with the exact command that was run and what the output showed. A completion report that says “all tests pass” without showing the command is not evidence. Evidence looks like this:

Goalpost	Command	Result	Evidence
Toggle disabled no-op	`dotnet test --filter "ShouldNotCompleteServiceWhenToggleDisabled"`	PASS	Test verifies no API call when toggle is disabled
Full suite	`dotnet test --filter "FullyQualifiedName!~Deployment"`	PASS 804/804	778 baseline + 26 new, 0 failures, 23s

Two things make a goalpost claim credible: the command that was run, and a number or output reference that proves the command was actually executed. “PASS” with no further detail is a self-report. “PASS 804/804, 23s duration” is evidence.

3. Verification Integrity Section

At the end of the session, answer three questions explicitly:

Test attribution: did Claude dismiss any failures as pre-existing or environmental without investigating? If yes, what was the actual root cause?
Test deletion or disabling: were any tests removed or skipped? Was this intentional and reviewed?
Vacuous or suspect tests: are there any new tests that would pass regardless of whether the production behavior is correct?

Leaving this section blank is worse than filling it out honestly. A session that reports “none observed” on a clean run has value. A session that omits the section provides no signal either way.

The RPIV Pattern

RPIV (Research, Plan, Implement, Validate) is the framework for structuring a session so the three requirements above happen naturally rather than being bolted on at the end.

Research: understand the existing system before writing any code. What patterns exist? What does the database schema actually look like? What tests already cover adjacent behavior? Research done at the start is cheap. Research done after a broken implementation is expensive.

Plan: document the approach before implementing it. Not a novel, a few paragraphs covering what will change, what the validation approach is, and what existing patterns will be followed. Writing the plan forces you to find the gaps before the code does.

Implement: do the work. The baseline has been captured. The plan is documented. Implementation proceeds against a known starting state.

Validate: verify against the goalposts. Run the commands. Record the output. Fill in the verification integrity section. The session artifact is the record.

The session artifact is a markdown file committed to .process-feedback/{workItemId}-rpiv-session.md in the implementation repo. It is not internal documentation, it is the evidence that the work was done correctly.

What Good Validation Evidence Looks Like

A validation section that provides real coverage:

Baseline: 909 tests passing on feature/180984 at branch creation.
Command: dotnet test ASL/Adis.Workflow.Test.xUnit/Adis.Workflow.Test.xUnit.csproj

Post-change: 909 tests passing (no new tests added for this change;
existing tests updated to cover new CarliEnabled parameter).

New behavior verified:
- Decision returns false when CarliEnabled=false:
  ShouldReturnFalseWhenCarliEnabledIsFalse, PASS
- Guard throws when CarliEnabled=false:
  ShouldThrowExceptionWhenCarliDisabled, PASS (message verified)
- Auto-advance bypasses check when CarliEnabled=false:
  ShouldCallSendToPhaseWhenCarliDisabledInPendingEstimates, PASS

A validation section that does not:

All tests pass. Implementation matches spec.

The difference is not effort, it is structure. Once you have the baseline habit and the goalpost table habit, filling them in takes two minutes. The cost of not having them when a question arises is much higher.

Purpose-Built Validation Tools: Code Is Cheap

Unit tests with mocked dependencies have a fundamental ceiling. They prove that your code does what you told it to do under the conditions you constructed. They cannot prove that the real system behaves correctly, that the database schema matches your assumptions, that the integration point works the way you specified, or that the behavior you implemented is the behavior the story actually requires.

With Claude Code, the cost of closing this gap is very low. Writing a console app that exercises your acceptance criteria against a live environment is something Claude can produce in minutes. The investment is small and the signal is much stronger than any mocked test can produce. These tools are worth keeping in source control. They serve as reference implementations for building similar tools on future stories, and they can be run again downstream in the process when the same behavior needs re-validation after changes or deployments.

As a recent example: during the UVW Lite migration implementation, the lead built a test harness at inspection-workflow/.tools/uvw-lite-migration-test-harness/, a console app with a Synthesis.cs that fabricated vehicles into specific pre-migration scenarios and a DbAccess.cs that queried actual outcomes from the live test database after migration. The harness drove 48 real migrations against the test environment and produced outcome data per vehicle: which queue they landed in, whether CarliEnabled was set correctly, whether damages were touched. That data confirmed 8 of the 15 acceptance criteria against live behavior before the formal integration tests existed.

This is the validation gap the harness fills. The formal integration tests are a regression net; they run on every build and prove the contract still holds. The harness proves the contract was met against real infrastructure, and it remains useful beyond the initial implementation. When the same behavior needs re-validation after a deployment or a downstream change, the harness is already there. Neither substitutes for the other.

The practical value multiplies because writing the harness with Claude also produces concrete debugging output. When something does not match the expected behavior, the harness tells you exactly what the live system did: which row was missing in the transition table, which status the vehicle actually landed in, which field the trigger modified unexpectedly. This is investigative infrastructure that would take a developer hours to build manually and takes Claude minutes.

The principle generalizes beyond migration functions. Any story with behavior that depends on a real database schema, real service interactions, or real environment configuration is a candidate. The questions to ask: can a unit test exercise this behavior without mocking the thing being tested? If the answer is no, a purpose-built validation tool is the right complement, and with Claude the cost of building one is no longer a reason to skip it.

Code Review as a Corruption Check

Multi-perspective code review catches verification corruption that the author cannot catch. The patterns are invisible from inside the implementation: you wrote the test to verify the behavior you implemented, so it looks correct to you.

A reviewer approaching the test fresh asks different questions: would this test pass if the production code did nothing? Does this test exercise the real code path or a mocked path? Are these assertions actually tied to the new behavior?

When running automated code review (via the code-review-agent), include corruption-specific reviewers alongside the standard quality checks. The three variants that catch what standard review misses:

Coverage analyst: for each new behavior, is there a test that would fail if that behavior were removed?
Test attribution verifier: are there any test failures being attributed to external causes without evidence?
Vacuous pass detector: are there any new tests that pass on an empty implementation?

Review findings that fall into these categories should be fixed before the session is considered complete, not deferred.

The Compound Effect

The three requirements are not bureaucracy. They exist because the alternative is a growing body of AI-assisted work where the evidence for correctness is thin, the verification was done by the same system that produced the implementation, and the first indication of a problem comes from production.

A session with a real baseline, evidence-backed goalposts, and an honest verification integrity section produces something that a future developer (or a future agent) can reason from. The work stands up. The delta is traceable. If something fails later, the session artifact tells you whether the tests were green, what the baseline was, and whether anything looked suspect at the time.

That is the standard worth holding.

The next guide in this series covers the feedback loop that makes all of this self-improving: how RPIV session artifacts feed into a process analysis system that identifies recurring patterns and proposes concrete changes to the rules, agents, and templates that govern future sessions.

←

Agent teams: lead + teammates

Reflection and feedback loops

→