ADR-014: Testing Infrastructure

Decision record for Vitest, mutation testing, real PostgreSQL for RLS tests, and AI-assisted quality enforcement.

Status: Accepted Date: 2026-03-30 (test-audit CLI), coverage baseline 2026-04-01, Postgres CI service added 2026-03-25 Deciders: Kyle Olson (Solo Founder) Linear: TRO-72 (test-audit CLI), TRO-8 (RLS implementation + CI Postgres), TRO-10 (critical RLS bypass bug)

Problem

AI agents write most of the code and tests. They exhibit a specific failure mode: syntactically correct tests that execute all code paths but verify nothing. A test that calls a function and asserts toBeDefined() on the result achieves coverage without catching bugs.

Three problems needed solving simultaneously:

Quantitative measurement -- aggregate coverage data across a pnpm monorepo where each package runs its own Vitest instance
Quality verification beyond coverage -- prove tests detect bugs, not just run code
Workflow enforcement -- ensure AI agents follow test-first development and produce behavioral tests

Decisions

Test Runner: Vitest (not Jest)

Vitest provides native TypeScript support, native ESM, near-instant cold starts, and first-class monorepo support via workspace configs. It shares Vite's transformation pipeline, already used in the build toolchain.

Jest is deferred to Phase 2 when React Native development begins. Vitest is the better fit for Phase 1 because of its TypeScript/ESM nativity and faster feedback cycle for agentic development.

Quality Signal: Mutation Testing (not Coverage Thresholds)

Traditional coverage thresholds (e.g., "80% line coverage required") were descoped. Mutation testing via StrykerJS is the primary quality signal.

Coverage thresholds are strictly weaker than mutation testing -- they measure execution, not verification. A file with 100% coverage but surviving mutations has weak tests. A file with 80% coverage but zero surviving mutations has strong tests.

Three factors drove the descoping of coverage gates:

Six packages have 0% coverage at baseline. Thresholds require noisy exclusions or a ratchet approach.
Mutation testing is strictly stronger. It directly measures whether tests catch bugs.
Agentic workflow timing mismatch. The agent that opens a PR has exited by the time CI comments arrive. Quality enforcement during development (via /test-write) is more effective.

RLS Tests: Real PostgreSQL (not Mocked)

RLS integration tests run against a real pgvector/pgvector:pg18 container, with real RLS policies and real withTenantContext() calls. Mocks were rejected because:

A mock cannot verify that a PostgreSQL RLS policy actually works
The critical RLS bug in TRO-10 -- where tenantProcedure passed the bare db pool instead of the transaction-scoped tx, silently bypassing all RLS policies -- would not have been caught by mocked tests
Mocked RLS tests give false confidence; the whole point of RLS is database-level enforcement regardless of application bugs

Custom CLI Tool: `trovella-test-audit`

A monorepo-aware CLI at tools/test-audit/ with four commands (coverage, report, map, mutate). Uses istanbul-lib-coverage for cross-package coverage merging and StrykerJS for mutation testing. See Test Audit CLI for the full reference.

AI-Assisted Testing Skills

Two Claude Code skills enforce quality during development:

/test-write -- TDD with pre-mortem fragility catalogue (10 categories), anti-rationalization table, and mutation verification loops targeting 75%+ mutation score (85%+ for RLS/auth/CASL)
/test-review -- five-dimension scoring (coverage, behavioral focus, completeness, isolation, mutation resilience), automatic CRITICAL severity for untested RLS/auth/CASL code

Consequences

Positive

Mutation testing provides concrete, measurable test quality beyond coverage
The map command lets agents read only relevant test files, preserving context window
Quality enforcement happens during development when an agent is present to act on it
RLS tests prove tenant isolation at the database level

Negative

Mutation testing is slow -- unsuitable for CI gating (minutes per package)
The StrykerJS pnpm symlink workaround is fragile and depends on pnpm internals
Six packages still have zero test files at baseline
Skill effectiveness depends on AI model quality -- prompt-enforced, not code-enforced

Risks

StrykerJS ecosystem stability (smaller community than Jest/Vitest)
V8 coverage format compatibility with istanbul-lib-coverage
AI-generated test quality drift as codebase grows

Validation

Rule	Enforcement
All packages have Vitest config with `passWithNoTests: true`	Manual -- required when adding new packages
RLS tests run against real PostgreSQL in CI	`pgvector/pgvector:pg18` service container in CI
Coverage data aggregated across packages	`trovella-test-audit coverage` merges per-package JSON
Test quality measured beyond coverage	`trovella-test-audit mutate` runs StrykerJS
AI agents follow TDD workflow	`/test-write` skill (prompt enforcement)
Critical areas flagged as high severity	`/test-review` severity triggers for RLS, auth, CASL

References

Full ADR: docs/architecture/decisions/014-testing-infrastructure.md
CLI tool source: tools/test-audit/src/
TDD skill: .claude/skills/test-write/SKILL.md
Review skill: .claude/skills/test-review/SKILL.md
CI pipeline: .github/workflows/ci.yml