Trovella Wiki

ADR-014: Testing Infrastructure

Decision record for Vitest, mutation testing, real PostgreSQL for RLS tests, and AI-assisted quality enforcement.

Status: Accepted Date: 2026-03-30 (test-audit CLI), coverage baseline 2026-04-01, Postgres CI service added 2026-03-25 Deciders: Kyle Olson (Solo Founder) Linear: TRO-72 (test-audit CLI), TRO-8 (RLS implementation + CI Postgres), TRO-10 (critical RLS bypass bug)

Problem

AI agents write most of the code and tests. They exhibit a specific failure mode: syntactically correct tests that execute all code paths but verify nothing. A test that calls a function and asserts toBeDefined() on the result achieves coverage without catching bugs.

Three problems needed solving simultaneously:

  1. Quantitative measurement -- aggregate coverage data across a pnpm monorepo where each package runs its own Vitest instance
  2. Quality verification beyond coverage -- prove tests detect bugs, not just run code
  3. Workflow enforcement -- ensure AI agents follow test-first development and produce behavioral tests

Decisions

Test Runner: Vitest (not Jest)

Vitest provides native TypeScript support, native ESM, near-instant cold starts, and first-class monorepo support via workspace configs. It shares Vite's transformation pipeline, already used in the build toolchain.

Jest is deferred to Phase 2 when React Native development begins. Vitest is the better fit for Phase 1 because of its TypeScript/ESM nativity and faster feedback cycle for agentic development.

Quality Signal: Mutation Testing (not Coverage Thresholds)

Traditional coverage thresholds (e.g., "80% line coverage required") were descoped. Mutation testing via StrykerJS is the primary quality signal.

Coverage thresholds are strictly weaker than mutation testing -- they measure execution, not verification. A file with 100% coverage but surviving mutations has weak tests. A file with 80% coverage but zero surviving mutations has strong tests.

Three factors drove the descoping of coverage gates:

  1. Six packages have 0% coverage at baseline. Thresholds require noisy exclusions or a ratchet approach.
  2. Mutation testing is strictly stronger. It directly measures whether tests catch bugs.
  3. Agentic workflow timing mismatch. The agent that opens a PR has exited by the time CI comments arrive. Quality enforcement during development (via /test-write) is more effective.

RLS Tests: Real PostgreSQL (not Mocked)

RLS integration tests run against a real pgvector/pgvector:pg18 container, with real RLS policies and real withTenantContext() calls. Mocks were rejected because:

  • A mock cannot verify that a PostgreSQL RLS policy actually works
  • The critical RLS bug in TRO-10 -- where tenantProcedure passed the bare db pool instead of the transaction-scoped tx, silently bypassing all RLS policies -- would not have been caught by mocked tests
  • Mocked RLS tests give false confidence; the whole point of RLS is database-level enforcement regardless of application bugs

Custom CLI Tool: trovella-test-audit

A monorepo-aware CLI at tools/test-audit/ with four commands (coverage, report, map, mutate). Uses istanbul-lib-coverage for cross-package coverage merging and StrykerJS for mutation testing. See Test Audit CLI for the full reference.

AI-Assisted Testing Skills

Two Claude Code skills enforce quality during development:

  • /test-write -- TDD with pre-mortem fragility catalogue (10 categories), anti-rationalization table, and mutation verification loops targeting 75%+ mutation score (85%+ for RLS/auth/CASL)
  • /test-review -- five-dimension scoring (coverage, behavioral focus, completeness, isolation, mutation resilience), automatic CRITICAL severity for untested RLS/auth/CASL code

Consequences

Positive

  • Mutation testing provides concrete, measurable test quality beyond coverage
  • The map command lets agents read only relevant test files, preserving context window
  • Quality enforcement happens during development when an agent is present to act on it
  • RLS tests prove tenant isolation at the database level

Negative

  • Mutation testing is slow -- unsuitable for CI gating (minutes per package)
  • The StrykerJS pnpm symlink workaround is fragile and depends on pnpm internals
  • Six packages still have zero test files at baseline
  • Skill effectiveness depends on AI model quality -- prompt-enforced, not code-enforced

Risks

  • StrykerJS ecosystem stability (smaller community than Jest/Vitest)
  • V8 coverage format compatibility with istanbul-lib-coverage
  • AI-generated test quality drift as codebase grows

Validation

RuleEnforcement
All packages have Vitest config with passWithNoTests: trueManual -- required when adding new packages
RLS tests run against real PostgreSQL in CIpgvector/pgvector:pg18 service container in CI
Coverage data aggregated across packagestrovella-test-audit coverage merges per-package JSON
Test quality measured beyond coveragetrovella-test-audit mutate runs StrykerJS
AI agents follow TDD workflow/test-write skill (prompt enforcement)
Critical areas flagged as high severity/test-review severity triggers for RLS, auth, CASL

References

  • Full ADR: docs/architecture/decisions/014-testing-infrastructure.md
  • CLI tool source: tools/test-audit/src/
  • TDD skill: .claude/skills/test-write/SKILL.md
  • Review skill: .claude/skills/test-review/SKILL.md
  • CI pipeline: .github/workflows/ci.yml

On this page