Trovella Wiki

CI Deployment

How the migrate-prod CI job detects schema changes, connects to Cloud SQL, and applies migrations to production.

Production migrations run automatically in CI when changes are pushed to main. The migrate-prod job in .github/workflows/ci.yml handles the entire pipeline: change detection, GCP authentication, Cloud SQL proxy setup, migration execution, verification, and reference data seeding.

Pipeline Overview

push to main
  |
  v
quality job (lint, typecheck, test against ephemeral PG)
  |
  v
migrate-prod job
  |-- Check for schema changes (paths-filter)
  |-- Authenticate to GCP (Workload Identity Federation)
  |-- Read DATABASE_URL from Secret Manager
  |-- Install + start Cloud SQL Auth Proxy
  |-- Wait for proxy + database readiness
  |-- pnpm db:migrate
  |-- Verify migration count
  |-- pnpm db:seed (NODE_ENV=production)
  |
  v
deploy-prod job (depends on migrate-prod + build-push)

The deploy-prod job depends on migrate-prod completing successfully. This means the database schema is always updated before the new application code is deployed.

Change Detection

The job uses dorny/paths-filter to check whether the push includes changes to any of these paths:

  • packages/db/src/migrations/**
  • packages/db/src/schema/**
  • packages/db/src/seed/**

If none of these paths changed, the entire job is skipped with a message: "No migration/schema/seed changes detected --- skipping." This prevents unnecessary Cloud SQL connections on code-only deploys.

GCP Authentication

The job authenticates using Workload Identity Federation (WIF), not service account keys. This avoids storing long-lived credentials in GitHub Secrets.

  • WIF Provider: Configured in vars.WIF_PROVIDER
  • Service Account: Configured in vars.WIF_SERVICE_ACCOUNT
  • Permissions required: roles/cloudsql.client (proxy connection), roles/secretmanager.secretAccessor (database URL)

Cloud SQL Auth Proxy

The proxy runs as a background process in the CI runner, tunneling connections from 127.0.0.1:5432 to the production Cloud SQL instance.

Proxy Configuration

cloud-sql-proxy \
  --port 5432 \
  --http-port 9801 \
  --health-check \
  --run-connection-test \
  --quota-project trovella-prod \
  trovella-prod:us-central1:trovella-prod

Key flags:

FlagPurpose
--port 5432Local port for the tunneled connection
--http-port 9801Health check endpoint for readiness polling
--health-checkEnable the /readiness HTTP endpoint
--run-connection-testForce a full TLS handshake before accepting clients (prevents premature readiness)
--quota-project trovella-prodRoute Cloud SQL Admin API calls through the prod project (fixes quota mismatch with shared WIF SA)

DATABASE_URL Rewriting

The production DATABASE_URL from Secret Manager points to the Cloud SQL public IP. The CI job rewrites it to route through the local proxy:

  1. Replace the host:port with 127.0.0.1:5432
  2. Append ?sslmode=disable (the proxy handles TLS to Cloud SQL; client-side SSL would cause double-negotiation and connection resets)

Readiness Checks

The job performs two levels of readiness checking:

  1. Proxy readiness (up to 30 seconds): Polls http://127.0.0.1:9801/readiness every second. If the proxy process exits, the job fails immediately with proxy logs.
  2. Database connectivity (up to 20 seconds): Runs pg_isready -h 127.0.0.1 -p 5432 every 2 seconds. This verifies actual database connectivity, not just network reachability.

Migration Execution

Once the proxy is ready:

pnpm db:migrate

This runs packages/db/src/migrate.ts, which:

  1. Reads _journal.json to get the expected count
  2. Queries drizzle.__drizzle_migrations for the pre-migration count
  3. Logs: Migration state: N applied, M expected, P pending
  4. Applies pending migrations
  5. Verifies post-migration count matches expected
  6. Fails with exit code 1 if counts don't match

Post-Migration Verification

A separate CI step independently verifies the migration count after pnpm db:migrate completes:

EXPECTED=$(node -e "const j=JSON.parse(require('fs').readFileSync(
  'packages/db/src/migrations/meta/_journal.json','utf-8'));
  console.log(j.entries.length)")

ACTUAL=$(psql "$DATABASE_URL" -t -A \
  -c "SELECT count(*) FROM drizzle.__drizzle_migrations")

if [ "$EXPECTED" != "$ACTUAL" ]; then
  echo "::error::Migration count mismatch!"
  exit 1
fi

This two-layer verification (runner + CI step) catches cases where:

  • The Drizzle migrator exits 0 without applying anything (zero pending migrations looks identical to success)
  • The runner's own verification has a bug

See the Migration Verification section for details on why this was added.

Reference Data Seeding

After migrations and verification pass:

NODE_ENV=production pnpm db:seed

With NODE_ENV=production, the seeder only runs reference data (AI model registry, pricing). Development data (test users, orgs) is excluded. See Seed Data for the full tiering system.

Proxy Log Dump

Regardless of job success or failure, the final step dumps Cloud SQL Auth Proxy logs:

cat /tmp/proxy.log || echo "No proxy log found"

This is invaluable for debugging connection issues. Common proxy errors and their causes are documented in CI Troubleshooting.

CI Database (Ephemeral)

The quality job (which runs before migrate-prod) uses a separate, ephemeral PostgreSQL service container for running RLS integration tests. This container:

  • Uses the same image as local dev: pgvector/pgvector:pg18
  • Runs on port 5433
  • Is destroyed after the workflow completes
  • Never connects to Cloud SQL

The ephemeral database validates that migrations apply cleanly to a fresh database and that RLS tests pass.

Failure Modes

Migration fails in CI

Impact: No data was changed (failed migrations are transactional). The deploy job is blocked.

Recovery: Check CI logs for the error (SQL syntax, proxy connection, timeout). Fix and push again.

Migration succeeds, verification fails

Impact: Database has the new schema but the count check reported a mismatch. The deploy is blocked.

Recovery: This typically means the journal is out of sync with the database. Connect to production (see Database Operations Runbook) and inspect drizzle.__drizzle_migrations.

Proxy fails to start

Impact: No database connection. The entire job fails.

Recovery: Check proxy logs. Common causes: WIF token expired, Cloud SQL instance in maintenance, quota project misconfigured. See the ADR-001 CI proxy hardening section for the three issues that were fixed.

Hardening History (TRO-122)

Three issues caused silent migration failures in CI before being fixed:

  1. Premature readiness: The proxy /readiness endpoint reported healthy before the TLS handshake completed. Fix: --run-connection-test.
  2. SSL double-negotiation: Without ?sslmode=disable, the pg driver attempted SSL with the local proxy socket (which already handles TLS to Cloud SQL), causing ECONNRESET. Fix: append ?sslmode=disable to the rewritten URL.
  3. Quota project mismatch: The WIF service account lives in trovella-shared, so Cloud SQL Admin API calls were attributed to that project where the API wasn't enabled. Fix: --quota-project trovella-prod.

These are documented in detail in ADR-001: Database -- PostgreSQL 18 + Drizzle ORM + pgvector.

On this page