CI Deployment
How the migrate-prod CI job detects schema changes, connects to Cloud SQL, and applies migrations to production.
Production migrations run automatically in CI when changes are pushed to main. The migrate-prod job in .github/workflows/ci.yml handles the entire pipeline: change detection, GCP authentication, Cloud SQL proxy setup, migration execution, verification, and reference data seeding.
Pipeline Overview
push to main
|
v
quality job (lint, typecheck, test against ephemeral PG)
|
v
migrate-prod job
|-- Check for schema changes (paths-filter)
|-- Authenticate to GCP (Workload Identity Federation)
|-- Read DATABASE_URL from Secret Manager
|-- Install + start Cloud SQL Auth Proxy
|-- Wait for proxy + database readiness
|-- pnpm db:migrate
|-- Verify migration count
|-- pnpm db:seed (NODE_ENV=production)
|
v
deploy-prod job (depends on migrate-prod + build-push)
The deploy-prod job depends on migrate-prod completing successfully. This means the database schema is always updated before the new application code is deployed.
Change Detection
The job uses dorny/paths-filter to check whether the push includes changes to any of these paths:
packages/db/src/migrations/**packages/db/src/schema/**packages/db/src/seed/**
If none of these paths changed, the entire job is skipped with a message: "No migration/schema/seed changes detected --- skipping." This prevents unnecessary Cloud SQL connections on code-only deploys.
GCP Authentication
The job authenticates using Workload Identity Federation (WIF), not service account keys. This avoids storing long-lived credentials in GitHub Secrets.
- WIF Provider: Configured in
vars.WIF_PROVIDER - Service Account: Configured in
vars.WIF_SERVICE_ACCOUNT - Permissions required:
roles/cloudsql.client(proxy connection),roles/secretmanager.secretAccessor(database URL)
Cloud SQL Auth Proxy
The proxy runs as a background process in the CI runner, tunneling connections from 127.0.0.1:5432 to the production Cloud SQL instance.
Proxy Configuration
cloud-sql-proxy \
--port 5432 \
--http-port 9801 \
--health-check \
--run-connection-test \
--quota-project trovella-prod \
trovella-prod:us-central1:trovella-prod
Key flags:
| Flag | Purpose |
|---|---|
--port 5432 | Local port for the tunneled connection |
--http-port 9801 | Health check endpoint for readiness polling |
--health-check | Enable the /readiness HTTP endpoint |
--run-connection-test | Force a full TLS handshake before accepting clients (prevents premature readiness) |
--quota-project trovella-prod | Route Cloud SQL Admin API calls through the prod project (fixes quota mismatch with shared WIF SA) |
DATABASE_URL Rewriting
The production DATABASE_URL from Secret Manager points to the Cloud SQL public IP. The CI job rewrites it to route through the local proxy:
- Replace the host:port with
127.0.0.1:5432 - Append
?sslmode=disable(the proxy handles TLS to Cloud SQL; client-side SSL would cause double-negotiation and connection resets)
Readiness Checks
The job performs two levels of readiness checking:
- Proxy readiness (up to 30 seconds): Polls
http://127.0.0.1:9801/readinessevery second. If the proxy process exits, the job fails immediately with proxy logs. - Database connectivity (up to 20 seconds): Runs
pg_isready -h 127.0.0.1 -p 5432every 2 seconds. This verifies actual database connectivity, not just network reachability.
Migration Execution
Once the proxy is ready:
pnpm db:migrate
This runs packages/db/src/migrate.ts, which:
- Reads
_journal.jsonto get the expected count - Queries
drizzle.__drizzle_migrationsfor the pre-migration count - Logs:
Migration state: N applied, M expected, P pending - Applies pending migrations
- Verifies post-migration count matches expected
- Fails with exit code 1 if counts don't match
Post-Migration Verification
A separate CI step independently verifies the migration count after pnpm db:migrate completes:
EXPECTED=$(node -e "const j=JSON.parse(require('fs').readFileSync(
'packages/db/src/migrations/meta/_journal.json','utf-8'));
console.log(j.entries.length)")
ACTUAL=$(psql "$DATABASE_URL" -t -A \
-c "SELECT count(*) FROM drizzle.__drizzle_migrations")
if [ "$EXPECTED" != "$ACTUAL" ]; then
echo "::error::Migration count mismatch!"
exit 1
fi
This two-layer verification (runner + CI step) catches cases where:
- The Drizzle migrator exits 0 without applying anything (zero pending migrations looks identical to success)
- The runner's own verification has a bug
See the Migration Verification section for details on why this was added.
Reference Data Seeding
After migrations and verification pass:
NODE_ENV=production pnpm db:seed
With NODE_ENV=production, the seeder only runs reference data (AI model registry, pricing). Development data (test users, orgs) is excluded. See Seed Data for the full tiering system.
Proxy Log Dump
Regardless of job success or failure, the final step dumps Cloud SQL Auth Proxy logs:
cat /tmp/proxy.log || echo "No proxy log found"
This is invaluable for debugging connection issues. Common proxy errors and their causes are documented in CI Troubleshooting.
CI Database (Ephemeral)
The quality job (which runs before migrate-prod) uses a separate, ephemeral PostgreSQL service container for running RLS integration tests. This container:
- Uses the same image as local dev:
pgvector/pgvector:pg18 - Runs on port 5433
- Is destroyed after the workflow completes
- Never connects to Cloud SQL
The ephemeral database validates that migrations apply cleanly to a fresh database and that RLS tests pass.
Failure Modes
Migration fails in CI
Impact: No data was changed (failed migrations are transactional). The deploy job is blocked.
Recovery: Check CI logs for the error (SQL syntax, proxy connection, timeout). Fix and push again.
Migration succeeds, verification fails
Impact: Database has the new schema but the count check reported a mismatch. The deploy is blocked.
Recovery: This typically means the journal is out of sync with the database. Connect to production (see Database Operations Runbook) and inspect drizzle.__drizzle_migrations.
Proxy fails to start
Impact: No database connection. The entire job fails.
Recovery: Check proxy logs. Common causes: WIF token expired, Cloud SQL instance in maintenance, quota project misconfigured. See the ADR-001 CI proxy hardening section for the three issues that were fixed.
Hardening History (TRO-122)
Three issues caused silent migration failures in CI before being fixed:
- Premature readiness: The proxy
/readinessendpoint reported healthy before the TLS handshake completed. Fix:--run-connection-test. - SSL double-negotiation: Without
?sslmode=disable, thepgdriver attempted SSL with the local proxy socket (which already handles TLS to Cloud SQL), causingECONNRESET. Fix: append?sslmode=disableto the rewritten URL. - Quota project mismatch: The WIF service account lives in
trovella-shared, so Cloud SQL Admin API calls were attributed to that project where the API wasn't enabled. Fix:--quota-project trovella-prod.
These are documented in detail in ADR-001: Database -- PostgreSQL 18 + Drizzle ORM + pgvector.