Failure Recovery

Diagnosis and recovery procedures for common deployment failures, from CI pipeline errors to VM-level issues.

This page covers failures specific to the deploy process. For runtime production incidents (database down, Redis down, Typesense down, OOM), see the Incident Response Runbook or the future Infrastructure -- Observability wiki page.

CI Pipeline Fails on Main

Symptoms: gh run list --branch main shows a failed run after merge.

Impact: No deploy happens. Production continues running the previous version.

Diagnosis

# See which job failed
gh run view <run-id>

# View logs for a specific job
gh run view <run-id> --log --job=<job-id>

Recovery by Job

Failed Job	Likely Cause	Recovery
`quality`	Flaky test, lint issue, or build error that slipped through PR checks	Fix the issue and push to `main`
`migrate-prod`	Cloud SQL Auth Proxy connection failure, WIF token expired, or migration SQL error	Check proxy logs in the CI run; see CI Deployment troubleshooting
`build-push`	Dockerfile error, Artifact Registry auth issue, or build arg problem	Check Docker build logs; verify WIF credentials are valid
`deploy-prod`	VM unreachable via SSH, SCP failure, or docker compose error	SSH into the VM manually to check container state

Re-triggering a Failed Pipeline

If the failure was transient (network timeout, flaky external service), you can re-trigger without a code change:

# Re-run the failed job only
gh run rerun <run-id> --failed

# Or push a no-op commit to trigger a fresh run
git commit --allow-empty -m "Retry deploy after transient CI failure"
git push origin main

Migration Succeeds but Deploy Fails

Symptoms: The migrate-prod job is green but deploy-prod is red. The database has the new schema but the application is running old code.

Impact: Depends on whether the migration is backward-compatible:

Additive changes (new columns, new tables): The old app ignores them. No user impact.
Breaking changes (renamed columns, dropped tables): The old app may error on queries that reference the changed objects.

Recovery

Fix the deploy issue and re-run the pipeline (push a no-op commit if needed)
If the old app is broken by the schema change, manually deploy the new image:

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="
    cd /opt/trovella && \
    sudo docker pull us-central1-docker.pkg.dev/trovella-shared/trovella/web:latest && \
    sudo docker compose -f docker-compose.prod.yml up -d web && \
    echo 'Manual deploy complete'
  "

This pulls the image that build-push already pushed to Artifact Registry even though deploy-prod failed.

App Starts but Health Check Fails

Symptoms: Container status shows "Up" but health check returns unhealthy or degraded.

Diagnosis

# Check which services are failing
curl -s https://trovella.ai/api/health | jq

# On the VM, check container health
docker compose -f /opt/trovella/docker-compose.prod.yml ps

# Check web container logs for startup errors
docker compose -f /opt/trovella/docker-compose.prod.yml logs web --tail=100

Common Causes

Health Status	Likely Cause	Fix
`database: false`	Cloud SQL proxy not connected, or Cloud SQL in maintenance	Restart `cloud-sql-proxy` container; check Cloud SQL instance state
`redis: false`	Upstash Redis down or wrong credentials	Check Upstash status; re-sync secrets if credentials changed
`typesense: false`	Typesense container crashed or data corruption	Restart container; if persistent, remove volume and rebuild index
All false	`.env` file missing or corrupted	Re-run `sync-secrets-vm.sh`; restart all containers

VM Unreachable via SSH

Symptoms: gcloud compute ssh hangs or times out. The health endpoint is also unreachable.

Diagnosis

# Check VM status
gcloud compute instances describe trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --format="value(status)"

Recovery by VM State

State	Meaning	Action
`RUNNING`	VM is up but SSH fails	Check IAP tunnel firewall rule in GCP Console; try `gcloud compute instances reset` for a hard reboot (~1 min downtime)
`TERMINATED`	VM was stopped	`gcloud compute instances start trovella-prod-vm --zone=us-central1-a --project=trovella-prod`
`SUSPENDED` / `STAGING`	GCP maintenance	Wait for GCP to complete maintenance

After the VM recovers, containers should auto-restart (they're configured with restart: unless-stopped). Verify:

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"

Deploy Succeeds but Site Is Down

Symptoms: CI shows all green, but https://trovella.ai returns an error or times out.

Possible Causes

DNS issue: Cloudflare DNS is not pointing to the VM's external IP
- Check: dig trovella.ai and compare with the VM's NAT IP
- Fix: Update the A record in Cloudflare
Caddy TLS failure: Caddy cannot obtain or renew the Let's Encrypt certificate
- Check: docker compose logs caddy --tail=50
- Fix: Restart Caddy; if persistent, clear Caddy data volumes
Port blocked: GCP firewall rules changed
- Check: Verify the allow-http-https firewall rule exists in GCP Console
- Fix: Re-apply Terraform (infra/environments/prod/)

Nuclear Option: Full Restart

If individual recovery steps are not working and the site is completely down:

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="
    cd /opt/trovella && \
    sudo docker compose -f docker-compose.prod.yml down && \
    sudo ./sync-secrets-vm.sh && \
    sudo docker compose -f docker-compose.prod.yml pull && \
    sudo docker compose -f docker-compose.prod.yml up -d && \
    sleep 15 && \
    docker compose -f docker-compose.prod.yml ps
  "

This tears down all containers, re-syncs secrets from Secret Manager, pulls fresh images, and brings everything back up. Total downtime: approximately 1--2 minutes.

Rollback Procedures -- reverting to a previous image or code version
Post-Deploy Verification -- confirming a deploy succeeded
Data & Storage -- Migration Rollback -- reversing database schema changes
Data & Storage -- CI Deployment -- migration pipeline failure modes

CI Pipeline Fails on Main

Diagnosis

Recovery by Job

Re-triggering a Failed Pipeline

Migration Succeeds but Deploy Fails

Recovery

App Starts but Health Check Fails

Diagnosis

Common Causes

VM Unreachable via SSH

Diagnosis

Recovery by VM State

Deploy Succeeds but Site Is Down

Possible Causes

Nuclear Option: Full Restart

Related

On this page