Failure Recovery
Diagnosis and recovery procedures for common deployment failures, from CI pipeline errors to VM-level issues.
This page covers failures specific to the deploy process. For runtime production incidents (database down, Redis down, Typesense down, OOM), see the Incident Response Runbook or the future Infrastructure -- Observability wiki page.
CI Pipeline Fails on Main
Symptoms: gh run list --branch main shows a failed run after merge.
Impact: No deploy happens. Production continues running the previous version.
Diagnosis
# See which job failed
gh run view <run-id>
# View logs for a specific job
gh run view <run-id> --log --job=<job-id>
Recovery by Job
| Failed Job | Likely Cause | Recovery |
|---|---|---|
quality | Flaky test, lint issue, or build error that slipped through PR checks | Fix the issue and push to main |
migrate-prod | Cloud SQL Auth Proxy connection failure, WIF token expired, or migration SQL error | Check proxy logs in the CI run; see CI Deployment troubleshooting |
build-push | Dockerfile error, Artifact Registry auth issue, or build arg problem | Check Docker build logs; verify WIF credentials are valid |
deploy-prod | VM unreachable via SSH, SCP failure, or docker compose error | SSH into the VM manually to check container state |
Re-triggering a Failed Pipeline
If the failure was transient (network timeout, flaky external service), you can re-trigger without a code change:
# Re-run the failed job only
gh run rerun <run-id> --failed
# Or push a no-op commit to trigger a fresh run
git commit --allow-empty -m "Retry deploy after transient CI failure"
git push origin main
Migration Succeeds but Deploy Fails
Symptoms: The migrate-prod job is green but deploy-prod is red. The database has the new schema but the application is running old code.
Impact: Depends on whether the migration is backward-compatible:
- Additive changes (new columns, new tables): The old app ignores them. No user impact.
- Breaking changes (renamed columns, dropped tables): The old app may error on queries that reference the changed objects.
Recovery
- Fix the deploy issue and re-run the pipeline (push a no-op commit if needed)
- If the old app is broken by the schema change, manually deploy the new image:
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap --command="
cd /opt/trovella && \
sudo docker pull us-central1-docker.pkg.dev/trovella-shared/trovella/web:latest && \
sudo docker compose -f docker-compose.prod.yml up -d web && \
echo 'Manual deploy complete'
"
This pulls the image that build-push already pushed to Artifact Registry even though deploy-prod failed.
App Starts but Health Check Fails
Symptoms: Container status shows "Up" but health check returns unhealthy or degraded.
Diagnosis
# Check which services are failing
curl -s https://trovella.ai/api/health | jq
# On the VM, check container health
docker compose -f /opt/trovella/docker-compose.prod.yml ps
# Check web container logs for startup errors
docker compose -f /opt/trovella/docker-compose.prod.yml logs web --tail=100
Common Causes
| Health Status | Likely Cause | Fix |
|---|---|---|
database: false | Cloud SQL proxy not connected, or Cloud SQL in maintenance | Restart cloud-sql-proxy container; check Cloud SQL instance state |
redis: false | Upstash Redis down or wrong credentials | Check Upstash status; re-sync secrets if credentials changed |
typesense: false | Typesense container crashed or data corruption | Restart container; if persistent, remove volume and rebuild index |
| All false | .env file missing or corrupted | Re-run sync-secrets-vm.sh; restart all containers |
VM Unreachable via SSH
Symptoms: gcloud compute ssh hangs or times out. The health endpoint is also unreachable.
Diagnosis
# Check VM status
gcloud compute instances describe trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--format="value(status)"
Recovery by VM State
| State | Meaning | Action |
|---|---|---|
RUNNING | VM is up but SSH fails | Check IAP tunnel firewall rule in GCP Console; try gcloud compute instances reset for a hard reboot (~1 min downtime) |
TERMINATED | VM was stopped | gcloud compute instances start trovella-prod-vm --zone=us-central1-a --project=trovella-prod |
SUSPENDED / STAGING | GCP maintenance | Wait for GCP to complete maintenance |
After the VM recovers, containers should auto-restart (they're configured with restart: unless-stopped). Verify:
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap --command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"
Deploy Succeeds but Site Is Down
Symptoms: CI shows all green, but https://trovella.ai returns an error or times out.
Possible Causes
-
DNS issue: Cloudflare DNS is not pointing to the VM's external IP
- Check:
dig trovella.aiand compare with the VM's NAT IP - Fix: Update the A record in Cloudflare
- Check:
-
Caddy TLS failure: Caddy cannot obtain or renew the Let's Encrypt certificate
- Check:
docker compose logs caddy --tail=50 - Fix: Restart Caddy; if persistent, clear Caddy data volumes
- Check:
-
Port blocked: GCP firewall rules changed
- Check: Verify the
allow-http-httpsfirewall rule exists in GCP Console - Fix: Re-apply Terraform (
infra/environments/prod/)
- Check: Verify the
Nuclear Option: Full Restart
If individual recovery steps are not working and the site is completely down:
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap --command="
cd /opt/trovella && \
sudo docker compose -f docker-compose.prod.yml down && \
sudo ./sync-secrets-vm.sh && \
sudo docker compose -f docker-compose.prod.yml pull && \
sudo docker compose -f docker-compose.prod.yml up -d && \
sleep 15 && \
docker compose -f docker-compose.prod.yml ps
"
This tears down all containers, re-syncs secrets from Secret Manager, pulls fresh images, and brings everything back up. Total downtime: approximately 1--2 minutes.
Related
- Rollback Procedures -- reverting to a previous image or code version
- Post-Deploy Verification -- confirming a deploy succeeded
- Data & Storage -- Migration Rollback -- reversing database schema changes
- Data & Storage -- CI Deployment -- migration pipeline failure modes