Trovella Wiki

Failure Recovery

Diagnosis and recovery procedures for common deployment failures, from CI pipeline errors to VM-level issues.

This page covers failures specific to the deploy process. For runtime production incidents (database down, Redis down, Typesense down, OOM), see the Incident Response Runbook or the future Infrastructure -- Observability wiki page.

CI Pipeline Fails on Main

Symptoms: gh run list --branch main shows a failed run after merge.

Impact: No deploy happens. Production continues running the previous version.

Diagnosis

# See which job failed
gh run view <run-id>

# View logs for a specific job
gh run view <run-id> --log --job=<job-id>

Recovery by Job

Failed JobLikely CauseRecovery
qualityFlaky test, lint issue, or build error that slipped through PR checksFix the issue and push to main
migrate-prodCloud SQL Auth Proxy connection failure, WIF token expired, or migration SQL errorCheck proxy logs in the CI run; see CI Deployment troubleshooting
build-pushDockerfile error, Artifact Registry auth issue, or build arg problemCheck Docker build logs; verify WIF credentials are valid
deploy-prodVM unreachable via SSH, SCP failure, or docker compose errorSSH into the VM manually to check container state

Re-triggering a Failed Pipeline

If the failure was transient (network timeout, flaky external service), you can re-trigger without a code change:

# Re-run the failed job only
gh run rerun <run-id> --failed

# Or push a no-op commit to trigger a fresh run
git commit --allow-empty -m "Retry deploy after transient CI failure"
git push origin main

Migration Succeeds but Deploy Fails

Symptoms: The migrate-prod job is green but deploy-prod is red. The database has the new schema but the application is running old code.

Impact: Depends on whether the migration is backward-compatible:

  • Additive changes (new columns, new tables): The old app ignores them. No user impact.
  • Breaking changes (renamed columns, dropped tables): The old app may error on queries that reference the changed objects.

Recovery

  1. Fix the deploy issue and re-run the pipeline (push a no-op commit if needed)
  2. If the old app is broken by the schema change, manually deploy the new image:
gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="
    cd /opt/trovella && \
    sudo docker pull us-central1-docker.pkg.dev/trovella-shared/trovella/web:latest && \
    sudo docker compose -f docker-compose.prod.yml up -d web && \
    echo 'Manual deploy complete'
  "

This pulls the image that build-push already pushed to Artifact Registry even though deploy-prod failed.

App Starts but Health Check Fails

Symptoms: Container status shows "Up" but health check returns unhealthy or degraded.

Diagnosis

# Check which services are failing
curl -s https://trovella.ai/api/health | jq

# On the VM, check container health
docker compose -f /opt/trovella/docker-compose.prod.yml ps

# Check web container logs for startup errors
docker compose -f /opt/trovella/docker-compose.prod.yml logs web --tail=100

Common Causes

Health StatusLikely CauseFix
database: falseCloud SQL proxy not connected, or Cloud SQL in maintenanceRestart cloud-sql-proxy container; check Cloud SQL instance state
redis: falseUpstash Redis down or wrong credentialsCheck Upstash status; re-sync secrets if credentials changed
typesense: falseTypesense container crashed or data corruptionRestart container; if persistent, remove volume and rebuild index
All false.env file missing or corruptedRe-run sync-secrets-vm.sh; restart all containers

VM Unreachable via SSH

Symptoms: gcloud compute ssh hangs or times out. The health endpoint is also unreachable.

Diagnosis

# Check VM status
gcloud compute instances describe trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --format="value(status)"

Recovery by VM State

StateMeaningAction
RUNNINGVM is up but SSH failsCheck IAP tunnel firewall rule in GCP Console; try gcloud compute instances reset for a hard reboot (~1 min downtime)
TERMINATEDVM was stoppedgcloud compute instances start trovella-prod-vm --zone=us-central1-a --project=trovella-prod
SUSPENDED / STAGINGGCP maintenanceWait for GCP to complete maintenance

After the VM recovers, containers should auto-restart (they're configured with restart: unless-stopped). Verify:

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"

Deploy Succeeds but Site Is Down

Symptoms: CI shows all green, but https://trovella.ai returns an error or times out.

Possible Causes

  1. DNS issue: Cloudflare DNS is not pointing to the VM's external IP

    • Check: dig trovella.ai and compare with the VM's NAT IP
    • Fix: Update the A record in Cloudflare
  2. Caddy TLS failure: Caddy cannot obtain or renew the Let's Encrypt certificate

    • Check: docker compose logs caddy --tail=50
    • Fix: Restart Caddy; if persistent, clear Caddy data volumes
  3. Port blocked: GCP firewall rules changed

    • Check: Verify the allow-http-https firewall rule exists in GCP Console
    • Fix: Re-apply Terraform (infra/environments/prod/)

Nuclear Option: Full Restart

If individual recovery steps are not working and the site is completely down:

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="
    cd /opt/trovella && \
    sudo docker compose -f docker-compose.prod.yml down && \
    sudo ./sync-secrets-vm.sh && \
    sudo docker compose -f docker-compose.prod.yml pull && \
    sudo docker compose -f docker-compose.prod.yml up -d && \
    sleep 15 && \
    docker compose -f docker-compose.prod.yml ps
  "

This tears down all containers, re-syncs secrets from Secret Manager, pulls fresh images, and brings everything back up. Total downtime: approximately 1--2 minutes.

On this page