Trovella Wiki

VM Management

Operational procedures for the Compute Engine VM -- upgrades, resource checks, cleanup, monitoring alerts, and the startup script.

Monitoring Alerts

Four alert policies are defined in infra/modules/compute-vm/monitoring.tf and send email notifications to kyle@trovella.ai:

AlertThresholdDurationAuto-close
CPU utilization> 85%5 minutes30 minutes
Memory utilization> 90%5 minutes30 minutes
Disk utilization> 80%Immediate30 minutes
HTTPS uptimetrovella.ai/api/health fails5 minutes30 minutes

CPU metrics come from the built-in Compute Engine metrics. Memory and disk metrics come from the Google Cloud Ops Agent, which is installed by the VM startup script. The uptime check runs every 5 minutes from multiple Google-managed locations, validating both SSL certificate validity and a 200 response from /api/health.

View alerts in the Cloud Monitoring console.

VM Startup Script

The first-boot script (infra/modules/compute-vm/startup.sh) runs once when the VM is created. It is idempotent -- a marker file at /opt/trovella/.initialized prevents re-execution on subsequent reboots.

The script performs:

  1. Install docker.io, docker-compose-v2, and jq
  2. Enable and start the Docker daemon
  3. Create /opt/trovella/data/ directory
  4. Configure Docker to authenticate with Artifact Registry (us-central1-docker.pkg.dev)
  5. Install the Google Cloud Ops Agent for memory and disk metrics

After the startup script completes, the VM is ready to receive deploys. The first deploy (via CI) copies Docker Compose files, syncs secrets, and starts the containers.

Upgrade Machine Type

Upgrading from e2-custom-2-6144 (6 GB) to e2-standard-2 (8 GB) takes approximately 2 minutes of downtime.

# Step 1: Stop the VM
gcloud compute instances stop trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod

# Step 2: Change the machine type
gcloud compute instances set-machine-type trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --machine-type=e2-standard-2

# Step 3: Start the VM
gcloud compute instances start trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod

# Step 4: Verify containers restarted
gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap \
  --command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"

All containers restart automatically because of their restart: unless-stopped policy. Update infra/modules/compute-vm/variables.tf default to match the new machine type so Terraform state stays in sync.

Cost Reference

Machine TypevCPURAMMonthly Cost
e2-custom-2-614426 GB~$37
e2-standard-228 GB~$49
e2-standard-4416 GB~$97

Check VM Resource Usage

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="
    echo '=== Disk ===' && df -h / && \
    echo '=== Memory ===' && free -h && \
    echo '=== Docker ===' && docker system df
  "

Check Container Status

# All containers (should show "Up" and "healthy" where applicable)
gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap \
  --command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"

# Recent logs (all containers)
gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap \
  --command="docker compose -f /opt/trovella/docker-compose.prod.yml logs --tail=50"

# Logs for a specific container
gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap \
  --command="docker compose -f /opt/trovella/docker-compose.prod.yml logs web --tail=50"

Clean Up Docker Resources

Old images accumulate after deploys. The CI pipeline prunes after each deploy, but manual cleanup may be needed if disk usage alerts fire.

gcloud compute ssh trovella-prod-vm \
  --zone=us-central1-a --project=trovella-prod \
  --tunnel-through-iap --command="
    sudo docker image prune -af && \
    sudo docker volume prune -f && \
    sudo docker builder prune -af
  "

Warning: docker volume prune removes unused volumes. Do not run this if any containers are stopped temporarily -- their data volumes would be deleted.

VM Is Unreachable

SSH Hangs or Times Out

  1. Check VM status:
    gcloud compute instances describe trovella-prod-vm \
      --zone=us-central1-a --project=trovella-prod \
      --format="value(status)"
  2. If TERMINATED -- start the VM:
    gcloud compute instances start trovella-prod-vm \
      --zone=us-central1-a --project=trovella-prod
  3. If RUNNING but SSH fails -- check the IAP tunnel firewall rule (trovella-prod-allow-iap-ssh) in the GCP Console
  4. If SUSPENDED or STAGING -- GCP is performing maintenance; wait for it to complete

Health Check Failing

The uptime check hits https://trovella.ai/api/health every 5 minutes. If it fails:

  1. SSH into the VM and check container status (see above)
  2. If the web container is unhealthy, check its logs for startup errors
  3. If Cloud SQL Proxy is not running, the health check will report database connectivity failure
  4. If Typesense is not running, the health check will report search connectivity failure

For full incident response procedures, see the deploy runbook (forward-reference -- not yet written).

Terraform Module Reference

The compute VM module accepts these variables (infra/modules/compute-vm/variables.tf):

VariableTypeDefaultDescription
project_idstring--GCP project ID (required)
regionstring--GCP region (required)
zonestringus-central1-aGCP zone for the VM
environmentstring--Environment name (required)
labelsmap(string){}Resource labels
machine_typestringe2-custom-2-6144Compute Engine machine type
disk_size_gbnumber50Boot disk size in GB
notification_emailstring--Email for monitoring alerts (required)

The module outputs:

OutputDescription
external_ipStatic external IP address
instance_nameVM instance name (trovella-{env}-vm)
instance_zoneZone of the instance
service_account_emailVM service account email

VM Service Account Roles

The VM service account (trovella-vm-{env}) has the following IAM roles:

RolePurpose
roles/secretmanager.secretAccessorRead secrets during sync-secrets-vm.sh
roles/logging.logWriterWrite structured logs to Cloud Logging
roles/monitoring.metricWriterWrite custom metrics to Cloud Monitoring
roles/cloudsql.clientAuthenticate to Cloud SQL via the Auth Proxy
roles/artifactregistry.reader (on trovella-shared)Pull Docker images from Artifact Registry

On this page