VM Management
Operational procedures for the Compute Engine VM -- upgrades, resource checks, cleanup, monitoring alerts, and the startup script.
Monitoring Alerts
Four alert policies are defined in infra/modules/compute-vm/monitoring.tf and send email notifications to kyle@trovella.ai:
| Alert | Threshold | Duration | Auto-close |
|---|---|---|---|
| CPU utilization | > 85% | 5 minutes | 30 minutes |
| Memory utilization | > 90% | 5 minutes | 30 minutes |
| Disk utilization | > 80% | Immediate | 30 minutes |
| HTTPS uptime | trovella.ai/api/health fails | 5 minutes | 30 minutes |
CPU metrics come from the built-in Compute Engine metrics. Memory and disk metrics come from the Google Cloud Ops Agent, which is installed by the VM startup script. The uptime check runs every 5 minutes from multiple Google-managed locations, validating both SSL certificate validity and a 200 response from /api/health.
View alerts in the Cloud Monitoring console.
VM Startup Script
The first-boot script (infra/modules/compute-vm/startup.sh) runs once when the VM is created. It is idempotent -- a marker file at /opt/trovella/.initialized prevents re-execution on subsequent reboots.
The script performs:
- Install
docker.io,docker-compose-v2, andjq - Enable and start the Docker daemon
- Create
/opt/trovella/data/directory - Configure Docker to authenticate with Artifact Registry (
us-central1-docker.pkg.dev) - Install the Google Cloud Ops Agent for memory and disk metrics
After the startup script completes, the VM is ready to receive deploys. The first deploy (via CI) copies Docker Compose files, syncs secrets, and starts the containers.
Upgrade Machine Type
Upgrading from e2-custom-2-6144 (6 GB) to e2-standard-2 (8 GB) takes approximately 2 minutes of downtime.
# Step 1: Stop the VM
gcloud compute instances stop trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod
# Step 2: Change the machine type
gcloud compute instances set-machine-type trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--machine-type=e2-standard-2
# Step 3: Start the VM
gcloud compute instances start trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod
# Step 4: Verify containers restarted
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap \
--command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"
All containers restart automatically because of their restart: unless-stopped policy. Update infra/modules/compute-vm/variables.tf default to match the new machine type so Terraform state stays in sync.
Cost Reference
| Machine Type | vCPU | RAM | Monthly Cost |
|---|---|---|---|
e2-custom-2-6144 | 2 | 6 GB | ~$37 |
e2-standard-2 | 2 | 8 GB | ~$49 |
e2-standard-4 | 4 | 16 GB | ~$97 |
Check VM Resource Usage
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap --command="
echo '=== Disk ===' && df -h / && \
echo '=== Memory ===' && free -h && \
echo '=== Docker ===' && docker system df
"
Check Container Status
# All containers (should show "Up" and "healthy" where applicable)
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap \
--command="docker compose -f /opt/trovella/docker-compose.prod.yml ps"
# Recent logs (all containers)
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap \
--command="docker compose -f /opt/trovella/docker-compose.prod.yml logs --tail=50"
# Logs for a specific container
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap \
--command="docker compose -f /opt/trovella/docker-compose.prod.yml logs web --tail=50"
Clean Up Docker Resources
Old images accumulate after deploys. The CI pipeline prunes after each deploy, but manual cleanup may be needed if disk usage alerts fire.
gcloud compute ssh trovella-prod-vm \
--zone=us-central1-a --project=trovella-prod \
--tunnel-through-iap --command="
sudo docker image prune -af && \
sudo docker volume prune -f && \
sudo docker builder prune -af
"
Warning: docker volume prune removes unused volumes. Do not run this if any containers are stopped temporarily -- their data volumes would be deleted.
VM Is Unreachable
SSH Hangs or Times Out
- Check VM status:
gcloud compute instances describe trovella-prod-vm \ --zone=us-central1-a --project=trovella-prod \ --format="value(status)" - If
TERMINATED-- start the VM:gcloud compute instances start trovella-prod-vm \ --zone=us-central1-a --project=trovella-prod - If
RUNNINGbut SSH fails -- check the IAP tunnel firewall rule (trovella-prod-allow-iap-ssh) in the GCP Console - If
SUSPENDEDorSTAGING-- GCP is performing maintenance; wait for it to complete
Health Check Failing
The uptime check hits https://trovella.ai/api/health every 5 minutes. If it fails:
- SSH into the VM and check container status (see above)
- If the web container is unhealthy, check its logs for startup errors
- If Cloud SQL Proxy is not running, the health check will report database connectivity failure
- If Typesense is not running, the health check will report search connectivity failure
For full incident response procedures, see the deploy runbook (forward-reference -- not yet written).
Terraform Module Reference
The compute VM module accepts these variables (infra/modules/compute-vm/variables.tf):
| Variable | Type | Default | Description |
|---|---|---|---|
project_id | string | -- | GCP project ID (required) |
region | string | -- | GCP region (required) |
zone | string | us-central1-a | GCP zone for the VM |
environment | string | -- | Environment name (required) |
labels | map(string) | {} | Resource labels |
machine_type | string | e2-custom-2-6144 | Compute Engine machine type |
disk_size_gb | number | 50 | Boot disk size in GB |
notification_email | string | -- | Email for monitoring alerts (required) |
The module outputs:
| Output | Description |
|---|---|
external_ip | Static external IP address |
instance_name | VM instance name (trovella-{env}-vm) |
instance_zone | Zone of the instance |
service_account_email | VM service account email |
VM Service Account Roles
The VM service account (trovella-vm-{env}) has the following IAM roles:
| Role | Purpose |
|---|---|
roles/secretmanager.secretAccessor | Read secrets during sync-secrets-vm.sh |
roles/logging.logWriter | Write structured logs to Cloud Logging |
roles/monitoring.metricWriter | Write custom metrics to Cloud Monitoring |
roles/cloudsql.client | Authenticate to Cloud SQL via the Auth Proxy |
roles/artifactregistry.reader (on trovella-shared) | Pull Docker images from Artifact Registry |