Monitoring and Budgets

Cloud Monitoring alert policies, uptime checks, notification channels, and per-project budget thresholds.

Trovella uses Cloud Monitoring for infrastructure alerts and per-project billing budgets for cost control. All monitoring resources are provisioned via Terraform in the compute-vm module.

Alert Policies

Four alert policies are configured for the production VM. All use the same notification channel (email to the founder) and auto-close after 30 minutes.

CPU Utilization

Setting	Value
Metric	`compute.googleapis.com/instance/cpu/utilization`
Threshold	> 85% sustained for 5 minutes
Source	Native Compute Engine metric (no agent required)
Auto-close	30 minutes

CPU utilization is available from Compute Engine without the Ops Agent. The 85% threshold with 5-minute duration filters out transient spikes (Docker image pulls, build operations) while catching sustained load that could degrade user experience.

Memory Utilization

Setting	Value
Metric	`agent.googleapis.com/memory/percent_used` (state = `used`)
Threshold	> 90% sustained for 5 minutes
Source	Cloud Ops Agent
Auto-close	30 minutes

Memory metrics require the Cloud Ops Agent, which is installed by the VM startup script (infra/modules/compute-vm/startup.sh). The 90% threshold accounts for the VM running multiple Docker containers (Next.js, Typesense, Inngest, Caddy, Cloud SQL Proxy) that collectively consume most of the 6GB RAM.

Disk Utilization

Setting	Value
Metric	`agent.googleapis.com/disk/percent_used` (state = `used`)
Threshold	> 80% immediate (0s duration)
Source	Cloud Ops Agent
Auto-close	30 minutes

Disk alerts fire immediately (no sustained duration) because disk exhaustion causes cascading failures -- Docker cannot pull images, logs cannot be written, and the application crashes. The 50GB boot disk provides headroom, but Docker images and Typesense data can grow unpredictably.

HTTPS Uptime Check

Setting	Value
URL	`https://trovella.ai/api/health`
Check interval	Every 5 minutes
Timeout	10 seconds
SSL validation	Enabled
Alert condition	Uptime check failing for > 5 minutes

The uptime check hits the application's health endpoint, which verifies the Next.js server is running and responsive. The check runs from multiple GCP regions, so a single region failure does not trigger the alert.

Notification Channel

All alerts use a single email notification channel:

resource "google_monitoring_notification_channel" "email" {
  display_name = "Trovella Alerts (prod)"
  type         = "email"
  labels = {
    email_address = "kyle@trovella.ai"
  }
}

Additional channels (Slack, PagerDuty) can be added when the team grows. The email address is configured via the notification_email variable in infra/environments/prod/variables.tf.

Cloud Ops Agent

The Ops Agent is installed by the VM startup script during first boot:

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install

The agent provides:

Memory metrics (agent.googleapis.com/memory/percent_used) -- not available from Compute Engine natively
Disk metrics (agent.googleapis.com/disk/percent_used) -- per-device utilization
Process metrics -- CPU and memory per process (not currently alerted on)

The startup script is idempotent: a marker file (/opt/trovella/.initialized) prevents re-running setup on VM restarts.

Budget Alerts

Per the infra README, budgets are set per project:

Project	Budget	Thresholds
`trovella-prod`	$50/month	50%, 80%, 100%, 120%
`trovella-staging`	$20/month	50%, 80%, 100%, 120%
`trovella-shared`	$10/month	50%, 80%, 100%, 120%

Budget alerts are currently configured manually in the GCP Console. The infra/modules/budget/ directory exists as a placeholder for future Terraform-managed budget resources but is currently empty.

Budget alerts are notification-only -- they do not automatically shut down resources. If spending exceeds 120% of the budget, manual investigation is required.

What Is Not Monitored (Yet)

These monitoring capabilities are planned but not yet implemented:

Application-level metrics -- request latency, error rates, tRPC procedure timing (will come via Pino structured logging to Cloud Logging)
Cloud SQL metrics -- connection count, query latency, storage growth (available natively in Cloud SQL but no alert policies configured)
Inngest job metrics -- execution duration, failure rates, queue depth (available via Inngest dashboard, not yet in Cloud Monitoring)
Slack/PagerDuty notifications -- currently email-only

Viewing Alerts

In the GCP Console:

Go to Monitoring > Alerting in the trovella-prod project
Active incidents appear at the top
Alert policies show configuration and history

Via CLI:

# List active incidents
gcloud monitoring policies list --project=trovella-prod

# Describe a specific alert policy
gcloud monitoring policies describe POLICY_ID --project=trovella-prod