Monitoring and Budgets
Cloud Monitoring alert policies, uptime checks, notification channels, and per-project budget thresholds.
Trovella uses Cloud Monitoring for infrastructure alerts and per-project billing budgets for cost control. All monitoring resources are provisioned via Terraform in the compute-vm module.
Alert Policies
Four alert policies are configured for the production VM. All use the same notification channel (email to the founder) and auto-close after 30 minutes.
CPU Utilization
| Setting | Value |
|---|---|
| Metric | compute.googleapis.com/instance/cpu/utilization |
| Threshold | > 85% sustained for 5 minutes |
| Source | Native Compute Engine metric (no agent required) |
| Auto-close | 30 minutes |
CPU utilization is available from Compute Engine without the Ops Agent. The 85% threshold with 5-minute duration filters out transient spikes (Docker image pulls, build operations) while catching sustained load that could degrade user experience.
Memory Utilization
| Setting | Value |
|---|---|
| Metric | agent.googleapis.com/memory/percent_used (state = used) |
| Threshold | > 90% sustained for 5 minutes |
| Source | Cloud Ops Agent |
| Auto-close | 30 minutes |
Memory metrics require the Cloud Ops Agent, which is installed by the VM startup script (infra/modules/compute-vm/startup.sh). The 90% threshold accounts for the VM running multiple Docker containers (Next.js, Typesense, Inngest, Caddy, Cloud SQL Proxy) that collectively consume most of the 6GB RAM.
Disk Utilization
| Setting | Value |
|---|---|
| Metric | agent.googleapis.com/disk/percent_used (state = used) |
| Threshold | > 80% immediate (0s duration) |
| Source | Cloud Ops Agent |
| Auto-close | 30 minutes |
Disk alerts fire immediately (no sustained duration) because disk exhaustion causes cascading failures -- Docker cannot pull images, logs cannot be written, and the application crashes. The 50GB boot disk provides headroom, but Docker images and Typesense data can grow unpredictably.
HTTPS Uptime Check
| Setting | Value |
|---|---|
| URL | https://trovella.ai/api/health |
| Check interval | Every 5 minutes |
| Timeout | 10 seconds |
| SSL validation | Enabled |
| Alert condition | Uptime check failing for > 5 minutes |
The uptime check hits the application's health endpoint, which verifies the Next.js server is running and responsive. The check runs from multiple GCP regions, so a single region failure does not trigger the alert.
Notification Channel
All alerts use a single email notification channel:
resource "google_monitoring_notification_channel" "email" {
display_name = "Trovella Alerts (prod)"
type = "email"
labels = {
email_address = "kyle@trovella.ai"
}
}
Additional channels (Slack, PagerDuty) can be added when the team grows. The email address is configured via the notification_email variable in infra/environments/prod/variables.tf.
Cloud Ops Agent
The Ops Agent is installed by the VM startup script during first boot:
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
bash add-google-cloud-ops-agent-repo.sh --also-install
The agent provides:
- Memory metrics (
agent.googleapis.com/memory/percent_used) -- not available from Compute Engine natively - Disk metrics (
agent.googleapis.com/disk/percent_used) -- per-device utilization - Process metrics -- CPU and memory per process (not currently alerted on)
The startup script is idempotent: a marker file (/opt/trovella/.initialized) prevents re-running setup on VM restarts.
Budget Alerts
Per the infra README, budgets are set per project:
| Project | Budget | Thresholds |
|---|---|---|
trovella-prod | $50/month | 50%, 80%, 100%, 120% |
trovella-staging | $20/month | 50%, 80%, 100%, 120% |
trovella-shared | $10/month | 50%, 80%, 100%, 120% |
Budget alerts are currently configured manually in the GCP Console. The infra/modules/budget/ directory exists as a placeholder for future Terraform-managed budget resources but is currently empty.
Budget alerts are notification-only -- they do not automatically shut down resources. If spending exceeds 120% of the budget, manual investigation is required.
What Is Not Monitored (Yet)
These monitoring capabilities are planned but not yet implemented:
- Application-level metrics -- request latency, error rates, tRPC procedure timing (will come via Pino structured logging to Cloud Logging)
- Cloud SQL metrics -- connection count, query latency, storage growth (available natively in Cloud SQL but no alert policies configured)
- Inngest job metrics -- execution duration, failure rates, queue depth (available via Inngest dashboard, not yet in Cloud Monitoring)
- Slack/PagerDuty notifications -- currently email-only
Viewing Alerts
In the GCP Console:
- Go to Monitoring > Alerting in the
trovella-prodproject - Active incidents appear at the top
- Alert policies show configuration and history
Via CLI:
# List active incidents
gcloud monitoring policies list --project=trovella-prod
# Describe a specific alert policy
gcloud monitoring policies describe POLICY_ID --project=trovella-prod