ADR-008: Compute -- Single VM + Docker Compose + Caddy

Decision record for choosing a single Compute Engine VM over Cloud Run and Firebase App Hosting.

Status: Accepted (supersedes Week 0 decision 5.1: Firebase App Hosting) Date: 2026-03-29 (TRO-54 created), implemented 2026-03-30 Deciders: Kyle Olson (Solo Founder)

Decision

Run all Trovella services on a single Compute Engine VM (e2-custom-2-6144, 2 vCPU, 6 GB RAM, ~$37/month) using Docker Compose for orchestration and Caddy for reverse proxying with automatic Let's Encrypt TLS.

Services on the VM: Next.js (custom image), Caddy (off-the-shelf), Cloud SQL Auth Proxy, Typesense 27.1, Inngest, and future additions.

Context

Trovella originally chose Firebase App Hosting (FAH) for zero-config deploys. FAH was framed as Cloud Run underneath with a ~2-hour ejection path when limits were hit.

The actual reason for leaving was not any of the planned ejection triggers (canary deploys, Cloud Armor, GPU, microservices). As the research engine architecture took shape, it became clear that Trovella was a multi-service system, not a single-app deployment. The MVP required Next.js, Typesense, Inngest, and Langfuse running together with sub-millisecond inter-service communication and shared persistent disk. FAH can only host one Next.js app -- the other services would always need to live somewhere else, creating cross-network latency and multiple deployment targets.

FAH also accumulated operational friction during Phase 0:

Interactive Console required -- FAH backend creation required a browser-based GitHub OAuth flow that could not be automated via Terraform or gcloud
CVE checker rejected catalog: specifiers -- FAH's adapter reads package.json literally, not through pnpm resolution. Three failed deployments before discovering the workaround.
Split CI/CD pipeline -- GitHub Actions ran CI, FAH ran build+deploy independently. No single dashboard showed both. CI failures did not block FAH deploys.
Secret access grants needed Firebase-specific commands -- standard gcloud IAM bindings were insufficient
FAH continued deploying after removal -- even after removing apphosting.yaml and Terraform modules, FAH watched the main branch. Required explicit firebase apphosting:backends:delete --force.

Decision Drivers

Zero cold starts -- research workflows chain 5-6 service calls per step; 2-3 second cold starts per service are unacceptable
Sub-millisecond inter-service communication -- Docker internal network vs 5-20 ms Cloud Run cross-service hops
Persistent disk -- Typesense index must survive restarts without ephemeral filesystem workarounds
Single deployment target -- one docker compose pull && docker compose up -d, one pipeline, one dashboard
Predictable flat cost -- one VM = one monthly bill, no per-request pricing surprises

Alternatives Considered

Separate Cloud Run Services (Original Plan)

Cloud Run with FAH handling Next.js deploys. Each service scales independently. ~$25-37/month.

Rejected because: 2-3 second cold starts on each service. 5-20 ms inter-service latency per hop (25-100 ms for a 5-step workflow). Ephemeral filesystem means Typesense loses its index on restart. Split pipeline with no unified view.

Typesense on Compute Engine + Rest on Cloud Run

Persistent disk for Typesense, Cloud Run for the app. ~$33-51/month.

Rejected because: two deployment targets, cross-network calls still exist for the hot path. If a VM is already needed for Typesense, consolidating everything on it is simpler.

Stay on FAH, Fix the Friction

No migration effort, built-in CDN and preview environments.

Rejected because: the multi-service architecture was the core requirement. FAH is designed for single-app deployments and cannot host the full service stack regardless of friction.

Key Implementation Decisions

Separate containers, not a monolithic image

Each service runs in its own Docker container. Only the Next.js app uses a custom-built image; Caddy, Typesense, and Inngest use off-the-shelf images. docker compose pull is a no-op for unchanged images (same digest), so typical deploys only restart the web container.

Caddy over Nginx or Traefik

Caddy was chosen for automatic Let's Encrypt with zero configuration -- a ~10-line Caddyfile handles HTTPS, www redirect, and reverse proxying. Nginx would require certbot and cron-based renewal. Traefik adds complexity for Docker label-based routing that is unnecessary with a fixed set of services.

IAP SSH only

No public SSH port. All administrative access goes through gcloud compute ssh via Identity-Aware Proxy. Admin dashboards (Inngest on 8288, Drizzle Studio on 4983) are accessed via SSH tunnels.

Cloud SQL via Auth Proxy sidecar

The application connects to cloud-sql-proxy:5432 on the Docker network. The proxy handles TLS and IAM authentication. Cloud SQL authorized networks still restrict access to the VM's static IP as defense-in-depth.

Memory budget sized for current + near-term services

Total estimated usage of 3.5-5.5 GB fits in the 6 GB VM. The upgrade to e2-standard-2 (8 GB, ~$49/month) is a single gcloud command with ~2 min restart -- this triviality closed the founder's risk concern.

Consequences

Positive

Zero cold starts -- all services always warm
Sub-millisecond inter-service communication via Docker network
Unified deployment -- one target, one pipeline, one monitoring setup
Persistent disk for Typesense, Inngest, and future data
Predictable ~$37/month flat cost

Negative

No auto-scaling -- traffic beyond VM capacity requires manual upgrade. Acceptable for pre-launch MVP.
Single point of failure -- VM outage takes down all services. Cloud SQL provides its own HA independently.
No preview environments -- FAH provided per-PR deploys. Not available without additional tooling.

Risks

VM resource exhaustion -- mitigated by trivial upgrade path to 8 GB or 16 GB
Single VM at production scale -- at thousands of concurrent users, extract services back to Cloud Run (each already containerized) or add a load balancer with instance group
Caddy as non-standard -- less common than Nginx, but the 10-line Caddyfile is simpler than typical Nginx configs

Validation

Rule	Enforcement
All services healthy	Cloud Monitoring alerts: CPU > 85%, Memory > 90%, Disk > 80%, HTTPS uptime
No public SSH	Terraform firewall: only ports 80, 443, IAP SSH (35.235.240.0/20)
Cloud SQL restricted to VM	Terraform authorized networks: VM static IP only (`/32`)
Deploy gates on CI	`deploy-prod` requires `quality`, `build-push`, and `migrate-prod` to pass
Dashboards not public	No Caddy proxy rules for port 8288 or 4983; SSH tunnel only

References

Source ADR: docs/architecture/decisions/008-compute-vm-docker-caddy.md
Related: Docker Containers, Caddy Reverse Proxy, Networking
Related ADRs: ADR-006 (Inngest self-hosted), ADR-009 (Search -- Typesense + pgvector), ADR-012 (CI/CD pipeline), ADR-013 (Local Docker + GCP parity)
Linear: TRO-54 (VM infrastructure), TRO-11 (original FAH setup)