ADR-008: Compute -- Single VM + Docker Compose + Caddy
Decision record for choosing a single Compute Engine VM over Cloud Run and Firebase App Hosting.
Status: Accepted (supersedes Week 0 decision 5.1: Firebase App Hosting) Date: 2026-03-29 (TRO-54 created), implemented 2026-03-30 Deciders: Kyle Olson (Solo Founder)
Decision
Run all Trovella services on a single Compute Engine VM (e2-custom-2-6144, 2 vCPU, 6 GB RAM, ~$37/month) using Docker Compose for orchestration and Caddy for reverse proxying with automatic Let's Encrypt TLS.
Services on the VM: Next.js (custom image), Caddy (off-the-shelf), Cloud SQL Auth Proxy, Typesense 27.1, Inngest, and future additions.
Context
Trovella originally chose Firebase App Hosting (FAH) for zero-config deploys. FAH was framed as Cloud Run underneath with a ~2-hour ejection path when limits were hit.
The actual reason for leaving was not any of the planned ejection triggers (canary deploys, Cloud Armor, GPU, microservices). As the research engine architecture took shape, it became clear that Trovella was a multi-service system, not a single-app deployment. The MVP required Next.js, Typesense, Inngest, and Langfuse running together with sub-millisecond inter-service communication and shared persistent disk. FAH can only host one Next.js app -- the other services would always need to live somewhere else, creating cross-network latency and multiple deployment targets.
FAH also accumulated operational friction during Phase 0:
- Interactive Console required -- FAH backend creation required a browser-based GitHub OAuth flow that could not be automated via Terraform or
gcloud - CVE checker rejected
catalog:specifiers -- FAH's adapter readspackage.jsonliterally, not through pnpm resolution. Three failed deployments before discovering the workaround. - Split CI/CD pipeline -- GitHub Actions ran CI, FAH ran build+deploy independently. No single dashboard showed both. CI failures did not block FAH deploys.
- Secret access grants needed Firebase-specific commands -- standard
gcloudIAM bindings were insufficient - FAH continued deploying after removal -- even after removing
apphosting.yamland Terraform modules, FAH watched themainbranch. Required explicitfirebase apphosting:backends:delete --force.
Decision Drivers
- Zero cold starts -- research workflows chain 5-6 service calls per step; 2-3 second cold starts per service are unacceptable
- Sub-millisecond inter-service communication -- Docker internal network vs 5-20 ms Cloud Run cross-service hops
- Persistent disk -- Typesense index must survive restarts without ephemeral filesystem workarounds
- Single deployment target -- one
docker compose pull && docker compose up -d, one pipeline, one dashboard - Predictable flat cost -- one VM = one monthly bill, no per-request pricing surprises
Alternatives Considered
Separate Cloud Run Services (Original Plan)
Cloud Run with FAH handling Next.js deploys. Each service scales independently. ~$25-37/month.
Rejected because: 2-3 second cold starts on each service. 5-20 ms inter-service latency per hop (25-100 ms for a 5-step workflow). Ephemeral filesystem means Typesense loses its index on restart. Split pipeline with no unified view.
Typesense on Compute Engine + Rest on Cloud Run
Persistent disk for Typesense, Cloud Run for the app. ~$33-51/month.
Rejected because: two deployment targets, cross-network calls still exist for the hot path. If a VM is already needed for Typesense, consolidating everything on it is simpler.
Stay on FAH, Fix the Friction
No migration effort, built-in CDN and preview environments.
Rejected because: the multi-service architecture was the core requirement. FAH is designed for single-app deployments and cannot host the full service stack regardless of friction.
Key Implementation Decisions
Separate containers, not a monolithic image
Each service runs in its own Docker container. Only the Next.js app uses a custom-built image; Caddy, Typesense, and Inngest use off-the-shelf images. docker compose pull is a no-op for unchanged images (same digest), so typical deploys only restart the web container.
Caddy over Nginx or Traefik
Caddy was chosen for automatic Let's Encrypt with zero configuration -- a ~10-line Caddyfile handles HTTPS, www redirect, and reverse proxying. Nginx would require certbot and cron-based renewal. Traefik adds complexity for Docker label-based routing that is unnecessary with a fixed set of services.
IAP SSH only
No public SSH port. All administrative access goes through gcloud compute ssh via Identity-Aware Proxy. Admin dashboards (Inngest on 8288, Drizzle Studio on 4983) are accessed via SSH tunnels.
Cloud SQL via Auth Proxy sidecar
The application connects to cloud-sql-proxy:5432 on the Docker network. The proxy handles TLS and IAM authentication. Cloud SQL authorized networks still restrict access to the VM's static IP as defense-in-depth.
Memory budget sized for current + near-term services
Total estimated usage of 3.5-5.5 GB fits in the 6 GB VM. The upgrade to e2-standard-2 (8 GB, ~$49/month) is a single gcloud command with ~2 min restart -- this triviality closed the founder's risk concern.
Consequences
Positive
- Zero cold starts -- all services always warm
- Sub-millisecond inter-service communication via Docker network
- Unified deployment -- one target, one pipeline, one monitoring setup
- Persistent disk for Typesense, Inngest, and future data
- Predictable ~$37/month flat cost
Negative
- No auto-scaling -- traffic beyond VM capacity requires manual upgrade. Acceptable for pre-launch MVP.
- Single point of failure -- VM outage takes down all services. Cloud SQL provides its own HA independently.
- No preview environments -- FAH provided per-PR deploys. Not available without additional tooling.
Risks
- VM resource exhaustion -- mitigated by trivial upgrade path to 8 GB or 16 GB
- Single VM at production scale -- at thousands of concurrent users, extract services back to Cloud Run (each already containerized) or add a load balancer with instance group
- Caddy as non-standard -- less common than Nginx, but the 10-line Caddyfile is simpler than typical Nginx configs
Validation
| Rule | Enforcement |
|---|---|
| All services healthy | Cloud Monitoring alerts: CPU > 85%, Memory > 90%, Disk > 80%, HTTPS uptime |
| No public SSH | Terraform firewall: only ports 80, 443, IAP SSH (35.235.240.0/20) |
| Cloud SQL restricted to VM | Terraform authorized networks: VM static IP only (/32) |
| Deploy gates on CI | deploy-prod requires quality, build-push, and migrate-prod to pass |
| Dashboards not public | No Caddy proxy rules for port 8288 or 4983; SSH tunnel only |
References
- Source ADR:
docs/architecture/decisions/008-compute-vm-docker-caddy.md - Related: Docker Containers, Caddy Reverse Proxy, Networking
- Related ADRs: ADR-006 (Inngest self-hosted), ADR-009 (Search -- Typesense + pgvector), ADR-012 (CI/CD pipeline), ADR-013 (Local Docker + GCP parity)
- Linear: TRO-54 (VM infrastructure), TRO-11 (original FAH setup)
Compute Overview
How Trovella's production compute infrastructure works -- a single Compute Engine VM running Docker Compose with Caddy, Next.js, and service containers.
Docker Containers
Container architecture, health checks, the multi-stage Docker build, and how production and local development stacks differ.