ADR-007: AI Integration -- Claude API + Gemini Embeddings
Decision record for choosing a thin @repo/ai wrapper around Anthropic Claude (chat) and Gemini Embedding 2 (vectors) with automatic usage tracking.
Status: Accepted Date: 2026-03-21 (Week 0 Decision Sprint), modified 2026-03-28 (TRO-22), modified 2026-04-02 (TRO-58 embeddings) Deciders: Kyle Olson (Solo Founder)
Decision
- Chat/reasoning: Anthropic Claude API via
@repo/aiwrapper -- Sonnet 4.6 (default), Opus 4.6 (complex tasks), Haiku 4.5 (high-volume lightweight tasks) - Embeddings: Gemini Embedding 2 at 1536 dimensions (MRL-truncated from 3072) via
@repo/ai - Wrapper pattern: Thin wrapper -- Anthropic SDK types flow through, no custom abstraction layer
- Usage tracking: Automatic on every call --
ai_usage(tokens, cost, latency) +ai_call_details(full request/response JSONB) - Model registry: Database tables (
ai_model,ai_model_pricing) with effective dates -- not hardcoded config - Multi-provider: Single-provider (Anthropic) for chat at MVP; multi-provider deferred until a second provider is actually needed
- SDK boundary: Only
@repo/aimay import@anthropic-ai/sdkor@google/genai-- enforced via ESLintno-restricted-imports
Context
AI-powered research is Trovella's core product. Every feature -- research plans, data extraction, contextual retrieval, document generation -- makes LLM calls. This creates two requirements traditional SaaS apps don't have:
- Cost visibility -- AI API costs are the primary variable expense; every call must be tracked with feature attribution for metering and cost optimization
- Single entry point -- if any package can import the Anthropic SDK directly, calls bypass usage tracking, billing data is wrong, and logging is missing
The MCP-first architecture (ADR-010) also shapes this decision. Users do heavy LLM work inside their subsidized Claude/ChatGPT subscriptions via Trovella's MCP server -- Trovella enhances that work through structure, memory, and skills rather than competing on raw LLM access. This means Trovella's own API usage is for internal operations (contextual retrieval, extraction, summarization), not wholesale chat relay.
Decision Drivers
- Usage tracking on every call -- cost attribution by feature is required for metering and business decisions
- Single wrapper package -- prevent direct SDK imports that bypass tracking
- Thin abstraction -- Anthropic SDK types flow through; no premature multi-provider interface until a second provider is actually needed
- Embedding quality for retrieval -- benchmarks matter more than price; retrieval accuracy directly affects product quality
- Database-backed model registry -- prices change, models are added/deprecated; config must be queryable and versioned
Alternatives Considered
Wrapper Abstraction Level
Provider-agnostic abstraction -- a custom interface hiding all provider types behind a Trovella abstraction. Rejected because it violates the progressive complexity principle. The provider column in ai_model accommodates future providers without requiring the wrapper abstraction today. Kyle's note: "I choose option A for decision 1, but know that we will expand to use more AI model vendors in the future."
Vercel AI SDK -- pre-built provider switching and React streaming hooks. Rejected because it adds another streaming abstraction that conflicts with the existing tRPC streaming pattern, and the complexity is not justified for single-provider use.
Embedding Model
Cohere Embed v4 -- 128K context window, $0.12/M tokens. Initially recommended but rejected after benchmark analysis. MTEB Retrieval score was 57.6 vs Gemini's 67.7 (a 10-point gap). Research showed the 128K context is counterproductive for retrieval: chunking to 512 tokens improved retrieval by +24.5% vs whole-document embedding.
OpenAI text-embedding-3-large -- well-known, 3072 dimensions, $0.13/M tokens. Rejected due to the lowest MTEB Overall score (64.6) and the cross-cloud dependency it would introduce into a GCP-committed project.
Usage Tracking Approach
Event-based tracking (consumers track their own usage) -- simpler wrapper, consumers decide what to log. Rejected because a single missed tracking call corrupts billing data. Defense in depth: the wrapper records usage before returning results.
Key Implementation Decisions
Automatic usage recording on every call
chatCompletion(), streamChat(), submitBatch(), embed(), and all variants automatically write to ai_usage and ai_call_details before returning results. Failed calls write a 0-token record for debugging. The feature field is required on every call for cost attribution. See Usage Tracking for the recording pipeline details.
Model registry as database tables with effective dates
Kyle caught that hardcoded pricing would break: "The model registry needs to be a table. Any of those figures could change at any time." The ai_model table stores model metadata; ai_model_pricing stores pricing by type with effectiveFrom/effectiveUntil date ranges. Cost estimates are calculated at query time using the active pricing record. See Data & Storage -- Reference Data for the schema.
Full request/response stored in ai_call_details
Kyle overrode the initial recommendation to exclude prompts: "I do want prompts and responses stored along with thinking block content and details about tool use." The ai_call_details table stores system prompt, full request/response JSONB, thinking blocks, and tool use details. A deferred ticket (TRO-49) covers archival to GCS when storage costs justify it.
ctx.ai on authorizedProcedure
Inside tRPC routes, ctx.ai provides methods pre-bound with the user's organizationId, userId, and logger. Outside tRPC, use createAIHelper(organizationId, userId, logger). See Reasoning Overview for the entry point patterns.
Separate functions for streaming vs non-streaming
chatCompletion() returns a complete message. streamChat() returns { events, finalMessage }. The streaming function must await finalMessage for usage to be recorded. A single overloaded function was rejected because the implementations diverge enough internally that overloaded return types would be harder to reason about.
Gemini embeddings at 1536 dimensions (MRL-truncated)
Gemini Embedding 2 outputs 3072 dimensions natively but supports Matryoshka Representation Learning (MRL) for truncation to 768/1536/3072. Trovella uses 1536 as the balance between quality and storage cost, stored as halfvec(1536) in PostgreSQL. Full 3072 dimensions were rejected because they double storage and index memory with diminishing quality returns.
Contextual retrieval with Haiku from day one
Each document chunk gets a 2-3 sentence context prefix generated by Claude Haiku before embedding. This implements Anthropic's contextual retrieval technique (~49% retrieval failure reduction). Haiku and Gemini embedding calls are linked via correlationId for cost analysis. See Search & Retrieval -- Indexing for the full pipeline.
Consequences
Positive
- Complete cost visibility -- every AI call is recorded with feature attribution, model, tokens, latency, and estimated cost
- Single entry point enforced -- ESLint + dependency-cruiser prevent any package from importing AI SDKs directly
- Multi-provider ready without premature abstraction --
ai_model.providercolumn supports adding providers without rewriting the wrapper - Best-in-class embeddings -- Gemini Embedding 2 scored 10 MTEB retrieval points above the nearest competitor
- Admin observability -- full admin UI at
/admin/ai-logswith dashboard and call detail drill-down. See Application -- AI Logs View.
Negative
- Two AI providers to manage -- Anthropic for chat + Google for embeddings means two API keys, two SDKs, two billing accounts
- Streaming usage requires
await finalMessage-- if a consumer only reads events and does not await, usage is not recorded ai_call_detailsstorage growth -- full JSONB for every call will grow quickly; GCS archival (TRO-49) is deferred
Risks
- Anthropic API pricing changes -- mitigated by database-backed pricing registry with effective dates
- Single-provider dependency for chat -- if Anthropic has an outage, all chat features are down; the thin wrapper makes adding a fallback provider a code change, not an architectural change
- Embedding model quality regression -- mitigated by the Inngest-based indexing pipeline which can re-process content in batch
Validation
| Rule | Enforcement |
|---|---|
Only @repo/ai imports @anthropic-ai/sdk and @google/genai | ESLint no-restricted-imports + dependency-cruiser |
feature field required on all AI calls | TypeScript compile error if omitted |
| Usage recorded on every call | Wrapper writes ai_usage before returning; 0-token on failure |
tRPC routes use ctx.ai not createAIHelper | ESLint no-restricted-imports in router directories |
| Model pricing is current | ai_model_pricing has effective date columns; admin updates without code changes |
References
- Linear: TRO-22 (AI API Layer), TRO-58 (Hybrid search -- embedding selection), TRO-49 (AI call content retention, deferred)
- Related: ADR-001 (Database --
halfvec(1536)storage), ADR-006 (Inngest -- indexing pipeline), ADR-009 (Search -- hybrid fusion)