ADR-007: AI Integration -- Claude API + Gemini Embeddings

Decision record for choosing a thin @repo/ai wrapper around Anthropic Claude (chat) and Gemini Embedding 2 (vectors) with automatic usage tracking.

Status: Accepted Date: 2026-03-21 (Week 0 Decision Sprint), modified 2026-03-28 (TRO-22), modified 2026-04-02 (TRO-58 embeddings) Deciders: Kyle Olson (Solo Founder)

Decision

Chat/reasoning: Anthropic Claude API via @repo/ai wrapper -- Sonnet 4.6 (default), Opus 4.6 (complex tasks), Haiku 4.5 (high-volume lightweight tasks)
Embeddings: Gemini Embedding 2 at 1536 dimensions (MRL-truncated from 3072) via @repo/ai
Wrapper pattern: Thin wrapper -- Anthropic SDK types flow through, no custom abstraction layer
Usage tracking: Automatic on every call -- ai_usage (tokens, cost, latency) + ai_call_details (full request/response JSONB)
Model registry: Database tables (ai_model, ai_model_pricing) with effective dates -- not hardcoded config
Multi-provider: Single-provider (Anthropic) for chat at MVP; multi-provider deferred until a second provider is actually needed
SDK boundary: Only @repo/ai may import @anthropic-ai/sdk or @google/genai -- enforced via ESLint no-restricted-imports

Context

AI-powered research is Trovella's core product. Every feature -- research plans, data extraction, contextual retrieval, document generation -- makes LLM calls. This creates two requirements traditional SaaS apps don't have:

Cost visibility -- AI API costs are the primary variable expense; every call must be tracked with feature attribution for metering and cost optimization
Single entry point -- if any package can import the Anthropic SDK directly, calls bypass usage tracking, billing data is wrong, and logging is missing

The MCP-first architecture (ADR-010) also shapes this decision. Users do heavy LLM work inside their subsidized Claude/ChatGPT subscriptions via Trovella's MCP server -- Trovella enhances that work through structure, memory, and skills rather than competing on raw LLM access. This means Trovella's own API usage is for internal operations (contextual retrieval, extraction, summarization), not wholesale chat relay.

Decision Drivers

Usage tracking on every call -- cost attribution by feature is required for metering and business decisions
Single wrapper package -- prevent direct SDK imports that bypass tracking
Thin abstraction -- Anthropic SDK types flow through; no premature multi-provider interface until a second provider is actually needed
Embedding quality for retrieval -- benchmarks matter more than price; retrieval accuracy directly affects product quality
Database-backed model registry -- prices change, models are added/deprecated; config must be queryable and versioned

Alternatives Considered

Wrapper Abstraction Level

Provider-agnostic abstraction -- a custom interface hiding all provider types behind a Trovella abstraction. Rejected because it violates the progressive complexity principle. The provider column in ai_model accommodates future providers without requiring the wrapper abstraction today. Kyle's note: "I choose option A for decision 1, but know that we will expand to use more AI model vendors in the future."

Vercel AI SDK -- pre-built provider switching and React streaming hooks. Rejected because it adds another streaming abstraction that conflicts with the existing tRPC streaming pattern, and the complexity is not justified for single-provider use.

Embedding Model

Cohere Embed v4 -- 128K context window, $0.12/M tokens. Initially recommended but rejected after benchmark analysis. MTEB Retrieval score was 57.6 vs Gemini's 67.7 (a 10-point gap). Research showed the 128K context is counterproductive for retrieval: chunking to 512 tokens improved retrieval by +24.5% vs whole-document embedding.

OpenAI text-embedding-3-large -- well-known, 3072 dimensions, $0.13/M tokens. Rejected due to the lowest MTEB Overall score (64.6) and the cross-cloud dependency it would introduce into a GCP-committed project.

Usage Tracking Approach

Event-based tracking (consumers track their own usage) -- simpler wrapper, consumers decide what to log. Rejected because a single missed tracking call corrupts billing data. Defense in depth: the wrapper records usage before returning results.

Key Implementation Decisions

Automatic usage recording on every call

chatCompletion(), streamChat(), submitBatch(), embed(), and all variants automatically write to ai_usage and ai_call_details before returning results. Failed calls write a 0-token record for debugging. The feature field is required on every call for cost attribution. See Usage Tracking for the recording pipeline details.

Model registry as database tables with effective dates

Kyle caught that hardcoded pricing would break: "The model registry needs to be a table. Any of those figures could change at any time." The ai_model table stores model metadata; ai_model_pricing stores pricing by type with effectiveFrom/effectiveUntil date ranges. Cost estimates are calculated at query time using the active pricing record. See Data & Storage -- Reference Data for the schema.

Full request/response stored in `ai_call_details`

Kyle overrode the initial recommendation to exclude prompts: "I do want prompts and responses stored along with thinking block content and details about tool use." The ai_call_details table stores system prompt, full request/response JSONB, thinking blocks, and tool use details. A deferred ticket (TRO-49) covers archival to GCS when storage costs justify it.

`ctx.ai` on `authorizedProcedure`

Inside tRPC routes, ctx.ai provides methods pre-bound with the user's organizationId, userId, and logger. Outside tRPC, use createAIHelper(organizationId, userId, logger). See Reasoning Overview for the entry point patterns.

Separate functions for streaming vs non-streaming

chatCompletion() returns a complete message. streamChat() returns { events, finalMessage }. The streaming function must await finalMessage for usage to be recorded. A single overloaded function was rejected because the implementations diverge enough internally that overloaded return types would be harder to reason about.

Gemini embeddings at 1536 dimensions (MRL-truncated)

Gemini Embedding 2 outputs 3072 dimensions natively but supports Matryoshka Representation Learning (MRL) for truncation to 768/1536/3072. Trovella uses 1536 as the balance between quality and storage cost, stored as halfvec(1536) in PostgreSQL. Full 3072 dimensions were rejected because they double storage and index memory with diminishing quality returns.

Contextual retrieval with Haiku from day one

Each document chunk gets a 2-3 sentence context prefix generated by Claude Haiku before embedding. This implements Anthropic's contextual retrieval technique (~49% retrieval failure reduction). Haiku and Gemini embedding calls are linked via correlationId for cost analysis. See Search & Retrieval -- Indexing for the full pipeline.

Consequences

Positive

Complete cost visibility -- every AI call is recorded with feature attribution, model, tokens, latency, and estimated cost
Single entry point enforced -- ESLint + dependency-cruiser prevent any package from importing AI SDKs directly
Multi-provider ready without premature abstraction -- ai_model.provider column supports adding providers without rewriting the wrapper
Best-in-class embeddings -- Gemini Embedding 2 scored 10 MTEB retrieval points above the nearest competitor
Admin observability -- full admin UI at /admin/ai-logs with dashboard and call detail drill-down. See Application -- AI Logs View.

Negative

Two AI providers to manage -- Anthropic for chat + Google for embeddings means two API keys, two SDKs, two billing accounts
Streaming usage requires await finalMessage -- if a consumer only reads events and does not await, usage is not recorded
ai_call_details storage growth -- full JSONB for every call will grow quickly; GCS archival (TRO-49) is deferred

Risks

Anthropic API pricing changes -- mitigated by database-backed pricing registry with effective dates
Single-provider dependency for chat -- if Anthropic has an outage, all chat features are down; the thin wrapper makes adding a fallback provider a code change, not an architectural change
Embedding model quality regression -- mitigated by the Inngest-based indexing pipeline which can re-process content in batch

Validation

Rule	Enforcement
Only `@repo/ai` imports `@anthropic-ai/sdk` and `@google/genai`	ESLint `no-restricted-imports` + `dependency-cruiser`
`feature` field required on all AI calls	TypeScript compile error if omitted
Usage recorded on every call	Wrapper writes `ai_usage` before returning; 0-token on failure
tRPC routes use `ctx.ai` not `createAIHelper`	ESLint `no-restricted-imports` in router directories
Model pricing is current	`ai_model_pricing` has effective date columns; admin updates without code changes

References

Linear: TRO-22 (AI API Layer), TRO-58 (Hybrid search -- embedding selection), TRO-49 (AI call content retention, deferred)
Related: ADR-001 (Database -- halfvec(1536) storage), ADR-006 (Inngest -- indexing pipeline), ADR-009 (Search -- hybrid fusion)