ADR-009: Typesense + pgvector Hybrid Search

Why Trovella uses Typesense for BM25 keyword search and pgvector for semantic search, merged via Reciprocal Rank Fusion.

Status: Accepted Date: 2026-03-29 (decision), implemented 2026-03-30 (TRO-55) and 2026-04-02 (TRO-58) Deciders: Kyle Olson (Solo Founder)

Trovella needed search that combines keyword matching (typo tolerance, type-ahead, faceted filtering) with semantic understanding (vector similarity) for research content. The decision was to use Typesense for BM25 keyword search and pgvector for semantic search, merged via Reciprocal Rank Fusion (RRF). At MVP scale, PostgreSQL FTS alone would have been sufficient for keyword search -- Typesense is a bet on needing typo tolerance, type-ahead, and faceted filtering as the product grows.

Decision

Component	Choice
Keyword search	Typesense 27.1, self-hosted on the Compute Engine VM, index entirely in RAM
Semantic search	pgvector in Cloud SQL with `halfvec(1536)` HNSW index
Fusion	Reciprocal Rank Fusion (RRF) with query-dependent weighting
Embeddings	Gemini Embedding 2 at 1536 dimensions
Contextual retrieval	Claude Haiku generates 2-3 sentence context prefix per chunk before embedding
Chunking	Recursive character splitting at 512 tokens with 50-token overlap
Indexing	Inngest background job triggered by MCP tool events

Context

Trovella's research engine stores artifacts (findings, analyses, extraction results) that users need to search across. Search must handle both precise keyword queries ("AAPL 10-K filing 2025") and natural language questions ("what are the risks in Tesla's supply chain?"). Neither keyword search nor semantic search alone handles both well -- keyword misses synonyms and paraphrases, semantic misses exact terms and identifiers.

The search architecture emerged from the research capability maturity analysis (TRO-53) which identified hybrid search as a Level 3 capability needed for the MVP. Typesense was chosen over Elasticsearch in the architecture decisions document before implementation began.

Was Typesense necessary?

The founder challenged whether Typesense was necessary at all given PostgreSQL's built-in full-text search. At MVP scale (hundreds of artifacts), PostgreSQL FTS would work fine for keyword search. Typesense adds typo tolerance, prefix matching for type-ahead, and faceted count aggregation that PG FTS cannot provide. Since Typesense was already deployed on the VM, the marginal cost of using it was near zero. The RRF fusion layer is provider-agnostic -- if Typesense were removed, PG FTS could serve as the keyword leg with minimal code change.

Decision Drivers

Hybrid retrieval -- keyword and semantic search serve different query types; both are needed
Typo tolerance and type-ahead -- Typesense provides these natively; PostgreSQL FTS does not
Zero incremental cost -- Typesense runs on the existing VM; pgvector runs in the existing Cloud SQL
Faceted filtering -- count aggregations by source type, artifact type, date range without separate queries
RLS-compatible -- pgvector queries inherit RLS from withTenantContext(); Typesense filters by organization_id

Alternatives Considered

PostgreSQL Full-Text Search Only (no Typesense)

Pros: Zero additional infrastructure. Fully integrated with RLS. Simpler architecture.
Cons: No typo tolerance (misspellings return zero results). No prefix matching for type-ahead. No faceted count aggregation without separate queries.
Verdict: Would have been sufficient for MVP. Typesense's marginal cost was near zero since it was already deployed. The RRF fusion layer can swap Typesense for PG FTS if the operational overhead isn't justified.

Elasticsearch

Pros: Most mature full-text search engine, largest ecosystem, ML-powered relevance tuning.
Cons: 4x more expensive than Typesense for equivalent workloads. JVM-based with high memory footprint. Complex operational model (cluster management, index lifecycle, shard tuning).
Rejected: Overkill for solo-founder operations at MVP scale.

Dedicated Vector Database (Pinecone, Weaviate)

Pros: Purpose-built for vector search, managed infrastructure.
Cons: Additional infrastructure cost ($25-70+/month). Separate tenant isolation mechanism needed. No transactional consistency with application data.
Rejected: pgvector in the existing Cloud SQL instance provides vector search at $0 incremental cost with automatic RLS tenant isolation.

Implementation Decisions

Infrastructure first, schema deferred (TRO-55 then TRO-58)

TRO-55 was scoped to infrastructure only -- Docker containers, @repo/search package with client singleton, health check integration. The collection schema and sync logic were deferred to TRO-58 because the data model hadn't been decided yet. This avoided building on assumptions about the data model.

RRF fusion with query-dependent weighting

RRF merges keyword and semantic results using score = sum(1/(k + rank)) with k=60 (Cormack et al. 2009). Query classification adjusts weighting: short queries (1-3 words) favor keyword, long queries (7+ words) favor semantic, mid-length (4-6 words) use balanced weighting.

Contextual retrieval with Haiku from day one

Each chunk gets a 2-3 sentence context prefix generated by Claude Haiku before embedding, implementing Anthropic's contextual retrieval technique (~49% retrieval failure reduction). The founder chose to build this immediately rather than defer. Haiku and Gemini embedding calls are linked via correlationId for cost analysis.

Pre-computed seed embeddings

Seed data uses pre-committed embedding vectors (packages/db/src/seeds/fixtures/seed-embeddings.json) instead of calling the Gemini API every time seed data is loaded. The Inngest indexing trigger no-ops during seeding.

Typesense not exposed publicly

Typesense listens on port 8108 on the Docker internal network only. No Caddy proxy rule exposes it. Search queries go through the tRPC hybridSearch router, which queries both Typesense and pgvector server-side.

Consequences

Positive

Hybrid search quality -- keyword handles exact matches, semantic handles natural language, RRF combines both
$0 incremental cost -- both engines run on existing infrastructure
Typo tolerance and type-ahead from Typesense, natively
Faceted filtering without separate queries
Tenant isolation preserved -- pgvector inherits RLS; Typesense filters by organization_id

Negative

Typesense may be unnecessary at MVP scale -- a bet on future need, not proven current requirement
Dual-write complexity -- every content change must update both pgvector and Typesense
RAM-resident index -- Typesense keeps its entire index in memory (negligible at MVP, grows with scale)
Two search systems for developers to understand

Risks

Typesense project health -- venture-funded startup. Migration to Meilisearch or PG FTS possible via the provider-agnostic RRF layer.
Index drift -- if the Inngest job fails for one store but succeeds for the other. Mitigated by step-level retries and ability to re-index from source data.
Embedding model changes -- switching models requires re-embedding all content. Mitigated by batch reprocessing via the Inngest pipeline.

References

Architecture: Hybrid Search Pipeline
Indexing Overview
Linear: TRO-55 (Typesense infrastructure), TRO-58 (Hybrid search fusion), TRO-53 (Research capability maturity)
Related ADRs: ADR-001 (Database, pgvector), ADR-006 (Inngest), ADR-007 (AI, Gemini embeddings), ADR-008 (Compute, VM)