Trovella Wiki

Search & Retrieval

Hybrid search system combining BM25 keyword search (Typesense) with semantic vector search (pgvector), merged via Reciprocal Rank Fusion.

The Search & Retrieval domain covers the hybrid search system that lets users find research content through both exact keyword queries and natural language questions. Content flows through an indexing pipeline (write path) and is queried through a fusion pipeline (read path).

Topics

Indexing

Document chunking, contextual prefix generation via Claude Haiku, embedding generation via Gemini, and dual-write to pgvector and Typesense. Covers the full write path from search/content.created event through the index-content Inngest function.

Query Pipeline

How search queries flow through classification, parallel BM25 + vector search, and result fusion. Covers the full read path from tRPC input through to fused results.

Relevance

Scoring algorithms, RRF tuning parameters, query classification heuristics, result evaluation, and quality metrics. Focused on the quality of search results rather than the mechanics of the pipeline.

Architecture

The system runs two search engines in parallel:

  • Typesense -- BM25 keyword search with typo tolerance, prefix matching for type-ahead, and faceted filtering
  • pgvector -- Cosine similarity search on halfvec(1536) embeddings with HNSW indexing

Results from both engines are merged via Reciprocal Rank Fusion (RRF), which assigns each result a score of 1/(k + rank) per source. Documents appearing in both keyword and semantic results accumulate scores from each, naturally ranking higher.

See the Hybrid Search Pipeline architecture diagram for the full flow.

Key Packages

PackageRole
@repo/searchTypesense client, collection schema, keyword search, type-ahead, RRF fusion
@repo/aiText chunking, contextual retrieval (Haiku), embedding generation (Gemini)

Cross-Domain References

  • Data & Storage -- document_chunk table schema, index-content background job, and event patterns
  • Infrastructure -- Typesense Docker container on the Compute Engine VM, Cloud SQL hosting pgvector
  • Identity & Access -- RLS tenant isolation on document_chunk table; Typesense filtered by organization_id
  • Research & Intelligence -- MCP tools that emit search/content.created events, hybrid search tRPC router

On this page