Relevance Overview

How Trovella scores and ranks hybrid search results -- RRF algorithm, query classification, and quality signals.

Relevance in Trovella's hybrid search system is the measure of how well search results match user intent. The system combines two independent scoring signals -- BM25 keyword matching and cosine vector similarity -- into a single ranked list using Reciprocal Rank Fusion (RRF). This topic covers how those scores are produced, how they are merged, and how to evaluate and tune result quality.

How Scoring Works Today

The search pipeline produces three layers of scores:

1. BM25 Keyword Score (Typesense)

Typesense computes BM25 scores over the title and embedded_text fields in the document_chunks collection. BM25 rewards:

Term frequency -- how often the query terms appear in the document
Inverse document frequency -- rare terms get more weight than common ones
Field length normalization -- shorter documents with matching terms score higher

The BM25 score is a raw number (not normalized to 0-1). Typesense returns results pre-sorted by this score, and the keywordSearch() function in @repo/search preserves both the score and the 1-based rank position.

2. Cosine Similarity Score (pgvector)

The semantic search runs a raw SQL query against document_chunk using the pgvector <#> inner product distance operator:

(embedding <#> $queryVector::halfvec(1536)) * -1 AS similarity

The inner product distance is negated to convert it to a similarity score (higher is better). Embeddings are generated by Gemini (gemini-embedding-2-preview) at 1536 dimensions, stored as halfvec(1536), and indexed with HNSW. Results are sorted by similarity descending, and the router assigns 1-based rank positions.

3. RRF Fusion Score

Both result sets are fed into Reciprocal Rank Fusion, which ignores the raw scores entirely and works only with rank positions. Each document receives 1/(k + rank) from each source it appears in. Documents appearing in both keyword and semantic results accumulate scores from both, naturally ranking higher. See RRF Algorithm for the full details.

Query Classification

Before running the search, classifyQuery() categorizes the query by word count:

Word Count	Classification	Intent
1-3 words	`keyword`	Exact term lookup (e.g., "notion pricing")
4-6 words	`balanced`	Mixed intent (e.g., "AI research workflow comparison")
7+ words	`semantic`	Natural language question (e.g., "how do competitors handle multi-tenant cost attribution")

Today this classification is returned in the response metadata but does not alter the scoring weights. Both keyword and semantic searches always run in parallel regardless of classification. See Query Classification for implementation details and the planned evolution.

Quality Signals in the Current System

The current implementation has several built-in quality mechanisms:

Contextual retrieval prefixes. During indexing, Claude Haiku generates a 2-3 sentence contextual prefix for each chunk (skipped for single-chunk documents). This prefix is prepended to the chunk text before embedding, which anchors the vector representation in the document's broader context rather than just the chunk's local content. This directly improves semantic search relevance for chunks that would otherwise lack sufficient context. See Indexing for details.

Dual-signal fusion. RRF inherently boosts documents that score well on both dimensions. A document ranking #5 on keyword and #8 on semantic will outscore a document ranking #1 on keyword alone but absent from semantic results. This acts as a built-in precision filter -- documents relevant on multiple axes rank higher.

Snippet preservation. The fusion process preserves text snippets from keyword results (which include BM25 highlights) and falls back to semantic snippets when keyword results lack them. This gives the UI meaningful preview text for display.

Debugging Relevance

The admin Search Debugger provides the primary tool for inspecting relevance. It shows all three result sets side by side with their scores, so you can see exactly why a document ranked where it did:

Which source contributed each result -- the KW and SEM badges on fused results
Raw scores from each engine -- BM25 score column, similarity score column, RRF score column
Query classification -- the badge above the results showing keyword, balanced, or semantic
Result set sizes -- how many results each engine returned before fusion