Evaluation
Measuring search quality -- metrics, judgment workflows, and building a relevance test suite.
Evaluation is how you know whether search relevance is improving or regressing. This page describes the metrics, workflows, and test infrastructure for measuring search quality in Trovella.
Current State
Trovella does not yet have automated relevance evaluation. Quality is assessed manually through the Search Debugger, which shows keyword, semantic, and fused results side by side for any query. This is sufficient for ad-hoc debugging but does not scale to regression testing across parameter changes.
The sections below describe the evaluation framework being planned and the metrics it will use.
Key Metrics
Precision at k (P@k)
The fraction of the top-k results that are relevant.
P@k = (relevant results in top k) / k
For example, if 4 of the top 5 results are relevant, P@5 = 0.80.
This is the most intuitive metric and the easiest to judge manually. It directly answers "when I look at the first page of results, how many are useful?"
Mean Reciprocal Rank (MRR)
The average of 1/rank for the first relevant result across a set of queries.
MRR = (1/N) * SUM( 1 / rank_first_relevant )
For example, if across 3 queries the first relevant result is at positions 1, 3, and 2, MRR = (1/1 + 1/3 + 1/2) / 3 = 0.61.
MRR measures how quickly users find what they need. It is especially important for Trovella because search results often feed into MCP tool workflows where the first relevant result determines the quality of downstream AI reasoning.
Dual-Source Hit Rate
The fraction of returned results that appeared in both keyword and semantic result sets.
DualRate = (results with inKeyword AND inSemantic) / total results
This is a proxy metric unique to hybrid search. A higher dual-source rate suggests the two engines agree on relevance, which tends to correlate with user-perceived quality. A very low rate (below 20%) suggests the engines are finding different content, which may indicate an embedding or vocabulary mismatch.
The debugSearch endpoint already returns inKeyword and inSemantic flags on every fused result, making this metric trivial to compute.
Building a Relevance Test Suite
Step 1: Collect Judgment Queries
A relevance test suite starts with a set of representative queries paired with relevance judgments. Each entry contains:
| Field | Description |
|---|---|
query | The search query string |
expectedIds | Chunk IDs or source document IDs that should appear in results |
expectedRank | Optional: the expected rank range for each expected ID |
tags | Query characteristics (e.g., "keyword", "semantic", "entity-lookup", "question") |
Start small -- 20-30 queries covering the main search use cases:
- Entity lookups (company names, product names)
- Specific concept searches (pricing models, market sizing)
- Natural language questions
- Multi-term queries that test BM25 vs semantic tension
Step 2: Create a Judgment Workflow
For each query, run the search and annotate results as relevant (1) or not relevant (0). Binary relevance is sufficient for the metrics above. Graded relevance (0/1/2/3) enables NDCG but adds annotation complexity without clear benefit at the current scale.
The Search Debugger already provides the UI foundation. The planned extension is a "judgment mode" that adds a thumbs-up/thumbs-down control to each result row and persists the judgments.
Step 3: Automate Metric Computation
Once judgments exist, computing P@k and MRR is mechanical:
// Pseudocode for a test harness
for (const testCase of testSuite) {
const results = await hybridSearch.search({ query: testCase.query, limit: 10 });
const relevantSet = new Set(testCase.expectedIds);
const precisionAt5 = results.slice(0, 5).filter((r) => relevantSet.has(r.id)).length / 5;
const firstRelevantRank = results.findIndex((r) => relevantSet.has(r.id)) + 1;
const mrr = firstRelevantRank > 0 ? 1 / firstRelevantRank : 0;
}
This can run as a Vitest test file that reports metrics without pass/fail assertions, or with thresholds once baselines are established.
Step 4: Establish Baselines
Run the test suite against the current system to establish baseline metrics. Record:
- P@5 per query and overall average
- MRR per query and overall average
- Dual-source hit rate per query and overall average
These baselines become the regression threshold. Any parameter change that improves one metric should not significantly degrade the others.
When to Evaluate
Evaluation should run whenever:
| Change | Why |
|---|---|
| RRF k parameter changes | Directly affects scoring |
| Chunk size or overlap changes | Affects what content each engine sees |
| Embedding model changes | Affects all semantic search results |
| Contextual prefix prompt changes | Affects embedding quality |
| Typesense schema or query_by changes | Affects keyword search results |
| Significant new content is indexed | May shift score distributions |
Limitations
No click-through data. Trovella search results currently feed into MCP tool workflows and the admin debugger. There is no user-facing search results page generating click signals. Evaluation relies entirely on manual judgments and automated metrics.
Small corpus. With a research-oriented MVP and a limited number of indexed documents, statistical significance of metric differences is hard to establish. Directional trends matter more than precise numbers at this stage.
Binary relevance only. The planned evaluation uses binary relevant/not-relevant judgments. This works well for precision and MRR but cannot distinguish "somewhat relevant" from "perfectly relevant." Graded relevance can be added later if the binary approach proves insufficient.
Related Pages
- Relevance Overview -- the scoring system being evaluated
- Tuning Guide -- the parameters that evaluation measures the effect of
- Search Debugger -- the UI for manual evaluation
- Query Pipeline -- the pipeline producing the results being evaluated