Evaluation

Measuring search quality -- metrics, judgment workflows, and building a relevance test suite.

Evaluation is how you know whether search relevance is improving or regressing. This page describes the metrics, workflows, and test infrastructure for measuring search quality in Trovella.

Current State

Trovella does not yet have automated relevance evaluation. Quality is assessed manually through the Search Debugger, which shows keyword, semantic, and fused results side by side for any query. This is sufficient for ad-hoc debugging but does not scale to regression testing across parameter changes.

The sections below describe the evaluation framework being planned and the metrics it will use.

Key Metrics

Precision at k (P@k)

The fraction of the top-k results that are relevant.

P@k = (relevant results in top k) / k

For example, if 4 of the top 5 results are relevant, P@5 = 0.80.

This is the most intuitive metric and the easiest to judge manually. It directly answers "when I look at the first page of results, how many are useful?"

Mean Reciprocal Rank (MRR)

The average of 1/rank for the first relevant result across a set of queries.

MRR = (1/N) * SUM( 1 / rank_first_relevant )

For example, if across 3 queries the first relevant result is at positions 1, 3, and 2, MRR = (1/1 + 1/3 + 1/2) / 3 = 0.61.

MRR measures how quickly users find what they need. It is especially important for Trovella because search results often feed into MCP tool workflows where the first relevant result determines the quality of downstream AI reasoning.

Dual-Source Hit Rate

The fraction of returned results that appeared in both keyword and semantic result sets.

DualRate = (results with inKeyword AND inSemantic) / total results

This is a proxy metric unique to hybrid search. A higher dual-source rate suggests the two engines agree on relevance, which tends to correlate with user-perceived quality. A very low rate (below 20%) suggests the engines are finding different content, which may indicate an embedding or vocabulary mismatch.

The debugSearch endpoint already returns inKeyword and inSemantic flags on every fused result, making this metric trivial to compute.

Building a Relevance Test Suite

Step 1: Collect Judgment Queries

A relevance test suite starts with a set of representative queries paired with relevance judgments. Each entry contains:

Field	Description
`query`	The search query string
`expectedIds`	Chunk IDs or source document IDs that should appear in results
`expectedRank`	Optional: the expected rank range for each expected ID
`tags`	Query characteristics (e.g., "keyword", "semantic", "entity-lookup", "question")

Start small -- 20-30 queries covering the main search use cases:

Entity lookups (company names, product names)
Specific concept searches (pricing models, market sizing)
Natural language questions
Multi-term queries that test BM25 vs semantic tension

Step 2: Create a Judgment Workflow

For each query, run the search and annotate results as relevant (1) or not relevant (0). Binary relevance is sufficient for the metrics above. Graded relevance (0/1/2/3) enables NDCG but adds annotation complexity without clear benefit at the current scale.

The Search Debugger already provides the UI foundation. The planned extension is a "judgment mode" that adds a thumbs-up/thumbs-down control to each result row and persists the judgments.

Step 3: Automate Metric Computation

Once judgments exist, computing P@k and MRR is mechanical:

// Pseudocode for a test harness
for (const testCase of testSuite) {
  const results = await hybridSearch.search({ query: testCase.query, limit: 10 });
  const relevantSet = new Set(testCase.expectedIds);

  const precisionAt5 = results.slice(0, 5).filter((r) => relevantSet.has(r.id)).length / 5;
  const firstRelevantRank = results.findIndex((r) => relevantSet.has(r.id)) + 1;
  const mrr = firstRelevantRank > 0 ? 1 / firstRelevantRank : 0;
}

This can run as a Vitest test file that reports metrics without pass/fail assertions, or with thresholds once baselines are established.

Step 4: Establish Baselines

Run the test suite against the current system to establish baseline metrics. Record:

P@5 per query and overall average
MRR per query and overall average
Dual-source hit rate per query and overall average

These baselines become the regression threshold. Any parameter change that improves one metric should not significantly degrade the others.

When to Evaluate

Evaluation should run whenever:

Change	Why
RRF k parameter changes	Directly affects scoring
Chunk size or overlap changes	Affects what content each engine sees
Embedding model changes	Affects all semantic search results
Contextual prefix prompt changes	Affects embedding quality
Typesense schema or query_by changes	Affects keyword search results
Significant new content is indexed	May shift score distributions

Limitations

No click-through data. Trovella search results currently feed into MCP tool workflows and the admin debugger. There is no user-facing search results page generating click signals. Evaluation relies entirely on manual judgments and automated metrics.

Small corpus. With a research-oriented MVP and a limited number of indexed documents, statistical significance of metric differences is hard to establish. Directional trends matter more than precise numbers at this stage.

Binary relevance only. The planned evaluation uses binary relevant/not-relevant judgments. This works well for precision and MRR but cannot distinguish "somewhat relevant" from "perfectly relevant." Graded relevance can be added later if the binary approach proves insufficient.

Relevance Overview -- the scoring system being evaluated
Tuning Guide -- the parameters that evaluation measures the effect of
Search Debugger -- the UI for manual evaluation
Query Pipeline -- the pipeline producing the results being evaluated