Fusion Algorithm (RRF)
Reciprocal Rank Fusion implementation -- how keyword and semantic results are merged, scored, and ranked.
Reciprocal Rank Fusion (RRF) merges the keyword and semantic result lists into a single ranked output. It is a rank-based algorithm -- it uses only the position of each result in its source list, not the raw scores from BM25 or cosine similarity. This makes it robust to score scale differences between the two search engines.
The Formula
For each document, the RRF score is:
rrfScore = sum of 1/(k + rank) for each source where the document appears
kis a smoothing constant (default 60, from Cormack et al. 2009)rankis the 1-based position in the source's result list
A document appearing in only one source gets a single term. A document appearing in both gets two terms added together, which naturally boosts it above single-source results.
Score Examples
| Scenario | Keyword Rank | Semantic Rank | RRF Score |
|---|---|---|---|
| Keyword rank 1 only | 1 | -- | 1/61 = 0.01639 |
| Semantic rank 1 only | -- | 1 | 1/61 = 0.01639 |
| Both rank 1 | 1 | 1 | 1/61 + 1/61 = 0.03279 |
| Keyword rank 1, semantic rank 5 | 1 | 5 | 1/61 + 1/65 = 0.03178 |
| Both rank 10 | 10 | 10 | 1/70 + 1/70 = 0.02857 |
The smoothing constant k=60 dampens the effect of rank differences. A rank-1 result gets score 1/61, while a rank-10 result gets 1/70 -- only a ~13% difference. This prevents a single high-ranking result from dominating. Higher k values flatten the distribution further; lower values amplify rank differences.
Implementation
The reciprocalRankFusion() function in packages/search/src/fusion.ts:
export function reciprocalRankFusion(
keywordResults: RankedResult[],
semanticResults: RankedResult[],
limit = 20,
k = 60,
): FusedResult[] {
const fusedMap = new Map<string, FusedResult>();
for (const result of keywordResults) {
const existing = fusedMap.get(result.id);
if (existing) {
existing.rrfScore += 1 / (k + result.rank);
existing.inKeyword = true;
} else {
fusedMap.set(result.id, {
id: result.id,
sourceTable: result.sourceTable,
sourceId: result.sourceId,
title: result.title,
rrfScore: 1 / (k + result.rank),
inKeyword: true,
inSemantic: false,
textSnippet: result.textSnippet,
});
}
}
for (const result of semanticResults) {
const existing = fusedMap.get(result.id);
if (existing) {
existing.rrfScore += 1 / (k + result.rank);
existing.inSemantic = true;
if (!existing.textSnippet && result.textSnippet) {
existing.textSnippet = result.textSnippet;
}
} else {
fusedMap.set(result.id, {
id: result.id,
sourceTable: result.sourceTable,
sourceId: result.sourceId,
title: result.title,
rrfScore: 1 / (k + result.rank),
inKeyword: false,
inSemantic: true,
textSnippet: result.textSnippet,
});
}
}
return Array.from(fusedMap.values())
.sort((a, b) => b.rrfScore - a.rrfScore)
.slice(0, limit);
}
Algorithm Steps
- Build a map keyed by chunk
id. Process keyword results first, then semantic. - Accumulate scores. If a document already exists in the map (it appeared in both sources), add the new RRF term to its existing score.
- Track provenance. Set
inKeywordandinSemanticbooleans so the UI can show which sources contributed. - Prefer keyword snippets. When a document appears in both sets, the keyword snippet is kept (it includes BM25 highlighting). The semantic snippet is used only as a fallback.
- Sort and truncate. Sort by
rrfScoredescending, then slice to the requestedlimit.
Types
RankedResult (Input)
interface RankedResult {
id: string;
sourceTable: string;
sourceId: string;
title: string;
rank: number; // 1-based position in the source's result list
score: number; // Original score from the source (BM25 or similarity)
textSnippet?: string;
}
FusedResult (Output)
interface FusedResult {
id: string;
sourceTable: string;
sourceId: string;
title: string;
rrfScore: number; // Combined RRF score
inKeyword: boolean; // Appeared in BM25 results
inSemantic: boolean; // Appeared in pgvector results
textSnippet?: string;
}
Test Coverage
The fusion algorithm has thorough unit tests in packages/search/src/__tests__/fusion.test.ts:
| Test Case | What It Verifies |
|---|---|
| Empty inputs | Returns empty array when both lists are empty |
| Keyword-only | Correct flags when semantic list is empty |
| Semantic-only | Correct flags when keyword list is empty |
| Boost overlap | Documents in both sets rank higher than single-source |
| Limit respected | Output truncated to requested limit |
| Default k=60 | Score equals 1/61 for a rank-1, single-source result |
| Custom k | Score changes correctly with non-default k value |
| Sort order | Results sorted by RRF score descending |
| Snippet preference | Keyword snippet preserved when document appears in both |
The k Parameter
The k=60 default comes from the original RRF paper (Cormack, Clarke, and Buettcher, 2009). It was chosen empirically to work well across a variety of retrieval tasks. The parameter is exposed as an argument to reciprocalRankFusion() for experimentation, but changing it affects the score distribution:
- Lower k (e.g., 10): Amplifies rank differences. Top-ranked results dominate more.
- Higher k (e.g., 100): Flattens rank differences. Position matters less; appearing in both sources matters more.
For tuning guidance, see Relevance -- RRF Algorithm.
Related Pages
- Query Pipeline Overview -- where fusion fits in the full pipeline
- Keyword Search -- how BM25 results are produced and ranked
- Semantic Search -- how pgvector results are produced and ranked
- Hybrid Search Admin View -- the debug UI that displays all three result sets