Query Pipeline
Read path of the hybrid search system -- query classification, parallel keyword + semantic search, and Reciprocal Rank Fusion.
The query pipeline is the read path of the hybrid search system. A user's search query enters the hybridSearch.search tRPC procedure, runs against two search engines in parallel, and exits as a single ranked list of results fused by Reciprocal Rank Fusion.
Pipeline Flow
User Query
|
v
classifyQuery() "keyword" | "balanced" | "semantic"
|
v
Promise.all([ Both searches run concurrently
keywordSearch() Typesense BM25 (title + embedded_text)
semanticSearch() pgvector cosine similarity (halfvec 1536)
])
|
v
reciprocalRankFusion() Merge by 1/(k + rank), k=60
|
v
FusedResult[] Deduplicated, scored, ranked
Step 1: Query Classification
classifyQuery() from @repo/search uses a word-count heuristic to classify the query:
| Word Count | Classification | Rationale |
|---|---|---|
| 1-3 words | keyword | Short, likely exact-match intent ("notion pricing") |
| 4-6 words | balanced | Could be either style ("document management market size") |
| 7+ words | semantic | Natural language question, semantic search excels |
The classification is returned in the response metadata for debugging. It does not currently alter the pipeline behavior -- both keyword and semantic search always run regardless of classification. The classification exists as a hook for future weighting: boosting keyword scores for keyword queries or semantic scores for semantic queries.
Step 2: Query Embedding
The user's query text is embedded via ctx.ai.embedQuery(), which calls Gemini Embedding 2 with RETRIEVAL_QUERY task type. This uses a different task type than the indexing path (RETRIEVAL_DOCUMENT), which Gemini optimizes for asymmetric retrieval -- short queries matched against longer documents.
The embedding call is tagged with feature "hybrid-search-query" for cost attribution in the ai_usage table.
Step 3: Parallel Search
Both search engines run concurrently via Promise.all:
- Keyword Search -- Typesense BM25 against the
document_chunkscollection, with tenant filtering and optional facet filters - Semantic Search -- pgvector inner product distance against
document_chunk.embedding, running through the tenant-scopedctx.dbtransaction
Both searches request limit * 2 results to give the fusion step a larger candidate pool than the final output size.
Step 4: Fusion
reciprocalRankFusion() merges both result lists into a single ranked output. Documents appearing in both lists accumulate scores from each source, naturally ranking higher than single-source results.
See Fusion Algorithm for the RRF implementation details and score calculation.
Entry Points
The hybridSearch tRPC router at packages/api/src/routers/hybrid-search.ts exposes four procedures:
| Procedure | Purpose | Authorization |
|---|---|---|
search | Production hybrid search with full filter support | read on ResearchArtifact |
typeAhead | Prefix search with typo tolerance for autocomplete | read on ResearchArtifact |
debugSearch | Returns keyword, semantic, and fused results separately | read on Organization |
stats | Chunk counts grouped by source table | read on Organization |
All procedures go through the authorizedProcedure chain, which means the session is validated, withTenantContext scopes the database connection, and CASL abilities are checked before any search logic runs.
Input Schema
The search procedure accepts:
z.object({
query: z.string().min(1).max(1000),
limit: z.number().min(1).max(100).default(20),
sourceTable: z.enum(["research_artifact", "research_output", "extraction_result"]).optional(),
artifactType: z.string().optional(),
mediaType: z.string().optional(),
planId: z.string().optional(),
userId: z.string().optional(),
createdAfter: z.number().optional(),
createdBefore: z.number().optional(),
});
Filters narrow the keyword search (via Typesense filter_by) and optionally the semantic search (via a WHERE source_table = clause). The planId, userId, and date filters currently apply only to the keyword path because Typesense has faceted indexes on these fields. The semantic path filters only by sourceTable when provided.
Response Shape
{
results: FusedResult[]; // Merged, deduplicated, ranked by RRF score
queryType: "keyword" | "balanced" | "semantic";
keywordCount: number; // Results from Typesense before fusion
semanticCount: number; // Results from pgvector before fusion
fusedCount: number; // Final result count after fusion
}
Each FusedResult carries inKeyword and inSemantic booleans indicating which sources contributed the result. The Hybrid Search Admin View renders these as KW and SEM badges in the debug UI.
Key Files
| File | Purpose |
|---|---|
packages/api/src/routers/hybrid-search.ts | tRPC router -- orchestrates the full pipeline |
packages/search/src/fusion.ts | reciprocalRankFusion() and classifyQuery() |
packages/search/src/keyword-search.ts | keywordSearch() and typeAheadSearch() |
packages/ai/src/embedding.ts | embedQuery() for query-time embedding |
packages/ai/src/context.ts | AIHelper interface bound to tRPC context |
Related Pages
- Indexing -- how data enters the index (write path)
- Drizzle Patterns -- vector similarity SQL pattern used by semantic search
- Hybrid Search Admin View -- the three-column debug UI
- Relevance -- RRF tuning, query classification heuristics, and evaluation