Trovella Wiki

Query Pipeline

Read path of the hybrid search system -- query classification, parallel keyword + semantic search, and Reciprocal Rank Fusion.

The query pipeline is the read path of the hybrid search system. A user's search query enters the hybridSearch.search tRPC procedure, runs against two search engines in parallel, and exits as a single ranked list of results fused by Reciprocal Rank Fusion.

Pipeline Flow

User Query
   |
   v
classifyQuery()                "keyword" | "balanced" | "semantic"
   |
   v
Promise.all([                  Both searches run concurrently
   keywordSearch()             Typesense BM25 (title + embedded_text)
   semanticSearch()            pgvector cosine similarity (halfvec 1536)
])
   |
   v
reciprocalRankFusion()         Merge by 1/(k + rank), k=60
   |
   v
FusedResult[]                  Deduplicated, scored, ranked

Step 1: Query Classification

classifyQuery() from @repo/search uses a word-count heuristic to classify the query:

Word CountClassificationRationale
1-3 wordskeywordShort, likely exact-match intent ("notion pricing")
4-6 wordsbalancedCould be either style ("document management market size")
7+ wordssemanticNatural language question, semantic search excels

The classification is returned in the response metadata for debugging. It does not currently alter the pipeline behavior -- both keyword and semantic search always run regardless of classification. The classification exists as a hook for future weighting: boosting keyword scores for keyword queries or semantic scores for semantic queries.

Step 2: Query Embedding

The user's query text is embedded via ctx.ai.embedQuery(), which calls Gemini Embedding 2 with RETRIEVAL_QUERY task type. This uses a different task type than the indexing path (RETRIEVAL_DOCUMENT), which Gemini optimizes for asymmetric retrieval -- short queries matched against longer documents.

The embedding call is tagged with feature "hybrid-search-query" for cost attribution in the ai_usage table.

Both search engines run concurrently via Promise.all:

  • Keyword Search -- Typesense BM25 against the document_chunks collection, with tenant filtering and optional facet filters
  • Semantic Search -- pgvector inner product distance against document_chunk.embedding, running through the tenant-scoped ctx.db transaction

Both searches request limit * 2 results to give the fusion step a larger candidate pool than the final output size.

Step 4: Fusion

reciprocalRankFusion() merges both result lists into a single ranked output. Documents appearing in both lists accumulate scores from each source, naturally ranking higher than single-source results.

See Fusion Algorithm for the RRF implementation details and score calculation.

Entry Points

The hybridSearch tRPC router at packages/api/src/routers/hybrid-search.ts exposes four procedures:

ProcedurePurposeAuthorization
searchProduction hybrid search with full filter supportread on ResearchArtifact
typeAheadPrefix search with typo tolerance for autocompleteread on ResearchArtifact
debugSearchReturns keyword, semantic, and fused results separatelyread on Organization
statsChunk counts grouped by source tableread on Organization

All procedures go through the authorizedProcedure chain, which means the session is validated, withTenantContext scopes the database connection, and CASL abilities are checked before any search logic runs.

Input Schema

The search procedure accepts:

z.object({
  query: z.string().min(1).max(1000),
  limit: z.number().min(1).max(100).default(20),
  sourceTable: z.enum(["research_artifact", "research_output", "extraction_result"]).optional(),
  artifactType: z.string().optional(),
  mediaType: z.string().optional(),
  planId: z.string().optional(),
  userId: z.string().optional(),
  createdAfter: z.number().optional(),
  createdBefore: z.number().optional(),
});

Filters narrow the keyword search (via Typesense filter_by) and optionally the semantic search (via a WHERE source_table = clause). The planId, userId, and date filters currently apply only to the keyword path because Typesense has faceted indexes on these fields. The semantic path filters only by sourceTable when provided.

Response Shape

{
  results: FusedResult[];  // Merged, deduplicated, ranked by RRF score
  queryType: "keyword" | "balanced" | "semantic";
  keywordCount: number;    // Results from Typesense before fusion
  semanticCount: number;   // Results from pgvector before fusion
  fusedCount: number;      // Final result count after fusion
}

Each FusedResult carries inKeyword and inSemantic booleans indicating which sources contributed the result. The Hybrid Search Admin View renders these as KW and SEM badges in the debug UI.

Key Files

FilePurpose
packages/api/src/routers/hybrid-search.tstRPC router -- orchestrates the full pipeline
packages/search/src/fusion.tsreciprocalRankFusion() and classifyQuery()
packages/search/src/keyword-search.tskeywordSearch() and typeAheadSearch()
packages/ai/src/embedding.tsembedQuery() for query-time embedding
packages/ai/src/context.tsAIHelper interface bound to tRPC context

On this page