Query Pipeline

Read path of the hybrid search system -- query classification, parallel keyword + semantic search, and Reciprocal Rank Fusion.

The query pipeline is the read path of the hybrid search system. A user's search query enters the hybridSearch.search tRPC procedure, runs against two search engines in parallel, and exits as a single ranked list of results fused by Reciprocal Rank Fusion.

Pipeline Flow

User Query
   |
   v
classifyQuery()                "keyword" | "balanced" | "semantic"
   |
   v
Promise.all([                  Both searches run concurrently
   keywordSearch()             Typesense BM25 (title + embedded_text)
   semanticSearch()            pgvector cosine similarity (halfvec 1536)
])
   |
   v
reciprocalRankFusion()         Merge by 1/(k + rank), k=60
   |
   v
FusedResult[]                  Deduplicated, scored, ranked

Step 1: Query Classification

classifyQuery() from @repo/search uses a word-count heuristic to classify the query:

Word Count	Classification	Rationale
1-3 words	`keyword`	Short, likely exact-match intent ("notion pricing")
4-6 words	`balanced`	Could be either style ("document management market size")
7+ words	`semantic`	Natural language question, semantic search excels

The classification is returned in the response metadata for debugging. It does not currently alter the pipeline behavior -- both keyword and semantic search always run regardless of classification. The classification exists as a hook for future weighting: boosting keyword scores for keyword queries or semantic scores for semantic queries.

Step 2: Query Embedding

The user's query text is embedded via ctx.ai.embedQuery(), which calls Gemini Embedding 2 with RETRIEVAL_QUERY task type. This uses a different task type than the indexing path (RETRIEVAL_DOCUMENT), which Gemini optimizes for asymmetric retrieval -- short queries matched against longer documents.

The embedding call is tagged with feature "hybrid-search-query" for cost attribution in the ai_usage table.

Step 3: Parallel Search

Both search engines run concurrently via Promise.all:

Keyword Search -- Typesense BM25 against the document_chunks collection, with tenant filtering and optional facet filters
Semantic Search -- pgvector inner product distance against document_chunk.embedding, running through the tenant-scoped ctx.db transaction

Both searches request limit * 2 results to give the fusion step a larger candidate pool than the final output size.

Step 4: Fusion

reciprocalRankFusion() merges both result lists into a single ranked output. Documents appearing in both lists accumulate scores from each source, naturally ranking higher than single-source results.

See Fusion Algorithm for the RRF implementation details and score calculation.

Entry Points

The hybridSearch tRPC router at packages/api/src/routers/hybrid-search.ts exposes four procedures:

Procedure	Purpose	Authorization
`search`	Production hybrid search with full filter support	`read` on `ResearchArtifact`
`typeAhead`	Prefix search with typo tolerance for autocomplete	`read` on `ResearchArtifact`
`debugSearch`	Returns keyword, semantic, and fused results separately	`read` on `Organization`
`stats`	Chunk counts grouped by source table	`read` on `Organization`

All procedures go through the authorizedProcedure chain, which means the session is validated, withTenantContext scopes the database connection, and CASL abilities are checked before any search logic runs.

Input Schema

The search procedure accepts:

z.object({
  query: z.string().min(1).max(1000),
  limit: z.number().min(1).max(100).default(20),
  sourceTable: z.enum(["research_artifact", "research_output", "extraction_result"]).optional(),
  artifactType: z.string().optional(),
  mediaType: z.string().optional(),
  planId: z.string().optional(),
  userId: z.string().optional(),
  createdAfter: z.number().optional(),
  createdBefore: z.number().optional(),
});

Filters narrow the keyword search (via Typesense filter_by) and optionally the semantic search (via a WHERE source_table = clause). The planId, userId, and date filters currently apply only to the keyword path because Typesense has faceted indexes on these fields. The semantic path filters only by sourceTable when provided.

Response Shape

{
  results: FusedResult[];  // Merged, deduplicated, ranked by RRF score
  queryType: "keyword" | "balanced" | "semantic";
  keywordCount: number;    // Results from Typesense before fusion
  semanticCount: number;   // Results from pgvector before fusion
  fusedCount: number;      // Final result count after fusion
}

Each FusedResult carries inKeyword and inSemantic booleans indicating which sources contributed the result. The Hybrid Search Admin View renders these as KW and SEM badges in the debug UI.

Key Files

File	Purpose
`packages/api/src/routers/hybrid-search.ts`	tRPC router -- orchestrates the full pipeline
`packages/search/src/fusion.ts`	`reciprocalRankFusion()` and `classifyQuery()`
`packages/search/src/keyword-search.ts`	`keywordSearch()` and `typeAheadSearch()`
`packages/ai/src/embedding.ts`	`embedQuery()` for query-time embedding
`packages/ai/src/context.ts`	`AIHelper` interface bound to tRPC context

Indexing -- how data enters the index (write path)
Drizzle Patterns -- vector similarity SQL pattern used by semantic search
Hybrid Search Admin View -- the three-column debug UI
Relevance -- RRF tuning, query classification heuristics, and evaluation