Indexing
Write path of the hybrid search system -- chunking, contextual retrieval, embedding, and dual-write to pgvector and Typesense.
Indexing is the write path of the hybrid search system. When research content is stored, it flows through a four-step pipeline that transforms raw text into searchable chunks in two data stores.
Pipeline Overview
The indexing pipeline runs as an Inngest background job (index-content) triggered by search/content.created events. Each step is a durable checkpoint -- if the function fails at step 3, re-execution starts at step 3 without repeating earlier steps.
Content Sources Inngest Pipeline Data Stores
─────────────── ──────────────── ───────────
research_artifact ─┐
research_output ─┤─→ search/content.created ─→ chunk-content ─→ pgvector (halfvec 1536)
extraction_result ─┘ generate-context Typesense (BM25 index)
embed-chunks
store-and-sync
Step 1: Chunk Content
Split the source text using a recursive character splitter. Target chunk size is 2048 characters (~512 tokens) with 200 characters (~50 tokens) of overlap. Separators are tried in order: paragraph > sentence > word > character.
See Chunking for the algorithm details.
Step 2: Generate Context
For multi-chunk documents, Claude Haiku generates a 2-3 sentence contextual prefix for each chunk. This implements Anthropic's contextual retrieval technique, which reduces retrieval failures by ~49%. Single-chunk documents (short content that fits in one chunk) skip this step.
See Contextual Retrieval for the prompt design and optimization decisions.
Step 3: Embed Chunks
Gemini Embedding 2 generates 1536-dimensional vectors for each chunk. The embedded text is contextPrefix + "\n\n" + originalText for multi-chunk documents, or just the original text for single-chunk documents. Uses RETRIEVAL_DOCUMENT task type for indexing.
See Embedding for the model configuration and cost tracking.
Step 4: Store and Sync
Dual-write to both data stores:
- Insert chunk rows into
document_chunkviawithTenantContext(RLS-scoped) - Upsert the same chunks into Typesense via
indexChunksfrom@repo/search
See Index Management for the Typesense collection schema and sync operations.
Content Sources
Three source tables currently feed the indexing pipeline:
| Source Table | Emitter | Content Type |
|---|---|---|
research_artifact | store_research MCP tool | Research findings, analyses |
research_output | store_research_output MCP tool | Final deliverables |
extraction_result | Content extraction pipeline | Extracted text from documents |
Source provenance is tracked via a polymorphic source_table enum + source_id pair on every chunk. This keeps the chunk table unified for search while maintaining traceability back to the original record.
Correlation ID
Each indexing run generates a correlationId (UUID) that links all AI calls for that run -- both the Haiku contextual retrieval calls and the Gemini embedding call. This correlation ID is stored on the document_chunk row and in the ai_usage table, enabling cost analysis per indexing operation.
Key Files
| File | Purpose |
|---|---|
apps/web/src/inngest/functions/index-content.ts | The Inngest function orchestrating all four steps |
packages/ai/src/chunking.ts | Recursive character text splitter |
packages/ai/src/contextual-retrieval.ts | Haiku contextual prefix generation |
packages/ai/src/embedding.ts | Gemini embedding generation |
packages/search/src/collections.ts | Typesense collection schema and sync utilities |
Related Pages
- Search Decision (ADR-009) -- Why Typesense + pgvector, and the alternatives considered
- Data & Storage: Schema Design --
document_chunktable definition and tenant scoping - Data & Storage: Background Jobs / Job Definitions -- The
index-contentfunction from the Inngest perspective - Data & Storage: Background Jobs / Event Patterns -- How
search/content.createdevents flow from MCP tools to Inngest
Search & Retrieval
Hybrid search system combining BM25 keyword search (Typesense) with semantic vector search (pgvector), merged via Reciprocal Rank Fusion.
ADR-009: Typesense + pgvector Hybrid Search
Why Trovella uses Typesense for BM25 keyword search and pgvector for semantic search, merged via Reciprocal Rank Fusion.