Trovella Wiki

Indexing

Write path of the hybrid search system -- chunking, contextual retrieval, embedding, and dual-write to pgvector and Typesense.

Indexing is the write path of the hybrid search system. When research content is stored, it flows through a four-step pipeline that transforms raw text into searchable chunks in two data stores.

Pipeline Overview

The indexing pipeline runs as an Inngest background job (index-content) triggered by search/content.created events. Each step is a durable checkpoint -- if the function fails at step 3, re-execution starts at step 3 without repeating earlier steps.

Content Sources                  Inngest Pipeline                    Data Stores
───────────────                  ────────────────                    ───────────
research_artifact  ─┐
research_output    ─┤─→ search/content.created ─→ chunk-content     ─→ pgvector (halfvec 1536)
extraction_result  ─┘                            generate-context      Typesense (BM25 index)
                                                 embed-chunks
                                                 store-and-sync

Step 1: Chunk Content

Split the source text using a recursive character splitter. Target chunk size is 2048 characters (~512 tokens) with 200 characters (~50 tokens) of overlap. Separators are tried in order: paragraph > sentence > word > character.

See Chunking for the algorithm details.

Step 2: Generate Context

For multi-chunk documents, Claude Haiku generates a 2-3 sentence contextual prefix for each chunk. This implements Anthropic's contextual retrieval technique, which reduces retrieval failures by ~49%. Single-chunk documents (short content that fits in one chunk) skip this step.

See Contextual Retrieval for the prompt design and optimization decisions.

Step 3: Embed Chunks

Gemini Embedding 2 generates 1536-dimensional vectors for each chunk. The embedded text is contextPrefix + "\n\n" + originalText for multi-chunk documents, or just the original text for single-chunk documents. Uses RETRIEVAL_DOCUMENT task type for indexing.

See Embedding for the model configuration and cost tracking.

Step 4: Store and Sync

Dual-write to both data stores:

  1. Insert chunk rows into document_chunk via withTenantContext (RLS-scoped)
  2. Upsert the same chunks into Typesense via indexChunks from @repo/search

See Index Management for the Typesense collection schema and sync operations.

Content Sources

Three source tables currently feed the indexing pipeline:

Source TableEmitterContent Type
research_artifactstore_research MCP toolResearch findings, analyses
research_outputstore_research_output MCP toolFinal deliverables
extraction_resultContent extraction pipelineExtracted text from documents

Source provenance is tracked via a polymorphic source_table enum + source_id pair on every chunk. This keeps the chunk table unified for search while maintaining traceability back to the original record.

Correlation ID

Each indexing run generates a correlationId (UUID) that links all AI calls for that run -- both the Haiku contextual retrieval calls and the Gemini embedding call. This correlation ID is stored on the document_chunk row and in the ai_usage table, enabling cost analysis per indexing operation.

Key Files

FilePurpose
apps/web/src/inngest/functions/index-content.tsThe Inngest function orchestrating all four steps
packages/ai/src/chunking.tsRecursive character text splitter
packages/ai/src/contextual-retrieval.tsHaiku contextual prefix generation
packages/ai/src/embedding.tsGemini embedding generation
packages/search/src/collections.tsTypesense collection schema and sync utilities

On this page