Indexing

Write path of the hybrid search system -- chunking, contextual retrieval, embedding, and dual-write to pgvector and Typesense.

Indexing is the write path of the hybrid search system. When research content is stored, it flows through a four-step pipeline that transforms raw text into searchable chunks in two data stores.

Pipeline Overview

The indexing pipeline runs as an Inngest background job (index-content) triggered by search/content.created events. Each step is a durable checkpoint -- if the function fails at step 3, re-execution starts at step 3 without repeating earlier steps.

Content Sources                  Inngest Pipeline                    Data Stores
───────────────                  ────────────────                    ───────────
research_artifact  ─┐
research_output    ─┤─→ search/content.created ─→ chunk-content     ─→ pgvector (halfvec 1536)
extraction_result  ─┘                            generate-context      Typesense (BM25 index)
                                                 embed-chunks
                                                 store-and-sync

Step 1: Chunk Content

Split the source text using a recursive character splitter. Target chunk size is 2048 characters (~512 tokens) with 200 characters (~50 tokens) of overlap. Separators are tried in order: paragraph > sentence > word > character.

See Chunking for the algorithm details.

Step 2: Generate Context

For multi-chunk documents, Claude Haiku generates a 2-3 sentence contextual prefix for each chunk. This implements Anthropic's contextual retrieval technique, which reduces retrieval failures by ~49%. Single-chunk documents (short content that fits in one chunk) skip this step.

See Contextual Retrieval for the prompt design and optimization decisions.

Step 3: Embed Chunks

Gemini Embedding 2 generates 1536-dimensional vectors for each chunk. The embedded text is contextPrefix + "\n\n" + originalText for multi-chunk documents, or just the original text for single-chunk documents. Uses RETRIEVAL_DOCUMENT task type for indexing.

See Embedding for the model configuration and cost tracking.

Step 4: Store and Sync

Dual-write to both data stores:

Insert chunk rows into document_chunk via withTenantContext (RLS-scoped)
Upsert the same chunks into Typesense via indexChunks from @repo/search

See Index Management for the Typesense collection schema and sync operations.

Content Sources

Three source tables currently feed the indexing pipeline:

Source Table	Emitter	Content Type
`research_artifact`	`store_research` MCP tool	Research findings, analyses
`research_output`	`store_research_output` MCP tool	Final deliverables
`extraction_result`	Content extraction pipeline	Extracted text from documents

Source provenance is tracked via a polymorphic source_table enum + source_id pair on every chunk. This keeps the chunk table unified for search while maintaining traceability back to the original record.

Correlation ID

Each indexing run generates a correlationId (UUID) that links all AI calls for that run -- both the Haiku contextual retrieval calls and the Gemini embedding call. This correlation ID is stored on the document_chunk row and in the ai_usage table, enabling cost analysis per indexing operation.

Key Files

File	Purpose
`apps/trovella/src/inngest/functions/index-content.ts`	The Inngest function orchestrating all four steps
`packages/ai/src/chunking.ts`	Recursive character text splitter
`packages/ai/src/contextual-retrieval.ts`	Haiku contextual prefix generation
`packages/ai/src/embedding.ts`	Gemini embedding generation
`packages/search/src/collections.ts`	Typesense collection schema and sync utilities

Search Decision (ADR-009) -- Why Typesense + pgvector, and the alternatives considered
Data & Storage: Schema Design -- document_chunk table definition and tenant scoping
Data & Storage: Background Jobs / Job Definitions -- The index-content function from the Inngest perspective
Data & Storage: Background Jobs / Event Patterns -- How search/content.created events flow from MCP tools to Inngest