Document Chunking
Recursive character text splitter that breaks source documents into overlapping chunks for embedding and search indexing.
The chunking step splits source documents into pieces sized for embedding. It uses a recursive character splitting algorithm that tries to break text at natural boundaries -- paragraphs first, then sentences, then words -- to keep semantically coherent content together within each chunk.
Configuration
| Parameter | Default | Description |
|---|---|---|
chunkSize | 2048 chars (~512 tokens) | Target maximum characters per chunk |
chunkOverlap | 200 chars (~50 tokens) | Characters of overlap between adjacent chunks |
These values are hardcoded in the index-content Inngest function (no options are passed to chunkText, so defaults apply). The 512-token target was chosen to balance embedding quality (smaller chunks are more precise) against context preservation (larger chunks retain more meaning).
Algorithm
The chunkText function in packages/ai/src/chunking.ts works in two phases:
Phase 1: Recursive splitting
The algorithm maintains an ordered list of separator characters, tried from coarsest to finest:
"\n\n" (paragraph break)
"\n" (line break)
". " (sentence end)
"! " (exclamation)
"? " (question)
"; " (semicolon)
", " (comma)
" " (space)
"" (character-level, last resort)
It splits text using the first (coarsest) separator. When a resulting piece exceeds chunkSize, that piece is recursively split using the next finer separator. This means a paragraph that fits in chunkSize stays intact, but an oversized paragraph gets split at sentence boundaries, and an oversized sentence gets split at word boundaries.
Phase 2: Overlap merging
After splitting, chunks are merged with overlap. Each chunk (except the first) gets the last chunkOverlap characters from the previous chunk prepended. If adding overlap pushes a chunk beyond chunkSize + chunkOverlap, the text is trimmed to that limit. Each chunk is trimmed of leading/trailing whitespace.
Short content bypass
Content under chunkSize characters (2048 by default, approximately 500 tokens) is returned as a single chunk without splitting. This avoids unnecessary processing for short documents and triggers the single-chunk optimization in the contextual retrieval step, which skips Haiku context generation entirely.
Output
interface TextChunk {
/** Zero-based index of this chunk within the source document. */
index: number;
/** The chunk text. */
text: string;
/** Approximate token count (~4 chars per token). */
tokenEstimate: number;
}
The tokenEstimate uses a simple ceil(length / 4) heuristic. This is adequate for cost estimation and logging but should not be used for precise token counting.
Token estimation
The codebase uses a consistent ~4 characters per token approximation for English text. This appears in:
chunkTextfortokenEstimateon each chunkembedContentfor input token estimation when the Gemini API doesn't return token countsstoreAndSyncChunksfor thetokenCountcolumn ondocument_chunkrows
Usage in the pipeline
The chunking step runs as Step 1 of the index-content Inngest function. It is a pure function with no external dependencies (no API calls, no database access), making it safe to retry without side effects.
// From apps/web/src/inngest/functions/index-content.ts
const chunks: TextChunk[] = await step.run("chunk-content", () => {
return chunkText(data.content);
});
Key file
packages/ai/src/chunking.ts -- the chunkText function, ChunkOptions and TextChunk interfaces, and the internal recursiveSplit and mergeWithOverlap helpers.
ADR-009: Typesense + pgvector Hybrid Search
Why Trovella uses Typesense for BM25 keyword search and pgvector for semantic search, merged via Reciprocal Rank Fusion.
Contextual Retrieval
Claude Haiku generates a 2-3 sentence context prefix per chunk before embedding, improving retrieval accuracy by ~49%.