Document Chunking

Recursive character text splitter that breaks source documents into overlapping chunks for embedding and search indexing.

The chunking step splits source documents into pieces sized for embedding. It uses a recursive character splitting algorithm that tries to break text at natural boundaries -- paragraphs first, then sentences, then words -- to keep semantically coherent content together within each chunk.

Configuration

Parameter	Default	Description
`chunkSize`	2048 chars (~512 tokens)	Target maximum characters per chunk
`chunkOverlap`	200 chars (~50 tokens)	Characters of overlap between adjacent chunks

These values are hardcoded in the index-content Inngest function (no options are passed to chunkText, so defaults apply). The 512-token target was chosen to balance embedding quality (smaller chunks are more precise) against context preservation (larger chunks retain more meaning).

Algorithm

The chunkText function in packages/ai/src/chunking.ts works in two phases:

Phase 1: Recursive splitting

The algorithm maintains an ordered list of separator characters, tried from coarsest to finest:

"\n\n"  (paragraph break)
"\n"    (line break)
". "    (sentence end)
"! "    (exclamation)
"? "    (question)
"; "    (semicolon)
", "    (comma)
" "     (space)
""      (character-level, last resort)

It splits text using the first (coarsest) separator. When a resulting piece exceeds chunkSize, that piece is recursively split using the next finer separator. This means a paragraph that fits in chunkSize stays intact, but an oversized paragraph gets split at sentence boundaries, and an oversized sentence gets split at word boundaries.

Phase 2: Overlap merging

After splitting, chunks are merged with overlap. Each chunk (except the first) gets the last chunkOverlap characters from the previous chunk prepended. If adding overlap pushes a chunk beyond chunkSize + chunkOverlap, the text is trimmed to that limit. Each chunk is trimmed of leading/trailing whitespace.

Short content bypass

Content under chunkSize characters (2048 by default, approximately 500 tokens) is returned as a single chunk without splitting. This avoids unnecessary processing for short documents and triggers the single-chunk optimization in the contextual retrieval step, which skips Haiku context generation entirely.

Output

interface TextChunk {
  /** Zero-based index of this chunk within the source document. */
  index: number;
  /** The chunk text. */
  text: string;
  /** Approximate token count (~4 chars per token). */
  tokenEstimate: number;
}

The tokenEstimate uses a simple ceil(length / 4) heuristic. This is adequate for cost estimation and logging but should not be used for precise token counting.

Token estimation

The codebase uses a consistent ~4 characters per token approximation for English text. This appears in:

chunkText for tokenEstimate on each chunk
embedContent for input token estimation when the Gemini API doesn't return token counts
storeAndSyncChunks for the tokenCount column on document_chunk rows

Usage in the pipeline

The chunking step runs as Step 1 of the index-content Inngest function. It is a pure function with no external dependencies (no API calls, no database access), making it safe to retry without side effects.

// From apps/web/src/inngest/functions/index-content.ts
const chunks: TextChunk[] = await step.run("chunk-content", () => {
  return chunkText(data.content);
});

Key file

packages/ai/src/chunking.ts -- the chunkText function, ChunkOptions and TextChunk interfaces, and the internal recursiveSplit and mergeWithOverlap helpers.