Index Management
Typesense collection schema, document sync operations, bootstrap behavior, and the dual-write pattern to pgvector and Typesense.
The indexing pipeline writes to two data stores: pgvector (semantic search) and Typesense (keyword search). This page covers the Typesense collection schema, the sync operations, and the dual-write pattern. For the pgvector document_chunk table schema, see Data & Storage: Schema Design.
Typesense Collection Schema
The document_chunks collection stores the text and metadata needed for BM25 keyword search, type-ahead, and faceted filtering. It mirrors a subset of the pgvector document_chunk table.
// From packages/search/src/collections.ts
const chunksSchema: CollectionCreateSchema = {
name: "document_chunks",
fields: [
{ name: "id", type: "string" },
{ name: "organization_id", type: "string", facet: false, index: false },
{ name: "source_table", type: "string", facet: true },
{ name: "source_id", type: "string", facet: false },
{ name: "title", type: "string" },
{ name: "embedded_text", type: "string" },
{ name: "artifact_type", type: "string", facet: true, optional: true },
{ name: "media_type", type: "string", facet: true, optional: true },
{ name: "plan_id", type: "string", facet: true, optional: true },
{ name: "user_id", type: "string", facet: true },
{ name: "created_at", type: "int64", facet: false },
],
default_sorting_field: "created_at",
token_separators: ["-", "_"],
};
Field Notes
organization_id-- Not indexed or faceted. Used only infilter_byclauses for tenant isolation. Settingindex: falsesaves memory since this field is never searched by text content.embedded_text-- The full text (context prefix + original) used for BM25 matching. This is the same text that was embedded as a vector in pgvector.token_separators-- Hyphens and underscores are treated as token separators, enabling search to match slug-style and hyphenated content (e.g., "10-K" matches queries for "10K" or "10-K").- Faceted fields --
source_table,artifact_type,media_type,plan_id, anduser_idsupport faceted count aggregation in search results. created_at-- Unix timestamp (int64), used as the default sorting field and for date range filtering.
Collection Bootstrap
ensureCollections() runs on Next.js app startup. It checks whether the document_chunks collection exists and creates it if missing. This ensures the schema is always current, even after a schema change or fresh deployment.
export async function ensureCollections(): Promise<void> {
const client = getTypesense();
try {
await client.collections(CHUNKS_COLLECTION).retrieve();
} catch {
await client.collections().create(chunksSchema);
}
}
The function is safe to call repeatedly -- it is a no-op when the collection already exists.
Sync Operations
All sync operations are in packages/search/src/collections.ts:
indexChunk(doc: ChunkDocument)
Upserts a single document into Typesense. Used for one-off updates.
indexChunks(docs: ChunkDocument[])
Batch upserts multiple documents. Used by the store-and-sync step in the indexing pipeline. Has a safety check: calling with an empty array is a no-op.
removeChunk(id: string)
Deletes a single document by ID. Idempotent -- silently ignores "not found" errors.
removeChunksBySource(sourceTable, sourceId)
Deletes all chunks for a given source record. Uses Typesense's filter_by deletion:
filter_by: `source_table:=${sourceTable} && source_id:=${sourceId}`;
This is used when re-indexing a source document (delete old chunks, then index new ones).
ChunkDocument Shape
The TypeScript interface for Typesense documents:
interface ChunkDocument {
id: string;
organization_id: string;
source_table: string;
source_id: string;
title: string;
embedded_text: string;
artifact_type?: string;
media_type?: string;
plan_id?: string;
user_id: string;
created_at: number; // Unix timestamp in seconds
}
Dual-Write Pattern
The store-and-sync step (Step 4 of index-content) writes to both stores in sequence:
- pgvector first -- Insert chunk rows via
withTenantContext(RLS-scoped transaction) - Typesense second -- Upsert the same data via
indexChunks
If pgvector succeeds but Typesense fails, the Inngest step retries. Since indexChunks uses upsert semantics, retrying is safe. If pgvector fails, the step fails before reaching Typesense, so both stores remain consistent.
The dual-write is not atomic across both stores. There is a brief window where pgvector has the data but Typesense does not. At Trovella's scale and usage pattern (background indexing, not real-time), this is acceptable.
pgvector Storage
Chunks are stored in the document_chunk table with RLS tenant isolation. Key columns not present in the Typesense schema:
| Column | Purpose |
|---|---|
embedding | halfvec(1536) vector for cosine similarity search |
original_text | The chunk text before context prefix was prepended |
context_prefix | The LLM-generated context (stored separately for debugging) |
chunk_index | Zero-based position within the source document |
token_count | Approximate token count of the embedded text |
correlation_id | Links to ai_usage records for cost analysis |
For the full table definition and indexes, see Data & Storage: Schema Design.
Typesense Client
The Typesense client (packages/search/src/client.ts) is a module-level singleton initialized from environment variables:
| Variable | Default | Description |
|---|---|---|
TYPESENSE_API_KEY | (required) | API key for authentication |
TYPESENSE_URL | http://localhost:8108 | Typesense server URL |
The client is configured with a 5-second connection timeout and 100ms retry interval. disconnectTypesense() resets the singleton for graceful shutdown.
Health Check
checkTypesense() in packages/search/src/health.ts calls the Typesense /health endpoint and returns { ok: boolean, latencyMs: number }. This is used by the application's /api/health endpoint since the Typesense Docker image has no curl/wget for container-level healthchecks.
Key Files
| File | Purpose |
|---|---|
packages/search/src/collections.ts | Collection schema, ensureCollections, CRUD operations |
packages/search/src/client.ts | Typesense client singleton |
packages/search/src/health.ts | Connectivity health check |
packages/db/src/schema/search.ts | document_chunk table definition (pgvector side) |