Trovella Wiki

Index Management

Typesense collection schema, document sync operations, bootstrap behavior, and the dual-write pattern to pgvector and Typesense.

The indexing pipeline writes to two data stores: pgvector (semantic search) and Typesense (keyword search). This page covers the Typesense collection schema, the sync operations, and the dual-write pattern. For the pgvector document_chunk table schema, see Data & Storage: Schema Design.

Typesense Collection Schema

The document_chunks collection stores the text and metadata needed for BM25 keyword search, type-ahead, and faceted filtering. It mirrors a subset of the pgvector document_chunk table.

// From packages/search/src/collections.ts
const chunksSchema: CollectionCreateSchema = {
  name: "document_chunks",
  fields: [
    { name: "id", type: "string" },
    { name: "organization_id", type: "string", facet: false, index: false },
    { name: "source_table", type: "string", facet: true },
    { name: "source_id", type: "string", facet: false },
    { name: "title", type: "string" },
    { name: "embedded_text", type: "string" },
    { name: "artifact_type", type: "string", facet: true, optional: true },
    { name: "media_type", type: "string", facet: true, optional: true },
    { name: "plan_id", type: "string", facet: true, optional: true },
    { name: "user_id", type: "string", facet: true },
    { name: "created_at", type: "int64", facet: false },
  ],
  default_sorting_field: "created_at",
  token_separators: ["-", "_"],
};

Field Notes

  • organization_id -- Not indexed or faceted. Used only in filter_by clauses for tenant isolation. Setting index: false saves memory since this field is never searched by text content.
  • embedded_text -- The full text (context prefix + original) used for BM25 matching. This is the same text that was embedded as a vector in pgvector.
  • token_separators -- Hyphens and underscores are treated as token separators, enabling search to match slug-style and hyphenated content (e.g., "10-K" matches queries for "10K" or "10-K").
  • Faceted fields -- source_table, artifact_type, media_type, plan_id, and user_id support faceted count aggregation in search results.
  • created_at -- Unix timestamp (int64), used as the default sorting field and for date range filtering.

Collection Bootstrap

ensureCollections() runs on Next.js app startup. It checks whether the document_chunks collection exists and creates it if missing. This ensures the schema is always current, even after a schema change or fresh deployment.

export async function ensureCollections(): Promise<void> {
  const client = getTypesense();
  try {
    await client.collections(CHUNKS_COLLECTION).retrieve();
  } catch {
    await client.collections().create(chunksSchema);
  }
}

The function is safe to call repeatedly -- it is a no-op when the collection already exists.

Sync Operations

All sync operations are in packages/search/src/collections.ts:

indexChunk(doc: ChunkDocument)

Upserts a single document into Typesense. Used for one-off updates.

indexChunks(docs: ChunkDocument[])

Batch upserts multiple documents. Used by the store-and-sync step in the indexing pipeline. Has a safety check: calling with an empty array is a no-op.

removeChunk(id: string)

Deletes a single document by ID. Idempotent -- silently ignores "not found" errors.

removeChunksBySource(sourceTable, sourceId)

Deletes all chunks for a given source record. Uses Typesense's filter_by deletion:

filter_by: `source_table:=${sourceTable} && source_id:=${sourceId}`;

This is used when re-indexing a source document (delete old chunks, then index new ones).

ChunkDocument Shape

The TypeScript interface for Typesense documents:

interface ChunkDocument {
  id: string;
  organization_id: string;
  source_table: string;
  source_id: string;
  title: string;
  embedded_text: string;
  artifact_type?: string;
  media_type?: string;
  plan_id?: string;
  user_id: string;
  created_at: number; // Unix timestamp in seconds
}

Dual-Write Pattern

The store-and-sync step (Step 4 of index-content) writes to both stores in sequence:

  1. pgvector first -- Insert chunk rows via withTenantContext (RLS-scoped transaction)
  2. Typesense second -- Upsert the same data via indexChunks

If pgvector succeeds but Typesense fails, the Inngest step retries. Since indexChunks uses upsert semantics, retrying is safe. If pgvector fails, the step fails before reaching Typesense, so both stores remain consistent.

The dual-write is not atomic across both stores. There is a brief window where pgvector has the data but Typesense does not. At Trovella's scale and usage pattern (background indexing, not real-time), this is acceptable.

pgvector Storage

Chunks are stored in the document_chunk table with RLS tenant isolation. Key columns not present in the Typesense schema:

ColumnPurpose
embeddinghalfvec(1536) vector for cosine similarity search
original_textThe chunk text before context prefix was prepended
context_prefixThe LLM-generated context (stored separately for debugging)
chunk_indexZero-based position within the source document
token_countApproximate token count of the embedded text
correlation_idLinks to ai_usage records for cost analysis

For the full table definition and indexes, see Data & Storage: Schema Design.

Typesense Client

The Typesense client (packages/search/src/client.ts) is a module-level singleton initialized from environment variables:

VariableDefaultDescription
TYPESENSE_API_KEY(required)API key for authentication
TYPESENSE_URLhttp://localhost:8108Typesense server URL

The client is configured with a 5-second connection timeout and 100ms retry interval. disconnectTypesense() resets the singleton for graceful shutdown.

Health Check

checkTypesense() in packages/search/src/health.ts calls the Typesense /health endpoint and returns { ok: boolean, latencyMs: number }. This is used by the application's /api/health endpoint since the Typesense Docker image has no curl/wget for container-level healthchecks.

Key Files

FilePurpose
packages/search/src/collections.tsCollection schema, ensureCollections, CRUD operations
packages/search/src/client.tsTypesense client singleton
packages/search/src/health.tsConnectivity health check
packages/db/src/schema/search.tsdocument_chunk table definition (pgvector side)

On this page