Keyword Search (BM25)

Typesense BM25 full-text search -- query execution, field weighting, filter construction, and result mapping.

Keyword search runs BM25 full-text matching against the document_chunks Typesense collection. It handles exact-match queries, partial matches, and highlighted snippet extraction.

How It Works

The keywordSearch() function in packages/search/src/keyword-search.ts builds a Typesense search request with tenant-scoped filtering and optional facet constraints.

Search Fields

query_by: "title,embedded_text";

Typesense scores matches in both fields using BM25. The title field tends to boost results where the query matches the document name. The embedded_text field contains the full chunk text (with contextual prefix if generated during indexing).

Highlight Extraction

highlight_full_fields: "embedded_text";

Typesense returns highlighted snippets showing where the query matched. The result mapper prefers the highlighted snippet; if none is available, it falls back to the first 200 characters of embedded_text.

textSnippet: snippet ?? doc.embedded_text.slice(0, 200);

Filter Construction

Every keyword search is scoped to the current tenant via the organization_id filter. Additional filters are appended when the caller provides them:

const filterParts = [`organization_id:=${opts.organizationId}`];
if (opts.sourceTable) filterParts.push(`source_table:=${opts.sourceTable}`);
if (opts.artifactType) filterParts.push(`artifact_type:=${opts.artifactType}`);
if (opts.mediaType) filterParts.push(`media_type:=${opts.mediaType}`);
if (opts.planId) filterParts.push(`plan_id:=${opts.planId}`);
if (opts.userId) filterParts.push(`user_id:=${opts.userId}`);
if (opts.createdAfter) filterParts.push(`created_at:>${String(opts.createdAfter)}`);
if (opts.createdBefore) filterParts.push(`created_at:<${String(opts.createdBefore)}`);

// Combined with &&
filter_by: filterParts.join(" && ");

These filters use Typesense's faceted fields. The collection schema (packages/search/src/collections.ts) defines which fields are faceted:

Field	Faceted	Notes
`organization_id`	No	Indexed `false` -- used only for filtering, not for search or faceted counts
`source_table`	Yes	Content type: `research_artifact`, `research_output`, `extraction_result`
`artifact_type`	Yes	Optional -- depends on source table
`media_type`	Yes	Optional -- `markdown`, `word`, etc.
`plan_id`	Yes	Links to a research plan
`user_id`	Yes	Who created the content
`created_at`	No	Unix timestamp for range filtering

Result Mapping

Typesense returns hits with text_match_info.score (the raw BM25 score) and a 1-based index position. The function maps these to KeywordSearchResult objects:

return (response.hits ?? []).map((hit, idx) => {
  const doc = hit.document;
  const snippet = hit.highlights?.[0]?.snippet;
  return {
    id: doc.id,
    sourceTable: doc.source_table,
    sourceId: doc.source_id,
    title: doc.title,
    textSnippet: snippet ?? doc.embedded_text.slice(0, 200),
    score: Number(hit.text_match_info?.score ?? 0),
    rank: idx + 1,
  };
});

The rank field is the 1-based position in the Typesense result order, which is what RRF uses for its 1/(k + rank) calculation.

Typesense Collection Schema

The document_chunks collection is defined in packages/search/src/collections.ts:

export const chunksSchema: CollectionCreateSchema = {
  name: "document_chunks",
  fields: [
    { name: "id", type: "string" },
    { name: "organization_id", type: "string", facet: false, index: false },
    { name: "source_table", type: "string", facet: true },
    { name: "source_id", type: "string", facet: false },
    { name: "title", type: "string" },
    { name: "embedded_text", type: "string" },
    { name: "artifact_type", type: "string", facet: true, optional: true },
    { name: "media_type", type: "string", facet: true, optional: true },
    { name: "plan_id", type: "string", facet: true, optional: true },
    { name: "user_id", type: "string", facet: true },
    { name: "created_at", type: "int64", facet: false },
  ],
  default_sorting_field: "created_at",
  token_separators: ["-", "_"],
};

The token_separators setting treats hyphens and underscores as word boundaries, so queries for "multi-tenant" match documents containing "multi_tenant" and vice versa.

Tenant Isolation

Typesense does not have row-level security like PostgreSQL. Tenant isolation is enforced at the application layer by always including organization_id:=${organizationId} in the filter_by clause. The organization_id field is deliberately set to index: false to prevent it from appearing in search suggestions or faceted counts -- it serves purely as a filter.

Query Pipeline Overview -- how keyword search fits into the full pipeline
Indexing -- Index Management -- how documents enter Typesense and the collection bootstrap process
Type-Ahead -- prefix search with typo tolerance, also backed by Typesense