Keyword Search (BM25)
Typesense BM25 full-text search -- query execution, field weighting, filter construction, and result mapping.
Keyword search runs BM25 full-text matching against the document_chunks Typesense collection. It handles exact-match queries, partial matches, and highlighted snippet extraction.
How It Works
The keywordSearch() function in packages/search/src/keyword-search.ts builds a Typesense search request with tenant-scoped filtering and optional facet constraints.
Search Fields
query_by: "title,embedded_text";
Typesense scores matches in both fields using BM25. The title field tends to boost results where the query matches the document name. The embedded_text field contains the full chunk text (with contextual prefix if generated during indexing).
Highlight Extraction
highlight_full_fields: "embedded_text";
Typesense returns highlighted snippets showing where the query matched. The result mapper prefers the highlighted snippet; if none is available, it falls back to the first 200 characters of embedded_text.
textSnippet: snippet ?? doc.embedded_text.slice(0, 200);
Filter Construction
Every keyword search is scoped to the current tenant via the organization_id filter. Additional filters are appended when the caller provides them:
const filterParts = [`organization_id:=${opts.organizationId}`];
if (opts.sourceTable) filterParts.push(`source_table:=${opts.sourceTable}`);
if (opts.artifactType) filterParts.push(`artifact_type:=${opts.artifactType}`);
if (opts.mediaType) filterParts.push(`media_type:=${opts.mediaType}`);
if (opts.planId) filterParts.push(`plan_id:=${opts.planId}`);
if (opts.userId) filterParts.push(`user_id:=${opts.userId}`);
if (opts.createdAfter) filterParts.push(`created_at:>${String(opts.createdAfter)}`);
if (opts.createdBefore) filterParts.push(`created_at:<${String(opts.createdBefore)}`);
// Combined with &&
filter_by: filterParts.join(" && ");
These filters use Typesense's faceted fields. The collection schema (packages/search/src/collections.ts) defines which fields are faceted:
| Field | Faceted | Notes |
|---|---|---|
organization_id | No | Indexed false -- used only for filtering, not for search or faceted counts |
source_table | Yes | Content type: research_artifact, research_output, extraction_result |
artifact_type | Yes | Optional -- depends on source table |
media_type | Yes | Optional -- markdown, word, etc. |
plan_id | Yes | Links to a research plan |
user_id | Yes | Who created the content |
created_at | No | Unix timestamp for range filtering |
Result Mapping
Typesense returns hits with text_match_info.score (the raw BM25 score) and a 1-based index position. The function maps these to KeywordSearchResult objects:
return (response.hits ?? []).map((hit, idx) => {
const doc = hit.document;
const snippet = hit.highlights?.[0]?.snippet;
return {
id: doc.id,
sourceTable: doc.source_table,
sourceId: doc.source_id,
title: doc.title,
textSnippet: snippet ?? doc.embedded_text.slice(0, 200),
score: Number(hit.text_match_info?.score ?? 0),
rank: idx + 1,
};
});
The rank field is the 1-based position in the Typesense result order, which is what RRF uses for its 1/(k + rank) calculation.
Typesense Collection Schema
The document_chunks collection is defined in packages/search/src/collections.ts:
export const chunksSchema: CollectionCreateSchema = {
name: "document_chunks",
fields: [
{ name: "id", type: "string" },
{ name: "organization_id", type: "string", facet: false, index: false },
{ name: "source_table", type: "string", facet: true },
{ name: "source_id", type: "string", facet: false },
{ name: "title", type: "string" },
{ name: "embedded_text", type: "string" },
{ name: "artifact_type", type: "string", facet: true, optional: true },
{ name: "media_type", type: "string", facet: true, optional: true },
{ name: "plan_id", type: "string", facet: true, optional: true },
{ name: "user_id", type: "string", facet: true },
{ name: "created_at", type: "int64", facet: false },
],
default_sorting_field: "created_at",
token_separators: ["-", "_"],
};
The token_separators setting treats hyphens and underscores as word boundaries, so queries for "multi-tenant" match documents containing "multi_tenant" and vice versa.
Tenant Isolation
Typesense does not have row-level security like PostgreSQL. Tenant isolation is enforced at the application layer by always including organization_id:=${organizationId} in the filter_by clause. The organization_id field is deliberately set to index: false to prevent it from appearing in search suggestions or faceted counts -- it serves purely as a filter.
Related Pages
- Query Pipeline Overview -- how keyword search fits into the full pipeline
- Indexing -- Index Management -- how documents enter Typesense and the collection bootstrap process
- Type-Ahead -- prefix search with typo tolerance, also backed by Typesense