Seed Data

The tiered seeding system for reference data, development data, research fixtures, and search embeddings.

Trovella uses a tiered seeding system that controls which data runs in which environment. The seed orchestrator (packages/db/src/seed.ts) calls each tier in order, gating development-only data behind a NODE_ENV check.

Seed Tiers

Tier	File	Environments	Idempotent?	Purpose
Reference	`seeds/reference.ts`	All (dev, staging, prod)	Yes	Lookup data the app depends on
Development	`seeds/development.ts`	Dev, staging only	No (truncates first)	Test users, orgs, memberships
Research	`seeds/research.ts`	Dev, staging only	Depends	Research plans, steps, artifacts for testing
Search	`seeds/search.ts`	Dev, staging only	Yes	Document chunks with pre-computed embeddings

Orchestrator

The seed entry point is packages/db/src/seed.ts:

import { db } from "./client";
import { seedDevelopmentData } from "./seeds/development";
import { seedReferenceData } from "./seeds/reference";
import { seedResearchData } from "./seeds/research";
import { seedSearchData } from "./seeds/search";

async function main() {
  await seedReferenceData(db);

  if (process.env["NODE_ENV"] !== "production") {
    await seedDevelopmentData(db);
    await seedResearchData(db);
    await seedSearchData(db);
  }
}

In production (NODE_ENV=production), only seedReferenceData runs. This is enforced in the CI migrate-prod job which sets NODE_ENV=production explicitly.

Tier 1: Reference Data

Reference data is lookup/registry data that the application requires to function. It must be idempotent -- safe to run multiple times without duplication.

Currently includes:

AI model registry (seeds/ai-models.ts): 4 models (Claude Opus 4.6, Sonnet 4.6, Haiku 4.5, Gemini Embedding 2) with pricing records. Uses ON CONFLICT DO NOTHING for idempotence.

await db.insert(aiModel).values([...]).onConflictDoNothing();
await db.insert(aiModelPricing).values([...]).onConflictDoNothing();

Adding New Reference Data

Create a new seed function in packages/db/src/seeds/
Make it idempotent (use ON CONFLICT DO NOTHING or upsert patterns)
Call it from seedReferenceData() in seeds/reference.ts
Test: pnpm db:seed should be safe to run repeatedly without errors or duplicates

Tier 2: Development Data

Development data creates a predictable local environment for manual testing and RLS integration tests. It uses fixed, deterministic IDs so tests can reference them directly.

The seed truncates all auth-related tables first, then inserts fresh data:

// Fixed IDs for deterministic dev data
export const USER_ALICE_ID = "dev_user_alice_001";
export const USER_BOB_ID = "dev_user_bob_002";
export const ORG_ACME_ID = "dev_org_acme_001";
export const ORG_ALICE_PERSONAL_ID = "dev_org_alice_personal_002";

Creates:

2 users: Alice (alice@dev.trovella.com) and Bob (bob@dev.trovella.com)
2 organizations: Acme Corp (company type) and Alice's Space (personal type)
3 memberships: Alice owns both orgs, Bob is a member of Acme Corp

These IDs are imported by other seed tiers and by RLS integration tests.

Warning: Truncation

Development seeding runs TRUNCATE CASCADE on auth tables before inserting. This destroys any manually created data in local development. Use pnpm db:seed only when you want a clean slate.

Tier 3: Research Data

Research data creates sample research plans, steps, and artifacts for testing the research engine. It depends on the development data IDs (Alice, Acme Corp).

Creates:

3 research plans: completed deep dive, in-progress execution, and a failed plan
12 plan steps across the 3 plans, in various statuses
4 skill executions: deep research, scan, research routing, and a failed execution
5 research artifacts: analysis, source list, synthesis, finding, and comparison
Audit logs and MCP tool call logs for the completed plan

The research seed uses an adaptive org selection pattern — if a real (non-dev) organization exists in the database, it uses that org and its first member as the owner. This ensures seed data is visible through RLS in the admin UI during development:

const realOrgs = await db
  .select({ id: organization.id })
  .from(organization)
  .where(sql`id NOT LIKE 'dev_%'`)
  .limit(1);

let orgId = ORG_ACME_ID;
let userId = USER_ALICE_ID;

if (realOrgs[0]) {
  orgId = realOrgs[0].id;
  // resolve first member of that org...
}

Tier 4: Search Data

Search data loads pre-computed vector embeddings from a JSON fixture file (seeds/fixtures/seed-embeddings.json) into the document_chunk table. This avoids calling the embedding API during seeding.

const fixturePath = resolve(dir, "./fixtures/seed-embeddings.json");
// ...
await db.insert(documentChunk).values(rows).onConflictDoNothing();

If the fixture file doesn't exist, the seed skips gracefully with a message:

Search: skipping — no seed-embeddings.json fixture found.
Run: pnpm tsx scripts/generate-seed-embeddings.ts

Uses ON CONFLICT DO NOTHING for idempotence.

Commands

Command	What It Does
`pnpm db:seed`	Run all seed tiers (reference + dev/research/search if not production)
`pnpm db:reset`	Run `db:migrate` then `db:seed` -- full reset from scratch
`pnpm db:seed-research`	Run only the research seed (separate entry point)

Seed Data in CI

CI Quality Job (Ephemeral Database)

The quality job runs pnpm db:migrate against an ephemeral PostgreSQL container but does not run seeds. RLS tests use their own test data setup.

CI migrate-prod Job (Production)

After migrations are applied to production:

NODE_ENV=production pnpm db:seed

Only reference data runs. Development users and test data are never seeded into production.

File Layout

packages/db/src/
  seed.ts                          -- Orchestrator (entry point)
  seed-research.ts                 -- Standalone research seed entry point
  seeds/
    reference.ts                   -- Tier 1: calls ai-models.ts
    ai-models.ts                   -- AI model registry + pricing
    development.ts                 -- Tier 2: test users, orgs, memberships
    research.ts                    -- Tier 3: research plans, artifacts
    search.ts                      -- Tier 4: document chunks with embeddings
    fixtures/
      seed-embeddings.json         -- Pre-computed embedding vectors

ID Conventions

Context	Prefix	Example
Dev users	`dev_user_`	`dev_user_alice_001`
Dev orgs	`dev_org_`	`dev_org_acme_001`
Dev memberships	`dev_member_`	`dev_member_001`
Seed plans	`seed_plan_`	`seed_plan_competitor_analysis_001`
Seed steps	`seed_step_`	`seed_step_1a_search`
Seed artifacts	`seed_artifact_`	`seed_artifact_analysis_001`
Test data (in tests)	`test_`	`test_user_001`

The prefixes make it easy to identify and clean up seed data:

DELETE FROM research_plan WHERE id LIKE 'seed_%';

Schema Design — Reference Data -- conventions for reference/lookup tables
Development Workflow -- when to run seeds during development
CI Deployment -- how seeds run in production