Stall Detection

How the plan engine identifies steps stuck in in_progress beyond a configurable threshold, and how stall information surfaces to users and administrators.

Stall detection identifies research steps that have been in_progress longer than expected. This happens when a Claude Code session disconnects, the user closes their laptop, or the AI platform encounters an issue it cannot recover from. The detection logic is a pure function in the plan engine; the get_plan_status tool surfaces stall information to calling clients.

How It Works

The detectStalledSteps function checks each step's elapsed time against a threshold:

function detectStalledSteps(
  steps: StepTimestamp[],
  now: Date = new Date(),
  thresholdMs: number = DEFAULT_STALL_THRESHOLD_MS,
): string[];

For each step in the input:

Skip any step whose status is not in_progress
Use startedAt as the reference time; fall back to updatedAt if startedAt is null
If now - referenceTime > thresholdMs, the step is stalled
Return the IDs of all stalled steps

The isPlanStalled convenience function returns a boolean:

function isPlanStalled(steps: StepTimestamp[], now?, thresholdMs?): boolean;
// Returns true if detectStalledSteps returns any IDs

Default Threshold

The default stall threshold is 30 minutes (30 * 60 * 1000 milliseconds). This was chosen to accommodate long-running research steps (web searches, multi-source synthesis) while still detecting genuine stalls within a reasonable window.

The threshold is configurable per call -- both detectStalledSteps and isPlanStalled accept an optional thresholdMs parameter. No per-plan or per-step threshold configuration exists in the database currently.

StepTimestamp Interface

The detection functions accept a minimal interface, not full step rows. This avoids loading unnecessary data:

interface StepTimestamp {
  id: string;
  status: string;
  startedAt: Date | null;
  updatedAt: Date;
}

The startedAt field is nullable because steps that have never been started (or whose start time was not recorded) fall back to updatedAt. The tests verify this fallback:

it("falls back to updatedAt when startedAt is null", () => {
  const step: StepTimestamp = {
    id: "s1",
    status: "in_progress",
    startedAt: null,
    updatedAt: new Date(Date.now() - 45 * 60 * 1000),
  };
  expect(detectStalledSteps([step])).toEqual(["s1"]);
});

How Stall Information Surfaces

Through get_plan_status

The get_plan_status tool calls isPlanStalled and includes a stalled: boolean field in its response. It maps step rows to the StepTimestamp interface:

const stalled = isPlanStalled(
  steps.map((s) => ({
    id: s.id,
    status: s.status,
    startedAt: s.startedAt,
    updatedAt: s.updatedAt,
  })),
);

The AI platform can check this field to decide whether to resume the stalled step, fail it, or alert the user.

Through the admin dashboard

The research plans view in the admin dashboard displays stall warnings for plans with steps that exceed the threshold. Administrators can see which specific steps are stalled and how long they have been in that state.

Stall vs. Plan State

Stall detection is informational -- it does not automatically transition the plan to the stalled state. The plan's stalled status is set explicitly by tool handlers or monitoring code, not by the pure detection functions. This separation keeps the pure functions side-effect-free and gives the calling code control over when and how to respond to stalls.

The plan state machine does allow stalled --> executing transitions, which supports the resume flow: a plan marked as stalled can be resumed when get_next_step is called again (typically from a new session).

State Machines -- the stalled plan state and its transitions
Progress Tracking -- how stall warnings appear in the status response
Execution Loop -- the cross-session resume flow that recovers from stalls