Stall Detection
How the plan engine identifies steps stuck in in_progress beyond a configurable threshold, and how stall information surfaces to users and administrators.
Stall detection identifies research steps that have been in_progress longer than expected. This happens when a Claude Code session disconnects, the user closes their laptop, or the AI platform encounters an issue it cannot recover from. The detection logic is a pure function in the plan engine; the get_plan_status tool surfaces stall information to calling clients.
How It Works
The detectStalledSteps function checks each step's elapsed time against a threshold:
function detectStalledSteps(
steps: StepTimestamp[],
now: Date = new Date(),
thresholdMs: number = DEFAULT_STALL_THRESHOLD_MS,
): string[];
For each step in the input:
- Skip any step whose
statusis notin_progress - Use
startedAtas the reference time; fall back toupdatedAtifstartedAtis null - If
now - referenceTime > thresholdMs, the step is stalled - Return the IDs of all stalled steps
The isPlanStalled convenience function returns a boolean:
function isPlanStalled(steps: StepTimestamp[], now?, thresholdMs?): boolean;
// Returns true if detectStalledSteps returns any IDs
Default Threshold
The default stall threshold is 30 minutes (30 * 60 * 1000 milliseconds). This was chosen to accommodate long-running research steps (web searches, multi-source synthesis) while still detecting genuine stalls within a reasonable window.
The threshold is configurable per call -- both detectStalledSteps and isPlanStalled accept an optional thresholdMs parameter. No per-plan or per-step threshold configuration exists in the database currently.
StepTimestamp Interface
The detection functions accept a minimal interface, not full step rows. This avoids loading unnecessary data:
interface StepTimestamp {
id: string;
status: string;
startedAt: Date | null;
updatedAt: Date;
}
The startedAt field is nullable because steps that have never been started (or whose start time was not recorded) fall back to updatedAt. The tests verify this fallback:
it("falls back to updatedAt when startedAt is null", () => {
const step: StepTimestamp = {
id: "s1",
status: "in_progress",
startedAt: null,
updatedAt: new Date(Date.now() - 45 * 60 * 1000),
};
expect(detectStalledSteps([step])).toEqual(["s1"]);
});
How Stall Information Surfaces
Through get_plan_status
The get_plan_status tool calls isPlanStalled and includes a stalled: boolean field in its response. It maps step rows to the StepTimestamp interface:
const stalled = isPlanStalled(
steps.map((s) => ({
id: s.id,
status: s.status,
startedAt: s.startedAt,
updatedAt: s.updatedAt,
})),
);
The AI platform can check this field to decide whether to resume the stalled step, fail it, or alert the user.
Through the admin dashboard
The research plans view in the admin dashboard displays stall warnings for plans with steps that exceed the threshold. Administrators can see which specific steps are stalled and how long they have been in that state.
Stall vs. Plan State
Stall detection is informational -- it does not automatically transition the plan to the stalled state. The plan's stalled status is set explicitly by tool handlers or monitoring code, not by the pure detection functions. This separation keeps the pure functions side-effect-free and gives the calling code control over when and how to respond to stalls.
The plan state machine does allow stalled --> executing transitions, which supports the resume flow: a plan marked as stalled can be resumed when get_next_step is called again (typically from a new session).
Related Pages
- State Machines -- the
stalledplan state and its transitions - Progress Tracking -- how stall warnings appear in the status response
- Execution Loop -- the cross-session resume flow that recovers from stalls
Branching Conditions
How branching conditions alter plan execution flow through safe expression evaluation and typed actions (skip_to, add_steps, fail, continue).
Execution Loop
The end-to-end flow from plan creation through step execution to completion, including human-in-the-loop review and cross-session resume.