Skip to content

Gateway crash (kill -9) during AI generation silently drops the user's message from conversation history #29125

@nohat

Description

@nohat

Problem

When the gateway process is killed (kill -9, OOM, power loss) while the AI is generating a reply, the in-flight message is silently dropped from the conversation. On restart, when the user sends their next message, the orphaned message is actively removed from the session tree — the user's original question is never answered and disappears from the AI's context.

Steps to reproduce

  1. Send a message via Telegram that triggers a long AI generation (e.g. "write a 1000-word essay about the history of databases")
  2. kill -9 the gateway process while the AI is generating
  3. Restart the gateway
  4. Send a new message (e.g. "hello")

What happens internally

The crash leaves the session transcript with a trailing "orphan" user message — a user turn with no corresponding assistant reply. When the next message arrives, the embedded agent detects this orphan via sessionManager.getLeafEntry() (in pi-embedded-runner/run/attempt.ts). Since consecutive user messages violate LLM role ordering, the agent calls sessionManager.branch(leafEntry.parentId) to rewind the session tree to the state before the orphan. The new message is then appended as a fresh leaf.

The result: the orphaned user message is structurally severed from the session's parentId chain. The AI processes the new message without any knowledge of the orphan. The user's original question is never answered, and no error or notification is sent to the user or operator.

Specific scenarios

  • Telegram DM: User asks a question, AI starts generating, gateway crashes → on restart, user sends another message → the original question is silently branched out of the session tree, never answered
  • Any channel: The orphan removal is channel-agnostic — any channel's messages are subject to the same silent loss
  • No visibility: No log entry warns the operator that a user message was dropped; the log.warn about "removed orphaned user message" only fires when the next message arrives, and doesn't identify what the lost message was

Expected behavior

On restart, the gateway should detect orphaned in-flight turns and either re-process them or notify the user that their message was lost. A persistent turn-tracking layer would enable the gateway to know which turns were in-progress at crash time without relying on transcript tree structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions