Skip to content

Startup hangs before /health when redo session_memory triggers blocking VLM call #1222

@dikotiledon

Description

@dikotiledon

Summary

OpenViking startup can hang before /health is available when crash-recovery redo includes session_memory extraction. In this state, the process stays alive, but port 1933 never binds, so host integrations (e.g., OpenClaw plugin) hit startup timeout and kill the process.

Environment

  • OpenViking: 0.3.3
  • OS: Windows 11 (10.0.22621)
  • Launch mode: openviking.server.bootstrap (also reproduced when launched by OpenClaw plugin in local mode)
  • Config: ov.conf with VLM/embedding via OpenAI-compatible local endpoint (http://127.0.0.1:1130/v1)

Observed behavior

  • Startup logs stop after storage/queue initialization, e.g.:
    • mounted serverinfofs at /serverinfo
    • mounted queuefs at /queue
    • mounted localfs at /local
    • Created queue 'Embedding' / 'Semantic'
  • /health remains unreachable (connection refused) even though the Python process is still running.
  • When launched by a supervisor/plugin, startup eventually fails with health-check timeout and process termination.

Root-cause evidence

Await-chain tracing during hang consistently shows startup blocked in redo recovery path:

OpenVikingService.initialize
-> LockManager.start
-> LockManager._recover_pending_redo
-> LockManager._redo_session_memory
-> SessionCompressor.extract_long_term_memories
-> MemoryExtractor.extract
-> VLMConfig.get_completion_async
-> OpenAIVLM.get_completion_async
-> HTTP wait (AsyncHTTP11Connection._receive_response_headers)

In other words, startup waits on an outbound VLM request while replaying redo, before server health endpoint is available.

Trigger condition

A pending redo marker existed under:
~/.openviking/data/viking/_system/redo/<task-id>/redo.json

Example payload:

{
  "archive_uri": "viking://session/default/.../history/archive_004",
  "session_uri": "viking://session/default/...",
  "account_id": "default",
  "user_id": "default",
  "agent_id": "main",
  "role": "root"
}

Removing this pending redo marker immediately allowed clean startup and healthy /health.

Expected behavior

  • Startup should not block health availability on long/slow external VLM calls during redo replay.
  • Redo replay should be bounded/async/deferred so server can become healthy first.

Suggested fixes

  1. Do not perform blocking LLM extraction in startup path (LockManager.start):
    • Start server + health first.
    • Process redo replay in background worker.
  2. Add timeout/circuit-breaker around redo session_memory extraction.
  3. On timeout/failure, enqueue semantic fallback and continue startup.
  4. Add explicit logs/metrics for redo task start/end/fail reason and per-task duration.
  5. Consider configurable startup mode: fast-start (defer redo) vs strict-recovery.

Impact

This can cause repeated startup failure loops in production supervisors (process appears alive but never healthy), especially when redo payload requires slow/unresponsive model calls.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions