Skip to content

[MP] Improve the stability for controllers and improve log clarity#2883

Merged
ApostaC merged 1 commit intoLMCache:devfrom
ApostaC:local-dev/error-handling
Mar 27, 2026
Merged

[MP] Improve the stability for controllers and improve log clarity#2883
ApostaC merged 1 commit intoLMCache:devfrom
ApostaC:local-dev/error-handling

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Mar 26, 2026

What this PR does / why we need it:

Improves MP mode resilience and debuggability:

  1. Main loop protection: Wraps internal dispatch calls in _prefetch_loop (PrefetchController) and _store_loop (StoreController) with try-except blocks so that an unexpected exception in one iteration does not crash the background thread.

  2. Stack trace logging: Switches logger.error/logger.warning to logger.exception at exception catch sites in multiprocess/ and distributed/ so that full stack traces are logged instead of just the exception message. Also adds missing logging for the silent except Exception in nixl_store_l2_adapter.py store path.

Special notes for your reviewers:

The catch-all in the main loops intentionally uses bare except Exception (not specific types) because the goal is to prevent any internal bug from killing the background thread. The exception is still logged with full traceback via logger.exception.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

…rove the logs

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves error handling and observability across several modules, including the NIXL store adapter, prefetch and store controllers, and multiprocess server components. The changes primarily involve replacing standard error or warning logs with logger.exception to ensure stack traces are captured and wrapping critical processing loops in try...except blocks to prevent unhandled exceptions from crashing the background threads. I have no feedback to provide as no review comments were submitted.

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC enabled auto-merge (squash) March 26, 2026 23:48
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 26, 2026
@ApostaC ApostaC merged commit 82224ed into LMCache:dev Mar 27, 2026
33 of 34 checks passed
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…MCache#2883)

[add] better error handling in prefetch and store controller, and improve the logs

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…MCache#2883)

[add] better error handling in prefetch and store controller, and improve the logs

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants