fix: add HERMES_SKIP_GATEWAY_RESTART env var to opt out of auto-restart#6703
Open
r266-tech wants to merge 1 commit into
Open
fix: add HERMES_SKIP_GATEWAY_RESTART env var to opt out of auto-restart#6703r266-tech wants to merge 1 commit into
r266-tech wants to merge 1 commit into
Conversation
When `hermes update` runs from a cron job hosted inside the gateway process, the auto-restart kills the worker thread mid-flight, causing lost cron state (stale jobs.json, missing state.db sessions). Adds an env var check before the restart block so cron jobs can set HERMES_SKIP_GATEWAY_RESTART=1 to safely update without self-kill. Closes NousResearch#6702
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
hermes updateunconditionally restarts all gateway processes after pulling new code (main.py:3758-3860). There is no env var, CLI flag, or check to opt out.When
hermes updateis invoked from a cron job scheduled by hermes's own in-process cron scheduler (cron/scheduler.pytick loop +ThreadPoolExecutorworker thread runs inside the gateway Python process), the auto-restart SIGTERMs the gateway → the worker thread gets killed mid-flight →jobs.jsonpost-run write never happens,state.dbsession row is missing, cron output is incomplete.Observed:
last_run_atstuck at yesterday injobs.jsoneven thoughgit reflogshows the pull succeeded 19 seconds before the gateway was killed.Fix
Adds a
HERMES_SKIP_GATEWAY_RESTARTenv var check before the auto-restart block. When set, the restart is skipped and the user sees a message confirming the skip.This lets cron jobs safely run
hermes updateby settingHERMES_SKIP_GATEWAY_RESTART=1in the cron environment.Changes
hermes_cli/main.py: Add_do_restart = not os.environ.get("HERMES_SKIP_GATEWAY_RESTART")guard wrapping the existing try/except restart block (+6 lines net)tests/hermes_cli/test_update_gateway_restart.py: AddTestSkipGatewayRestartEnvVartest class verifying the env var guard logicTest plan
Truewhen env var absent)Closes #6702