fix(logging): make rotating handler safe against external rotation (#27649)#27681
Closed
xxxigm wants to merge 2 commits into
Closed
fix(logging): make rotating handler safe against external rotation (#27649)#27681xxxigm wants to merge 2 commits into
xxxigm wants to merge 2 commits into
Conversation
When multiple Hermes processes share ``~/.hermes/logs/agent.log`` (e.g.
gateway + several ``hermes --tui`` sessions and their TUI/slash-worker
subprocesses) the stdlib ``RotatingFileHandler`` keeps an open file
descriptor pointing at the active inode. When *another* process
rotates the file -- renaming ``agent.log`` -> ``agent.log.1`` and
creating a fresh ``agent.log`` -- the original fd still resolves to
the renamed inode, so writes from the older processes silently land
in the rotated backup forever. Newer processes write to the new
``agent.log`` and the live log stream gets split across files.
``_ManagedRotatingFileHandler`` now detects this case:
* Before each ``emit`` (throttled to once every
``_REOPEN_CHECK_INTERVAL`` records, default 16; the very first
emit always checks) it compares the open stream's
``(st_dev, st_ino)`` against the same pair for ``baseFilename``
on disk. On mismatch it closes the stale stream and reopens
``baseFilename`` so subsequent writes follow the live log.
* ``doRollover`` short-circuits when an external rotation already
happened, simply reopening ``baseFilename`` instead of renaming
it. Without this guard our own rollover would clobber the newer
``agent.log`` that a sibling process just created.
* ``st_ino == 0`` and ``OSError`` from ``fstat``/``stat`` are
treated as "unknown" rather than forcing spurious reopens (for
Windows + Python combos that don't expose stable inode numbers).
* Failures inside the reopen helper are swallowed so a broken
logging path never kills the calling code (fail-open).
Managed-mode chmod (group-writable 0o660 on NixOS deployments) is
re-applied after the reopen so a sibling-rotated file doesn't inherit
the process umask.
Fixes NousResearch#27649.
Add ``tests/test_hermes_logging_multiprocess_rotation.py`` -- 15 regression tests pinning the new inode-aware behaviour of ``_ManagedRotatingFileHandler``. TestExternalRotationReopen (4 tests) * Direct NousResearch#27649 repro: rename agent.log -> agent.log.1 mid-stream and confirm subsequent writes land in the new agent.log, not in the backup. * Steady-state: no inode change -> no stream-object replacement. * ``_reopen_if_rotated_externally`` returns ``True`` only on actual mismatch and is a no-op on the next call. * Multiple writes after the reopen all reach the live log. TestDoRolloverShortCircuit (2 tests) * If a sibling already rotated and wrote new content, our own ``doRollover`` must NOT rename their file (no clobber). * Single-process rotation behaviour is unchanged when no external rotation has happened. TestThrottleAndCorners (7 tests) * The very first emit always checks (even with the default interval of 16) so a process that starts after another rotated doesn't spend 16 records writing to the backup. * The ``_REOPEN_CHECK_INTERVAL`` contract is bounded -- writes converge to the live log within the configured window. * Closed/None stream -> helpers return ``None`` and the reopen is a no-op. * Windows-style ``st_ino == 0`` is treated as unknown rather than triggering spurious reopens on every record. * ``baseFilename`` missing -> on-disk inode lookup returns ``None`` (the reopen helper does nothing). * ``OSError`` during the close half of the reopen is swallowed and the new stream is still opened correctly. * Unexpected exceptions in the inode-check path do NOT drop the record being emitted (fail-open contract). TestManagedModeChmodAfterReopen (1 test, POSIX-only) * The managed-mode 0o660 chmod is reapplied after the external- rotation reopen so a sibling-created log doesn't get stuck with 0o644 from the process umask. TestIntegrationWithSetupLogging (1 test) * ``setup_logging()`` still wires ``_ManagedRotatingFileHandler`` rather than the bare stdlib class, so the fix is picked up on real installs. All 65 tests in ``tests/test_hermes_logging*.py`` pass (50 pre-existing + 15 new).
Contributor
|
Thanks for the careful write-up and regression coverage. This stale-fd external-rotation fix is already implemented on current main, so this PR can be closed as superseded by main. Evidence from this automated hermes-sweeper review:
I also read the later #27649 discussion: the Windows |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes the multi-process log-rotation split reported in #27649 without adding a new runtime dependency.
When several Hermes processes share
~/.hermes/logs/agent.log(e.g. the gateway plus severalhermes --tuisessions and theirtui_gateway.entry/slash_workersubprocesses) the stdlibRotatingFileHandlerkeeps an open file descriptor pointing at the active inode. If another process rotates the file -- renamingagent.log->agent.log.1and creating a freshagent.log-- the original fd still resolves to the renamed inode, so:agent.log.1forever.agent.log.agent.logmiss current activity.agent.logthat a sibling just created.hermes_logging._ManagedRotatingFileHandlernow detects external rotation by comparing the open stream's(st_dev, st_ino)againstbaseFilenameon disk, and reopens the active file when they disagree. The fix is stdlib-only -- noconcurrent-log-handlerdependency.Related Issue
Fixes #27649.
Type of Change
Changes Made
hermes_logging.py--_ManagedRotatingFileHandleris now inode-aware:_REOPEN_CHECK_INTERVAL = 16controls how often the inode check runs (the first emit always checks, then every Nth thereafter)._stream_inode_key()/_baseFilename_inode_key()helpers return(st_dev, st_ino)and treatst_ino == 0(Windows + Python combos that don't expose stable inode numbers) andOSErroras "unknown" rather than triggering spurious reopens._reopen_if_rotated_externally()closes the stale stream and reopensbaseFilenamewhen the inode-comparison mismatches; returnsTrueif a reopen happened.emit()invokes the check before delegating to the base class; failures in the check path are swallowed so a broken logging path never kills the calling code (fail-open).doRollover()short-circuits when an external rotation already happened, reopeningbaseFilenameinstead of renaming it -- preventing the older process from clobbering the neweragent.loga sibling just created.chmod 0o660is re-applied on the reopen path so sibling-rotated files don't get stuck at the process umask (0644).tests/test_hermes_logging_multiprocess_rotation.py(new file, 15 regression tests):TestExternalRotationReopen(4 tests) -- direct [Bug]: Multiprocess logging can keep writing to rotated agent.log.N files #27649 repro plus steady-state guards.TestDoRolloverShortCircuit(2 tests) -- the rollover-clobber regression.TestThrottleAndCorners(7 tests) -- first-emit-always-checks, throttle-bounded-window, closed-stream safety,st_ino == 0Windows compat, missing-baseFilenamehandling,OSErrorswallowing during close, fail-open of the inode-check path.TestManagedModeChmodAfterReopen(1 test, POSIX-only) -- chmod survives the reopen.TestIntegrationWithSetupLogging(1 test) --setup_logging()still wires the managed handler.How to Test
Check out the branch and set up the venv:
Run the new regression suite alone:
Expected: 15 passed.
Run the full logging-related suite to confirm no cross-file regression:
Expected: 65 passed (50 pre-existing + 15 new).
(Optional) Reproduce the [Bug]: Multiprocess logging can keep writing to rotated agent.log.N files #27649 scenario manually:
Expected:
post-rotationshows up in/tmp/x.log, not in/tmp/x.log.1.Checklist
Code
fix(logging),test(logging))scripts/run_tests.sh tests/test_hermes_logging.py tests/test_hermes_logging_multiprocess_rotation.pyand all tests passDocumentation & Housekeeping
docs/, docstrings) -- the class-level docstring on_ManagedRotatingFileHandlernow documents the external-rotation safety contract and the throttle interval. No user-facing docs are affected (the handler is internal).cli-config.yaml.exampleif I added/changed config keys -- N/A (no config-key changes).CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows -- N/A.st_ino == 0andOSErroras "unknown" (Windows-friendly fallback), and the managed-mode chmod test is gated by@pytest.mark.skipif(sys.platform.startswith("win")).Screenshots / Logs
Limitations (per the issue's "no strong prescription from me" note)
This is a minimal stdlib-only fix:
doRolloverprevents the worst symptom (a slow process overwriting a freshagent.logwritten by a faster sibling) but doesn't synchronize the rename itself.concurrent-log-handlerwas considered and rejected to keep the install footprint minimal).