Skip to content

fix(logging): make rotating handler safe against external rotation (#27649)#27681

Closed
xxxigm wants to merge 2 commits into
NousResearch:mainfrom
xxxigm:fix/27649-multiprocess-log-rotation
Closed

fix(logging): make rotating handler safe against external rotation (#27649)#27681
xxxigm wants to merge 2 commits into
NousResearch:mainfrom
xxxigm:fix/27649-multiprocess-log-rotation

Conversation

@xxxigm

@xxxigm xxxigm commented May 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes the multi-process log-rotation split reported in #27649 without adding a new runtime dependency.

When several Hermes processes share ~/.hermes/logs/agent.log (e.g. the gateway plus several hermes --tui sessions and their tui_gateway.entry / slash_worker subprocesses) the stdlib RotatingFileHandler keeps an open file descriptor pointing at the active inode. If another process rotates the file -- renaming agent.log -> agent.log.1 and creating a fresh agent.log -- the original fd still resolves to the renamed inode, so:

  1. Writes from the older processes silently land in agent.log.1 forever.
  2. Newer processes write to the new agent.log.
  3. The live log stream gets split across files; tools that follow only agent.log miss current activity.
  4. Worse, once the older process eventually triggers its own rollover, the rename overwrites the newer agent.log that a sibling just created.

hermes_logging._ManagedRotatingFileHandler now detects external rotation by comparing the open stream's (st_dev, st_ino) against baseFilename on disk, and reopens the active file when they disagree. The fix is stdlib-only -- no concurrent-log-handler dependency.

Related Issue

Fixes #27649.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Security fix
  • Documentation update
  • Tests (adding or improving test coverage)
  • Refactor (no behavior change)
  • New skill (bundled or hub)

Changes Made

hermes_logging.py -- _ManagedRotatingFileHandler is now inode-aware:

  • New class attribute _REOPEN_CHECK_INTERVAL = 16 controls how often the inode check runs (the first emit always checks, then every Nth thereafter).
  • New _stream_inode_key() / _baseFilename_inode_key() helpers return (st_dev, st_ino) and treat st_ino == 0 (Windows + Python combos that don't expose stable inode numbers) and OSError as "unknown" rather than triggering spurious reopens.
  • New _reopen_if_rotated_externally() closes the stale stream and reopens baseFilename when the inode-comparison mismatches; returns True if a reopen happened.
  • emit() invokes the check before delegating to the base class; failures in the check path are swallowed so a broken logging path never kills the calling code (fail-open).
  • doRollover() short-circuits when an external rotation already happened, reopening baseFilename instead of renaming it -- preventing the older process from clobbering the newer agent.log a sibling just created.
  • Managed-mode chmod 0o660 is re-applied on the reopen path so sibling-rotated files don't get stuck at the process umask (0644).

tests/test_hermes_logging_multiprocess_rotation.py (new file, 15 regression tests):

  • TestExternalRotationReopen (4 tests) -- direct [Bug]: Multiprocess logging can keep writing to rotated agent.log.N files #27649 repro plus steady-state guards.
  • TestDoRolloverShortCircuit (2 tests) -- the rollover-clobber regression.
  • TestThrottleAndCorners (7 tests) -- first-emit-always-checks, throttle-bounded-window, closed-stream safety, st_ino == 0 Windows compat, missing-baseFilename handling, OSError swallowing during close, fail-open of the inode-check path.
  • TestManagedModeChmodAfterReopen (1 test, POSIX-only) -- chmod survives the reopen.
  • TestIntegrationWithSetupLogging (1 test) -- setup_logging() still wires the managed handler.

How to Test

  1. Check out the branch and set up the venv:

    python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
    
  2. Run the new regression suite alone:

    scripts/run_tests.sh tests/test_hermes_logging_multiprocess_rotation.py -v
    

    Expected: 15 passed.

  3. Run the full logging-related suite to confirm no cross-file regression:

    scripts/run_tests.sh tests/test_hermes_logging.py tests/test_hermes_logging_multiprocess_rotation.py
    

    Expected: 65 passed (50 pre-existing + 15 new).

  4. (Optional) Reproduce the [Bug]: Multiprocess logging can keep writing to rotated agent.log.N files #27649 scenario manually:

    python -c "
    import logging, os
    from hermes_logging import _ManagedRotatingFileHandler
    
    h = _ManagedRotatingFileHandler('/tmp/x.log', maxBytes=1024, backupCount=3)
    h._REOPEN_CHECK_INTERVAL = 1
    logging.getLogger('demo').addHandler(h); logging.getLogger('demo').setLevel(logging.INFO)
    
    logging.getLogger('demo').info('pre-rotation')
    os.rename('/tmp/x.log', '/tmp/x.log.1')
    open('/tmp/x.log','w').close()
    logging.getLogger('demo').info('post-rotation')
    print('live:', open('/tmp/x.log').read())
    print('backup:', open('/tmp/x.log.1').read())
    "
    

    Expected: post-rotation shows up in /tmp/x.log, not in /tmp/x.log.1.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (2 commits: fix(logging), test(logging))
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh tests/test_hermes_logging.py tests/test_hermes_logging_multiprocess_rotation.py and all tests pass
  • I've added tests for my changes (15 new regression tests)
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) -- the class-level docstring on _ManagedRotatingFileHandler now documents the external-rotation safety contract and the throttle interval. No user-facing docs are affected (the handler is internal).
  • I've updated cli-config.yaml.example if I added/changed config keys -- N/A (no config-key changes).
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows -- N/A.
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide -- the inode helpers treat st_ino == 0 and OSError as "unknown" (Windows-friendly fallback), and the managed-mode chmod test is gated by @pytest.mark.skipif(sys.platform.startswith("win")).
  • I've updated tool descriptions/schemas if I changed tool behavior -- N/A.

Screenshots / Logs

$ scripts/run_tests.sh tests/test_hermes_logging_multiprocess_rotation.py -v
4 workers [15 items]
...............                                                          [100%]
============================== 15 passed in 1.17s ==============================

$ scripts/run_tests.sh tests/test_hermes_logging.py tests/test_hermes_logging_multiprocess_rotation.py
4 workers [65 items]
.................................................................        [100%]
============================== 65 passed in 1.18s ==============================

Limitations (per the issue's "no strong prescription from me" note)

This is a minimal stdlib-only fix:

  • The inode check is throttled to every 16 records by default. Up to 15 records emitted between checks immediately after an external rotation may still land in the rotated backup. Tests pin this as a bounded contract rather than promising zero loss.
  • True multi-process rotation safety (file locking around the rename) is out of scope -- two processes rotating simultaneously can still race. The short-circuit in doRollover prevents the worst symptom (a slow process overwriting a fresh agent.log written by a faster sibling) but doesn't synchronize the rename itself.
  • No new dependency is added (concurrent-log-handler was considered and rejected to keep the install footprint minimal).

xxxigm added 2 commits May 18, 2026 07:39
When multiple Hermes processes share ``~/.hermes/logs/agent.log`` (e.g.
gateway + several ``hermes --tui`` sessions and their TUI/slash-worker
subprocesses) the stdlib ``RotatingFileHandler`` keeps an open file
descriptor pointing at the active inode.  When *another* process
rotates the file -- renaming ``agent.log`` -> ``agent.log.1`` and
creating a fresh ``agent.log`` -- the original fd still resolves to
the renamed inode, so writes from the older processes silently land
in the rotated backup forever.  Newer processes write to the new
``agent.log`` and the live log stream gets split across files.

``_ManagedRotatingFileHandler`` now detects this case:

  * Before each ``emit`` (throttled to once every
    ``_REOPEN_CHECK_INTERVAL`` records, default 16; the very first
    emit always checks) it compares the open stream's
    ``(st_dev, st_ino)`` against the same pair for ``baseFilename``
    on disk.  On mismatch it closes the stale stream and reopens
    ``baseFilename`` so subsequent writes follow the live log.
  * ``doRollover`` short-circuits when an external rotation already
    happened, simply reopening ``baseFilename`` instead of renaming
    it.  Without this guard our own rollover would clobber the newer
    ``agent.log`` that a sibling process just created.
  * ``st_ino == 0`` and ``OSError`` from ``fstat``/``stat`` are
    treated as "unknown" rather than forcing spurious reopens (for
    Windows + Python combos that don't expose stable inode numbers).
  * Failures inside the reopen helper are swallowed so a broken
    logging path never kills the calling code (fail-open).

Managed-mode chmod (group-writable 0o660 on NixOS deployments) is
re-applied after the reopen so a sibling-rotated file doesn't inherit
the process umask.

Fixes NousResearch#27649.
Add ``tests/test_hermes_logging_multiprocess_rotation.py`` -- 15
regression tests pinning the new inode-aware behaviour of
``_ManagedRotatingFileHandler``.

TestExternalRotationReopen (4 tests)
  * Direct NousResearch#27649 repro: rename agent.log -> agent.log.1 mid-stream
    and confirm subsequent writes land in the new agent.log, not in
    the backup.
  * Steady-state: no inode change -> no stream-object replacement.
  * ``_reopen_if_rotated_externally`` returns ``True`` only on actual
    mismatch and is a no-op on the next call.
  * Multiple writes after the reopen all reach the live log.

TestDoRolloverShortCircuit (2 tests)
  * If a sibling already rotated and wrote new content, our own
    ``doRollover`` must NOT rename their file (no clobber).
  * Single-process rotation behaviour is unchanged when no external
    rotation has happened.

TestThrottleAndCorners (7 tests)
  * The very first emit always checks (even with the default
    interval of 16) so a process that starts after another rotated
    doesn't spend 16 records writing to the backup.
  * The ``_REOPEN_CHECK_INTERVAL`` contract is bounded -- writes
    converge to the live log within the configured window.
  * Closed/None stream -> helpers return ``None`` and the reopen is
    a no-op.
  * Windows-style ``st_ino == 0`` is treated as unknown rather than
    triggering spurious reopens on every record.
  * ``baseFilename`` missing -> on-disk inode lookup returns ``None``
    (the reopen helper does nothing).
  * ``OSError`` during the close half of the reopen is swallowed and
    the new stream is still opened correctly.
  * Unexpected exceptions in the inode-check path do NOT drop the
    record being emitted (fail-open contract).

TestManagedModeChmodAfterReopen (1 test, POSIX-only)
  * The managed-mode 0o660 chmod is reapplied after the external-
    rotation reopen so a sibling-created log doesn't get stuck with
    0o644 from the process umask.

TestIntegrationWithSetupLogging (1 test)
  * ``setup_logging()`` still wires ``_ManagedRotatingFileHandler``
    rather than the bare stdlib class, so the fix is picked up on
    real installs.

All 65 tests in ``tests/test_hermes_logging*.py`` pass (50 pre-existing
+ 15 new).
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/agent Core agent loop, run_agent.py, prompt builder labels May 18, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the careful write-up and regression coverage. This stale-fd external-rotation fix is already implemented on current main, so this PR can be closed as superseded by main.

Evidence from this automated hermes-sweeper review:

  • 75d2c081c9abd60333694b13f1d753c3a8361f61 (fix(logging): recover gateway.log handler from external rotation (#34349)) is present on origin/main and adds WatchedFileHandler-style inode recovery.
  • hermes_logging.py:358 documents _ManagedRotatingFileHandler as surviving external rotation, and hermes_logging.py:457 calls _reopen_if_externally_rotated() before delegating to stdlib RotatingFileHandler.emit().
  • hermes_logging.py:505 wires that same managed handler through _add_rotating_handler(), so the behavior applies to agent.log, errors.log, gateway.log, and gui.log.
  • tests/test_hermes_logging.py:832 adds regression coverage for external rename/unlink/truncate and normal rollover.

I also read the later #27649 discussion: the Windows PermissionError/WinError 32 rollover spam is a related follow-up issue, but it is not addressed by this PR either.

@teknium1 teknium1 closed this Jun 13, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Multiprocess logging can keep writing to rotated agent.log.N files

3 participants