Skip to content

Fix/misc bug fixes#1974

Closed
Gutslabs wants to merge 2 commits into
NousResearch:mainfrom
Gutslabs:fix/misc-bug-fixes
Closed

Fix/misc bug fixes#1974
Gutslabs wants to merge 2 commits into
NousResearch:mainfrom
Gutslabs:fix/misc-bug-fixes

Conversation

@Gutslabs

Copy link
Copy Markdown
Contributor

Summary

Fix 11 independently discovered bugs across the agent core, gateway, cron scheduler, and tools — none tracked in existing issues.

Bugs Fixed

# Scope Bug Severity
1 run_agent.py _hydrate_todo_store() unconditionally called _set_interrupt(False), silently clearing pending interrupts on every run_conversation call High
2 run_agent.py read_file listed in both _PATH_SCOPED_TOOLS and _PARALLEL_SAFE_TOOLS — when called without a path arg, the path-scoped check returned False first, unnecessarily serializing the entire batch Medium
3 run_agent.py _materialize_data_url_for_vision() created a NamedTemporaryFile(delete=False) but never cleaned it up if base64.b64decode() raised Medium
4 hermes_state.py session_count() and message_count() were the only SessionDB methods that didn't acquire self._lock — race condition under concurrent gateway requests High
5 hermes_state.py Schema migrations ran all steps under a single commit — a mid-migration crash left the DB with a mismatched version and no way to resume Medium
6 web_tools.py Top-level from firecrawl import Firecrawl crashed the entire web tools module for users on Tavily or Parallel backends High
7 memory_tool.py Unconditional import fcntl — immediate ImportError on Windows, violating CONTRIBUTING.md cross-platform rules High
8 gateway/session.py append_to_transcript() JSONL write had no error handling — disk full or permission errors crashed the gateway message handler Medium
9 gateway/status.py _read_pid_record() called read_text() without try/except — concurrent PID file writes caused unhandled OSError Medium
10 cron/scheduler.py finally block called lock_fd.close() unconditionally, but lock_fd could still be None if open() never completed → AttributeError Medium
11 cron/scheduler.py except RuntimeError in _deliver_result() was too broad — any RuntimeError from _send_to_platform (not just event-loop errors) triggered a spurious retry in a new thread, risking double delivery High

Test Plan

  • pytest tests/ -v passes
  • Import web_tools without firecrawl package installed (Tavily/Parallel backend)
  • Import memory_tool on Windows (fcntl guard)
  • Concurrent gateway sessions don't trigger SQLite threading errors on count methods
  • Cron tick() handles KeyboardInterrupt during lock acquisition without AttributeError

Gutslabs added 2 commits March 18, 2026 17:12
- agent: remove misplaced _set_interrupt(False) from _hydrate_todo_store()
  that silently cleared pending interrupts on every run_conversation call
- agent: fix read_file in both _PATH_SCOPED_TOOLS and _PARALLEL_SAFE_TOOLS
  causing unnecessary batch serialization when path arg is missing
- agent: fix temp file leak in _materialize_data_url_for_vision() when
  base64.b64decode() raises an exception
- state: add missing thread lock to session_count() and message_count()
  preventing race conditions with concurrent gateway requests
- state: add intermediate commits between schema migration steps to
  prevent version/schema mismatch on mid-migration crashes
- tools: make firecrawl import lazy in web_tools.py so non-Firecrawl
  backends (Tavily, Parallel) don't crash on module import
- tools: guard fcntl import in memory_tool.py for Windows compatibility
  per CONTRIBUTING.md cross-platform rules
- gateway: wrap JSONL transcript write in try/except in session.py to
  prevent gateway crash on disk full or permission errors
- gateway: add OSError handling to _read_pid_record() in status.py to
  handle concurrent PID file writes gracefully
- cron: guard lock_fd.close() with None check in scheduler.py finally
  block to prevent AttributeError if open() never completed
- cron: narrow RuntimeError catch in _deliver_result() to only catch
  event-loop-related errors, preventing spurious retries on unrelated
  RuntimeErrors from _send_to_platform
@nidhishgajjar

Copy link
Copy Markdown

Orb Code Review (powered by GLM 5.1 on Orb Cloud)

Summary

Fixes 11 independently discovered bugs across the agent core, gateway, cron scheduler, and tools. Each fix is small and targeted. Key highlights:

  1. _set_interrupt(False) on every run_conversation — silently cleared pending interrupts (high severity)
  2. Path normalization for parallel tool batching./ and ../ aliases caused same-target writes to be incorrectly parallelized
  3. Temp file leak on base64 decode failure — now cleans up
  4. session_count()/message_count() missing lock — race condition under concurrent access
  5. Schema migration atomicity — each migration step now has its own commit()
  6. from firecrawl import Firecrawl at module level — crashed web tools for non-Firecrawl users
  7. import fcntl unconditional — broke Windows
  8. JSONL transcript write — no error handling for disk-full/permissions
  9. PID file read — no OSError handling
  10. lock_fd could be None in finally — AttributeError
  11. RuntimeError catch too broad — now only catches event-loop-related errors

Architecture

All fixes follow the existing patterns in the codebase. The bug fixes are orthogonal — no cross-dependencies between them.

Issues

Warningcron/scheduler.py RuntimeError catch specificity:

except RuntimeError as e:
    if "event loop" not in str(e).lower():
        raise

String-matching on exception messages is fragile — if Python changes the RuntimeError message for event loop issues, this will break. However, this is the standard pattern for asyncio.run() guard code and there's no better alternative in the Python stdlib today. Acceptable.

Suggestionhermes_state.py migrations now have individual commit() calls but no rollback on failure. If a migration step fails mid-execution (e.g., ALTER TABLE succeeds but UPDATE schema_version fails), the database could be left with the new column but the old version number. Consider wrapping each migration step in a try/except that rolls back:

if current_version < 2:
    try:
        cursor.execute("ALTER TABLE sessions ADD COLUMN ...")
        cursor.execute("UPDATE schema_version SET version = 2")
        self._conn.commit()
    except sqlite3.OperationalError:
        self._conn.rollback()
        pass

This is a minor improvement — the current code is already better than the original (single commit for all migrations).

Note_set_interrupt(False) removal in _hydrate_todo_store: The original code cleared interrupts during todo hydration on every run_conversation entry. Removing it is correct if interrupts should persist across conversation re-entries. But verify that there isn't a path where _set_interrupt(False) is needed to prevent stale interrupts from blocking the new conversation. The test coverage should catch regressions here.

Cross-file impact

  • web_tools.py: Moving from firecrawl import Firecrawl to lazy import inside _get_firecrawl_client() is correct — this is the only function that uses it.
  • memory_tool.py: The fcntl = None fallback + yield without lock on Windows is acceptable for single-process use. On multi-process Windows, consider msvcrt locking, but this matches the existing Windows strategy in cron/scheduler.py.

Assessment

approve ✅ — Well-structured batch of targeted bug fixes. Each addresses a real issue. Good test additions for the path normalization fix. The PR description's bug table is excellent for review traceability.

1 similar comment
@nidhishgajjar

Copy link
Copy Markdown

Orb Code Review (powered by GLM 5.1 on Orb Cloud)

Summary

Fixes 11 independently discovered bugs across the agent core, gateway, cron scheduler, and tools. Each fix is small and targeted. Key highlights:

  1. _set_interrupt(False) on every run_conversation — silently cleared pending interrupts (high severity)
  2. Path normalization for parallel tool batching./ and ../ aliases caused same-target writes to be incorrectly parallelized
  3. Temp file leak on base64 decode failure — now cleans up
  4. session_count()/message_count() missing lock — race condition under concurrent access
  5. Schema migration atomicity — each migration step now has its own commit()
  6. from firecrawl import Firecrawl at module level — crashed web tools for non-Firecrawl users
  7. import fcntl unconditional — broke Windows
  8. JSONL transcript write — no error handling for disk-full/permissions
  9. PID file read — no OSError handling
  10. lock_fd could be None in finally — AttributeError
  11. RuntimeError catch too broad — now only catches event-loop-related errors

Architecture

All fixes follow the existing patterns in the codebase. The bug fixes are orthogonal — no cross-dependencies between them.

Issues

Warningcron/scheduler.py RuntimeError catch specificity:

except RuntimeError as e:
    if "event loop" not in str(e).lower():
        raise

String-matching on exception messages is fragile — if Python changes the RuntimeError message for event loop issues, this will break. However, this is the standard pattern for asyncio.run() guard code and there's no better alternative in the Python stdlib today. Acceptable.

Suggestionhermes_state.py migrations now have individual commit() calls but no rollback on failure. If a migration step fails mid-execution (e.g., ALTER TABLE succeeds but UPDATE schema_version fails), the database could be left with the new column but the old version number. Consider wrapping each migration step in a try/except that rolls back:

if current_version < 2:
    try:
        cursor.execute("ALTER TABLE sessions ADD COLUMN ...")
        cursor.execute("UPDATE schema_version SET version = 2")
        self._conn.commit()
    except sqlite3.OperationalError:
        self._conn.rollback()
        pass

This is a minor improvement — the current code is already better than the original (single commit for all migrations).

Note_set_interrupt(False) removal in _hydrate_todo_store: The original code cleared interrupts during todo hydration on every run_conversation entry. Removing it is correct if interrupts should persist across conversation re-entries. But verify that there isn't a path where _set_interrupt(False) is needed to prevent stale interrupts from blocking the new conversation. The test coverage should catch regressions here.

Cross-file impact

  • web_tools.py: Moving from firecrawl import Firecrawl to lazy import inside _get_firecrawl_client() is correct — this is the only function that uses it.
  • memory_tool.py: The fcntl = None fallback + yield without lock on Windows is acceptable for single-process use. On multi-process Windows, consider msvcrt locking, but this matches the existing Windows strategy in cron/scheduler.py.

Assessment

approve ✅ — Well-structured batch of targeted bug fixes. Each addresses a real issue. Good test additions for the path normalization fix. The PR description's bug table is excellent for review traceability.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery comp/cron Cron scheduler and job management tool/web Web search and extraction tool/memory Memory tool and memory providers labels May 3, 2026
teknium1 added a commit that referenced this pull request May 11, 2026
Salvages the three substantive low-severity fixes from Gutslabs' #1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged via #23591 — merge commit 3af3c4e on main with your authorship preserved (rebase merge).

After review, only 3 of the 11 claims survived:

The other 8 claims were either already fixed on main with superior implementations (state lock, firecrawl lazy import, fcntl/msvcrt guard with proper Windows locking, path normalization, schema migrations rewritten declaratively) or didn't survive analysis (the interrupt clear is immediately re-asserted by run_conversation; cron lock_fd is guaranteed non-None by the outer try block structure; etc).

Thanks for the contribution — credit preserved in git log for the salvaged fixes.

rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
Salvages the three substantive low-severity fixes from Gutslabs' NousResearch#1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
Salvages the three substantive low-severity fixes from Gutslabs' NousResearch#1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Salvages the three substantive low-severity fixes from Gutslabs' NousResearch#1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
Salvages the three substantive low-severity fixes from Gutslabs' NousResearch#1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
Salvages the three substantive low-severity fixes from Gutslabs' NousResearch#1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Salvages the three substantive low-severity fixes from Gutslabs' NousResearch#1974
"misc bug fixes" bundle.  The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.

- run_agent: `_materialize_data_url_for_vision` uses
  `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
  corrupt data URL the temp file would persist forever.  Wrap the
  write in try/except and `os.unlink` the temp on failure.

- gateway/session: `append_to_transcript` JSONL write had no error
  handling, so disk-full / read-only-fs / permission errors crashed the
  message handler.  The SQLite write above is the primary store, so
  swallow OSError on the JSONL fallback with a debug log.

- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
  an `exists()` check; if the PID file is deleted between the two
  calls (concurrent gateway restart) we hit an unhandled OSError.
  Catch it and return None.

Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cron Cron scheduler and job management comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers tool/web Web search and extraction type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants