Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughA new sanitization helper 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the security and privacy of the system by centralizing and applying robust message sanitization across critical components. It introduces a dedicated Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant security improvement by centralizing and enhancing message sanitization to prevent leaking sensitive information like file paths and URLs into LLM contexts. The new sanitize_message helper is well-implemented and applied correctly in _handle_fatal_error, checkpoint reconciliation, and compaction summaries. The refactoring is clean, and the accompanying unit and integration tests are comprehensive, covering a wide range of cases and potential bypasses. I have one suggestion to further improve the robustness of the path redaction logic.
| _PATH_PATTERN = re.compile( | ||
| r"[A-Za-z]:\\[^\s,;)\"']+" | ||
| r"|\\\\[^\s,;)\"']+" | ||
| r"|//[^\s,;)\"']{2,}" | ||
| r"|/[^\s,;)\"']{2,}" | ||
| r"|\.\.?/[^\s,;)\"']+" | ||
| ) |
There was a problem hiding this comment.
The current regex for path sanitization does not handle unquoted paths containing spaces. For example, an error message like Cannot open C:\Program Files\app.exe would result in partial redaction (Cannot open [REDACTED_PATH] Files\app.exe), potentially leaking parts of the file path. This is a security weakness.
While perfectly parsing all possible unquoted paths with spaces is very difficult with a single regex, we can improve robustness by explicitly matching quoted paths, which are a common way to represent paths with spaces. This suggestion adds patterns for single- and double-quoted paths before the existing patterns for unquoted paths.
_PATH_PATTERN = re.compile(
r'"(?:[A-Za-z]:\\|\\\\|//|/|\./|\.\./)[^"]*"'
r"|'(?:[A-Za-z]:\\|\\\\|//|/|\./|\.\./)[^']*'"
r"|[A-Za-z]:\\[^\s,;)\"'\"]+"
r"|\\\\[^\s,;)\"'\"]+"
r"|//[^\s,;)\"'\"]{2,}"
r"|/[^\s,;)\"'\"]{2,}"
r"|\.\.?/[^\s,;)\"'\"]+"
)There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/synthorg/engine/sanitization.py`:
- Around line 10-20: sanitize_message() currently only uses _PATH_PATTERN and
_URL_PATTERN and misses plain credentials (e.g., api_key=, token=, password=,
Authorization: Bearer) — add one or more credential regexes (e.g., for key=value
query/header forms and Authorization: Bearer/.+ and Bearer tokens in
URLs/headers) and apply them in sanitize_message() after/_alongside
_PATH_PATTERN/_URL_PATTERN to replace matches with a consistent placeholder like
"<REDACTED_CREDENTIAL>"; update/remove overlaps with _URL_PATTERN as needed to
avoid double-redaction and ensure functions/classes referenced (_PATH_PATTERN,
_URL_PATTERN, sanitize_message) handle headers, query params, and inline tokens.
- Around line 8-9: The module src/synthorg/engine/sanitization.py is missing the
standard module logger; add the required import and logger initialization by
importing get_logger from synthorg.observability and creating logger =
get_logger(__name__) at module scope so the module follows the convention used
across synthorg; place the import alongside existing imports (e.g., next to
import re) and define logger = get_logger(__name__) near the top of the file.
In `@tests/unit/engine/compaction/test_summarizer.py`:
- Around line 241-313: Add a regression test that verifies snippet truncation
cannot leak a partial path prefix: create a new async test (e.g.
test_assistant_snippet_truncation_does_not_leak_path_prefix) that uses the same
CompactionConfig and make_compaction_callback setup as the existing tests, then
build messages with a very long assistant content where a full path (e.g.
"C:\\Users\\dev\\project\\secrets.yaml") begins just before the 100-char snippet
boundary so truncation would have previously exposed a partial prefix; call
_build_context and the callback, then assert the resulting summary
message.content does not contain any partial path fragment like "C:\\Users" or
the original filename and that "[REDACTED_PATH]" appears instead; reuse symbols
_msg, _build_context, CompactionConfig, make_compaction_callback and the
sample_agent_with_personality fixture so the test matches the existing pattern.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: d61d0284-68ee-4e12-8c4b-9953cb3fca68
📒 Files selected for processing (10)
CLAUDE.mddocs/design/engine.mdsrc/synthorg/engine/agent_engine.pysrc/synthorg/engine/checkpoint/resume.pysrc/synthorg/engine/compaction/summarizer.pysrc/synthorg/engine/sanitization.pytests/unit/engine/checkpoint/test_resume.pytests/unit/engine/compaction/test_summarizer.pytests/unit/engine/test_agent_engine_errors.pytests/unit/engine/test_sanitization.py
| import re | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add module logger initialization to match engine module conventions.
This module has business logic but does not initialize logger = get_logger(__name__).
As per coding guidelines, src/synthorg/**/*.py: “Every module with business logic must have: from synthorg.observability import get_logger then logger = get_logger(__name__).”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/engine/sanitization.py` around lines 8 - 9, The module
src/synthorg/engine/sanitization.py is missing the standard module logger; add
the required import and logger initialization by importing get_logger from
synthorg.observability and creating logger = get_logger(__name__) at module
scope so the module follows the convention used across synthorg; place the
import alongside existing imports (e.g., next to import re) and define logger =
get_logger(__name__) near the top of the file.
| _PATH_PATTERN = re.compile( | ||
| r"[A-Za-z]:\\[^\s,;)\"']+" | ||
| r"|\\\\[^\s,;)\"']+" | ||
| r"|//[^\s,;)\"']{2,}" | ||
| r"|/[^\s,;)\"']{2,}" | ||
| r"|\.\.?/[^\s,;)\"']+" | ||
| ) | ||
| _URL_PATTERN = re.compile( | ||
| r"(?:https?|postgresql|postgres|mysql|redis|mongodb|amqp|ftp|sftp|file)" | ||
| r"://[^\s,;)\"']+" | ||
| ) |
There was a problem hiding this comment.
Add credential/token redaction to the shared sanitizer.
sanitize_message() currently redacts paths/URLs only. Plain secrets like
api_key=..., password=..., or Authorization: Bearer ... can still be
injected into LLM context through all three integration points.
🔒 Proposed patch
_URL_PATTERN = re.compile(
r"(?:https?|postgresql|postgres|mysql|redis|mongodb|amqp|ftp|sftp|file)"
r"://[^\s,;)\"']+"
)
+_SECRET_PATTERN = re.compile(
+ r"(?i)(?:"
+ r"\b(?:api[_-]?key|token|secret|password)\b\s*[:=]\s*[^\s,;)\"']+"
+ r"|\bauthorization\s*:\s*bearer\s+[^\s,;)\"']+"
+ r")"
+)
@@
def sanitize_message(raw: str, *, max_length: int = 200) -> str:
@@
sanitized = _URL_PATTERN.sub("[REDACTED_URL]", raw)
sanitized = _PATH_PATTERN.sub("[REDACTED_PATH]", sanitized)
+ sanitized = _SECRET_PATTERN.sub("[REDACTED_SECRET]", sanitized)
sanitized = "".join(c for c in sanitized[:max_length] if c.isprintable())Also applies to: 36-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/engine/sanitization.py` around lines 10 - 20, sanitize_message()
currently only uses _PATH_PATTERN and _URL_PATTERN and misses plain credentials
(e.g., api_key=, token=, password=, Authorization: Bearer) — add one or more
credential regexes (e.g., for key=value query/header forms and Authorization:
Bearer/.+ and Bearer tokens in URLs/headers) and apply them in
sanitize_message() after/_alongside _PATH_PATTERN/_URL_PATTERN to replace
matches with a consistent placeholder like "<REDACTED_CREDENTIAL>";
update/remove overlaps with _URL_PATTERN as needed to avoid double-redaction and
ensure functions/classes referenced (_PATH_PATTERN, _URL_PATTERN,
sanitize_message) handle headers, query params, and inline tokens.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #632 +/- ##
=======================================
Coverage 92.48% 92.49%
=======================================
Files 553 554 +1
Lines 27823 27836 +13
Branches 2669 2670 +1
=======================================
+ Hits 25733 25747 +14
+ Misses 1648 1647 -1
Partials 442 442 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… compaction summaries Extract shared sanitize_message() helper from _handle_fatal_error's inline regex chain into engine/sanitization.py. Apply to checkpoint resume (error_message before LLM context injection) and compaction summarizer (assistant content snippets before joining into summary). Closes #614 Closes #615 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Broaden Unix path pattern to catch all absolute paths (not just allowlisted prefixes), add UNC path support - Extend URL pattern to cover data-store URI schemes (postgresql, redis, mongodb, etc.) - Reorder URL/path redaction (URLs first) to prevent path pattern from mangling URL scheme separators - Fix summarizer to sanitize full content before truncating (prevents mid-path truncation bypass) - Add integration test for _handle_fatal_error sanitization wiring - Convert parametrize labels to pytest.param IDs, add edge cases - Add pytestmark timeout to summarizer tests - Update CLAUDE.md and docs/design/engine.md for new module Pre-reviewed by 10 agents, 15 findings addressed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… reviewers - Fix overly broad Unix path regex: add (?<!\w) lookbehind to prevent false positives on ratios (100/min), content-types (application/json), and date formats (2024/01/15) - Add quoted Windows path detection for paths with spaces - Fix truncation ordering: cap raw input before redaction to prevent partial redaction tokens like [REDACTED_PA in output - Add max_length negative value validation with ValueError - Handle all-redacted snippets in compaction summarizer: filter out useless "details redacted" placeholders, fall through to generic fallback when all snippets are redacted - Fix "stack traces" claim in engine.md: function only redacts paths/URLs, truncation limits stack trace exposure - Add sanitization mentions to Compaction Hook and Checkpoint Recovery doc sections - Update _build_summary docstring to mention sanitization step - Update max_length docstring to clarify it caps raw input, not output - Fix resume.py inline comment precision (broader than just paths/URLs) - Add 13 new tests: non-standard Unix prefixes, false-positive prevention, quoted Windows paths, additional URL schemes (ftp, sftp, amqp, mysql), file:// URI ordering, negative max_length, truncation boundary regression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3f30023 to
8094057
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
src/synthorg/engine/sanitization.py (2)
8-9:⚠️ Potential issue | 🟠 MajorInitialize module logger for this business-logic module.
This module introduces engine business logic but does not define the required module logger.
🔧 Proposed fix
import re + +from synthorg.observability import get_logger + +logger = get_logger(__name__)As per coding guidelines,
src/synthorg/**/*.py: “Every module with business logic must have:from synthorg.observability import get_loggerthenlogger = get_logger(__name__).”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/synthorg/engine/sanitization.py` around lines 8 - 9, This module is missing the required module logger: add the import "from synthorg.observability import get_logger" at top of src/synthorg/engine/sanitization.py and initialize "logger = get_logger(__name__)" at module scope so business-logic in this file uses the standard get_logger and logger symbols.
10-29:⚠️ Potential issue | 🟠 MajorAdd credential/token redaction in shared sanitizer.
sanitize_message()currently redacts paths/URLs but still allows inline secrets (e.g.,api_key=...,password=..., bearer tokens) to pass through to LLM context.🔒 Proposed fix
_URL_PATTERN = re.compile( r"(?:https?|postgresql|postgres|mysql|redis|mongodb|amqp|ftp|sftp|file)" r"://[^\s,;)\"']+" ) +_SECRET_PATTERN = re.compile( + r"(?i)(?:" + r"\b(?:api[_-]?key|token|secret|password)\b\s*[:=]\s*[^\s,;)\"']+" + r"|\bauthorization\s*:\s*bearer\s+[^\s,;)\"']+" + r")" +) @@ sanitized = _URL_PATTERN.sub("[REDACTED_URL]", capped) sanitized = _PATH_PATTERN.sub("[REDACTED_PATH]", sanitized) + sanitized = _SECRET_PATTERN.sub("[REDACTED_SECRET]", sanitized) sanitized = "".join(c for c in sanitized if c.isprintable())Also applies to: 54-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/synthorg/engine/sanitization.py` around lines 10 - 29, sanitize_message currently only strips paths/URLs via _PATH_PATTERN/_URL_PATTERN but doesn't redact inline credentials or bearer tokens; add one or more regex patterns to match common secrets (e.g., query/key-value pairs like api_key=..., access_token=..., token=..., password=..., secret=..., auth=..., passwd=...), Authorization headers with "Bearer <token>" and header-style "Authorization:Bearer ..." and include them in the sanitizer pipeline so matches are replaced with a stable placeholder like "[REDACTED]"; update sanitize_message to apply these new credential patterns (after or alongside _URL_PATTERN) and ensure functions/variables mentioned (_PATH_PATTERN, _URL_PATTERN, sanitize_message) are the points to modify so any matched secrets are redacted before returning the message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/synthorg/engine/sanitization.py`:
- Around line 50-52: The code raises ValueError for an invalid max_length but
doesn't log the error; before raising, call the module logger (e.g.,
logger.warning or logger.error) to record the invalid-input with context
including the provided max_length and any relevant function name (reference the
max_length check and msg variable in sanitization.py), then raise the same
ValueError; ensure you import or use the existing logger instance in the module
and include a clear message like "invalid max_length" plus the max_length value.
In `@tests/unit/engine/checkpoint/test_resume.py`:
- Around line 128-167: Combine the two nearly identical tests
test_error_with_path_is_sanitized and test_error_with_url_is_sanitized into a
single parametrized test using pytest.mark.parametrize: keep the shared setup
using _make_ctx_json and deserialize_and_reconcile (referencing AgentIdentity
and Task fixtures), and parametrize the error_message input and expected
redaction token (e.g., ("Failed at
C:\\Users\\dev\\secret.key","[REDACTED_PATH]") and ("Timeout calling
https://api.internal.io/v1/completions","[REDACTED_URL]")), then assert the
redaction appears and the raw substring (like "C:\\Users" or "https://") does
not appear; ensure the new test keeps the same agent_id/task_id and checks
last_msg.content as before.
In `@tests/unit/engine/compaction/test_summarizer.py`:
- Around line 241-314: Combine the two near-duplicate tests
test_assistant_snippet_with_path_is_sanitized and
test_assistant_snippet_with_url_is_sanitized into a single parametrized test
using pytest.mark.parametrize: iterate over tuples like (assistant_text,
forbidden_substring, expected_redaction) and reuse the same setup
(CompactionConfig, make_compaction_callback, _build_context, _msg, MessageRole)
inside the test body; assert that forbidden_substring is not in
summary_msg.content and expected_redaction is present. Keep the same
capacity/fill/preserve parameters and preserve_recent_turns, and name the new
test to reflect both cases (e.g.,
test_assistant_snippet_sanitization_parametrized).
In `@tests/unit/engine/test_sanitization.py`:
- Around line 216-222: Add a unit test that exercises sanitize_message's
redaction-expansion edge case: construct input where a short redactable pattern
(e.g., a path like "/ab") falls right before the truncate boundary used by
test_default_max_length/test_custom_max_length (use max_length=200) so that the
replacement token (e.g., "[REDACTED_PATH]") can expand past the limit; call
sanitize_message(raw, max_length=200), assert the redaction token appears in the
result and allow the length to exceed 200 as documented. Reference
sanitize_message and reuse the existing test naming style (e.g.,
test_redaction_near_boundary_may_exceed_max_length) so the new test is
discoverable alongside test_default_max_length and test_custom_max_length.
---
Duplicate comments:
In `@src/synthorg/engine/sanitization.py`:
- Around line 8-9: This module is missing the required module logger: add the
import "from synthorg.observability import get_logger" at top of
src/synthorg/engine/sanitization.py and initialize "logger =
get_logger(__name__)" at module scope so business-logic in this file uses the
standard get_logger and logger symbols.
- Around line 10-29: sanitize_message currently only strips paths/URLs via
_PATH_PATTERN/_URL_PATTERN but doesn't redact inline credentials or bearer
tokens; add one or more regex patterns to match common secrets (e.g.,
query/key-value pairs like api_key=..., access_token=..., token=...,
password=..., secret=..., auth=..., passwd=...), Authorization headers with
"Bearer <token>" and header-style "Authorization:Bearer ..." and include them in
the sanitizer pipeline so matches are replaced with a stable placeholder like
"[REDACTED]"; update sanitize_message to apply these new credential patterns
(after or alongside _URL_PATTERN) and ensure functions/variables mentioned
(_PATH_PATTERN, _URL_PATTERN, sanitize_message) are the points to modify so any
matched secrets are redacted before returning the message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: cee47d61-b3bf-419d-b67c-de6bd421113d
📒 Files selected for processing (10)
CLAUDE.mddocs/design/engine.mdsrc/synthorg/engine/agent_engine.pysrc/synthorg/engine/checkpoint/resume.pysrc/synthorg/engine/compaction/summarizer.pysrc/synthorg/engine/sanitization.pytests/unit/engine/checkpoint/test_resume.pytests/unit/engine/compaction/test_summarizer.pytests/unit/engine/test_agent_engine_errors.pytests/unit/engine/test_sanitization.py
| if max_length < 0: | ||
| msg = f"max_length must be >= 0, got {max_length}" | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Log invalid-input error path before raising.
On Line 50-Line 52, ValueError is raised without a WARNING/ERROR log entry and context.
As per coding guidelines, src/synthorg/**/*.py: “All error paths must log at WARNING or ERROR with context before raising.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/synthorg/engine/sanitization.py` around lines 50 - 52, The code raises
ValueError for an invalid max_length but doesn't log the error; before raising,
call the module logger (e.g., logger.warning or logger.error) to record the
invalid-input with context including the provided max_length and any relevant
function name (reference the max_length check and msg variable in
sanitization.py), then raise the same ValueError; ensure you import or use the
existing logger instance in the module and include a clear message like "invalid
max_length" plus the max_length value.
- Parametrize duplicate resume sanitization tests (path + URL) - Parametrize duplicate compaction sanitization tests (path + URL) - Add redaction-expansion boundary test verifying output may exceed max_length when redaction tokens expand (documented behavior) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🤖 I have created a release *beep* *boop* --- ## [0.3.10](v0.3.9...v0.3.10) (2026-03-20) ### Bug Fixes * **ci:** generate required secrets in DAST workflow ([#623](#623)) ([6ae297f](6ae297f)) * **cli:** doctor image check reads compose file and fix API docs URL ([#625](#625)) ([5202e53](5202e53)) * **engine:** sanitize error messages in checkpoint reconciliation and compaction summaries ([#632](#632)) ([5394ed7](5394ed7)) * mitigate TOCTOU DNS rebinding gap in git clone SSRF prevention ([#633](#633)) ([1846f6e](1846f6e)) * resolve post-startup log loss, add provider model discovery, and improve setup wizard UX ([#634](#634)) ([2df8d11](2df8d11)) ### Maintenance * bump https://github.com/astral-sh/ruff-pre-commit from v0.15.6 to 0.15.7 ([#628](#628)) ([c641d2c](c641d2c)) * bump python from `584e89d` to `fb83750` in /docker/backend ([#627](#627)) ([1a36eca](1a36eca)) * bump python from `584e89d` to `fb83750` in /docker/sandbox ([#629](#629)) ([fd3e69a](fd3e69a)) * bump the minor-and-patch group across 2 directories with 3 updates ([#630](#630)) ([67d14c4](67d14c4)) * bump the minor-and-patch group with 2 updates ([#631](#631)) ([2e51b60](2e51b60)) * **ci:** add timeout-minutes, harden fuzz script, extend CVE audit ([#626](#626)) ([25420e2](25420e2)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary
sanitize_message()helper from_handle_fatal_error's inline regex chain intoengine/sanitization.py-- redacts file paths, URLs, UNC paths, and non-HTTP URI schemes, strips non-printable chars, enforces length limiterror_messagein checkpoint reconciliation before injecting into LLM context (fix(engine): sanitize error messages in checkpoint reconciliation #615)_handle_fatal_errorto use the shared helper (no behavior change)Closes #614
Closes #615
Test plan
sanitize_message()covering paths (Windows, Unix, UNC, relative), URLs (HTTP, data-store URIs), mixed content, truncation, non-printable stripping, fallback, edge cases, and passthrough_handle_fatal_errorsanitization wiringReview coverage
Pre-reviewed by 10 agents (docs-consistency, code-reviewer, python-reviewer, pr-test-analyzer, conventions-enforcer, logging-audit, resilience-audit, security-reviewer, comment-analyzer, issue-resolution-verifier). 15 findings addressed.
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Documentation
Tests