Skip to content

fix: resolve open CodeQL alerts (workflow permissions, ReDoS, path-injection)#4388

Merged
robfrank merged 2 commits into
mainfrom
fix/codeql-workflow-perms-redos-pathinjection
May 28, 2026
Merged

fix: resolve open CodeQL alerts (workflow permissions, ReDoS, path-injection)#4388
robfrank merged 2 commits into
mainfrom
fix/codeql-workflow-perms-redos-pathinjection

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Summary

Resolves the 18 open CodeQL alerts on main (filtered to tool:CodeQL):

  • actions/missing-workflow-permissions (15, medium) - added least-privilege permissions blocks to mvn-test.yml, mvn-deploy.yml, meterian.yml and license-compliance.yml. Top-level contents: read, with per-job elevations only where actions require them:
    • checks: write for the dorny/test-reporter jobs (unit/slow-unit/integration/ha-integration/java-e2e, and the already-restricted studio-e2e so its reporter keeps posting),
    • security-events: write for the Meterian SARIF upload,
    • pull-requests: write for the license-compliance PR comment.
  • py/redos (2, high) - the docker-compose sanitizing regex in bindings/python/examples/11_vector_index_build.py and 12_vector_search.py used (?ms)...(?:.*\n)*?..., where DOTALL lets . consume the \n delimiter and the loop becomes ambiguous (exponential backtracking). Rewrote so every newline is explicit and the loop body is [^\n]*\n. Output is byte-identical on valid input; pathological input now runs in sub-millisecond time.
  • java/path-injection (1, high) - Neo4jImporter derives on-disk bucket file names from untrusted import labels. The blocklist guard added in fix: resolve CodeQL Java code-scanning alerts at their true sources #4383 did not clear the alert because URLEncoder.encode encodes / and \ but leaves .. intact, and CodeQL does not accept that guard shape as a sanitizer. Replaced it with an ASCII allow-list Pattern.compile("[A-Za-z0-9_ -]+") validated via matcher(label).matches(), which CodeQL's DirectoryCharactersGuard recognizes as a path-traversal sanitizer.

Note / trade-off

The path-injection fix is intentionally ASCII-only for labels - this is what makes CodeQL recognize the sanitizer (predefined classes like \p{L} contain a backslash that the query's heuristic reads as permitting \). Existing Neo4j exports use ASCII identifier labels, so behavior is unchanged; a non-ASCII label is now rejected with a clear ImportException.

Test plan

  • mvn -pl integration test -Dtest=Neo4jImporterIT - all 8 tests pass, including rejectNodeLabelWithPathTraversal, rejectEdgeLabelWithPathTraversal, and the large/multi-type import tests.
  • Python examples py_compile clean; verified the new regex produces identical output to the old on a representative compose file and is ReDoS-resistant on adversarial input.
  • All four workflow YAML files validated; permissions verified against each job's actual actions.
  • Confirm CodeQL re-scan closes all 18 alerts after merge.

…jection)

Address the 18 open CodeQL alerts on main:

- actions/missing-workflow-permissions (15): add least-privilege
  permissions blocks to mvn-test, mvn-deploy, meterian and
  license-compliance workflows. Top-level contents:read plus per-job
  checks:write for dorny/test-reporter jobs, security-events:write for
  the Meterian SARIF upload and pull-requests:write for the license PR
  comment.
- py/redos (2): rewrite the docker-compose sanitizing regex in the
  vector index/search Python examples so newlines are explicit and the
  loop body cannot consume the delimiter, eliminating exponential
  backtracking. Output is unchanged.
- java/path-injection (1): replace the Neo4jImporter label blocklist
  with an ASCII allow-list (Pattern [A-Za-z0-9_ -]+), which CodeQL
  recognises as a path-traversal sanitizer. The previous blocklist did
  not clear the alert because URLEncoder.encode leaves '..' intact.
@codacy-production

codacy-production Bot commented May 28, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics -4 complexity

Metric Results
Complexity -4

View in Codacy

🟢 Coverage 100.00% diff coverage · -7.37% coverage variation

Metric Results
Coverage variation -7.37% coverage variation
Diff coverage 100.00% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3f6c078) 127835 93820 73.39%
Head commit (5e9b2ce) 159514 (+31679) 105316 (+11496) 66.02% (-7.37%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4388) 2 2 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates regular expressions in Python examples to prevent ReDoS and tightens label validation in the Java Neo4j importer using a strict allow-list pattern to prevent path traversal. The review feedback highlights that the updated Python regexes will fail on Windows-style CRLF line endings and suggests a robust pattern to handle both LF and CRLF. Additionally, for the Java importer, the reviewer recommends restricting the label validation pattern to disallow leading, trailing, or consecutive spaces, which could otherwise cause filesystem issues on Windows.

Comment on lines +1060 to 1065
sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", raw)
sanitized = re.sub(
r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)",
r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)",
r"\1",
sanitized,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The updated regular expressions use [ \t]* and explicit \n instead of \s* and \s*\n. While this successfully prevents ReDoS, it will fail to match on files with Windows-style line endings (\r\n) because [ \t]* does not match \r. To ensure robustness across different platforms, use \r?\n and [^\r\n]* to correctly handle both LF and CRLF line endings.

Suggested change
sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", raw)
sanitized = re.sub(
r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)",
r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)",
r"\1",
sanitized,
)
sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\r\n]*\r?\n", "", raw)
sanitized = re.sub(
r"(?m)(^[ \t]*networks:[ \t]*\r?\n(?:[^\r\n]*\r?\n)*?^[ \t]*milvus:[ \t]*\r?\n)([ \t]*name:[ \t]*milvus[ \t]*\r?\n)",
r"\1",
sanitized,
)

Comment on lines +1251 to 1257
sanitized = re.sub(r"(?m)^[ \t]*version[ \t]*:[ \t]*[^\n]*\n", "", raw)
sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", sanitized)
sanitized = re.sub(
r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)",
r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)",
r"\1",
sanitized,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The updated regular expressions use [ \t]* and explicit \n instead of \s* and \s*\n. While this successfully prevents ReDoS, it will fail to match on files with Windows-style line endings (\r\n) because [ \t]* does not match \r. To ensure robustness across different platforms, use \r?\n and [^\r\n]* to correctly handle both LF and CRLF line endings.

Suggested change
sanitized = re.sub(r"(?m)^[ \t]*version[ \t]*:[ \t]*[^\n]*\n", "", raw)
sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", sanitized)
sanitized = re.sub(
r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)",
r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)",
r"\1",
sanitized,
)
sanitized = re.sub(r"(?m)^[ \t]*version[ \t]*:[ \t]*[^\r\n]*\r?\n", "", raw)
sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\r\n]*\r?\n", "", sanitized)
sanitized = re.sub(
r"(?m)(^[ \t]*networks:[ \t]*\r?\n(?:[^\r\n]*\r?\n)*?^[ \t]*milvus:[ \t]*\r?\n)([ \t]*name:[ \t]*milvus[ \t]*\r?\n)",
r"\1",
sanitized,
)

private final static SimpleDateFormat dateTimeISO8601Format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
// Allow-list for imported labels: ASCII letters, digits, underscore, hyphen and space only. Excluding '.', '/' and '\'
// makes path-traversal sequences structurally impossible in the on-disk bucket file names derived from these labels.
private final static Pattern SAFE_LABEL = Pattern.compile("[A-Za-z0-9_ -]+");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SAFE_LABEL pattern [A-Za-z0-9_ -]+ allows leading, trailing, and consecutive spaces. Since these labels are used to derive on-disk bucket filenames, labels with leading/trailing spaces or only spaces can cause filesystem errors or compatibility issues (especially on Windows, which strips trailing spaces and prohibits certain space-only names). Restricting the pattern to disallow leading/trailing/consecutive spaces improves robustness.

Suggested change
private final static Pattern SAFE_LABEL = Pattern.compile("[A-Za-z0-9_ -]+");
private final static Pattern SAFE_LABEL = Pattern.compile("[A-Za-z0-9_-]+( [A-Za-z0-9_-]+)*");

@claude

claude Bot commented May 28, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR resolves 18 open CodeQL alerts across three categories: GitHub Actions workflow over-permissions (15 alerts), Python ReDoS vulnerabilities (2 alerts), and a Java path-injection issue in Neo4jImporter (1 alert). The changes are focused, well-scoped, and the PR description clearly explains each trade-off.


GitHub Actions Workflow Permissions

Positive. All four workflow files correctly adopt the pattern of a restrictive top-level default (contents: read) with per-job elevations only where the job's actions require them. The job-level checks: write additions for dorny/test-reporter jobs follow least-privilege correctly, and mvn-deploy.yml only needs contents: read because deploy credentials travel via secrets, not GITHUB_TOKEN.


Python ReDoS Fixes

Positive. The fix is technically correct. Removing the s (DOTALL) flag and replacing .* with [^\n]* eliminates the ambiguity that caused catastrophic backtracking: with [^\n]*\n, the newline delimiter cannot be consumed by the body alternation, so the NFA has no exponential branch to explore. Output is byte-identical on valid YAML input.

Minor note. The original \s* matched any Unicode whitespace (including \r, \f, etc.); the replacement [ \t]* matches only horizontal ASCII whitespace. For YAML compose files fetched from a well-known URL this is fine in practice, but worth noting in case the tool ever processes CRLF content from Windows hosts.


Java Path Injection Fix (Neo4jImporter)

Positive overall. Replacing the blocklist with a strict allow-list is the right direction - blocklists require enumerating every bad character, allow-lists are safe by construction. CodeQL rightly recognizes matcher(label).matches() as a DirectoryCharactersGuard, so the alert clears cleanly.

Bug: error message does not match the new behaviour (medium)

The exception message at line 692 still says "Invalid label: must not contain path separators or '..'", which describes the old blocklist semantics. The new allow-list rejects far more than just path separators - it also rejects labels containing Unicode letters, accented characters, dots, colons, brackets, etc. A user with a label like Nodo or MyType.v2 will see an error blaming "path separators" when neither is present, which is confusing.

Suggested replacement:

throw new ImportException("Invalid label '" + label + "': only ASCII letters, digits, underscore, hyphen and space are allowed");

The test assertions at lines 333 and 352 in Neo4jImporterIT use hasMessageContaining("path separators") and will need updating to match the new message text.

Minor: regex character class readability

In Pattern.compile("[A-Za-z0-9_ -]+"), the hyphen at the end of the class (before ]) is treated as a literal, which is correct. However, _ - can look like a range from space to something, and some tooling flags trailing hyphens inconsistently. The comment above mentions hyphen explicitly, which helps. An unambiguous alternative: [-A-Za-z0-9_ ]+ or [A-Za-z0-9_\\- ]+.

Missing test: the stricter allow-list behaviour is not covered

The two existing tests (rejectNodeLabelWithPathTraversal, rejectEdgeLabelWithPathTraversal) use labels like ../../etc/evil, which both the old blocklist and new allow-list reject. There is no test for a label that the old code accepted but the new allow-list rejects - for example a label with a Unicode character or a dot (MyType.v2). Adding one regression test for this case would verify the allow-list tightening and lock in the expected exception message.


Summary

Area Status Blocking?
Workflow permissions Clean, least-privilege, correct No
Python ReDoS fix Correct, linear-time on adversarial input No
Error message mismatch Message describes old blocklist, not new allow-list Yes - misleads users; test assertion passes only by coincidence
SAFE_LABEL regex readability Hyphen placement is correct but non-obvious No (nit)
Missing allow-list regression test No test for labels rejected by allow-list but not blocklist Recommended

The path-injection fix and workflow permissions are the right approach. The main item to address is aligning the exception message (and test assertion) with what the allow-list actually enforces.

@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.35%. Comparing base (3f6c078) to head (5e9b2ce).

Files with missing lines Patch % Lines
...m/arcadedb/integration/importer/Neo4jImporter.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4388      +/-   ##
============================================
+ Coverage     64.32%   64.35%   +0.02%     
- Complexity      424      431       +7     
============================================
  Files          1647     1647              
  Lines        127835   127836       +1     
  Branches      27403    27403              
============================================
+ Hits          82233    82263      +30     
+ Misses        34046    34020      -26     
+ Partials      11556    11553       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Since engine PR #4343, findNeighborsFromVector exposes a squared
Euclidean distance (lower is better) for the EUCLIDEAN metric, matching
COSINE and DOT_PRODUCT which already return distances. The Python
binding still treated EUCLIDEAN as a similarity: it sorted EUCLIDEAN
results in reverse (returning the worst matches first) and
test_lsm_euclidean_distance asserted a 1/(1+d²) similarity.

Drop the EUCLIDEAN reverse-sort special case so every metric sorts
ascending by distance, update the class docstring, and assert the
squared-distance values (origin -> 0.0, point -> 25.0) in the test.

Verified against freshly built 26.6.1 engine jars: the full
tests/test_vector.py suite passes (38 passed, 4 skipped).
@claude

claude Bot commented May 28, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR resolves 18 open CodeQL alerts across three categories: workflow permissions (15), Python ReDoS (2), and Java path-injection (1). It also bundles an undocumented behavioral change to the Python vector embedding API. The security fixes are well-motivated and technically sound; one issue requires attention before merging.


Workflow Permissions (.github/workflows/*.yml) - Approved

The layered permission model is correct:

  • Top-level contents: read establishes a restrictive default that caps every job.
  • Per-job elevations (checks: write, security-events: write, pull-requests: write) are scoped precisely to the jobs that need them.
  • The pre-existing studio-e2e job that already had contents: read simply gains checks: write — no regression there.

Python ReDoS Fix (11_vector_index_build.py, 12_vector_search.py) - Approved

The root cause is correctly identified: (?ms) lets . swallow \n, making (?:.*\n)*? ambiguous because the \n delimiter can also be consumed by ., forcing the engine into exponential backtracking on adversarial input.

The fix is the right approach:

  • \s*[ \t]* (whitespace without newlines)
  • .*[^\n]* (match characters up to, but not including, newline)
  • Drop the (?s) (DOTALL) flag entirely

Each [^\n]*\n token is now unambiguous — there is exactly one way to match it — eliminating the backtracking. The PR description's claim that output is byte-identical on valid input is correct.


Java Path-Injection Fix (Neo4jImporter.java) - Minor issue

The allow-list approach ([A-Za-z0-9_ -]+) is strictly more secure than the deny-list it replaces and is the correct fix for CodeQL's DirectoryCharactersGuard recognizer. Compiling the pattern as a static final constant is good practice.

Issue: error message no longer matches the validation constraint.

The existing error message — "Invalid label: must not contain path separators or '..'" — described the old deny-list. The new allow-list rejects far more characters: any period (.), any non-ASCII character, @, #, etc. A user whose label is "Person.Employee" or "Ärzte" will get a message that doesn't explain why their label was rejected.

Suggested fix:

throw new ImportException("Invalid label '" + label + "': only ASCII letters, digits, underscore, hyphen and space are allowed");

Minor: empty string behaviour changed. The old code accepted empty strings (condition required at least one forbidden character); [A-Za-z0-9_ -]+ rejects them (+ requires ≥1 character). This is almost certainly the right behaviour, but worth noting.

Note on the character class: [A-Za-z0-9_ -] — in Java regex, - immediately before ] is a literal hyphen, so the class is A-Z | a-z | 0-9 | _ | space | -. This is correct.


Python Vector API change (vector.py, test_vector.py) - Needs documentation

This change is not mentioned in the PR description but is a breaking behavioral change:

Old behavior New behavior
EUCLIDEAN score Similarity: 1 / (1 + d²) — higher is better Squared distance: — lower is better
Sort order Descending by similarity Ascending by distance

The _uses_reverse_score_sort() helper is removed entirely; _sort_results now always sorts ascending. The nearest-neighbor ranking is unchanged (nearest item is always first), but the numeric score values returned to callers change completely for EUCLIDEAN queries. Any user code comparing or thresholding these scores will break silently.

This belongs in the PR description with an explicit "breaking change" callout and should probably appear in the project's changelog / release notes. If this is intentional, please add a note to the description explaining the motivation (the code comment says the engine already returns distances natively, so the old conversion was superfluous — that rationale should be visible at the PR level).


Summary

Area Status
Workflow permissions Approved
Python ReDoS Approved
Java path-injection logic Approved
Java error message Needs update
Vector API breaking change Needs PR description update

@robfrank robfrank merged commit 949aae1 into main May 28, 2026
131 of 142 checks passed
@robfrank robfrank deleted the fix/codeql-workflow-perms-redos-pathinjection branch May 28, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant