fix: resolve open CodeQL alerts (workflow permissions, ReDoS, path-injection)#4388
Conversation
…jection) Address the 18 open CodeQL alerts on main: - actions/missing-workflow-permissions (15): add least-privilege permissions blocks to mvn-test, mvn-deploy, meterian and license-compliance workflows. Top-level contents:read plus per-job checks:write for dorny/test-reporter jobs, security-events:write for the Meterian SARIF upload and pull-requests:write for the license PR comment. - py/redos (2): rewrite the docker-compose sanitizing regex in the vector index/search Python examples so newlines are explicit and the loop body cannot consume the delimiter, eliminating exponential backtracking. Output is unchanged. - java/path-injection (1): replace the Neo4jImporter label blocklist with an ASCII allow-list (Pattern [A-Za-z0-9_ -]+), which CodeQL recognises as a path-traversal sanitizer. The previous blocklist did not clear the alert because URLEncoder.encode leaves '..' intact.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | -4 |
🟢 Coverage 100.00% diff coverage · -7.37% coverage variation
Metric Results Coverage variation ✅ -7.37% coverage variation Diff coverage ✅ 100.00% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (3f6c078) 127835 93820 73.39% Head commit (5e9b2ce) 159514 (+31679) 105316 (+11496) 66.02% (-7.37%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4388) 2 2 100.00% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request updates regular expressions in Python examples to prevent ReDoS and tightens label validation in the Java Neo4j importer using a strict allow-list pattern to prevent path traversal. The review feedback highlights that the updated Python regexes will fail on Windows-style CRLF line endings and suggests a robust pattern to handle both LF and CRLF. Additionally, for the Java importer, the reviewer recommends restricting the label validation pattern to disallow leading, trailing, or consecutive spaces, which could otherwise cause filesystem issues on Windows.
| sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", raw) | ||
| sanitized = re.sub( | ||
| r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)", | ||
| r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)", | ||
| r"\1", | ||
| sanitized, | ||
| ) |
There was a problem hiding this comment.
The updated regular expressions use [ \t]* and explicit \n instead of \s* and \s*\n. While this successfully prevents ReDoS, it will fail to match on files with Windows-style line endings (\r\n) because [ \t]* does not match \r. To ensure robustness across different platforms, use \r?\n and [^\r\n]* to correctly handle both LF and CRLF line endings.
| sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", raw) | |
| sanitized = re.sub( | |
| r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)", | |
| r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)", | |
| r"\1", | |
| sanitized, | |
| ) | |
| sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\r\n]*\r?\n", "", raw) | |
| sanitized = re.sub( | |
| r"(?m)(^[ \t]*networks:[ \t]*\r?\n(?:[^\r\n]*\r?\n)*?^[ \t]*milvus:[ \t]*\r?\n)([ \t]*name:[ \t]*milvus[ \t]*\r?\n)", | |
| r"\1", | |
| sanitized, | |
| ) |
| sanitized = re.sub(r"(?m)^[ \t]*version[ \t]*:[ \t]*[^\n]*\n", "", raw) | ||
| sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", sanitized) | ||
| sanitized = re.sub( | ||
| r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)", | ||
| r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)", | ||
| r"\1", | ||
| sanitized, | ||
| ) |
There was a problem hiding this comment.
The updated regular expressions use [ \t]* and explicit \n instead of \s* and \s*\n. While this successfully prevents ReDoS, it will fail to match on files with Windows-style line endings (\r\n) because [ \t]* does not match \r. To ensure robustness across different platforms, use \r?\n and [^\r\n]* to correctly handle both LF and CRLF line endings.
| sanitized = re.sub(r"(?m)^[ \t]*version[ \t]*:[ \t]*[^\n]*\n", "", raw) | |
| sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\n]*\n", "", sanitized) | |
| sanitized = re.sub( | |
| r"(?ms)(^\s*networks:\s*\n(?:.*\n)*?^\s*milvus:\s*\n)(\s*name:\s*milvus\s*\n)", | |
| r"(?m)(^[ \t]*networks:[ \t]*\n(?:[^\n]*\n)*?^[ \t]*milvus:[ \t]*\n)([ \t]*name:[ \t]*milvus[ \t]*\n)", | |
| r"\1", | |
| sanitized, | |
| ) | |
| sanitized = re.sub(r"(?m)^[ \t]*version[ \t]*:[ \t]*[^\r\n]*\r?\n", "", raw) | |
| sanitized = re.sub(r"(?m)^[ \t]*container_name:[ \t]*[^\r\n]*\r?\n", "", sanitized) | |
| sanitized = re.sub( | |
| r"(?m)(^[ \t]*networks:[ \t]*\r?\n(?:[^\r\n]*\r?\n)*?^[ \t]*milvus:[ \t]*\r?\n)([ \t]*name:[ \t]*milvus[ \t]*\r?\n)", | |
| r"\1", | |
| sanitized, | |
| ) |
| private final static SimpleDateFormat dateTimeISO8601Format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss"); | ||
| // Allow-list for imported labels: ASCII letters, digits, underscore, hyphen and space only. Excluding '.', '/' and '\' | ||
| // makes path-traversal sequences structurally impossible in the on-disk bucket file names derived from these labels. | ||
| private final static Pattern SAFE_LABEL = Pattern.compile("[A-Za-z0-9_ -]+"); |
There was a problem hiding this comment.
The SAFE_LABEL pattern [A-Za-z0-9_ -]+ allows leading, trailing, and consecutive spaces. Since these labels are used to derive on-disk bucket filenames, labels with leading/trailing spaces or only spaces can cause filesystem errors or compatibility issues (especially on Windows, which strips trailing spaces and prohibits certain space-only names). Restricting the pattern to disallow leading/trailing/consecutive spaces improves robustness.
| private final static Pattern SAFE_LABEL = Pattern.compile("[A-Za-z0-9_ -]+"); | |
| private final static Pattern SAFE_LABEL = Pattern.compile("[A-Za-z0-9_-]+( [A-Za-z0-9_-]+)*"); |
Code ReviewOverviewThis PR resolves 18 open CodeQL alerts across three categories: GitHub Actions workflow over-permissions (15 alerts), Python ReDoS vulnerabilities (2 alerts), and a Java path-injection issue in GitHub Actions Workflow PermissionsPositive. All four workflow files correctly adopt the pattern of a restrictive top-level default ( Python ReDoS FixesPositive. The fix is technically correct. Removing the Minor note. The original Java Path Injection Fix (Neo4jImporter)Positive overall. Replacing the blocklist with a strict allow-list is the right direction - blocklists require enumerating every bad character, allow-lists are safe by construction. CodeQL rightly recognizes Bug: error message does not match the new behaviour (medium) The exception message at line 692 still says "Invalid label: must not contain path separators or '..'", which describes the old blocklist semantics. The new allow-list rejects far more than just path separators - it also rejects labels containing Unicode letters, accented characters, dots, colons, brackets, etc. A user with a label like Suggested replacement: throw new ImportException("Invalid label '" + label + "': only ASCII letters, digits, underscore, hyphen and space are allowed");The test assertions at lines 333 and 352 in Minor: regex character class readability In Missing test: the stricter allow-list behaviour is not covered The two existing tests ( Summary
The path-injection fix and workflow permissions are the right approach. The main item to address is aligning the exception message (and test assertion) with what the allow-list actually enforces. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4388 +/- ##
============================================
+ Coverage 64.32% 64.35% +0.02%
- Complexity 424 431 +7
============================================
Files 1647 1647
Lines 127835 127836 +1
Branches 27403 27403
============================================
+ Hits 82233 82263 +30
+ Misses 34046 34020 -26
+ Partials 11556 11553 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Since engine PR #4343, findNeighborsFromVector exposes a squared Euclidean distance (lower is better) for the EUCLIDEAN metric, matching COSINE and DOT_PRODUCT which already return distances. The Python binding still treated EUCLIDEAN as a similarity: it sorted EUCLIDEAN results in reverse (returning the worst matches first) and test_lsm_euclidean_distance asserted a 1/(1+d²) similarity. Drop the EUCLIDEAN reverse-sort special case so every metric sorts ascending by distance, update the class docstring, and assert the squared-distance values (origin -> 0.0, point -> 25.0) in the test. Verified against freshly built 26.6.1 engine jars: the full tests/test_vector.py suite passes (38 passed, 4 skipped).
Code ReviewOverviewThis PR resolves 18 open CodeQL alerts across three categories: workflow permissions (15), Python ReDoS (2), and Java path-injection (1). It also bundles an undocumented behavioral change to the Python vector embedding API. The security fixes are well-motivated and technically sound; one issue requires attention before merging. Workflow Permissions (
|
| Old behavior | New behavior | |
|---|---|---|
| EUCLIDEAN score | Similarity: 1 / (1 + d²) — higher is better |
Squared distance: d² — lower is better |
| Sort order | Descending by similarity | Ascending by distance |
The _uses_reverse_score_sort() helper is removed entirely; _sort_results now always sorts ascending. The nearest-neighbor ranking is unchanged (nearest item is always first), but the numeric score values returned to callers change completely for EUCLIDEAN queries. Any user code comparing or thresholding these scores will break silently.
This belongs in the PR description with an explicit "breaking change" callout and should probably appear in the project's changelog / release notes. If this is intentional, please add a note to the description explaining the motivation (the code comment says the engine already returns distances natively, so the old conversion was superfluous — that rationale should be visible at the PR level).
Summary
| Area | Status |
|---|---|
| Workflow permissions | Approved |
| Python ReDoS | Approved |
| Java path-injection logic | Approved |
| Java error message | Needs update |
| Vector API breaking change | Needs PR description update |
Summary
Resolves the 18 open CodeQL alerts on
main(filtered totool:CodeQL):actions/missing-workflow-permissions(15, medium) - added least-privilegepermissionsblocks tomvn-test.yml,mvn-deploy.yml,meterian.ymlandlicense-compliance.yml. Top-levelcontents: read, with per-job elevations only where actions require them:checks: writefor thedorny/test-reporterjobs (unit/slow-unit/integration/ha-integration/java-e2e, and the already-restricted studio-e2e so its reporter keeps posting),security-events: writefor the Meterian SARIF upload,pull-requests: writefor the license-compliance PR comment.py/redos(2, high) - the docker-compose sanitizing regex inbindings/python/examples/11_vector_index_build.pyand12_vector_search.pyused(?ms)...(?:.*\n)*?..., where DOTALL lets.consume the\ndelimiter and the loop becomes ambiguous (exponential backtracking). Rewrote so every newline is explicit and the loop body is[^\n]*\n. Output is byte-identical on valid input; pathological input now runs in sub-millisecond time.java/path-injection(1, high) -Neo4jImporterderives on-disk bucket file names from untrusted import labels. The blocklist guard added in fix: resolve CodeQL Java code-scanning alerts at their true sources #4383 did not clear the alert becauseURLEncoder.encodeencodes/and\but leaves..intact, and CodeQL does not accept that guard shape as a sanitizer. Replaced it with an ASCII allow-listPattern.compile("[A-Za-z0-9_ -]+")validated viamatcher(label).matches(), which CodeQL'sDirectoryCharactersGuardrecognizes as a path-traversal sanitizer.Note / trade-off
The path-injection fix is intentionally ASCII-only for labels - this is what makes CodeQL recognize the sanitizer (predefined classes like
\p{L}contain a backslash that the query's heuristic reads as permitting\). Existing Neo4j exports use ASCII identifier labels, so behavior is unchanged; a non-ASCII label is now rejected with a clearImportException.Test plan
mvn -pl integration test -Dtest=Neo4jImporterIT- all 8 tests pass, includingrejectNodeLabelWithPathTraversal,rejectEdgeLabelWithPathTraversal, and the large/multi-type import tests.py_compileclean; verified the new regex produces identical output to the old on a representative compose file and is ReDoS-resistant on adversarial input.