Skip to content

fix(#4317): pass decompressedLength instead of data.size() in LZ4Compression.compress(Binary)#4322

Merged
robfrank merged 3 commits into
mainfrom
fix/4317-lz4-compression-srclen-bug
May 25, 2026
Merged

fix(#4317): pass decompressedLength instead of data.size() in LZ4Compression.compress(Binary)#4322
robfrank merged 3 commits into
mainfrom
fix/4317-lz4-compression-srclen-bug

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Closes #4317

Summary

LZ4Compression.compress(Binary) was passing data.size() (total backing-array length) as the srcLen argument to the LZ4 compressor. The correct value — already computed two lines above as decompressedLength = data.size() - data.position() — was ignored. When data.position() > 0, the compressor read past the intended slice, causing either ArrayIndexOutOfBoundsException or silent output corruption on every compress→decompress round-trip.

The companion decompress(Binary, …) overload already subtracted position correctly, making the asymmetry the direct source of corruption.

Change: replace data.size() with decompressedLength at the single call site in LZ4Compression.java.

Test plan

  • New regression test CompressionTest#compressionWithNonZeroPosition: creates a Binary with a non-zero position (simulating a partially-consumed buffer), compresses it, decompresses, and asserts byte-for-byte equality. Confirmed failing before the fix (ArrayIndexOutOfBoundsException) and passing after.
  • Existing CompressionTest#compression and #emptyCompression continue to pass.
  • Run: mvn test -Dtest=CompressionTest -pl engine

🤖 Generated with Claude Code

…ression.compress(Binary)

When data.position() > 0, the compressor received a srcLen larger than the destination
buffer, causing ArrayIndexOutOfBoundsException or silent corruption on every compress
→ decompress round-trip with a non-zero position.
@codacy-production

codacy-production Bot commented May 25, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Coverage 100.00% diff coverage · -7.34% coverage variation

Metric Results
Coverage variation -7.34% coverage variation
Diff coverage 100.00% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (d1b1632) 127537 93311 73.16%
Head commit (7b01785) 159215 (+31678) 104803 (+11492) 65.82% (-7.34%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4322) 2 2 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug in LZ4Compression.compress where the total size of the backing array was incorrectly used as the source length instead of the actual data length to be compressed. The changes include a one-line fix in the compression logic, a new regression test, and documentation of the issue. Review feedback indicates that the fix should also account for the internal array offset (getContentBeginOffset()) to correctly handle Binary slices, and suggests updating the regression test to use sliced buffers to ensure robustness against this offset issue.

Comment thread engine/src/main/java/com/arcadedb/compression/LZ4Compression.java Outdated
Comment on lines +57 to +61
buffer.position(prefix.length);

final int decompressedLength = buffer.size() - buffer.position();
final Binary compressed = CompressionFactory.getDefault().compress(buffer);
final Binary decompressed = CompressionFactory.getDefault().decompress(compressed, decompressedLength);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure the fix is robust and correctly handles Binary slices (which have a non-zero internal array offset), it is recommended to use buffer.slice() in the regression test. This would have exposed the offset bug identified in LZ4Compression.java.

Suggested change
buffer.position(prefix.length);
final int decompressedLength = buffer.size() - buffer.position();
final Binary compressed = CompressionFactory.getDefault().compress(buffer);
final Binary decompressed = CompressionFactory.getDefault().decompress(compressed, decompressedLength);
final Binary slice = buffer.slice(prefix.length);
final int decompressedLength = slice.size();
final Binary compressed = CompressionFactory.getDefault().compress(slice);
final Binary decompressed = CompressionFactory.getDefault().decompress(compressed, decompressedLength);

@claude

claude Bot commented May 25, 2026

Copy link
Copy Markdown

Code Review

Overview: This PR fixes a bug in LZ4Compression.compress(Binary) where data.size() was passed as srcLen instead of decompressedLength. When position > 0, the compressor read past the intended slice causing ArrayIndexOutOfBoundsException or silent corruption. Fix is one line with a regression test.


The Fix (LZ4Compression.java): Correct and minimal. decompressedLength was already computed two lines earlier but not used at the call site. The change is exactly right.


Test Coverage (CompressionTest.java):

The new test correctly reproduces the bug scenario. Two observations:

  1. Assertion precision: startsWith(payload) on the raw backing array is technically sufficient paired with the size check, but exact equality on the relevant slice would be cleaner and match other tests: assertThat(Arrays.copyOf(decompressed.getContent(), decompressed.size())).isEqualTo(payload)

  2. Minor nit: The inline comment is useful but longer than needed. One line referencing issue 4317 is enough.


Documentation File (docs/4317-lz4-compression-srclen-bug.md): This file should not be committed. It duplicates the PR description and is a debugging artifact. Per-issue analysis docs add long-term noise - the PR body already captures all relevant context. Please remove this file before merging.


Summary:

  • Bug fix correctness: Correct - exact one-line change needed
  • Regression test: Good - covers the bug scenario; minor assertion improvement suggested
  • Documentation file: Should be removed - not appropriate for the repo

The core fix is solid. The only blocking item is removing docs/4317-lz4-compression-srclen-bug.md.

robfrank added 2 commits May 25, 2026 14:11
…ecompress(Binary)

For slice Binaries, position() is relative to the slice start but the backing
array offset is tracked separately via getContentBeginOffset(). Both compress
and decompress must use getContentBeginOffset() + position() as srcOff.
Add a second regression test covering the slice Binary case.
@claude

claude Bot commented May 25, 2026

Copy link
Copy Markdown

Code Review - PR #4322

Overview

This PR fixes a real and impactful bug in LZ4Compression.compress(Binary): it was passing data.size() (the total backing array length) as the srcLen argument to the LZ4 compressor, when the correct value is data.size() - data.position() (already computed as decompressedLength two lines above). A second structural bug was also present in both compress and decompress: the srcOff argument was data.position() alone, ignoring getContentBeginOffset() (i.e., ByteBuffer.arrayOffset()), so sliced Binaries with a non-zero array offset would read from the wrong position in the backing array.

Both bugs are fixed correctly, and regression tests cover both failure modes. The fix is minimal and well-targeted.


What is correct

  • Bug fix is accurate. The updated call in compress(Binary):

    compressor.compress(data.getContent(), data.getContentBeginOffset() + data.position(),
        decompressedLength, compressed, 0, maxCompressedLength);

    correctly uses decompressedLength as srcLen and adds getContentBeginOffset() to the offset. This matches the pattern already used in move() and copy() elsewhere in Binary.

  • decompress fix is consistent. Applying the same getContentBeginOffset() + position() offset to the decompress path is correct and mirrors the compress fix.

  • Two regression tests cover the two independent failure modes: a non-zero position() with zero array offset, and a sliced Binary with non-zero array offset. Both are precise enough given the combination of startsWith and size() assertions.


Issues and suggestions

[Minor] Test assertions could be more precise.

Both new tests use:

assertThat(decompressed.getContent()).startsWith(payload);
assertThat(decompressed.size()).isEqualTo(decompressedLength);

getContent() returns the full backing array, which may be larger than size(). The combination is logically correct, but a cleaner assertion would be:

assertThat(Arrays.copyOf(decompressed.getContent(), decompressed.size())).isEqualTo(payload);

This makes it immediately obvious that we expect an exact match with payload - no startsWith ambiguity and no separate size() check needed.

[Minor] Inline comments in tests explain what, not why.

Per the project coding guidelines, comments should only appear when the WHY is non-obvious. The first line of each comment (the issue number reference) is valuable for traceability. The second line (e.g., // Advance position past the prefix so only the payload slice is to be compressed) describes what the next line of code already shows clearly. Consider trimming to just the issue reference.

[Should fix] The docs/4317-lz4-compression-srclen-bug.md file contains review workflow metadata that does not belong in the repository.

The "Review Cycles" and "Final State" sections track head SHAs, bot reviewer names, approval states, and a "timeout" condition. This is review tooling state, not codebase documentation. Looking at the other doc files in docs/ (e.g., 4274-cluster-topology-consistency.md), they contain only technical content (root cause, fix, files changed) - no review cycle tracking. The meta-workflow content should be removed from this file before merging, leaving only the technical sections (Summary, Root Cause, Files Changed, Impact Analysis).


No issues found in

  • Logic and correctness of the LZ4 fix itself
  • Scope of the change (minimal, focused)
  • Test framework usage (JUnit 5 + AssertJ, consistent with the codebase)
  • No new dependencies introduced
  • No security or performance regressions

Summary

The core fix is correct and addresses a real data-corruption/crash bug. The two action items before merging are: (1) trim the doc file to remove review cycle tracking metadata, and (2) optionally tighten the test assertions for clarity. The code change itself is ready.

@claude

claude Bot commented May 25, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR fixes a real bug in LZ4Compression.compress(Binary): the srcLen argument was data.size() (total backing-array length) instead of decompressedLength (bytes from the current position to the end). A companion fix adds getContentBeginOffset() to the offset calculation in both compress and decompress to correctly handle sliced Binary objects whose backing array has a non-zero arrayOffset().

The fix is correct and the approach is sound.


Issues

1. docs/4317-lz4-compression-srclen-bug.md should not be committed

This file contains review-cycle metadata (bot login status, SHA references, PR URL, "Final State: timeout"). That is ephemeral scaffolding, not repository documentation. The existing files in docs/ appear to be design documents - this review-log style file does not fit. The PR description and commit history already capture the intent; the doc adds noise to git log.

Please drop this file from the commit.

2. Test assertions use startsWith where isEqualTo is more precise

Both new tests assert:

assertThat(decompressed.getContent()).startsWith(payload);
assertThat(decompressed.size()).isEqualTo(decompressedLength);

Because decompress returns new Binary(decompressed) where decompressed is a byte[decompressedLength], the raw content array has length exactly decompressedLength. The two assertions together imply exact equality, but spelling it out directly is cleaner and will produce a better diff on failure:

assertThat(decompressed.getContent()).isEqualTo(payload);

Minor Observations

  • Production callers are unaffected today. All current compress/decompress call sites in production code (BinarySerializer, UtilCompress, UtilDecompress) use the byte[] overloads, not the Binary overloads. The bug path exists in the API but is not exercised at runtime yet. That does not make the fix wrong - it makes it a proactive correctness fix.

  • Code style is good. final on all local variables, no fully-qualified names, AssertJ style matches project conventions.

  • The decompress(Binary, int) fix is correct. The original code passed data.position() as the src offset without accounting for arrayOffset(). The fix (data.getContentBeginOffset() + data.position()) is symmetric with the compress fix and correct.

  • Empty Binary edge case. compress(Binary) has no guard for decompressedLength == 0 (unlike decompress), but LZ4 handles a zero-length input gracefully, so this is fine.


Summary

The core fix (one line in LZ4Compression.java) and the slice-offset fix are both correct. The two regression tests verify the right scenarios. The main ask is to drop docs/4317-lz4-compression-srclen-bug.md before merging - it is review scaffolding, not repository documentation. The assertion style suggestion is minor but worth considering.

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.13%. Comparing base (d1b1632) to head (7b01785).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4322      +/-   ##
============================================
+ Coverage     64.11%   64.13%   +0.02%     
- Complexity      423      431       +8     
============================================
  Files          1645     1645              
  Lines        127537   127537              
  Branches      27348    27348              
============================================
+ Hits          81766    81797      +31     
+ Misses        34257    34234      -23     
+ Partials      11514    11506       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@robfrank robfrank merged commit c4cf514 into main May 25, 2026
26 of 30 checks passed
@robfrank robfrank deleted the fix/4317-lz4-compression-srclen-bug branch May 25, 2026 13:37
@robfrank robfrank added this to the 26.6.1 milestone May 27, 2026
@robfrank robfrank self-assigned this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LZ4Compression.compress passes total size() where srcLen is expected — silent corruption with Binary.position() > 0

1 participant