fix(#4328): pre-serialize properties so count matches bytes written in writeExternal#4349
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a deserialization desynchronization issue (#4328) in JavaBinarySerializer by pre-filtering and pre-serializing properties to ensure the written property count matches the actual serialized entries. It also adds regression tests to verify the fix. The reviewer suggests a performance optimization to avoid multiple byte array allocations during pre-serialization by using a single ByteArrayOutputStream to reduce garbage collection pressure.
| final Map<String, Object> properties = document.toMap(); | ||
| out.writeInt(properties.size()); | ||
| final List<String> names = new ArrayList<>(properties.size()); | ||
| final List<byte[]> payloads = new ArrayList<>(properties.size()); | ||
| for (final Map.Entry<String, Object> prop : properties.entrySet()) { | ||
| final String propName = prop.getKey(); | ||
| final Object propValue = prop.getValue(); | ||
|
|
||
| // PROPERTY NAME | ||
| out.writeUTF(propName); | ||
|
|
||
| if (propValue != null) { | ||
| // PROPERTY VALUE | ||
| buffer.clear(); | ||
|
|
||
| final Property property = documentType.getPropertyIfExists(propName); | ||
| final byte type = BinaryTypes.getTypeFromValue(propValue, property); | ||
| if (type == -1) { | ||
| // INVALID: SKIP IT | ||
| LogManager.instance() | ||
| .log(BinaryTypes.class, Level.WARNING, | ||
| "Cannot serialize property '%s' of type %s, value %s. The property will be ignored", | ||
| propName, propValue.getClass(), propValue); | ||
| continue; | ||
| } | ||
|
|
||
| buffer.putByte(type); | ||
| serializer.serializeValue(db, buffer, type, propValue); | ||
| buffer.flip(); | ||
|
|
||
| out.writeInt(buffer.size()); | ||
| out.write(buffer.getContent(), 0, buffer.size()); | ||
| if (propValue == null) | ||
| continue; | ||
| final Property schemaProp = documentType.getPropertyIfExists(prop.getKey()); | ||
| final byte type = BinaryTypes.getTypeFromValue(propValue, schemaProp); | ||
| if (type == -1) { | ||
| LogManager.instance() | ||
| .log(BinaryTypes.class, Level.WARNING, | ||
| "Cannot serialize property '%s' of type %s, value %s. The property will be ignored", | ||
| prop.getKey(), propValue.getClass(), propValue); | ||
| continue; | ||
| } | ||
| buffer.clear(); | ||
| buffer.putByte(type); | ||
| serializer.serializeValue(db, buffer, type, propValue); | ||
| buffer.flip(); | ||
| names.add(prop.getKey()); | ||
| payloads.add(Arrays.copyOf(buffer.getContent(), buffer.size())); | ||
| } | ||
|
|
||
| // PROPERTY COUNT - matches exactly what will be written below | ||
| out.writeInt(names.size()); | ||
| for (int i = 0; i < names.size(); i++) { | ||
| out.writeUTF(names.get(i)); | ||
| final byte[] payload = payloads.get(i); | ||
| out.writeInt(payload.length); | ||
| out.write(payload); | ||
| } |
There was a problem hiding this comment.
Performance Optimization: Avoid Multiple Byte Array Allocations
The current implementation pre-serializes properties by allocating a new byte[] array for every single valid property via Arrays.copyOf(buffer.getContent(), buffer.size()), and maintains two parallel ArrayLists (names and payloads). In high-throughput database operations, this creates significant garbage collection pressure.
We can optimize this by serializing all valid properties into a single temporary ByteArrayOutputStream wrapped in a DataOutputStream. This reduces the number of allocations to O(1) per document serialization (one ByteArrayOutputStream and one final byte[] via baos.toByteArray()), while maintaining full compatibility with readExternal.
final Map<String, Object> properties = document.toMap();
int validCount = 0;
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (final DataOutputStream tempOut = new DataOutputStream(baos)) {
for (final Map.Entry<String, Object> prop : properties.entrySet()) {
final Object propValue = prop.getValue();
if (propValue == null)
continue;
final String propName = prop.getKey();
final Property schemaProp = documentType.getPropertyIfExists(propName);
final byte type = BinaryTypes.getTypeFromValue(propValue, schemaProp);
if (type == -1) {
LogManager.instance()
.log(BinaryTypes.class, Level.WARNING,
"Cannot serialize property '%s' of type %s, value %s. The property will be ignored",
propName, propValue.getClass(), propValue);
continue;
}
buffer.clear();
buffer.putByte(type);
serializer.serializeValue(db, buffer, type, propValue);
buffer.flip();
tempOut.writeUTF(propName);
tempOut.writeInt(buffer.size());
tempOut.write(buffer.getContent(), 0, buffer.size());
validCount++;
}
tempOut.flush();
}
// PROPERTY COUNT - matches exactly what will be written below
out.writeInt(validCount);
out.write(baos.toByteArray());
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Coverage variation | ✅ -13.42% coverage variation |
| Diff coverage | ✅ 100.00% diff coverage |
Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (61177a8) 127719 94366 73.89% Head commit (cd51460) 159403 (+31684) 96378 (+2012) 60.46% (-13.42%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>
Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4349) 17 17 100.00% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Code ReviewOverall: This is a correct and well-explained bug fix. The root cause analysis is accurate and the regression tests are solid. A few items worth discussing before merge. What the PR does
The fix pre-serializes all valid properties into a staging Issues1.
2. Pre-existing In in.read(array);
in.readFully(array);This is a pre-existing bug and out of scope for this PR, but since you are touching this class it would be a good opportunity to fix it. 3. Minor: extra heap buffer allocation The staging // Pass 1: count valid entries
int validCount = 0;
for (final Map.Entry<String, Object> prop : properties.entrySet()) {
if (prop.getValue() == null) continue;
final byte type = BinaryTypes.getTypeFromValue(prop.getValue(), documentType.getPropertyIfExists(prop.getKey()));
if (type != -1) validCount++;
}
out.writeInt(validCount);
// Pass 2: serialize
for (final Map.Entry<String, Object> prop : properties.entrySet()) { ... }For documents with many large properties this halves peak heap pressure. Since serialization is not on the query hot path this is a minor concern, but it is simpler code too - no What is good
SummaryFix the |
|
Thanks for the thorough review. Addressed point 1 (dropped On point 2 ( On point 3 (two-pass instead of staging buffer): The staging-buffer shape came from gemini's review in the first round (5a3e679 -> 3053f8b). A two-pass approach would re-invoke |
Code ReviewOverviewThis PR fixes a real data-corruption bug in Code Quality
One avoidable extra copy - // instead of: out.write(staged.toByteArray())
staged.writeTo((OutputStream) out);Minor, but worth it for large documents. Comment style - CLAUDE.md says comments should explain the why rather than the what, capped at one line. The two-line staging comment and Pre-existing issue surfaced by the fix (not introduced here)In in.read(array); // may return fewer bytes than array.length
in.readFully(array);
TestsTwo regression tests are included, confirmed to fail before the fix and pass after - good practice. Assertion style - line 164 uses: assertThat(result.containsKey("nullProp")).isFalse();The more idiomatic AssertJ form (better failure message, matches project style) is: assertThat(result).doesNotContainKey("nullProp");SummaryThe fix is correct and the tests cover the important desync scenario. Suggested follow-ups:
|
|
Applied in 5188a4e:
Declined:
|
Code ReviewOverviewThis PR fixes a desync bug in The fix stages all valid entries into a The Fix is CorrectThe two-pass approach cleanly solves the desync: null and unrecognized-type properties are filtered before the count is committed to the stream. The logic is easy to follow and the fix is minimal. FindingsOne pre-existing bug worth fixing alongside this PR (readExternal) In in.read(array);
in.readFully(array);Not introduced by this PR, but since the test suite is already exercising this path it is a good moment to close the gap. Minor: unnecessary heap copy
One naming nit The local variable Test CoverageBoth new tests are well-structured:
One small suggestion: the second test calls SummaryThe fix is correct and well-tested. The only actionable suggestion is to change |
|
Cycle 4 (final per --max-cycles=4):
Stopping the orchestrator here. Merge belongs to the maintainer. |
Code ReviewOverviewThis PR fixes a real data-corruption bug in The fix is correct and the root cause analysis in the PR description is accurate. What works well
Issues and suggestionsPre-existing bug worth noting (out of scope but related): Minor - extra allocation: Minor - Comment style: The CLAUDE.md guidance asks to avoid comments that explain what code does rather than why. The added comment SummaryThe fix is correct, well-tested, and safe to merge. The one thing worth addressing before or shortly after merging is |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4349 +/- ##
============================================
- Coverage 64.72% 58.26% -6.46%
+ Complexity 446 423 -23
============================================
Files 1647 1647
Lines 127719 127725 +6
Branches 27375 27375
============================================
- Hits 82660 74422 -8238
- Misses 33384 42643 +9259
+ Partials 11675 10660 -1015 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…n writeExternal writeExternal wrote properties.size() as the count but silently skipped the size/payload bytes for null-valued and unrecognised-type properties. readExternal always reads UTF + int + bytes for each counted entry, so a single skipped property desynced every subsequent read (EOFException or garbage values). Fix: collect serializable (non-null, known-type) entries into parallel name/payload lists in one pass, then write names.size() as the count before iterating the lists.
Replace the two parallel ArrayLists + per-property Arrays.copyOf with a single growing ByteArrayOutputStream wrapped in a DataOutputStream. This collapses O(N) per-property byte[] allocations down to one final toByteArray copy, easing GC pressure for documents with many properties. Wire format is unchanged - readExternal sees the same byte sequence.
claude bot review: per-issue analysis docs are not committed in this project; the PR description and commit messages carry the context. The file would become stale once merged.
- Test uses doesNotContainKey for idiomatic AssertJ - Drop comments that describe what the code does; keep only the why
Two 'Out' variables (tempOut wrapping staged, and the parameter 'out') in scope were easy to confuse. stagedOut makes the relationship to 'staged' explicit.
317e730 to
cd51460
Compare
Code ReviewOverviewThis PR fixes a correctness bug in The fix stages serializable entries into a Correctness - No IssuesThe root cause is accurately diagnosed and the fix is logically correct. The written count now exactly matches the number of entries on the wire. Both null-skip and type=-1-skip paths are handled consistently. Potential IssuesPre-existing bug in // JavaBinarySerializer.java:129
final byte[] array = new byte[propertySize];
in.read(array); // <-- may read fewer bytes than requested
Double logging for unrecognized types
PerformanceThe fix allocates a A lighter alternative: the thread-local Also, Test CoverageThe two new tests are well-structured and use the correct AssertJ assertion style:
Suggestion: Add a third test for the unrecognized-type ( Minor Notes
SummaryThe fix is correct and the tests confirm the before/after behavior. The main actionable items are:
|
Closes #4328
Summary
JavaBinarySerializer.writeExternalwroteproperties.size()as the property count but silently skipped writing size/payload bytes for null-valued properties and properties whose type could not be inferred (type == -1). BecausereadExternalalways reads name + int-size + bytes for each counted entry, a single skipped property desynced every subsequent read, producing garbage values orEOFException.The fix collects serializable entries (non-null value, recognised type) into parallel
names/payloadslists in one pass - which also pre-serializes the bytes once each - then writesnames.size()as the count before emitting the lists. The written count now matches the bytes on the wire exactly.Test plan
JavaBinarySerializerTest#documentWithNullPropertyDoesNotDesync- document with a null-valued property in the middle of non-null properties round-trips correctly; null property absent, non-null properties intact.JavaBinarySerializerTest#documentAllNullPropertiesDeserializesEmpty- document with only null properties serializes to an empty map.JavaBinarySerializerTestmethods pass; 73 serializer/document tests pass with no regressions.