Skip to content

Fix corrupted record counter in SequenceFileReader.Offset.increment()#8440

Merged
jnioche merged 1 commit into
masterfrom
bugSequenceFileReader
Mar 28, 2026
Merged

Fix corrupted record counter in SequenceFileReader.Offset.increment()#8440
jnioche merged 1 commit into
masterfrom
bugSequenceFileReader

Conversation

@jnioche

@jnioche jnioche commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

In SequenceFileReader.Offset.increment(), line 204 overwrites currentRecord (a record counter) with newBytePosition (a byte offset), immediately after incrementing it on line 202:

  ++currentRecord;                        // line 202 — correct                                                                                                                                                                               
  prevRecordEndOffset = currRecordEndOffset; // line 203                                                                                                                                                                                      
  currentRecord = newBytePosition;         // line 204 — BUG: should be currRecordEndOffset

This is a copy-paste error: currentRecord should be currRecordEndOffset.

Impact

Every call to next() replaces the record counter with the reader's byte position. This corrupts:

  • Offset equality/comparison — two offsets at the same record count compare as unequal if byte positions differ
  • Offset serialization (toString()) — the persisted record= field contains a byte offset, not a record number
  • Resume after restart — the HDFS spout uses the serialized offset to resume reading; a corrupted value causes records to be skipped or re-processed
  • prevRecordEndOffset tracking — since currRecordEndOffset is never updated, prevRecordEndOffset always copies a stale value, breaking sync point calculation

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche jnioche added the bug label Mar 28, 2026
@jnioche jnioche added this to the 2.8.6 milestone Mar 28, 2026
@jnioche jnioche merged commit 2ebbe38 into master Mar 28, 2026
12 checks passed
@jnioche jnioche deleted the bugSequenceFileReader branch March 28, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants