-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| | 00001011 | 0 (padding) | unspecified | | ||
| |Byte 0 (validity bitmap) | Bytes 1-63 | | ||
| |-------------------------|-----------------------| | ||
| | 00001011 | 0 (padding) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this was a deliberate choice so that one can rely upon accurate null counts using a byte-wise hardware popcount instruction. Though I'm not sure how safe that is in general since we can do:
array->Slice(0, 1)
which would yield a slice of length 1, whose additional bytes should not be inspected in algorithms. In an IPC setting, the padding to byte 63 should be all zeros (see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L220), so I think this is fine. We will want to make sure in the IPC writers that we do not write possibly unspecified in-memory bytes from the validity bitmap, instead truncating the buffer to the effective byte range and writing zero padding. It might be worth adding some language to this document about this.
Per our prior discussion about padding -- since this document is about what data gets written to stream or file format, it may be good to be more rigid about zero-ing out the padding bytes that go on the wire. This places a slight additional burden on the writers, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we only need to use zero padding for null bitmap in a word(8 bytes) boundary, which means the previous code is accurate.
However, to make the rule simpler, I think it also makes sense to require all the padding bytes to be 0 for null bitmap. This is also what we specified in the "Null bitmaps" section:
A 1 (set bit) for index j indicates that the value is not null, while a 0 (bit not set) indicates that it is null. Bitmaps are to be initialized to be all unset at allocation time (this includes padding).
|
Can you open a JIRA? The build failure is unrelated here |
|
We use the "ARROW-1011:" style for the PR titles in our merge script, if you could change that. Thanks! |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I will comment in the other JIRAs about padding when I can to clarify my concerns there
julienledem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
according to the specification in the `Null bitmaps` section: > Bitmaps are to be initialized to be all unset at allocation time (this includes padding). null bitmaps should always be padded with 0 Author: Wenchen Fan <wenchen@databricks.com> Closes apache#673 from cloud-fan/first and squashes the following commits: 8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
according to the specification in the
Null bitmapssection:null bitmaps should always be padded with 0