Skip to content

Conversation

@cloud-fan
Copy link
Contributor

according to the specification in the Null bitmaps section:

Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

null bitmaps should always be padded with 0

| 00001011 | 0 (padding) | unspecified |
|Byte 0 (validity bitmap) | Bytes 1-63 |
|-------------------------|-----------------------|
| 00001011 | 0 (padding) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this was a deliberate choice so that one can rely upon accurate null counts using a byte-wise hardware popcount instruction. Though I'm not sure how safe that is in general since we can do:

array->Slice(0, 1)

which would yield a slice of length 1, whose additional bytes should not be inspected in algorithms. In an IPC setting, the padding to byte 63 should be all zeros (see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L220), so I think this is fine. We will want to make sure in the IPC writers that we do not write possibly unspecified in-memory bytes from the validity bitmap, instead truncating the buffer to the effective byte range and writing zero padding. It might be worth adding some language to this document about this.

Per our prior discussion about padding -- since this document is about what data gets written to stream or file format, it may be good to be more rigid about zero-ing out the padding bytes that go on the wire. This places a slight additional burden on the writers, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we only need to use zero padding for null bitmap in a word(8 bytes) boundary, which means the previous code is accurate.

However, to make the rule simpler, I think it also makes sense to require all the padding bytes to be 0 for null bitmap. This is also what we specified in the "Null bitmaps" section:

A 1 (set bit) for index j indicates that the value is not null, while a 0 (bit not set) indicates that it is null. Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

@wesm
Copy link
Member

wesm commented May 11, 2017

Can you open a JIRA? The build failure is unrelated here

@cloud-fan cloud-fan changed the title [FORMAT] fix typo and mistakes in Layout.md [ARROW-1011][FORMAT] fix typo and mistakes in Layout.md May 12, 2017
@wesm
Copy link
Member

wesm commented May 12, 2017

We use the "ARROW-1011:" style for the PR titles in our merge script, if you could change that. Thanks!

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I will comment in the other JIRAs about padding when I can to clarify my concerns there

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cloud-fan cloud-fan changed the title [ARROW-1011][FORMAT] fix typo and mistakes in Layout.md ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md May 13, 2017
@asfgit asfgit closed this in 99ff240 May 14, 2017
jeffknupp pushed a commit to jeffknupp/arrow that referenced this pull request Jun 3, 2017
according to the specification in the `Null bitmaps` section:
> Bitmaps are to be initialized to be all unset at allocation time (this includes padding).

null bitmaps should always be padded with 0

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#673 from cloud-fan/first and squashes the following commits:

8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants