Skip to content

Rewrite NULL Handling Logic #19

@Mytherin

Description

@Mytherin

Special values in the domain are not nice, because they have to be handled all over the place in every single loop which makes the code ugly, and they are not efficient. Tim and me talked about it, and propose the following change:

Every vector has an optional pointer to a bitmask of size 1024 (this amounts to 64 bytes, or 16 8-byte integers as overhead), which is relatively negligible for most data types.

This can be implemented with C++ bitsets (http://www.cplusplus.com/reference/bitset/bitset/) which handles most nasty code for us and should be quite efficient because of template magic. If we find out it's not efficient we can always roll our own implementation.

In regular loops (e.g. addition and so), we completely ignore null values and just loop over the whole data. This is nice because it (1) allows SIMD (2) does away with any branching and (3) simplifies our loop code which results in a lower code size.

The actual NULLs in the data can be computed separately depending on the operator type. For regular math operators (+, -, *, /), we would OR together the bitmasks if both sides have one, or simply select the other bitmask if one of the sides does not have a bitmask (or set to NULL if neither have one). This should be pretty much free if both sides have no NULLs, and cheap even if they do, because the bitmask OR is just an OR of 16 8-byte integers.

The difficulty with this approach will be aggregations, since they still have to check the NULL values. Also, because we are performing operations on the "hidden" NULL values (even if the results of the operations are not used) the hidden values should be somewhat sane. I propose the value 0 for the hidden value, because:

  • We already have to check for 0 in divisions, so that doesn't create any extra problems
  • By setting to 0, it makes checking NULLs not necessary in the SUM computation.

For other aggregations (MIN and MAX) we still need to check the bitmask to see if a value is a "genuine" NULL or a fake NULL. For this, we could have separate code depending on whether or not there is a NULL mask. However, this is only for a small amount of functions and not for every single function in the pipeline.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions