[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support#1813
[DRIVERS-2926] [PYTHON-4577] BSON Binary Vector Subtype Support#1813blink1073 merged 48 commits intomongodb:masterfrom
Conversation
…t.test_bson.TestBSON.test_encode_type_marker to pass
…t.test_bson.TestBSON.test_encode_type_marker to pass
…ector, from_vector
e32d1b8 to
26b8398
Compare
bson/binary.py
Outdated
| elif dtype == BinaryVectorDtype.FLOAT32: | ||
| n_bytes = len(self) - position | ||
| n_values = n_bytes // 4 | ||
| assert n_bytes % 4 == 0 |
There was a problem hiding this comment.
We should not use assert for validation data. If the argument type is incorrect we raise a TypeError, if the type is correct but the value is invalid we raise a ValueError.
bson/binary.py
Outdated
| """ | ||
| if dtype == BinaryVectorDtype.INT8: # pack ints in [-128, 127] as signed int8 | ||
| format_str = "b" | ||
| assert not padding, f"padding does not apply to {dtype=}" |
bson/binary.py
Outdated
| format_str = "B" | ||
| elif dtype == BinaryVectorDtype.FLOAT32: # pack floats as float32 | ||
| format_str = "f" | ||
| assert not padding, f"padding does not apply to {dtype=}" |
|
|
||
| .. versionadded:: 4.9 | ||
| """ | ||
|
|
There was a problem hiding this comment.
We need to validate self.subtype == 9 here before attempting to decode.
There was a problem hiding this comment.
Something like:
if self.subtype != VECTOR_SUBTYPE:
raise ValueError(...)There was a problem hiding this comment.
Look at as_uuid it should be the same as that validation:
if self.subtype not in ALL_UUID_SUBTYPES:
raise ValueError(f"cannot decode subtype {self.subtype} as a uuid")| .. versionadded:: 4.9 | ||
| """ | ||
|
|
||
| INT8 = b"\x03" |
There was a problem hiding this comment.
An enum of bytes is a bit unusual. Can we change it to use ints? eg INT8 = 3
There was a problem hiding this comment.
The naming convention that Geert used has meaning in both the first and last 4 bits, so I'd prefer it to stay as-is.
There was a problem hiding this comment.
I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?
There was a problem hiding this comment.
The user doesn't see the value locally either. Given that the bson spec itself always uses integers I think we should use the integer form.
bson/binary.py
Outdated
| .. versionadded:: 4.9 | ||
| """ | ||
|
|
||
| data: Sequence[float | int] |
There was a problem hiding this comment.
Add __slots__ = ("data", "dtype", "padding") to reduce the memory overhead of using this class.
There was a problem hiding this comment.
Sure. How do I do type annotation for that?
There was a problem hiding this comment.
None needed, e.g.
mongo-python-driver/bson/timestamp.py
Line 31 in 821811e
There was a problem hiding this comment.
We could pass slots=True instead: https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
There was a problem hiding this comment.
Strange. And you're cool with that?
There was a problem hiding this comment.
We have to add them manually instead of using slots=True.
There was a problem hiding this comment.
class BinaryVector:
"""**(BETA)** Vector of numbers along with metadata for binary interoperability.
:param data (Sequence[float | int]): Sequence of numbers representing the mathematical vector.
:param dtype (:class:`bson.Binary.BinaryVectorDtype`): The data type stored in binary
:param padding (Optional[int] = 0): The number of bits in the final byte that are to be ignored
when a vector element's size is less than a byte
and the length of the vector is not a multiple of 8. Default is 0.
.. versionadded:: 4.10
"""
__slots__ = ("data", "dtype", "padding")
def __init__(self, data, dtype, padding=0):
self.data = data
self.dtype = dtype
self.padding = padding
Is this right? @blink1073
There was a problem hiding this comment.
Oh I didn't realize __slots__ with dataclass was problematic.
| .. versionadded:: 4.9 | ||
| """ | ||
|
|
||
| INT8 = b"\x03" |
There was a problem hiding this comment.
I don't see how the interpretation of the values here matters. Are you saying this is a bit flag?
bson/binary.py
Outdated
|
|
||
| data: Sequence[float | int] | ||
| dtype: BinaryVectorDtype | ||
| padding: Optional[int] = 0 |
There was a problem hiding this comment.
Optional[int] -> int, unless padding=None is considered valid.
There was a problem hiding this comment.
padding=None is expected when it is not a packed type
There was a problem hiding this comment.
Is it? I see the code using padding=0 in that case.
bson/binary.py
Outdated
| cls: Type[Binary], | ||
| vector: list[int, float], | ||
| dtype: BinaryVectorDtype, | ||
| padding: Optional[int] = 0, |
bson/binary.py
Outdated
| data = struct.pack(f"{len(vector)}{format_str}", *vector) | ||
| return cls(metadata + data, subtype=VECTOR_SUBTYPE) | ||
|
|
||
| def as_vector(self, uncompressed: Optional[bool] = False) -> BinaryVector: |
There was a problem hiding this comment.
Optional[bool] -> bool unless uncompressed=None is valid.
bson/binary.py
Outdated
| else: | ||
| bits = [] | ||
| for uint8 in unpacked_uint8s: | ||
| bits.extend([int(bit) for bit in f"{uint8:08b}"]) |
There was a problem hiding this comment.
Why add an uncompressed option here at all? It looks like this option is irreversible because from_vector does not support the same option. Even if it did, is this option useful outside of test code?
There was a problem hiding this comment.
Yeah, I think we should leave out uncompressed, it was a quality of life addition, but is not likely to be standardized.
There was a problem hiding this comment.
Yes. It's irreversible. We don't yet provide an API to go from a full vector of zeros and ones to a packed bit vector.
There was a problem hiding this comment.
Going in, you'd be using something like numpy.packbits
There was a problem hiding this comment.
Let's omit it since it adds unneeded complexity at this point.
|
Fixing the test suite and the Enum issue are my last remaining comments. |
|
We're pressing ahead with the hex values in the Enum because they simplify the implementation of encode and decode. |
|
Again, the user will only ever see |
Adds support for new subtype of Binary BSON subtype to be used for densely-packed 1d arrays (vectors).
Corresponding changes are in specifications repo PR #1658