Merged
Conversation
Should make it easier to extend these tests to more cases.
User code working with NumPy or Pandas objects often expects the objects to be mutable. However if read-only frames (like `bytes`) objects are used, this is not true. So add a test to check for this so that we can make sure this is true and we can catch and fix cases where that may not be true.
38a7d14 to
2fb24e5
Compare
This makes it easier to operate with the frames in later parts of the code.
Instead of having a fast path, guard against going through the slow path. Should make it easier to do some common post-processing work.
Instead of using `b""`, create the `bytes` object explicitly.
Instead of using a `bytes` object to merge frames together, use a `bytearray`. This ensures the result is mutable, which downstream objects from serialization may care about.
As users expect objects that are deserialized like NumPy arrays and Pandas objects are mutable, make sure the frames used to build them up are also mutable. Do this by checking if they are read-only (as is the case for `bytes` as an example). If they are, copy them into a `bytearray` backed `memoryview`. Otherwise pass them through as-is.
Member
|
Thanks for handling this @jakirkham . I'm only seeing tests in this PR so far. Is that correct? |
Member
Author
|
Yeah I'm just trying to demonstrate the failure on CI. Will push the rest of the fix after. |
Member
Author
|
Alright have pushed the changes to fix the tests. Please let me know what you think 🙂 |
mrocklin
reviewed
Jul 17, 2020
| out.append(memoryview(bytearray().join(L))) | ||
| frames = out | ||
|
|
||
| frames = [memoryview(bytearray(f)) if f.readonly else f for f in frames] |
Member
There was a problem hiding this comment.
For fun I decided to micro-benchmark this:
In [1]: L = [b'0' * i for i in [10, 100, 10, 100, 1000] * 2]
In [2]: %timeit b"".join(L)
203 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [3]: %timeit memoryview(bytearray().join(L))
463 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)260ns doesn't really matter today I don't think
Member
|
In principle this looks good to me. Thank you for identifying and resolving this @jakirkham ! |
Member
Author
|
Thanks Matt! 😄 |
This was referenced Jul 21, 2020
Member
Author
|
Adding a test for a Pandas Series as well in PR ( #3995 ). |
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1978
Fixes #3943
Closes #3918 (superseded)
User code often expects NumPy and Pandas objects to be writable. However this fails in various cases. After some investigation this appears to be due to some frames being read-only (as they are coerced to
bytesvia compression, logic inmerge_frames, etc.). To fix this we update the logic inmerge_framesto check frames are writable. If frames are read-only, we copy them to a writable object (likebytearray). We include tests ofmerge_framesand NumPy array serialization to make sure this is working correctly.