feat: add to_/from_safetensors#3685
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files
🚀 New features to boost your workflow:
|
|
The documentation preview is ready to be viewed at http://preview.awkward-array.org.s3-website.us-east-1.amazonaws.com/PR3685 |
|
Something is looking weird with the API docs of these two functions, but I don't see what I did wrong... Any ideas? |
ianna
left a comment
There was a problem hiding this comment.
@pfackeldey - excellent work! A few minor comments, please, check. Also you correctly support str, pathlib.Path, or file-like objects for destination in docstring, but the implementation does not explicitly normalize Path objects. While safetensors.numpy.save_file accepts paths, an explicit cast like:
import os
from pathlib import Path
if isinstance(destination, Path):
destination = os.fspath(destination)can make behavior more predictable across platforms.
Ah, this should come first: """
Args:
...and then the function description, I think. |
|
@pfackeldey do you wanna add tests for every single layout type? You can just copy the layouts from |
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
no, this uses to/from_buffers under-the-hood which is well-tested already. I don't think it makes sense to add redundant test cases. This conversion here works as long as |
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
Co-authored-by: Ianna Osborne <ianna.osborne@cern.ch>
|
@pfackeldey maybe I missed something in the code, but shouldn't you materialize before writing to safetensors? |
good point! I'll add that 👍 |
|
And I had one more thing that I just thought of. Maybe there should be a check that the array is not typetracer-backed when writing? I'm not sure what other IO functions to, I didn't check before writing this . I am saying this because |
it fails with a correct and good error already: |
|
Ah good. I was under the impression from buffers would be fine. I should have tried it before speaking I guess. Thanks for checking. |
|
Hi @ianna , I've addressed all comments and switch the file handling to fsspec to normalize file paths and support remote read and write. It's following the implementation of |
ianna
left a comment
There was a problem hiding this comment.
@pfackeldey - Great! Thanks for addressing the comments! Please, go ahead and merge it if you are done with it. Thanks!
This PR adds to and from safetensors conversions. They're extremely fast at the cost of file size because they to not include any compression. The idea is that all buffers are saved as a long sequence of uncompressed bytes along with metadata that remembers where each buffers starts and stops (similar to an awkward array). Loading it mmaps the file and accessing individual buffers loads only the corresponding slice into memory. This is basically what zarr does but with a dynamic chunk size instead of a static one (which is good for us, because we don't have rectangular arrays) and when one turns off compression.