Skip to content

Add MemoryFileSystem#2741

Closed
martindurant wants to merge 29 commits intodask:masterfrom
martindurant:memory_fs
Closed

Add MemoryFileSystem#2741
martindurant wants to merge 29 commits intodask:masterfrom
martindurant:memory_fs

Conversation

@martindurant
Copy link
Member

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

See dask/fastparquet#215
This is a single global store. It is meant for use only with the Threaded scheduler - not sure how useful it is.

@martindurant
Copy link
Member Author

What would be the right way of determining the size of data help in a bytesio on py2? Is it something that needs to be saved via tell() when we are done writing instead?

@mrocklin
Copy link
Member

mrocklin commented Oct 26, 2017 via email

@martindurant
Copy link
Member Author

Actually, thinking about it a moment, i.seek(0, 2) should work and come with very little cost.

@mrocklin
Copy link
Member

@martindurant is there anything that remains to be done here? What's here seems fine to me.

My only comment is that there seems to be a fair amount of copy-pasting between filesystem test suites. It might make sense at some point to construct an inheritable test class that others can use for tests. This might be something that we hand to the Arrow folks for use with their HDFS implementation.

@martindurant
Copy link
Member Author

I think this is complete enough to be useful.
Agree about the test duplication, although I don't expect this code to change frequently.

@mrocklin
Copy link
Member

I think that this needs to be added to the import at dask/bytes/__init__.py .

It would also be nice to see a roundtrip test with dd.to_csv and dd.read_csv

@mrocklin
Copy link
Member

@martindurant ok to merge?

@martindurant
Copy link
Member Author

Yes, I think so. This does not appear explicitly in the docs, but it is a fairly niche use.

@martindurant
Copy link
Member Author

Updated here with the simplifications that went into bytes. Can be merged after #3160, if that is good to go.

@martindurant
Copy link
Member Author

@alimanfoo , if this would be useful to you for making in-memory zarr files, then please try it out and see how well it works.

@alimanfoo
Copy link
Contributor

alimanfoo commented May 29, 2018 via email

@jakirkham
Copy link
Member

Have a few questions. What contexts does this work in (e.g. single threaded, multithreaded, multiprocessing, distributed, etc.)? Also how does this work when someone wants to access this stored data?

@martindurant
Copy link
Member Author

There are a couple of examples of round-tripping in the tests, so the following should work

arr.to_zarr('memory://path/arr.zarr')
arr2  = da.read_zarr('memory://path/arr.zarr')

so long as we are within one process (sync or thread scheduler, or distributed in-process).

If you are not in one process, you would still successfully make the file-like objects of binary data, but would not know which piece was where. That is like persisting a set of keys (binary data in memory) without the global map of which key is where - i.e., not too useful.

Martin Durant added 2 commits June 6, 2018 10:11
Enough to get to_zarr/from_zarr working
@martindurant
Copy link
Member Author

With those changes, a simple zarr roundtrip does work.

Note: this stuff, is found useful, still needs extensive testing
@jrbourbeau
Copy link
Member

Closing based on https://github.com/martindurant/filesystem_spec/pull/11#issue-209228566. @martindurant feel free to re-open if needed

@jrbourbeau jrbourbeau closed this Jun 17, 2019
@martindurant martindurant deleted the memory_fs branch February 9, 2021 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants