Feature/whole file memmap#1230
Conversation
instead of memmapping each array (which opens a new file per array memmapped), memmap the file with the python standard library mmap then use numpy.ndarray.__new__ to create views into the memmap for each array (as per docs in numpy.memmap)
Previous to the PR when memmapping each memmapped array would consume one open file. For files with large numbers of arrays this increases the liklihood of hitting the maximum number of open files for a single process and also drastically slows down file open times. This PR memmaps the entire file (when memmapping is enabled) and provides the memmap (a python mmap.mmap object from the standard library) to each memmapped array (via a call to ndarray.__new__). File updates (asdf.update) complicates the handling of this whole file memmapping as it can result in changes to the file on disk which require regeneration of the memmap. The test test_block_data_change checks for this regeneration.
On Windows if 0 is provided as the number of bytes to mmap, the file pointer end up at a negative index which causes issues with seek and an OSError on __exit__. Instead, seek to the end of the file, get the location with tell, and provide the non-zero size to mmap.
Prior code closed all memmaps when BlockManger.close was called. This was used during the edit command when the yaml grew larger than the previously allocated space. An call to close was added to allow tests of the edit command to work on Windows.
Refactors memmap handling for AsdfFile.update into flush_memmap and close_memmap added to generic_io classes. Cleans up a few comments.
a3fb9c8 to
0581bd8
Compare
asdf/asdf.py
Outdated
| # close memmaps so they will regenerate | ||
| if fd.can_memmap(): | ||
| fd.close_memmap() |
There was a problem hiding this comment.
Should this be under the finally?
There was a problem hiding this comment.
That is a much better way to do it. I'll make that change.
There was a problem hiding this comment.
Hopefully that will work. It is nice because it puts it cleanly into the context manager framework being used in that case.
WilliamJamieson
left a comment
There was a problem hiding this comment.
Overall these changes look great.
| if self._memmapped and self._data is not None: | ||
| if NUMPY_LT_1_7: # pragma: no cover | ||
| try: | ||
| self._data.flush() | ||
| except ValueError: | ||
| pass | ||
| else: | ||
| self._data.flush() | ||
| if self._data._mmap is not None: | ||
| self._data._mmap.close() |
There was a problem hiding this comment.
Nice, we should try to find more of these bits left over from supporting old versions of dependencies and remove them.
| def close_memmap(self): | ||
| """ | ||
| Close the memmapped file (if one was mapped with memmap_array) | ||
| """ | ||
| raise NotImplementedError() | ||
|
|
||
| def flush_memmap(self): | ||
| """ | ||
| Flush any pending writes to the memmapped file (if one was mapped with | ||
| memmap_array) | ||
| """ | ||
| raise NotImplementedError() | ||
|
|
There was a problem hiding this comment.
Could you add an error message here describing that this should be implemented if needed.
Also, do we need these implemented by all the child classes?
There was a problem hiding this comment.
I added a rather generic error message to these (and the NotImplementedError in memmap_array). Improvements to the wording are appreciated.
https://github.com/braingram/asdf/blob/feature/whole_file_memmap/asdf/generic_io.py#L617
I believe all of the calls to memmap_array, flush_memmap and close_memmap are within conditional blocks that check can_memmap. RealFile is the only subclass that returns True for can_memmap (I'm not sure if others would be possible) and RealFile has all of these implemented. I want to say a mixin might make sense here but I am open to suggestions.
There was a problem hiding this comment.
I dislike mixins in general. I was more asking in case all the subclasses where implementing their own version in which case we would want this to be an abc method which is required to be implemented by the subclasses rather than having the error message.
Since this is the case I like what you have done.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
mmap access changes in whole file memmap asdf-format#1230 may have made this flag redundant.
Instead of memmaping each array (which consumes one open file descriptor per array) this PR will memmap the entire file (when enabled) and provide the mmap.mmap to each array.
This has the added benefit of greatly increasing open speed for files with a large number of arrays. For example the current jwst ifupost nirspec reference file has 240 arrays. With this PR (on my machine) open takes ~1.5 seconds compared to ~7 without this PR.
I'm opening this as draft to run the downstream tests.