[MRG] numpy_pickle: several enhancements#626
Conversation
|
Ok, the CI needs some C libraries to be installed, marking this PR as WIP... |
|
Real-world benchmark data could be the adult census data parsed as a pandas dataframe: import pandas as pd
import os
from urllib.request import urlretrieve
url = ("https://archive.ics.uci.edu/ml/machine-learning-databases"
"/adult/adult.data")
local_filename = os.path.basename(url)
if not os.path.exists(local_filename):
print("Downloading Adult Census datasets from UCI")
urlretrieve(url, local_filename)
names = ("age, workclass, fnlwgt, education, education-num, "
"marital-status, occupation, relationship, race, sex, "
"capital-gain, capital-loss, hours-per-week, "
"native-country, income").split(', ')
data = pd.read_csv(local_filename, names=names)This is a mixed types datastructure with both categorical and numerical values. |
|
I updated the gist bench with the Adult Census datasets and KDDCUP99 (which is rather big, 2GB uncompressed). The benches were run on a laptop with an i7 and a SSD.
|
|
Updated the bench results. The results on KDDCUP99 show that LZ4 is by far the most efficient at dump time. |
|
Very nice! |
|
In your benchmark code you should also report additional columns for ratios:
|
ok |
|
Edited with the ratios |
7083482 to
bdbffe4
Compare
Codecov Report
@@ Coverage Diff @@
## master #626 +/- ##
=========================================
- Coverage 95.13% 95.1% -0.03%
=========================================
Files 39 40 +1
Lines 5484 5663 +179
=========================================
+ Hits 5217 5386 +169
- Misses 267 277 +10
Continue to review full report at Codecov.
|
|
Would this support Blosc too? |
If one can provide an external file-like buffered object that uses blosc for compression, then yes. You'll just have to register the new compressor. |
|
Update on this PR:
This PR ready for review. |
ogrisel
left a comment
There was a problem hiding this comment.
Here is a short review. I think the documentation should be updated and you should also add or update an example to show the tradeoffs of common compressors (no compression, lz4, zlib, lzma): dump / load speed, disk usage, ability to mmap, standard library vs additional dependencies...
joblib/compressor.py
Outdated
|
|
||
|
|
||
| def register_compressor_file(compressor_name, file_prefix, file_object, | ||
| force=False): |
There was a problem hiding this comment.
Please rename this function to register_compressor_file as we do not want to register a "file" but a compression algorithm to be made available to dump. The fact that it implements the file object API is a detail.
There was a problem hiding this comment.
Please rename this function to register_compressor_file
Do you mean register_compressor ?
joblib/compressor.py
Outdated
|
|
||
| def register_compressor_file(compressor_name, file_prefix, file_object, | ||
| force=False): | ||
| """Register a compressor file object.""" |
There was a problem hiding this comment.
"""Register a compressor implementing the traditional Python file object API."""
joblib/compressor.py
Outdated
| wbits = 31 # zlib compressor/decompressor wbits value for gzip format. | ||
|
|
||
|
|
||
| def register_compressor_file(compressor_name, file_prefix, file_object, |
There was a problem hiding this comment.
Please rename file_object to compressor.
joblib/compressor.py
Outdated
|
|
||
| if file_object is None or not issubclass(file_object, io.BufferedIOBase): | ||
| raise ValueError("File object should inherit io.BufferedIOBase, " | ||
| "'{}' given.".format(file_object)) |
There was a problem hiding this comment.
Instead of checking inheritance explicitly, I think we should do a duck type check: make sure that the object has the methods we need (write, read, ...) and leave the implementers the freedom to no inherit from io.BufferedIOBase if they want.
joblib/numpy_pickle.py
Outdated
| if compress_method == 'lz4' and lz4 is None: | ||
| raise ValueError('LZ4 in not installed. Consider installing it ' | ||
| 'from PyPI: https://pypi.python.org/pypi/lz4 ' | ||
| 'using: "pip install lz4"') |
There was a problem hiding this comment.
Let's stay neutral w.r.t. the installation method (pip, conda, debian...), it's up to the user to decide.
raise ValueError('LZ4 in not installed. Consider installing it: '
'http://python-lz4.readthedocs.io/')|
Please also add an entry to CHANGES. |
a9e704d to
ef5fdd2
Compare
|
@ogrisel, comments addressed. I added an example in the gallery, adapted from the gist mentioned above. Comments welcome ! |
ogrisel
left a comment
There was a problem hiding this comment.
I think the example should be simplified to be more readable and demonstrate code that the reader is likely to write. In particular, our users will not write a compression benchmark framework.
So to simplify the example you should select a single dataset and demonstrate the dump and load times of uncompressed, gzip and lz4 sequentially in the main flow of the example (not in a benchmark loop) and comment the results (gzip is reasonably fast and compress well, lz4 is much faster but compress less than gzip)...
doc/persistence.rst
Outdated
| >>> joblib.load(filename + '.lz4') | ||
| [('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))] | ||
|
|
||
| The LZ4 compression in Joblib relies on the ``lz4.frame.LZ4File`` file object. |
There was a problem hiding this comment.
I think the users don't care about this implementation detail. Those advanced users who really want to know about this will naturally read the source code. What the median users care about is when is why would they want to use LZ4 instead of gzip. You can say that while LZ4 does not compress as much as gzip, LZ4 is a very fast both when dumping and loading data and therefore often results in shorter overall load and dump times than with uncompressed data even on SSD drives.
|
I think the example should be simplified to be more readable and demonstrate
code that the reader is likely to write. In particular, our users will not
write a compression benchmark framework.
One option is to write two example: one that demonstrates simple usage,
and another that does a benchmark.
|
6ecbaaf to
7d93a10
Compare
|
@ogrisel, I tried to simplify the example as we discussed IRL. I think the plots could be improved, any help/good advice will be appreciated :) |
examples/compressors_comparison.py
Outdated
| # The compression format is identified by the standard magic number present | ||
| # at the beginning of the file. Joblib uses this information to determine | ||
| # the compression method used. | ||
| # This is the case for all compression methods supported by Joblib. |
There was a problem hiding this comment.
This does not seem like valid rst: the indentation after "note" is strange. But maybe I am wrong. Did you check that it renders right?
There was a problem hiding this comment.
I pushed an update with the valid rst: much better rendering indeed ! Thanks.
|
I haven't looked at the details of this PR, but the general direction is awesome! Thanks a lot! |
ogrisel
left a comment
There was a problem hiding this comment.
This is really going in the right direction. A few more things to improve though :)
examples/compressors_comparison.py
Outdated
|
|
||
| ############################################################################### | ||
| # The compression level is using the default value, 3, which is in general the | ||
| # best compromise between good compression and speed. |
There was a problem hiding this comment.
which is, in general, a good compromise between compression and speed.
examples/compressors_comparison.py
Outdated
|
|
||
| ############################################################################### | ||
| # Then measure the size of the raw dumped data on disk: | ||
| raw_file_size = os.stat(pickle_file).st_size / 2**20 |
There was a problem hiding this comment.
1 MB is a 1e6 bytes. Mega is standardized to always be a prefix for 1 million in the international unit system. It makes things simpler.
Multiples of 1024 are for KiB (kibibyte), MiB (mebibyte) and so on. We don't care about those. Leave powers of 1024 to the hard drive manufacturers from the 90's and just use decimal powers to keep things simpler.
There was a problem hiding this comment.
Also please add
print("Raw dump file size: %0.3f MB" % raw_file_size)| start = time.time() | ||
| with open(pickle_file, 'wb') as f: | ||
| dump(data, f) | ||
| raw_dump_duration = time.time() - start |
There was a problem hiding this comment.
Please print the measurements along the way to make linear reading of this example more interesting.
print("Raw dump duration: %0.3fs" % raw_dump_duration)
examples/compressors_comparison.py
Outdated
| local_filename = os.path.basename(url) | ||
| urlretrieve(url, local_filename) | ||
|
|
||
| data = pd.read_csv(local_filename, names=names) |
There was a problem hiding this comment.
Please do:
data = pd.read_csv(url, name=names)to make the code simpler and having to delete a temporary file at the end of the notebook.
joblib/compressor.py
Outdated
|
|
||
| If mode is 'wb', compresslevel can be a number between 1 | ||
| and 9 specifying the level of compression: 1 produces the least | ||
| compression, and 9 (default) produces the most compression. |
There was a problem hiding this comment.
Why use 9 as the default? I think we tend to use 3 elsewhere. Better stay consistent and use 3 everywhere in joblib unless there is a good reason not to do so.
There was a problem hiding this comment.
Since I just copy pasted this from the numpy_pickle_utils module, this was here from the beginning I think. I changed that.
doc/persistence.rst
Outdated
| If the ``lz4`` package is installed, this compression method is automatically | ||
| available with the dump function. | ||
|
|
||
| >>> joblib.dump(to_persist, filename + '.lz4', compress='lz4') # doctest: +ELLIPSIS |
There was a problem hiding this comment.
There is no need to add compress='lz4' right? I think using the ".lz4" extension should be explicit enough to automatically select the lz4 compressor.
There was a problem hiding this comment.
Actually, I just tried and lz4 is not automatically used when the filename ends with the ".lz4" extensions. I think this should be the case by default.
joblib/test/test_numpy_pickle.py
Outdated
| # Check that lz4 cannot be used when dependency is not available. | ||
| fname = tmpdir.join('test.no.lz4').strpath | ||
| data = 'test data' | ||
| compressor = 'lz4' |
There was a problem hiding this comment.
The test would be more readable by not defining this local variable and using the literatl compressor='lz4' directly in the code.
joblib/test/test_numpy_pickle.py
Outdated
| with raises(ValueError) as excinfo: | ||
| numpy_pickle.dump(data, fname, compress=compressor) | ||
| excinfo.match('LZ4 in not installed. Consider installing it: ' | ||
| 'http://python-lz4.readthedocs.io/') |
There was a problem hiding this comment.
Please add another check without setting the compress kwarg:
with raises(ValueError) as excinfo:
# fname ends with .lz4
numpy_pickle.dump(data, fname)
excinfo.match('LZ4 in not installed. Consider installing it: '
'http://python-lz4.readthedocs.io/')
joblib/test/test_numpy_pickle.py
Outdated
| data = 'test data' | ||
| numpy_pickle.dump(data, fname, compress=compressor) | ||
|
|
||
| assert open(fname, 'rb').read(len(_LZ4_PREFIX)) == _LZ4_PREFIX |
There was a problem hiding this comment.
Please with the with statement to close the file explicitly:
with open(fname, 'rb') as f:
assert f.read(len(_LZ4_PREFIX)) == _LZ4_PREFIX| matplotlib | ||
| pillow | ||
| sphinx-gallery | ||
| pandas |
There was a problem hiding this comment.
Before I forget it would be to mention in the README.rst that pandas is needed to run some examples.
There was a problem hiding this comment.
The notes about building the docs already mention that some dependencies are required and gives the command to install them: https://github.com/joblib/joblib#building-the-docs.
There was a problem hiding this comment.
I am sure you are aware that joblib users may want to run the examples without building the doc. They may even be completely unaware that both are related. Add a sentence after the optional numpy dependency and/or look at how scikit-learn does it.
There was a problem hiding this comment.
I have seen this has been addressed, sorry for the noise.
|
comments addressed @ogrisel |
|
Merged! Thanks @aabadie. Now that the infrastructure is in place, you might want to add support for zstd: https://facebook.github.io/zstd/ It's uniformly better than most other compression algorithms in the compression ratio vs speed tradeoff (except for LZ4 that is the fastest of all at the cost of worse compression ratio). |
|
Thanks for merging @ogrisel ! |
|
Exciting!
The next joblib release is going to be a killer!
|
|
Yeah, zstd is an interesting one as it supports learning a dictionary for compression, which can make it viable for compressing even small pieces of data well (assuming that data is relatively similar). ref: https://github.com/facebook/zstd#the-case-for-small-data-compression |
* tag '0.12': (116 commits) Release 0.12 typo typo typo ENH add initializer limiting n_threads for C-libs (joblib#701) DOC better parallel docstring (joblib#704) [MRG] Nested parallel call thread bomb mitigation (joblib#700) MTN vendor loky2.1.3 (joblib#699) Make it possible to configure the reusable executor workers timeout (joblib#698) MAINT increase timeouts to make test more robust on travis DOC: use the .joblib extension instead of .pkl (joblib#697) [MRG] Fix exception handling in nested parallel calls (joblib#696) Fix skip test lz4 not installed (joblib#695) [MRG] numpy_pickle: several enhancements (joblib#626) Introduce Parallel.__call__ backend callbacks (joblib#689) Add distributed on readthedocs (joblib#686) Support registration of external backends (joblib#655) [MRG] Add a dask.distributed example (joblib#613) ENH use cloudpickle to pickle interactively defined callable (joblib#677) CI freeze the version of sklearn0.19.1 and scipy1.0.1 (joblib#685) ...
* releases: (121 commits) Release 0.12.1 fix kwonlydefaults key error in filter_args (joblib#715) MNT fix some "undefined name" flake8 warnings (joblib#713). from importlib import reload for Python 3 (joblib#675) MTN vendor loky2.1.4 (joblib#708) Release 0.12 typo typo typo ENH add initializer limiting n_threads for C-libs (joblib#701) DOC better parallel docstring (joblib#704) [MRG] Nested parallel call thread bomb mitigation (joblib#700) MTN vendor loky2.1.3 (joblib#699) Make it possible to configure the reusable executor workers timeout (joblib#698) MAINT increase timeouts to make test more robust on travis DOC: use the .joblib extension instead of .pkl (joblib#697) [MRG] Fix exception handling in nested parallel calls (joblib#696) Fix skip test lz4 not installed (joblib#695) [MRG] numpy_pickle: several enhancements (joblib#626) Introduce Parallel.__call__ backend callbacks (joblib#689) ...
This PR fixes #583 by extending and improving the compression/decompression mechanism used by the dump/load functions. Here is what has changed:
register_compressor_file()functionand snappycompressor by default if it's installeddumpfunction signature: now it accepts compress parameter passed as stringSome words about the
register_compress_file: the idea is to register a compressor based on a name, the file prefix (magic number of the compression format) and a file-like object. After that,dumpcan be used withcompress='<compressor name>'as parameter. Passing a tuple parameter(<compressor name>, <level>)can also be used but the compression level will not be taken into account. This is a limitation of the current implementation: it's not possible to finely tweak registered compressor when they are used.How to use it with a
CompressorFileobject:Otherwise, if one has lz4 installed, this compressor can be used directly without the need to register it:
Later potential improvements:
add an example about compressors in the gallery, maybe this gist could be adapted with real world dataimprove reloading of lz4 compressed pickle when it's not availableallow fine tuning of registered compressorscc @ogrisel @lsorber @GaelVaroquaux @lesteve if you want to have a look.