Replaced shutil and tempfile packages with joblib in caching transformers #14113
Replaced shutil and tempfile packages with joblib in caching transformers #14113rth merged 3 commits intoscikit-learn:masterfrom
Conversation
Replaced shutil and tempfile packages with joblib functions according to Memory usage in [official documentation](https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html)
rth
left a comment
There was a problem hiding this comment.
Looks good except for one comment below. Thanks @kirilldolmatov!
| # Create a temporary folder to store the transformers of the pipeline | ||
| cachedir = mkdtemp() | ||
| memory = Memory(location=cachedir, verbose=10) | ||
| location = './cachedir' |
There was a problem hiding this comment.
Just "cachedir" for an OS independent path, otherwise LGTM. Thanks!
|
|
||
| # Delete the temporary cache before exiting | ||
| rmtree(cachedir) | ||
| memory.clear(warn=False) |
There was a problem hiding this comment.
Have you checked that this also removes the location folder?
There was a problem hiding this comment.
yes, I've checked after calling memory.clear(warn=False) location folder is removed. However, not main directory - '/cachedir', but subdirectory - 'sklearn' (./cachedir/joblib/sklearn). So after removing all caching files of transformers will be removed, but the directory (./cachedir/joblib) will remain.
There was a problem hiding this comment.
OK, then we still want to remove the cachedir though.
There was a problem hiding this comment.
I found smth that can meet our requirements: clear_path. But, it requires one extra code line.
1. Made location path OS independent 2. Add `clear_location()` for full removing location folder
|
This PR adds a |
|
The issue will be the same: it's also possible to use a temporary folder (from cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=10)
memory.clear(warn=False)
memory.store_backend.clear_location(cachedir) |
|
Does memory.clear(warn=False)
memory.store_backend.clear_location(cachedir)do something that |
|
@thomasjpfan Both solutions are equivalent. We either use the standard lib or joblib to achieve the same result. I am +0.5 for the change (it is nice that people find out about the joblib.Memory feature). |
|
My initial PR replaced |
| # Delete the temporary cache before exiting | ||
| rmtree(cachedir) | ||
| memory.clear(warn=False) | ||
| memory.store_backend.clear_location(location) |
There was a problem hiding this comment.
IMO this is not a good idea. store_backend is not documented publicly. I would use the rmtree to remove the folder. In addition, I would add a comment that memory.clear() remove the cache used for the computation and rmtree() will remove the folder where the cache was localized.
There was a problem hiding this comment.
I'd rather agree with you. I add clear_location(location) (in my second commit) for removing location directory after discussions above. IMHO, it isn't necessary to remove the folder, because it can be use in the future. If we're willing to accept this, than my first commit is actual. In another case, it might be better to use rmtree. My original wish was to replace three libs with one.
There was a problem hiding this comment.
IMHO, it isn't necessary to remove the folder, because it can be use in the future
In general, I would agree. However, we don't want to have a remaining folder after the example execution. That's why, I would make a comment before the rmtree explaining why there is this call and that in practice, you probably don't need it.
My original wish was to replace three libs with one.
It does not matter since they come in the standard library of Python
There was a problem hiding this comment.
OK, I've made new commit using rmtree(location) instead of clear_location(location)
glemaitre
left a comment
There was a problem hiding this comment.
If you could just add a small comment, something like: we remove the temporary directory earlier created (optional).
Otherwise LGTM
| # Delete the temporary cache before exiting | ||
| memory.clear(warn=False) | ||
| memory.store_backend.clear_location(location) | ||
| rmtree(location) |
There was a problem hiding this comment.
remove the temporary directory earlier created (optional)
|
Merging, thanks! |
Replaced shutil and tempfile packages with joblib functions according to Memory usage in official documentation
What does this implement/fix? Explain your changes.
This PR simplifies using of caching transformers within a Pipeline.
Any other comments?
This is my first PR.