Skip to content

Replaced shutil and tempfile packages with joblib in caching transformers #14113

Merged
rth merged 3 commits intoscikit-learn:masterfrom
kirilldolmatov:my-feature
Jul 2, 2019
Merged

Replaced shutil and tempfile packages with joblib in caching transformers #14113
rth merged 3 commits intoscikit-learn:masterfrom
kirilldolmatov:my-feature

Conversation

@kirilldolmatov
Copy link
Copy Markdown
Contributor

Replaced shutil and tempfile packages with joblib functions according to Memory usage in official documentation

What does this implement/fix? Explain your changes.

This PR simplifies using of caching transformers within a Pipeline.

Any other comments?

This is my first PR.

Replaced shutil and tempfile packages with joblib functions according to Memory usage in [official documentation](https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html)
@kirilldolmatov kirilldolmatov changed the title replaced shutil and tempfile packages with joblib Replaced shutil and tempfile packages with joblib in caching transformers Jun 18, 2019
Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for one comment below. Thanks @kirilldolmatov!

# Create a temporary folder to store the transformers of the pipeline
cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=10)
location = './cachedir'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "cachedir" for an OS independent path, otherwise LGTM. Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Delete the temporary cache before exiting
rmtree(cachedir)
memory.clear(warn=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that this also removes the location folder?

Copy link
Copy Markdown
Contributor Author

@kirilldolmatov kirilldolmatov Jun 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I've checked after calling memory.clear(warn=False) location folder is removed. However, not main directory - '/cachedir', but subdirectory - 'sklearn' (./cachedir/joblib/sklearn). So after removing all caching files of transformers will be removed, but the directory (./cachedir/joblib) will remain.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then we still want to remove the cachedir though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found smth that can meet our requirements: clear_path. But, it requires one extra code line.

1. Made location path OS independent 
2. Add `clear_location()` for full removing location folder
@thomasjpfan
Copy link
Copy Markdown
Member

thomasjpfan commented Jun 19, 2019

This PR adds a 'cachedir' into the examples directory by directly following the joblib doc and then removes 'cachedir'. What was the issue with using a temporary folder (from mkdtemp) for caching in the example?

@kirilldolmatov
Copy link
Copy Markdown
Contributor Author

The issue will be the same: it's also possible to use a temporary folder (from mkdtemp):

cachedir = mkdtemp()
memory = Memory(location=cachedir, verbose=10)
memory.clear(warn=False)
memory.store_backend.clear_location(cachedir)

@thomasjpfan
Copy link
Copy Markdown
Member

Does

memory.clear(warn=False)
memory.store_backend.clear_location(cachedir)

do something that rmtree(cachedir) fails to do?

@glemaitre
Copy link
Copy Markdown
Member

@thomasjpfan Both solutions are equivalent. We either use the standard lib or joblib to achieve the same result. I am +0.5 for the change (it is nice that people find out about the joblib.Memory feature).

@kirilldolmatov
Copy link
Copy Markdown
Contributor Author

kirilldolmatov commented Jun 22, 2019

My initial PR replaced shutil and tempfile libs with joblib. Such solution looks simplifier, because we avoid import two extra libs and use only joblib. But, there is side effect - directory will remain after removing cache files. Therefore, I add extra one code line, that removes this directory. Moreover, such solution allows use mkdtemp, if user doesn't like to create folder in directly way.

@glemaitre glemaitre self-requested a review June 25, 2019 15:28
# Delete the temporary cache before exiting
rmtree(cachedir)
memory.clear(warn=False)
memory.store_backend.clear_location(location)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is not a good idea. store_backend is not documented publicly. I would use the rmtree to remove the folder. In addition, I would add a comment that memory.clear() remove the cache used for the computation and rmtree() will remove the folder where the cache was localized.

Copy link
Copy Markdown
Contributor Author

@kirilldolmatov kirilldolmatov Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather agree with you. I add clear_location(location) (in my second commit) for removing location directory after discussions above. IMHO, it isn't necessary to remove the folder, because it can be use in the future. If we're willing to accept this, than my first commit is actual. In another case, it might be better to use rmtree. My original wish was to replace three libs with one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it isn't necessary to remove the folder, because it can be use in the future

In general, I would agree. However, we don't want to have a remaining folder after the example execution. That's why, I would make a comment before the rmtree explaining why there is this call and that in practice, you probably don't need it.

My original wish was to replace three libs with one.

It does not matter since they come in the standard library of Python

Copy link
Copy Markdown
Contributor Author

@kirilldolmatov kirilldolmatov Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've made new commit using rmtree(location) instead of clear_location(location)

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could just add a small comment, something like: we remove the temporary directory earlier created (optional).

Otherwise LGTM

# Delete the temporary cache before exiting
memory.clear(warn=False)
memory.store_backend.clear_location(location)
rmtree(location)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the temporary directory earlier created (optional)

@rth
Copy link
Copy Markdown
Member

rth commented Jul 2, 2019

Merging, thanks!

@rth rth merged commit faa9406 into scikit-learn:master Jul 2, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants