ENH buffer openml stream rather than reading all at once#16084
ENH buffer openml stream rather than reading all at once#16084rth merged 15 commits intoscikit-learn:masterfrom
Conversation
|
Quick memory profiling on masterThis PR |
|
Thanks @thomasjpfan, though that's not the relevant portion, since now the content isn't even buffered until |
|
master this branch |
|
On master and PR masterthis prThis PR works as expected. |
thomasjpfan
left a comment
There was a problem hiding this comment.
I am happy with this. All it needs is a whats new entry in 0.23 tagged Efficiency.
sklearn/datasets/_openml.py
Outdated
| return_type = _arff.DENSE_GEN | ||
|
|
||
| arff_file = _arff.load((line.decode('utf-8') | ||
| for line in response), |
There was a problem hiding this comment.
@shashanksingh28 if this PR is merged before #14800 I'd suggest just making a _check_md5 helper which takes an iterable of bytes as input and returns generates iterable of bytes, checking md5 on the way.
There was a problem hiding this comment.
Sounds good. I would rather wait for this to go in first and md5 check after...
sklearn/datasets/_openml.py
Outdated
| elif y.shape[1] == 0: | ||
| y = None | ||
|
|
||
| fp.close() # explicitly close HTTP connection after parsing |
There was a problem hiding this comment.
If any of the parsing fails, the connection would remain open.
Maybe, we can put the _convert_arff_data_dataframe and _convert_arff_data logic up:
fp, arff = _download_data_arff(data_description['file_id'], return_sparse, ...)
nominal_attributes = None
frame = None
with closing(fp):
if as_frame:
columns = data_columns + target_columns
frame = _convert_arff_data_dataframe(arff, columns, features_dict)
else:
X, y = _convert_arff_data(arff['data'], col_slice_x, ...)
if as_frame:
...
else:
...An alternative would be to indent the whole parsing logic.
| fp, data_arff = _download_data_arff(data_description['file_id'], | ||
| sparse, None, False) | ||
| data_downloaded = np.array(list(data_arff['data']), dtype='O') | ||
| fp.close() |
There was a problem hiding this comment.
So the connection closes even with np.array... fails.
with closing(fp):
data_downloaded = np.array(list(data_arff['data']), dtype='O')|
hmmm good points. I also need to review how this interacts with the caching |
|
It looks like it should work okay with caching actually. |
|
I've realised that the progressive loading of the file also breaks some of the usefulness of |
|
We most likely need to refactor a little to get def fetch_openml(...):
....
if as_frame:
parse_arff = partial(_convert_arff_data_dataframe, columns=columns,
features_dict=features_dict)
else:
parse_arff = partial(_convert_arff_data, col_slice_x=, ...)
result = _download_and_parse_data_arff(..., parse_arff)
...
def _download_and_parse_data_arff(file_id, sparse, data_home, as_frame, parse_arff):
url = _DATA_FILE.format(file_id)
@_retry_with_clean_cache(url, data_home)
def _download_parse_inner():
arff_file = _download_data_arff(...)
if as_frame:
return parse_arff(arff_file)
else:
return parse_arff(arff_file['data'])
return _download_parse_inner()With this type of refactor |
|
Here's my little refactor that ensures the failure will be retried with appropriate scope. Not that the retry business is tested (should I bother??) |
|
Another reviewer would be very welcome here! |
The retry logic is independently tested, but not together with @NicolasHug Most of the diff of this PR is moving code around to allow for the stream close with a context manager. This is done to accommodate |
NicolasHug
left a comment
There was a problem hiding this comment.
Looks good, minor concern about the retry logic now
doc/whats_new/v0.23.rst
Outdated
| :func:`datasets.make_moons` now accept two-element tuple. | ||
| :pr:`15707` by :user:`Maciej J Mikulski <mjmikulski>`. | ||
|
|
||
| - |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset |
There was a problem hiding this comment.
| - |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset | |
| - |Efficiency| :func:`datasets.fetch_openml` has reduced memory usage because it no longer stores the full dataset |
sklearn/datasets/_openml.py
Outdated
| # Note that if the data is dense, no reading is done until the data | ||
| # generator is iterated. |
There was a problem hiding this comment.
I'm not sure how to interpret this comment
There was a problem hiding this comment.
Do you mean e.g. "Since we pass a generator, load() will read lines one by one and avoid excessive memory usage" ?
That'd be a useful comment IMHO
| col_slice_y = [int(features_dict[col_name]['index']) | ||
| for col_name in target_columns] | ||
|
|
||
| col_slice_x = [int(features_dict[col_name]['index']) | ||
| for col_name in data_columns] | ||
| for col_idx in col_slice_y: | ||
| feat = features_list[col_idx] | ||
| nr_missing = int(feat['number_of_missing_values']) | ||
| if nr_missing > 0: | ||
| raise ValueError('Target column {} has {} missing values. ' | ||
| 'Missing values are not supported for target ' | ||
| 'columns. '.format(feat['name'], nr_missing)) |
There was a problem hiding this comment.
The column validation is moved in this function which is decorated by _retry_with_clean_cache.
So the ValueError raised here will cause the decorator to raise warn("Invalid cache, redownloading file", RuntimeWarning) and it will re-run the function after clearing the cache which is not necessary
There was a problem hiding this comment.
This is a fair point. I'll try to get some user-input validation out of the retry.
|
Thanks for the review @NicolasHug. I've tried to address your comments, but haven't been able to test locally thanks to an OS update breaking my conda build... |
sklearn/datasets/_openml.py
Outdated
| return f(*args, **kw) | ||
| except HTTPError: | ||
| raise | ||
| except ValueError: |
There was a problem hiding this comment.
except (HTTPError, ValueError):
raiseThere was a problem hiding this comment.
Should we also passthrough ArffException as well?
There was a problem hiding this comment.
Should we also passthrough ArffException as well?
No, because they may be due to data corruption.
There was a problem hiding this comment.
That was my concern about not letting ValueError through too...
There was a problem hiding this comment.
There are some ValueErrors in _arff.py as well. In principle, we only want to retry when there is an parsing error or an error from _open_openml_url, which is scoped to:
def _load_arff(...):
response = _open_openml_url(url, data_home)
with closing(response):
arff = _arff.load((line.decode('utf-8')
for line in response),
return_type=return_type,
encode_nominal=not as_frame)
return arffCan we place this in its own function and use the retry wrapper on this new function?
There was a problem hiding this comment.
No, that code only reads the headers and returns a generator, which is the basis of this change. The parsing errors will actually occur when that generator is iterated during conversion to arrays.
If the ValueErrors in _arff.py are not possible to raise due to data corruption, then the current solution is fine
There was a problem hiding this comment.
The only ValueError I see from corruption may be
scikit-learn/sklearn/externals/_arff.py
Lines 243 to 249 in fa9cf22
On a side note, we can define the _load_arff as follows:
def _load_arff(..., as_frame):
response = _open_openml_url(url, data_home)
with closing(response):
arff = _arff.load((line.decode('utf-8')
for line in response),
return_type=return_type,
encode_nominal=not as_frame)
if as_frame:
return _convert_arff_data_dataframe(arff, columns, features_dict)
else:
return _convert_arff_data(arff['data'], col_slice_x,
col_slice_y, shape)|
Not sure entirely what you're suggesting... The loader needs to also return
metadata from arff, not just data. That's why it's all processed here, but
you're right that it might be possible to simplify.
|
|
To be concrete, I was thinking of this: jnothman#8 |
thomasjpfan
left a comment
There was a problem hiding this comment.
PR updated such that only errors from the downloading and parsing will trigger a redownload.
still LGTM
doc/whats_new/v0.23.rst
Outdated
| :func:`datasets.make_moons` now accept two-element tuple. | ||
| :pr:`15707` by :user:`Maciej J Mikulski <mjmikulski>`. | ||
|
|
||
| - |Efficiency| :func:`datasets.fetch_openml` no longer stores the full dataset |
|
Perhaps we can get another review here? Not essential, but a nice memory boost for fetch_openml, and unblocking work on the checksum PR. |
|
As an aside, this may help resolve #16629 and allow us to turn back on the memory profiler for gallery examples. |
rth
left a comment
There was a problem hiding this comment.
Thanks @jnothman ! LGTM, after a somewhat superficial review. I think our openml fetcher code is non trivial and sometimes difficult to follow. I wonder if type annotations could help some for readability and if we can move part of this logic upstream.
|
Maybe I should have synced master, but CI was green I merged it. Looking into a fix in #17047 |
|
Great! Thanks! |
I've not benchmarked yet. This should reduce memory requirements when fetching from OpenML.
This no longer explicitly closes the open URL handler, since it is required until the ARFF has been completely read. We could return the stream object from _download_data_arff if we want to close it.