DS-126 Update patch on Dask-ml v1.3.0#8
Conversation
dask_ml/model_selection/_search.py
Outdated
| # format to keys with full information compatible with cv_extract functions | ||
| keys, vals = _generate_fit_params_key_vals(fit_params, keys_filtered=[eval_weight_source]) | ||
| # dask evaluation requires wrapping into function to allow the function to be evaluated once cv object is resolved | ||
| def extract_param(cvs,k,v,n,fld): |
There was a problem hiding this comment.
move extract_param out of the for loop, it does not need to be defined multiple times.
There was a problem hiding this comment.
cv_extract_params is slightly different than extract_param. cv_extract_params is the original Dask function that gets the fit_params information in {fit_param_key: key_values} format for all the fit_params . extract_param is a one time pull of a single fit_param since it needs to be passed on to score for a pipeline before it has actually been computed.
There was a problem hiding this comment.
I mean you can define extract_param before for loop.
|
FYI: This might change the current notebook behavior because of how test cases run on prod |
IanQS
left a comment
There was a problem hiding this comment.
Generally looks good BUT I think that there are some low-hanging fruit that would be really useful for readability and maintainability moving forward. Happy to approve after
| def _get_fit_params(cv, fit_params, n_splits): | ||
| if not fit_params: | ||
| return [(n, None) for n in range(n_splits)] | ||
| def _generate_fit_params_key_vals(fit_params, keys_filtered=None): |
There was a problem hiding this comment.
if it's not too much of a pain can you add typing? I'm assuming fit_params is a dictionary but what is keys_filtered ? And what does it do?
There was a problem hiding this comment.
I could add typing to the docstring for sure but I am avoiding making these 3 files the only files with typing in the whole repo.
There was a problem hiding this comment.
Yeah. I would just add comments rather than typing to "keep it consistent with Dask style". I know what this sounds like. Obviously, comments are not in "Dask style", but comments are good so maybe just keep it "Dask(ish) style".
| return eval_weight_source | ||
|
|
||
|
|
||
| def _get_n_folds_fit_params(cv, fit_params, n_splits, keys_filtered=None): |
There was a problem hiding this comment.
Can we get typing on this? Also, for the return type can we get a NamedTuple? It's a little hard to parse the return here. Would also make the below easier to read
for n, fld_fit_params in n_and_fold_fit_params:
...
...
fld_fit_params[0],
fld_fit_params[1],
and it's used in other places too so I think returning a NamedTuple would be really beneficial
There was a problem hiding this comment.
see above comment. will add best I can in dostrings
|
FYI: TLDR; |
| def _get_fit_params(cv, fit_params, n_splits): | ||
| if not fit_params: | ||
| return [(n, None) for n in range(n_splits)] | ||
| def _generate_fit_params_key_vals(fit_params, keys_filtered=None): |
There was a problem hiding this comment.
Yeah. I would just add comments rather than typing to "keep it consistent with Dask style". I know what this sounds like. Obviously, comments are not in "Dask style", but comments are good so maybe just keep it "Dask(ish) style".
| def cv_extract_params(cvs, keys, vals, n): | ||
| return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)} | ||
| cvs: (CVCache): CV cache for CV information of folds | ||
| keys: ((str,str)) fit params (name,full_name) key tuple |
There was a problem hiding this comment.
I think having a key that is just the full_name or name of the fit parameter key is simpler but v1.3.0 of dask-ml uses this tuple as the key and I am avoiding changing this key format as I don't want to find out want to find out what the downstream impact is on this change for the sake of a simpler key scheme. This is also why _generate_fit_params_key_vals was necessary
| (log_loss, False, True, LogisticRegression(), IMP_WT_LOG_REG_PARAMS, [2500000, 500000, 200000, 100000]), | ||
| (brier_score_loss, False, True, LogisticRegression(), IMP_WT_LOG_REG_PARAMS, [2500000, 500000, 200000, 100000]), | ||
| ]) | ||
| def test_sample_weight_cross_validation( |
There was a problem hiding this comment.
Looks like this is zefr test case? If so can we leave a comment here so that it can be easily found later?
tian-yi-zefr
left a comment
There was a problem hiding this comment.
Generally, it looks good to me. One thing I would do is testing the dependency upgrade on ds-model-content-relevancy and make sure the upgrade works fine.
ryan-deak-zefr
left a comment
There was a problem hiding this comment.
The only thing I would like to see is comments about the meaning of indices when indexing into the dask keys and values and params arrays.
| keys, vals = _generate_fit_params_key_vals(fit_params, keys_filtered=[eval_weight_source]) | ||
| # create the proper dask tasks to generate the train objects when computing. | ||
| # Dask tasks are tuples of function followed by arguments | ||
| w_train = (extract_param, cv, keys[0], vals[0], n, True) |
There was a problem hiding this comment.
Maybe a comment about what the 0 index is.
| fields, | ||
| p, | ||
| fit_params, | ||
| fld_fit_params[0], |
There was a problem hiding this comment.
Maybe a comment about what the 0 index is.
| fields, | ||
| p, | ||
| fit_params, | ||
| fld_fit_params[0], |
There was a problem hiding this comment.
Maybe a comment about what the 0 index is.

NOTE:
A) Ignore for this PR all the commits not made by myself as we probably dont have time to code review dask-ml v1.3.0
B) Please ignore the code I havent changed ie (the official dask code). I agree there are issues that you can code review on how things are but outside of C) please focus on the "changes and functionality of the patch" as the scope not what the official dask code could do better although I did try to improve some variable names and documentation when possible.
C) If you believe there is a bug in the official dask code please note it in this PR and will need another PR to deal with bug fix that should probably be passed on to the official dask-ml repo.
This PR makes the patch compatible with dask-ml in its current state. It adds the patch's functionality of applying the weighting scheme to the test fold. It does add 3 bits of extra functionaility.
support for eval_sample_weight to allow for a different weighting scheme on evaluation than training
support lightGBM eval_data_set by resolving the proper test fold partitions if the user provides the eval_data_set
new test case for sklearn.pipeline . This is used by ZEFR so should probably tested as well.
Files Changed to add functionality :
tests/model_selection/dask_searchcv/test_model_selection.py
dask_ml/model_selection/methods.py
dask_ml/model_selection/_search.py