Skip to content
This repository was archived by the owner on Nov 27, 2023. It is now read-only.

DS-126 Update patch on Dask-ml v1.3.0#8

Merged
jamesbsilva merged 13 commits intomerge_onlyfrom
merge_only_then_patch
Apr 28, 2020
Merged

DS-126 Update patch on Dask-ml v1.3.0#8
jamesbsilva merged 13 commits intomerge_onlyfrom
merge_only_then_patch

Conversation

@jamesbsilva
Copy link
Copy Markdown

NOTE:
A) Ignore for this PR all the commits not made by myself as we probably dont have time to code review dask-ml v1.3.0
B) Please ignore the code I havent changed ie (the official dask code). I agree there are issues that you can code review on how things are but outside of C) please focus on the "changes and functionality of the patch" as the scope not what the official dask code could do better although I did try to improve some variable names and documentation when possible.
C) If you believe there is a bug in the official dask code please note it in this PR and will need another PR to deal with bug fix that should probably be passed on to the official dask-ml repo.

This PR makes the patch compatible with dask-ml in its current state. It adds the patch's functionality of applying the weighting scheme to the test fold. It does add 3 bits of extra functionaility.

support for eval_sample_weight to allow for a different weighting scheme on evaluation than training
support lightGBM eval_data_set by resolving the proper test fold partitions if the user provides the eval_data_set
new test case for sklearn.pipeline . This is used by ZEFR so should probably tested as well.
Files Changed to add functionality :

tests/model_selection/dask_searchcv/test_model_selection.py
dask_ml/model_selection/methods.py
dask_ml/model_selection/_search.py

# format to keys with full information compatible with cv_extract functions
keys, vals = _generate_fit_params_key_vals(fit_params, keys_filtered=[eval_weight_source])
# dask evaluation requires wrapping into function to allow the function to be evaluated once cv object is resolved
def extract_param(cvs,k,v,n,fld):
Copy link
Copy Markdown

@tian-yi-zefr tian-yi-zefr Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move extract_param out of the for loop, it does not need to be defined multiple times.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cv_extract_params is slightly different than extract_param. cv_extract_params is the original Dask function that gets the fit_params information in {fit_param_key: key_values} format for all the fit_params . extract_param is a one time pull of a single fit_param since it needs to be passed on to score for a pipeline before it has actually been computed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you can define extract_param before for loop.

@jamesbsilva
Copy link
Copy Markdown
Author

FYI: This might change the current notebook behavior because of how test cases run on prod

http://dask.zefr.com:8888/lab/workspaces/james-chr/tree/mnt/efs/notebooks/james.silva/Hotfix/run_test_cases_on_prod_build.ipynb

Copy link
Copy Markdown

@IanQS IanQS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good BUT I think that there are some low-hanging fruit that would be really useful for readability and maintainability moving forward. Happy to approve after

def _get_fit_params(cv, fit_params, n_splits):
if not fit_params:
return [(n, None) for n in range(n_splits)]
def _generate_fit_params_key_vals(fit_params, keys_filtered=None):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's not too much of a pain can you add typing? I'm assuming fit_params is a dictionary but what is keys_filtered ? And what does it do?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add typing to the docstring for sure but I am avoiding making these 3 files the only files with typing in the whole repo.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I would just add comments rather than typing to "keep it consistent with Dask style". I know what this sounds like. Obviously, comments are not in "Dask style", but comments are good so maybe just keep it "Dask(ish) style".

return eval_weight_source


def _get_n_folds_fit_params(cv, fit_params, n_splits, keys_filtered=None):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get typing on this? Also, for the return type can we get a NamedTuple? It's a little hard to parse the return here. Would also make the below easier to read

for n, fld_fit_params in n_and_fold_fit_params:
    ...
    ...
    fld_fit_params[0],
    fld_fit_params[1],

and it's used in other places too so I think returning a NamedTuple would be really beneficial

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment. will add best I can in dostrings

@jamesbsilva
Copy link
Copy Markdown
Author

jamesbsilva commented Apr 23, 2020

FYI:
I will add more typing in doc string but as an approximate rule generally if it is in _search.py the objects are actually Dask tasks (tuples with information how to calculate the task) and if it is in methods.py it has been computed and is a more concrete object post-compute like a bumpy array ect.

TLDR; _search.py is pre-compute() and methods.py is mid-compute()

def _get_fit_params(cv, fit_params, n_splits):
if not fit_params:
return [(n, None) for n in range(n_splits)]
def _generate_fit_params_key_vals(fit_params, keys_filtered=None):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I would just add comments rather than typing to "keep it consistent with Dask style". I know what this sounds like. Obviously, comments are not in "Dask style", but comments are good so maybe just keep it "Dask(ish) style".

def cv_extract_params(cvs, keys, vals, n):
return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)}
cvs: (CVCache): CV cache for CV information of folds
keys: ((str,str)) fit params (name,full_name) key tuple
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a key that is just the full_name or name of the fit parameter key is simpler but v1.3.0 of dask-ml uses this tuple as the key and I am avoiding changing this key format as I don't want to find out want to find out what the downstream impact is on this change for the sake of a simpler key scheme. This is also why _generate_fit_params_key_vals was necessary

Copy link
Copy Markdown

@zachary-mcpher zachary-mcpher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

@jamesbsilva
Copy link
Copy Markdown
Author

Screen Shot 2020-04-23 at 1 56 38 PM

Results of tests

Copy link
Copy Markdown

@zachary-mcpher zachary-mcpher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(log_loss, False, True, LogisticRegression(), IMP_WT_LOG_REG_PARAMS, [2500000, 500000, 200000, 100000]),
(brier_score_loss, False, True, LogisticRegression(), IMP_WT_LOG_REG_PARAMS, [2500000, 500000, 200000, 100000]),
])
def test_sample_weight_cross_validation(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is zefr test case? If so can we leave a comment here so that it can be easily found later?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Copy Markdown

@tian-yi-zefr tian-yi-zefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, it looks good to me. One thing I would do is testing the dependency upgrade on ds-model-content-relevancy and make sure the upgrade works fine.

Copy link
Copy Markdown

@zexuan-zhou zexuan-zhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown

@ryan-deak-zefr ryan-deak-zefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I would like to see is comments about the meaning of indices when indexing into the dask keys and values and params arrays.

keys, vals = _generate_fit_params_key_vals(fit_params, keys_filtered=[eval_weight_source])
# create the proper dask tasks to generate the train objects when computing.
# Dask tasks are tuples of function followed by arguments
w_train = (extract_param, cv, keys[0], vals[0], n, True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment about what the 0 index is.

fields,
p,
fit_params,
fld_fit_params[0],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment about what the 0 index is.

fields,
p,
fit_params,
fld_fit_params[0],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment about what the 0 index is.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants