DS-126 Update patch on Dask-ml v1.3.0 by jamesbsilva · Pull Request #8 · ZEFR-INC/dask-ml

jamesbsilva · 2020-04-22T21:24:48Z

NOTE:
A) Ignore for this PR all the commits not made by myself as we probably dont have time to code review dask-ml v1.3.0
B) Please ignore the code I havent changed ie (the official dask code). I agree there are issues that you can code review on how things are but outside of C) please focus on the "changes and functionality of the patch" as the scope not what the official dask code could do better although I did try to improve some variable names and documentation when possible.
C) If you believe there is a bug in the official dask code please note it in this PR and will need another PR to deal with bug fix that should probably be passed on to the official dask-ml repo.

This PR makes the patch compatible with dask-ml in its current state. It adds the patch's functionality of applying the weighting scheme to the test fold. It does add 3 bits of extra functionaility.

support for eval_sample_weight to allow for a different weighting scheme on evaluation than training
support lightGBM eval_data_set by resolving the proper test fold partitions if the user provides the eval_data_set
new test case for sklearn.pipeline . This is used by ZEFR so should probably tested as well.
Files Changed to add functionality :

tests/model_selection/dask_searchcv/test_model_selection.py
dask_ml/model_selection/methods.py
dask_ml/model_selection/_search.py

dask_ml/model_selection/_search.py

dask_ml/model_selection/methods.py

dask_ml/model_selection/_search.py

tests/model_selection/dask_searchcv/test_model_selection.py

tian-yi-zefr · 2020-04-22T23:32:57Z

dask_ml/model_selection/_search.py

+                # format to keys with full information compatible with cv_extract functions
+                keys, vals = _generate_fit_params_key_vals(fit_params, keys_filtered=[eval_weight_source])
+                # dask evaluation requires wrapping into function to allow the function to be evaluated once cv object is resolved
+                def extract_param(cvs,k,v,n,fld):


move extract_param out of the for loop, it does not need to be defined multiple times.

cv_extract_params is slightly different than extract_param. cv_extract_params is the original Dask function that gets the fit_params information in {fit_param_key: key_values} format for all the fit_params . extract_param is a one time pull of a single fit_param since it needs to be passed on to score for a pipeline before it has actually been computed.

I mean you can define extract_param before for loop.

jamesbsilva · 2020-04-22T23:33:32Z

FYI: This might change the current notebook behavior because of how test cases run on prod

http://dask.zefr.com:8888/lab/workspaces/james-chr/tree/mnt/efs/notebooks/james.silva/Hotfix/run_test_cases_on_prod_build.ipynb

IanQS

Generally looks good BUT I think that there are some low-hanging fruit that would be really useful for readability and maintainability moving forward. Happy to approve after

tests/model_selection/dask_searchcv/test_model_selection.py

dask_ml/model_selection/_search.py

IanQS · 2020-04-22T23:36:38Z

dask_ml/model_selection/_search.py

-def _get_fit_params(cv, fit_params, n_splits):
-    if not fit_params:
-        return [(n, None) for n in range(n_splits)]
+def _generate_fit_params_key_vals(fit_params, keys_filtered=None):


if it's not too much of a pain can you add typing? I'm assuming fit_params is a dictionary but what is keys_filtered ? And what does it do?

I could add typing to the docstring for sure but I am avoiding making these 3 files the only files with typing in the whole repo.

Yeah. I would just add comments rather than typing to "keep it consistent with Dask style". I know what this sounds like. Obviously, comments are not in "Dask style", but comments are good so maybe just keep it "Dask(ish) style".

IanQS · 2020-04-22T23:55:37Z

dask_ml/model_selection/_search.py

+    return eval_weight_source
+
+
+def _get_n_folds_fit_params(cv, fit_params, n_splits, keys_filtered=None):


Can we get typing on this? Also, for the return type can we get a NamedTuple? It's a little hard to parse the return here. Would also make the below easier to read

for n, fld_fit_params in n_and_fold_fit_params: ... ... fld_fit_params[0], fld_fit_params[1],

and it's used in other places too so I think returning a NamedTuple would be really beneficial

see above comment. will add best I can in dostrings

jamesbsilva · 2020-04-23T01:02:03Z

FYI:
I will add more typing in doc string but as an approximate rule generally if it is in _search.py the objects are actually Dask tasks (tuples with information how to calculate the task) and if it is in methods.py it has been computed and is a more concrete object post-compute like a bumpy array ect.

TLDR; _search.py is pre-compute() and methods.py is mid-compute()

dask_ml/model_selection/_search.py

ryan-deak-zefr · 2020-04-23T00:43:45Z

dask_ml/model_selection/_search.py

-def _get_fit_params(cv, fit_params, n_splits):
-    if not fit_params:
-        return [(n, None) for n in range(n_splits)]
+def _generate_fit_params_key_vals(fit_params, keys_filtered=None):


Yeah. I would just add comments rather than typing to "keep it consistent with Dask style". I know what this sounds like. Obviously, comments are not in "Dask style", but comments are good so maybe just keep it "Dask(ish) style".

dask_ml/model_selection/_search.py

tests/model_selection/dask_searchcv/test_model_selection.py

jamesbsilva · 2020-04-23T01:06:22Z

dask_ml/model_selection/methods.py

-def cv_extract_params(cvs, keys, vals, n):
-    return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)}
+        cvs: (CVCache): CV cache for CV information of folds
+        keys: ((str,str)) fit params (name,full_name) key tuple


I think having a key that is just the full_name or name of the fit parameter key is simpler but v1.3.0 of dask-ml uses this tuple as the key and I am avoiding changing this key format as I don't want to find out want to find out what the downstream impact is on this change for the sake of a simpler key scheme. This is also why _generate_fit_params_key_vals was necessary

dask_ml/model_selection/_search.py

zachary-mcpher

A few comments

dask_ml/model_selection/_search.py

jamesbsilva · 2020-04-23T21:26:26Z

Results of tests

zachary-mcpher

LGTM

dask_ml/model_selection/_search.py

zexuan-zhou · 2020-04-27T17:39:13Z

tests/model_selection/dask_searchcv/test_model_selection.py

+    (log_loss, False, True, LogisticRegression(), IMP_WT_LOG_REG_PARAMS, [2500000, 500000, 200000, 100000]),
+    (brier_score_loss, False, True, LogisticRegression(), IMP_WT_LOG_REG_PARAMS, [2500000, 500000, 200000, 100000]),
 ])
 def test_sample_weight_cross_validation(


Looks like this is zefr test case? If so can we leave a comment here so that it can be easily found later?

tian-yi-zefr

Generally, it looks good to me. One thing I would do is testing the dependency upgrade on ds-model-content-relevancy and make sure the upgrade works fine.

zexuan-zhou

lgtm

ryan-deak-zefr

The only thing I would like to see is comments about the meaning of indices when indexing into the dask keys and values and params arrays.

ryan-deak-zefr · 2020-04-28T17:06:03Z

dask_ml/model_selection/_search.py

+                keys, vals = _generate_fit_params_key_vals(fit_params, keys_filtered=[eval_weight_source])
+                # create the proper dask tasks to generate the train objects when computing.
+                # Dask tasks are tuples of function followed by arguments
+                w_train = (extract_param, cv, keys[0], vals[0], n, True)


Maybe a comment about what the 0 index is.

ryan-deak-zefr · 2020-04-28T17:06:11Z

dask_ml/model_selection/_search.py

                        fields,
                        p,
-                        fit_params,
+                        fld_fit_params[0],


Maybe a comment about what the 0 index is.

ryan-deak-zefr · 2020-04-28T17:06:21Z

dask_ml/model_selection/_search.py

                        fields,
                        p,
-                        fit_params,
+                        fld_fit_params[0],


Maybe a comment about what the 0 index is.

apply patch

7c07e01

jamesbsilva requested review from IanQS, ryan-deak-zefr, tian-yi-zefr, zachary-mcpher and zexuan-zhou April 22, 2020 21:25

jamesbsilva added 2 commits April 22, 2020 14:58

whitespace

757d303

whitespace

68bf118

tian-yi-zefr reviewed Apr 22, 2020

View reviewed changes

dask_ml/model_selection/_search.py Outdated Show resolved Hide resolved

tian-yi-zefr reviewed Apr 22, 2020

View reviewed changes

dask_ml/model_selection/methods.py Show resolved Hide resolved

tian-yi-zefr reviewed Apr 22, 2020

View reviewed changes

dask_ml/model_selection/methods.py Show resolved Hide resolved

tian-yi-zefr reviewed Apr 22, 2020

View reviewed changes

dask_ml/model_selection/_search.py Outdated Show resolved Hide resolved

tian-yi-zefr reviewed Apr 22, 2020

View reviewed changes

tests/model_selection/dask_searchcv/test_model_selection.py Show resolved Hide resolved

tian-yi-zefr reviewed Apr 22, 2020

View reviewed changes

IanQS reviewed Apr 23, 2020

View reviewed changes

jamesbsilva added 2 commits April 22, 2020 17:52

comments and docstring for types of new functions

e069717

more comments

2a995b1

ryan-deak-zefr suggested changes Apr 23, 2020

View reviewed changes

jamesbsilva commented Apr 23, 2020

View reviewed changes

reformat per PR suggestions

3b66fb6

zachary-mcpher reviewed Apr 23, 2020

View reviewed changes

dask_ml/model_selection/_search.py Outdated Show resolved Hide resolved

format tuple

20ae5f1

zachary-mcpher suggested changes Apr 23, 2020

View reviewed changes

jamesbsilva added 4 commits April 23, 2020 13:57

fix tests that deal with fit_params=None

0ec83ce

dry

a6d014b

msg split

90e770d

whitespace

f6e394b

jamesbsilva requested a review from ryan-deak-zefr April 23, 2020 21:24

jamesbsilva requested review from IanQS, tian-yi-zefr and zachary-mcpher April 23, 2020 21:25

zachary-mcpher approved these changes Apr 23, 2020

View reviewed changes

tian-yi-zefr reviewed Apr 23, 2020

View reviewed changes

dask_ml/model_selection/_search.py Show resolved Hide resolved

more commenting for clarity

aaa7ae6

zexuan-zhou reviewed Apr 27, 2020

View reviewed changes

tian-yi-zefr approved these changes Apr 27, 2020

View reviewed changes

update based on PR feedback

a948c65

zexuan-zhou approved these changes Apr 27, 2020

View reviewed changes

ryan-deak-zefr approved these changes Apr 28, 2020

View reviewed changes

jamesbsilva merged commit 909dd8b into merge_only Apr 28, 2020

jamesbsilva mentioned this pull request Apr 28, 2020

Merge v1.3.0 dask-ml feature branch into master #9

Merged

		return eval_weight_source


		def _get_n_folds_fit_params(cv, fit_params, n_splits, keys_filtered=None):

Conversation

jamesbsilva commented Apr 22, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tian-yi-zefr Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesbsilva commented Apr 22, 2020

Uh oh!

IanQS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesbsilva commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zachary-mcpher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jamesbsilva commented Apr 23, 2020

Uh oh!

zachary-mcpher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tian-yi-zefr left a comment

Choose a reason for hiding this comment

Uh oh!

zexuan-zhou left a comment

Choose a reason for hiding this comment

Uh oh!

ryan-deak-zefr left a comment

Choose a reason for hiding this comment

Uh oh!

tian-yi-zefr Apr 22, 2020 •

edited

Loading

jamesbsilva commented Apr 23, 2020 •

edited

Loading