Skip to content

MAINT: call model.fit directly if ndarrays passed to partial_fit#225

Closed
stsievert wants to merge 4 commits intodask:masterfrom
stsievert:partial-fit-task-within-task
Closed

MAINT: call model.fit directly if ndarrays passed to partial_fit#225
stsievert wants to merge 4 commits intodask:masterfrom
stsievert:partial-fit-task-within-task

Conversation

@stsievert
Copy link
Copy Markdown
Member

If partial_fit is called with NumPy arrays, why don't we call partial_fit directly with those arrays instead of making a new dask array?

I think this is the bug behind #221: _partial_fit would needlessly launch dask tasks.

@stsievert stsievert mentioned this pull request Jun 23, 2018
7 tasks
@stsievert
Copy link
Copy Markdown
Member Author

I'm using hasattr(model, 'estimator') to get the underlying estimator. We should probably move this to a utils function; maybe two function in wrappers.py named unwrap and wrap?

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to better digest these changes, and their motivation.

An alternative is to pass compute=False for ndarrays, extract the dask graph of the Delayed, and merge it with the caller's other stuff before doing compute.

x = da.from_array(x, chunks=x.shape)
if isinstance(y, np.ndarray):
y = da.from_array(y, chunks=y.shape)
if isinstance(x, np.ndarray) and isinstance(y, np.ndarray):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that just X is passed, and y is None.

if isinstance(y, np.ndarray):
y = da.from_array(y, chunks=y.shape)
if isinstance(x, np.ndarray) and isinstance(y, np.ndarray):
if hasattr(model, 'estimator'):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fragile. Why not go through the model?

@stsievert
Copy link
Copy Markdown
Member Author

What tests do we want to pass? I see two cases, and these are reflected in the types passed to test_fit:

  1. fit works regardless of model type and array type.
  2. fit only works for dask-ml models and dask arrays.

The current implementation implements option 1 in #224.

I'd rather be explicit about types and go with option 2. Then fit will have a very specific job: build a graph using dask that fits on blocks of data, and will not be a function that works regardless of input type. This would involve reverting #224.

I think we should give a better error message if ndarrays are passed, and point to-be-implemented unwrap and wrap functions. I can implement these as part of #221.

Thoughts?

@TomAugspurger
Copy link
Copy Markdown
Member

2 sounds better to me as well.

@stsievert stsievert closed this Jun 26, 2018
@stsievert stsievert deleted the partial-fit-task-within-task branch June 26, 2018 01:02
@stsievert
Copy link
Copy Markdown
Member Author

#234 moves this to Incremental.partial_fit, which is what I was trying to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants