Skip to content

Conversation

@mfeurer
Copy link
Collaborator

@mfeurer mfeurer commented Mar 19, 2019

This PR implements an extension interface to OpenML-Python which abstracts away the deep connection with scikit-learn and allows using OpenML-Python with other machine learning libraries as well.

The main interface can be found in openml.extensions.extension_interface. A possible use for scikit-learn is given in openml.extensions.sklearn_extension.

The whole PR is very raw and is only intended to provide feedback to me. Missing parts are:

  • unit tests
  • mention in the documentation
  • docstrings
  • full type hints
  • make the API implementation for sklearn its own submodule
  • obtain full diff coverage

This supersedes #543

@janvanrijn @joaquinvanschoren @amueller @ArlindKadra @PGijsbers @Bilgecelik

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! First impression is very good :) after thinking it over a bit more I might find something to bicker about but I think it's fine to go ahead with these changes. In the process of working things out/implementing extensions for other modules we might find short-comings, but I say we can tackle them as they come.

@janvanrijn
Copy link
Member

I really like where this is going. Additional bonus: introduction of type hints. I marked down my first comments. Feel free to ping me for a next review.

@codecov-io
Copy link

codecov-io commented Mar 21, 2019

Codecov Report

Merging #647 into develop will increase coverage by 0.05%.
The diff coverage is 90.84%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #647      +/-   ##
===========================================
+ Coverage    90.94%   90.99%   +0.05%     
===========================================
  Files           32       36       +4     
  Lines         3391     3522     +131     
===========================================
+ Hits          3084     3205     +121     
- Misses         307      317      +10
Impacted Files Coverage Δ
openml/study/functions.py 96.87% <ø> (ø) ⬆️
openml/datasets/functions.py 95.47% <ø> (ø) ⬆️
openml/tasks/functions.py 86.45% <ø> (ø) ⬆️
openml/extensions/sklearn/__init__.py 100% <100%> (ø)
openml/testing.py 95.32% <100%> (+1.49%) ⬆️
openml/__init__.py 100% <100%> (ø) ⬆️
openml/config.py 89.47% <100%> (+0.18%) ⬆️
openml/extensions/__init__.py 100% <100%> (ø)
openml/runs/trace.py 91.02% <100%> (+0.17%) ⬆️
openml/flows/flow.py 94.17% <100%> (+0.09%) ⬆️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ec429e...80ab989. Read the comment docs.

@mfeurer
Copy link
Collaborator Author

mfeurer commented Mar 27, 2019

Okay, I checked the coverage, and the coverage diff is mostly in the dataset submodule and thereby nothing which should be addressed by this pull request. @PGijsbers @joaquinvanschoren @janvanrijn @Bilgecelik this pull request is ready for review.

@mfeurer mfeurer requested review from PGijsbers and janvanrijn March 27, 2019 21:26
Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! No major remarks other than the multiple inheritance suggestion (which should be easy nonetheless) 👍

import sklearn.base

from openml.extensions import Extension, register_extension
from openml.extensions.sklearn.flow_functions import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might make sense to use double inheritance instead? E.g. basically make all functions in run_functions.py part of a SklearnRunExtension-class and all functions in flow_functions part of a SklearnFlowExtension-class and then simply use inheritance instead (both of these would inherit from Extension). Functions which are implemented in neither can be implemented here (currently only can_handle_model). So the rewritten extension would look like:

class SklearnExtension(SklearnRunExtension, SklearnFlowExtension):

    @staticmethod
    def can_handle_model(model: Any) -> bool:
        """Check whether a model is an instance of ``sklearn.base.BaseEstimator``.

        Parameters
        ----------
        model : Any

        Returns
        -------
        bool
        """
        return isinstance(model, sklearn.base.BaseEstimator)

?

I don't have much experience with multiple inheritance but I think it would work (I tried a toy example and that worked). It would help by:

  • Removing a lot of double documentation, since now you document the function twice, e.g. here and here. This can lead to inconsistencies (indeed, even the linked example has inconsistencies, though only a minor ones regarding spacing.
  • making it much clearer which functions are actually implemented in this file.
  • maintenance when e.g. function signatures change. Currently we should change the Extension interface, the method implementation in e.g. flow_functions.py and the redirect in extension.py. Using multiple inheritance saves us one step (since the method is not redeclared in extension.py)

Copy link
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great effort @mfeurer, this looks really great!

I left a lot of comments (learned from the best ;) ) but I think most of them can be resolved easily. The ones that are a bit bigger are

  • implement functions for extension interface in the appropriate class itself
  • try to simplify run_model_on_fold interface.
    As the PR is getting already quite big, I agree with moving these to a next PR, as long as we can all agree to resolve these before releasing a new version. (as it might impose quite some backward compatibility problems)

Additionally, the unit tests can probably be modularized better (moving extension dependent tests elsewhere) but also this does not seem like a priority to me.

All together this PR looks very good and with some effort it can be merged as soon as possible :)

@mfeurer
Copy link
Collaborator Author

mfeurer commented Apr 3, 2019

Okay, I have addressed all issues raised so for, please go ahead for a (hopefully) final review. Regarding failing CI:

  • appveyor/branch: this has been fixed on the development branch which is why appveyor/pr does not fail
  • travis-ci/pr: only fails for the PEP8 check, which does not fail for travis/push. I suggest fixing this with the next PR to the development branch
  • travis-ci/push: fails because of API can not return tagged runs OpenML#948, which has been fixed in the development branch already (and which is why they don't fail for travis-ci/pr).

@mfeurer mfeurer requested review from PGijsbers and janvanrijn April 3, 2019 09:25
PGijsbers
PGijsbers previously approved these changes Apr 3, 2019
@PGijsbers PGijsbers self-requested a review April 3, 2019 10:28
@PGijsbers PGijsbers dismissed their stale review April 3, 2019 10:30

Accidental accept

@PGijsbers
Copy link
Collaborator

PGijsbers commented Apr 3, 2019

I would accept but I don't feel this issue is resolved yet. Would like some clarification.

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the open issue, as said, it's good enough for me if it's good enough for Jan. But I don't want this to be of a negative note, because the changes in this PR are great 🎉 thanks for all the effort!

Copy link
Member

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another look through this PR and personally have a feeling we can proceed and merge. If anything, this PR will make the library better maintainable, more stable and flexible.

One more thing that I want to check is the list of open TODO's after this PR, as I mentioned some in my previous pass. It would be good to make sure that this list is complete and that we are all on the same page regarding not releasing before we have solved all of them (or argued why we won't / can't solve them)

I would also add dependency structure to this list (my last open comment)

@mfeurer
Copy link
Collaborator Author

mfeurer commented Apr 8, 2019

Here's the list of TODOs:

  • Remove obtain_arff_trace and return the trace headers in each fold and later on check that they're all the same.
  • Simplify run_model_on_fold to accept X and y and do not split the data itself
  • Simplify the return value of run_model_on_fold to have functionality for transforming the predictions in the actual task classes.
  • Figure out how to best handle dependencies (as raised here)

1 similar comment
@mfeurer
Copy link
Collaborator Author

mfeurer commented Apr 8, 2019

Here's the list of TODOs:

  • Remove obtain_arff_trace and return the trace headers in each fold and later on check that they're all the same.
  • Simplify run_model_on_fold to accept X and y and do not split the data itself
  • Simplify the return value of run_model_on_fold to have functionality for transforming the predictions in the actual task classes.
  • Figure out how to best handle dependencies (as raised here)

@mfeurer mfeurer merged commit 0f8b7f0 into develop Apr 8, 2019
@mfeurer mfeurer deleted the extensions branch April 8, 2019 09:06
PGijsbers pushed a commit that referenced this pull request Apr 10, 2019
* draft extensions interface

* Change to new advised style of defining abstract base class.

* incorporate @pgijbers' feedback

* incorporate Jan's comments

* (hopefully) make the tests run again

* make more tests work again

* fix more tests?

* Move all files for the sklearn converter to a single location

* fix tests

* TST fix function call

* slight reorganization of the files

* TST fix wrong path

* TST fix wrong path

* MAINT add type hints to all methods touched in this PR

* factor a lot of extension functions to new file

* fix a few broken tests

* rename test files to reflect previous refactor

* fix unit tests

* fix unit tests

* add extension plugin mechanism

* pep8 & mypy

* save docstring progress

* fix?

* finish docstrings & simplify interface

* add extension interface to documentation

* PEP8 & doc building

* Address comments by Jan and Pieter

* progress dump

* tests, pep8, shuffle functions and tests around
@mfeurer mfeurer mentioned this pull request Apr 15, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants