Integrate scikit-learn's `set_output` method into `TransactionEncoder` by it176131 · Pull Request #1087 · rasbt/mlxtend

it176131 · 2024-03-25T03:16:24Z

Code of Conduct

Description

This defines the :method:get_feature_names_out in :class:TransactionEncoder to expose the :method:set_output.

Related issues or pull requests

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
- See 1c4d328
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
- See 45cb6cd, b21bb21, and 1a02dd5
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
- See 35dbeef
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
- 20 preexisting unit tests did not pass before work was started. I added two new unit tests to cover the work requested. Both new tests pass and the new work does not affect the prior tests.
Checked for style issues by running flake8 ./mlxtend
- Prior work outside of this PR does not pass flake8 v7.0.0 guidelines. See mlxtend/feature_selection/column_selector.py:81

- Added two new tests, `test_get_feature_names_out` and `test_set_output`. Passing these tests is a step towards the output of `TransactionEncoder` being formatted as a pandas.DataFramed by default.

- Added `get_feature_names_out` method to `TransactionEncoder` to expose the `set_output` method.

- Updated test to include more checks. It is now back in a failing state.

- Updated test_set_output docstring to be more explicit. - Added numpy assertion to check that the transformed output columns match the original columns_ attribute for test_set_output. - Added numpy assertion to check that the get_feature_names_out output match the original columns_ attribute for test_get_feature_names_out.

- Added logic similar to that in `sklearn.base.ClassNamePrefixFeaturesOutMixin` and `sklearn.base.OneToOneFeatureMixin` for the get_feature_names_out method.

- Updated the user guide to show both the get_feature_names_out method and the set_output method.

- Updated changelog to reflect new features.

review-notebook-app · 2024-03-25T03:16:29Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

- Updated issue number.

- Updated issue number (again) to reflect the PR link instead of the issue link.

- Ran isort over imports to fix failing check in PR.

rasbt · 2024-03-26T01:42:22Z

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

- Increased scikit-learn version to minimum required for set_output to work.

it176131 · 2024-03-26T01:50:28Z

Thanks a lot for the PR, really appreciate it and hope to review it in the upcoming days!

Sure thing! I wasn't sure what to put regarding newer version release dates so I bumped the patch number by one and set the release date to "TBD". Lmk if I should change anything.

it176131 · 2024-03-27T22:36:20Z

Hey @rasbt it looks like the pipeline checks keep failing for files unrelated to this PR. Should I log a separate issue and open a new PR to fix those too?

rasbt · 2024-03-28T13:37:27Z

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:

      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]

Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.

And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

it176131 · 2024-03-28T15:33:46Z

I'd say as long as the tests for the new feature pass then it should be ok. There have been some other tests that have been failing in some submodules for several months due to certain software version updates and minor precision differences I think. I haven't had time to investigate yet.

From a quick look, there seems to be a more major problem though:
      AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean: 'zipimporter'?
      [end of output]
Maybe that's related to Python 3.12. That's definitely something worth fixing so we can find out whether the relevant tests for this PR pass. Maybe doing it in a separate branch first may make sense so the PR doesn't become too cluttered. I also changed the setting here in hope tests will now run automatically each time the PR is updated.
And thanks again for your efforts ... I wish I could be more responsive but it's a busy week

I'll do some digging. If I can replicate the issue and come up with a solution I'll submit it in a separate PR.

rasbt · 2024-03-28T16:27:01Z

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

rasbt · 2024-03-28T16:33:28Z

Hm, I think the new failures could be due to the sklearn version bump

it176131 · 2024-03-28T19:05:00Z

No worries, it should be fixed now via #1089 (except for the transaction encoder test, but that's something we can address in this PR)

I noticed that scikit-learn 1.1.3 is being installed in the github workflow. Can we bump it to 1.2.2 as that is required for set_output to work?

rasbt · 2024-03-28T19:08:06Z

Yes, please feel free to bump it up. We probably need to fix some other places that have not been adjusted for the most recent version though

- Bumped scikit-learn version up to 1.2.2 to match requirements.txt.

- Bumped scikit-learn version up to 1.2.2 to match environment.yml and requirements.txt.

it176131 · 2024-03-29T02:53:30Z

Running test_inverse_transform locally, it appears the error is raised when np.array(data_sorted) is called. This is because the data_sorted is list of lists where the nested lists may have varying lengths (same for oht.inverse_transform(expect)). I see two potential ways around this:

Use a simpler assertion like assert data_sorted == oht.inverse_transform(expect)
Add dtype="object" to the np.array constructors

Both result in the test passing. I'm in favor of the first option as it doesn't require any changes to the data_sorted and the expected output type is a list rather than an np.ndarray.

- Updated `test_inverse_transform` to passing state by removing conversion to numpy array.

it176131 · 2024-03-29T04:24:09Z

Turns out scikit-learn version 1.2.2 had a bug in the set_output API that failed when the input was not a pandas.DataFrame. See this PR for details. The fix came in scikit-learn version 1.3.1. Bumping the scikit-learn version to it fixes the issue. Commit(s) to follow.

- Updated scikit-learn version to 1.3.1 to integerate fix from scikit-learn/scikit-learn#27044 modified: environment.yml - Updated scikit-learn version to 1.3.1 to integerate fix from scikit-learn/scikit-learn#27044 modified: requirements.txt - Updated scikit-learn version to 1.3.1 to integerate fix from scikit-learn/scikit-learn#27044

it176131 · 2024-03-29T04:33:04Z

Of course bumping the version results in more failed tests 🤦

I'm going to make a separate PR to handle the scikit-learn version bump and the failed unit tests. THEN maybe this will work. Didn't realize it was going to take so much work 😅

rasbt · 2024-03-29T16:45:42Z

Arg sorry. Yeah, a sklearn bump was overdue but I recently didn't have the time to look into it since I wasn't using the affected features. If this is too much work, don't worry about it, I can understand if you want to drop this. I could revisit the version bump in the upcoming weeks then, address this, and then merge your PR once it's addressed.

codecov · 2024-03-30T14:49:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.32%. Comparing base (e82c9c5) to head (f018a8d).

❗ Current head f018a8d differs from pull request most recent head 44961b5. Consider uploading reports for the commit 44961b5 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1087      +/-   ##
==========================================
+ Coverage   78.29%   78.32%   +0.03%     
==========================================
  Files         196      196              
  Lines       11140    11157      +17     
  Branches     1404     1404              
==========================================
+ Hits         8722     8739      +17     
  Misses       2200     2200              
  Partials      218      218

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

it176131 · 2024-03-30T15:35:32Z

@rasbt looks like all checks passed. Ready to merge 😎

rasbt

This looks great. Thanks so much for the efforts! And sorry again about the unit test hassles!

mlxtend/preprocessing/transactionencoder.py

it176131 added 7 commits March 24, 2024 19:40

modified: test_transactionencoder.py

45cb6cd

- Added two new tests, `test_get_feature_names_out` and `test_set_output`. Passing these tests is a step towards the output of `TransactionEncoder` being formatted as a pandas.DataFramed by default.

modified: transactionencoder.py

9434575

- Added `get_feature_names_out` method to `TransactionEncoder` to expose the `set_output` method.

modified: tests/test_transactionencoder.py

b21bb21

- Updated test to include more checks. It is now back in a failing state.

modified: transactionencoder.py

0167c8f

- Added logic similar to that in `sklearn.base.ClassNamePrefixFeaturesOutMixin` and `sklearn.base.OneToOneFeatureMixin` for the get_feature_names_out method.

modified: docs/sources/user_guide/preprocessing/TransactionEncoder.ipynb

35dbeef

- Updated the user guide to show both the get_feature_names_out method and the set_output method.

modified: docs/sources/CHANGELOG.md

1c4d328

- Updated changelog to reflect new features.

it176131 added 3 commits March 24, 2024 22:20

modified: docs/sources/CHANGELOG.md

3f5496c

- Updated issue number.

modified: docs/sources/CHANGELOG.md

3ecb711

- Updated issue number (again) to reflect the PR link instead of the issue link.

modified: mlxtend/preprocessing/transactionencoder.py

8c0ca72

- Ran isort over imports to fix failing check in PR.

modified: requirements.txt

2931b7a

- Increased scikit-learn version to minimum required for set_output to work.

Merge branch 'master' into issue_1085

977fd1d

it176131 added 2 commits March 28, 2024 21:20

modified: environment.yml

96e2a36

- Bumped scikit-learn version up to 1.2.2 to match requirements.txt.

modified: .github/workflows/python-package-conda.yml

09d9f24

- Bumped scikit-learn version up to 1.2.2 to match environment.yml and requirements.txt.

modified: mlxtend/preprocessing/tests/test_transactionencoder.py

833d31e

- Updated `test_inverse_transform` to passing state by removing conversion to numpy array.

This was referenced Mar 30, 2024

Most recent scikit-learn results in several failed unit tests #1090

Closed

Most recent scikit-learn results in several failed unit tests #1091

Merged

Merge branch 'master' into issue_1085

f018a8d

rasbt approved these changes Mar 30, 2024

View reviewed changes

mlxtend/preprocessing/transactionencoder.py Outdated Show resolved Hide resolved

mlxtend/preprocessing/transactionencoder.py Outdated Show resolved Hide resolved

mlxtend/preprocessing/transactionencoder.py Outdated Show resolved Hide resolved

rasbt added 3 commits March 30, 2024 14:05

Update mlxtend/preprocessing/transactionencoder.py

f059ab7

Update mlxtend/preprocessing/transactionencoder.py

bf012d7

Update mlxtend/preprocessing/transactionencoder.py

44961b5

rasbt merged commit 506a4d5 into rasbt:master Mar 30, 2024

it176131 deleted the issue_1085 branch March 30, 2024 19:56

it176131 mentioned this pull request Mar 31, 2024

Update index handling in PandasAdapter scikit-learn/scikit-learn#28731

Closed

Conversation

it176131 commented Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code of Conduct

Description

Related issues or pull requests

Pull Request Checklist

Uh oh!

review-notebook-app bot commented Mar 25, 2024

Uh oh!

rasbt commented Mar 26, 2024

Uh oh!

it176131 commented Mar 26, 2024

Uh oh!

it176131 commented Mar 27, 2024

Uh oh!

rasbt commented Mar 28, 2024

Uh oh!

it176131 commented Mar 28, 2024

Uh oh!

rasbt commented Mar 28, 2024

Uh oh!

rasbt commented Mar 28, 2024

Uh oh!

it176131 commented Mar 28, 2024

Uh oh!

rasbt commented Mar 28, 2024

Uh oh!

it176131 commented Mar 29, 2024

Uh oh!

it176131 commented Mar 29, 2024

Uh oh!

it176131 commented Mar 29, 2024

Uh oh!

rasbt commented Mar 29, 2024

Uh oh!

codecov bot commented Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

it176131 commented Mar 30, 2024

Uh oh!

rasbt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

it176131 commented Mar 25, 2024 •

edited

Loading

codecov bot commented Mar 30, 2024 •

edited

Loading