Skip to content

ENH Adds FeatureHasher support to pypy#23023

Merged
lorentzenchr merged 10 commits intoscikit-learn:mainfrom
thomasjpfan:pypy_feature_hasher
Apr 8, 2022
Merged

ENH Adds FeatureHasher support to pypy#23023
lorentzenchr merged 10 commits intoscikit-learn:mainfrom
thomasjpfan:pypy_feature_hasher

Conversation

@thomasjpfan
Copy link
Copy Markdown
Member

Reference Issues/PRs

Fixes #11540

What does this implement/fix? Explain your changes.

This PR enables FeatureHasher to work on pypy by using c++ vectors in the transform call. Running this benchmark:

Benchmark
import numpy as np
import re
from sklearn.datasets import fetch_20newsgroups
from collections import defaultdict
from time import perf_counter
from statistics import mean, stdev
from sklearn.feature_extraction._hashing_fast import transform as _hashing_transform


def tokens(doc):
    return (tok.lower() for tok in re.findall(r"\w+", doc))


def token_freqs(doc):
    freq = defaultdict(int)
    for tok in tokens(doc):
        freq[tok] += 1
    return freq


n_features = 2**20
n_repeats = 100

raw_data, _ = fetch_20newsgroups(subset="all", return_X_y=True)
raw_X = [token_freqs(d).items() for d in raw_data]

durations = []
for i in range(n_repeats):
    t0 = perf_counter()
    _hashing_transform(
        raw_X, n_features, np.float64, alternate_sign=True, seed=n_repeats
    )
    duration = perf_counter() - t0
    durations.append(duration)

mean_dur = mean(durations)
std_dur = stdev(durations)
print(f"{mean_dur:.4f} +/- {std_dur:.4f}")

Here are my results:

0.2824 +/- 0.0025 : PR
0.2892 +/- 0.0020 : main

Any other comments?

Int32 and Int64 were added to the vector sentinel to support the hasher.

Copy link
Copy Markdown
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. C++ rocks.

return ITYPECODE


cdef class StdVectorSentinelInt32(StdVectorSentinel):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look like tempita material 😄

@jeremiedbb
Copy link
Copy Markdown
Member

THe pypy run did not get it through :(

@thomasjpfan
Copy link
Copy Markdown
Member Author

Currently, PyPy already timeouts on main. I increased the timeout duration for PyPy in this PR so the test has a chance to complete.

Copy link
Copy Markdown
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr
Copy link
Copy Markdown
Member

A green pypy CI would be nice. After that we should revert c008f80 (don't we?) and then merge.

@jeremiedbb
Copy link
Copy Markdown
Member

After that we should revert c008f80 (don't we?) and then merge.

We already merged this change in main (#23049). We need to keep it to not have this job time out when we trigger it

@lorentzenchr lorentzenchr merged commit b2632de into scikit-learn:main Apr 8, 2022
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FeatureHasher support in PyPy

3 participants