Make standard scaler compatible to Array API#27113
Make standard scaler compatible to Array API#27113lesteve merged 80 commits intoscikit-learn:mainfrom
Conversation
|
Hi @AlexanderFabisch, I'm happy to continue this if it cannot wait until October. Waiting to see what the maintainers think. :) Here are a few things I learned while working on my PR that might be helpful if you decide to keep working on it:
|
Sure, I could also give you write access to my fork if needed. That way we could collaborate better. |
3d9293a to
fe6409c
Compare
|
Hi @AlexanderFabisch , thank you for sharing the fork :) There are still a couple of TODOs:
Another thing to bear in mind is that |
|
That looks a lot better @EdAbati . Thanks for continuing this PR. |
|
@OmarManzoor the script in your comment uses |
Really sorry about that. I used the prior script and forgot to replace StandardScaler. |
|
No worries. I was just wondering if I was missing something :D When I make the swap, I see numbers roughly similar to what you do. A ten times slowdown for fitting seems weird no? edit: these are the numbers I see if I use So I think the results above came from the code you pasted. I see a roughly 3x speed up if I add a 0th run of the Detailsfrom time import time
import numpy as np
import torch as xp
from sklearn._config import config_context
from sklearn.preprocessing import StandardScaler
X_np = np.random.rand(100000, 100).astype(np.float32)
X_xp = xp.asarray(X_np, device="mps")
# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in range(n_iter):
start = time()
pf_np = StandardScaler()
pf_np.fit(X_np)
fit_times.append(time() - start)
start = time()
pf_np.transform(X_np)
transform_times.append(time() - start)
avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")
# Torch mps benchmarks
with config_context(array_api_dispatch=True):
pf_xp = StandardScaler()
pf_xp.fit(X_xp)
fit_times = []
transform_times = []
for _ in range(n_iter):
with config_context(array_api_dispatch=True):
start = time()
pf_xp = StandardScaler()
pf_xp.fit(X_xp)
fit_times.append(time() - start)
start = time()
float(pf_xp.transform(X_xp)[0, 0])
transform_times.append(time() - start)
avg_fit_time_mps = sum(fit_times) / n_iter
avg_transform_time_mps = sum(transform_times) / n_iter
print(
f"Avg fit time for torch mps: {avg_fit_time_mps:.3f}, "
f"speed-up: {avg_fit_time_numpy / avg_fit_time_mps:.1f}x"
)
print(
f"Avg transform time for torch mps: {avg_transform_time_mps:.3f} "
f"speed-up: {avg_transform_time_numpy / avg_transform_time_mps:.1f}x"
)
|
Here are the results that I ran. I increased the dataset size to (1000000, 200). I just used the original code and replaced StandardScaler Detailsfrom time import time
import numpy as np
import torch as xp
from tqdm import tqdm
from sklearn._config import config_context
from sklearn.preprocessing import StandardScaler
X_np = np.random.rand(1000000, 200).astype(np.float32)
X_xp = xp.asarray(X_np, device="mps")
# Numpy benchmarks
fit_times = []
transform_times = []
n_iter = 10
for _ in tqdm(range(n_iter), desc="Numpy Flow"):
start = time()
pf_np = StandardScaler()
pf_np.fit(X_np)
fit_times.append(time() - start)
start = time()
pf_np.transform(X_np)
transform_times.append(time() - start)
avg_fit_time_numpy = sum(fit_times) / n_iter
avg_transform_time_numpy = sum(transform_times) / n_iter
print(f"Avg fit time for numpy: {avg_fit_time_numpy:.3f}")
print(f"Avg transform time for numpy: {avg_transform_time_numpy:.3f}")
# Torch mps benchmarks
fit_times = []
transform_times = []
for _ in tqdm(range(n_iter), desc="Torch mps Flow"):
with config_context(array_api_dispatch=True):
start = time()
pf_xp = StandardScaler()
pf_xp.fit(X_xp)
fit_times.append(time() - start)
start = time()
float(pf_xp.transform(X_xp)[0, 0])
transform_times.append(time() - start)
avg_fit_time_mps = sum(fit_times) / n_iter
avg_transform_time_mps = sum(transform_times) / n_iter
print(
f"Avg fit time for torch mps: {avg_fit_time_mps:.3f}, "
f"speed-up: {avg_fit_time_numpy / avg_fit_time_mps:.1f}x"
)
print(
f"Avg transform time for torch mps: {avg_transform_time_mps:.3f} "
f"speed-up: {avg_transform_time_numpy / avg_transform_time_mps:.1f}x"
) |
…nto feature/standard_scaler_array_api
❌ Unsupported file formatUpload processing failed due to unsupported file format. Please review the parser error message:
For more help, visit our troubleshooting guide. |
betatim
left a comment
There was a problem hiding this comment.
I think this looks good.
One thing to address in a new PR is the whole story around "everything follows X" -
| else: | ||
| if self.with_mean: | ||
| X -= self.mean_ | ||
| X -= xp.astype(self.mean_, X.dtype) |
There was a problem hiding this comment.
For my education: why do we (now) need this additional astype? Is it because the type of X in transform can be different from what is used in fit? Why did we not need it before?
There was a problem hiding this comment.
The inner computation tries to use the maximum float available and sets the computed values and attributes accordingly. Since StandardScaler preserves the dtype we need this here as from what I remember the self.mean_ can be set according to the max float (float64) but X is float32.
There was a problem hiding this comment.
Commenting out the xp.astype and running the tests, I think only array-api-strict is picky about this. For other namespaces X -= self.mean_ works fine if X has dtype float32 and self.mean_ has type float64, which it was it was not needed before with numpy.
There was a problem hiding this comment.
Thanks for figuring it out. array-pi-strict is strict :-/
| # if the number of samples is the same for each feature (i.e. no | ||
| # missing values) | ||
| if np.ptp(self.n_samples_seen_) == 0: | ||
| if xp.max(self.n_samples_seen_) == xp.min(self.n_samples_seen_): |
There was a problem hiding this comment.
FYI, I double-checked and np.ptp does nothing smart (like computing both the min and max in a single pass) so this is fine to replace it by max == min.
|
LGTM, thanks to everyone involved in this PR over the last 2 years @AlexanderFabisch @EdAbati, @ogrisel @charlesjhill @OmarManzoor @betatim! |
Co-authored-by: Edoardo Abati <29585319+EdAbati@users.noreply.github.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Charles Hill <charles.hill@etegent.com> Co-authored-by: Omar Salman <omar.salman@arbisoft.com> Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Here's my contribution from the EuroSciPy 2023 sprint. It's still work in progress and I won't have the time to continue the work before October. So if anyone else wants to take it from here, feel free to do so.
Reference Issues/PRs
See also #26024
What does this implement/fix? Explain your changes.
Make standard scaler compatible to Array API.
Any other comments?
Unfortunately, the current implementation breaks some unit tests of the standard scaler that are related to dtypes. That's because I wanted to make it work for torch.float16, but maybe that is not necessary and we should just support float32 and float64.
I'll also add some comments to the diff. See below.