-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
SelectFromModel.transform is very slow: > 0.25 seconds to run on a single row #7478
Description
Description
When profiling unexpectedly slow production code, I was shocked to find out that SelectFromModel.transform() took over a quarter of a second to run on a single row, while everything else in the Pipeline took very roughly a microsecond to run on a single row.
We're using this in production with private data, so I can't copy exactly the example we're using. But the dataset that we feed into SelectFromModel.fit has many hundreds of data points, while the pruned version of the dataset after feature selection has only dozens of data points.
The compute time required for SelectFromModel.transform() over tens of thousands of rows is only a fraction of a second longer than the compute time required for a single row. We can probably pre-calculate a lot of what SelectFromModel.transform is calculating on the fly each time.
At some point, I'll dive into the code a little more and submit a PR to speed this up. This slowness would have prevented us from using a pipeline built using scikit-learn in production. Everything else runs very quickly (that's one of the reasons I like scikit-learn so much: a very active community of developers optimizing the performance all the time), but this one bottleneck took orders of magnitude longer than all the other calculations combined when getting a prediction on a single row.
Versions
>>> import platform; print(platform.platform())
Darwin-15.6.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
('Python', '2.7.12 (default, Jun 29 2016, 14:05:02) \n[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.11.1')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.18.0')
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
('Scikit-Learn', '0.17.1')
Thanks, as always, for running such an awesome project! Hopefully this will help speed up other peoples' production code as well, and encourage even wider adoption of scikit-learn. It seems a shame to have such a highly optimized library bottlenecked by such a primitive piece of functionality.