Fix "Improve TruncatedSVD.transform on sparse csc matrices #16828" by PandaTinker · Pull Request #16837 · scikit-learn/scikit-learn

PandaTinker · 2020-04-04T00:02:23Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This commit allows the use of csc matrix without conversion to csr in TruncatedSVD.transform.
In addition, a benchmark file "bench_TSVDtransform_csc.py" is added for evaluating
the performance improvement.

Any other comments?

Here is a quick preview of the benchmark.
Performance Comparison

Input_Dimensions With_CSR_Conversion Without_Conversion

  1000               0.3052s                  0.2708s
  2000               1.6780s                  1.7055s
  3000               3.9639s                  4.0052s
  4000               7.5915s                  7.2595s
  5000               12.2266s                 12.1459s
  6000               18.1505s                 17.7042s
  7000               25.2273s                 25.0626s
  8000               33.1733s                 33.2724s
  9000               42.9401s                 42.7734s

In addtion, a benchmark file "bench_TSVDtransform_csc.py" for evaluating the performance improvement.

lorentzenchr · 2020-04-04T10:20:34Z

@wornbb Thank you for this PR. In you benchmark, the call to TruncatedSVD().fit(X) is included in the timing. But we only want the timing for transform(X).

Then, it's great that you published your benchmark script. For this small change, however, I think the benchmark script would not be merged. You can post it here in github, maybe under a "details" section.

PandaTinker · 2020-04-04T21:00:58Z

Hi, @lorentzenchr Thank you for your advice. I have removed the benchmark file. The following this script and outcome.

Details

"""
Benchmarks of with vs without CSR conversion

The performances are evaluated in various input dimensions.

The input array X is a n x n square matrix.

Time is calculated by running the transform 100 times for each method.

Performance Comparison
======================
Input_Dimensions Time_With_CSR_Conversion Time_Without_Conversion
-----------------------------------------------------------------
      1000               0.0339s                  0.0321s
      2000               0.1433s                  0.1376s
      3000               0.4040s                  0.4357s
      4000               0.7765s                  0.7333s
      5000               1.3254s                  1.3391s
      6000               2.0317s                  1.9206s
      7000               2.8399s                  2.7303s
      8000               3.6815s                  3.5850s
      9000               4.8620s                  4.6881s
"""
import scipy.sparse as ss
from sklearn.decomposition import TruncatedSVD
from time import time
from sklearn.utils import check_array


def benchmark_converted(dim, trials=100):
    start_time = time()
    X = ss.random(dim, dim, format='csc')
    tsvd = TruncatedSVD().fit(X)
    for i in range(trials):
        X = check_array(X, accept_sparse='csr')
        tsvd.transform(X)
    return time() - start_time


def benchmark_direct(dim, trials=100):
    start_time = time()
    X = ss.random(dim, dim, format='csc')
    tsvd = TruncatedSVD().fit(X)
    for i in range(trials):
        tsvd.transform(X)
    return time() - start_time


if __name__ == '__main__':
    print("Performance Comparison")
    print("======================")
    headers = ("Input_Dimensions",
               "Time_With_CSR_Conversion",
               "Time_Without_Conversion")
    header_len = [len(header) for header in headers]
    print("%s %s %s"
          % headers)
    print("-" * (sum(header_len) + 2))
    for dim in range(1000, 10000, 1000):
        t1 = benchmark_converted(dim)
        t2 = benchmark_direct(dim)
        print("%s %s %s" % (("%d" % dim).center(header_len[0]),
                            ("%.4fs" % t1).center(header_len[1]),
                            ("%.4fs" % t2).center(header_len[2])))

thomasjpfan · 2020-04-04T23:00:55Z

Thank you for the benchmark. The benchmark is timing more than just the transform. We can adjust the benchmark a little to only time the transform:

def benchmark_converted(dim, trials=100):
    X = ss.random(dim, dim, format='csr')
    tsvd = TruncatedSVD().fit(X)

    start_time = time()
    for i in range(trials):
        tsvd.transform(X)
    return time() - start_time


def benchmark_direct(dim, trials=100):
    X = ss.random(dim, dim, format='csc')
    tsvd = TruncatedSVD().fit(X)

    start_time = time()
    for i in range(trials):
        tsvd.transform(X)
    return time() - start_time

On master:

Input_Dimensions Time_With_CSR_Conversion Time_Without_Conversion
-----------------------------------------------------------------
      1000               0.0163s                  0.0375s        
      2000               0.0342s                  0.0850s        
      3000               0.0610s                  0.1679s        
      4000               0.0992s                  0.2865s        
      5000               0.1463s                  0.5046s        
      6000               0.2234s                  0.7453s        
      7000               0.2850s                  1.0539s        
      8000               0.3653s                  1.4679s        
      9000               0.4501s                  2.0035s

This PR

Input_Dimensions Time_With_CSR_Conversion Time_Without_Conversion
-----------------------------------------------------------------
      1000               0.0181s                  0.0157s        
      2000               0.0357s                  0.0333s        
      3000               0.0639s                  0.0618s        
      4000               0.0992s                  0.1042s        
      5000               0.1487s                  0.1463s        
      6000               0.2116s                  0.2041s        
      7000               0.2938s                  0.2752s        
      8000               0.3546s                  0.3591s        
      9000               0.4443s                  0.4505s

Looks good to me.

rth

Thanks @wornbb !

…rn#16837)

PandaTinker added 3 commits April 3, 2020 19:14

This commit allows csc matrix format in TruncatedSVD.transform(X)

e716330

In addtion, a benchmark file "bench_TSVDtransform_csc.py" for evaluating the performance improvement.

formating to conform PEP8

7d50df8

conforming with PEP8

a172192

github-actions bot added the module:decomposition label Apr 4, 2020

deleted the branchmark file, adding it to the detail of the PR

1ff48a4

thomasjpfan approved these changes Apr 4, 2020

View reviewed changes

lorentzenchr approved these changes Apr 5, 2020

View reviewed changes

rth approved these changes Apr 5, 2020

View reviewed changes

rth merged commit b2bc40c into scikit-learn:master Apr 5, 2020

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

FIX Improve TruncatedSVD.transform on sparse csc matrices (scikit-lea…

45256c4

…rn#16837)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix "Improve TruncatedSVD.transform on sparse csc matrices #16828"#16837

Fix "Improve TruncatedSVD.transform on sparse csc matrices #16828"#16837
rth merged 4 commits intoscikit-learn:masterfrom
PandaTinker:iss16828

PandaTinker commented Apr 4, 2020

Uh oh!

lorentzenchr commented Apr 4, 2020

Uh oh!

PandaTinker commented Apr 4, 2020

Uh oh!

thomasjpfan commented Apr 4, 2020

Uh oh!

rth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

PandaTinker commented Apr 4, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Here is a quick preview of the benchmark. Performance Comparison

Input_Dimensions With_CSR_Conversion Without_Conversion

Uh oh!

lorentzenchr commented Apr 4, 2020

Uh oh!

PandaTinker commented Apr 4, 2020

Uh oh!

thomasjpfan commented Apr 4, 2020

On master:

This PR

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Here is a quick preview of the benchmark.
Performance Comparison