Skip to content

Fix "Improve TruncatedSVD.transform on sparse csc matrices #16828"#16837

Merged
rth merged 4 commits intoscikit-learn:masterfrom
PandaTinker:iss16828
Apr 5, 2020
Merged

Fix "Improve TruncatedSVD.transform on sparse csc matrices #16828"#16837
rth merged 4 commits intoscikit-learn:masterfrom
PandaTinker:iss16828

Conversation

@PandaTinker
Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Fixes #16828

What does this implement/fix? Explain your changes.

This commit allows the use of csc matrix without conversion to csr in TruncatedSVD.transform.
In addition, a benchmark file "bench_TSVDtransform_csc.py" is added for evaluating
the performance improvement.

Any other comments?

Here is a quick preview of the benchmark.
Performance Comparison

Input_Dimensions With_CSR_Conversion Without_Conversion

  1000               0.3052s                  0.2708s
  2000               1.6780s                  1.7055s
  3000               3.9639s                  4.0052s
  4000               7.5915s                  7.2595s
  5000               12.2266s                 12.1459s
  6000               18.1505s                 17.7042s
  7000               25.2273s                 25.0626s
  8000               33.1733s                 33.2724s
  9000               42.9401s                 42.7734s

In addtion, a benchmark file "bench_TSVDtransform_csc.py" for evaluating
the performance improvement.
@lorentzenchr
Copy link
Copy Markdown
Member

@wornbb Thank you for this PR. In you benchmark, the call to TruncatedSVD().fit(X) is included in the timing. But we only want the timing for transform(X).

Then, it's great that you published your benchmark script. For this small change, however, I think the benchmark script would not be merged. You can post it here in github, maybe under a "details" section.

@PandaTinker
Copy link
Copy Markdown
Contributor Author

Hi, @lorentzenchr Thank you for your advice. I have removed the benchmark file. The following this script and outcome.

Details
"""
Benchmarks of with vs without CSR conversion

The performances are evaluated in various input dimensions.

The input array X is a n x n square matrix.

Time is calculated by running the transform 100 times for each method.

Performance Comparison
======================
Input_Dimensions Time_With_CSR_Conversion Time_Without_Conversion
-----------------------------------------------------------------
      1000               0.0339s                  0.0321s
      2000               0.1433s                  0.1376s
      3000               0.4040s                  0.4357s
      4000               0.7765s                  0.7333s
      5000               1.3254s                  1.3391s
      6000               2.0317s                  1.9206s
      7000               2.8399s                  2.7303s
      8000               3.6815s                  3.5850s
      9000               4.8620s                  4.6881s
"""
import scipy.sparse as ss
from sklearn.decomposition import TruncatedSVD
from time import time
from sklearn.utils import check_array


def benchmark_converted(dim, trials=100):
    start_time = time()
    X = ss.random(dim, dim, format='csc')
    tsvd = TruncatedSVD().fit(X)
    for i in range(trials):
        X = check_array(X, accept_sparse='csr')
        tsvd.transform(X)
    return time() - start_time


def benchmark_direct(dim, trials=100):
    start_time = time()
    X = ss.random(dim, dim, format='csc')
    tsvd = TruncatedSVD().fit(X)
    for i in range(trials):
        tsvd.transform(X)
    return time() - start_time


if __name__ == '__main__':
    print("Performance Comparison")
    print("======================")
    headers = ("Input_Dimensions",
               "Time_With_CSR_Conversion",
               "Time_Without_Conversion")
    header_len = [len(header) for header in headers]
    print("%s %s %s"
          % headers)
    print("-" * (sum(header_len) + 2))
    for dim in range(1000, 10000, 1000):
        t1 = benchmark_converted(dim)
        t2 = benchmark_direct(dim)
        print("%s %s %s" % (("%d" % dim).center(header_len[0]),
                            ("%.4fs" % t1).center(header_len[1]),
                            ("%.4fs" % t2).center(header_len[2])))

@thomasjpfan
Copy link
Copy Markdown
Member

Thank you for the benchmark. The benchmark is timing more than just the transform. We can adjust the benchmark a little to only time the transform:

def benchmark_converted(dim, trials=100):
    X = ss.random(dim, dim, format='csr')
    tsvd = TruncatedSVD().fit(X)

    start_time = time()
    for i in range(trials):
        tsvd.transform(X)
    return time() - start_time


def benchmark_direct(dim, trials=100):
    X = ss.random(dim, dim, format='csc')
    tsvd = TruncatedSVD().fit(X)

    start_time = time()
    for i in range(trials):
        tsvd.transform(X)
    return time() - start_time

On master:

Input_Dimensions Time_With_CSR_Conversion Time_Without_Conversion
-----------------------------------------------------------------
      1000               0.0163s                  0.0375s        
      2000               0.0342s                  0.0850s        
      3000               0.0610s                  0.1679s        
      4000               0.0992s                  0.2865s        
      5000               0.1463s                  0.5046s        
      6000               0.2234s                  0.7453s        
      7000               0.2850s                  1.0539s        
      8000               0.3653s                  1.4679s        
      9000               0.4501s                  2.0035s    

This PR

Input_Dimensions Time_With_CSR_Conversion Time_Without_Conversion
-----------------------------------------------------------------
      1000               0.0181s                  0.0157s        
      2000               0.0357s                  0.0333s        
      3000               0.0639s                  0.0618s        
      4000               0.0992s                  0.1042s        
      5000               0.1487s                  0.1463s        
      6000               0.2116s                  0.2041s        
      7000               0.2938s                  0.2752s        
      8000               0.3546s                  0.3591s        
      9000               0.4443s                  0.4505s  

Looks good to me.

Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wornbb !

@rth rth merged commit b2bc40c into scikit-learn:master Apr 5, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve TruncatedSVD.transform on sparse csc matrices

4 participants