[MAGMA][CUDA] eigh: deprecate MAGMA and dispatch to cuSolver unconditionally#174619
[MAGMA][CUDA] eigh: deprecate MAGMA and dispatch to cuSolver unconditionally#174619gderossi wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174619
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b0c401f with merge base c68a888 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchmergebot label ciflow/trunk ciflow/h100 ciflow/b200 |
|
Seems like #174674 could be a very nice follow-up :) |
nikitaved
left a comment
There was a problem hiding this comment.
LGTM! Thank you very much!
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
25edcbf to
2d1a853
Compare
|
@pytorchmergebot label ciflow/trunk ciflow/h100 ciflow/b200 ciflow/rocm-mi300 |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ionally (pytorch#174619) Both cuSolver and hipSolver support syevd/syevj, so just removed MAGMA path entirely and updated relevant tests to skip if missing cuSolver instead of missing MAGMA. Benchmark script and results are included below, though results only show sizes 512+ because MAGMA just calls LAPACK on sizes up to 128. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] dtypes = [torch.float32, torch.float64, torch.complex64, torch.complex128] for b, n, dtype in product(batches, sizes, dtypes): shape = b + (n, n) print(f"Testing shape={shape}, dtype={dtype}") label = "torch.linalg.eigh" sub_label = f"{shape}, {dtype}" X = torch.rand(*shape, dtype=dtype, device="cuda") X = X + X.mT.conj() stmt = "torch.linalg.eigh(X)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'X': X}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmark results on RTX Pro 6000: ``` [------------------------ torch.linalg.eigh -------------------------] | magma | cusolver | speedup 1 threads: ----------------------------------------------------------- | (512, 512), torch.float32 | 12605.6 | 11742.1 | 1.1 (512, 512), torch.float64 | 17244.3 | 10558.8 | 1.6 (512, 512), torch.complex64 | 18868.0 | 3612.1 | 5.2 (512, 512), torch.complex128 | 28479.8 | 16659.5 | 1.7 (2048, 2048), torch.float32 | 226035.4 | 19598.1 | 11.5 (2048, 2048), torch.float64 | 451455.1 | 68374.8 | 6.6 (2048, 2048), torch.complex64 | 535989.6 | 23807.6 | 22.5 (2048, 2048), torch.complex128 | 1111481.8 | 164294.9 | 6.8 (16, 512, 512), torch.float32 | 210144.0 | 187468.1 | 1.1 (16, 512, 512), torch.float64 | 281164.8 | 167509.6 | 1.7 (16, 512, 512), torch.complex64 | 307684.5 | 57805.7 | 5.3 (16, 512, 512), torch.complex128 | 468624.1 | 265833.6 | 1.8 (16, 2048, 2048), torch.float32 | 3650952.0 | 315576.2 | 11.6 (16, 2048, 2048), torch.float64 | 7147413.6 | 1096273.9 | 6.5 (16, 2048, 2048), torch.complex64 | 8579275.9 | 384409.0 | 22.3 (16, 2048, 2048), torch.complex128 | 17937525.7 | 2639580.7 | 6.8 (64, 512, 512), torch.float32 | 835108.8 | 716855.2 | 1.2 (64, 512, 512), torch.float64 | 1145713.3 | 672703.7 | 1.7 (64, 512, 512), torch.complex64 | 1289962.5 | 233632.8 | 5.5 (64, 512, 512), torch.complex128 | 1863496.5 | 1067678.9 | 1.7 (64, 2048, 2048), torch.float32 | 14329632.9 | 1257138.1 | 11.4 (64, 2048, 2048), torch.float64 | 27999996.1 | 4381371.4 | 6.4 (64, 2048, 2048), torch.complex64 | 32749115.0 | 1528567.4 | 21.4 (64, 2048, 2048), torch.complex128 | 70825685.0 | 10548410.4 | 6.7 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174619 Approved by: https://github.com/eqy, https://github.com/nikitaved, https://github.com/Skylion007
…ionally (#174619) Both cuSolver and hipSolver support syevd/syevj, so just removed MAGMA path entirely and updated relevant tests to skip if missing cuSolver instead of missing MAGMA. Benchmark script and results are included below, though results only show sizes 512+ because MAGMA just calls LAPACK on sizes up to 128. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] dtypes = [torch.float32, torch.float64, torch.complex64, torch.complex128] for b, n, dtype in product(batches, sizes, dtypes): shape = b + (n, n) print(f"Testing shape={shape}, dtype={dtype}") label = "torch.linalg.eigh" sub_label = f"{shape}, {dtype}" X = torch.rand(*shape, dtype=dtype, device="cuda") X = X + X.mT.conj() stmt = "torch.linalg.eigh(X)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'X': X}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmark results on RTX Pro 6000: ``` [------------------------ torch.linalg.eigh -------------------------] | magma | cusolver | speedup 1 threads: ----------------------------------------------------------- | (512, 512), torch.float32 | 12605.6 | 11742.1 | 1.1 (512, 512), torch.float64 | 17244.3 | 10558.8 | 1.6 (512, 512), torch.complex64 | 18868.0 | 3612.1 | 5.2 (512, 512), torch.complex128 | 28479.8 | 16659.5 | 1.7 (2048, 2048), torch.float32 | 226035.4 | 19598.1 | 11.5 (2048, 2048), torch.float64 | 451455.1 | 68374.8 | 6.6 (2048, 2048), torch.complex64 | 535989.6 | 23807.6 | 22.5 (2048, 2048), torch.complex128 | 1111481.8 | 164294.9 | 6.8 (16, 512, 512), torch.float32 | 210144.0 | 187468.1 | 1.1 (16, 512, 512), torch.float64 | 281164.8 | 167509.6 | 1.7 (16, 512, 512), torch.complex64 | 307684.5 | 57805.7 | 5.3 (16, 512, 512), torch.complex128 | 468624.1 | 265833.6 | 1.8 (16, 2048, 2048), torch.float32 | 3650952.0 | 315576.2 | 11.6 (16, 2048, 2048), torch.float64 | 7147413.6 | 1096273.9 | 6.5 (16, 2048, 2048), torch.complex64 | 8579275.9 | 384409.0 | 22.3 (16, 2048, 2048), torch.complex128 | 17937525.7 | 2639580.7 | 6.8 (64, 512, 512), torch.float32 | 835108.8 | 716855.2 | 1.2 (64, 512, 512), torch.float64 | 1145713.3 | 672703.7 | 1.7 (64, 512, 512), torch.complex64 | 1289962.5 | 233632.8 | 5.5 (64, 512, 512), torch.complex128 | 1863496.5 | 1067678.9 | 1.7 (64, 2048, 2048), torch.float32 | 14329632.9 | 1257138.1 | 11.4 (64, 2048, 2048), torch.float64 | 27999996.1 | 4381371.4 | 6.4 (64, 2048, 2048), torch.complex64 | 32749115.0 | 1528567.4 | 21.4 (64, 2048, 2048), torch.complex128 | 70825685.0 | 10548410.4 | 6.7 Times are in microseconds (us). ``` Pull Request resolved: #174619 Approved by: https://github.com/eqy, https://github.com/nikitaved, https://github.com/Skylion007
…ionally (pytorch#174619) Both cuSolver and hipSolver support syevd/syevj, so just removed MAGMA path entirely and updated relevant tests to skip if missing cuSolver instead of missing MAGMA. Benchmark script and results are included below, though results only show sizes 512+ because MAGMA just calls LAPACK on sizes up to 128. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] dtypes = [torch.float32, torch.float64, torch.complex64, torch.complex128] for b, n, dtype in product(batches, sizes, dtypes): shape = b + (n, n) print(f"Testing shape={shape}, dtype={dtype}") label = "torch.linalg.eigh" sub_label = f"{shape}, {dtype}" X = torch.rand(*shape, dtype=dtype, device="cuda") X = X + X.mT.conj() stmt = "torch.linalg.eigh(X)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'X': X}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmark results on RTX Pro 6000: ``` [------------------------ torch.linalg.eigh -------------------------] | magma | cusolver | speedup 1 threads: ----------------------------------------------------------- | (512, 512), torch.float32 | 12605.6 | 11742.1 | 1.1 (512, 512), torch.float64 | 17244.3 | 10558.8 | 1.6 (512, 512), torch.complex64 | 18868.0 | 3612.1 | 5.2 (512, 512), torch.complex128 | 28479.8 | 16659.5 | 1.7 (2048, 2048), torch.float32 | 226035.4 | 19598.1 | 11.5 (2048, 2048), torch.float64 | 451455.1 | 68374.8 | 6.6 (2048, 2048), torch.complex64 | 535989.6 | 23807.6 | 22.5 (2048, 2048), torch.complex128 | 1111481.8 | 164294.9 | 6.8 (16, 512, 512), torch.float32 | 210144.0 | 187468.1 | 1.1 (16, 512, 512), torch.float64 | 281164.8 | 167509.6 | 1.7 (16, 512, 512), torch.complex64 | 307684.5 | 57805.7 | 5.3 (16, 512, 512), torch.complex128 | 468624.1 | 265833.6 | 1.8 (16, 2048, 2048), torch.float32 | 3650952.0 | 315576.2 | 11.6 (16, 2048, 2048), torch.float64 | 7147413.6 | 1096273.9 | 6.5 (16, 2048, 2048), torch.complex64 | 8579275.9 | 384409.0 | 22.3 (16, 2048, 2048), torch.complex128 | 17937525.7 | 2639580.7 | 6.8 (64, 512, 512), torch.float32 | 835108.8 | 716855.2 | 1.2 (64, 512, 512), torch.float64 | 1145713.3 | 672703.7 | 1.7 (64, 512, 512), torch.complex64 | 1289962.5 | 233632.8 | 5.5 (64, 512, 512), torch.complex128 | 1863496.5 | 1067678.9 | 1.7 (64, 2048, 2048), torch.float32 | 14329632.9 | 1257138.1 | 11.4 (64, 2048, 2048), torch.float64 | 27999996.1 | 4381371.4 | 6.4 (64, 2048, 2048), torch.complex64 | 32749115.0 | 1528567.4 | 21.4 (64, 2048, 2048), torch.complex128 | 70825685.0 | 10548410.4 | 6.7 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174619 Approved by: https://github.com/eqy, https://github.com/nikitaved, https://github.com/Skylion007
Both cuSolver and hipSolver support syevd/syevj, so just removed MAGMA path entirely and updated relevant tests to skip if missing cuSolver instead of missing MAGMA. Benchmark script and results are included below, though results only show sizes 512+ because MAGMA just calls LAPACK on sizes up to 128.
Benchmarking script:
Benchmark results on RTX Pro 6000:
cc @nikitaved @eqy