[MPS] optimize cholesky by Isalia20 · Pull Request #145722 · pytorch/pytorch

Isalia20 · 2025-01-27T07:05:56Z

Followup to #145701

Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR).

Script to collect perf

import torch
import numpy as np
import time
import csv

matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3

def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)

def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start

results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")
        
        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")
            
            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)
            
            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)
            
            mean_time = np.mean(times)
            std_time = np.std(times)
            
            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)
            
            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")
            
        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue

with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

Observed speedups on M1 Pro

cc @kulinseth @albanD @malfet @DenisVieriu97 @jhavukainen

pytorch-bot · 2025-01-27T07:06:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145722

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 42 Pending

As of commit 6b8eb58 with merge base 0f5a683 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Isalia20 · 2025-01-27T09:25:42Z

aten/src/ATen/native/mps/kernels/LinearAlgebra.metal

+// forward substitution with loop unrolling and vectorization
+#pragma unroll 4


This is somewhat annoying to me, for some reason lintrunner removes indentation here

malfet

Sure, though it would be nice to add some description on perf before/after

Isalia20 · 2025-01-31T19:16:26Z

Speed improvements over the old kernel(Benchmarked on M1 Pro):

For benchmarking one can use the below script, some basic packages like numpy/pandas/matplotlib needed. Usage:

Compile with old kernel
Run the below script
Rename saved csv to cholesky_benchmark_times_old.csv
Compile with new kernel
Run the below script
Rename saved csv to cholesky_benchmark_times_new.csv
Run the script after this

import torch
import numpy as np
import time
import csv

matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3

def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)

def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start

results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")
        
        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")
            
            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)
            
            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)
            
            mean_time = np.mean(times)
            std_time = np.std(times)
            
            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)
            
            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")
            
        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue

with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

To visualize:

import pandas as pd
import matplotlib.pyplot as plt

old_data = pd.read_csv("cholesky_benchmark_times_old.csv")
new_data = pd.read_csv("cholesky_benchmark_times_new.csv")
merged_data = pd.merge(old_data, new_data, on=["N", "batch_size"], suffixes=("_old", "_new"))
merged_data["speedup"] = merged_data["mean_time_old"] / merged_data["mean_time_new"]
pivot_table = merged_data.pivot(index="batch_size", columns="N", values="speedup")
plt.figure(figsize=(10, 6))
for N in pivot_table.columns:
    plt.plot(pivot_table.index, pivot_table[N], marker="o", label=f"N={N}")
plt.xlabel("Batch Size")
plt.ylabel("Speedup (Old Time / New Time)")
plt.title("Speedup Comparison: Old vs New Times")
plt.grid(True, which="both", linestyle="--", linewidth=0.5)
plt.legend(title="Matrix Size (N)")
plt.tight_layout()
plt.savefig("cholesky_speedup.png")

malfet · 2025-01-31T19:50:54Z

@pytorchbot merge -f "Lint + MPS are green"

pytorchmergebot · 2025-01-31T19:52:21Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

optimized syrk and trsm kernels

d6f64bf

Isalia20 requested review from kulinseth and malfet as code owners January 27, 2025 07:05

pytorch-bot bot added the release notes: mps Release notes category label Jan 27, 2025

pytorchbot added the open source label Jan 27, 2025

Isalia20 commented Jan 27, 2025

View reviewed changes

Merge branch 'main' into mps-cholesky-optimization

6b8eb58

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 29, 2025

malfet approved these changes Jan 31, 2025

View reviewed changes

malfet added ciflow/mps Run MPS tests (subset of trunk) module: mps Related to Apple Metal Performance Shaders framework labels Jan 31, 2025

pytorchmergebot added the merging label Jan 31, 2025

pytorchmergebot added the Merged label Jan 31, 2025

pytorchmergebot closed this in ec2522e Jan 31, 2025

pytorchmergebot removed the merging label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MPS] optimize cholesky#145722

[MPS] optimize cholesky#145722
Isalia20 wants to merge 2 commits intopytorch:mainfrom
Isalia20:mps-cholesky-optimization

Isalia20 commented Jan 27, 2025 •

edited by malfet

Loading

Uh oh!

pytorch-bot bot commented Jan 27, 2025 •

edited

Loading

Uh oh!

Isalia20 Jan 27, 2025

Uh oh!

malfet left a comment

Uh oh!

Isalia20 commented Jan 31, 2025 •

edited by malfet

Loading

Uh oh!

malfet commented Jan 31, 2025

Uh oh!

pytorchmergebot commented Jan 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		// forward substitution with loop unrolling and vectorization
		#pragma unroll 4

Conversation

Isalia20 commented Jan 27, 2025 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145722

⏳ No Failures, 42 Pending

Uh oh!

Isalia20 Jan 27, 2025

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Isalia20 commented Jan 31, 2025 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed improvements over the old kernel(Benchmarked on M1 Pro):

Uh oh!

malfet commented Jan 31, 2025

Uh oh!

pytorchmergebot commented Jan 31, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Isalia20 commented Jan 27, 2025 •

edited by malfet

Loading

pytorch-bot bot commented Jan 27, 2025 •

edited

Loading

Isalia20 commented Jan 31, 2025 •

edited by malfet

Loading