Skip to content

torch.linalg.solve yields much lower precisions in 1.13.0 than previous versions #90453

@astroboylrx

Description

@astroboylrx

🐛 Describe the bug

After upgrading to torch 1.13.0, torch.linalg.solve suddenly gives solutions with much lower precisions, regardless of device (cpu or gpu) or type (float64 or float32). The errors quickly escalate in my numerical calculations and break down my simulations.

Take the following data as an example (I know it is somewhat ill-conditioned, but the changes in behaviors are real)

import torch
torch.set_default_dtype(torch.float64)
torch.backends.cuda.matmul.allow_tf32 = False
A = torch.tensor([
    [ 3.8025705376834739e-07, -9.1719365342788720e-07, -6.7124337949782264e-06, -6.4837019110456791e-05, -7.0869999797614066e-04, -1.0694859984690733e-02, -3.2912231531790004e-01, -6.6347339870464399e+00, -8.2509761085708249e+01,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00,  4.4000124553730829e-07, -5.5080918253708871e-07, -5.1498277032055974e-06, -5.7818057148617599e-05, -9.1226448867859551e-04, -2.2619326362175465e-02, -4.4038788530099793e-01, -5.1992675801721502e+00,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -1.0669700681643825e-10,  4.3768558191229986e-07, -4.3974816153203019e-07, -4.8865127972067992e-06, -7.8116560507683326e-05, -1.7589402883070333e-03, -3.3666362131922367e-02, -3.8659142733749491e-01,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -7.8216940301197729e-12, -1.5895421888461478e-10,  4.3542984469163267e-07, -4.0043248885844276e-07, -6.6798905178796823e-06, -1.3761857019311234e-04, -2.5943507621790695e-03, -2.9003633389177604e-02,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -2.4603969583879200e-13, -6.0925772512004975e-12, -1.9886454656863128e-10,  4.3370279880257098e-07, -5.6639032522315289e-07, -1.0649799471193429e-05, -1.9808440853565822e-04, -2.1583707954594099e-03,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -1.4999959257460881e-15, -3.2831398418930186e-14, -8.8714562886788080e-13, -4.3280772005187299e-11,  4.4148762039828565e-07, -6.8089481270669943e-07, -1.4575015323337058e-05, -1.5597848962814291e-04,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -3.6858575028157790e-16, -7.2036090445864899e-15, -1.4349791509103240e-13, -2.9849302443991965e-12,  6.3914122655929791e-10,  4.6448551809896547e-07, -6.8453604307207769e-07, -1.0332761488908590e-05,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -3.7045642770024088e-17, -7.2015144280333478e-16, -1.4158860652466324e-14, -2.8662564585632735e-13, -6.2285079180541528e-12,  1.5090963357302090e-09,  4.8979817748389458e-07, -1.2863401745116974e-07,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00, -2.3760629594245614e-18, -4.6007155546998113e-17, -8.9513844792609796e-16, -1.7640414722799569e-14, -3.5935860384434572e-13, -7.9429359080595169e-12,  2.0146206213869421e-09,  4.7959403001188342e-07,  0.0000000000000000e+00],
    [ 0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  3.8025705376834739e-07]
])
b = torch.tensor(
    [ 6.9677181015078851e+04,  3.9337825712781823e+03,  2.7914109655787729e+02,  1.9895852311404216e+01,  1.3819016836738420e+00,  7.5229947004102571e-02,  1.3433804143281360e-03, -3.1421146091483441e-04, -2.8076324348838071e-05,  0.0000000000000000e+00]
)

With torch 1.12.1, the relative errors are around machine-precision (a few 1e-16), which is consistent with the precision obtained from numpy or cupy

In [1]: (A @ torch.linalg.solve(A, b) - b) / b
tensor([ 0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  0.0000000000000000e+00,  1.6068046486108669e-16,
        -3.6894317650011501e-16,  0.0000000000000000e+00, -0.0000000000000000e+00,  3.6202728109145290e-16,                     nan])

However, with torch 1.13.0, the relative errors are huge (max at 5e-11)

In [2]: (A @ torch.linalg.solve(A, b) - b) / b
tensor([-2.0884764590602007e-16,  4.6240212075443264e-16,  0.0000000000000000e+00, -1.7856554337026822e-16, -4.1776920863882539e-15,
        -8.7255061242277206e-14,  5.0944524844510106e-11, -2.0456409676328997e-11, -4.9269499441339466e-12,                     nan])

Below are more comparisons using torch.float64 and cuda

In [1]: A = torch.tensor([ ... ], device=torch.device('cuda'))
In [2]: b = torch.tensor([ ... ], device=torch.device('cuda'))
In [3]: (A @ torch.linalg.solve(A, b) - b) / b  # with torch 1.12.1
tensor([ 0.0000e+00,  1.1560e-16,  0.0000e+00,  0.0000e+00,  0.0000e+00,
        -1.8447e-16,  0.0000e+00,  1.7253e-16,  3.6203e-16,         nan],
       device='cuda:0')
In [4]: (A @ torch.linalg.solve(A, b) - b) / b  # with torch 1.13.0
tensor([-2.0885e-16,  0.0000e+00, -2.0364e-16, -7.1426e-16, -1.7675e-15,
         4.1875e-14,  4.4228e-11, -1.2897e-11, -3.0743e-12,         nan],
       device='cuda:0')

And more comparisons using torch.float32 and cpu

In [1]: torch.set_default_dtype(torch.float32)
In [2]: torch.backends.cuda.matmul.allow_tf32 = True
In [3]: (A @ torch.linalg.solve(A, b) - b) / b  # with torch 1.12.1
tensor([-1.1212e-07,  0.0000e+00,  0.0000e+00,  0.0000e+00,  8.6265e-08,
         1.9807e-07, -8.6658e-08,  9.2625e-08, -0.0000e+00,         nan])
In [4]: (A @ torch.linalg.solve(A, b) - b) / b  # with torch 1.13.0
tensor([-1.1212e-07,  6.2063e-08, -1.0933e-07, -9.5867e-08, -2.3291e-06,
        -4.0902e-05, -2.2294e-02, -2.5929e-03, -1.9909e-03,         nan])

Versions

For tests with torch 1.12.1, the output is

Collecting environment information...
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Sep 28 2021, 16:10:42)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to:
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia driver version: 527.37
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] pytorch-memlab==0.2.4
[pip3] torch==1.12.1+cu116
[pip3] torchaudio==0.12.1+cu116
[pip3] torchvision==0.13.1+cu116
[pip3] xitorch==0.3.0
[conda] No relevant packages

For tests with torch 1.13.0, the output is

PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun 22 2022, 20:18:18)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia driver version: 527.37
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.14.0
[conda] No relevant packages

cc @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmultriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions