Relax cusparse windows guard on cuda 11 by xwang233 · Pull Request #42412 · pytorch/pytorch

xwang233 · 2020-08-01T18:12:26Z

cusparse Xcsrmm2 API:

new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm
old (deprecated in cuda 11): https://docs.nvidia.com/cuda/archive/10.2/cusparse/index.html#csrmm2

Before:

cuda ver	windows	linux
10.1	old api	old api
10.2	old api	new api
11	old api (build error claimed in #42406)	new api

After:

cuda ver	windows	linux
10.1	old api	old api
10.2	old api	old api
11	new api	new api

cusparse bmm-sparse-dense API

reverted, will be revisited in the future

(cc @kurtamohler #33430)

new: https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm

Before:

cuda ver	windows	linux
10.1	not supported	new api
10.2	not supported	new api
11	not supported	new api

After:

cuda ver	windows	linux
10.1	not supported	new api
10.2	not supported	new api
11	new api	new api

dr-ci · 2020-08-01T19:56:03Z

💊 CI failures summary and remediations

As of commit 72466d2 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

1/1 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_windows_vs2019_py36_cpu_test2 (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun) ❄️

CondaHTTPError: HTTP 000 CONNECTION FAILED for url

 if 0 NEQ 0 (exit /b 0  )   
 call conda install -y -q -c conda-forge cmake   
 if 0 NEQ 0 (exit /b 0  )  
)  
The system cannot find the drive specified. 
Collecting package metadata (current_repodata.json): ...working... done 
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve. 
Collecting package metadata (repodata.json): ...working... done 
Solving environment: ...working... done 
 
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/win-64/llvmlite-0.29.0-py36ha925a31_0.conda> 
Elapsed: - 
 
An HTTP error occurred when trying to retrieve this URL. 
HTTP errors are often intermittent, and a simple retry will get you on your way. 
 
 
 
## Package Plan ## 
 
  environment location: C:\Jenkins\Miniconda3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 12 times.

xwang233 · 2020-08-02T05:05:52Z

Reviewer: as of 8/1/2020, cuda 11 + windows is not yet available in the CI

ROCm failure in 72d716e seems irrelevant, because all CI passed in 874403e.

xwang233 · 2020-08-02T05:06:49Z

cc @ptrblck

peterjc123 · 2020-08-02T05:59:12Z

@xwang233 I'm trying to add Windows CUDA 11 CI with #42420 and now it is blocked by the same error as described in #42406.

peterjc123 · 2020-08-02T06:01:21Z

@xwang233 I'll try to cherry-pick your fix there.

xwang233 · 2020-08-02T06:02:32Z

@peterjc123 Thanks for checking the build!

peterjc123 · 2020-08-02T06:28:11Z

New error after cherry-picking your fix:

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(875): error: expression must have a constant value
C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(875): note: the value of variable "num_matrices"
(831): here cannot be used as a constant

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(879): error: identifier "search_end_matrix_indices" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(903): error: identifier "getTensorCudaDataType" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(903): error: identifier "getTensorCudaDataType" is undefined

4 errors detected in the compilation of "C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu".

See https://app.circleci.com/pipelines/github/pytorch/pytorch/195766/workflows/bc3ba331-dd46-4ec5-9356-02cdfa015561/jobs/6468496.
Seems to be caused by using VLA which is not supported in MSVC and the wrong preprocessor condition in

pytorch/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu

Line 735 in 72d716e

#if !(defined(__HIP_PLATFORM_HCC__) || defined(_WIN32) || defined(_WIN64))

.

This reverts commit 72d716e.

xwang233 · 2020-08-02T06:49:43Z

I have reverted the code in SparseCUDATensorMath.cu.

peterjc123 · 2020-08-02T06:52:39Z

@xwang233 So the new api of cusparse is still not supported on Windows, right? I'll have a bit more testing with 0c00fa3.

xwang233 · 2020-08-02T07:14:51Z

The problem in #42406 will be fixed, since it crashes cuda11 + windows build.

The change in bmm-sparse-dense is optional at this moment, since it doesn't crash any build. bmm-sparse-dense just raises TORCH_CHECK error in windows now regardless of cuda version. The cusparse API itself should be fine on cuda11 + windows. It would be nice if you can find a fix. If not, we can leave that in the next PR.

You can check the table in the first post of this thread.

peterjc123 · 2020-08-02T07:20:38Z

The problem in #42406 will be fixed, since it crashes cuda11 + windows build.

The change in bmm-sparse-dense is optional at this moment, since it doesn't crash any build. bmm-sparse-dense just raises TORCH_CHECK error in windows now regardless of cuda version. The cusparse API itself should be fine on cuda11 + windows. It would be nice if you can find a fix. If not, we can leave that in the next PR.

You can check the table in the first post of this thread.

For the latest build in the pr, it seems that it is supported. But yes, it's okay to bring them in with later commits.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

malfet · 2020-08-04T03:10:21Z

aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu

-#if !defined(_MSC_VER) && defined(__CUDACC__) && CUSPARSE_VERSION >= 10301 // CUDA release >= 10.2 and not windows
+#if defined(__CUDACC__) && CUSPARSE_VERSION >= 11000


This would disable this feature on Linux with CUDA-10.2, isn't it?

cuda 10.2 on linux will use the old API, which is still functionally the same

facebook-github-bot · 2020-08-04T04:13:19Z

@ezyang merged this pull request in c8cb5e5.

This fixes feature regression introduced by pytorch#42412 which limited all the use of the API to CUDA-11.0+

Summary: This fixes feature regression introduced by #42412 which limited all the use of the API to CUDA-11.0+ Pull Request resolved: #42556 Reviewed By: ngimel Differential Revision: D22932129 Pulled By: malfet fbshipit-source-id: 2756e0587456678fa1bc7deaa09d0ea482dfd19f

relax windows guard on cuda 11

67e8766

pytorchbot added the open source label Aug 1, 2020

xwang233 added 2 commits August 1, 2020 14:07

cusparseGetErrorString 10100

874403e

relax bmm_sparse_cuda on windows + cuda 11

72d716e

xwang233 changed the title ~~[WIP] Relax cusparse windows guard on cuda 11~~ Relax cusparse windows guard on cuda 11 Aug 2, 2020

xwang233 requested review from ezyang and ngimel August 2, 2020 05:06

Revert "relax bmm_sparse_cuda on windows + cuda 11"

72466d2

This reverts commit 72d716e.

peterjc123 approved these changes Aug 2, 2020

View reviewed changes

facebook-github-bot reviewed Aug 2, 2020

View reviewed changes

facebook-github-bot closed this in c8cb5e5 Aug 4, 2020

xwang233 mentioned this pull request Aug 4, 2020

Add CUDA 11 builds for Windows CI #42420

Closed

malfet reviewed Aug 4, 2020

View reviewed changes

facebook-github-bot added the merged label Aug 4, 2020

malfet mentioned this pull request Aug 4, 2020

Reenable cusparse SpMM on cuda 10.2 #42556

Closed

malfet added a commit to malfet/pytorch that referenced this pull request Aug 4, 2020

Re-enable cusparseSpMM on CUDA-10.x on Linux

4ed34a7

This fixes feature regression introduced by pytorch#42412 which limited all the use of the API to CUDA-11.0+

mruberry added the Merged label Oct 28, 2020

facebook-github-bot deleted the ci-all/cusparse-windows-cuda11 branch January 27, 2021 18:26

		#if !defined(_MSC_VER) && defined(__CUDACC__) && CUSPARSE_VERSION >= 10301 // CUDA release >= 10.2 and not windows
		#if defined(__CUDACC__) && CUSPARSE_VERSION >= 11000

Conversation

xwang233 commented Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cusparse Xcsrmm2 API:

cusparse bmm-sparse-dense API

Uh oh!

dr-ci bot commented Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 1 failure tentatively classified as flaky

pytorch_windows_vs2019_py36_cpu_test2 (1/1)

Uh oh!

xwang233 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Aug 2, 2020

Uh oh!

peterjc123 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterjc123 commented Aug 2, 2020

Uh oh!

xwang233 commented Aug 2, 2020

Uh oh!

peterjc123 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Aug 2, 2020

Uh oh!

peterjc123 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xwang233 commented Aug 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterjc123 commented Aug 2, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

malfet Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

xwang233 Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xwang233 commented Aug 1, 2020 •

edited

Loading

dr-ci bot commented Aug 1, 2020 •

edited

Loading

xwang233 commented Aug 2, 2020 •

edited

Loading

peterjc123 commented Aug 2, 2020 •

edited

Loading

peterjc123 commented Aug 2, 2020 •

edited

Loading

peterjc123 commented Aug 2, 2020 •

edited

Loading

xwang233 commented Aug 2, 2020 •

edited

Loading