Skip to content

Relax cusparse windows guard on cuda 11#42412

Closed
xwang233 wants to merge 4 commits intomasterfrom
ci-all/cusparse-windows-cuda11
Closed

Relax cusparse windows guard on cuda 11#42412
xwang233 wants to merge 4 commits intomasterfrom
ci-all/cusparse-windows-cuda11

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 commented Aug 1, 2020

Fixes #42406

cusparse Xcsrmm2 API:

(#37202)

Before:

cuda ver windows linux
10.1 old api old api
10.2 old api new api
11 old api (build error claimed in #42406) new api

After:

cuda ver windows linux
10.1 old api old api
10.2 old api old api
11 new api new api

cusparse bmm-sparse-dense API

reverted, will be revisited in the future (cc @kurtamohler #33430)

Before:

cuda ver windows linux
10.1 not supported new api
10.2 not supported new api
11 not supported new api

After:

cuda ver windows linux
10.1 not supported new api
10.2 not supported new api
11 new api new api

@dr-ci
Copy link
Copy Markdown

dr-ci bot commented Aug 1, 2020

💊 CI failures summary and remediations

As of commit 72466d2 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_windows_vs2019_py36_cpu_test2 (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun) ❄️

CondaHTTPError: HTTP 000 CONNECTION FAILED for url
 if 0 NEQ 0 (exit /b 0  )   
 call conda install -y -q -c conda-forge cmake   
 if 0 NEQ 0 (exit /b 0  )  
)  
The system cannot find the drive specified. 
Collecting package metadata (current_repodata.json): ...working... done 
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve. 
Collecting package metadata (repodata.json): ...working... done 
Solving environment: ...working... done 
 
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/win-64/llvmlite-0.29.0-py36ha925a31_0.conda> 
Elapsed: - 
 
An HTTP error occurred when trying to retrieve this URL. 
HTTP errors are often intermittent, and a simple retry will get you on your way. 
 
 
 
## Package Plan ## 
 
  environment location: C:\Jenkins\Miniconda3 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 12 times.

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Aug 2, 2020

Reviewer: as of 8/1/2020, cuda 11 + windows is not yet available in the CI

ROCm failure in 72d716e seems irrelevant, because all CI passed in 874403e.

@xwang233 xwang233 changed the title [WIP] Relax cusparse windows guard on cuda 11 Relax cusparse windows guard on cuda 11 Aug 2, 2020
@xwang233 xwang233 requested review from ezyang and ngimel August 2, 2020 05:06
@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Aug 2, 2020

cc @ptrblck

@peterjc123
Copy link
Copy Markdown
Collaborator

peterjc123 commented Aug 2, 2020

@xwang233 I'm trying to add Windows CUDA 11 CI with #42420 and now it is blocked by the same error as described in #42406.

@peterjc123
Copy link
Copy Markdown
Collaborator

@xwang233 I'll try to cherry-pick your fix there.

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Aug 2, 2020

@peterjc123 Thanks for checking the build!

@peterjc123
Copy link
Copy Markdown
Collaborator

peterjc123 commented Aug 2, 2020

New error after cherry-picking your fix:

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(875): error: expression must have a constant value
C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(875): note: the value of variable "num_matrices"
(831): here cannot be used as a constant

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(879): error: identifier "search_end_matrix_indices" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(903): error: identifier "getTensorCudaDataType" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu(903): error: identifier "getTensorCudaDataType" is undefined

4 errors detected in the compilation of "C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDATensorMath.cu".

See https://app.circleci.com/pipelines/github/pytorch/pytorch/195766/workflows/bc3ba331-dd46-4ec5-9356-02cdfa015561/jobs/6468496.
Seems to be caused by using VLA which is not supported in MSVC and the wrong preprocessor condition in

#if !(defined(__HIP_PLATFORM_HCC__) || defined(_WIN32) || defined(_WIN64))
.

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Aug 2, 2020

I have reverted the code in SparseCUDATensorMath.cu.

@peterjc123
Copy link
Copy Markdown
Collaborator

peterjc123 commented Aug 2, 2020

@xwang233 So the new api of cusparse is still not supported on Windows, right? I'll have a bit more testing with 0c00fa3.

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Aug 2, 2020

The problem in #42406 will be fixed, since it crashes cuda11 + windows build.

The change in bmm-sparse-dense is optional at this moment, since it doesn't crash any build. bmm-sparse-dense just raises TORCH_CHECK error in windows now regardless of cuda version. The cusparse API itself should be fine on cuda11 + windows. It would be nice if you can find a fix. If not, we can leave that in the next PR.

You can check the table in the first post of this thread.

@peterjc123
Copy link
Copy Markdown
Collaborator

The problem in #42406 will be fixed, since it crashes cuda11 + windows build.

The change in bmm-sparse-dense is optional at this moment, since it doesn't crash any build. bmm-sparse-dense just raises TORCH_CHECK error in windows now regardless of cuda version. The cusparse API itself should be fine on cuda11 + windows. It would be nice if you can find a fix. If not, we can leave that in the next PR.

You can check the table in the first post of this thread.

For the latest build in the pr, it seems that it is supported. But yes, it's okay to bring them in with later commits.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment on lines -84 to +79
#if !defined(_MSC_VER) && defined(__CUDACC__) && CUSPARSE_VERSION >= 10301 // CUDA release >= 10.2 and not windows
#if defined(__CUDACC__) && CUSPARSE_VERSION >= 11000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would disable this feature on Linux with CUDA-10.2, isn't it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda 10.2 on linux will use the old API, which is still functionally the same

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in c8cb5e5.

malfet added a commit to malfet/pytorch that referenced this pull request Aug 4, 2020
This fixes feature regression introduced by pytorch#42412 which limited all the use of the API to CUDA-11.0+
facebook-github-bot pushed a commit that referenced this pull request Aug 5, 2020
Summary:
This fixes feature regression introduced by #42412 which limited all the use of the API to CUDA-11.0+

Pull Request resolved: #42556

Reviewed By: ngimel

Differential Revision: D22932129

Pulled By: malfet

fbshipit-source-id: 2756e0587456678fa1bc7deaa09d0ea482dfd19f
@facebook-github-bot facebook-github-bot deleted the ci-all/cusparse-windows-cuda11 branch January 27, 2021 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aten looking for function cusparseDcsrmm2 in cuda 11 header, windows

6 participants