Improves ATen CUDAEvent by mruberry · Pull Request #11293 · pytorch/pytorch

mruberry · 2018-09-05T20:24:35Z

After submitting PR #9726, PR #10581 created a different CUDAEvent class. The CUDAEvent proposed in #9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by #10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR #8354. In particular, this PR:

Adds set_device() to CUDAContext
Adds three CUDAEvent tests to stream_test.cpp
Fixes three bugs:
Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards).
Moving an event would cause a segfault.
Events were not destroyed on the device they were created on. See PR add device to CUDAEvent #9415 (@pietern)
Adds the happened() and recordOnce() functions
Changes the record() functions to not be const
Adds additional assertions to verify correctness

This PR does not:

Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR)

Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here:

pytorch/aten/src/ATen/native/cudnn/RNN.cpp

Lines 1130 to 1132 in 542aadd

    
           // NB: CUDA binds the event to a device at creation time, so we can initialize it 
        
           // only now, when we know we're on the correct device. 
        
           state.event.emplace();

and that file is the only one which currently uses the ATen CUDAEvent.

Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs).

I have decided to separate this from PR #8354 since while it's required for PR #8354 the changes are, clearly, of independent interest. PR #8354 has a new dependency on this one, however. I am closing PR #9726 in favor of this PR.

@apaszke @ezyang @pietern

aten/src/ATen/cuda/CUDAEvent.h

  }

-  CUDAEvent& operator=(CUDAEvent other) noexcept {
+  CUDAEvent& operator=(CUDAEvent other) {


ezyang · 2018-09-05T21:39:32Z

Windows error is legit:

21:34:35 stream_test.cpp.obj : error LNK2019: unresolved external symbol "public: void __cdecl at::cuda::CUDAStream::synchronize_with(struct at::cuda::CUDAEvent const &)const " (?synchronize_with@CUDAStream@cuda@at@@QEBAXAEBUCUDAEvent@23@@Z) referenced in function "void __cdecl ____C_A_T_C_H____T_E_S_T____14(void)" (?____C_A_T_C_H____T_E_S_T____14@@YAXXZ)
21:34:35 stream_test.cpp.obj : error LNK2019: unresolved external symbol "public: void __cdecl at::cuda::CUDAEvent::record(struct at::cuda::CUDAStream const &)" (?record@CUDAEvent@cuda@at@@QEAAXAEBUCUDAStream@23@@Z) referenced in function "void __cdecl ____C_A_T_C_H____T_E_S_T____18(void)" (?____C_A_T_C_H____T_E_S_T____18@@YAXXZ)
21:34:35 stream_test.cpp.obj : error LNK2019: unresolved external symbol "public: void __cdecl at::cuda::CUDAEvent::recordOnce(struct at::cuda::CUDAStream const &)" (?recordOnce@CUDAEvent@cuda@at@@QEAAXAEBUCUDAStream@23@@Z) referenced in function "void __cdecl ____C_A_T_C_H____T_E_S_T____14(void)" (?____C_A_T_C_H____T_E_S_T____14@@YAXXZ)
21:34:35 bin\stream_test.exe : fatal error LNK1120: 3 unresolved externals

facebook-github-bot

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2018-09-05T22:23:57Z

Tests look clear except for pr/caffe2-py2-cuda9.0-cudnn7-windows-build which appears to be broken atm. We may want to wait for that to be fixed since we were hitting Windows-specific errors there earlier.

Removing refcounting is causing a new error I need to diagnose tomorrow.

pietern · 2018-09-06T16:42:57Z

@mruberry Nice work. I think it will be straightforward to update c10d to use this implementation. But I think we should wait with that until the stream PR is merged as well.

mruberry · 2018-09-06T16:48:28Z

@pietern Which stream PR?

pietern · 2018-09-06T16:51:37Z

@mruberry Ah nevermind, I read #8354 as being the stream PR but the actual stream PR has already been merged.

mruberry · 2018-09-06T21:18:00Z

I think this PR is now ready. The remaining failure appears to be unrelated (and happening with other PRs, too?).

20:22:19 FAIL: test_scalar_fusion (main.TestScript)
20:22:19 ----------------------------------------------------------------------
20:22:19 Traceback (most recent call last):
20:22:19 File "test_jit.py", line 2961, in test_scalar_fusion
20:22:19 self.assertExpectedGraph(ge.graph_for(x, y))
20:22:19 File "test_jit.py", line 248, in assertExpectedGraph
20:22:19 self.assertExpected(str(graph), *args, **kwargs)
20:22:19 File "/var/lib/jenkins/workspace/test/common.py", line 524, in assertExpected
20:22:19 self.assertMultiLineEqual(expected, s)
20:22:19 AssertionError: 'graph(%x : Float()\n %y : Float()) {\n %2 : Float() = prim::FusionGroup_0 [truncated]... != 'graph(%x : Float()\n %y : Float()) {\n %2 : Float() = aten::type_as(%y, % [truncated]...
20:22:19 graph(%x : Float()
20:22:19 %y : Float()) {
20:22:19 - %2 : Float() = prim::FusionGroup_0[device=-1](%x, %y)
20:22:19 - return (%2);
20:22:19 - }
20:22:19 - with prim::FusionGroup_0 = graph(%0 : Float()
20:22:19 - %1 : Float()) {
20:22:19 - %2 : Float() = aten::type_as(%1, %0)
20:22:19 ? ^ ^
20:22:19 + %2 : Float() = aten::type_as(%y, %x)
20:22:19 ? ^ ^
20:22:19 %3 : int = prim::Constantvalue=1
20:22:19 - %4 : Float() = aten::add(%0, %2, %3)
20:22:19 ? ^
20:22:19 + %4 : Float() = aten::add(%x, %2, %3)
20:22:19 ? ^
20:22:19 return (%4);
20:22:19 }
20:22:19
20:22:19
20:22:19 ----------------------------------------------------------------------
20:22:19 Ran 1179 tests in 30.373s
20:22:19
20:22:19 FAILED (failures=1, skipped=44, expected failures=3)
20:22:19 Traceback (most recent call last):
20:22:19 File "test/run_test.py", line 391, in
20:22:19 main()
20:22:19 File "test/run_test.py", line 383, in main
20:22:19 raise RuntimeError(message)
20:22:19 RuntimeError: test_jit failed!

facebook-github-bot

soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

soumith · 2018-09-07T02:05:39Z

waiting on internal contbuilds. will land after that's finished (~3 hours)

Summary: After submitting PR #9726, PR #10581 created a different CUDAEvent class. The CUDAEvent proposed in #9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by #10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR #8354. In particular, this PR: - Adds set_device() to CUDAContext - Adds three CUDAEvent tests to stream_test.cpp - Fixes three bugs: - Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards). - Moving an event would cause a segfault. - Events were not destroyed on the device they were created on. See PR #9415 (pietern) - Adds the happened() and recordOnce() functions - Changes the record() functions to not be const - Adds additional assertions to verify correctness This PR does not: - Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR) Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here: https://github.com/pytorch/pytorch/blob/542aadd9a7609892e207c1e15de08a975b697752/aten/src/ATen/native/cudnn/RNN.cpp#L1130-L1132 and that file is the only one which currently uses the ATen CUDAEvent. Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs). I have decided to separate this from PR #8354 since while it's required for PR #8354 the changes are, clearly, of independent interest. PR #8354 has a new dependency on this one, however. I am closing PR #9726 in favor of this PR. apaszke ezyang pietern Pull Request resolved: pytorch/pytorch#11293 Differential Revision: D9665836 Pulled By: soumith fbshipit-source-id: a1513fa4f9761e2f304d126e402f6b6950e1c1d2

Summary: After submitting PR pytorch#9726, PR pytorch#10581 created a different CUDAEvent class. The CUDAEvent proposed in pytorch#9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by pytorch#10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR pytorch#8354. In particular, this PR: - Adds set_device() to CUDAContext - Adds three CUDAEvent tests to stream_test.cpp - Fixes three bugs: - Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards). - Moving an event would cause a segfault. - Events were not destroyed on the device they were created on. See PR pytorch#9415 (pietern) - Adds the happened() and recordOnce() functions - Changes the record() functions to not be const - Adds additional assertions to verify correctness This PR does not: - Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR) Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here: https://github.com/pytorch/pytorch/blob/542aadd9a7609892e207c1e15de08a975b697752/aten/src/ATen/native/cudnn/RNN.cpp#L1130-L1132 and that file is the only one which currently uses the ATen CUDAEvent. Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs). I have decided to separate this from PR pytorch#8354 since while it's required for PR pytorch#8354 the changes are, clearly, of independent interest. PR pytorch#8354 has a new dependency on this one, however. I am closing PR pytorch#9726 in favor of this PR. apaszke ezyang pietern Pull Request resolved: pytorch#11293 Differential Revision: D9665836 Pulled By: soumith fbshipit-source-id: a1513fa4f9761e2f304d126e402f6b6950e1c1d2

Improves ATen CUDAEvent

ec8636d

mruberry requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners September 5, 2018 20:24

mruberry mentioned this pull request Sep 5, 2018

Creates ATen CUDAEvent #9726

Closed

mruberry added 4 commits September 5, 2018 13:39

Fixes DeviceGuard narrowing conversion error

c13e98f

Attempt to appease Windows

9274f45

Reverting some of first attempt to appease Windows

e84024d

Changes includes

0122f9a

ezyang reviewed Sep 5, 2018

View reviewed changes

aten/src/ATen/cuda/CUDAEvent.h Outdated

}

CUDAEvent& operator=(CUDAEvent other) noexcept {

CUDAEvent& operator=(CUDAEvent other) {

This comment was marked as off-topic.

Sign in to view

ezyang approved these changes Sep 5, 2018

View reviewed changes

facebook-github-bot reviewed Sep 5, 2018

View reviewed changes

mruberry added 2 commits September 5, 2018 15:00

More AT_CUDA_API usage

79e486f

More AT_CUDA_API usage

c221987

mruberry added 2 commits September 5, 2018 17:05

removes refcounting

83969a0

improves comment

500d2c4

mruberry added 3 commits September 6, 2018 09:51

Improves synchronization logic

8387a2a

Makes destructor nonthrowing, targeted Windows testing

95388b5

Reverts targetted testing (Windows is passing it)

c6a6868

soumith approved these changes Sep 7, 2018

View reviewed changes

facebook-github-bot reviewed Sep 7, 2018

View reviewed changes

facebook-github-bot closed this in a2afad2 Sep 7, 2018

mruberry deleted the cuda_event_improvement branch September 25, 2018 16:39

ezyang added open source merged labels Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves ATen CUDAEvent#11293

Improves ATen CUDAEvent#11293
mruberry wants to merge 12 commits intopytorch:masterfrom
mruberry:cuda_event_improvement

mruberry commented Sep 5, 2018 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Sep 5, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

mruberry commented Sep 5, 2018 •

edited

Loading

Uh oh!

pietern commented Sep 6, 2018

Uh oh!

mruberry commented Sep 6, 2018 •

edited

Loading

Uh oh!

pietern commented Sep 6, 2018

Uh oh!

mruberry commented Sep 6, 2018

Uh oh!

facebook-github-bot left a comment

Uh oh!

soumith commented Sep 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	// NB: CUDA binds the event to a device at creation time, so we can initialize it
	// only now, when we know we're on the correct device.
	state.event.emplace();

Conversation

mruberry commented Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang commented Sep 5, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mruberry commented Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern commented Sep 6, 2018

Uh oh!

mruberry commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pietern commented Sep 6, 2018

Uh oh!

mruberry commented Sep 6, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Sep 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mruberry commented Sep 5, 2018 •

edited

Loading

mruberry commented Sep 5, 2018 •

edited

Loading

mruberry commented Sep 6, 2018 •

edited

Loading