[TorchTidy] Refactor profiler to use SizesAndStrides by Gamrix · Pull Request #81824 · pytorch/pytorch

Gamrix · 2022-07-20T22:49:43Z

Stack from ghstack (oldest at bottom):

This includes adding stride information

This includes adding stride information [ghstack-poisoned]

facebook-github-bot · 2022-07-20T22:49:49Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/81824
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 44f09aa (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This includes adding stride information ghstack-source-id: 7bf6586 Pull Request resolved: #81824

This includes adding stride information [ghstack-poisoned]

robieta · 2022-07-25T16:17:15Z

I was very surprised by how expensive it is to copy around SizesAndStrides. (I assume the copy from TensorImpl::sizes_and_strides() get's RVO'd through to the emplace so you only do one copy?) Benchmarking your PR on my system adds about 180 ns for one input. (An additional 40% to the profiling overhead.) That got me curious, and I played around with all manner of unsafe data access to the underlying SizesAndStrides and its underlying memory. Ultimately what I found was that just calling t.sizes() and t.strides() was as fast as any of the specializations (read: hacks) that I could come up with. That said, I did find that when I added a vectorized copy to AppendOnlyList there was a non-trivial speedup:

  template <typename T0>
  typename std::enable_if<
      std::is_same<T0, T>::value && std::is_trivially_copyable<T>::value>::type
  copy(const T0* src, size_t n) {
    if (C10_LIKELY(next_ + n <= end_)) {
      std::memcpy((void*)next_, (void*)src, n * sizeof(T0));
      next_ += n;
    } else {
      // We could chunk this into several `memcpy`s, but because we expect this
      // fallback to be infrequent (n << ChunkSize) the performance impact is
      // negligible.
      for (size_t i = 0; i < n; i++) {
        emplace_back(*src++);
      }
    }
  }

Interestingly, I also found that doing separate memcpy's for sizes and strides was faster than a single memcpy of the contiguous memory. (In addition to using less memory) When I did that (again, nothing fancy)

    const auto& sizes = t.sizes();
    const auto& strides = t.strides();
    const auto dim = sizes.size();
    ...
    tensor_sizes_.copy(sizes.data(), dim);
    tensor_sizes_.copy(strides.data(), dim);

overhead was identical to the baseline. So I think that (plus updating the decode) is the way to go. Sorry for leading you astray. WDYT?

Gamrix · 2022-07-25T19:37:33Z

Firstly, we should discuss what we want from TensorImpl ( search TensorImpl size constraints) .
Below is my best guess of what I think we want from it.

+ is we do want, ? is unsure, - is we don't want

- various other pointers
- Refcount information
+ storage pointer (temporarily to get Unique Tensor Id)
- SizesAndStrides (from discussion above, we will capture this in a different fashion)
? storage_offset
- numel (duplicated from Sizes information)
+ dtype 
+ device 
- is_contiguous (John confirmed that this is just a bit to optimize performance on TensorImpl)
- storage_access_should_throw_ 
? bitfields - Need to review on a bit by bit basis, many bits are just speed optimizations for TensorImpl.

I think from this, we can conclude that we do not want to copy TensorImpl, and likely want to do our own thing. We should align on what exactly we want from TensorImpl, and also check that there are no data structures that currently capture most of what we want.

Secondly, all of the multiple benchmarks, with different resulting times shows that it is critical that we have a repeatable way of running benchmarks, and that we migrate to a system that has reproducible results. It will be really hard to get anything done when one week we are saying that x feature has a 20% overhead, and the next, we say it has 40%. My proposal is that we do future benchmarking on the AWS machines that people are preparing specifically for benchmarking and move the benchmark script to open source, and fully document the benchmarking process for profiler overheads.

robieta · 2022-07-25T19:57:09Z

Refcount information

I think we might want this. (CC @c-odrin who is looking into implementing it) The specific use case is to help find leaks by allowing the user to see where the number of referents differs from what is expected. An atomic read should be cheap, but this is all subject to cost-benefit analysis once we have something concrete.

? storage_offset

Probably not? I don't think we're expecting to get compiler-y enough to need this fine grained of detail.

? is_contiguous

I think duplicated from sizes and strides information?

? bitfields

Lots of great info for not much space is very tempting, but probably not until we have a concrete use case.

+1 we should open source the benchmark.

ngimel · 2022-08-05T00:30:18Z

torch/csrc/profiler/collection.cpp

-      tensor_sizes_.emplace_back(i);
-    }
+        /*layout_=*/t.layout(),
+        /*sizes_and_strides_*/ t.unsafeGetTensorImpl()->sizes_and_strides());


heads up, this would still error out for nested tensor (because sizes_custom errors). I'm adding a test for nested tensor profiling in #82854 so hopefully this should be visible.

Should we just add an entry to the enum for "this doesn't even make sense"? Just want to future proof things a bit since I'm sure nested tensor won't be the last non-rectangular thing that people dream up.

One could conceivably say that .sizes() of a nested tensor should return a vector of sizes of constituent tensors, or, if nested tensor happens to be rectangular, the sizes of this rectangular box, or something else? But until nested tensor decides what it wants profiler should ignore it. I don't know if we should call sizes_and_strides with try-catch to ignore all special tensorImpls that choose to throw an error for those calls, and not hardcode just nested tensors? E.g. sparse tensors will give you sizes, but will balk at strides, and I don't remember if their strides_custom throws an error.

Gamrix · 2022-08-08T21:11:11Z

[Closing this as Taylor thinks we should not go with SizesAndStrides for this support and instead go back to what is implemented in #80072 ]

[TorchTidy] Refactor profiler to use SizesAndStrides

373d0d0

This includes adding stride information [ghstack-poisoned]

Gamrix requested review from albanD and soulitzer as code owners July 20, 2022 22:49

Gamrix mentioned this pull request Jul 20, 2022

[TorchTidy] Adding Layout support and exposing TensorMetadata #81155

Closed

facebook-github-bot added the cla signed label Jul 20, 2022

Gamrix added a commit that referenced this pull request Jul 20, 2022

[TorchTidy] Refactor profiler to use SizesAndStrides

2aaae0b

This includes adding stride information ghstack-source-id: 7bf6586 Pull Request resolved: #81824

Gamrix requested review from robieta and removed request for albanD and soulitzer July 21, 2022 00:37

Update on "[TorchTidy] Refactor profiler to use SizesAndStrides"

4039cc4

This includes adding stride information [ghstack-poisoned]

Gamrix mentioned this pull request Jul 21, 2022

[TorchTidy] Adding support for storing scalar values in profiling #81843

Closed

Update on "[TorchTidy] Refactor profiler to use SizesAndStrides"

44f09aa

This includes adding stride information [ghstack-poisoned]

robieta mentioned this pull request Aug 5, 2022

Fix profiling with record_shapes=True and nested tensor #82854

Closed

ngimel reviewed Aug 5, 2022

View reviewed changes

Gamrix mentioned this pull request Aug 8, 2022

[TorchTidy] Adding support for accessing strides and scalars #80072

Closed

Gamrix closed this Aug 8, 2022

facebook-github-bot deleted the gh/gamrix/82/head branch September 8, 2022 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TorchTidy] Refactor profiler to use SizesAndStrides#81824

[TorchTidy] Refactor profiler to use SizesAndStrides#81824
Gamrix wants to merge 3 commits intogh/gamrix/82/basefrom
gh/gamrix/82/head

Gamrix commented Jul 20, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 20, 2022 •

edited

Loading

Uh oh!

robieta commented Jul 25, 2022

Uh oh!

Gamrix commented Jul 25, 2022 •

edited

Loading

Uh oh!

robieta commented Jul 25, 2022 •

edited

Loading

Uh oh!

ngimel Aug 5, 2022

Uh oh!

robieta Aug 5, 2022

Uh oh!

ngimel Aug 5, 2022

Uh oh!

Gamrix commented Aug 8, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Gamrix commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

robieta commented Jul 25, 2022

Uh oh!

Gamrix commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robieta commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

robieta Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

ngimel Aug 5, 2022

Choose a reason for hiding this comment

Uh oh!

Gamrix commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Gamrix commented Jul 20, 2022 •

edited

Loading

facebook-github-bot commented Jul 20, 2022 •

edited

Loading

Gamrix commented Jul 25, 2022 •

edited

Loading

robieta commented Jul 25, 2022 •

edited

Loading

Gamrix commented Aug 8, 2022 •

edited

Loading