Skip to content

[TorchTidy] Refactor profiler to use SizesAndStrides#81824

Closed
Gamrix wants to merge 3 commits intogh/gamrix/82/basefrom
gh/gamrix/82/head
Closed

[TorchTidy] Refactor profiler to use SizesAndStrides#81824
Gamrix wants to merge 3 commits intogh/gamrix/82/basefrom
gh/gamrix/82/head

Conversation

@Gamrix
Copy link
Contributor

@Gamrix Gamrix commented Jul 20, 2022

This includes adding stride information

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 20, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 44f09aa (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Gamrix added a commit that referenced this pull request Jul 20, 2022
This includes adding stride information

ghstack-source-id: 7bf6586
Pull Request resolved: #81824
@Gamrix Gamrix requested review from robieta and removed request for albanD and soulitzer July 21, 2022 00:37
This includes adding stride information

[ghstack-poisoned]
This includes adding stride information

[ghstack-poisoned]
@robieta
Copy link
Contributor

robieta commented Jul 25, 2022

I was very surprised by how expensive it is to copy around SizesAndStrides. (I assume the copy from TensorImpl::sizes_and_strides() get's RVO'd through to the emplace so you only do one copy?) Benchmarking your PR on my system adds about 180 ns for one input. (An additional 40% to the profiling overhead.) That got me curious, and I played around with all manner of unsafe data access to the underlying SizesAndStrides and its underlying memory. Ultimately what I found was that just calling t.sizes() and t.strides() was as fast as any of the specializations (read: hacks) that I could come up with. That said, I did find that when I added a vectorized copy to AppendOnlyList there was a non-trivial speedup:

  template <typename T0>
  typename std::enable_if<
      std::is_same<T0, T>::value && std::is_trivially_copyable<T>::value>::type
  copy(const T0* src, size_t n) {
    if (C10_LIKELY(next_ + n <= end_)) {
      std::memcpy((void*)next_, (void*)src, n * sizeof(T0));
      next_ += n;
    } else {
      // We could chunk this into several `memcpy`s, but because we expect this
      // fallback to be infrequent (n << ChunkSize) the performance impact is
      // negligible.
      for (size_t i = 0; i < n; i++) {
        emplace_back(*src++);
      }
    }
  }

Interestingly, I also found that doing separate memcpy's for sizes and strides was faster than a single memcpy of the contiguous memory. (In addition to using less memory) When I did that (again, nothing fancy)

    const auto& sizes = t.sizes();
    const auto& strides = t.strides();
    const auto dim = sizes.size();
    ...
    tensor_sizes_.copy(sizes.data(), dim);
    tensor_sizes_.copy(strides.data(), dim);

overhead was identical to the baseline. So I think that (plus updating the decode) is the way to go. Sorry for leading you astray. WDYT?

@Gamrix
Copy link
Contributor Author

Gamrix commented Jul 25, 2022

Firstly, we should discuss what we want from TensorImpl ( search TensorImpl size constraints) .
Below is my best guess of what I think we want from it.

+ is we do want, ? is unsure, - is we don't want

- various other pointers
- Refcount information
+ storage pointer (temporarily to get Unique Tensor Id)
- SizesAndStrides (from discussion above, we will capture this in a different fashion)
? storage_offset
- numel (duplicated from Sizes information)
+ dtype 
+ device 
- is_contiguous (John confirmed that this is just a bit to optimize performance on TensorImpl)
- storage_access_should_throw_ 
? bitfields - Need to review on a bit by bit basis, many bits are just speed optimizations for TensorImpl.

I think from this, we can conclude that we do not want to copy TensorImpl, and likely want to do our own thing. We should align on what exactly we want from TensorImpl, and also check that there are no data structures that currently capture most of what we want.

Secondly, all of the multiple benchmarks, with different resulting times shows that it is critical that we have a repeatable way of running benchmarks, and that we migrate to a system that has reproducible results. It will be really hard to get anything done when one week we are saying that x feature has a 20% overhead, and the next, we say it has 40%. My proposal is that we do future benchmarking on the AWS machines that people are preparing specifically for benchmarking and move the benchmark script to open source, and fully document the benchmarking process for profiler overheads.

@robieta
Copy link
Contributor

robieta commented Jul 25, 2022

  • Refcount information

I think we might want this. (CC @c-odrin who is looking into implementing it) The specific use case is to help find leaks by allowing the user to see where the number of referents differs from what is expected. An atomic read should be cheap, but this is all subject to cost-benefit analysis once we have something concrete.

? storage_offset

Probably not? I don't think we're expecting to get compiler-y enough to need this fine grained of detail.

? is_contiguous

I think duplicated from sizes and strides information?

? bitfields

Lots of great info for not much space is very tempting, but probably not until we have a concrete use case.

+1 we should open source the benchmark.

tensor_sizes_.emplace_back(i);
}
/*layout_=*/t.layout(),
/*sizes_and_strides_*/ t.unsafeGetTensorImpl()->sizes_and_strides());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up, this would still error out for nested tensor (because sizes_custom errors). I'm adding a test for nested tensor profiling in #82854 so hopefully this should be visible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add an entry to the enum for "this doesn't even make sense"? Just want to future proof things a bit since I'm sure nested tensor won't be the last non-rectangular thing that people dream up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could conceivably say that .sizes() of a nested tensor should return a vector of sizes of constituent tensors, or, if nested tensor happens to be rectangular, the sizes of this rectangular box, or something else? But until nested tensor decides what it wants profiler should ignore it. I don't know if we should call sizes_and_strides with try-catch to ignore all special tensorImpls that choose to throw an error for those calls, and not hardcode just nested tensors? E.g. sparse tensors will give you sizes, but will balk at strides, and I don't remember if their strides_custom throws an error.

@Gamrix
Copy link
Contributor Author

Gamrix commented Aug 8, 2022

[Closing this as Taylor thinks we should not go with SizesAndStrides for this support and instead go back to what is implemented in #80072 ]

@Gamrix Gamrix closed this Aug 8, 2022
@facebook-github-bot facebook-github-bot deleted the gh/gamrix/82/head branch September 8, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants