Make contiguity ignore broadcasts by zasdfgbnm · Pull Request #2517 · csarofeen/pytorch

zasdfgbnm · 2023-02-24T23:42:51Z

Currently, contiguity had the same size as the rfactor domain of a tensor. And in this PR, I am changing it to the size of TensorDomain::noBroadcasts(rfactor_dom). With this change, "the contiguity of a broadcast/expand dimension" is no longer meaningful by definition. And contiguity[i] is true if and only if it is memory dense with the next non-broadcasting dimension.

For example, if I have a tensor torch.zeros(4, 1, 3).expand(-1, 10, -1), before this change, the contiguity of this tensor will be (false, false, true), and after the change, it will be (true, true).

The reason for doing this change is, we are more interested in whether a non-broadcasting dimension is memory dense with its next non-broadcasting dimension. In the example above, we are interested in whether 4 and 3 is memory dense. We are not interested in if 10 and 3 are memory dense, because by definition they are trivially not. In this example, we want to vectorize 4, however, the current contiguity design is blocking us from doing so.

Currently, our definition about the contiguity of the broadcasting dimensions and the dimension before a broadcasting dimension is vague and not well formalized. For example, if I have shape (4, 1, 6), stride (4*999999, 999999, 1), then on the one hand, our system will calculate its contiguity as (true, false, true), however, on the other hand, our indexing will collapse the index of dim 0 with dim 2 because it ignoring broadcasts (this is the root cause of #2169). I will not consider this an indexing bug. Instead, I consider this as an issue of ambiguity in the definition of contiguity. And my design change is an effort to remove this ambiguity.

See also: #2169, #2049

Fixes #2169

jjsjann123

Quick question on TensorView constructor.

torch/testing/_internal/opinfo/definitions/_masked.py

third_party/nvfuser/csrc/ir_internal_nodes.h

third_party/nvfuser/csrc/ir_nodes.cpp

jjsjann123

A few comments here and there.

third_party/nvfuser/csrc/contiguity.cpp

third_party/nvfuser/csrc/ir_nodes.cpp

jjsjann123 · 2023-02-28T22:22:39Z

third_party/nvfuser/csrc/ir_utils.cpp

+  int no_broadcast_i = 0;
+  for (const auto i : c10::irange(ids.size())) {
+    if (!ids.at(i)->isBroadcast()) {
+      full2nob_map.at(i) = no_broadcast_i++;


nitpick: std::vector (size_t) gives zero init values? I'm a little uncomfortable with us leaving them here, which might accidentally map that to axis-0. Can we default init everything to -1

Actually, the way we are accessing these, I think an unordered_map makes more sense, in terms of readability... It'll be easier to catch bug in accessing the mapping.

I agree that unordered_map provide better readability and error checking, but isn't unordered_map much slower compared to vector? I changed the code to initialize the vector with std::numeric_limits<size_t>::max(), so if an unexpected item is accessed, it will almost always lead to an out-of-bound error or a segfault.

third_party/nvfuser/csrc/lower_validation.cpp

third_party/nvfuser/csrc/python_frontend/python_bindings.cpp

third_party/nvfuser/csrc/scheduler/vectorize_helper.cpp

third_party/nvfuser/csrc/tensor_view.cpp

naoyam · 2023-03-06T21:10:25Z

It seems that many of the changes are because the contiguity vector now only holds flags only for non-broadcast domains. I wonder if it could be simpler if we kept the contiguity vector to have flags of all domains and just change the definition of the flag. IIRC, if the flag is true, it means the stride of the domain can be calculated as the stride of the next inner domain multiplied by the extent of the inner domain. If we change the definition of the next inner domain to the next non-broadcast inner domain, I think we should be able to have the same benefit of this PR without doing noBroadcastDomains.

Just my two cents.

zasdfgbnm · 2023-03-06T21:47:50Z

It seems that many of the changes are because the contiguity vector now only holds flags only for non-broadcast domains. I wonder if it could be simpler if we kept the contiguity vector to have flags of all domains and just change the definition of the flag. IIRC, if the flag is true, it means the stride of the domain can be calculated as the stride of the next inner domain multiplied by the extent of the inner domain. If we change the definition of the next inner domain to the next non-broadcast inner domain, I think we should be able to have the same benefit of this PR without doing noBroadcastDomains.

Just my two cents.

I agree that keeping the flag for broadcasting could still have the benefit of ignoring broadcast in its definition. But I don't think it would make this diff easier. The only save is a few noBroadcastDomains, but we would need to change our caching system to ignore the flag value at broadcast dimensions. And I don't like making contiguity storing redundant unused value, because the definition of contiguity is not trivial, and we already made mistakes by writing wrong indexing code. Storing these extra values would make it easy for us to make similar mistakes in the future. Making the size of contiguity deviates from the size of the rfactor domain could lead to much louder errors when we make a mistake, therefore avoid hard-to-catch bugs.

naoyam · 2023-03-06T21:51:57Z

Hahaha, I expected this answer:

And I don't like making contiguity storing redundant unused value

third_party/nvfuser/test/test_gpu3.cpp

third_party/nvfuser/csrc/lower_misaligned_vectorization.cpp

third_party/nvfuser/csrc/tensor_view.cpp

naoyam · 2023-03-06T21:01:00Z

third_party/nvfuser/csrc/scheduler/utils.cpp

-    if ((*it)->isBroadcast()) {
-      if (inner_most_id == nullptr) {
-        inner_most_id = *it;
-      }


It seems this function ignored broadcast domains if there's any other non-broadcast domains. If the tensor only has broadcast domains, it would have returned the innermost broadcast domain. I don't remember why we're doing this.

I am not sure either. But this would only happen on all-broadcasting tensors, and this function is only used in vectorization and transpose detection. Loosely speaking, if a tensor is all-broadcasting, it should be a no-op in vectorization/transpose analysis, so I would speculate this change is safe.

jjsjann123 · 2023-03-07T02:53:21Z

Looks like my issues are all resolved. I'm leaving it to @naoyam to stamp on this one.

zasdfgbnm · 2023-03-07T03:07:06Z

torch/_prims/nvfuser_executor.py

    from nvfuser import compute_contiguity

-    return compute_contiguity(shape, strides)
+    return tuple(compute_contiguity(shape, strides))


Looks like I need to change this. @jjsjann123

Should I upstream this?

naoyam

LGTM

zasdfgbnm · 2023-03-08T21:47:06Z

Hahaha, I expected this answer:

And I don't like making contiguity storing redundant unused value

@naoyam Indeed, instead of making contiguity storing redundant unused boolean, we can still make contiguity have the same size as rfactor domain by making contiguity a std::vector<c10::optional<bool>> instead of std::vector<bool>. So for the tensor torch.zeros(8, 1, 3).expand(8, 9, 3), the contiguity will be (true, None, true). I disliked (true, true, true) and (true, false, true) in favor of (true, true) because I wanted to see a hard error when trying to read the contiguity of a broadcast domain because this means a bug and because I don't want to recompile if the contiguity changed from (true, true, true) to (true, false, true). I think (true, None, true) has all the benefits of (true, true) and is more straightforward and more convenient. I was discussing with @jjsjann123 this morning about frontend design, and we both like this approach. Do you think this makes sense? I will write a PR to change it if it makes sense.

In csarofeen#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: #96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98

naoyam · 2023-03-08T23:30:54Z

Hahaha, I expected this answer:

And I don't like making contiguity storing redundant unused value

@naoyam Indeed, instead of making contiguity storing redundant unused boolean, we can still make contiguity have the same size as rfactor domain by making contiguity a std::vector<c10::optional<bool>> instead of std::vector<bool>. So for the tensor torch.zeros(8, 1, 3).expand(8, 9, 3), the contiguity will be (true, None, true). I disliked (true, true, true) and (true, false, true) in favor of (true, true) because I wanted to see a hard error when trying to read the contiguity of a broadcast domain because this means a bug and because I don't want to recompile if the contiguity changed from (true, true, true) to (true, false, true). I think (true, None, true) has all the benefits of (true, true) and is more straightforward and more convenient. I was discussing with @jjsjann123 this morning about frontend design, and we both like this approach. Do you think this makes sense? I will write a PR to change it if it makes sense.

Sounds good to me.

jjsjann123 · 2023-03-09T17:12:58Z

Briefly brought up this conversation with the frontend team. There's some opinion on how we should expose contiguity flag on the frontend. pointing @kevinstephano here for visibility.

jjsjann123 · 2023-03-09T17:28:54Z

The change on python frontend is linked above in #2561

In csarofeen/pytorch#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: pytorch/pytorch#96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98

In csarofeen#2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change. Pull Request resolved: pytorch#96218 Approved by: https://github.com/jjsjann123, https://github.com/davidberard98

zasdfgbnm added 12 commits February 24, 2023 15:34

Make contiguity ignore broadcasts

208b720

fix

31f8dc0

fix

3fe8de5

fix

77acaba

fix

6ed1353

fix

7fdd0cc

save

758650a

save

1872315

fix

c432522

unskip

ebb6716

Merge branch 'devel' of github.com:csarofeen/pytorch into 2169

e077388

fix

3c66ab2

zasdfgbnm marked this pull request as ready for review February 28, 2023 08:28

save

70efe24

zasdfgbnm requested review from csarofeen, jjsjann123 and naoyam February 28, 2023 09:41

jjsjann123 reviewed Feb 28, 2023

View reviewed changes

torch/testing/_internal/opinfo/definitions/_masked.py Show resolved Hide resolved

third_party/nvfuser/csrc/ir_internal_nodes.h Show resolved Hide resolved

third_party/nvfuser/csrc/ir_nodes.cpp Show resolved Hide resolved

jjsjann123 reviewed Feb 28, 2023

View reviewed changes

zasdfgbnm added 7 commits February 28, 2023 23:45

review 1

096c97f

compute_contiguity

8af5365

fix

19a1eb6

fix

5c9ea5d

save

9fe807f

Merge branch 'devel' of github.com:csarofeen/pytorch into 2169

0ed9330

fix

246854b

zasdfgbnm requested a review from jjsjann123 March 1, 2023 09:15

naoyam reviewed Mar 6, 2023

View reviewed changes

zasdfgbnm added 4 commits March 6, 2023 17:40

Merge branch 'devel' of github.com:csarofeen/pytorch into 2169

5b0872d

fix

e681cde

const auto&

1154602

found_innermost_non_broadcast

0deba03

fix compute_contiguity

0dfa382

zasdfgbnm commented Mar 7, 2023

View reviewed changes

naoyam approved these changes Mar 7, 2023

View reviewed changes

zasdfgbnm merged commit 4ad1055 into devel Mar 7, 2023

zasdfgbnm deleted the 2169 branch March 7, 2023 06:37

This was referenced Mar 7, 2023

broadcast_in_dim: The size of contiguity must equal to the number of non-broadcasting IterDomains #2549

Open

IterDomain resize for pad, cat, slice #2480

Merged

Update nvfuser_executor.py pytorch/pytorch#96218

Closed

jjsjann123 mentioned this pull request Mar 9, 2023

Bcast python api patch #2561

Merged

zasdfgbnm mentioned this pull request Apr 4, 2023

[NVFuser] RuntimeError: ref_id_it != replayed_concrete_ids_.vector().end() INTERNAL ASSERT FAILED pytorch/pytorch#84510

Open

zasdfgbnm mentioned this pull request Apr 28, 2023

Codegen tensor to support arbitrary stride order NVIDIA/Fuser#248

Closed

Conversation

zasdfgbnm commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jjsjann123 Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Feb 28, 2023

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Mar 1, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naoyam commented Mar 6, 2023

Uh oh!

zasdfgbnm commented Mar 6, 2023

Uh oh!

naoyam commented Mar 6, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naoyam Mar 6, 2023

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Mar 8, 2023

Uh oh!

naoyam commented Mar 8, 2023

Uh oh!

jjsjann123 commented Mar 9, 2023

Uh oh!

jjsjann123 commented Mar 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zasdfgbnm commented Feb 24, 2023 •

edited

Loading

jjsjann123 commented Mar 7, 2023 •

edited

Loading