Adding symbolic sizes, contiguity, stride indices#36101
Adding symbolic sizes, contiguity, stride indices#36101Krovatkin wants to merge 7 commits intopytorch:masterfrom
Conversation
💊 Build failures summary and remediationsAs of commit bcb10ac (more details on the Dr. CI page): ✅ None of the build failures appear to be your fault 💚
1 job timed out:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 126 times. |
b49f106 to
f5f854c
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
zdevito
left a comment
There was a problem hiding this comment.
Nice, my comments are inline and mostly focus on the particular encodings and how they are documented.
There was a problem hiding this comment.
typically? when are they not?
There was a problem hiding this comment.
Seems odd that these can be invalid. It makes it hard to know statically when they are not used.
There was a problem hiding this comment.
Alternatively, because in every place we accept a shapesymbol , it might be unknown. We can explictly reserve one value of the integer to mean "unknown shape", which is not equal to any other unknown shape. This would prevent having to wrap all ShapeSymbols in optionals. In other words Shape symbols would represent something like:
?- completely unknown, encoding 03- literally size 3, encoding 3a- the same size as othera, encoding -1
There was a problem hiding this comment.
I don't think these should be public. This is suppose to be an abstract class representing a symbol, not a containing for our particular encoding.
There was a problem hiding this comment.
We are trying to keep shapes separate from integers, but this sort of undoes that by allowing all shapes to be used as integers anywhere.
There was a problem hiding this comment.
I will get rid of this in the next patch, since the new algorithm doesn't use isCompatibleWithInCurrentExecutionContext
There was a problem hiding this comment.
If we already use negative numbers to represent abstract shapes, then why is this necesary?
There was a problem hiding this comment.
I find this confusing, both from a template perspective and from an actual behavior perspective. We don't want to pun integers and shapes but this is extending it through the API.
There was a problem hiding this comment.
- stride and contiguity information should probably go in its own object not flattened out in the tensor.
- stride_indices_ ,and contiguity_, and strides_ need documentation. I think you can take a lot of the documentation from the quip document that proposes this interface.
There was a problem hiding this comment.
stride and contiguity information should probably go in its own object not flattened out in the tensor.
Can we defer this piece until a bit later? I understand we want a StrideDimension object that has an index and contiguity but since there isn't any code that actually uses this information (e.g. contiguity and stride_indices) yet.
There was a problem hiding this comment.
Well, there at least should be tests that this information is getting tracked correctly. I found what seem like several bugs tracking this information. If it is unused, then it shouldn't be going into the PR.
There was a problem hiding this comment.
merge_sizes is unused, its a bit confusing to have it. If you want to not merge sizes, you can just blank out the sizes after or before the merge.
There was a problem hiding this comment.
it's used in the next PR since to avoid running the default logic for VaryingShape<ShapeSymbol> since the merging logic will be in profiling_record.cpp
There was a problem hiding this comment.
That doesn't sound like a good separation of concerns, but it is difficult to suggest something better because its use isn't in this PR. Maybe remove it from here and put in the PR that actually uses it. I am skeptical that this is the best way to represent something, because this boolean flag essentially makes merge not perform a merge, which is an abstraction leak issue.
There was a problem hiding this comment.
Did this move, or does the just remove the hashing test? Seems like an important test.
There was a problem hiding this comment.
Forgot to port it. Also changed the tests that use TensorType::create in a way we actually use it.
There was a problem hiding this comment.
Implementation details of TensorType appear to have leaked here. I think it would look clearer of stride information was represented as its own object.
There was a problem hiding this comment.
I will get rid of this in the next patch, since the new algorithm doesn't use isCompatibleWithInCurrentExecutionContext
There was a problem hiding this comment.
I wanted to defer an API clean-up until we have the entire E2E pipeline working for dynamic shapes, but it's indeed pretty confusing in its current state, so I cleaned it up.
auto s = Shape::fromConstantSize(3);
added fromStaticSize
auto n = ShapeScope::fresh(); // new shape in the scope (i.e. the function being profiled)
Unfortunately, this needs to be a bit more complex than this since ShapeScope needs to persist across multiple profiles of the same function. For the first iteration, I just added newSymbol which generates a fresh symbol unique globally across all the functions.
std::optional concrete_value = s.static_size();
Added static_size returning int and is_static
There was a problem hiding this comment.
it's used in the next PR since to avoid running the default logic for VaryingShape<ShapeSymbol> since the merging logic will be in profiling_record.cpp
There was a problem hiding this comment.
stride and contiguity information should probably go in its own object not flattened out in the tensor.
Can we defer this piece until a bit later? I understand we want a StrideDimension object that has an index and contiguity but since there isn't any code that actually uses this information (e.g. contiguity and stride_indices) yet.
There was a problem hiding this comment.
Forgot to port it. Also changed the tests that use TensorType::create in a way we actually use it.
There was a problem hiding this comment.
Shouldn't contiguity be determined after sorting? If we know that we are handling transposed data, for kernel generation, what we would care about is the contiguity information in ordered stride.
There was a problem hiding this comment.
correct, contiguity information is suppose to be from smallest to largest. So this probably needs to be done in verse order: first compute stride_indices, then contiguity form them.
There was a problem hiding this comment.
To make this problem even more convoluted, currently we sort dimensions by their stride and break tie on their semantic index (https://github.com/pytorch/pytorch/pull/36101/files/4366d3d892bd89811c11483474e0917f612c3f1d#diff-1eaac5e95929476c6522cd50bbb2432eR781)
However, it makes the contiguity information ambiguous.
If we have an unsqueezed tensor with size [4, 1, 8] and a stride [8, 1, 1], current implementation would compare contiguity between dimension [0/1] & [1/2] where it won't correctly identify that the meaningful dimension [0/2] are actually contiguous.
Size-1 dimension makes contiguity tricky to handle for two reasons: 1. stride could be wild; 2. it breaks the cumulative law of contiguity;
I think we should special case size-1 dimension to make our life easier:
- We still need to preserve size-1 dimension when we sort the dimensions per their stride. (This is to ensure our memory_format stuff works properly)
- size-1 tensor should transparently pass contiguity check and instead assess contiguity information for closest neighboring dimensions with
size != 1;
I'm in favor of Zach's idea of having a separate class to properly encapsulate this logic.
zdevito
left a comment
There was a problem hiding this comment.
Getting there -- I have a few separate comments inline.
There was a problem hiding this comment.
should probably be explicit to prevent confusion.
There was a problem hiding this comment.
Fresh symbols should have negative values, right? Otherwise the symbol gets a fixed size?
There was a problem hiding this comment.
what does is_static mean? this could use a comment.
There was a problem hiding this comment.
What does 'fixed' mean? (I can guess but developers new to the codebase could use more specific wording).
There was a problem hiding this comment.
The problem with representing stride_indices and continuity as separate VaryingShape elements is that it offers more ways in which the information is present/absent than what is actually expected. For instance, is there any case where we set some stride_index values but not the contiguity values? The current data structures can represent that, but it is not clear that was the intent vs. just a side effect of exposing the arrays as VaryingShape objects. In its current form any code that needs to use this information has to be much more verbose to handle a bunch of cases that don't actually happen. I still think it would be much better to create an abstract object that represents the striding information and documentation about what it means.
There was a problem hiding this comment.
this looks like it is missing correct handling of stride_indices_ and contiguity_. This is another reasons to implement strides a a self-contained object: a bunch of splatted parameters make it harder to keep objects up to date.
There was a problem hiding this comment.
This seems like it doesn't correct set the rest of the stride inforation
There was a problem hiding this comment.
That doesn't sound like a good separation of concerns, but it is difficult to suggest something better because its use isn't in this PR. Maybe remove it from here and put in the PR that actually uses it. I am skeptical that this is the best way to represent something, because this boolean flag essentially makes merge not perform a merge, which is an abstraction leak issue.
There was a problem hiding this comment.
Well, there at least should be tests that this information is getting tracked correctly. I found what seem like several bugs tracking this information. If it is unused, then it shouldn't be going into the PR.
There was a problem hiding this comment.
This is a lot of extra code to avoid creating a SymbolicShape object from an integer. I would prefer explicitly doing that over this complicated template logic.
There was a problem hiding this comment.
correct, contiguity information is suppose to be from smallest to largest. So this probably needs to be done in verse order: first compute stride_indices, then contiguity form them.
1207da2 to
3c66cb0
Compare
zdevito
left a comment
There was a problem hiding this comment.
Thanks for working on this! Looks pretty good to me. I only have some minor comments. Improvements to the API surface can come in future PRs.
There was a problem hiding this comment.
can you add a comment about how this encodes the above information?
There was a problem hiding this comment.
Why not just use Stride() to represent an empty stride? Seems confusing to have to de-normalize and renormalize things.
There was a problem hiding this comment.
What does 'fixed' mean? (I can guess but developers new to the codebase could use more specific wording).
There was a problem hiding this comment.
First time this is called, it will return -1 which is the same as ShapeSymbol(). Since -1 is being used to encode a non-present shape, this seems wrong.
There was a problem hiding this comment.
num_symbols starts with 1 (i.e. there's always one unknown/undefined symbol). newSymbol increments first then returns, so the first new symbol should have the index of -2. I also ran a quick test to confirm and I'm indeed seeing -2
There was a problem hiding this comment.
nit: stride_properties, it isn't a great idea to abbreviate in the public API of a class.
There was a problem hiding this comment.
Both Stride, and ShapeSymbol now have their own internal ways of representing not present values. So VaryingShape is going to get confusing because of the possibilities involved. It would be good to try to simply the API surface here. For instance does sizes() need to still return a list of optional ints? Or do we just need concrete sizes, and sizes_properaties() (i.e. the symbolic shape-returning version).
There was a problem hiding this comment.
I initially liked the consistency of having c10::optional to denote the unknowness but it's getting a bit clunky and awkward especially with Strides.
We could probably switch to the following API:
c10::optional<std::vector<ShapeSymbol>> symbolic_sizes();
c10::optional<std::vector<int64_t> concrete_sizes();
//strides
c10::optional<std::vector<Stride>> stride_properties();
// and we should be removing `concrete_strides()`/`strides()`There was a problem hiding this comment.
stride_indices, contiguity are unused here. Doesn't seem to be implemented correctly but the code makes it look like it is.
There was a problem hiding this comment.
oh :-( yes, that's definitely a bug. we should use computeStrideProps
There was a problem hiding this comment.
oh wait, no it's correct. that particular constructor we call is a convenience c-tor. If VaryingShapes contain concrete sizes we will compute the stride properties, otherwise we will just compute the rank information.
7ba702a to
320f5fc
Compare
There was a problem hiding this comment.
Shouldn't we also mention implicit broadcasting here where we have dimension with stride == 0?
There was a problem hiding this comment.
yup, i also want to switch contiguous to YES, NO, BROADCASTED .
There was a problem hiding this comment.
Quick question: since stride_ is an c10::optional<size_t> basically means it's only carrying static stride when it's applicable. Meanwhile, contiguious_ information is only really useful when we have symbolic shapes -> non-static stride_. I see an opportunity to merge the two together, but I don't know if I missed out anything.
And certainly not a priority in this PR.
There was a problem hiding this comment.
@jjsjann123 we will be deprecating and removing strides_. It's basically there to support the legacy executor.
There was a problem hiding this comment.
NITPICK: we have value_ as int64_t, which is not enough range for pytorch size_t. I guess realistically we shouldn't run into issues but maybe we can put a comment here.
There was a problem hiding this comment.
sure i'll do it in the next PR since it's very painful to land even this one.
Fix failing tests compute contiguity in the original order getting rid of VaryingShape strides remove todos remove dead code assert in contiguous remove dead codde remove dead code get rid of VaryingShape get rid off VaryingShape for good add a c-tor for VaryingShape2<ShapeSymbol> rename VaryingShape2 to VaryingShape fix dimensionedOnly remove random stuff clang-format update third_party fix bbuild breaks more clang-format disable a failing cuda parser test address Zach's feedback p1 Stride object smaller fixes fixes to computeStrides backout changes to jit_log fix comp errors after formatting rename stride_properties fix computeStrideProps; different linkage revert back
799d7bd to
bcb10ac
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@Krovatkin merged this pull request in 4ed790d. |
Summary: Pull Request resolved: pytorch/pytorch#36101 Reviewed By: jamesr66a Differential Revision: D20908711 Pulled By: Krovatkin fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
Summary: Pull Request resolved: pytorch/pytorch#36101 Reviewed By: jamesr66a Differential Revision: D20908711 Pulled By: Krovatkin fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
Summary: Pull Request resolved: pytorch#36101 Reviewed By: jamesr66a Differential Revision: D20908711 Pulled By: Krovatkin fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
No description provided.