Skip to content

Adding symbolic sizes, contiguity, stride indices#36101

Closed
Krovatkin wants to merge 7 commits intopytorch:masterfrom
Krovatkin:krovatkin/tensortype_api
Closed

Adding symbolic sizes, contiguity, stride indices#36101
Krovatkin wants to merge 7 commits intopytorch:masterfrom
Krovatkin:krovatkin/tensortype_api

Conversation

@Krovatkin
Copy link
Copy Markdown
Contributor

No description provided.

@Krovatkin Krovatkin requested a review from zdevito April 6, 2020 21:28
@Krovatkin Krovatkin requested a review from apaszke as a code owner April 6, 2020 21:28
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 6, 2020
@Krovatkin Krovatkin removed the request for review from zdevito April 6, 2020 21:29
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 6, 2020

💊 Build failures summary and remediations

As of commit bcb10ac (more details on the Dr. CI page):


None of the build failures appear to be your fault 💚


  • 1/1 broken upstream at merge base cd48fb5 on Apr 30 from 2:26pm to 7:04pm PDT (9 commits; cd48fb5 - bedc50e)

    Please rebase on the viable/strict branch (expand for instructions)

    Since your merge base is older than viable/strict, run these commands:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase FETCH_HEAD
    

    Check out the recency history of this "viable master" tracking branch.


1 job timed out:

  • pytorch_xla_linux_bionic_py3_6_clang9_test

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 126 times.

@Krovatkin Krovatkin requested review from zdevito and removed request for zdevito April 6, 2020 23:16
@Krovatkin Krovatkin force-pushed the krovatkin/tensortype_api branch from b49f106 to f5f854c Compare April 7, 2020 18:33
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, my comments are inline and mostly focus on the particular encodings and how they are documented.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically? when are they not?

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd that these can be invalid. It makes it hard to know statically when they are not used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, because in every place we accept a shapesymbol , it might be unknown. We can explictly reserve one value of the integer to mean "unknown shape", which is not equal to any other unknown shape. This would prevent having to wrap all ShapeSymbols in optionals. In other words Shape symbols would represent something like:

  • ? - completely unknown, encoding 0
  • 3 - literally size 3, encoding 3
  • a - the same size as other a, encoding -1

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these should be public. This is suppose to be an abstract class representing a symbol, not a containing for our particular encoding.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to keep shapes separate from integers, but this sort of undoes that by allowing all shapes to be used as integers anywhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get rid of this in the next patch, since the new algorithm doesn't use isCompatibleWithInCurrentExecutionContext

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already use negative numbers to represent abstract shapes, then why is this necesary?

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing, both from a template perspective and from an actual behavior perspective. We don't want to pun integers and shapes but this is extending it through the API.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • stride and contiguity information should probably go in its own object not flattened out in the tensor.
  • stride_indices_ ,and contiguity_, and strides_ need documentation. I think you can take a lot of the documentation from the quip document that proposes this interface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stride and contiguity information should probably go in its own object not flattened out in the tensor.

Can we defer this piece until a bit later? I understand we want a StrideDimension object that has an index and contiguity but since there isn't any code that actually uses this information (e.g. contiguity and stride_indices) yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there at least should be tests that this information is getting tracked correctly. I found what seem like several bugs tracking this information. If it is unused, then it shouldn't be going into the PR.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge_sizes is unused, its a bit confusing to have it. If you want to not merge sizes, you can just blank out the sizes after or before the merge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used in the next PR since to avoid running the default logic for VaryingShape<ShapeSymbol> since the merging logic will be in profiling_record.cpp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound like a good separation of concerns, but it is difficult to suggest something better because its use isn't in this PR. Maybe remove it from here and put in the PR that actually uses it. I am skeptical that this is the best way to represent something, because this boolean flag essentially makes merge not perform a merge, which is an abstraction leak issue.

Comment thread test/cpp/jit/test_argument_spec.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this move, or does the just remove the hashing test? Seems like an important test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to port it. Also changed the tests that use TensorType::create in a way we actually use it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation details of TensorType appear to have leaked here. I think it would look clearer of stride information was represented as its own object.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get rid of this in the next patch, since the new algorithm doesn't use isCompatibleWithInCurrentExecutionContext

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to defer an API clean-up until we have the entire E2E pipeline working for dynamic shapes, but it's indeed pretty confusing in its current state, so I cleaned it up.

auto s = Shape::fromConstantSize(3);

added fromStaticSize

auto n = ShapeScope::fresh(); // new shape in the scope (i.e. the function being profiled)

Unfortunately, this needs to be a bit more complex than this since ShapeScope needs to persist across multiple profiles of the same function. For the first iteration, I just added newSymbol which generates a fresh symbol unique globally across all the functions.

std::optional concrete_value = s.static_size();

Added static_size returning int and is_static

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used in the next PR since to avoid running the default logic for VaryingShape<ShapeSymbol> since the merging logic will be in profiling_record.cpp

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stride and contiguity information should probably go in its own object not flattened out in the tensor.

Can we defer this piece until a bit later? I understand we want a StrideDimension object that has an index and contiguity but since there isn't any code that actually uses this information (e.g. contiguity and stride_indices) yet.

Comment thread test/cpp/jit/test_argument_spec.cpp Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to port it. Also changed the tests that use TensorType::create in a way we actually use it.

@Krovatkin Krovatkin requested a review from zdevito April 14, 2020 00:21
Comment thread aten/src/ATen/core/type.cpp Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't contiguity be determined after sorting? If we know that we are handling transposed data, for kernel generation, what we would care about is the contiguity information in ordered stride.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, contiguity information is suppose to be from smallest to largest. So this probably needs to be done in verse order: first compute stride_indices, then contiguity form them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this problem even more convoluted, currently we sort dimensions by their stride and break tie on their semantic index (https://github.com/pytorch/pytorch/pull/36101/files/4366d3d892bd89811c11483474e0917f612c3f1d#diff-1eaac5e95929476c6522cd50bbb2432eR781)

However, it makes the contiguity information ambiguous.
If we have an unsqueezed tensor with size [4, 1, 8] and a stride [8, 1, 1], current implementation would compare contiguity between dimension [0/1] & [1/2] where it won't correctly identify that the meaningful dimension [0/2] are actually contiguous.

Size-1 dimension makes contiguity tricky to handle for two reasons: 1. stride could be wild; 2. it breaks the cumulative law of contiguity;
I think we should special case size-1 dimension to make our life easier:

  1. We still need to preserve size-1 dimension when we sort the dimensions per their stride. (This is to ensure our memory_format stuff works properly)
  2. size-1 tensor should transparently pass contiguity check and instead assess contiguity information for closest neighboring dimensions with size != 1;

I'm in favor of Zach's idea of having a separate class to properly encapsulate this logic.

Copy link
Copy Markdown
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting there -- I have a few separate comments inline.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably be explicit to prevent confusion.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fresh symbols should have negative values, right? Otherwise the symbol gets a fixed size?

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does is_static mean? this could use a comment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'fixed' mean? (I can guess but developers new to the codebase could use more specific wording).

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with representing stride_indices and continuity as separate VaryingShape elements is that it offers more ways in which the information is present/absent than what is actually expected. For instance, is there any case where we set some stride_index values but not the contiguity values? The current data structures can represent that, but it is not clear that was the intent vs. just a side effect of exposing the arrays as VaryingShape objects. In its current form any code that needs to use this information has to be much more verbose to handle a bunch of cases that don't actually happen. I still think it would be much better to create an abstract object that represents the striding information and documentation about what it means.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it is missing correct handling of stride_indices_ and contiguity_. This is another reasons to implement strides a a self-contained object: a bunch of splatted parameters make it harder to keep objects up to date.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it doesn't correct set the rest of the stride inforation

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound like a good separation of concerns, but it is difficult to suggest something better because its use isn't in this PR. Maybe remove it from here and put in the PR that actually uses it. I am skeptical that this is the best way to represent something, because this boolean flag essentially makes merge not perform a merge, which is an abstraction leak issue.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there at least should be tests that this information is getting tracked correctly. I found what seem like several bugs tracking this information. If it is unused, then it shouldn't be going into the PR.

Comment thread aten/src/ATen/core/type.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of extra code to avoid creating a SymbolicShape object from an integer. I would prefer explicitly doing that over this complicated template logic.

Comment thread aten/src/ATen/core/type.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, contiguity information is suppose to be from smallest to largest. So this probably needs to be done in verse order: first compute stride_indices, then contiguity form them.

@Krovatkin Krovatkin force-pushed the krovatkin/tensortype_api branch from 1207da2 to 3c66cb0 Compare April 22, 2020 17:33
@Krovatkin Krovatkin requested a review from zdevito April 24, 2020 17:01
Copy link
Copy Markdown
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Looks pretty good to me. I only have some minor comments. Improvements to the API surface can come in future PRs.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about how this encodes the above information?

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use Stride() to represent an empty stride? Seems confusing to have to de-normalize and renormalize things.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'fixed' mean? (I can guess but developers new to the codebase could use more specific wording).

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time this is called, it will return -1 which is the same as ShapeSymbol(). Since -1 is being used to encode a non-present shape, this seems wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_symbols starts with 1 (i.e. there's always one unknown/undefined symbol). newSymbol increments first then returns, so the first new symbol should have the index of -2. I also ran a quick test to confirm and I'm indeed seeing -2

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stride_properties, it isn't a great idea to abbreviate in the public API of a class.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Stride, and ShapeSymbol now have their own internal ways of representing not present values. So VaryingShape is going to get confusing because of the possibilities involved. It would be good to try to simply the API surface here. For instance does sizes() need to still return a list of optional ints? Or do we just need concrete sizes, and sizes_properaties() (i.e. the symbolic shape-returning version).

Copy link
Copy Markdown
Contributor Author

@Krovatkin Krovatkin Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially liked the consistency of having c10::optional to denote the unknowness but it's getting a bit clunky and awkward especially with Strides.
We could probably switch to the following API:

c10::optional<std::vector<ShapeSymbol>> symbolic_sizes();
c10::optional<std::vector<int64_t> concrete_sizes();
//strides
c10::optional<std::vector<Stride>> stride_properties();
// and we should be removing `concrete_strides()`/`strides()`

Comment thread aten/src/ATen/core/type.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stride_indices, contiguity are unused here. Doesn't seem to be implemented correctly but the code makes it look like it is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh :-( yes, that's definitely a bug. we should use computeStrideProps

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, no it's correct. that particular constructor we call is a convenience c-tor. If VaryingShapes contain concrete sizes we will compute the stride properties, otherwise we will just compute the rank information.

@Krovatkin Krovatkin force-pushed the krovatkin/tensortype_api branch 4 times, most recently from 7ba702a to 320f5fc Compare April 29, 2020 17:08
Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also mention implicit broadcasting here where we have dimension with stride == 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, i also want to switch contiguous to YES, NO, BROADCASTED .

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: since stride_ is an c10::optional<size_t> basically means it's only carrying static stride when it's applicable. Meanwhile, contiguious_ information is only really useful when we have symbolic shapes -> non-static stride_. I see an opportunity to merge the two together, but I don't know if I missed out anything.

And certainly not a priority in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjsjann123 we will be deprecating and removing strides_. It's basically there to support the legacy executor.

Comment thread aten/src/ATen/core/jit_type.h Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NITPICK: we have value_ as int64_t, which is not enough range for pytorch size_t. I guess realistically we shouldn't run into issues but maybe we can put a comment here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i'll do it in the next PR since it's very painful to land even this one.

Krovatkin and others added 7 commits April 30, 2020 14:34
Fix failing tests

compute contiguity in the original order

getting rid of VaryingShape strides

remove todos

remove dead code

assert in contiguous

remove dead codde

remove dead code

get rid of VaryingShape

get rid off VaryingShape for good

add a c-tor for VaryingShape2<ShapeSymbol>

rename VaryingShape2 to VaryingShape

fix dimensionedOnly

remove random stuff

clang-format

update third_party

fix bbuild breaks

more clang-format

disable a failing cuda parser test

address Zach's feedback p1

Stride object

smaller fixes

fixes  to computeStrides

backout changes to jit_log

fix comp errors after formatting

rename stride_properties

fix computeStrideProps; different linkage

revert back
@Krovatkin Krovatkin force-pushed the krovatkin/tensortype_api branch from 799d7bd to bcb10ac Compare April 30, 2020 21:36
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@Krovatkin merged this pull request in 4ed790d.

jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary: Pull Request resolved: pytorch/pytorch#36101

Reviewed By: jamesr66a

Differential Revision: D20908711

Pulled By: Krovatkin

fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary: Pull Request resolved: pytorch/pytorch#36101

Reviewed By: jamesr66a

Differential Revision: D20908711

Pulled By: Krovatkin

fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary: Pull Request resolved: pytorch#36101

Reviewed By: jamesr66a

Differential Revision: D20908711

Pulled By: Krovatkin

fbshipit-source-id: f90ce74acffeb645d7d906d07e293164d65ed7e6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants