[WIP][DO NOT REVIEW] upstream push smoke test#82937
Closed
jjsjann123 wants to merge 909 commits intopytorch:masterfrom
Closed
[WIP][DO NOT REVIEW] upstream push smoke test#82937jjsjann123 wants to merge 909 commits intopytorch:masterfrom
jjsjann123 wants to merge 909 commits intopytorch:masterfrom
Conversation
* initial volta support * mma parallel type && cleanup * cleanup * alignment * comment * change request * fix same parallel type * move validation pass * comment and cleanup * lint * comment and cleanup * comment and format
* Propagate new symbol throughout fusion using ValReplacementMutator * Replace FusionViewFailPersistent with FusionViewPersistentShmoo * Create separate test-gpu-view.cpp for view tests * Move replaceValue to ir_utils
fusion_args prints arguments given to runFusion. kernel_args prints arguments given to generated CUDA kernels
* Fixes validation of vectorization with contig indexing True contig indexing needs reference tensors, so finding vectorized contig domains at the initial validation time can result in false positives and negatives. Fixed by filling that information at the time of indexing. Also considered to keep it separated from indexing and fill it at the validation time, but it would end up replicating the same logic as reference replay. Closes pytorch#1534
To highlight the impact of the change, renamed `IterDomain::clone()` to `IterDomain::cloneWithoutRFactor()`.
* save * save * save * save
…ch#1552) * Fix ComputeAtRootDomainMap with broadcast in view root domains Fixes pytorch#1549
…torch#1529) * Allow vectorization with contig-merged domains in pwise scheduler
* Forward merging of trivial-reduction dims in producers * Enable trivial reduction forwarding only when trivial reduction domain is a root domain. For example, splitting a reduction domain by 1 and merging it with another non-reduction domain would result in a trivial-reduction merge. Probably possible to allow such non-root trivial reduction domains, but that would mean, e.g., a leaf domain would be mappable yet its root domain could be unmappable, which seems rather confusing. Considering such transformations would be unlikely, not enabling forwarding would be fine and would cause less surprise.
…1556) * Propagate root domain mappings from rfactor to root domains in ComputeAtRootDomainMap The main purpose of ComputeAtRootDomainMap is to find unmappable domains for comptueAt. This analyais is done by traversing a fusion in a backward direction. Currently, the traversal only visits arithmetic expressions, so information propagation is done from consumer tensors to producer tensors. This propagation is also required from rfactor domains to root domains. Previously it doesn't really matter as rfactor is limited reduction domains, but that's not the case with view. This change also means that ComputeAtRootDomain does not guarantee one-to-one mappings. For example, ``` tv0: [I0, I1] tv1 = view(tv0); // tv1: [I0*I1/N, N] ``` I.e., the view op is done first merging the two domains of `tv0` and then splitting it by N. Note that both of the two rfactor axes of `tv1` are now mapped with the two axes of `tv0`. Because of this change, `ComputeAtRootDomainMap:mapBestEffort` and other mapping functions between a producer and a consumer that is supposed to return a one-to-one map can fail. `ComputeAtRootDomainMap::getMappableDims` is fine as it just grabs any domain that is mappable. `ComputeAtRootDomainMap::mapConsumerToProducer` and `ComputeAtRootDomainMap::mapProducerToConsumer` were used in `TransformReplay::replayPasC` and `TransformReplay::replayCasP`, but they don't really need to use `ComputeAtRootDomainMap` but just `PairwiseRootDomainMap` is sufficient, so replaed the usages with the pairwise variant.
* Minor fix on python test
Add flatten support on the python side
pytorch#1559) * Added a more helpful error message when checking for empty outputs on the Fusion. * Clang fix.
…#1561) * do not re-compute unary op with output and allow expr duplication in debug print.
* always allocate dynamic smem * add driver API call for large smem usage Co-authored-by: Christian Sarofeen <csarofeen@nvidia.com>
This reverts commit 2d5e4cf.
quick test fix
* Remove some welford specific logic. * Multi-reduction fix * Some more minor cleanup. * Add a note on multi-input reductions Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Split out from pytorch#1854 - The `InlinePropagatorSelector` seems to be less generally useful than `BoundedPropagationSelector`, so I made `InlinePropagatorSelector` a private class of `compute_at.cpp` and renamed it to `ComputeAtSelector`, and moved `BoundedPropagationSelector` to `maxinfo_propagator.h` and renamed it to `SetSelector`. - Split `DomainMap` from `pointwise.cpp` into `pointwise_utils.cpp`, and renamed some functions. - Add two cache entry: `DOMAIN_MAP` and `REFERENCE_TENSORS`, and use them to in the pointwise scheduler.
…ined matmul operand load (pytorch#1827)
Upstream merge 0803
Contributor
🔗 Helpful links
❌ 5 New FailuresAs of commit 932a0e1 (more details on the Dr. CI page): Expand to see more
🕵️ 5 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
Collaborator
Author
|
hmmm. the assert failure on cuda10.2 is really strange... It's complaining about mismatch inputs number to fused kernel. 😕 The other failure about codegen error should be easy to patch, we are leaking __bfloat there in a debug print. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
place holder PR for nvfuser code bump. smoke test for CI. Real PR should go through upstream repo.