Broadcast masked_select and implement double backwards for masked_scatter, index_copy, scatter_add, gather by gchanan · Pull Request #1884 · pytorch/pytorch

gchanan · 2017-06-22T19:51:13Z

Add broadcasting to masked_select; it was arbitrary that we weren't broadcasting it previously (there is no exact numpy equivalent outside of advanced indexing). For consistency with similar functions, it's probably better to broadcast it.
Implement double backwards of masked_scatter; this is easier with masked_select broadcasting because those two functions are the forward/backwards inverses.

soumith · 2017-06-22T22:07:42Z

IndexCopy test failed in the travis once.

0-dimensional tensors.

gchanan · 2017-06-23T15:19:36Z

this has another occasional failure: if the generated mask for mask_select is all 0s, the test will fail because the output is 0-dimensional and various places in the code can't handle that. This would previously happen once every 2^25 times (because the mask was 5x5 in the smallest case), but now the smallest mask only is generated with 5 elements and expanded, so it fails every 2^5 times.

gchanan · 2017-06-23T16:20:38Z

The problem with indexCopy is the output and gradient are undefined in the presence of repeated indices. The docs are incorrect, they say that indices is "index (LongTensor) – Indices to select from tensor", but it's actually indices to copy to the underlying tensor, so with repeated indices its undefined who "wins".

Here's an example with cuda:

>>> torch.zeros(1000,1000).cuda().index_copy_(0, torch.zeros(1000).long().cuda(), torch.arange(1,1001).expand(1000,1000).t().cuda())

  987   987   987  ...    984   984   984
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
       ...          ⋱          ...
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
    0     0     0  ...      0     0     0
[torch.cuda.FloatTensor of size 1000x1000 (GPU 0)]

gchanan · 2017-06-23T22:18:05Z

I believe the above issues have been addressed.

…09c7db (pytorch#19454) Summary: Pull Request resolved: pytorch#19454 Previous import was ad7313470a9119d7e1afda7edf1d654497ee80ab Included changes: - **[83dd6265](onnx/onnx@83dd6265)**: Add NonMaxSuppression operator (pytorch#1703) <Hector Li> - **[31ca5d6f](onnx/onnx@31ca5d6f)**: add node tests for quantized ops (pytorch#1944) <Ashwini Khade> - **[e6076c1d](onnx/onnx@e6076c1d)**: Fix test stat coverage script (pytorch#1948) <Raymond Yang> - **[ad036405](onnx/onnx@ad036405)**: Add IsInf to detect infinity values (pytorch#1884) <Wei-Sheng Chin> Differential Revision: D15010015 fbshipit-source-id: 9778757752785fe3169ad2ac606b37299aa69da6

…09c7db (#19454) Summary: Pull Request resolved: #19454 Previous import was ad7313470a9119d7e1afda7edf1d654497ee80ab Included changes: - **[83dd6265](onnx/onnx@83dd6265)**: Add NonMaxSuppression operator (#1703) <Hector Li> - **[31ca5d6f](onnx/onnx@31ca5d6f)**: add node tests for quantized ops (#1944) <Ashwini Khade> - **[e6076c1d](onnx/onnx@e6076c1d)**: Fix test stat coverage script (#1948) <Raymond Yang> - **[ad036405](onnx/onnx@ad036405)**: Add IsInf to detect infinity values (#1884) <Wei-Sheng Chin> Reviewed By: benoitsteiner Differential Revision: D15010015 fbshipit-source-id: 4b29de21de60f8e6a2db75309809a4e619c92532

…09c7db (pytorch#19454) Summary: Pull Request resolved: pytorch#19454 Previous import was ad7313470a9119d7e1afda7edf1d654497ee80ab Included changes: - **[83dd6265](onnx/onnx@83dd6265)**: Add NonMaxSuppression operator (pytorch#1703) <Hector Li> - **[31ca5d6f](onnx/onnx@31ca5d6f)**: add node tests for quantized ops (pytorch#1944) <Ashwini Khade> - **[e6076c1d](onnx/onnx@e6076c1d)**: Fix test stat coverage script (pytorch#1948) <Raymond Yang> - **[ad036405](onnx/onnx@ad036405)**: Add IsInf to detect infinity values (pytorch#1884) <Wei-Sheng Chin> Reviewed By: benoitsteiner Differential Revision: D15010015 fbshipit-source-id: 4b29de21de60f8e6a2db75309809a4e619c92532

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: 3745722 Pull Request resolved: #83067

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser [ghstack-poisoned]

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: #83067 Approved by: https://github.com/davidberard98

Summary: Pull Request resolved: #83067 Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D38543000 Pulled By: davidberard98 fbshipit-source-id: 752edbfbced14fe01b84e417f23cc941b2148842

* Use triton commit same as that used for release/2.6 branch since both are triton version 3.2.0, so assuming they're compatible. Relates to: https://github.com/ROCm/rocAutomation/pull/660/files https://github.com/ROCm/builder/pull/70/files Validation http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

Relates to: https://github.com/ROCm/rocAutomation/pull/660/files ROCm/builder#68 Validation: http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/550/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

* Use triton commit same as that used for release/2.6 branch since both are triton version 3.2.0, so assuming they're compatible. Relates to: https://github.com/ROCm/rocAutomation/pull/660/files https://github.com/ROCm/builder/pull/70/files Validation http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 14c1417)

* Use triton commit same as that used for release/2.6 branch since both are triton version 3.2.0, so assuming they're compatible. Relates to: https://github.com/ROCm/rocAutomation/pull/660/files https://github.com/ROCm/builder/pull/70/files Validation http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 14c1417) (cherry picked from commit c20a8f8)

gchanan added 2 commits June 22, 2017 12:14

Add broadcasting to masked_select.

33893ea

Implement double backwards for masked_scatter.

687a246

soumith added the in progress label Jun 22, 2017

gchanan added 4 commits June 22, 2017 13:23

Fix and test single backwards IndexCopy.

5b6c93a

Implement and test double backwards for IndexCopy.

70254cf

Add ScatterAdd autograd function.

8f00f19

Implement Gather double backwards.

ef1750b

gchanan changed the title ~~Broadcast masked_select and implement double backwards for masked_scatter~~ Broadcast masked_select and implement double backwards for masked_scatter, index_copy, scatter_add, gather Jun 22, 2017

Ensure masked_select tests don't have masks of all zeros which yields

9d49b40

0-dimensional tensors.

Fix index_copy gradgrad test by ensuring indices cannot be repeated.

92b9d8f

gchanan mentioned this pull request Jun 23, 2017

Autograd IndexCopy is broken #473

Closed

Fix lint.

955e2ce

gchanan mentioned this pull request Jun 23, 2017

Implement double backwards cumprod, fix double backwards prod #1895

Merged

soumith approved these changes Jun 24, 2017

View reviewed changes

soumith merged commit 32e6665 into pytorch:master Jun 24, 2017

soumith removed the in progress label Jun 24, 2017

houseroad mentioned this pull request Apr 19, 2019

Automatic update of fbcode/onnx to 83dd62659fc07d5b7fa93b5d1c1879f93509c7db #19454

Closed

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this pull request Aug 6, 2022

Add TensorViewBuilder::shape(std::vector<Val*> shape) (pytorch#1884)

1617373

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast masked_select and implement double backwards for masked_scatter, index_copy, scatter_add, gather#1884

Broadcast masked_select and implement double backwards for masked_scatter, index_copy, scatter_add, gather#1884
soumith merged 9 commits intopytorch:masterfrom
gchanan:broadcast_masked_select_master

gchanan commented Jun 22, 2017

Uh oh!

soumith commented Jun 22, 2017

Uh oh!

gchanan commented Jun 23, 2017

Uh oh!

gchanan commented Jun 23, 2017

Uh oh!

gchanan commented Jun 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gchanan commented Jun 22, 2017

Uh oh!

soumith commented Jun 22, 2017

Uh oh!

gchanan commented Jun 23, 2017

Uh oh!

gchanan commented Jun 23, 2017

Uh oh!

gchanan commented Jun 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants