Move at::chunk into the graph fuser by zou3519 · Pull Request #10178 · pytorch/pytorch

zou3519 · 2018-08-02T20:39:52Z

... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026

This is done through the following:

Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
When compiling a kernel, emit a std::vector<ConcatDesc> that describes if an input (of the original graph) will be chunked.
When launching a kernel, use std::vector<ConcatDesc> to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.

Test Plan

Expect test and correctness test to see if a single chunk is fused
by the graph fuser
Correctness test for a variety of chunks (dimension = beginning,
middle, end) and tensors (contiguous, non-contiguous, edge case
(splitSize = 1) for both CPU/CUDA
Expect test for multiple chunks fused into the same kernel and
correctness test.

Perf

LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.

After changes:

thnn    cudnn   jit
8.8468  6.5797  9.3470

Before changes:

thnn    cudnn   jit
9.9221  6.6539  11.2550

zdevito

The fusion pass changes look good. The fusion compiler stuff looks pretty good too but I have some suggestions to simplify it and to avoid allocations along the fast path that I think we should do. Otherwise it would make it difficult to add more functionality to the fuser later.

torch/csrc/jit/fusion_compiler.cpp

torch/csrc/jit/passes/graph_fuser.cpp

zou3519 · 2018-08-06T21:15:47Z

I've modified the pull request to use ConcatDesc and fixed the extra std::vector allocations.

I've measured some rough numbers on how much more time it takes to run CompiledFusionFunction::launch_with_tensors by timing from the beginning of the function to right before it launches the kernel.

Before (master): 11.73 microseconds

After changes: 12.14 microseconds

so the current changes have not added too much overhead.

torch/csrc/jit/passes/graph_fuser.cpp

torch/csrc/jit/fusion_compiler.cpp

yf225 · 2018-08-14T16:23:46Z

@zou3519 Ping :D

zou3519 · 2018-08-16T13:59:26Z

This should be ready for another review now, despite the failing (unrelated) tests

apaszke

LGTM, but I'm not sure if you're handling contiguity information correctly in PartitionDesc

test/expect/TestJit.test_fusion_distribute.expect

torch/csrc/jit/fusion_compiler.cpp

torch/csrc/jit/fusion_compiler.h

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This is done through the following: 1) Absorb starting chunks into FusionGroup as a part of the graph fuser pass. 2) When compiling a kernel, move chunks out of the FusionGroup and use the resulting graph. Emit a std::vector<MaybeChunkDesc> that describes if an input (of the original graph) will be chunked. 3) When launching a kernel, use std::vector<MaybeChunkDesc> to chunk an input tensor on the CPU. This chunk takes in an at::Tensor and outputs four TensorInfo structs, bypassing intermediate Tensors. 4) The resulting TensorInfo structs are sent into the compiled kernel. Test Plan - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test. Absorb starting at::chunk into FusionGroups If all outputs to at::chunk are inputs to a FusionGroup and "chunks", "dim" are both constants, then the at::chunk is moved into the beginning of the FusionGroup. Teach fusion compiler about at::chunk inside a FusionGroup When compiling, the compiler emits an extra std::vector<ConcatDesc> that says which inputs are chunked into how many pieces. The compiler scans inputs and produces a list of "flat inputs". When launching, the compiler scans the inputs and the chunk_desc to see which inputs are chunked. It uses this information to prepare a list of flat inputs to send to the compiled kernel. Update expect files Fix nit Windows fix Address most comments, still working on the rest Use prim::FusedChunk for chunks inside FusionGroup. Addressed comments: - add assert - separate PartitionDesc chunk / cat ctors so the logic is clearer

If one has a graph like the following: ``` y1, y2 = chunk(x) z1, z2 = chunk(x) fusiongroup(y1, y2, z1, z2) ``` Only one chunk should become a prim::FusedChunk inside the fusion group because there is an invariant that prim::FusedChunk inside the fusion group may not have the same input. This is because of how the fusion compiler replaces the input to be chunked into its chunked tensors.

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zou3519 · 2018-08-18T21:18:51Z

@pytorchbot retest this please

Summary: ... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from pytorch#10026 This is done through the following: 1) Absorb starting chunks into FusionGroup as a part of the graph fuser pass. 2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked. 3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an input tensor on the CPU. This chunk directly takes in an at::Tensor and creates four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors. - Expect test and correctness test to see if a single chunk is fused by the graph fuser - Correctness test for a variety of chunks (dimension = beginning, middle, end) and tensors (contiguous, non-contiguous, edge case (splitSize = 1) for both CPU/CUDA - Expect test for multiple chunks fused into the same kernel and correctness test. cc zdevito apaszke LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights. After changes: ``` thnn cudnn jit 8.8468 6.5797 9.3470 ``` Before changes: ``` thnn cudnn jit 9.9221 6.6539 11.2550 ``` Pull Request resolved: pytorch#10178 Differential Revision: D9382661 Pulled By: zou3519 fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8

zou3519 requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 2, 2018 20:39

zou3519 force-pushed the pytorch-chunkfusion2 branch from 5ec923d to bf687cc Compare August 2, 2018 20:51

zou3519 mentioned this pull request Aug 2, 2018

Move at::chunk into the graph fuser #10026

Closed

zou3519 added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 3, 2018

zdevito reviewed Aug 4, 2018

View reviewed changes

zou3519 force-pushed the pytorch-chunkfusion2 branch 4 times, most recently from ffba17a to afe7b9d Compare August 6, 2018 20:12

apaszke reviewed Aug 13, 2018

View reviewed changes

zou3519 force-pushed the pytorch-chunkfusion2 branch 2 times, most recently from 0656d68 to 3de2807 Compare August 15, 2018 20:05

zou3519 force-pushed the pytorch-chunkfusion2 branch from 3de2807 to 60b58f1 Compare August 16, 2018 18:15

apaszke approved these changes Aug 16, 2018

View reviewed changes

zou3519 force-pushed the pytorch-chunkfusion2 branch from 0e344fe to 598eedf Compare August 17, 2018 18:46

facebook-github-bot reviewed Aug 17, 2018

View reviewed changes

zou3519 added 2 commits August 18, 2018 11:34

zou3519 force-pushed the pytorch-chunkfusion2 branch from 598eedf to b00a76e Compare August 18, 2018 18:40

facebook-github-bot reviewed Aug 18, 2018

View reviewed changes

facebook-github-bot closed this in f1420ad Aug 18, 2018

ezyang added the merged label Jun 26, 2019

Conversation

zou3519 commented Aug 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Perf

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

zou3519 commented Aug 6, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

yf225 commented Aug 14, 2018

Uh oh!

zou3519 commented Aug 16, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

zou3519 commented Aug 2, 2018 •

edited

Loading