Skip to content

Move at::chunk into the graph fuser#10178

Closed
zou3519 wants to merge 2 commits intopytorch:masterfrom
zou3519:pytorch-chunkfusion2
Closed

Move at::chunk into the graph fuser#10178
zou3519 wants to merge 2 commits intopytorch:masterfrom
zou3519:pytorch-chunkfusion2

Conversation

@zou3519
Copy link
Contributor

@zou3519 zou3519 commented Aug 2, 2018

... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026

This is done through the following:

  1. Absorb starting chunks into FusionGroup as a part of the graph fuser
    pass.
  2. When compiling a kernel, emit a std::vector<ConcatDesc> that describes if an input (of the original graph) will be chunked.
  3. When launching a kernel, use std::vector<ConcatDesc> to chunk an
    input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
    four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.

Test Plan

  • Expect test and correctness test to see if a single chunk is fused
    by the graph fuser
  • Correctness test for a variety of chunks (dimension = beginning,
    middle, end) and tensors (contiguous, non-contiguous, edge case
    (splitSize = 1) for both CPU/CUDA
  • Expect test for multiple chunks fused into the same kernel and
    correctness test.

cc @zdevito @apaszke

Perf

LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.

After changes:

thnn    cudnn   jit
8.8468  6.5797  9.3470

Before changes:

thnn    cudnn   jit
9.9221  6.6539  11.2550

@zou3519 zou3519 force-pushed the pytorch-chunkfusion2 branch from 5ec923d to bf687cc Compare August 2, 2018 20:51
@zou3519 zou3519 added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 3, 2018
Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fusion pass changes look good. The fusion compiler stuff looks pretty good too but I have some suggestions to simplify it and to avoid allocations along the fast path that I think we should do. Otherwise it would make it difficult to add more functionality to the fuser later.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@zou3519 zou3519 force-pushed the pytorch-chunkfusion2 branch 4 times, most recently from ffba17a to afe7b9d Compare August 6, 2018 20:12
@zou3519
Copy link
Contributor Author

zou3519 commented Aug 6, 2018

I've modified the pull request to use ConcatDesc and fixed the extra std::vector allocations.

I've measured some rough numbers on how much more time it takes to run CompiledFusionFunction::launch_with_tensors by timing from the beginning of the function to right before it launches the kernel.

Before (master): 11.73 microseconds

After changes: 12.14 microseconds

so the current changes have not added too much overhead.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@yf225
Copy link
Contributor

yf225 commented Aug 14, 2018

@zou3519 Ping :D

@zou3519 zou3519 force-pushed the pytorch-chunkfusion2 branch 2 times, most recently from 0656d68 to 3de2807 Compare August 15, 2018 20:05
@zou3519
Copy link
Contributor Author

zou3519 commented Aug 16, 2018

This should be ready for another review now, despite the failing (unrelated) tests

@zou3519 zou3519 force-pushed the pytorch-chunkfusion2 branch from 3de2807 to 60b58f1 Compare August 16, 2018 18:15
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not sure if you're handling contiguity information correctly in PartitionDesc

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@zou3519 zou3519 force-pushed the pytorch-chunkfusion2 branch from 0e344fe to 598eedf Compare August 17, 2018 18:46
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This is done through the following:

1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, move chunks out of the FusionGroup and use
the resulting graph. Emit a std::vector<MaybeChunkDesc> that describes
if an input (of the original graph) will be chunked.
3) When launching a kernel, use std::vector<MaybeChunkDesc> to chunk an
input tensor on the CPU. This chunk takes in an at::Tensor and outputs
four TensorInfo structs, bypassing intermediate Tensors.
4) The resulting TensorInfo structs are sent into the compiled kernel.

Test Plan

- Expect test and correctness test to see if a single chunk is fused
  by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
  middle, end) and tensors (contiguous, non-contiguous, edge case
  (splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
  correctness test.

Absorb starting at::chunk into FusionGroups

If all outputs to at::chunk are inputs to a FusionGroup and "chunks",
"dim" are both constants, then the at::chunk is moved into the beginning
of the FusionGroup.

Teach fusion compiler about at::chunk inside a FusionGroup

When compiling, the compiler emits an extra std::vector<ConcatDesc> that
says which inputs are chunked into how many pieces. The compiler scans
inputs and produces a list of "flat inputs".

When launching, the compiler scans the inputs and the chunk_desc to see
which inputs are chunked. It uses this information to prepare a list of
flat inputs to send to the compiled kernel.

Update expect files

Fix nit

Windows fix

Address most comments, still working on the rest

Use prim::FusedChunk for chunks inside FusionGroup.

Addressed comments:

- add assert
- separate PartitionDesc chunk / cat ctors so the logic is clearer
If one has a graph like the following:
```
y1, y2 = chunk(x)
z1, z2 = chunk(x)
fusiongroup(y1, y2, z1, z2)
```
Only one chunk should become a prim::FusedChunk inside the fusion group
because there is an invariant that prim::FusedChunk inside the fusion
group may not have the same input. This is because of how the fusion
compiler replaces the input to be chunked into its chunked tensors.
@zou3519 zou3519 force-pushed the pytorch-chunkfusion2 branch from 598eedf to b00a76e Compare August 18, 2018 18:40
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@zou3519
Copy link
Contributor Author

zou3519 commented Aug 18, 2018

@pytorchbot retest this please

PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from pytorch#10026

This is done through the following:

1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.

- Expect test and correctness test to see if a single chunk is fused
  by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
  middle, end) and tensors (contiguous, non-contiguous, edge case
  (splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
  correctness test.

cc zdevito apaszke

LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.

After changes:
```
thnn    cudnn   jit
8.8468  6.5797  9.3470
```

Before changes:
```
thnn    cudnn   jit
9.9221  6.6539  11.2550
```
Pull Request resolved: pytorch#10178

Differential Revision: D9382661

Pulled By: zou3519

fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants