Skip to content

THD updates#1396

Closed
apaszke wants to merge 49 commits intomasterfrom
thd
Closed

THD updates#1396
apaszke wants to merge 49 commits intomasterfrom
thd

Conversation

@apaszke
Copy link
Contributor

@apaszke apaszke commented Apr 28, 2017

No description provided.

@apaszke apaszke force-pushed the thd branch 2 times, most recently from a988b35 to 3f2c72c Compare April 28, 2017 22:01
@apaszke
Copy link
Contributor Author

apaszke commented May 1, 2017

Rebased and merged directly into master.

@apaszke apaszke closed this May 1, 2017
houseroad added a commit to houseroad/pytorch that referenced this pull request Sep 13, 2018
…7894eb

Summary:
Previous import was bff0b8835870c7df7762ef43498d000d2d8ffb52

Included changes:
- **[39dd0d4](onnx/onnx@39dd0d4)**: [build] Add ONNX_API for protos in all cases (pytorch#1407) <Orion Reblitz-Richardson>
- **[944db4f](onnx/onnx@944db4f)**: cmake (pytorch#1401) <zrphercule>
- **[8ccc8dd](onnx/onnx@8ccc8dd)**: Remove ONNXIFI_CHECK_RESULT from onnxRelease* functions (pytorch#1397) <Marat Dukhan>
- **[df14e74](onnx/onnx@df14e74)**: Change onnxifi test driver classname (pytorch#1396) <zrphercule>
- **[0c885cc](onnx/onnx@0c885cc)**: ONNXIFI cpp test driver (pytorch#1290) <zrphercule>
- **[a557848](onnx/onnx@a557848)**: Coverage Report Tools for Backend Scoreboard (pytorch#1301) <Akshay Chalana>
- **[31fd87f](onnx/onnx@31fd87f)**: fix AvgPool doc. add default value for count_include_pad (pytorch#1391) <Wenhao Hu>
- **[8ff08c2](onnx/onnx@8ff08c2)**: Do not export onnx symbols in the python extension (pytorch#1388) <bddppq>

Differential Revision: D9806635

fbshipit-source-id: 962e5dcb79f98a7e3a769b1ca9633e60c1735b48
facebook-github-bot pushed a commit that referenced this pull request Sep 13, 2018
…7894eb (#11622)

Summary:
Pull Request resolved: #11622

Previous import was bff0b8835870c7df7762ef43498d000d2d8ffb52

Included changes:
- **[39dd0d4](onnx/onnx@39dd0d4)**: [build] Add ONNX_API for protos in all cases (#1407) <Orion Reblitz-Richardson>
- **[944db4f](onnx/onnx@944db4f)**: cmake (#1401) <zrphercule>
- **[8ccc8dd](onnx/onnx@8ccc8dd)**: Remove ONNXIFI_CHECK_RESULT from onnxRelease* functions (#1397) <Marat Dukhan>
- **[df14e74](onnx/onnx@df14e74)**: Change onnxifi test driver classname (#1396) <zrphercule>
- **[0c885cc](onnx/onnx@0c885cc)**: ONNXIFI cpp test driver (#1290) <zrphercule>
- **[a557848](onnx/onnx@a557848)**: Coverage Report Tools for Backend Scoreboard (#1301) <Akshay Chalana>
- **[31fd87f](onnx/onnx@31fd87f)**: fix AvgPool doc. add default value for count_include_pad (#1391) <Wenhao Hu>
- **[8ff08c2](onnx/onnx@8ff08c2)**: Do not export onnx symbols in the python extension (#1388) <bddppq>

Reviewed By: orionr

Differential Revision: D9806635

fbshipit-source-id: f61c052b6bd14e0c80ace19c1a5f0ba659030c6f
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this pull request Nov 1, 2022
…ytorch#1400)

* it looks possible to remove this file

* add communication collectives

* update Column|RowParallelLinear

* update checkpoint function

* update function name

* parity between public and private collectives

* row parallel linear

* column parallel linear

* sequence parallel: p2p comm

fix typo

* sequence parallel: pipeline parallel

* fix typo

* add layernorm with sequence_parallel_enabled attr

* class variable -> member variable

* fix col parallel test with sequence parallel

* Initial test of `forward_backward_pipelining_without_interleaving` with `model_type=ModelType.encoder_and_decoder`

* add cases pretending to test sequence_parallel

* Apply 2 suggestion(s) to 1 file(s)

* update sequence_parallel_enabled docstring

* update docstring: order of tensor dimensions, sequence_parallel_enabled behavior

* Divide sequence_length if sequence parallel

tensor shape should be updated if sequence parallel is enabled.

* cherry-pick NVIDIA/Megatron-LM@8474e6e

* type annotation

* Fix matmul call in RowParallelLinear

Fix `sequence_parallel_enabled` to `False` as you can see in
https://github.com/NVIDIA/Megatron-LM/blob/d898a8991d1a08d29074f87819d1bf41517e35f5/megatron/mpu/layers.py#L511-L514

* update rowparallellinear test

* fix `loss_weight` is not defined in test_layers

* @eqy's comment

* mixed fused layer norm

* fix typo

* misc

* test_layers cleanup

* Skip Bert/GPT script

Since these two models haven't gotten updated for sequence parallle, e.g. the update of the order of dimension from (batch, sequence, feature) to (sequence, batch, feature) and global variables of arguments

* debug part 1/N: comment out `x.retain_grad`

* debug part 2/N: [ColumnParallelLinear] comment out overriding of sequence_parallel_enabled

* debug 3/N: add pipeline test with parallel mlp

* Fix handling `self.input_tensor` and argument

* tp2pp4 ModelType.encoder_or_decoder is failing, which can be at my fault because the backward is blaming the output and the grad_ouptut shape don't match

* revert debug 1/N

* defer tensor model parallel size > 1

* split tensor in sequence dim

* cosmetic

* cosmetic: remove archaic comment

* enable TP>1 for encoder_and_decoder as well

* set requires_grad=True always...

* Set `scatter_gather_tensors_in_pipeline` to :obj:`False`

for the sake of nemo megatron's GPT works with sequence parallel enabled.

* brush up comment of `requires_grad()`

There's a possibility that PyTorch DistributedDataParallel hangs
when some tensor (or parameter) doesn't require grad according to @ptrblck.
This forced `requires_grad` in my understanding is different from that.

* misc changes of scatter_gather_tensors_in_pipeline comment

* guard for torch_ucc

* cosmetic changes related to tests

* update command line arguments

* update TransformerLanguageModel

* rename

* move gpt to gpt.py

* update bert

* add all_gather for params in sequence parallel region

* misc. some diffs were lost during rebasing...

* updates for non sequence parallel execution

* gpt with sequence parallel

* Apply 2 suggestion(s) to 2 file(s)

* update tensor&pipeline parallel size

* why `sequence_parallel_enabled` is not supplied!? Did I messed up when rebasing?

* cosmetic fix

* correct key is sequence_parallel_enabled
rraminen pushed a commit to rraminen/pytorch that referenced this pull request May 14, 2024
There was a known issue with triton where we saw errors with bfloat16.
This is now fixed upstream with
pytorch#111129 . However, it seems that
we branched off release/2.1 before the change was merged upstream. In
the meantime, we can just skip these UTs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants