Skip to content

Parallelize TensorMethods.cpp builds#1400

Merged
soumith merged 1 commit intomasterfrom
split_types
Apr 29, 2017
Merged

Parallelize TensorMethods.cpp builds#1400
soumith merged 1 commit intomasterfrom
split_types

Conversation

@apaszke
Copy link
Contributor

@apaszke apaszke commented Apr 28, 2017

Now works with Python 2 too. See #1364 for original PR.

@soumith soumith merged commit 9169f60 into master Apr 29, 2017
@soumith soumith deleted the split_types branch April 29, 2017 13:07
Jiaming-Liu pushed a commit to Jiaming-Liu/pytorch that referenced this pull request May 18, 2017
jjsjann123 added a commit to jjsjann123/pytorch that referenced this pull request Feb 1, 2022
Fixes pytorch#1311

A scalar tensor is defined as rank-0, size-1 tensor. PyTorch eager (mostly TensorIterator) supports device promotion of cpu scalar tensor, where you can have cross device tensors (cpu scalar tensor and cuda tensors) feeding to a single operator, and cpu scalar tensor would be promoted to a scalar.

We extended this support to nvfuser. A few changes that's required to support this:

API to query if a given tensor is indeed a scalar tensor is_scalar. Current criteria is tensor rank and size (utils.h & utils.cpp)
Update to partition logic where the device of a cpu scalar tensor is ignored. This should avoid us accidentally merging an operator of two cpu scalar tensors.
Integration code updated:
i. maps TS cpu scalar tensor into codegen scalar;
ii. skips usual tensor checks (vectorization / valid inputs) for cpu scalar tensor
iii. kernel arguments to extract scalar value from cpu scalar tensor
cpu scalar tests. Need to verify: 1. cpu scalar tensor with gpu tensor; 2. cpu scalar tensor with cpu scalar tensor; 3. cpu scalar tensor with cpu tensor; 4. cpu tensor with gpu scalar tensor
Note that, we briefly tried the alternative approach where we move cpu scalar tensor to gpu scalar tensor. Implementation is very straight forward, but a cuda tensor creation and copy is really slow. Hence the motivation to extract it into a scalar argument. More details in the issue pytorch#1311
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this pull request Nov 1, 2022
…ytorch#1400)

* it looks possible to remove this file

* add communication collectives

* update Column|RowParallelLinear

* update checkpoint function

* update function name

* parity between public and private collectives

* row parallel linear

* column parallel linear

* sequence parallel: p2p comm

fix typo

* sequence parallel: pipeline parallel

* fix typo

* add layernorm with sequence_parallel_enabled attr

* class variable -> member variable

* fix col parallel test with sequence parallel

* Initial test of `forward_backward_pipelining_without_interleaving` with `model_type=ModelType.encoder_and_decoder`

* add cases pretending to test sequence_parallel

* Apply 2 suggestion(s) to 1 file(s)

* update sequence_parallel_enabled docstring

* update docstring: order of tensor dimensions, sequence_parallel_enabled behavior

* Divide sequence_length if sequence parallel

tensor shape should be updated if sequence parallel is enabled.

* cherry-pick NVIDIA/Megatron-LM@8474e6e

* type annotation

* Fix matmul call in RowParallelLinear

Fix `sequence_parallel_enabled` to `False` as you can see in
https://github.com/NVIDIA/Megatron-LM/blob/d898a8991d1a08d29074f87819d1bf41517e35f5/megatron/mpu/layers.py#L511-L514

* update rowparallellinear test

* fix `loss_weight` is not defined in test_layers

* @eqy's comment

* mixed fused layer norm

* fix typo

* misc

* test_layers cleanup

* Skip Bert/GPT script

Since these two models haven't gotten updated for sequence parallle, e.g. the update of the order of dimension from (batch, sequence, feature) to (sequence, batch, feature) and global variables of arguments

* debug part 1/N: comment out `x.retain_grad`

* debug part 2/N: [ColumnParallelLinear] comment out overriding of sequence_parallel_enabled

* debug 3/N: add pipeline test with parallel mlp

* Fix handling `self.input_tensor` and argument

* tp2pp4 ModelType.encoder_or_decoder is failing, which can be at my fault because the backward is blaming the output and the grad_ouptut shape don't match

* revert debug 1/N

* defer tensor model parallel size > 1

* split tensor in sequence dim

* cosmetic

* cosmetic: remove archaic comment

* enable TP>1 for encoder_and_decoder as well

* set requires_grad=True always...

* Set `scatter_gather_tensors_in_pipeline` to :obj:`False`

for the sake of nemo megatron's GPT works with sequence parallel enabled.

* brush up comment of `requires_grad()`

There's a possibility that PyTorch DistributedDataParallel hangs
when some tensor (or parameter) doesn't require grad according to @ptrblck.
This forced `requires_grad` in my understanding is different from that.

* misc changes of scatter_gather_tensors_in_pipeline comment

* guard for torch_ucc

* cosmetic changes related to tests

* update command line arguments

* update TransformerLanguageModel

* rename

* move gpt to gpt.py

* update bert

* add all_gather for params in sequence parallel region

* misc. some diffs were lost during rebasing...

* updates for non sequence parallel execution

* gpt with sequence parallel

* Apply 2 suggestion(s) to 2 file(s)

* update tensor&pipeline parallel size

* why `sequence_parallel_enabled` is not supplied!? Did I messed up when rebasing?

* cosmetic fix

* correct key is sequence_parallel_enabled
petrex pushed a commit to petrex/pytorch that referenced this pull request Sep 23, 2024
* Enable batchnorm NHWC for MIOpen

* cleanup

* test to compare NHWC MIOpen batchnorm with CPU

* fix 'use_miopen' condition for nhwc miopen

* fix includes

* use native nhwc batchnorm to verify miopen

* remove extra spaces

* remove empty lines

* set PYTORCH_MIOPEN_SUGGEST_NHWC=1 for all test_nn.py test
jagadish-amd pushed a commit to jagadish-amd/pytorch that referenced this pull request Jan 29, 2025
======================================================

Enable NHWC batchnorm for miopen (pytorch#1400)

* Enable batchnorm NHWC for MIOpen

* cleanup

* test to compare NHWC MIOpen batchnorm with CPU

* fix 'use_miopen' condition for nhwc miopen

* fix includes

* use native nhwc batchnorm to verify miopen

* remove extra spaces

* remove empty lines

* set PYTORCH_MIOPEN_SUGGEST_NHWC=1 for all test_nn.py test

change torch.equal() to self.assetrEqual() while comparing NHWC and NCHW batchnorm output (pytorch#1600)

`self.assertTrue(torch.equal(out1, out2))` assumes a compete match
But we have a slight difference (~1e-7) with fp32 NHWC and NCHW
batchnorm output
`self.assertEqual(out1, out2)` allows for tolerance

(cherry picked from commit a1e8b0e)
jagadish-amd pushed a commit to jagadish-amd/pytorch that referenced this pull request May 15, 2025
======================================================

Enable NHWC batchnorm for miopen (pytorch#1400)

* Enable batchnorm NHWC for MIOpen

(cherry picked from commit a1e8b0e)
(cherry picked from commit 8c39322)

* env var PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM=1 enables NHWC batchnorm
separately from convolution

(cherry picked from commit 398bd57)

* avoid redundant NHWC-NCHW-NHWC conversions for MiopenBatchNormBackward

(cherry picked from commit d966835)

* add test_batchnorm_train and test_batchnorm_inference (pytorch#2022)

* Enable NHWC batchnorm on MIOpen if ROCm>=6.5 and environment
variable `PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM=1` defined.

New batchnorm tests can compare MIOpen batchnorm results with cpu or
native backends, or NHWC with NCHW memory layouts on the same backend

Train:

```
test_batchnorm_train_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.035s)
test_batchnorm_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.007s)
test_batchnorm_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.006s)
test_batchnorm_train_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.054s)
test_batchnorm_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.007s)
test_batchnorm_train_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.006s)
test_batchnorm_train_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.005s)
test_batchnorm_train_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_train_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.004s)
```

Inference:

```
test_batchnorm_inference_NCHW_vs_cpu_float32 (__main__.TestNN) ... ok (0.027s)
test_batchnorm_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.005s)
test_batchnorm_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NCHW_vs_native_float32 (__main__.TestNN) ... ok (0.108s)
test_batchnorm_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_NCHW_float32 (__main__.TestNN) ... ok (0.019s)
test_batchnorm_inference_NHWC_vs_NCHW_mixed_bfloat16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_NCHW_mixed_float16 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_cpu_float32 (__main__.TestNN) ... ok (0.004s)
test_batchnorm_inference_NHWC_vs_cpu_mixed_bfloat16 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_cpu_mixed_float16 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_native_float32 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_native_mixed_bfloat16 (__main__.TestNN) ... ok (0.003s)
test_batchnorm_inference_NHWC_vs_native_mixed_float16 (__main__.TestNN) ... ok (0.003s)
  ```

`NCHW_vs_native_mixed_bfloat16` config removed because it failed for `train` and passed for `inference`

Tested on docker image `compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:15845_ubuntu22.04_py3.10_pytorch_rocm6.4_internal_testing_8190c80`

(cherry picked from commit fc899bf)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants