[DO NOT MERGE] Hacked cross-iteration grouping by naoyam · Pull Request #1731 · csarofeen/pytorch

naoyam · 2022-05-25T19:55:16Z

This is just documenting performance results of cross-iteration grouping of grid reductions.

kernel.cu was generated from the BN-like test added in #1729. The iteration domain is vectorized by 2, so the grid reduction call looks like this:

  #pragma unroll
  for(nvfuser_index_t i317 = 0; i317 < 2; ++i317) {
    // Allocate global tensor T20
    // Allocate global tensor T21
    // Allocate global tensor T22
    T5_reduction.reduceGroup(
      RefTuple<float>(T5[i317]),
      ConstRefTuple<float>(T19[i317]),
      VolatilePtrTuple<float>(&T20[0]),
      LocalTuple<float>(0),
      [](float &a, float b) { a = a + b; },
      RefTuple<float>(T9[i317]),
      ConstRefTuple<float>(T18[i317]),
      VolatilePtrTuple<float>(&T21[0]),
      LocalTuple<float>(0),
      [](float &a, float b) { a = a + b; },
      &T22[0],
      shared_mem,
      true,
      true);
  }

So, already two separate grid reductions are grouped.

In kernel.cu, I hacked a two-way cross iteration grouping, so the above code is changed as:

  {
    constexpr int i317 = 0;
    T5_reduction.reduceGroup(
        RefTuple<float>(T5[i317]),
        ConstRefTuple<float>(T19[i317]),
        VolatilePtrTuple<float>(&T20[0]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },
        RefTuple<float>(T9[i317]),
        ConstRefTuple<float>(T18[i317]),
        VolatilePtrTuple<float>(&T21[0]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },

        RefTuple<float>(T5[i317 + 1]),
        ConstRefTuple<float>(T19[i317 + 1]),
        VolatilePtrTuple<float>(&T20[3136]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },

        RefTuple<float>(T9[i317 + 1]),
        ConstRefTuple<float>(T18[i317 + 1]),
        VolatilePtrTuple<float>(&T21[3136]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },

        &T22[0],
        shared_mem,
        true,
        true,
        true,
        true);
  }

Overall, this version does 4 grid reductions with just one grid sync.

The performance numbers are:

Launch Parameters: BlockDim.x = 16, BlockDim.y = 8, BlockDim.z = -1, GridDim.x = 1, GridDim.y = 196, GridDim.z = -1, Smem Size = 512
Original: 46 us
Two-way horizontal grouping: 36 us
Two-way horizontal and two-way cross-iteration grouping: 29 us

So, for problems where the cost of grid reductions is significant, the grouping optimization seems quite effective.

Hacked cross-iteration grouping

44f43bb

naoyam closed this May 25, 2022

naoyam mentioned this pull request Jun 14, 2022

Grouping grid allreduces across iterations #1755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Hacked cross-iteration grouping#1731

[DO NOT MERGE] Hacked cross-iteration grouping#1731
naoyam wants to merge 1 commit intodevelfrom
test_group_across_iteration

naoyam commented May 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

naoyam commented May 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant