Skip to content

[DO NOT MERGE] Hacked cross-iteration grouping#1731

Closed
naoyam wants to merge 1 commit intodevelfrom
test_group_across_iteration
Closed

[DO NOT MERGE] Hacked cross-iteration grouping#1731
naoyam wants to merge 1 commit intodevelfrom
test_group_across_iteration

Conversation

@naoyam
Copy link
Copy Markdown
Collaborator

@naoyam naoyam commented May 25, 2022

This is just documenting performance results of cross-iteration grouping of grid reductions.

kernel.cu was generated from the BN-like test added in #1729. The iteration domain is vectorized by 2, so the grid reduction call looks like this:

  #pragma unroll
  for(nvfuser_index_t i317 = 0; i317 < 2; ++i317) {
    // Allocate global tensor T20
    // Allocate global tensor T21
    // Allocate global tensor T22
    T5_reduction.reduceGroup(
      RefTuple<float>(T5[i317]),
      ConstRefTuple<float>(T19[i317]),
      VolatilePtrTuple<float>(&T20[0]),
      LocalTuple<float>(0),
      [](float &a, float b) { a = a + b; },
      RefTuple<float>(T9[i317]),
      ConstRefTuple<float>(T18[i317]),
      VolatilePtrTuple<float>(&T21[0]),
      LocalTuple<float>(0),
      [](float &a, float b) { a = a + b; },
      &T22[0],
      shared_mem,
      true,
      true);
  }

So, already two separate grid reductions are grouped.

In kernel.cu, I hacked a two-way cross iteration grouping, so the above code is changed as:

  {
    constexpr int i317 = 0;
    T5_reduction.reduceGroup(
        RefTuple<float>(T5[i317]),
        ConstRefTuple<float>(T19[i317]),
        VolatilePtrTuple<float>(&T20[0]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },
        RefTuple<float>(T9[i317]),
        ConstRefTuple<float>(T18[i317]),
        VolatilePtrTuple<float>(&T21[0]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },

        RefTuple<float>(T5[i317 + 1]),
        ConstRefTuple<float>(T19[i317 + 1]),
        VolatilePtrTuple<float>(&T20[3136]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },

        RefTuple<float>(T9[i317 + 1]),
        ConstRefTuple<float>(T18[i317 + 1]),
        VolatilePtrTuple<float>(&T21[3136]),
        LocalTuple<float>(0),
        [](float &a, float b) { a = a + b; },

        &T22[0],
        shared_mem,
        true,
        true,
        true,
        true);
  }

Overall, this version does 4 grid reductions with just one grid sync.

The performance numbers are:

Launch Parameters: BlockDim.x = 16, BlockDim.y = 8, BlockDim.z = -1, GridDim.x = 1, GridDim.y = 196, GridDim.z = -1, Smem Size = 512
Original: 46 us
Two-way horizontal grouping: 36 us
Two-way horizontal and two-way cross-iteration grouping: 29 us

So, for problems where the cost of grid reductions is significant, the grouping optimization seems quite effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant