Get a GEMM example with all bells and whistles by csarofeen · Pull Request #368 · csarofeen/pytorch

csarofeen · 2020-09-07T22:51:51Z

Changed one of the tests to be a GEMM example with a combination of compiler time and runtime tiling parameters, including symbolic values for both inter-cta and intra-cta reductions. A few experimental fixes went in that we need to more thoroughly vet.

Related issues:
#363
#364
#365
#366
#367

rdspring1 · 2020-09-08T20:08:02Z

+  tv6->reorder({
+      {2, -2},
+      {3, -1},
+      {4, 2},
+      {5, 3},
+      {6, 4},
+  });


Maybe comment that rFactor moves the reduction axes to the inner-most dimension and this reorder returns the tv6 to its previous mapping.

rdspring1 · 2020-09-08T20:16:01Z

+  // Sum the K-dim
+  TensorView* tv5 = sum(tv4, {1});


comment that the K-dim becomes a reduction axis in tv5[M, rK, N]

rdspring1 · 2020-09-08T20:17:27Z

-      {t0, t1, BSX},
-      torch::jit::fuser::cuda::LaunchParams(-1, -1, -1, BSX, -1, -1));
+      {t0, t1, 3, 4, 5},
+      torch::jit::fuser::cuda::LaunchParams(-1, -1, -1, -1, -1, -1));


Could we remove LaunchParams since it is completely inferred?

rdspring1 · 2020-09-08T20:18:34Z

+      {t0, t1, 3, 4, 5},
+      torch::jit::fuser::cuda::LaunchParams(-1, -1, -1, -1, -1, -1));
+
+  at::Tensor aten_output = mul(t0.unsqueeze(2), t1.unsqueeze(0)).sum(1);


Maybe replace with at::Tensor aten_output = matmul(t0, t1);

I went back and forth on this, for the presentation I'm including this in I think it's interesting to structure it similar to how we structured the kernel.

naoyam

The change regarding gridReduce looks good to me. Note that there is a bug in the use of shared memory inside the loop of the new added test. __syncthreads() is needed at the end of the loop body.

… reading before writing.

naoyam · 2020-09-09T15:48:43Z

+    fl->body().push_back(new kir::Sync());
+  }
+
+  bool needs_sync = prev_needs_sync;


Is this supposed to declare a new variable? Or is it supposed to assign prev_needs_sync to the existing needs_sync?

naoyam · 2020-09-09T15:58:47Z

+void SyncInserter::handle(kir::ForLoop* fl) {
+  bool prev_needs_sync = needs_sync;
+  active_scope = fl;
+


Don't we need to reset needs_sync here?

naoyam · 2020-09-09T16:02:28Z

+    for (auto inp : expr->inputs()) {
+      if (ir_utils::isTV(inp)) {
+        if (inp->as<TensorView>()->getMemoryType() == MemoryType::Shared) {
+          needs_sync = true;


Reading shared memory does not necessarily mean syncthreads is required. This synchthreads is needed when there is read-write dependency across threads.

This is a safe approach, but doesn't seem very efficient.

I was working on a PR to detect this Write-After-Read race condition. It could be used here to be more efficient.

@rdspring1 I'm cleaning up this PR. I'm thinking about dropping the syncthreads changes in this PR in favor of your #374. Does it sound good to you?

…in shared memory.

…ample

…e finish reading before writing." This reverts commit dffaa76. Revert this in favor of #383

* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com>

naoyam · 2020-09-22T00:10:49Z

@csarofeen The change in this PR is only about inserting thread predicates for expressions writing into shared memory. I tried to do a little cleanup at 3b63be6. The analysis didn't change but is done only once for each expression, whereas it is done on-demand in the original implementation, potentially being done multiple times redundantly

Please let me know if this looks good to you.

naoyam · 2020-09-22T00:22:21Z

This remaining change avoids redundant writes to broadcast tensors on shared memory. For example, in the SmemDynamicTiledGemm test, the inner-most loop looks like below:

 for(size_t i11 = 0; i11 < (ceilDiv((ceilDiv(T0.size[1], i2)), i1)); ++i11) {
    if ((((((blockIdx.z * blockDim.z) + threadIdx.z) < T0.size[0]) && (((((i11 * blockDim.x) + threadIdx.x) * gridDim.x) + blockIdx.x) < T0.size[1])) && (threadIdx.y == 0))) {
      T2[(threadIdx.z * blockDim.x) + threadIdx.x]
         = T0[(((blockIdx.z * blockDim.z) + threadIdx.z) * T0.stride[0]) + (((((i11 * blockDim.x) + threadIdx.x) * gridDim.x) + blockIdx.x) * T0.stride[1])];
    }
    if ((((((((i11 * blockDim.x) + threadIdx.x) * gridDim.x) + blockIdx.x) < T1.size[0]) && (((blockIdx.y * 8) + threadIdx.y) < T1.size[1])) && (threadIdx.z == 0))) {
      T3[(threadIdx.x * 8) + threadIdx.y]
         = T1[(((((i11 * blockDim.x) + threadIdx.x) * gridDim.x) + blockIdx.x) * T1.stride[0]) + (((blockIdx.y * 8) + threadIdx.y) * T1.stride[1])];
    }
    __syncthreads();
    float T4[1];
    if ((((((blockIdx.z * blockDim.z) + threadIdx.z) < T0.size[0]) && (((((i11 * blockDim.x) + threadIdx.x) * gridDim.x) + blockIdx.x) < T0.size[1])) && (((blockIdx.y * 8) + threadIdx.y) < T1.size[1]))) {
      T4[0]
        = T2[(threadIdx.z * blockDim.x) + threadIdx.x]
        * T3[(threadIdx.x * 8) + threadIdx.y];
    }
    if ((((((blockIdx.z * blockDim.z) + threadIdx.z) < T0.size[0]) && (((((i11 * blockDim.x) + threadIdx.x) * gridDim.x) + blockIdx.x) < T0.size[1])) && (((blockIdx.y * 8) + threadIdx.y) < T1.size[1]))) {
      T6[0]
        = T6[0]
        + T4[0];
    }
    __syncthreads();
  }

Notice that writes to T2 and T3 are predicated with threadIdx.y == 0 and threadIdx.z == 0, respectively, in addition to the other bound-check predicates.

jjsjann123 · 2020-09-22T00:38:11Z

I'll put an upstream PR once this is merged.

csarofeen

LGTM

Summary: X-link: meta-pytorch/data#368 This is PR aims to expose the right data-relate API. There are two more changes made in this PR to convert public api to private api `check_lambda_fn` -> `_check_lambda_fn` `deprecation_warning` -> `_deprecation_warning` Pull Request resolved: pytorch#76143 Reviewed By: albanD, NivekT Differential Revision: D35798311 Pulled By: ejguan fbshipit-source-id: b13fded5c88a533c706702fb2070c918c839dca4 (cherry picked from commit 0b534b8)

Get a crazy test example working.

cbf0368

csarofeen requested review from kevinstephano, naoyam and rdspring1 September 7, 2020 22:51

Change problem size and tile size, still an issue with N > 32.

99b1ff3

rdspring1 reviewed Sep 8, 2020

View reviewed changes

naoyam approved these changes Sep 9, 2020

View reviewed changes

Add sync threads in loops that read from smem, to make sure we finish…

dffaa76

… reading before writing.

naoyam reviewed Sep 9, 2020

View reviewed changes

csarofeen added 3 commits September 9, 2020 12:37

Predicate off threads bound to a broadcast dim of an output when its …

e76dcfe

…in shared memory.

Predicate smem tiling writing based on broadcasted dims in consumer.

f21e2a3

Cleanup example a bit.

49dc418

naoyam mentioned this pull request Sep 14, 2020

Move prediatation inside block/gridReduciton functions #376

Merged

Naoya Maruyama added 2 commits September 14, 2020 15:01

Merge branch '20_8_18_devel' into crazy_example

c5ce1c8

Merge branch '20_8_18_devel' into crazy_example

8992ee9

This was referenced Sep 14, 2020

Tiled GEMM example #377

Merged

Missing _syncthreads #380

Closed

Merge branch '20_8_18_devel' into crazy_example

913fc90

naoyam force-pushed the crazy_example branch from 65ae8ea to 913fc90 Compare September 15, 2020 20:57

Naoya Maruyama and others added 5 commits September 15, 2020 15:37

Merge branch '20_8_18_devel' into crazy_example

61b9c3b

Merge commit '9c1d7bd69516f736de7d30f88397854e38236d23' into crazy_ex…

fe7e275

…ample

Revert "Add sync threads in loops that read from smem, to make sure w…

2eaa5c4

…e finish reading before writing." This reverts commit dffaa76. Revert this in favor of #383

Add _syncthreads for Write-After-Read Race (#383)

2fa082f

* Basic Write-After-Read (WAR) check to add __syncthreads to end of for-loop * Enable Tiled GEMM example * Check that IterDomain iterates from zero to some positive integer Co-authored-by: Ryan Spring <rspring@nvidia.com>

Refactor thread predication for writes to smem

3b63be6

naoyam force-pushed the crazy_example branch from 4a09dcf to 3b63be6 Compare September 22, 2020 00:07

Merge branch '20_8_18_devel' into crazy_example

2794e97

csarofeen commented Sep 22, 2020

View reviewed changes

naoyam merged commit 1c67154 into 20_8_18_devel Sep 22, 2020

csarofeen deleted the crazy_example branch June 9, 2021 13:38

Conversation

csarofeen commented Sep 7, 2020

Uh oh!

rdspring1 Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Sep 22, 2020

Uh oh!

naoyam commented Sep 22, 2020

Uh oh!

jjsjann123 commented Sep 22, 2020

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdspring1 Sep 8, 2020 •

edited

Loading