[IR] Adding more support for dynamic shape on Task and FlowGraph level by yaoyaoding · Pull Request #220 · hidet-org/hidet

yaoyaoding · 2023-05-09T05:56:00Z

No description provided.

. . . . . . . . .

This PR supports the vectorization of epilogue fusion. Originally, we fused the operators belonging to the epilogue stage on scalars. For example, the code before fusion is as follows ```python for mi in range(mma_m): for ni in range(mma_n): global_m = mi * stride_m global_n = ni * stride_n if global_m < m and global_n < n: # the following statement always be converted to a LDG.U16 instruction at SASS level if we assume `regs_c` is a float16 register array. # the if statement prevents nvcc to aggressively do vectorized writing back to global memory. gmem_c[global_m, global_n] = regs_c[mi, ni] ``` The code after fusion looks like ```python for mi in range(mma_m): for ni in range(mma_n): global_m = mi * stride_m global_n = ni * stride_n if global_m < m and global_n < n: # x, y, z are fused tensors gmem_c[inverse_map(global_m, global_n)] = elementwise_op(regs_c[mi, ni], x, y, z...) ``` The fusion still emits some scalar store instructions to perform the final write-back, which is inefficient. Although fusion can bring us some performance benefits, the penalty of inefficient write-back can be higher than the benefit of fusion itself. This PR is trying to fuse operators while vectorizing the memory accesses as much as possible. Basically, we developed a technique similar to the [epilogue visitor tree](https://dl.acm.org/doi/pdf/10.1145/3620666.3651369) in CUTLASS to address this problem. For details, you can refer to the attached [document](https://drive.google.com/drive/folders/1kXlu2k-lPmH0jPlQ_8CfmqZvAPVtQ3Qs). ## Changes 1. restructure the instruction selection as a compiler pass 2. support the vectorized epilogue fusion algorithm in the `apply_prologue_epilogue` pass. 3. adapt the schedule template of `matmul_f16_pk` to enable this optimization. 4. currently, we have a very simple shared memory updater, and we should design a shared memory planner. A potential proposal can be found in [this document](https://dl.acm.org/doi/pdf/10.5555/314500.315082), which is the shared memory allocator used in triton. ## Simple epilogue fusion This experiment benchmarks the following graph. `dims` is a permutation of the four axes 0, 1, 2, 3. We iterate over all the possible permutations to show the generality of fusion capability. ``` def graph(a, b, d, bx1xn, bxmx1, mx1, x1xn): c = a @ b c = c + d c = c + bxmx1 + bx1xn + mx1 + x1xn L, M, N = c.shape c = c.reshape(L, M, N // head_size, head_size) c = c.permute(*dims) c = c.reshape(L * N // head_size, M, head_size) c = relu(c) return c ``` Performance comparison | problem(m,n,k,l) | dims | max-autotune | hidet(after) | hidet(before) | speedup (after vs before) | | -------------------- | ------------ | ------------ | ------------- | ------------ | ------- | | [512, 1536, 512, 16] | (0, 1, 2, 3) | 0.263 | 0.199 | 0.229 | 13.10% | | [512, 1536, 512, 16] | (0, 1, 3, 2) | 0.25 | 0.218 | 0.27 | 19.26% | | [512, 1536, 512, 16] | (0, 2, 1, 3) | 0.248 | 0.217 | 0.223 | 2.69% | | [512, 1536, 512, 16] | (0, 2, 3, 1) | 0.374 | 0.219 | 0.224 | 2.23% | | [512, 1536, 512, 16] | (0, 3, 1, 2) | 0.509 | 0.269 | 0.35 | 23.14% | | [512, 1536, 512, 16] | (0, 3, 2, 1) | 0.278 | 0.226 | 0.225 | \-0.44% | | [512, 1536, 512, 16] | (1, 0, 2, 3) | 0.248 | 0.202 | 0.237 | 14.77% | | [512, 1536, 512, 16] | (1, 0, 3, 2) | 0.251 | 0.222 | 0.269 | 17.47% | | [512, 1536, 512, 16] | (1, 2, 0, 3) | 0.264 | 0.202 | 0.241 | 16.18% | | [512, 1536, 512, 16] | (1, 2, 3, 0) | 0.247 | 0.275 | 0.274 | \-0.36% | | [512, 1536, 512, 16] | (1, 3, 0, 2) | 0.708 | 0.224 | 0.4 | 44.00% | | [512, 1536, 512, 16] | (1, 3, 2, 0) | 0.368 | 0.352 | 0.435 | 19.08% | | [512, 1536, 512, 16] | (2, 0, 1, 3) | 0.261 | 0.211 | 0.222 | 4.95% | | [512, 1536, 512, 16] | (2, 0, 3, 1) | 0.477 | 0.221 | 0.225 | 1.78% | | [512, 1536, 512, 16] | (2, 1, 0, 3) | 0.262 | 0.211 | 0.236 | 10.59% | | [512, 1536, 512, 16] | (2, 1, 3, 0) | 0.268 | 0.289 | 0.288 | \-0.35% | | [512, 1536, 512, 16] | (2, 3, 0, 1) | 0.257 | 0.22 | 0.227 | 3.08% | | [512, 1536, 512, 16] | (2, 3, 1, 0) | 0.251 | 0.373 | 0.355 | \-5.07% | | [512, 1536, 512, 16] | (3, 0, 1, 2) | 0.559 | 0.27 | 0.372 | 27.42% | | [512, 1536, 512, 16] | (3, 0, 2, 1) | 0.376 | 0.225 | 0.223 | \-0.90% | | [512, 1536, 512, 16] | (3, 1, 0, 2) | 0.648 | 0.281 | 0.399 | 29.57% | | [512, 1536, 512, 16] | (3, 1, 2, 0) | 0.624 | 0.352 | 0.474 | 25.74% | | [512, 1536, 512, 16] | (3, 2, 0, 1) | 0.333 | 0.226 | 0.228 | 0.88% | | [512, 1536, 512, 16] | (3, 2, 1, 0) | 0.35 | 0.354 | 0.355 | 0.28% | ## Model benchmark Performance comparison | model | inputs | eager | reduce-overhead | max-autotune | hidet(after) | hidet(before) | speedup (after vs before) | | ------------------------ | ------------------ | ------ | --------------- | ------------ | ------------ | ------------- | ------------------------- | | model/gpt2 | f16, bs=1, seq=128 | 2.043 | 0.557 | 0.555 | 0.7 | 0.7 | 0.00% | | model/gpt2 | f16, bs=1, seq=512 | 2.01 | 1.26 | 1.112 | 1.204 | 1.297 | 7.17% | | model/gpt2 | f16, bs=8, seq=128 | 2.759 | 1.97 | 1.684 | 1.635 | 1.892 | 13.58% | | model/gpt2 | f16, bs=8, seq=512 | 13.622 | 7.915 | 6.633 | 7.014 | 8.101 | 13.42% | | model/bert-base-uncased | f16, bs=1, seq=128 | 1.247 | 0.744 | 0.753 | 0.659 | 0.671 | 1.79% | | model/bert-base-uncased | f16, bs=1, seq=512 | 1.498 | 1.346 | 1.394 | 1.187 | 1.211 | 1.98% | | model/bert-base-uncased | f16, bs=8, seq=128 | 2.045 | 1.844 | 1.723 | 1.526 | 1.637 | 6.78% | | model/bert-base-uncased | f16, bs=8, seq=512 | 7.115 | 7.205 | 6.239 | 5.597 | 5.933 | 5.66% | | model/bert-large-uncased | f16, bs=1, seq=128 | 2.342 | 1.835 | 1.974 | 1.515 | 1.496 | \-1.27% | | model/bert-large-uncased | f16, bs=1, seq=512 | 3.586 | 3.367 | 3.63 | 2.932 | 3.108 | 5.66% | | model/bert-large-uncased | f16, bs=8, seq=128 | 5.385 | 4.913 | 4.789 | 4.451 | 4.7 | 5.30% | | model/bert-large-uncased | f16, bs=8, seq=512 | 19.817 | 19.994 | 19.693 | 16.324 | 17.269 | 5.47% | ## Regression Pipeline A10 ![operator](https://github.com/CentML/hidet/assets/16316020/71ab9c61-614b-481f-b3d0-2c2478efd2ac) ![model](https://github.com/CentML/hidet/assets/16316020/dd77b622-5e46-4152-8d83-848a2bf7c1cc) This PR only enables the optimization for `matmul_f16` so models that don't have this operator won't have performance gains. ## TODO - [x] epilogue fusion for dynamic shape - [x] remove rearrange operator - [x] fallback support --------- Co-authored-by: xiaocenxiaocen <xiao.zhang@centml.ai>

yaoyaoding added 2 commits May 9, 2023 00:58

.

ec0f0b8

. . . . . . . . .

fix & format

1bba532

yaoyaoding merged commit 00e91dd into hidet-org:main May 9, 2023

yaoyaoding deleted the dynamic-graph branch May 9, 2023 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IR] Adding more support for dynamic shape on Task and FlowGraph level#220

[IR] Adding more support for dynamic shape on Task and FlowGraph level#220
yaoyaoding merged 2 commits intohidet-org:mainfrom
yaoyaoding:dynamic-graph

yaoyaoding commented May 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyaoding commented May 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant