Add Permute/Unpermute Fusion Code Path#588
Conversation
Eliasthunderdog
left a comment
There was a problem hiding this comment.
I have some concerns about remote rank visibility. Please check the lines below in the comments.
| uint32_t* last_chunk_flag_addr = intra_node_expert_output_chunk_flags[remote_rank_id] + last_chunk_global_chunk_id; | ||
| // Need a strong system-scope red to make sure the target ranks can observe the update of the flag, | ||
| // Notify last chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
There was a problem hiding this comment.
I think we need to use "red.release" here to guarantee the chunk is visible when observing the flag update? Although cp_async_bulk_wait_group within the cache-coherent NVL can implicitly guarantee the ordering, I think we need a semantically correct way to do this.
There was a problem hiding this comment.
Hi, thanks for the comment.
Yes, we need a release semantic here to make sure the flag is observed after the data chunk.
In fact, we already have it, you can see we explicitly call fence.release.sys; at Ln 1468 before we start notifying any flag, this instruction is an explicit release semantic memory fence, so it guarantee all the red.relaxed happens after the cp_async_bulk_wait_group(i.e. all notification happens after the data chunk is written to the global location) in system-scoped memory order. So it has the same effect as calling red.release here.
BTW, calling a red.release instruction or any memory intructions with .release semantics(such as st.release etc.) will be compiled into a sequence of fence.release.scope + mop.relaxed.scope(scope can be sys ,gpu etc. and mop can be red, st etc.), so there is NO difference by using fence.release.sys+red.relaxed.sys vs red.release.sys. As you can see the red.relaxed.sys here is within a loop, so if you are using red.release.sys here, then you will call fence.release.sys+red.relaxed.sys multiple time along with the loop which will kill all performance since fence.release.sys is a very very expensive operation and you want to call it as less time as possible. So instead of calling fence.release.sys+red.relaxed.sys within a loop, we will call fence.release.sys once then call red.relaxed.sys within a loop.
Even if we only call fence.release.sys once, it is still a very expensive operation, and according to our test, it will kill 50-100 GB/s NVLink bandwidth on Blackwell platform, so we want to avoid it whenever possible. According to our test, if the communcaition and sychronization happens only within a NVLink domain(i.e. no RDMA communication), w/o this memory fence can still get the corrent result. But if PCIe device such as RDMA NIC come into the communication, the result will corrupt w/o this memory fence. So, that's why we call this fence only when RDMA is enabled(i.e. HYBRID_EP_BUILD_MULTINODE_ENABLE is defined). Yes, for a well-defined memory behavior, we need the fence.release.sys here unconditionally, but as we are pursuing performance on NVLink system, we currently implement a correct(at least under our test), fast but undefined behavior here for NVLink scenario.
| uint32_t* current_chunk_flag_addr = intra_node_expert_output_chunk_flags[remote_rank_id] + global_chunk_id; | ||
| // Need a strong system-scope red to make sure the target ranks can observe the update of the flag, | ||
| // Notify last chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
| : "l"(__cvta_generic_to_global(last_chunk_flag_addr)), "n"(1) | ||
| : "memory"); | ||
| // Notify current chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
| uint32_t* last_chunk_flag_addr = intra_node_expert_output_chunk_flags[remote_rank_id] + last_chunk_global_chunk_id; | ||
| // Need a strong system-scope red to make sure the target ranks can observe the update of the flag, | ||
| // Notify last chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
| do{ | ||
| intra_node_chunk_flag = 0; | ||
| // Need a strong system-scope load to observe peer ranks' Atomic result. | ||
| asm volatile("ld.relaxed.sys.global.u32 %0, [%1];" |
There was a problem hiding this comment.
Yes, we need a acquire semantic here to make sure the flag is observed before consuming the data chunk.
However, on current GPU arch, the acquire semantic will be compiled to mop.relaxed.scope + L1 dirty cache line flush(so no acquire fence actually. Here, the ld.acquire.sys will be compiled to ld.relaxed.sys + L1 dirty cache line flush), this will make sure the later memory operation after the flag polling will not accidentally load the stale data from L1 cache. This alone will make sure that if a memory operation can be observed when a flag polling(i.e. the ld.relaxed.sys) success, it can be observed by later memory operations.
Since the data chunk is totally consumed by TMA instructions after polling the flag, and TMA instrcutions bypass L1 cache entirely, so we actually don't need the L1 dirty cache line flush after the ld.relaxed.sys, so that's why we use ld.relaxed.sys instead of ld.acquire.sys here.
Again, the L1 dirty cache line flush is very very expensive, it may take thousands of cycles to finish. We want to avoid it at all cost. So, we currently use a arch-specific implementation here, not a well-defined implementation.
| uint32_t* current_chunk_flag_addr = intra_node_expert_input_chunk_flags[current_rank_id] + current_flag_id; | ||
| // Need a strong system-scope red to make sure all ranks can observe the update of the flag, | ||
| // Notify last chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
| : "l"(__cvta_generic_to_global(last_chunk_flag_addr)), "n"(1) | ||
| : "memory"); | ||
| // Notify current chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
| uint32_t* last_chunk_flag_addr = intra_node_expert_input_chunk_flags[last_chunk_rank_id] + last_chunk_global_chunk_id; | ||
| // Need a strong system-scope red to make sure all ranks can observe the update of the flag, | ||
| // Notify last chunk. | ||
| asm volatile("red.relaxed.sys.global.add.u32 [%0], %1;" |
| do{ | ||
| intra_node_chunk_flag = 0; | ||
| // Need a strong system-scope load to observe peer ranks' Atomic result. | ||
| asm volatile("ld.relaxed.sys.global.u32 %0, [%1];" |
| do{ | ||
| intra_node_chunk_flag = 0; | ||
| // Need a strong system-scope load to observe peer ranks' Atomic result. | ||
| asm volatile("ld.relaxed.sys.global.u32 %0, [%1];" |
|
Hi @Autumn1998, could you share any performance comparison results with the target branch? |
Hybrid_EP_NVL72.md I just submitted a bug fix that will cost us a slight performance regression. We haven't had time to update the detailed benchmarks yet, but to avoid blocking subsequent nixl PRs, I suggest this PR can be prioritized for merging. |
why combine kernel performance reduce after merge this PR:before: after before current PR: after current PR: test command: combine_performance_test_hybrid_ep.py |
This may be caused by the default values set in this PR: NUM_OF_TOKENS_PER_CHUNK_DISPATCH_API=64 and NUM_OF_TOKENS_PER_CHUNK_COMBINE_API=64. While 128 is use before. 128 chunk-size might yield better performance for the non-fused case, we set it to 64 because it keeps performance acceptable for both the fused and non-fused case |
|
I've noticed that the hybrid-ep project has increasingly more configurable parameters, making it challenging to tune to the optimal settings. Currently, it seems that setting the configuration to 128 would be a better option! currently the performance is normal: test command: before default args: test command: |
|
Although the performance of the |
Uh oh!
There was an error while loading. Please reload this page.