Skip to content

[determinism] [feature] DSV4 Determinism Kernel Level Optimization #3538

@ZhiyuLi-Nvidia

Description

@ZhiyuLi-Nvidia

User problem

As per deepseekv4 paper, track, implement and benchmark the feature for optimized determinism:

  • Attention Backward: independent accumulation buffers followed by a global deterministic summation
  • MoE backward: token order pre-processing within the rank, buffer isolation across multiple ranks
  • mHC: output each split part separately and perform a deterministic reduction in a subsequent kernel

Desired outcome

Deepseek-v4 is deterministic in training with all the optimized deterministic kernels.

Alternatives considered

No response

Affected area

area:model

Urgency / use case

Blocking current work

Extra context

No response

Metadata

Metadata

Labels

DeterminismTo track the bugs/issues in deterministic training in Megatron-Bridge.area:perfPerformance optimizations and benchmarkingfeatureNew capabilities, enhancements, or enablement worktrackingTracking issue for an ongoing project with smaller steps

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions