feat(xtoken): support TP/CP/diff-DP sharded cross-tokenizer distillation#2745
Open
RayenTian wants to merge 2 commits into
Open
feat(xtoken): support TP/CP/diff-DP sharded cross-tokenizer distillation#2745RayenTian wants to merge 2 commits into
RayenTian wants to merge 2 commits into
Conversation
cb37de4 to
1537d7d
Compare
1537d7d to
bbe782f
Compare
39cf1b2 to
5185829
Compare
Contributor
Author
|
/ok to test 040969c |
Contributor
Author
|
/ok to test 17bec2f |
Extend cross-tokenizer off-policy distillation to run with the student under tensor- and context-parallelism (and a data-parallel degree that may differ from the teacher's) on the automodel/DTensor policy worker. - Teacher full-vocab logits are exported per rank and shipped to the student via CUDA IPC (FullLogitsPostProcessor), then reassembled on the consumer across its CP group for heterogeneous teacher/student TP/CP layouts. - The loss runs TP/CP-aware: vocab-parallel log-softmax / argmax / projection, CP load-balanced -> contiguous re-layout, CP-aware next-token shift, and partial chunk-average + grad-preserving all-reduce. The generic collectives live in model_utils; the cross-tokenizer orchestration in x_token/loss_utils. - TP/CP process groups are derived from the student logits' own device mesh instead of being threaded through the generic LossPostProcessor, so the SFT / GRPO / distillation LOGPROB paths keep using the DTensor branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ruit <ruit@nvidia.com>
…on nightly - Unit tests for the TP/CP/diff-DP setup grid and the vocab-/CP-parallel loss helpers (real 2-GPU NCCL actors), plus per-step IPC buffer release. Adds the x_token unit-test package __init__.py so the Ray actor FQN is importable. - Nightly recipe distillation-xtoken-off-policy-qwen3-4b-to-llama3.2-1b-1n8g- dtensor-tp4cp2 (student TP4xCP2 <- teacher TP2xCP2) and its driver, wired into nightly.txt; guards sharded-loss parallelism-invariance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: ruit <ruit@nvidia.com>
Contributor
Author
|
/ok to test b02211f |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Adds TP / CP / heterogeneous-DP sharding support to the cross-tokenizer (xtoken) off-policy distillation loss (P-KL + gold) on the DTensor v2 worker, fixes a context-parallel loss-invariance bug, and adds a parallelism-invariance nightly.
Issues
Closes #2682
Following #2792 (Make building the (student, teacher) projection matrix inflight )
Result
Teacher : Qwen/Qwen3-4B TP2 CP2 DP2
Student : meta-llama/Llama-3.2-1B TP4 CP2 DP1
KL Loss
Gold Loss
XT Loss
Usage
Before your PR is "Ready for review"
Pre checks:
Additional Information