Skip to content

Add worker and policy hooks for cross-tokenizer distillation flow#2

Closed
avenkateshha wants to merge 1 commit into
xtoken/stack-pr2-lossfrom
xtoken/stack-pr3-worker-policy
Closed

Add worker and policy hooks for cross-tokenizer distillation flow#2
avenkateshha wants to merge 1 commit into
xtoken/stack-pr2-lossfrom
xtoken/stack-pr3-worker-policy

Conversation

@avenkateshha

Copy link
Copy Markdown
Owner

Extend LMPolicy and DTensorPolicyWorkerV2 with teacher-forward and cross-tokenizer state update paths while preserving the current worker architecture and IPC-based distillation flow.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Extend LMPolicy and DTensorPolicyWorkerV2 with teacher-forward and cross-tokenizer state update paths while preserving the current worker architecture and IPC-based distillation flow.

Made-with: Cursor
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale label Apr 27, 2026
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions Bot closed this May 4, 2026
avenkateshha added a commit that referenced this pull request May 23, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request May 27, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request Jun 4, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request Jun 7, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant