Skip to content

Add off-policy cross-tokenizer training algorithm wiring#3

Closed
avenkateshha wants to merge 1 commit into
xtoken/stack-pr3-worker-policyfrom
xtoken/stack-pr4-offpolicy-algo
Closed

Add off-policy cross-tokenizer training algorithm wiring#3
avenkateshha wants to merge 1 commit into
xtoken/stack-pr3-worker-policyfrom
xtoken/stack-pr4-offpolicy-algo

Conversation

@avenkateshha

Copy link
Copy Markdown
Owner

Port the off-policy distillation training orchestration and thin entry script integration on top of stacked loss/worker changes using current main-compatible flow.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Port the off-policy distillation training orchestration and thin entry script integration on top of stacked loss/worker changes using current main-compatible flow.

Made-with: Cursor
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale label Apr 27, 2026
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions Bot closed this May 4, 2026
avenkateshha added a commit that referenced this pull request May 20, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request May 23, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request May 27, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request May 27, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request Jun 4, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request Jun 4, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request Jun 7, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
avenkateshha added a commit that referenced this pull request Jun 7, 2026
PR NVIDIA-NeMo#2508 review (@RayenTian):

- #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk
  mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with
  loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon,
  top-1 accuracy, and the returned valid-count. Mirrors the P-KL path.

- #3: Size both projection-matrix axes from the configured tokenizer
  vocabs (student + teacher), not max(observed_idx) + 1.
  CrossTokenizerDistillationLossConfig declares student_vocab_size and
  teacher_vocab_size; xtoken_distillation.setup() injects both at
  runtime from len(student_tokenizer) / len(teacher_tokenizer).
  get_sparse_projection_matrix now takes both as keyword-only args and
  clamps V_s / V_t up against the projection's observed maxima as a
  defensive fallback. Same-magnitude-int positional swap is guarded by
  the keyword-only signature.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant