Add off-policy cross-tokenizer training algorithm wiring by avenkateshha · Pull Request #3 · avenkateshha/RL

avenkateshha · 2026-04-12T20:24:52Z

Port the off-policy distillation training orchestration and thin entry script integration on top of stacked loss/worker changes using current main-compatible flow.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Port the off-policy distillation training orchestration and thin entry script integration on top of stacked loss/worker changes using current main-compatible flow. Made-with: Cursor

github-actions · 2026-04-27T04:52:14Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions · 2026-05-04T05:01:03Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

@RayenTian

PR NVIDIA-NeMo#2508 review (@RayenTian): - #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon, top-1 accuracy, and the returned valid-count. Mirrors the P-KL path. - #3: Size both projection-matrix axes from the configured tokenizer vocabs (student + teacher), not max(observed_idx) + 1. CrossTokenizerDistillationLossConfig declares student_vocab_size and teacher_vocab_size; xtoken_distillation.setup() injects both at runtime from len(student_tokenizer) / len(teacher_tokenizer). get_sparse_projection_matrix now takes both as keyword-only args and clamps V_s / V_t up against the projection's observed maxima as a defensive fallback. Same-magnitude-int positional swap is guarded by the keyword-only signature. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

@RayenTian

PR NVIDIA-NeMo#2508 review (@RayenTian): - #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon, top-1 accuracy, and the returned valid-count. Mirrors the P-KL path. - #3: Size both projection-matrix axes from the configured tokenizer vocabs (student + teacher), not max(observed_idx) + 1. CrossTokenizerDistillationLossConfig declares student_vocab_size and teacher_vocab_size; xtoken_distillation.setup() injects both at runtime from len(student_tokenizer) / len(teacher_tokenizer). get_sparse_projection_matrix now takes both as keyword-only args and clamps V_s / V_t up against the projection's observed maxima as a defensive fallback. Same-magnitude-int positional swap is guarded by the keyword-only signature. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

@RayenTian

PR NVIDIA-NeMo#2508 review (@RayenTian): - #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon, top-1 accuracy, and the returned valid-count. Mirrors the P-KL path. - #3: Size both projection-matrix axes from the configured tokenizer vocabs (student + teacher), not max(observed_idx) + 1. CrossTokenizerDistillationLossConfig declares student_vocab_size and teacher_vocab_size; xtoken_distillation.setup() injects both at runtime from len(student_tokenizer) / len(teacher_tokenizer). get_sparse_projection_matrix now takes both as keyword-only args and clamps V_s / V_t up against the projection's observed maxima as a defensive fallback. Same-magnitude-int positional swap is guarded by the keyword-only signature. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

@RayenTian

PR NVIDIA-NeMo#2508 review (@RayenTian): - #2: Fold data["sample_mask"] into the gold-loss path's valid-chunk mask (chunk_mask & sample_mask.bool().unsqueeze(-1)) so samples with loss_multiplier=0 stop contributing to KL-on-common, L1-on-uncommon, top-1 accuracy, and the returned valid-count. Mirrors the P-KL path. - #3: Size both projection-matrix axes from the configured tokenizer vocabs (student + teacher), not max(observed_idx) + 1. CrossTokenizerDistillationLossConfig declares student_vocab_size and teacher_vocab_size; xtoken_distillation.setup() injects both at runtime from len(student_tokenizer) / len(teacher_tokenizer). get_sparse_projection_matrix now takes both as keyword-only args and clamps V_s / V_t up against the projection's observed maxima as a defensive fallback. Same-magnitude-int positional swap is guarded by the keyword-only signature. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

add off-policy cross-tokenizer training algorithm wiring

a27f2b1

Port the off-policy distillation training orchestration and thin entry script integration on top of stacked loss/worker changes using current main-compatible flow. Made-with: Cursor

github-actions Bot added the Stale label Apr 27, 2026

github-actions Bot closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add off-policy cross-tokenizer training algorithm wiring#3

Add off-policy cross-tokenizer training algorithm wiring#3
avenkateshha wants to merge 1 commit into
xtoken/stack-pr3-worker-policyfrom
xtoken/stack-pr4-offpolicy-algo

avenkateshha commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avenkateshha commented Apr 12, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant