feat(training): return eval losses and add pure eval example by yaoyu-33 · Pull Request #4125 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-06-03T00:08:09Z

Summary

Keep the public path on the existing evaluate_and_print_results() helper: it still performs the same logging/printing side effects and now returns the averaged loss dictionary when evaluation completes.
Remove the separate evaluate_model() wrapper from this branch so there is no second eval entry point.
Add examples/evaluation/pure_eval.py, a runnable minimal Bridge example that builds a small GPT model, mock data iterator, and calls evaluate_and_print_results() directly.
Keep focused mock unit coverage for the existing helper's return-value behavior.

Usage

uv run python -m torch.distributed.run --nproc_per_node=2 examples/evaluation/pure_eval.py --tp-size 2

Programmatic callers can use the existing helper directly:

losses = evaluate_and_print_results(...)

The example uses the standard Bridge setup path to create the model, mock dataset iterator, and process groups, then calls evaluate_and_print_results(). Workflows that already have a Megatron model, ConfigContainer, and iterator can use the same existing helper call.

Tests / Validation

ruff check src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.py
ruff format --check src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.py
python3 -m py_compile src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.py
git diff --check origin/main...HEAD
pre-commit run --all-files
Internal 2-GPU validation on pushed head e1528cc:

uv run --no-sync python -m torch.distributed.run --nproc_per_node=2 \
  examples/evaluation/pure_eval.py --tp-size 2 --eval-iters 1 \
  --seq-length 64 --hidden-size 64 --num-attention-heads 4 \
  --global-batch-size 2 --micro-batch-size 1

Result: passed. The run initialized tensor model parallel size 2, evaluated one iteration, and returned {'lm loss': 10.390985488891602}.

Not run as passing local evidence:

uv run pre-commit run --all-files is blocked on this workstation before hooks run because the pinned nvidia-resiliency-ext==0.6.0 wheel is not available for the local Linux platform baseline.
Runtime pytest was not used as authoritative local evidence; runtime validation was done with the 2-GPU example smoke above.

Notes

The example intentionally keeps using the existing setup path, which still constructs optimizer/scheduler objects because current setup/config validation expects them. This PR does not refactor those broader requirements.

Fixes #2033

copy-pr-bot · 2026-06-03T00:08:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-06-03T00:10:27Z

+def _attach_tokenizer_to_dataset_config(cfg: ConfigContainer, state: GlobalState) -> Any:
+    """Attach the lazily built tokenizer to dataset config pieces used by loaders."""
+    tokenizer = state.tokenizer
+    cfg.dataset.tokenizer = tokenizer


nit: _attach_tokenizer_to_dataset_config mutates cfg.dataset in-place (sets .tokenizer and possibly .token_dtype_code). This means calling evaluate_model has a persistent side-effect on the caller's ConfigContainer. If the caller later reuses cfg for a different purpose (e.g. training), the injected tokenizer and dtype fields stick around.

Consider documenting this mutation in the evaluate_model docstring, or copying/shallow-cloning the dataset config before mutating it.

claude · 2026-06-03T00:12:58Z

Light code review - source changes look correct. See inline comment on _attach_tokenizer_to_dataset_config re: in-place mutation. Missing tests: (1) timelimit path returns None, (2) empty model list ValueError, (3) no-iterator RuntimeError, (4) write_to_tensorboard=True alone. No perf tests impacted.

claude · 2026-06-03T00:14:10Z

test

claude · 2026-06-03T00:14:18Z

Light Code Review (detailed)

Source changes look correct. The new evaluate_model helper cleanly delegates to the existing evaluate / evaluate_and_print_results paths, and the return-value fix for evaluate_and_print_results (explicit return total_loss_dict + None guards) is a good improvement.

One note posted as an inline comment:

_attach_tokenizer_to_dataset_config mutates the caller cfg.dataset in-place. Worth documenting or defensively cloning.

Missing test coverage

The unit tests cover the happy paths well. A few gaps:

Timelimit path in evaluate_model - when evaluate() returns timelimit=True, evaluate_model should return None. No test exercises this branch (line 664-665).
_normalize_eval_model with empty list - should raise ValueError (line 515-516). Not tested.
_build_eval_data_iterator when all iterators are None - should raise RuntimeError (line 567). Not tested.
write_to_tensorboard=True alone - the routing condition is verbose or write_to_tensorboard (line 632), but the existing test only covers verbose=True. A test with write_to_tensorboard=True, verbose=False would confirm both triggers.

Of these, (1) is the most important - it is the only branch in evaluate_model that transforms a non-trivial return value and is currently untested.

Suggested test cases: No perf tests impacted.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

claude Bot reviewed Jun 3, 2026

View reviewed changes

yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jun 3, 2026

yaoyu-33 force-pushed the casey/issue2033-pure-eval-loop branch from 2c36cb6 to 2bc93f0 Compare June 3, 2026 23:50

feat(training): return eval losses from eval helper

e1528cc

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 force-pushed the casey/issue2033-pure-eval-loop branch from 2bc93f0 to e1528cc Compare June 3, 2026 23:56

yaoyu-33 changed the title ~~feat(training): add eval helper for existing models~~ feat(training): return eval losses and add pure eval example Jun 4, 2026

yaoyu-33 added the needs-more-tests Requires additional L0 and L1 test coverage before merge label Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(training): return eval losses and add pure eval example#4125

feat(training): return eval losses and add pure eval example#4125
yaoyu-33 wants to merge 1 commit into
mainfrom
casey/issue2033-pure-eval-loop

yaoyu-33 commented Jun 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

claude Bot Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyu-33 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Tests / Validation

Notes

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

claude Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026

Uh oh!

claude Bot commented Jun 3, 2026

Light Code Review (detailed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yaoyu-33 commented Jun 3, 2026 •

edited

Loading