Skip to content

feat(training): return eval losses and add pure eval example#4125

Open
yaoyu-33 wants to merge 1 commit into
mainfrom
casey/issue2033-pure-eval-loop
Open

feat(training): return eval losses and add pure eval example#4125
yaoyu-33 wants to merge 1 commit into
mainfrom
casey/issue2033-pure-eval-loop

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Keep the public path on the existing evaluate_and_print_results() helper: it still performs the same logging/printing side effects and now returns the averaged loss dictionary when evaluation completes.
  • Remove the separate evaluate_model() wrapper from this branch so there is no second eval entry point.
  • Add examples/evaluation/pure_eval.py, a runnable minimal Bridge example that builds a small GPT model, mock data iterator, and calls evaluate_and_print_results() directly.
  • Keep focused mock unit coverage for the existing helper's return-value behavior.

Usage

uv run python -m torch.distributed.run --nproc_per_node=2 examples/evaluation/pure_eval.py --tp-size 2

Programmatic callers can use the existing helper directly:

losses = evaluate_and_print_results(...)

The example uses the standard Bridge setup path to create the model, mock dataset iterator, and process groups, then calls evaluate_and_print_results(). Workflows that already have a Megatron model, ConfigContainer, and iterator can use the same existing helper call.

Tests / Validation

  • ruff check src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.py
  • ruff format --check src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.py
  • python3 -m py_compile src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.py
  • git diff --check origin/main...HEAD
  • pre-commit run --all-files
  • Internal 2-GPU validation on pushed head e1528cc:
uv run --no-sync python -m torch.distributed.run --nproc_per_node=2 \
  examples/evaluation/pure_eval.py --tp-size 2 --eval-iters 1 \
  --seq-length 64 --hidden-size 64 --num-attention-heads 4 \
  --global-batch-size 2 --micro-batch-size 1

Result: passed. The run initialized tensor model parallel size 2, evaluated one iteration, and returned {'lm loss': 10.390985488891602}.

Not run as passing local evidence:

  • uv run pre-commit run --all-files is blocked on this workstation before hooks run because the pinned nvidia-resiliency-ext==0.6.0 wheel is not available for the local Linux platform baseline.
  • Runtime pytest was not used as authoritative local evidence; runtime validation was done with the 2-GPU example smoke above.

Notes

  • The example intentionally keeps using the existing setup path, which still constructs optimizer/scheduler objects because current setup/config validation expects them. This PR does not refactor those broader requirements.

Fixes #2033

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread src/megatron/bridge/training/eval.py Outdated
def _attach_tokenizer_to_dataset_config(cfg: ConfigContainer, state: GlobalState) -> Any:
"""Attach the lazily built tokenizer to dataset config pieces used by loaders."""
tokenizer = state.tokenizer
cfg.dataset.tokenizer = tokenizer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _attach_tokenizer_to_dataset_config mutates cfg.dataset in-place (sets .tokenizer and possibly .token_dtype_code). This means calling evaluate_model has a persistent side-effect on the caller's ConfigContainer. If the caller later reuses cfg for a different purpose (e.g. training), the injected tokenizer and dtype fields stick around.

Consider documenting this mutation in the evaluate_model docstring, or copying/shallow-cloning the dataset config before mutating it.

@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Light code review - source changes look correct. See inline comment on _attach_tokenizer_to_dataset_config re: in-place mutation. Missing tests: (1) timelimit path returns None, (2) empty model list ValueError, (3) no-iterator RuntimeError, (4) write_to_tensorboard=True alone. No perf tests impacted.

@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

test

@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Light Code Review (detailed)

Source changes look correct. The new evaluate_model helper cleanly delegates to the existing evaluate / evaluate_and_print_results paths, and the return-value fix for evaluate_and_print_results (explicit return total_loss_dict + None guards) is a good improvement.

One note posted as an inline comment:

  • _attach_tokenizer_to_dataset_config mutates the caller cfg.dataset in-place. Worth documenting or defensively cloning.

Missing test coverage

The unit tests cover the happy paths well. A few gaps:

  1. Timelimit path in evaluate_model - when evaluate() returns timelimit=True, evaluate_model should return None. No test exercises this branch (line 664-665).
  2. _normalize_eval_model with empty list - should raise ValueError (line 515-516). Not tested.
  3. _build_eval_data_iterator when all iterators are None - should raise RuntimeError (line 567). Not tested.
  4. write_to_tensorboard=True alone - the routing condition is verbose or write_to_tensorboard (line 632), but the existing test only covers verbose=True. A test with write_to_tensorboard=True, verbose=False would confirm both triggers.

Of these, (1) is the most important - it is the only branch in evaluate_model that transforms a non-trivial return value and is currently untested.


Suggested test cases: No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jun 3, 2026
@yaoyu-33 yaoyu-33 force-pushed the casey/issue2033-pure-eval-loop branch from 2c36cb6 to 2bc93f0 Compare June 3, 2026 23:50
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 force-pushed the casey/issue2033-pure-eval-loop branch from 2bc93f0 to e1528cc Compare June 3, 2026 23:56
@yaoyu-33 yaoyu-33 changed the title feat(training): add eval helper for existing models feat(training): return eval losses and add pure eval example Jun 4, 2026
@yaoyu-33 yaoyu-33 added the needs-more-tests Requires additional L0 and L1 test coverage before merge label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-more-tests Requires additional L0 and L1 test coverage before merge needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support pure eval loop given a model and ConfigContainer

1 participant