feat(training): return eval losses and add pure eval example#4125
feat(training): return eval losses and add pure eval example#4125yaoyu-33 wants to merge 1 commit into
Conversation
| def _attach_tokenizer_to_dataset_config(cfg: ConfigContainer, state: GlobalState) -> Any: | ||
| """Attach the lazily built tokenizer to dataset config pieces used by loaders.""" | ||
| tokenizer = state.tokenizer | ||
| cfg.dataset.tokenizer = tokenizer |
There was a problem hiding this comment.
nit: _attach_tokenizer_to_dataset_config mutates cfg.dataset in-place (sets .tokenizer and possibly .token_dtype_code). This means calling evaluate_model has a persistent side-effect on the caller's ConfigContainer. If the caller later reuses cfg for a different purpose (e.g. training), the injected tokenizer and dtype fields stick around.
Consider documenting this mutation in the evaluate_model docstring, or copying/shallow-cloning the dataset config before mutating it.
|
Light code review - source changes look correct. See inline comment on _attach_tokenizer_to_dataset_config re: in-place mutation. Missing tests: (1) timelimit path returns None, (2) empty model list ValueError, (3) no-iterator RuntimeError, (4) write_to_tensorboard=True alone. No perf tests impacted. |
|
test |
Light Code Review (detailed)Source changes look correct. The new evaluate_model helper cleanly delegates to the existing evaluate / evaluate_and_print_results paths, and the return-value fix for evaluate_and_print_results (explicit return total_loss_dict + None guards) is a good improvement. One note posted as an inline comment:
Missing test coverage The unit tests cover the happy paths well. A few gaps:
Of these, (1) is the most important - it is the only branch in evaluate_model that transforms a non-trivial return value and is currently untested. Suggested test cases: No perf tests impacted. |
2c36cb6 to
2bc93f0
Compare
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
2bc93f0 to
e1528cc
Compare
Summary
evaluate_and_print_results()helper: it still performs the same logging/printing side effects and now returns the averaged loss dictionary when evaluation completes.evaluate_model()wrapper from this branch so there is no second eval entry point.examples/evaluation/pure_eval.py, a runnable minimal Bridge example that builds a small GPT model, mock data iterator, and callsevaluate_and_print_results()directly.Usage
Programmatic callers can use the existing helper directly:
The example uses the standard Bridge setup path to create the model, mock dataset iterator, and process groups, then calls
evaluate_and_print_results(). Workflows that already have a Megatron model,ConfigContainer, and iterator can use the same existing helper call.Tests / Validation
ruff check src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.pyruff format --check src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.pypython3 -m py_compile src/megatron/bridge/training/eval.py tests/unit_tests/training/test_eval.py examples/evaluation/pure_eval.pygit diff --check origin/main...HEADpre-commit run --all-filese1528cc:Result: passed. The run initialized tensor model parallel size 2, evaluated one iteration, and returned
{'lm loss': 10.390985488891602}.Not run as passing local evidence:
uv run pre-commit run --all-filesis blocked on this workstation before hooks run because the pinnednvidia-resiliency-ext==0.6.0wheel is not available for the local Linux platform baseline.Notes
Fixes #2033