[train] Add hf trainer support for dictionary of datasets#56484
[train] Add hf trainer support for dictionary of datasets#56484matthewdeng merged 1 commit intoray-project:masterfrom
Conversation
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for using a dictionary of Ray Datasets for evaluation in the HuggingFace Trainer integration. The changes look good and are well-tested. I've found a potential bug in _transformers_utils.py that could lead to a KeyError and suggested a fix that also improves code clarity. I've also left a couple of comments on the test files to suggest refactoring to reduce code duplication and improve maintainability.
| if ( | ||
| isinstance(eval_dataset, str) | ||
| and isinstance(self.eval_dataset, dict) | ||
| and isinstance(self.eval_dataset[eval_dataset], _IterableFromIterator) | ||
| ): | ||
| dataset = RayTorchIterableDataset(self.eval_dataset[eval_dataset]) | ||
| return DataLoader(dataset, batch_size=1, collate_fn=lambda x: x[0]) | ||
| elif isinstance(eval_dataset, _IterableFromIterator): | ||
| dataset = RayTorchIterableDataset(eval_dataset) | ||
| return DataLoader(dataset, batch_size=1, collate_fn=lambda x: x[0]) |
There was a problem hiding this comment.
The current implementation has a potential KeyError if eval_dataset is a string but not a valid key in self.eval_dataset. The expression self.eval_dataset[eval_dataset] would raise an exception before isinstance is checked. It's safer to check for key existence first.
Additionally, there's some code duplication in creating the DataLoader. I suggest refactoring to address both issues, which also improves readability.
| if ( | |
| isinstance(eval_dataset, str) | |
| and isinstance(self.eval_dataset, dict) | |
| and isinstance(self.eval_dataset[eval_dataset], _IterableFromIterator) | |
| ): | |
| dataset = RayTorchIterableDataset(self.eval_dataset[eval_dataset]) | |
| return DataLoader(dataset, batch_size=1, collate_fn=lambda x: x[0]) | |
| elif isinstance(eval_dataset, _IterableFromIterator): | |
| dataset = RayTorchIterableDataset(eval_dataset) | |
| return DataLoader(dataset, batch_size=1, collate_fn=lambda x: x[0]) | |
| iterable_dataset = None | |
| if ( | |
| isinstance(eval_dataset, str) | |
| and isinstance(self.eval_dataset, dict) | |
| and eval_dataset in self.eval_dataset | |
| ): | |
| iterable_dataset = self.eval_dataset[eval_dataset] | |
| elif isinstance(eval_dataset, _IterableFromIterator): | |
| iterable_dataset = eval_dataset | |
| if isinstance(iterable_dataset, _IterableFromIterator): | |
| dataset = RayTorchIterableDataset(iterable_dataset) | |
| return DataLoader(dataset, batch_size=1, collate_fn=lambda x: x[0]) |
| if config["use_dict_eval_datasets"]: | ||
| eval_ds_shard_1 = ray.train.get_dataset_shard("eval_1") | ||
| eval_ds_shard_2 = ray.train.get_dataset_shard("eval_2") | ||
|
|
||
| eval_dataset = { | ||
| "eval_1": eval_ds_shard_1.iter_torch_batches( | ||
| batch_size=BATCH_SIZE_PER_WORKER | ||
| ), | ||
| "eval_2": eval_ds_shard_2.iter_torch_batches( | ||
| batch_size=BATCH_SIZE_PER_WORKER | ||
| ), | ||
| } | ||
| else: | ||
| eval_ds_shard = ray.train.get_dataset_shard("eval") | ||
|
|
||
| eval_dataset = eval_ds_shard.iter_torch_batches( | ||
| batch_size=BATCH_SIZE_PER_WORKER | ||
| ) |
There was a problem hiding this comment.
There's some code duplication in how eval_dataset is created. You could extract the common iter_torch_batches call into a helper function to make this more concise and maintainable.
def get_eval_batch_iter(shard_name: str):
shard = ray.train.get_dataset_shard(shard_name)
return shard.iter_torch_batches(batch_size=BATCH_SIZE_PER_WORKER)
if config["use_dict_eval_datasets"]:
eval_dataset = {
"eval_1": get_eval_batch_iter("eval_1"),
"eval_2": get_eval_batch_iter("eval_2"),
}
else:
eval_dataset = get_eval_batch_iter("eval")| def test_e2e_dict_eval_ray_data(ray_start_6_cpus_2_gpus, config_id): | ||
| def train_func(config): | ||
| # Datasets | ||
| if config["use_ray_data"]: | ||
| train_ds_shard = ray.train.get_dataset_shard("train") | ||
| eval_ds_shard_1 = ray.train.get_dataset_shard("eval_1") | ||
| eval_ds_shard_2 = ray.train.get_dataset_shard("eval_2") | ||
|
|
||
| train_dataset = train_ds_shard.iter_torch_batches( | ||
| batch_size=BATCH_SIZE_PER_WORKER | ||
| ) | ||
| eval_dataset = { | ||
| "eval_1": eval_ds_shard_1.iter_torch_batches( | ||
| batch_size=BATCH_SIZE_PER_WORKER | ||
| ), | ||
| "eval_2": eval_ds_shard_2.iter_torch_batches( | ||
| batch_size=BATCH_SIZE_PER_WORKER | ||
| ), | ||
| } | ||
| else: | ||
| train_df = pd.read_json(train_data) | ||
| validation_df = pd.read_json(validation_data) | ||
|
|
||
| train_dataset = Dataset.from_pandas(train_df) | ||
| eval_dataset = Dataset.from_pandas(validation_df) | ||
|
|
||
| # Model | ||
| model_config = AutoConfig.from_pretrained(MODEL_NAME) | ||
| model = AutoModelForCausalLM.from_config(model_config) | ||
|
|
||
| # HF Transformers Trainer | ||
| training_args = TrainingArguments( | ||
| f"{MODEL_NAME}-wikitext2", | ||
| evaluation_strategy=config["evaluation_strategy"], | ||
| logging_strategy=config["logging_strategy"], | ||
| save_strategy=config["save_strategy"], | ||
| eval_steps=config["eval_steps"], | ||
| save_steps=config["save_steps"], | ||
| logging_steps=config["logging_steps"], | ||
| num_train_epochs=config.get("num_train_epochs", MAX_EPOCHS), | ||
| max_steps=config.get("max_steps", -1), | ||
| learning_rate=config.get("learning_rate", 2e-5), | ||
| per_device_train_batch_size=BATCH_SIZE_PER_WORKER, | ||
| per_device_eval_batch_size=BATCH_SIZE_PER_WORKER, | ||
| weight_decay=0.01, | ||
| disable_tqdm=True, | ||
| no_cuda=config["no_cuda"], | ||
| report_to="none", | ||
| ) | ||
| trainer = Trainer( | ||
| model=model, | ||
| args=training_args, | ||
| train_dataset=train_dataset, | ||
| eval_dataset=eval_dataset, | ||
| ) | ||
|
|
||
| # Report to Ray Train | ||
| trainer.add_callback(RayTrainReportCallback()) | ||
| trainer = prepare_trainer(trainer) | ||
|
|
||
| # Start Training | ||
| trainer.train() | ||
|
|
||
| train_loop_config = CONFIGURATIONS[config_id] | ||
|
|
||
| # Must specify `max_steps` for Iterable Dataset | ||
| train_loop_config["use_ray_data"] = True | ||
| train_loop_config["max_steps"] = MAX_STEPS | ||
|
|
||
| # Calculate the num of Ray training iterations | ||
| num_iterations = MAX_STEPS // train_loop_config["save_steps"] | ||
|
|
||
| train_df = pd.read_json(train_data) | ||
| validation_df = pd.read_json(validation_data) | ||
|
|
||
| ray_train_ds = ray.data.from_pandas(train_df) | ||
| ray_eval_ds_1 = ray.data.from_pandas(validation_df) | ||
| ray_eval_ds_2 = ray.data.from_pandas(validation_df) | ||
|
|
||
| use_gpu = not train_loop_config["no_cuda"] | ||
|
|
||
| trainer = TorchTrainer( | ||
| train_func, | ||
| train_loop_config=train_loop_config, | ||
| scaling_config=ScalingConfig(num_workers=NUM_WORKERS, use_gpu=use_gpu), | ||
| datasets={ | ||
| "train": ray_train_ds, | ||
| "eval_1": ray_eval_ds_1, | ||
| "eval_2": ray_eval_ds_2, | ||
| }, | ||
| ) | ||
| result = trainer.fit() | ||
|
|
||
| assert result.metrics["step"] == MAX_STEPS | ||
| assert result.checkpoint | ||
| assert isinstance(result.checkpoint, Checkpoint) | ||
| assert len(result.best_checkpoints) == num_iterations | ||
| assert "eval_eval_1_loss" in result.metrics | ||
| assert "eval_eval_2_loss" in result.metrics | ||
|
|
There was a problem hiding this comment.
This new test function test_e2e_dict_eval_ray_data duplicates a lot of code from test_e2e_hf_data and test_e2e_ray_data in the same file, especially the inlined train_func. This makes the test suite harder to maintain. Consider refactoring by extracting the common logic into a shared helper function. For example, the model and trainer setup logic is identical across these tests and could be moved to a helper.
xinyuangui2
left a comment
There was a problem hiding this comment.
Review is already done in #56046
…t#56484) To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the `eval_dataset` argument type in `get_eval_dataloader` in [transformers](https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196). Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com> Co-authored-by: yenhong.wong <yenhong.wong@grabtaxi.com> Signed-off-by: zac <zac@anyscale.com>
…t#56484) To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the `eval_dataset` argument type in `get_eval_dataloader` in [transformers](https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196). Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com> Co-authored-by: yenhong.wong <yenhong.wong@grabtaxi.com> Signed-off-by: Marco Stephan <marco@magic.dev>
To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the `eval_dataset` argument type in `get_eval_dataloader` in [transformers](https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196). Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com> Co-authored-by: yenhong.wong <yenhong.wong@grabtaxi.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…t#56484) To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the `eval_dataset` argument type in `get_eval_dataloader` in [transformers](https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196). Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com> Co-authored-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Original PR #56484 by wyhong3103 Original: ray-project/ray#56484
…of datasets Merged from original PR #56484 Original: ray-project/ray#56484
…t#56484) To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the `eval_dataset` argument type in `get_eval_dataloader` in [transformers](https://github.com/huggingface/transformers/blob/v4.56.1/src/transformers/trainer.py#L1196). Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com> Co-authored-by: yenhong.wong <yenhong.wong@grabtaxi.com>
Use case
To provide support for evaluating model via HuggingFace trainer using a dictionary of Ray datasets. This is to align with the
eval_datasetargument type inget_eval_dataloaderin transformers.Why are these changes needed?
If users provide a dictionary of Ray datasets as
eval_datasetsto HuggingFaceTrainer, it will fail during evaluation step because the current dataloader returned is NOT of the typeRayTorchIterableDataset.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.