[train] Add checkpoint util functions for JaxTrainer. by siyuanfoundation · Pull Request #60759 · ray-project/ray

siyuanfoundation · 2026-02-04T23:17:20Z

Description

Add checkpoint util functions for JaxTrainer.

This is optional to use (could serve as an example). Users can always use their own checkpoint function.

Related issues

#55162

Additional information

gemini-code-assist

Code Review

This pull request introduces a JaxCheckpointManager to handle checkpointing for JaxTrainer by wrapping orbax.CheckpointManager. This is a significant and well-implemented feature for improving JAX support in Ray Train. The new functionality is supported by comprehensive tests for both single-host and multi-host scenarios. My review includes suggestions for improving code clarity, maintainability, and test robustness.

python/ray/train/v2/jax/checkpoint.py

python/ray/train/v2/jax/config.py

python/ray/train/v2/tests/test_jax_trainer.py

siyuanfoundation · 2026-02-06T15:30:53Z

/cc @ryanaoleary

liulehui

Thank you!!

python/ray/train/v2/jax/checkpoint.py

python/ray/train/v2/jax/__init__.py

python/requirements/ml/dl-cpu-requirements.txt

python/ray/train/v2/jax/checkpoint.py

liulehui

Could you help elaborate the requirement/functionality in the PR description?

ty!

python/ray/train/v2/jax/config.py

python/ray/train/v2/jax/__init__.py

liulehui · 2026-02-11T17:39:35Z

python/ray/train/v2/jax/checkpoint.py

+from ray.train._internal.session import _TrainingResult
+from ray.train._internal.storage import StorageContext, _exists_at_fs_path


I think these are for ray train v1, this is our v2 checkpoint manager:
https://github.com/liulehui/ray/blob/oss-elastic-training/python/ray/train/v2/_internal/execution/checkpoint/checkpoint_manager.py#L77

I recently added a gpt-2 template: https://docs.ray.io/en/master/train/examples/jax/intro_to_jax_trainer/README.html
which use orbax for checkpointing as well,

is there any more functionality/requirement needed beyond this one?

ryanaoleary · 2026-02-10T23:15:46Z

python/ray/train/v2/jax/checkpoint.py

+            )
+
+        # Use PyTreeCheckpointHandler for standard PyTree saving
+        item_handlers = {


Is it possible/supported for users to pass different arguments here? It might be good to expose orbax_options or something similar to define the arguments users can pass to the CheckpointManager when instantiating their JaxCheckpointManager.

liulehui · 2026-02-12T21:38:16Z

python/ray/train/v2/jax/checkpoint.py

+        save_args = ocp_args.PyTreeSave(
+            item=train_state,
+            save_args=jax.tree.map(
+                lambda _: ocp.SaveArgs(chunk_byte_size=chunk_byte_size), train_state
+            ),
+        )


discussed offline,

let's try to have a similar layout with other framework currently have.

e.g.

def train_fn_per_worker(train_loop_config: dict): checkpoint = ray.train.get_checkpoint() if checkpoint: with checkpoint.as_directory() as temp_checkpoint_dir: # pass in new workers and mesh/sharding config to restore restore_args_structure = jax.tree.map(map_to_restore_args, target) checkpoint_args = ocp_args.PyTreeRestore( item=target, restore_args=restore_args_structure ) model = orbax.restore(args=ocp_args.Composite(items=checkpoint_args)) # continue training ... # save with current mesh sharding with tempfile.TemporaryDirectory() as temp_checkpoint_dir: success = orbax.checkpointer.save(step, args=ocp_args.Composite(items, train_state), sharding) ray.train.report( {"loss": 0.1}, checkpoint=ray.train.Checkpoint.from_directory(temp_checkpoint_dir), )

python/ray/train/v2/jax/config.py

python/ray/train/v2/jax/utils/checkpoint.py

python/ray/train/v2/tests/test_jax_trainer.py

Signed-off-by: siyuanfoundation <sizhang@google.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-17T23:16:23Z

python/ray/train/v2/jax/utils/checkpoint.py

+                mesh=x.sharding.mesh, sharding=x.sharding
+            )
+            if isinstance(x, (jax.Array, jax.ShapeDtypeStruct))
+            and hasattr(x, "sharding")


Null sharding access crashes on ShapeDtypeStruct targets

Medium Severity

The guard hasattr(x, "sharding") is insufficient for jax.ShapeDtypeStruct because its sharding attribute always exists but defaults to None. When a ShapeDtypeStruct with sharding=None is passed as part of the target, hasattr returns True, then x.sharding.mesh raises an AttributeError since None has no mesh. The check needs to also verify that x.sharding is not None (e.g., using getattr(x, "sharding", None) is not None).

liulehui · 2026-02-18T21:27:39Z

python/ray/train/v2/jax/utils/checkpoint.py

+    """
+    import orbax.checkpoint as ocp
+
+    checkpointer = ocp.PyTreeCheckpointer()


I think this can be in user's training func script instead of us forcing it.

they can choose to do

checkpointer.save(checkpoint_dir, item, force=force) mlflow.log_model(checkpoint_dir)

without using ray.train.report.

we can leave this for user.

liulehui · 2026-02-18T21:29:34Z

python/ray/train/v2/jax/utils/checkpoint.py

+        restore_args = jax.tree_util.tree_map(
+            lambda x: type_handlers.ArrayRestoreArgs(
+                mesh=x.sharding.mesh, sharding=x.sharding
+            )
+            if isinstance(x, (jax.Array, jax.ShapeDtypeStruct))
+            and hasattr(x, "sharding")
+            else ocp.checkpoint_utils.construct_restore_args(x),
+            target,
+            is_leaf=lambda x: isinstance(x, (jax.Array, jax.ShapeDtypeStruct)),
+        )


just for my understanding,
is the main difference here to pass in target (which would include both model definition and sharding info)
so that we can restore from a previous checkpoint?

Would it be sufficient we keep the Mesh/Sharding in the training context so that user can just use that for restoring? In this way, I think we only need a util to construct a restore_args right

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

siyuanfoundation force-pushed the jax-checkpoint branch 3 times, most recently from 8ecaeca to cfe32c8 Compare February 6, 2026 14:51

siyuanfoundation marked this pull request as ready for review February 6, 2026 15:30

siyuanfoundation requested review from a team, matthewdeng and richardliaw as code owners February 6, 2026 15:30

matthewdeng requested a review from liulehui February 6, 2026 18:03

ray-gardener bot added the community-contribution Contributed by the community label Feb 6, 2026

liulehui self-assigned this Feb 9, 2026

liulehui reviewed Feb 9, 2026

View reviewed changes

python/ray/train/v2/jax/checkpoint.py Outdated Show resolved Hide resolved

siyuanfoundation force-pushed the jax-checkpoint branch from cfe32c8 to 2b88598 Compare February 10, 2026 22:46

cursor bot reviewed Feb 10, 2026

View reviewed changes

python/ray/train/v2/jax/__init__.py Outdated Show resolved Hide resolved

python/requirements/ml/dl-cpu-requirements.txt Outdated Show resolved Hide resolved

python/ray/train/v2/jax/checkpoint.py Outdated Show resolved Hide resolved

siyuanfoundation force-pushed the jax-checkpoint branch from 7ff843f to 77ac30a Compare February 11, 2026 01:57

liulehui reviewed Feb 11, 2026

View reviewed changes

ryanaoleary reviewed Feb 12, 2026

View reviewed changes

liulehui reviewed Feb 12, 2026

View reviewed changes

siyuanfoundation force-pushed the jax-checkpoint branch 2 times, most recently from 2297a66 to c07acb2 Compare February 17, 2026 21:11

siyuanfoundation changed the title ~~[train] Add checkpoint manager for JaxTrainer.~~ [train] Add checkpoint util functions for JaxTrainer. Feb 17, 2026

cursor bot reviewed Feb 17, 2026

View reviewed changes

python/ray/train/v2/jax/config.py Show resolved Hide resolved

python/ray/train/v2/jax/utils/checkpoint.py Outdated Show resolved Hide resolved

siyuanfoundation force-pushed the jax-checkpoint branch from c07acb2 to 96a15b0 Compare February 17, 2026 21:46

cursor bot reviewed Feb 17, 2026

View reviewed changes

python/ray/train/v2/tests/test_jax_trainer.py Show resolved Hide resolved

[train] Add checkpoint util functions for jax trainer.

adc697f

Signed-off-by: siyuanfoundation <sizhang@google.com>

siyuanfoundation force-pushed the jax-checkpoint branch from 96a15b0 to adc697f Compare February 17, 2026 23:04

cursor bot reviewed Feb 17, 2026

View reviewed changes

liulehui reviewed Feb 18, 2026

View reviewed changes

siyuanfoundation closed this Feb 18, 2026

ryanaoleary mentioned this pull request Feb 25, 2026

[Train] Update elastic policy to handle multi-host TPUs with JaxTrainer #61299

Merged

		from ray.train._internal.session import _TrainingResult
		from ray.train._internal.storage import StorageContext, _exists_at_fs_path

Conversation

siyuanfoundation commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siyuanfoundation commented Feb 6, 2026

Uh oh!

liulehui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liulehui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liulehui Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

liulehui Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 17, 2026

Choose a reason for hiding this comment

Null sharding access crashes on ShapeDtypeStruct targets

Uh oh!

liulehui Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

liulehui Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

siyuanfoundation commented Feb 4, 2026 •

edited

Loading