Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory by Saiteja64 · Pull Request #1287 · pytorch/torchtitan

Saiteja64 · 2025-06-12T14:49:35Z

Test Titan changes to use DCP ZOC instead of titan default

Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete Last Checkpoint (since it's sync save). Then Run Load and Run Training Until Step 1000

Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC

DCP + Titan Training on LLAMA3 8B Model For 500 Steps

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] 
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] *****************************************
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] *****************************************
[rank0]:[titan] 2025-06-25 12:55:00,591 - root - INFO - Starting job: Llama 3 8B training
[rank0]:[titan] 2025-06-25 12:55:02,714 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-06-25 12:55:02,716 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:[titan] 2025-06-25 12:55:02,717 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
[rank0]:NCCL version 2.27.3+cuda12.4
[rank0]:[rank0]:[W625 12:55:10.376686804 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 12:55:11,222 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:[titan] 2025-06-25 12:55:11,223 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:[titan] 2025-06-25 12:55:17,789 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001)
[rank0]:[titan] 2025-06-25 12:55:17,927 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1255
[rank0]:[titan] 2025-06-25 12:55:17,928 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-06-25 12:55:17,943 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-25 12:55:17,944 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-25 12:55:18,010 - root - INFO - Applied FSDP to the model
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[rank0]:[W625 12:55:18.048917586 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 500 (warmup 200).
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-25 12:55:24,823 - root - INFO - step:  1  loss: 12.2478  memory: 42.16GiB(44.38%)  tps: 1,191  tflops: 68.97  mfu: 6.97%
[rank0]:[titan] 2025-06-25 12:55:24,824 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 12:55:24,842 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.04 seconds.
[rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[0] 2025-06-25 12:55:32,097 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[W625 12:55:32.282661499 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Checkpoint background process is running...
[rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Waiting for checkpoint save request...
[rank0]:[0] 2025-06-25 12:55:37,346 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-1
[rank0]:[titan] 2025-06-25 12:55:55,589 - root - INFO - step: 10  loss:  9.9493  memory: 49.66GiB(52.28%)  tps: 2,397  tflops: 138.80  mfu: 14.03%
[rank0]:[titan] 2025-06-25 12:56:09,436 - root - INFO - step: 20  loss:  8.5802  memory: 49.66GiB(52.28%)  tps: 5,917  tflops: 342.67  mfu: 34.65%
[rank0]:[titan] 2025-06-25 12:56:23,343 - root - INFO - step: 30  loss:  7.7218  memory: 49.66GiB(52.28%)  tps: 5,892  tflops: 341.21  mfu: 34.50%
[rank0]:[titan] 2025-06-25 12:56:37,275 - root - INFO - step: 40  loss:  7.3309  memory: 49.66GiB(52.28%)  tps: 5,881  tflops: 340.60  mfu: 34.44%
[rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-1', uuid='320c8c66-1115-416f-a22a-50d86e094b3a')
[rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 12:56:49,706 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 12:56:51,223 - root - INFO - step: 50  loss:  7.1024  memory: 49.66GiB(52.28%)  tps: 5,874  tflops: 340.20  mfu: 34.40%
[rank0]:[titan] 2025-06-25 12:57:05,194 - root - INFO - step: 60  loss:  6.8512  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.61  mfu: 34.34%
[rank0]:[titan] 2025-06-25 12:57:19,180 - root - INFO - step: 70  loss:  6.9118  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.30  mfu: 34.31%
[rank0]:[titan] 2025-06-25 12:57:33,158 - root - INFO - step: 80  loss:  6.6411  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.46  mfu: 34.32%
[rank0]:[titan] 2025-06-25 12:57:47,135 - root - INFO - step: 90  loss:  6.6820  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.50  mfu: 34.33%
[rank0]:[titan] 2025-06-25 12:58:00,303 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - step: 100  loss:  6.4539  memory: 49.66GiB(52.28%)  tps: 5,555  tflops: 321.72  mfu: 32.53%
[rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 12:58:01,884 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 12:58:01,904 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 12:58:02,647 - root - INFO - Dumping profiler traces at step 100
[rank0]:[titan] 2025-06-25 12:58:02,863 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[0] 2025-06-25 12:58:03,725 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-100
[rank0]:[titan] 2025-06-25 12:58:17,301 - root - INFO - step: 110  loss:  6.6926  memory: 49.66GiB(52.28%)  tps: 5,314  tflops: 307.75  mfu: 31.12%
[rank0]:[titan] 2025-06-25 12:58:31,306 - root - INFO - step: 120  loss:  6.3697  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.81  mfu: 34.26%
[rank0]:[titan] 2025-06-25 12:58:45,299 - root - INFO - step: 130  loss:  6.4435  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.10  mfu: 34.29%
[rank0]:[titan] 2025-06-25 12:58:59,293 - root - INFO - step: 140  loss:  6.3808  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.05  mfu: 34.28%
[rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-100', uuid='324a8712-2df4-4b32-ae9f-eb26516ce3ac')
[rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 12:59:11,719 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 12:59:13,294 - root - INFO - step: 150  loss:  6.0557  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.92  mfu: 34.27%
[rank0]:[titan] 2025-06-25 12:59:27,291 - root - INFO - step: 160  loss:  6.3298  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.01  mfu: 34.28%
[rank0]:[titan] 2025-06-25 12:59:41,291 - root - INFO - step: 170  loss:  6.0259  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.92  mfu: 34.27%
[rank0]:[titan] 2025-06-25 12:59:55,283 - root - INFO - step: 180  loss:  6.0728  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.13  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:00:09,269 - root - INFO - step: 190  loss:  6.7391  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.26  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:00:21,734 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - step: 200  loss:  6.1409  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.79  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:00:23,275 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,801 - root - INFO - Dumping profiler traces at step 200
[rank0]:[titan] 2025-06-25 13:00:24,009 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[0] 2025-06-25 13:00:24,833 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-200
[rank0]:[titan] 2025-06-25 13:00:38,511 - root - INFO - step: 210  loss:  6.0412  memory: 49.66GiB(52.28%)  tps: 5,377  tflops: 311.41  mfu: 31.49%
[rank0]:[titan] 2025-06-25 13:00:52,498 - root - INFO - step: 220  loss:  6.0333  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.22  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:06,488 - root - INFO - step: 230  loss:  5.9720  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.18  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:20,477 - root - INFO - step: 240  loss:  5.8235  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.21  mfu: 34.30%
[rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-200', uuid='9a1e278a-8372-4bf5-b7f7-a7fd2b80cc5e')
[rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:01:32,911 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds.
[rank0]:[titan] 2025-06-25 13:01:34,464 - root - INFO - step: 250  loss:  5.8334  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.25  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:49,456 - root - INFO - step: 260  loss:  5.7444  memory: 49.66GiB(52.28%)  tps: 5,465  tflops: 316.51  mfu: 32.00%
[rank0]:[titan] 2025-06-25 13:02:03,466 - root - INFO - step: 270  loss:  5.7170  memory: 49.66GiB(52.28%)  tps: 5,848  tflops: 338.69  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:02:17,468 - root - INFO - step: 280  loss:  5.6201  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.87  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:02:31,468 - root - INFO - step: 290  loss:  5.7282  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.93  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:02:43,898 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:02:45,474 - root - INFO - step: 300  loss:  5.7546  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.76  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:02:45,475 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:02:45,476 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:02:45,496 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:02:46,012 - root - INFO - Dumping profiler traces at step 300
[rank0]:[titan] 2025-06-25 13:02:46,220 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[0] 2025-06-25 13:02:47,106 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-300
[rank0]:[titan] 2025-06-25 13:03:00,674 - root - INFO - step: 310  loss:  5.5701  memory: 49.66GiB(52.28%)  tps: 5,390  tflops: 312.17  mfu: 31.56%
[rank0]:[titan] 2025-06-25 13:03:14,668 - root - INFO - step: 320  loss:  5.5060  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.09  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:03:28,667 - root - INFO - step: 330  loss:  5.4723  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 338.96  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:03:42,657 - root - INFO - step: 340  loss:  5.5342  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.18  mfu: 34.29%
[rank0]:[0] 2025-06-25 13:03:45,717 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-300', uuid='c7b82f66-2746-49c4-b626-f527beec83cb')
[rank0]:[0] 2025-06-25 13:03:45,718 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:03:55,085 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:03:56,649 - root - INFO - step: 350  loss:  5.5262  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.11  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:04:10,642 - root - INFO - step: 360  loss:  5.4645  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.11  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:04:24,639 - root - INFO - step: 370  loss:  5.3705  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.02  mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:04:38,629 - root - INFO - step: 380  loss:  5.3711  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.18  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:04:52,611 - root - INFO - step: 390  loss:  5.2803  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.37  mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:05:05,085 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 13:05:06,613 - root - INFO - step: 400  loss:  5.4034  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.86  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:05:06,614 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:05:06,615 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:05:06,642 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.03 seconds.
[rank0]:[titan] 2025-06-25 13:05:07,153 - root - INFO - Dumping profiler traces at step 400
[rank0]:[titan] 2025-06-25 13:05:07,351 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:05:08,248 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-400
[rank0]:[titan] 2025-06-25 13:05:21,858 - root - INFO - step: 410  loss:  5.2031  memory: 49.66GiB(52.28%)  tps: 5,374  tflops: 311.24  mfu: 31.47%
[rank0]:[titan] 2025-06-25 13:05:35,829 - root - INFO - step: 420  loss:  5.3476  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.65  mfu: 34.34%
[rank0]:[titan] 2025-06-25 13:05:49,793 - root - INFO - step: 430  loss:  5.2579  memory: 49.66GiB(52.28%)  tps: 5,867  tflops: 339.80  mfu: 34.36%
[rank0]:[titan] 2025-06-25 13:06:03,755 - root - INFO - step: 440  loss:  5.2553  memory: 49.66GiB(52.28%)  tps: 5,868  tflops: 339.85  mfu: 34.36%
[rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-400', uuid='22d8ae0c-f9c7-4424-ac4b-ab4fdaa4655f')
[rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:06:16,214 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 13:06:17,743 - root - INFO - step: 450  loss:  5.1627  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.25  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:06:31,746 - root - INFO - step: 460  loss:  5.1879  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.86  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:06:45,740 - root - INFO - step: 470  loss:  5.0814  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.07  mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:06:59,746 - root - INFO - step: 480  loss:  5.0594  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.80  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:07:13,748 - root - INFO - step: 490  loss:  5.1015  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.88  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:07:26,234 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:07:27,814 - root - INFO - step: 500  loss:  5.1793  memory: 49.66GiB(52.28%)  tps: 5,824  tflops: 337.30  mfu: 34.11%
[rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving a full checkpoint at last step, step 500.
[rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.69 seconds.
[rank0]:[titan] 2025-06-25 13:08:18,133 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Finished dumping profiler traces in 0.24 seconds
[rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-25 13:08:20,369 - root - INFO - Training completed
[rank0]:[titan] 2025-06-25 13:08:20,370 - root - INFO - Destroying the purge thread.
[rank0]:[titan] 2025-06-25 13:08:24,141 - root - INFO - Process group destroyed.

Delete Checkpoint at 500th Step (Since this was sync saved). Run LLAMA3 8B model training to 1000 steps

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] 
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] *****************************************
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] *****************************************
[rank0]:[titan] 2025-06-25 13:10:27,930 - root - INFO - Starting job: Llama 3 8B training
[rank0]:[titan] 2025-06-25 13:10:32,443 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-06-25 13:10:32,445 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:[titan] 2025-06-25 13:10:32,447 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
[rank0]:NCCL version 2.27.3+cuda12.4
[rank0]:[rank0]:[W625 13:10:37.522041468 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:[titan] 2025-06-25 13:10:43,788 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001)
[rank0]:[titan] 2025-06-25 13:10:43,930 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1310
[rank0]:[titan] 2025-06-25 13:10:43,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-06-25 13:10:43,946 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-25 13:10:43,947 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-25 13:10:44,010 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[rank0]:[W625 13:10:45.257529860 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200).
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Loading the checkpoint at step 400.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - [GC] GC collection for checkpoint loading. 0.03 seconds.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Finished loading the checkpoint in 11.31 seconds.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Training starts at step 401.
[rank0]:[titan] 2025-06-25 13:10:56,822 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-25 13:11:13,774 - root - INFO - step: 410  loss:  5.2593  memory: 49.66GiB(52.28%)  tps: 2,746  tflops: 159.06  mfu: 16.08%
[rank0]:[titan] 2025-06-25 13:11:27,654 - root - INFO - step: 420  loss:  5.4461  memory: 49.66GiB(52.28%)  tps: 5,903  tflops: 341.86  mfu: 34.57%
[rank0]:[titan] 2025-06-25 13:11:41,595 - root - INFO - step: 430  loss:  5.3764  memory: 49.66GiB(52.28%)  tps: 5,877  tflops: 340.34  mfu: 34.41%
[rank0]:[titan] 2025-06-25 13:11:55,562 - root - INFO - step: 440  loss:  5.3640  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.74  mfu: 34.35%
[rank0]:[titan] 2025-06-25 13:12:07,987 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:12:09,550 - root - INFO - step: 450  loss:  5.2764  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.23  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:12:23,550 - root - INFO - step: 460  loss:  5.2871  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.91  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:12:37,552 - root - INFO - step: 470  loss:  5.1846  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.90  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:12:51,565 - root - INFO - step: 480  loss:  5.1532  memory: 49.66GiB(52.28%)  tps: 5,847  tflops: 338.61  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:13:05,570 - root - INFO - step: 490  loss:  5.2000  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.82  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:13:19,252 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:13:20,824 - root - INFO - step: 500  loss:  5.2609  memory: 49.66GiB(52.28%)  tps: 5,371  tflops: 311.03  mfu: 31.45%
[rank0]:[titan] 2025-06-25 13:13:20,825 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:13:20,826 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:13:20,844 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:13:21,785 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-25 13:13:22,009 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[0] 2025-06-25 13:13:30,315 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process
[rank0]:[W625 13:13:30.320876720 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Checkpoint background process is running...
[rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Waiting for checkpoint save request...
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[0] 2025-06-25 13:13:35,906 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-500
[rank0]:[titan] 2025-06-25 13:13:55,786 - root - INFO - step: 510  loss:  5.0734  memory: 49.66GiB(52.28%)  tps: 2,343  tflops: 135.71  mfu: 13.72%
[rank0]:[titan] 2025-06-25 13:14:09,750 - root - INFO - step: 520  loss:  5.2781  memory: 49.66GiB(52.28%)  tps: 5,867  tflops: 339.80  mfu: 34.36%
[rank0]:[titan] 2025-06-25 13:14:23,740 - root - INFO - step: 530  loss:  5.1984  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.15  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:14:37,748 - root - INFO - step: 540  loss:  5.0945  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.73  mfu: 34.25%
[rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-500', uuid='851c2b13-e42f-4630-9a50-edb23b6a52ac')
[rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:14:50,219 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-25 13:14:51,767 - root - INFO - step: 550  loss:  5.1208  memory: 49.66GiB(52.28%)  tps: 5,844  tflops: 338.47  mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:15:05,767 - root - INFO - step: 560  loss:  5.1587  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.93  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:15:19,775 - root - INFO - step: 570  loss:  4.9740  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.73  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:15:33,792 - root - INFO - step: 580  loss:  5.0386  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.50  mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:15:47,811 - root - INFO - step: 590  loss:  5.0741  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.48  mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:16:00,259 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - step: 600  loss:  4.9624  memory: 49.66GiB(52.28%)  tps: 5,844  tflops: 338.43  mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:16:01,833 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:16:01,853 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:16:02,368 - root - INFO - Dumping profiler traces at step 600
[rank0]:[titan] 2025-06-25 13:16:02,566 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:16:03,466 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-600
[rank0]:[titan] 2025-06-25 13:16:17,066 - root - INFO - step: 610  loss:  5.0695  memory: 49.66GiB(52.28%)  tps: 5,378  tflops: 311.46  mfu: 31.49%
[rank0]:[titan] 2025-06-25 13:16:31,050 - root - INFO - step: 620  loss:  5.0808  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.31  mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:16:45,041 - root - INFO - step: 630  loss:  4.9225  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.14  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:16:59,065 - root - INFO - step: 640  loss:  4.8939  memory: 49.66GiB(52.28%)  tps: 5,842  tflops: 338.36  mfu: 34.21%
[rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-600', uuid='05657add-86bb-4707-b3d6-3494d586c34d')
[rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:17:11,847 - root - INFO - [GC] Peforming periodical GC collection. 0.23 seconds.
[rank0]:[titan] 2025-06-25 13:17:13,311 - root - INFO - step: 650  loss:  5.0611  memory: 49.66GiB(52.28%)  tps: 5,751  tflops: 333.05  mfu: 33.68%
[rank0]:[titan] 2025-06-25 13:17:27,609 - root - INFO - step: 660  loss:  4.9348  memory: 49.66GiB(52.28%)  tps: 5,731  tflops: 331.88  mfu: 33.56%
[rank0]:[titan] 2025-06-25 13:17:41,811 - root - INFO - step: 670  loss:  4.8798  memory: 49.66GiB(52.28%)  tps: 5,769  tflops: 334.13  mfu: 33.78%
[rank0]:[titan] 2025-06-25 13:17:56,001 - root - INFO - step: 680  loss:  4.9749  memory: 49.66GiB(52.28%)  tps: 5,774  tflops: 334.40  mfu: 33.81%
[rank0]:[titan] 2025-06-25 13:18:10,004 - root - INFO - step: 690  loss:  4.7231  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.86  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:18:22,447 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,013 - root - INFO - step: 700  loss:  4.8765  memory: 49.66GiB(52.28%)  tps: 5,848  tflops: 338.69  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:18:24,014 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:18:24,015 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,034 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,557 - root - INFO - Dumping profiler traces at step 700
[rank0]:[titan] 2025-06-25 13:18:24,762 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:18:25,616 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-700
[rank0]:[titan] 2025-06-25 13:18:39,233 - root - INFO - step: 710  loss:  4.8973  memory: 49.66GiB(52.28%)  tps: 5,383  tflops: 311.76  mfu: 31.52%
[rank0]:[titan] 2025-06-25 13:18:53,245 - root - INFO - step: 720  loss:  4.8671  memory: 49.66GiB(52.28%)  tps: 5,848  tflops: 338.67  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:19:07,252 - root - INFO - step: 730  loss:  5.1143  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.76  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:19:21,270 - root - INFO - step: 740  loss:  4.9568  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.51  mfu: 34.23%
[rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-700', uuid='40580106-cb24-4788-a99d-b10ff483e779')
[rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:19:33,755 - root - INFO - [GC] Peforming periodical GC collection. 0.11 seconds.
[rank0]:[titan] 2025-06-25 13:19:35,288 - root - INFO - step: 750  loss:  4.6943  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.49  mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:19:49,292 - root - INFO - step: 760  loss:  4.7441  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.85  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:20:03,668 - root - INFO - step: 770  loss:  4.7597  memory: 49.66GiB(52.28%)  tps: 5,699  tflops: 330.05  mfu: 33.37%
[rank0]:[titan] 2025-06-25 13:20:18,428 - root - INFO - step: 780  loss:  4.7751  memory: 49.66GiB(52.28%)  tps: 5,551  tflops: 321.47  mfu: 32.50%
[rank0]:[titan] 2025-06-25 13:20:32,447 - root - INFO - step: 790  loss:  4.7187  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.49  mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:20:44,878 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - step: 800  loss:  4.7215  memory: 49.66GiB(52.28%)  tps: 5,847  tflops: 338.65  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:20:46,459 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:20:46,479 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:20:47,058 - root - INFO - Dumping profiler traces at step 800
[rank0]:[titan] 2025-06-25 13:20:47,260 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:20:48,131 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-800
[rank0]:[titan] 2025-06-25 13:21:01,733 - root - INFO - step: 810  loss:  4.7147  memory: 49.66GiB(52.28%)  tps: 5,364  tflops: 310.63  mfu: 31.41%
[rank0]:[titan] 2025-06-25 13:21:15,731 - root - INFO - step: 820  loss:  4.6326  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 338.97  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:21:29,735 - root - INFO - step: 830  loss:  4.7056  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.85  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:21:43,727 - root - INFO - step: 840  loss:  4.6464  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.13  mfu: 34.29%
[rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-800', uuid='a5a2aa11-d40c-4d6e-8f58-a21181f08eaf')
[rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:21:56,225 - root - INFO - [GC] Peforming periodical GC collection. 0.13 seconds.
[rank0]:[titan] 2025-06-25 13:21:57,742 - root - INFO - step: 850  loss:  4.8118  memory: 49.66GiB(52.28%)  tps: 5,846  tflops: 338.58  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:22:11,742 - root - INFO - step: 860  loss:  4.6158  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.94  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:22:25,735 - root - INFO - step: 870  loss:  4.5595  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.12  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:22:39,726 - root - INFO - step: 880  loss:  4.4991  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.17  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:22:53,710 - root - INFO - step: 890  loss:  4.7063  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.34  mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:23:06,134 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - step: 900  loss:  4.7203  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.74  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:23:07,718 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:23:07,738 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:23:08,289 - root - INFO - Dumping profiler traces at step 900
[rank0]:[titan] 2025-06-25 13:23:08,491 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:23:09,410 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-900
[rank0]:[titan] 2025-06-25 13:23:22,968 - root - INFO - step: 910  loss:  4.4820  memory: 49.66GiB(52.28%)  tps: 5,372  tflops: 311.11  mfu: 31.46%
[rank0]:[titan] 2025-06-25 13:23:36,961 - root - INFO - step: 920  loss:  4.6184  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.10  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:23:50,963 - root - INFO - step: 930  loss:  4.6420  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.90  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:24:04,987 - root - INFO - step: 940  loss:  4.5612  memory: 49.66GiB(52.28%)  tps: 5,842  tflops: 338.35  mfu: 34.21%
[rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-900', uuid='13493b71-2fef-4543-94dc-65dd4c30d299')
[rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:24:17,438 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds.
[rank0]:[titan] 2025-06-25 13:24:18,991 - root - INFO - step: 950  loss:  4.6412  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.84  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:24:33,004 - root - INFO - step: 960  loss:  4.5748  memory: 49.66GiB(52.28%)  tps: 5,847  tflops: 338.60  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:24:47,007 - root - INFO - step: 970  loss:  4.6774  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.85  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:25:01,005 - root - INFO - step: 980  loss:  4.5435  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 338.99  mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:25:15,028 - root - INFO - step: 990  loss:  4.6884  memory: 49.66GiB(52.28%)  tps: 5,842  tflops: 338.36  mfu: 34.21%
[rank0]:[titan] 2025-06-25 13:25:27,494 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - step: 1000  loss:  4.6401  memory: 49.66GiB(52.28%)  tps: 5,828  tflops: 337.50  mfu: 34.13%
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving a full checkpoint at last step, step 1000.
[rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.21 seconds.
[rank0]:[titan] 2025-06-25 13:26:18,296 - root - INFO - Checkpointer is deleting ./outputs/checkpoint/step-1.
[rank0]:[titan] 2025-06-25 13:26:19,208 - root - INFO - Dumping profiler traces at step 1000
[rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-25 13:26:21,065 - root - INFO - Checkpointer deleted ./outputs/checkpoint/step-1 in 2.77 seconds.
[rank0]:[titan] 2025-06-25 13:26:21,433 - root - INFO - Training completed
[rank0]:[titan] 2025-06-25 13:26:21,434 - root - INFO - Destroying the purge thread.
[rank0]:[titan] 2025-06-25 13:26:28,980 - root - INFO - Process group destroyed.

TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps on LLAMA3 8B Model

[rank0]:[titan] 2025-06-16 22:26:05,616 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-16 22:26:05,617 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-16 22:26:05,680 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-16 22:26:05,942 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-16 22:26:05,943 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200).
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - step:  1  loss: 12.2552  memory: 42.16GiB(44.38%)  tps: 1,512  tflops: 87.59  mfu: 8.86%
[rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-16 22:26:24,113 - root - INFO - step: 10  loss: 10.2261  memory: 49.66GiB(52.28%)  tps: 5,638  tflops: 326.50  mfu: 33.01%
[rank0]:[titan] 2025-06-16 22:26:37,933 - root - INFO - step: 20  loss:  8.5252  memory: 49.66GiB(52.28%)  tps: 5,928  tflops: 343.32  mfu: 34.71%
[rank0]:[titan] 2025-06-16 22:26:51,817 - root - INFO - step: 30  loss:  7.7284  memory: 49.66GiB(52.28%)  tps: 5,902  tflops: 341.78  mfu: 34.56%
[rank0]:[titan] 2025-06-16 22:27:05,726 - root - INFO - step: 40  loss:  7.3651  memory: 49.66GiB(52.28%)  tps: 5,890  tflops: 341.14  mfu: 34.49%
[rank0]:[titan] 2025-06-16 22:27:18,093 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-16 22:27:19,646 - root - INFO - step: 50  loss:  7.1575  memory: 49.66GiB(52.28%)  tps: 5,886  tflops: 340.87  mfu: 34.47%
[rank0]:[titan] 2025-06-16 22:27:33,572 - root - INFO - step: 60  loss:  6.8193  memory: 49.66GiB(52.28%)  tps: 5,883  tflops: 340.72  mfu: 34.45%
[rank0]:[titan] 2025-06-16 22:27:47,509 - root - INFO - step: 70  loss:  6.9294  memory: 49.66GiB(52.28%)  tps: 5,879  tflops: 340.47  mfu: 34.43%
[rank0]:[titan] 2025-06-16 22:28:01,442 - root - INFO - step: 80  loss:  6.6544  memory: 49.66GiB(52.28%)  tps: 5,880  tflops: 340.56  mfu: 34.44%
[rank0]:[titan] 2025-06-16 22:28:15,387 - root - INFO - step: 90  loss:  6.6736  memory: 49.66GiB(52.28%)  tps: 5,875  tflops: 340.27  mfu: 34.41%
[rank0]:[titan] 2025-06-16 22:28:28,168 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:28:29,755 - root - INFO - step: 100  loss:  6.4444  memory: 49.66GiB(52.28%)  tps: 5,702  tflops: 330.24  mfu: 33.39%
[rank0]:[titan] 2025-06-16 22:28:30,372 - root - INFO - Dumping profiler traces at step 100
[rank0]:[titan] 2025-06-16 22:28:30,573 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:28:44,564 - root - INFO - step: 110  loss:  6.6835  memory: 49.66GiB(52.28%)  tps: 5,532  tflops: 320.39  mfu: 32.40%
[rank0]:[titan] 2025-06-16 22:28:58,531 - root - INFO - step: 120  loss:  6.3605  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.72  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:29:12,500 - root - INFO - step: 130  loss:  6.4930  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.69  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:29:26,476 - root - INFO - step: 140  loss:  6.4257  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.53  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:29:38,860 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:29:40,448 - root - INFO - step: 150  loss:  6.0684  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.61  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:29:54,415 - root - INFO - step: 160  loss:  6.3202  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.72  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:30:08,386 - root - INFO - step: 170  loss:  5.9970  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.64  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:30:22,357 - root - INFO - step: 180  loss:  6.0888  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.65  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:30:36,333 - root - INFO - step: 190  loss:  6.7146  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.52  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:30:48,735 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:30:50,322 - root - INFO - step: 200  loss:  6.1384  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.16  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:30:50,758 - root - INFO - Dumping profiler traces at step 200
[rank0]:[titan] 2025-06-16 22:30:50,976 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-16 22:31:04,970 - root - INFO - step: 210  loss:  6.0388  memory: 49.66GiB(52.28%)  tps: 5,593  tflops: 323.92  mfu: 32.75%
[rank0]:[titan] 2025-06-16 22:31:18,934 - root - INFO - step: 220  loss:  6.0348  memory: 49.66GiB(52.28%)  tps: 5,868  tflops: 339.82  mfu: 34.36%
[rank0]:[titan] 2025-06-16 22:31:32,895 - root - INFO - step: 230  loss:  5.9670  memory: 49.66GiB(52.28%)  tps: 5,869  tflops: 339.88  mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:31:46,869 - root - INFO - step: 240  loss:  5.8311  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.59  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:31:59,261 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:32:00,846 - root - INFO - step: 250  loss:  5.8383  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.48  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:32:15,834 - root - INFO - step: 260  loss:  5.7541  memory: 49.66GiB(52.28%)  tps: 5,467  tflops: 316.60  mfu: 32.01%
[rank0]:[titan] 2025-06-16 22:32:29,803 - root - INFO - step: 270  loss:  5.7202  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.66  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:32:43,772 - root - INFO - step: 280  loss:  5.6326  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.71  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:32:57,734 - root - INFO - step: 290  loss:  5.7467  memory: 49.66GiB(52.28%)  tps: 5,868  tflops: 339.85  mfu: 34.36%
[rank0]:[titan] 2025-06-16 22:33:10,148 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-16 22:33:11,718 - root - INFO - step: 300  loss:  5.7704  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.31  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:33:12,173 - root - INFO - Dumping profiler traces at step 300
[rank0]:[titan] 2025-06-16 22:33:12,371 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:33:26,359 - root - INFO - step: 310  loss:  5.6014  memory: 49.66GiB(52.28%)  tps: 5,596  tflops: 324.09  mfu: 32.77%
[rank0]:[titan] 2025-06-16 22:33:40,337 - root - INFO - step: 320  loss:  5.5262  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.47  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:33:54,315 - root - INFO - step: 330  loss:  5.4930  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.45  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:34:08,294 - root - INFO - step: 340  loss:  5.5682  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.44  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:34:20,693 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:34:22,279 - root - INFO - step: 350  loss:  5.5599  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.32  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:34:36,250 - root - INFO - step: 360  loss:  5.5070  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.63  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:34:50,242 - root - INFO - step: 370  loss:  5.4143  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.11  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:35:04,230 - root - INFO - step: 380  loss:  5.4226  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.23  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:35:18,206 - root - INFO - step: 390  loss:  5.3208  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.55  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:35:30,661 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:35:32,251 - root - INFO - step: 400  loss:  5.4494  memory: 49.66GiB(52.28%)  tps: 5,833  tflops: 337.83  mfu: 34.16%
[rank0]:[titan] 2025-06-16 22:35:32,714 - root - INFO - Dumping profiler traces at step 400
[rank0]:[titan] 2025-06-16 22:35:32,916 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:35:46,913 - root - INFO - step: 410  loss:  5.2551  memory: 49.66GiB(52.28%)  tps: 5,588  tflops: 323.63  mfu: 32.72%
[rank0]:[titan] 2025-06-16 22:36:00,884 - root - INFO - step: 420  loss:  5.4090  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.64  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:36:14,870 - root - INFO - step: 430  loss:  5.3344  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.27  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:36:28,864 - root - INFO - step: 440  loss:  5.3106  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.10  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:36:41,255 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:36:42,843 - root - INFO - step: 450  loss:  5.2298  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.47  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:36:56,829 - root - INFO - step: 460  loss:  5.2365  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.26  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:37:10,821 - root - INFO - step: 470  loss:  5.1352  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.15  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:37:24,810 - root - INFO - step: 480  loss:  5.1063  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.20  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:37:38,791 - root - INFO - step: 490  loss:  5.1559  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.40  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:37:51,229 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-16 22:37:52,788 - root - INFO - step: 500  loss:  5.2053  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 339.00  mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:37:53,282 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-16 22:37:53,479 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:38:07,509 - root - INFO - step: 510  loss:  5.0252  memory: 49.66GiB(52.28%)  tps: 5,565  tflops: 322.31  mfu: 32.59%
[rank0]:[titan] 2025-06-16 22:38:22,243 - root - INFO - step: 520  loss:  5.2369  memory: 49.66GiB(52.28%)  tps: 5,561  tflops: 322.06  mfu: 32.56%
[rank0]:[titan] 2025-06-16 22:38:36,217 - root - INFO - step: 530  loss:  5.1380  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.56  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:38:50,196 - root - INFO - step: 540  loss:  5.0538  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.44  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:39:02,608 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:39:04,186 - root - INFO - step: 550  loss:  5.0703  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.16  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:39:18,173 - root - INFO - step: 560  loss:  5.1104  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.27  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:39:32,166 - root - INFO - step: 570  loss:  4.9368  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.12  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:39:46,148 - root - INFO - step: 580  loss:  5.0037  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.38  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:40:00,139 - root - INFO - step: 590  loss:  5.0303  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.16  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:40:12,566 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:40:14,145 - root - INFO - step: 600  loss:  4.9233  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.78  mfu: 34.25%
[rank0]:[titan] 2025-06-16 22:40:14,671 - root - INFO - Dumping profiler traces at step 600
[rank0]:[titan] 2025-06-16 22:40:14,871 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:40:28,831 - root - INFO - step: 610  loss:  5.0338  memory: 49.66GiB(52.28%)  tps: 5,579  tflops: 323.10  mfu: 32.67%
[rank0]:[titan] 2025-06-16 22:40:42,800 - root - INFO - step: 620  loss:  5.0437  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.68  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:40:56,779 - root - INFO - step: 630  loss:  4.8868  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.46  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:41:10,768 - root - INFO - step: 640  loss:  4.8542  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.19  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:41:23,176 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:41:24,756 - root - INFO - step: 650  loss:  5.0227  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.23  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:41:38,741 - root - INFO - step: 660  loss:  4.8901  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.32  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:41:52,723 - root - INFO - step: 670  loss:  4.8374  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.36  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:06,707 - root - INFO - step: 680  loss:  4.9357  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.33  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:20,691 - root - INFO - step: 690  loss:  4.6904  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.33  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:33,106 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:42:34,686 - root - INFO - step: 700  loss:  4.8426  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.02  mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:42:35,202 - root - INFO - Dumping profiler traces at step 700
[rank0]:[titan] 2025-06-16 22:42:35,411 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:42:49,414 - root - INFO - step: 710  loss:  4.8630  memory: 49.66GiB(52.28%)  tps: 5,563  tflops: 322.17  mfu: 32.58%
[rank0]:[titan] 2025-06-16 22:43:03,389 - root - INFO - step: 720  loss:  4.8307  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.54  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:43:17,366 - root - INFO - step: 730  loss:  5.0704  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.48  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:43:31,346 - root - INFO - step: 740  loss:  4.9245  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.40  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:43:43,762 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-16 22:43:45,331 - root - INFO - step: 750  loss:  4.6614  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.31  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:43:59,308 - root - INFO - step: 760  loss:  4.7145  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.50  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:44:13,608 - root - INFO - step: 770  loss:  4.7226  memory: 49.66GiB(52.28%)  tps: 5,730  tflops: 331.83  mfu: 33.55%
[rank0]:[titan] 2025-06-16 22:44:28,628 - root - INFO - step: 780  loss:  4.7430  memory: 49.66GiB(52.28%)  tps: 5,455  tflops: 315.93  mfu: 31.94%
[rank0]:[titan] 2025-06-16 22:44:42,611 - root - INFO - step: 790  loss:  4.6875  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.35  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:44:55,024 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:44:56,605 - root - INFO - step: 800  loss:  4.6901  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.06  mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:44:57,138 - root - INFO - Dumping profiler traces at step 800
[rank0]:[titan] 2025-06-16 22:44:57,344 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:45:11,360 - root - INFO - step: 810  loss:  4.6838  memory: 49.66GiB(52.28%)  tps: 5,552  tflops: 321.56  mfu: 32.51%
[rank0]:[titan] 2025-06-16 22:45:25,305 - root - INFO - step: 820  loss:  4.6000  memory: 49.66GiB(52.28%)  tps: 5,876  tflops: 340.28  mfu: 34.41%
[rank0]:[titan] 2025-06-16 22:45:39,251 - root - INFO - step: 830  loss:  4.6733  memory: 49.66GiB(52.28%)  tps: 5,875  tflops: 340.25  mfu: 34.40%
[rank0]:[titan] 2025-06-16 22:45:53,208 - root - INFO - step: 840  loss:  4.6178  memory: 49.66GiB(52.28%)  tps: 5,870  tflops: 339.98  mfu: 34.38%
[rank0]:[titan] 2025-06-16 22:46:05,611 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:46:07,194 - root - INFO - step: 850  loss:  4.7753  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.27  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:46:21,172 - root - INFO - step: 860  loss:  4.5818  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.45  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:46:35,159 - root - INFO - step: 870  loss:  4.5332  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.25  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:46:49,143 - root - INFO - step: 880  loss:  4.4733  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.34  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:47:03,125 - root - INFO - step: 890  loss:  4.6788  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.38  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:47:15,614 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:47:17,195 - root - INFO - step: 900  loss:  4.6943  memory: 49.66GiB(52.28%)  tps: 5,823  tflops: 337.23  mfu: 34.10%
[rank0]:[titan] 2025-06-16 22:47:17,789 - root - INFO - Dumping profiler traces at step 900
[rank0]:[titan] 2025-06-16 22:47:18,004 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-16 22:47:31,972 - root - INFO - step: 910  loss:  4.4559  memory: 49.66GiB(52.28%)  tps: 5,544  tflops: 321.10  mfu: 32.47%
[rank0]:[titan] 2025-06-16 22:47:45,950 - root - INFO - step: 920  loss:  4.5920  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.46  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:47:59,923 - root - INFO - step: 930  loss:  4.6219  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.58  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:48:13,889 - root - INFO - step: 940  loss:  4.5429  memory: 49.66GiB(52.28%)  tps: 5,867  tflops: 339.76  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:48:26,271 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:48:27,849 - root - INFO - step: 950  loss:  4.6160  memory: 49.66GiB(52.28%)  tps: 5,870  tflops: 339.93  mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:48:41,809 - root - INFO - step: 960  loss:  4.5495  memory: 49.66GiB(52.28%)  tps: 5,869  tflops: 339.91  mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:48:55,764 - root - INFO - step: 970  loss:  4.6507  memory: 49.66GiB(52.28%)  tps: 5,871  tflops: 340.03  mfu: 34.38%
[rank0]:[titan] 2025-06-16 22:49:09,734 - root - INFO - step: 980  loss:  4.5222  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.66  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:49:23,717 - root - INFO - step: 990  loss:  4.6672  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.37  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:49:36,204 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-16 22:49:37,745 - root - INFO - step: 1000  loss:  4.6165  memory: 49.66GiB(52.28%)  tps: 5,840  tflops: 338.24  mfu: 34.20%
[rank0]:[titan] 2025-06-16 22:49:38,344 - root - INFO - Dumping profiler traces at step 1000
[rank0]:[titan] 2025-06-16 22:49:38,552 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:49:38,553 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-16 22:49:40,553 - root - INFO - Training completed
[rank0]:[titan] 2025-06-16 22:49:41,030 - root - INFO - Process group destroyed.

fegin

Mostly look good. We should have a benchmark to understand the performance impact. I will expect DCP version should be faster (I didn't optimize the original logic well).

fegin · 2025-06-16T18:47:08Z

Synced offline, we can land this PR after 1) performance is on par or better, 2) loss curve matches the training without any checkpointing.

Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401

Summary: Pull Request resolved: #156207 This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401

fegin

Overall, LGTM. We need to wait for the PyTorch PR to get the testing signals here. I'm not sure if this PR will break the TorchTitan Checkpoint unittest.

fegin · 2025-06-26T21:46:15Z

torchtitan/components/checkpoint.py


        self.mp = None
        self.async_future = None
+        self.upload_future = None


The naming is not clear, now that we have two different futures. Can you change async_future to staging_future? upload_future is also not clear, may be save_future as some people may use local storages, which upload is a little bit confusing here.

fegin · 2025-06-26T21:47:05Z

torchtitan/components/checkpoint.py

                self.purge_queue.put(Terminate())
                self.purge_thread.join()

+            if self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:


This check is redundant. Can we just check if self.stager is None or not?

Summary: X-link: meta-pytorch/tnt#1010 Pull Request resolved: #156207 This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Reviewed By: diego-urgell Differential Revision: D72391401

Saiteja64 · 2025-06-30T15:59:54Z

@pytorchbot rebase

fegin

LGTM

Saiteja64 requested review from fegin, tianyu-l and wwwjn as code owners June 12, 2025 14:49

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 12, 2025

fegin reviewed Jun 16, 2025

View reviewed changes

Saiteja64 mentioned this pull request Jun 17, 2025

[DCP] OSS Zero Overhead Checkpointing Implementation pytorch/pytorch#156207

Closed

fegin reviewed Jun 26, 2025

View reviewed changes

saiteja64 and others added 16 commits June 30, 2025 09:46

changes to use DCP ZOC instead of titan default

fc04308

wait on upload completion instead of staging completion

0f9773f

fix bug where we wait on upload future during staging

e2c5225

fix bug where we wait on upload future during staging

b256ce6

fix bug where we wait on upload future during staging

33cce19

changes to accomodate async_stager DCP argument

490ebd8

revert llama3_8b.toml to default state

77c596d

update based on DCP ZOC API Changes

b99aec3

revert llama3_70b.toml to default setting

45439ae

address PR comments

0daa653

lint

f45094a

run pre-commit for accurate lint

b83e51c

update comment

436b35a

fix checkpoint UT

0d583e4

fix ft UT

f2ac14d

fix mock return value for participating rank to properly test

17e8e34

fegin approved these changes Jul 1, 2025

View reviewed changes

yzs981130 mentioned this pull request Aug 13, 2025

fix: remove redundant legacy usage of mp in checkpoint #1562

Open

fegin mentioned this pull request Oct 2, 2025

Fix async_with_pinned_mem doesn't set staging correctly #1783

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory#1287

Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory#1287
Saiteja64 merged 16 commits intomainfrom
torchtitan_dcp_zoc_test

Saiteja64 commented Jun 12, 2025 •

edited

Loading

Uh oh!

fegin left a comment •

edited

Loading

Uh oh!

fegin commented Jun 16, 2025

Uh oh!

fegin left a comment

Uh oh!

fegin Jun 26, 2025

Uh oh!

fegin Jun 26, 2025

Uh oh!

Saiteja64 commented Jun 30, 2025

Uh oh!

fegin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Saiteja64 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete Last Checkpoint (since it's sync save). Then Run Load and Run Training Until Step 1000

Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC

DCP + Titan Training on LLAMA3 8B Model For 500 Steps

Delete Checkpoint at 500th Step (Since this was sync saved). Run LLAMA3 8B model training to 1000 steps

TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps on LLAMA3 8B Model

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented Jun 16, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Saiteja64 commented Jun 30, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Saiteja64 commented Jun 12, 2025 •

edited

Loading

fegin left a comment •

edited

Loading