Artifact for Paper "Understanding Stragglers in Large Model Training Using What-if Analysis"

👋 Hi, everyone!
We are ByteDance Seed team.

You can get to know us better through the following channels👇

Artifact for Paper "Understanding Stragglers in Large Model Training Using What-if Analysis"

Introduction

This artifact provides the core functionality of the simulator and the what-if analysis proposed in the paper, along with three sample traces to demonstrate the usage of the tool. The expected output includes the following for each sample trace:

Estimated slowdown $S$ (i.e., Eq. 1)
Slowdown $S_t$ attributed to each operation type $t$ (i.e., Eq. 2)
Slowdown $S_w$ attributed to each worker $w$ (i.e., Eq. 4)
Characterization metrics $M_W$(i.e., Eq. 5) and $M_S$ for individual worker issues and stage partitioning imbalance, respectively
A heatmap visualization as in Fig. 14.
A timeline of the simulated ideal trace visualizable in Perfetto.

Code Structure

├── analyzer  # Analyzer codes
├── data # Stores input data for analysis and corresponding expected results
├── format.sh # Script for code formatting
├── README.md
├── requirements.txt # Python dependencies
├── style.yapf # Configuration file for code formatting, defining the code style
└── run_all.sh # Convenient script used to reproduce all the results for each trace

./analyzer/wia.py is the entry-point for the what-if analysis. It takes in one job's trace and outputs various slowdown estimatation through simulation. We document in detail what each output field mean in the AnalyzerResult class in ./analyzer/metatypes.py.

In data folder there are three traces named by root causes SE, ST and AR, denoting sequence length imbalance, stage partitioning imbalance, and artifial individual worker slowdown (see section A.2), respectively. They correspond to the jobs analyzed in Sections 5.2, 5.3 and A.2 (the one with highest level of slowdown), respectively. Several files are included for each trace:

meta-<trace_name>.yaml: metadata for the corresonding job.
trace-<trace_name>.parquet: trace data. Each row corresponds to a recorded operation with the following fields:
- dp_rank: DP rank of the worker performing the op
- stage: PP rank of the worker
- rank: global rank of the worker
- step: training step this op is at
- optype: operation type. See metatypes.py for all supported types.
- start_ts: start timestamp
- duration: duration of the op
- seq_id: ops of the same type in a step on a worker will be assigned a sequence number seq_id in ascending order of their start times
- mc: model chunk (virtual stage) ID, the ID within the PP stage
- gmc: global model chunk (virtual stage) ID, the ID in the model. e.g., PP_size=VPP_size=2, then PP0 holds model chunks with gmc=0 and 2, while PP1 holds gmc=1 and 3.
- mb_id: microbatch ID, only valid for forward-compute and backward-compute ops
result-<trace_name>.json: expected what-if analysis result.
heatmap-<trace_name>.png: expected heatmap generated with the analysis result.
ms-<trace_name>.json: the $M_S$ metric as in Section 5.2 of the paper.
mw-<trace_name>.json: the $M_W$ metric as in Eq. 5 of the paper.
timeline-<trace_name>.json.gz: the timeline of the original trace that can be visualized in Perfetto.
ideal-timeline-<trace_name>.json.gz: the expected timeline of the simulated ideal that can be visualized in Perfetto. One could contrast it with the original timeline to have a more intuitive understanding on the simulation.

Evaluation Steps

1.Install dependencies.

The code was tested with Python 3.11, but should work on other Python versions. To install the necessary dependencies, run the following command:

pip install -r requirements.txt
export PYTHONPATH="`pwd`:$PYTHONPATH"

2. Execute the reproduction script.

For each trace we will analyze with wia.py, produce heatmap with heatmap.py, compute the $M_S$ and $M_W$ metrics with compute_ms.py and compute_mw.py, and generate the ideal timeline with to_timeline.py. For convienience we pack them all in one script and all you need is run it:

./run_all.sh

3. Check if output is expected.

The script above also compares with the expected results in data folder. The result should only differ (if any) in the generated heatmap PNG files, and only in figure plotting but not the underlying data, potentially due to plotting logic difference across platforms/versions.

Result highlight:

For trace AR, $M_W$ should be large (~100%) as it's individual worker issues, and only one worker should be highlighted in the heatmap.
For trace ST, $M_S$ should be large (~120%) as it's caused by long last stage, and only workers on the last stage will be highlighted in the heatmap.
For trace SE, $M_S$ is interestingly high (~68%) as well, since it also suffers from the issue of long last stage, second to the dominant sequence length imbalance. In the heatmap all workers are hightlighted as this is a randomly occuring issue.

4. Optional: explore customized what-if analysis

Users can also easily run their own analysis using our tool. We show two examples in custom-wia.ipynb. Simply run the file to see the result, or play with it with your own analysis.

License

This project is licensed under Apache 2.0. See the LICENSE flie for details.

About ByteDance Seed Team

Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Artifact for Paper "Understanding Stragglers in Large Model Training Using What-if Analysis"

Introduction

Code Structure

Evaluation Steps

1.Install dependencies.

2. Execute the reproduction script.

3. Check if output is expected.

4. Optional: explore customized what-if analysis

License

About ByteDance Seed Team

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
analyzer		analyzer
data		data
.gitignore		.gitignore
LICENSE		LICENSE
custom-wia.ipynb		custom-wia.ipynb
format.sh		format.sh
readme.md		readme.md
requirements.txt		requirements.txt
run_all.sh		run_all.sh
style.yapf		style.yapf

License

ByteDance-Seed/StragglerAnalysis

Folders and files

Latest commit

History

Repository files navigation

Artifact for Paper "Understanding Stragglers in Large Model Training Using What-if Analysis"

Introduction

Code Structure

Evaluation Steps

1.Install dependencies.

2. Execute the reproduction script.

3. Check if output is expected.

4. Optional: explore customized what-if analysis

License

About ByteDance Seed Team

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages