Skip to content

Conversation

@stas00
Copy link
Collaborator

@stas00 stas00 commented Jul 19, 2022

solve 2 issues:

  • less noise using tqdm progress bar
  • more informative - tell users how much to wait and how many shards to load

New way:

Loading 72 checkpoints:  12%|█▎        | 9/72 [01:12<08:39,  8.25s/it]

@RezaYazdaniAminabadi, @jeffra

solve 2 issues:
- less noise using tqdm progress bar
- more informative - tell users how much to wait and how many shards to load

New way:

```
Loading 72 checkpoints:  12%|█▎        | 9/72 [01:12<08:39,  8.25s/it]
```
@stas00
Copy link
Collaborator Author

stas00 commented Jul 19, 2022

It's actually interesting to watch tqdm processing here - the first checkpoints are loaded really fast and then it starts slowing down and elongating the ETA as the IO gets saturated.

Unrelated to this PR, but perhaps we can use a different approach to loading, so that the io per shard happens only once per node and not 8 or 16 times concurrently? mmap or something similar? It'd probably make the loading much much faster. 10min to load is quite slow.

and of course we discussing making 2 additional branches with pre-sharded-per-TP-rank data, so each process loads only what it needs. But that requires a lot more work and coding to look up the right checkpoint branch.

@tjruwase tjruwase merged commit 16699d8 into deepspeedai:master Jul 19, 2022
@stas00 stas00 deleted the patch-2 branch July 19, 2022 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants