Skip to content

Conversation

@DanielHesslow
Copy link
Collaborator

@DanielHesslow DanielHesslow commented Nov 29, 2021

Providing the functionality for running the EleutherAI evaluation harness on megatron checkpoints, adressing #137

In order to run on JZ we need to cache the tasks locally since the gpu-nodes does not have internet access.
eval_harness/download.py, provides that functionality.

Currently pipeline parallel models work but model parallel needs to be tested.

@DanielHesslow
Copy link
Collaborator Author

DanielHesslow commented Nov 29, 2021

Tensor parallelism should now work as well. Will coordinate with Jason Phang.

@DanielHesslow DanielHesslow marked this pull request as ready for review November 29, 2021 10:57
@DanielHesslow DanielHesslow marked this pull request as draft November 29, 2021 11:27
@DanielHesslow DanielHesslow marked this pull request as ready for review December 6, 2021 16:43
@DanielHesslow
Copy link
Collaborator Author

This should now be all good. We're getting the same results as Jason get who has in turn verified it against the HF models.

DanielHesslow and others added 2 commits December 9, 2021 17:02
Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Thanks for the great work

Comment on lines +5 to +6
PP_SIZE=1
TP_SIZE=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those that actually not depend on the checkpoint? What happens if my checkpoint was PP_SIZE=2 can I run evaluation on PP_SIZE=1? Same question for TP

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it works, the checkpoints are loaded with the state dicts produced by the deepspeed_to_megatron converter, so it'll merge the model parallell partitions etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait that script converts to megatron format, ie Megatron-LM no? Can you load a rotary embedding checkpoint?

Copy link
Collaborator Author

@DanielHesslow DanielHesslow Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Megatron as in the state_dict format that you get when you run without the --deepspeed flag. Eg some layers are named differently as we're not wrapping it in deepspeeds pipelining primitives. I don't see why rotary shouldn't work but I haven't tested it in particular.

Comment on lines +23 to +24
--vocab-file $VOCAB_FILE\
--merge-file $MERGE_FILE\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocab-file/merge-file just stores the path, so if the path is not available during eval it'll crash otherwise.

--merge-file $MERGE_FILE\
--micro-batch-size 64\
--adaptive_seq_len\
--eval_fp32\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Have you seen performance changes between 16 and 32?

Copy link
Collaborator Author

@DanielHesslow DanielHesslow Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we change the seq_length during the eval the performance degrades a bit in fp16. Not entirely sure why, but it's stable in fp32 and using adaptive_seq_len speeds things up by a lot.

Comment on lines +1 to +3
# Downloads the specified taks in the evaluation harness
# This is particularly useful when running in environments where the GPU nodes
# do not have internet access. This way we can pre-download them and use the cached data-set during evaluation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome! 🤯

import torch
from collections import OrderedDict
from deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint
from .deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwanted change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, intentional. Otherwise we need to add tools/convert_checkpoint/ to the path in order to import deepspeed_to_megatron.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed to move all these script to the Med-DS normal files here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/121/files
but not sure if that PR will be finished any time soon.

Once in the normal module files this won't be needed. But it's fine for now.

Or move these I like proposed in the PR above in this PR already... either way works.

python isn't very friendly to scripts that come with extra files....

from logging import logMultiprocessing
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds the root of the repo to the path so we can import stuff from there without pip installing the repo. Not pretty but just copied tasks/main.py 🤷

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to not pip install the repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also currently it can't find megatron files (which can't be installed).

I had to workaround with:

cd /gpfsssd/worksf/projects/rech/six/commun/code/eval/Megatron-DeepSpeed
PYTHONPATH=. sh ./run_evalharness.sh

so you probably need to find the root and add it to the sys.path as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "can't install", you mean that pip install -e . didn't work? If so I tried figuring out why I'm not sure why it cause ModuleErrorNotFound, but I had a workaroung fix it for me: #173 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "can't install", you mean that pip install -e . didn't work?

That's right. At the moment I added the PYTHONPATH= to the slurm script and it works.


def main():
task_list = ALL_TASKS if args.task_list == 'all' else args.task_list.split(',')
tasks.get_task_dict(task_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing this script, this creates a data folder from where you called this script for some specific tasks like lambada or triviaqa. This then needs to be moved to the root of the for lm-evaluation-harness. I think this should be fixed in that repo though ... (we can always manually move those files)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm okay, my bad didn't test those tasks last night. I can also just revert the last commit with the pickle workaround for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your fault at all, needed it to load all the datasets, turns out half of them have a manual script. I think you can just add a comment (we could build a sym link even) but the better things would be to update their repo to install theie data locally when we use their API.

Copy link
Collaborator

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for all your hard work Daniel and Jason!

I left a minor comment below for my own understanding.

Also, I wonder if we should add the lm_eval dependency somewhere (not sure where it's best to do that, maybe in extras_require in setup.pyor explaining it in theREADME.md` ).

# Initialize megatron model using the parsed state dict.
sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(), mpu.get_pipeline_model_parallel_rank(), True)

model = get_model(model_provider)[0]
Copy link
Member

@thomasw21 thomasw21 Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add assert(isinstance(model, GPTModelPipe))?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes more sense to check if model inherits from GPTModelPipe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure isinstance checks the inheritance, so if we even implement a model that inherits GOTModelPipe, the assert would return True. Thought technically all models we train currently are GPTModelPipe, so using type equality should be fine.

So turns out this is GPTModel instead of GPTModelPipe, which is one of the inconsistencies that led @DanielHesslow to find out that we didn't train with alibi (otherwise we would have seen good-ish performances as it would be in the same setting as during pretraining). I think it's great we figure this out, but I think we should try as much as we can to be close to the model we train. I think @DanielHesslow encountered some issues with deepspeed concerning this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it now works with GPTModelPipe

@DanielHesslow DanielHesslow requested a review from a team April 26, 2022 22:40
@TevenLeScao TevenLeScao self-requested a review April 26, 2022 23:04
Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small README changes, otherwise looks good.

Comment on lines +20 to +27
2. Pre-download needed datasets

some symlinks due to lm-harness' issues with relative position of data
```
mkdir data
ln -s `pwd`/data tasks/eval_harness/data
```
Also make sure `data` is not on one of the limited paritions like WORKSF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think that's needed anymore as everything was migrated to use datasets caching system

On login console with external network

Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness)
Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and `best-download==0.0.7` needed to download some tasks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be able to bump the version to master why should have the fixed version.

Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and `best-download==0.0.7` needed to download some tasks.
```
start-prod
pip install best-download==0.0.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install best-download==0.0.7

Comment on lines +22 to +27
some symlinks due to lm-harness' issues with relative position of data
```
mkdir data
ln -s `pwd`/data tasks/eval_harness/data
```
Also make sure `data` is not on one of the limited paritions like WORKSF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
some symlinks due to lm-harness' issues with relative position of data
```
mkdir data
ln -s `pwd`/data tasks/eval_harness/data
```
Also make sure `data` is not on one of the limited paritions like WORKSF.

If there are things like custom tokenizers, pre-download those too, e.g.:

```
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/oscar_13_languages_alpha_weight')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/oscar_13_languages_alpha_weight')"
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/tokenizer')"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you need tokenizers>0.12.1

(torchDDP, LocalDDP, Float16Module))

optimizer = get_megatron_optimizer(unwrapped_model)
if args.inference:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already have do_train?

@Muennighoff
Copy link
Collaborator

I added an arg to reduce bootstrap_iters, by default it's the same as previously.
Unfortunately reducing it to 2 only saves 1min20 on the 3h30min Piqa run.

@stas00
Copy link
Contributor

stas00 commented May 13, 2022

Thank you, @Muennighoff

and please remember to document those flags in the .md doc in this PR - and when to use those. Thanks!

if mpu.is_pipeline_last_stage() and mpu.get_tensor_model_parallel_rank() == 0:
print(json.dumps(results, indent=2))
results_path = args.results_path.replace(".json", f"_{task_name}.json")
with open(f"{results_path}", 'w') as outfile:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-atomic and could lead to all of data loss.

it's probably very unlikely to happen unless someone hits Ctrl-C in this moment, but why not use a safer practice if it's not too hard.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; Maybe check if the path exists & only write if it does not exist?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add a timestamp to the file name? Seems like useful info to have + avoids name clashes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea; The rarely used default results_path already has a timestamp, so if the default is used, there will be two timestamps, but that should be okay.
If @stas00 agrees, I'll adjust it to:

timestamp = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
results_path = args.results_path.replace(".json", f"_{timestamp}_{task_name}.json")

Copy link
Contributor

@stas00 stas00 May 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this was my bad, I missed that you are creating a new file for each task, I initially read it as incremental write with more tasks. But may be that is one way? Keep reusing the same file and adding results to it?

The other critical part is the checkpoint, what we really want is after multiple runs all we want is a single results file per checkpoint (and ideally in the same order so that it's easy to import into excell tables).

So I'd use the format of:

checkpoint_name_results.json which gets appended to, e.g. lm-eval-results_40000.json

what do you think?

and then just append to that file, rather than overwrite?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for 1), we can just maintain a 2nd file with the same contents, see below
For 2), Getting the model_name reliably from the ckpt path would be difficult as we have e.g.
/checkpoints/tr3m-1B3-emb-norm-pile/global_step296023
/checkpoints/tr11-176B-ml/checkpoints/main/global_step41000. I would instead use the existing args.result_path argument which is set to the VARIANT which is set to the model name in our slurm script. If the iteration_id refers to the global_step, I think we can add it as below?

    if args.intermed_results:
        global_results = {"results": {}, "versions": {}}
        timestamp = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
        iteration_id = args.load.split("/")[-1].replace("/", "")
        results_path = args.results_path.replace(".json", f"_lm-eval_{iteration_id}_{timestamp}.json")
        # Backup file in case of interruption during writing
        results_path_backup = args.results_path.replace(".json", f"_lm-eval_{iteration_id}_{timestamp}_backup.json")
        for task_name, task in task_dict.items():
            results = evaluator.evaluate(adaptor, {task_name: task}, False, 0, 10, bootstrap_iters=args.bootstrap_iters)
            global_results["results"] = {**global_results["results"], **results["results"]}
            global_results["versions"] = {**global_results["versions"], **results["versions"]}
            if mpu.is_pipeline_last_stage() and mpu.get_tensor_model_parallel_rank() == 0:
                print(json.dumps(results, indent=2))
                with open(results_path, 'w') as outfile:
                    json.dump(global_results, outfile, indent = 4)
                with open(results_path_backup, 'w') as outfile:
                    json.dump(global_results, outfile, indent = 4)

Copy link
Contributor

@stas00 stas00 May 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we are super safe.

unless others disagree (after all we are modifying the output file name already provided by the user) it looks great to me

thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only problem with this is that if results_path wasn't provided the default already includes a timestamp.

ok, then revert that change where I added the timestamp to the default and then we are good.

Copy link
Collaborator

@Muennighoff Muennighoff May 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then if args.intermed_results & results_path is not set, there would be no timestamp at all.
I would either leave it as is given we always set results_path to the model name or make results_path mandatory & maybe rename it to model_name so that it's supplied without .json and then always append a timestamp to it.

Edit: Pushing the above code for now; Will change if not okay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They all sound good - so whatever you feel is a better choice, Niklas.

@stas00
Copy link
Contributor

stas00 commented Jun 16, 2022

Should we merge this PR?

@Muennighoff
Copy link
Collaborator

Should we merge this PR?

Good from my side; @DanielHesslow ?

@thomasw21
Copy link
Member

There's a bit of discrepancies compared to the one we're currently training on (typically some of layer norm weight syncing isn't here, but I don't think this would change much actually), so we should merge into master as well as layer_norm_auto_sync and start running eval on the same branch as the one we're training on.

@DanielHesslow
Copy link
Collaborator Author

Merging is all good with me!

@TevenLeScao
Copy link
Collaborator

Since everyone seems to have agreed to merge, I'll do so tonight unless someone objects

@TevenLeScao TevenLeScao merged commit 3ab0ad1 into main Jun 28, 2022
@TevenLeScao TevenLeScao deleted the eval_harness branch June 28, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.