-
Notifications
You must be signed in to change notification settings - Fork 228
Eval harness #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval harness #212
Conversation
|
Tensor parallelism should now work as well. Will coordinate with Jason Phang. |
|
This should now be all good. We're getting the same results as Jason get who has in turn verified it against the HF models. |
Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>
thomasw21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Thanks for the great work
| PP_SIZE=1 | ||
| TP_SIZE=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those that actually not depend on the checkpoint? What happens if my checkpoint was PP_SIZE=2 can I run evaluation on PP_SIZE=1? Same question for TP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup it works, the checkpoints are loaded with the state dicts produced by the deepspeed_to_megatron converter, so it'll merge the model parallell partitions etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait that script converts to megatron format, ie Megatron-LM no? Can you load a rotary embedding checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Megatron as in the state_dict format that you get when you run without the --deepspeed flag. Eg some layers are named differently as we're not wrapping it in deepspeeds pipelining primitives. I don't see why rotary shouldn't work but I haven't tested it in particular.
| --vocab-file $VOCAB_FILE\ | ||
| --merge-file $MERGE_FILE\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vocab-file/merge-file just stores the path, so if the path is not available during eval it'll crash otherwise.
| --merge-file $MERGE_FILE\ | ||
| --micro-batch-size 64\ | ||
| --adaptive_seq_len\ | ||
| --eval_fp32\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Have you seen performance changes between 16 and 32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we change the seq_length during the eval the performance degrades a bit in fp16. Not entirely sure why, but it's stable in fp32 and using adaptive_seq_len speeds things up by a lot.
| # Downloads the specified taks in the evaluation harness | ||
| # This is particularly useful when running in environments where the GPU nodes | ||
| # do not have internet access. This way we can pre-download them and use the cached data-set during evaluation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome! 🤯
| import torch | ||
| from collections import OrderedDict | ||
| from deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint | ||
| from .deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwanted change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, intentional. Otherwise we need to add tools/convert_checkpoint/ to the path in order to import deepspeed_to_megatron.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed to move all these script to the Med-DS normal files here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/121/files
but not sure if that PR will be finished any time soon.
Once in the normal module files this won't be needed. But it's fine for now.
Or move these I like proposed in the PR above in this PR already... either way works.
python isn't very friendly to scripts that come with extra files....
| from logging import logMultiprocessing | ||
| import os | ||
| import sys | ||
| sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds the root of the repo to the path so we can import stuff from there without pip installing the repo. Not pretty but just copied tasks/main.py 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to not pip install the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also currently it can't find megatron files (which can't be installed).
I had to workaround with:
cd /gpfsssd/worksf/projects/rech/six/commun/code/eval/Megatron-DeepSpeed
PYTHONPATH=. sh ./run_evalharness.sh
so you probably need to find the root and add it to the sys.path as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "can't install", you mean that pip install -e . didn't work? If so I tried figuring out why I'm not sure why it cause ModuleErrorNotFound, but I had a workaroung fix it for me: #173 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "can't install", you mean that pip install -e . didn't work?
That's right. At the moment I added the PYTHONPATH= to the slurm script and it works.
|
|
||
| def main(): | ||
| task_list = ALL_TASKS if args.task_list == 'all' else args.task_list.split(',') | ||
| tasks.get_task_dict(task_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing this script, this creates a data folder from where you called this script for some specific tasks like lambada or triviaqa. This then needs to be moved to the root of the for lm-evaluation-harness. I think this should be fixed in that repo though ... (we can always manually move those files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm okay, my bad didn't test those tasks last night. I can also just revert the last commit with the pickle workaround for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your fault at all, needed it to load all the datasets, turns out half of them have a manual script. I think you can just add a comment (we could build a sym link even) but the better things would be to update their repo to install theie data locally when we use their API.
SaulLu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for all your hard work Daniel and Jason!
I left a minor comment below for my own understanding.
Also, I wonder if we should add the lm_eval dependency somewhere (not sure where it's best to do that, maybe in extras_require in setup.pyor explaining it in theREADME.md` ).
tasks/eval_harness/evaluate.py
Outdated
| # Initialize megatron model using the parsed state dict. | ||
| sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(), mpu.get_pipeline_model_parallel_rank(), True) | ||
|
|
||
| model = get_model(model_provider)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add assert(isinstance(model, GPTModelPipe))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it makes more sense to check if model inherits from GPTModelPipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure isinstance checks the inheritance, so if we even implement a model that inherits GOTModelPipe, the assert would return True. Thought technically all models we train currently are GPTModelPipe, so using type equality should be fine.
So turns out this is GPTModel instead of GPTModelPipe, which is one of the inconsistencies that led @DanielHesslow to find out that we didn't train with alibi (otherwise we would have seen good-ish performances as it would be in the same setting as during pretraining). I think it's great we figure this out, but I think we should try as much as we can to be close to the model we train. I think @DanielHesslow encountered some issues with deepspeed concerning this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it now works with GPTModelPipe
thomasw21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small README changes, otherwise looks good.
| 2. Pre-download needed datasets | ||
|
|
||
| some symlinks due to lm-harness' issues with relative position of data | ||
| ``` | ||
| mkdir data | ||
| ln -s `pwd`/data tasks/eval_harness/data | ||
| ``` | ||
| Also make sure `data` is not on one of the limited paritions like WORKSF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think that's needed anymore as everything was migrated to use datasets caching system
| On login console with external network | ||
|
|
||
| Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) | ||
| Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and `best-download==0.0.7` needed to download some tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be able to bump the version to master why should have the fixed version.
| Get lm-eval harness (https://github.com/EleutherAI/lm-evaluation-harness) and `best-download==0.0.7` needed to download some tasks. | ||
| ``` | ||
| start-prod | ||
| pip install best-download==0.0.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pip install best-download==0.0.7 |
| some symlinks due to lm-harness' issues with relative position of data | ||
| ``` | ||
| mkdir data | ||
| ln -s `pwd`/data tasks/eval_harness/data | ||
| ``` | ||
| Also make sure `data` is not on one of the limited paritions like WORKSF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| some symlinks due to lm-harness' issues with relative position of data | |
| ``` | |
| mkdir data | |
| ln -s `pwd`/data tasks/eval_harness/data | |
| ``` | |
| Also make sure `data` is not on one of the limited paritions like WORKSF. |
| If there are things like custom tokenizers, pre-download those too, e.g.: | ||
|
|
||
| ``` | ||
| python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/oscar_13_languages_alpha_weight')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/oscar_13_languages_alpha_weight')" | |
| python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('bigscience/tokenizer')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you need tokenizers>0.12.1
| (torchDDP, LocalDDP, Float16Module)) | ||
|
|
||
| optimizer = get_megatron_optimizer(unwrapped_model) | ||
| if args.inference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not already have do_train?
|
I added an arg to reduce |
|
Thank you, @Muennighoff and please remember to document those flags in the .md doc in this PR - and when to use those. Thanks! |
tasks/eval_harness/evaluate.py
Outdated
| if mpu.is_pipeline_last_stage() and mpu.get_tensor_model_parallel_rank() == 0: | ||
| print(json.dumps(results, indent=2)) | ||
| results_path = args.results_path.replace(".json", f"_{task_name}.json") | ||
| with open(f"{results_path}", 'w') as outfile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-atomic and could lead to all of data loss.
it's probably very unlikely to happen unless someone hits Ctrl-C in this moment, but why not use a safer practice if it's not too hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; Maybe check if the path exists & only write if it does not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add a timestamp to the file name? Seems like useful info to have + avoids name clashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea; The rarely used default results_path already has a timestamp, so if the default is used, there will be two timestamps, but that should be okay.
If @stas00 agrees, I'll adjust it to:
timestamp = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
results_path = args.results_path.replace(".json", f"_{timestamp}_{task_name}.json")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this was my bad, I missed that you are creating a new file for each task, I initially read it as incremental write with more tasks. But may be that is one way? Keep reusing the same file and adding results to it?
The other critical part is the checkpoint, what we really want is after multiple runs all we want is a single results file per checkpoint (and ideally in the same order so that it's easy to import into excell tables).
So I'd use the format of:
checkpoint_name_results.json which gets appended to, e.g. lm-eval-results_40000.json
what do you think?
and then just append to that file, rather than overwrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for 1), we can just maintain a 2nd file with the same contents, see below
For 2), Getting the model_name reliably from the ckpt path would be difficult as we have e.g.
/checkpoints/tr3m-1B3-emb-norm-pile/global_step296023
/checkpoints/tr11-176B-ml/checkpoints/main/global_step41000. I would instead use the existing args.result_path argument which is set to the VARIANT which is set to the model name in our slurm script. If the iteration_id refers to the global_step, I think we can add it as below?
if args.intermed_results:
global_results = {"results": {}, "versions": {}}
timestamp = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
iteration_id = args.load.split("/")[-1].replace("/", "")
results_path = args.results_path.replace(".json", f"_lm-eval_{iteration_id}_{timestamp}.json")
# Backup file in case of interruption during writing
results_path_backup = args.results_path.replace(".json", f"_lm-eval_{iteration_id}_{timestamp}_backup.json")
for task_name, task in task_dict.items():
results = evaluator.evaluate(adaptor, {task_name: task}, False, 0, 10, bootstrap_iters=args.bootstrap_iters)
global_results["results"] = {**global_results["results"], **results["results"]}
global_results["versions"] = {**global_results["versions"], **results["versions"]}
if mpu.is_pipeline_last_stage() and mpu.get_tensor_model_parallel_rank() == 0:
print(json.dumps(results, indent=2))
with open(results_path, 'w') as outfile:
json.dump(global_results, outfile, indent = 4)
with open(results_path_backup, 'w') as outfile:
json.dump(global_results, outfile, indent = 4)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we are super safe.
unless others disagree (after all we are modifying the output file name already provided by the user) it looks great to me
thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only problem with this is that if results_path wasn't provided the default already includes a timestamp.
ok, then revert that change where I added the timestamp to the default and then we are good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but then if args.intermed_results & results_path is not set, there would be no timestamp at all.
I would either leave it as is given we always set results_path to the model name or make results_path mandatory & maybe rename it to model_name so that it's supplied without .json and then always append a timestamp to it.
Edit: Pushing the above code for now; Will change if not okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They all sound good - so whatever you feel is a better choice, Niklas.
|
Should we merge this PR? |
Good from my side; @DanielHesslow ? |
|
There's a bit of discrepancies compared to the one we're currently training on (typically some of layer norm weight syncing isn't here, but I don't think this would change much actually), so we should merge into master as well as |
|
Merging is all good with me! |
|
Since everyone seems to have agreed to merge, I'll do so tonight unless someone objects |
Providing the functionality for running the EleutherAI evaluation harness on megatron checkpoints, adressing #137
In order to run on JZ we need to cache the tasks locally since the gpu-nodes does not have internet access.
eval_harness/download.py, provides that functionality.
Currently pipeline parallel models work but model parallel needs to be tested.