Eval harness #212

DanielHesslow · 2021-11-29T01:25:46Z

Providing the functionality for running the EleutherAI evaluation harness on megatron checkpoints, adressing #137

In order to run on JZ we need to cache the tasks locally since the gpu-nodes does not have internet access.
eval_harness/download.py, provides that functionality.

Currently pipeline parallel models work but model parallel needs to be tested.

DanielHesslow · 2021-11-29T10:47:56Z

Tensor parallelism should now work as well. Will coordinate with Jason Phang.

DanielHesslow · 2021-12-06T16:47:19Z

This should now be all good. We're getting the same results as Jason get who has in turn verified it against the HF models.

examples/run_evalharness.sh

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

thomasw21

Partial review. Thanks for the great work

thomasw21 · 2021-12-09T14:38:59Z

examples/run_evalharness.sh

+PP_SIZE=1
+TP_SIZE=1


Those that actually not depend on the checkpoint? What happens if my checkpoint was PP_SIZE=2 can I run evaluation on PP_SIZE=1? Same question for TP

Yup it works, the checkpoints are loaded with the state dicts produced by the deepspeed_to_megatron converter, so it'll merge the model parallell partitions etc.

Wait that script converts to megatron format, ie Megatron-LM no? Can you load a rotary embedding checkpoint?

Megatron as in the state_dict format that you get when you run without the --deepspeed flag. Eg some layers are named differently as we're not wrapping it in deepspeeds pipelining primitives. I don't see why rotary shouldn't work but I haven't tested it in particular.

thomasw21 · 2021-12-09T14:39:39Z

examples/run_evalharness.sh

+    --vocab-file $VOCAB_FILE\
+    --merge-file $MERGE_FILE\


Why do you need this?

The vocab-file/merge-file just stores the path, so if the path is not available during eval it'll crash otherwise.

thomasw21 · 2021-12-09T14:41:53Z

examples/run_evalharness.sh

+    --merge-file $MERGE_FILE\
+    --micro-batch-size 64\
+    --adaptive_seq_len\
+    --eval_fp32\


Interesting. Have you seen performance changes between 16 and 32?

Yeah, if we change the seq_length during the eval the performance degrades a bit in fp16. Not entirely sure why, but it's stable in fp32 and using adaptive_seq_len speeds things up by a lot.

thomasw21 · 2021-12-09T14:43:01Z

tasks/eval_harness/download.py

+# Downloads the specified taks in the evaluation harness
+# This is particularly useful when running in environments where the GPU nodes 
+# do not have internet access. This way we can pre-download them and use the cached data-set during evaluation.


That's awesome! 🤯

thomasw21 · 2021-12-09T14:44:09Z

tools/convert_checkpoint/deepspeed_to_megatron.py

 import torch
 from collections import OrderedDict
-from deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint
+from .deepspeed_checkpoint import ARGS_KEY, DeepSpeedCheckpoint


unwanted change?

Nope, intentional. Otherwise we need to add tools/convert_checkpoint/ to the path in order to import deepspeed_to_megatron.py

I proposed to move all these script to the Med-DS normal files here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/121/files
but not sure if that PR will be finished any time soon.

Once in the normal module files this won't be needed. But it's fine for now.

Or move these I like proposed in the PR above in this PR already... either way works.

python isn't very friendly to scripts that come with extra files....

thomasw21 · 2021-12-09T14:44:40Z

tasks/eval_harness/evaluate.py

+from logging import logMultiprocessing
+import os
+import sys
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__),


Why is this needed?

Adds the root of the repo to the path so we can import stuff from there without pip installing the repo. Not pretty but just copied tasks/main.py 🤷

Why is it important to not pip install the repo?

also currently it can't find megatron files (which can't be installed).

I had to workaround with:

cd /gpfsssd/worksf/projects/rech/six/commun/code/eval/Megatron-DeepSpeed PYTHONPATH=. sh ./run_evalharness.sh

so you probably need to find the root and add it to the sys.path as well

When you say "can't install", you mean that pip install -e . didn't work? If so I tried figuring out why I'm not sure why it cause ModuleErrorNotFound, but I had a workaroung fix it for me: #173 (comment)

When you say "can't install", you mean that pip install -e . didn't work?

That's right. At the moment I added the PYTHONPATH= to the slurm script and it works.

thomasw21 · 2021-12-10T15:10:04Z

tasks/eval_harness/download.py

+
+def main():
+    task_list = ALL_TASKS if args.task_list == 'all' else args.task_list.split(',')
+    tasks.get_task_dict(task_list)


After testing this script, this creates a data folder from where you called this script for some specific tasks like lambada or triviaqa. This then needs to be moved to the root of the for lm-evaluation-harness. I think this should be fixed in that repo though ... (we can always manually move those files)

Hmm okay, my bad didn't test those tasks last night. I can also just revert the last commit with the pickle workaround for now.

Not your fault at all, needed it to load all the datasets, turns out half of them have a manual script. I think you can just add a comment (we could build a sym link even) but the better things would be to update their repo to install theie data locally when we use their API.

SaulLu

Thank you so much for all your hard work Daniel and Jason!

I left a minor comment below for my own understanding.

Also, I wonder if we should add the lm_eval dependency somewhere (not sure where it's best to do that, maybe in extras_require in setup.pyor explaining it in theREADME.md` ).

tasks/eval_harness/evaluate.py

thomasw21 · 2021-12-22T16:07:48Z

tasks/eval_harness/evaluate.py

+    # Initialize megatron model using the parsed state dict.
+    sd = _create_rank_checkpoint(ds_checkpoint, None, mpu.get_tensor_model_parallel_rank(), mpu.get_pipeline_model_parallel_rank(), True)
+
+    model = get_model(model_provider)[0]


Can we add assert(isinstance(model, GPTModelPipe))?

Maybe it makes more sense to check if model inherits from GPTModelPipe?

I'm pretty sure isinstance checks the inheritance, so if we even implement a model that inherits GOTModelPipe, the assert would return True. Thought technically all models we train currently are GPTModelPipe, so using type equality should be fine.

So turns out this is GPTModel instead of GPTModelPipe, which is one of the inconsistencies that led @DanielHesslow to find out that we didn't train with alibi (otherwise we would have seen good-ish performances as it would be in the same setting as during pretraining). I think it's great we figure this out, but I think we should try as much as we can to be close to the model we train. I think @DanielHesslow encountered some issues with deepspeed concerning this issue.

it now works with GPTModelPipe

thomasw21

Small README changes, otherwise looks good.

thomasw21 · 2022-04-27T08:19:40Z