Overview

This repository is a fork of https://github.com/openai/simple-evals, used for running benchmarks on the Reflection 70B model.

Running locally

For HumanEval

git clone https://github.com/openai/human-eval
pip install -e human-eval

For GPQA

Download this csv file in the directory: https://huggingface.co/datasets/Idavidrein/gpqa/resolve/main/gpqa_main.csv Note: You'll need to be authenticated with huggingface hub and accepted the conditions on the repo.
```
wget --header="Authorization: Bearer <hf_token>" https://huggingface.co/datasets/Idavidrein/gpqa/resolve/main/gpqa_main.csv
```

Run benchmarking

Install requirements: pip3 install -r requirements.txt
Start vllm server locally: vllm serve glaiveai/Reflection-Llama-3.1-70B --host 0.0.0.0 --port 5050 --tensor-parallel 8
Set OPENAI_API_KEY env var for running the equality checker.
Run python3 run_reflection_eval.py --evals mmlu humaneval gsm8k gpqa math
You can run evals on any model being served using vllm by creating a sampler for it, example samplers for llama 3.1 70B have been commented in the run_reflection_eval.py file.

Running ifeval

cd ifeval
python3 gen_results.py --input_file data/ifeval_input_data.jsonl --output_file data/reflection_output.jsonl --model_name glaiveai/Reflection-Llama-3.1-70B --base_url http://0.0.0.0:5050/v1
python3 -m evaluation_main \
  --input_data=./data/ifeval_input_data.jsonl \
  --input_response_data=./data/reflection_output.jsonl \
  --output_dir=./data/

You can pass --no-reflection arg to use a generic system prompt instead of the Reflection system prompt.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ifeval		ifeval
sampler		sampler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
common.py		common.py
gpqa_eval.py		gpqa_eval.py
gsm_eval.py		gsm_eval.py
humaneval_eval.py		humaneval_eval.py
math_eval.py		math_eval.py
mmlu_eval.py		mmlu_eval.py
requirements.txt		requirements.txt
run_reflection_eval.py		run_reflection_eval.py
type_definitions.py		type_definitions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Running locally

For HumanEval

For GPQA

Run benchmarking

Running ifeval

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Running locally

For HumanEval

For GPQA

Run benchmarking

Running ifeval

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages