LEGIT

Evaluating Legal Reasoning Traces with Legal Issue Trees

Jinu Lee, Kyoung-Woon On, Simeng Han, Arman Cohan, Julia Hockenmaier

Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g.,, law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

Dataset link: Huggingface

Installation

Install python requirements by:

virtualenv .venv
source .venv/bin/activate # create and activate venv
pip install -r requirements.txt # install python dependencies

Main Experiments

LEGIT currently supports vllm, ollama, vertexai (Gemini), and openai (OpenAI GPT/oN) for LLM generation. If you need to add different frameworks to test your own LLM, write your own async def generate() function in utils/package and add appropriate initialization script in utils/router.py.

Generating responses

Execute this script:

# PACKAGE: vllm -> MODEL: huggingface ID for `load_from_pretrained` function (e.g., Qwen/Qwen2.5-3B-Instruct, ./my_models/checkpoints/step_500)
# PACKAGE: ollama -> MODEL: ollama ID (e.g., llama3.1)
# PACKAGE: vertexai -> MODEL: vertexai model ID (e.g., gemini-2.0-flash)
# PACKAGE: openai -> MODEL: openai model ID (e.g., gpt-4.1, o3)
python reasoning_task_solve.py --package {PACKAGE} --model {MODEL}
# This will generate `results/reasoning_tasks_{MODEL}.jsonl` file.

Evaluating responses

Execute this script:

# RESPONSE_FILE: results/reasoning_tasks_{GENERATOR_MODEL}.jsonl
python reasoning_task_solve.py --package {PACKAGE} --model {MODEL} --response_path {RESPONSE_FILE}
# This will generate `results/reasoning_tasks_{GENERATOR_MODEL}_evaluator_{MODEL}/.jsonl` file.

Additional experiments

Evaluating with RAG

Overview

First, we run batch retrieval using the scripts below. These will use the test data (data/reasoning_tasks_test.jsonl) and add retrieval results. Retrieval base can be found from data/deduplicated_relevant_laws.json.

python lawretrieval_baseline_{RETRIEVAL_METHOD}.py
# This will generate `data/lawretrieval_test_{RETRIEVAL_METHOD}.jsonl

Finally, generate the response with following script. Evaluate the generated data with the evaluation script above.

# RETRIEVAL_METHOD: bm25, contriever, groundtruth
python reasoning_task_solve_rag.py --package {PACKAGE} --model {MODEL} --retrieval_method {RETRIEVAL_METHOD}
# This will generate `results/reasoning_tasks_rag_{RETRIEVAL_METHOD}_{MODEL}.jsonl` file.

Training your LLM with `verl` on LEGIT training set

First, install custom verl (branched from 0.5.0 to accomodate Gemma 3):

cd verl
pip install -e .
cd ..

Next, download the verl training data using this script:

python utils/verl/download_legit_data.py

Finally, execute verl training via command line interface. We provided the configuration used for the paper in verl_trainer_example.sh. If you open the script file, there will be (1) a commented line that starts the vLLM server for LLM-as-a-judge, and (2) 2 instances of CUDA_VISIBLE_DEVICES. Modify it as necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
contriever		contriever
data		data
doc		doc
plots		plots
prompts		prompts
results		results
utils		utils
verl		verl
.gitignore		.gitignore
README.md		README.md
contriever_preprocess.py		contriever_preprocess.py
contriever_train.sh		contriever_train.sh
lawretrieval_analysis.py		lawretrieval_analysis.py
lawretrieval_baseline_bm25.py		lawretrieval_baseline_bm25.py
lawretrieval_baseline_contriever.py		lawretrieval_baseline_contriever.py
lawretrieval_baseline_contriever_finetuned.py		lawretrieval_baseline_contriever_finetuned.py
lawretrieval_create_data.py		lawretrieval_create_data.py
lawretrieval_extract_law.py		lawretrieval_extract_law.py
lawretrieval_test.py		lawretrieval_test.py
legit_annotate.py		legit_annotate.py
legit_patch.py		legit_patch.py
legit_stats.py		legit_stats.py
legit_viewer.py		legit_viewer.py
reasoning_task_analysis_plots.py		reasoning_task_analysis_plots.py
reasoning_task_analysis_plots_corpusstat.py		reasoning_task_analysis_plots_corpusstat.py
reasoning_task_analysis_scores.py		reasoning_task_analysis_scores.py
reasoning_task_create.py		reasoning_task_create.py
reasoning_task_evaluate_legit.py		reasoning_task_evaluate_legit.py
reasoning_task_evaluate_legit.sh		reasoning_task_evaluate_legit.sh
reasoning_task_evaluate_likert.py		reasoning_task_evaluate_likert.py
reasoning_task_evaluate_likert.sh		reasoning_task_evaluate_likert.sh
reasoning_task_evaluate_sampler.py		reasoning_task_evaluate_sampler.py
reasoning_task_solve.py		reasoning_task_solve.py
reasoning_task_solve_rag.py		reasoning_task_solve_rag.py
requirements.txt		requirements.txt
verl_trainer_example.sh		verl_trainer_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEGIT

Installation

Main Experiments

Generating responses

Evaluating responses

Additional experiments

Evaluating with RAG

Overview

Training your LLM with `verl` on LEGIT training set

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LEGIT

Installation

Main Experiments

Generating responses

Evaluating responses

Additional experiments

Evaluating with RAG

Overview

Training your LLM with verl on LEGIT training set

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Training your LLM with `verl` on LEGIT training set

Packages