Authors:
Jacqueline He, Howard Yen, Margaret Li, Shuyue Stella Li, Zhiyuan Zeng, Weijia Shi,
Yulia Tsvetkov, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer
This repository contains the code and data for our NeurIPS 2025 paper: Precise Information Control in Long-Form Generation.
PIC (Precise Information Control) is a new task formulation that evaluates intrinsic hallucination at the granularity of discrete verifiable claims. PIC raises a fundamental question: can LMs produce long-form responses that are strictly grounded on a set of explicit statements, without introducing any unsupported claims? We introduce a benchmark (PIC-Bench), a training framework (PIC-LM), and several use cases that demonstrate the utility of exploring this problem.
@inproceedings{he2025precise,
title={Precise Information Control in Long-Form Text Generation},
author={Jacqueline He and Howard Yen and Margaret Li and Shuyue Stella Li and Zhiyuan Zeng and Weijia Shi and Yulia Tsvetkov and Danqi Chen and Pang Wei Koh and Luke Zettlemoyer},
booktitle={Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)},
year={2025},
url={https://arxiv.org/abs/2506.06589},
}- Installation
- Precise Information Control (PIC)
- Code Acknowledgments
- License
- Troubleshooting or Questions
Step 1. Set up conda environment
We provide an environment.yaml file with exact dependency versions. You can set up a new Python 3.10 environment using Conda by running:
conda env create -f environment.yaml
conda activate picStep 2. Install other Python packages
FlashAttention is a package for fast inference that must be installed manually, due to CUDA build requirements:
pip install flash-attn==2.5.7 --no-build-isolationLikewise, if you are interested in scoring PIC-Bench generations or running our factuality pipeline, you would additionally need to install the following spaCy English pipelines, en_core_web_sm and en_core_web_trf for sentence splitting and named entity recognition:
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_trf Our contributions are three-fold: we provide a benchmark (PIC-Bench), a training framework (PIC-LM), and use cases demonstrating the utility of PIC.
- PIC-Bench: Generation and evaluation scripts for our benchmark.
- PIC-LM: PIC-LM model checkpoints, and related training and preference data generation scripts.
- PIC Use Cases: RAG and factuality pipeline scripts.
The PIC-Bench evaluation data is located as a HF dataset here, in which each split constitutes one task. By default, LMs are evaluated on all eight PIC-Bench (six full PIC, two partial PIC) tasks.
We provide generation support for three inference engines: vLLM, OpenAI, and Anthropic, the last two of which require user-provided API keys. Default Hydra configs with the exact settings reported in the paper are in conf/eval as default_gen_vllm.yaml, default_gen_openai.yaml, and default_gen_anthropic.yaml, respectively. We provide the following example bash scripts below in scripts/ (all using 1 GPU):
# Generate PIC-Bench evals for PIC-LM 8B
bash examples/pic_bench_scripts/eval_vllm.sh
# Generate PIC-Bench evals for GPT-4o
OPENAI_API_KEY=sk-proj-xxx... bash examples/pic_bench_scripts/eval_openai.sh
# Generate PIC-Bench evals for Claude 3.5 Sonnet
ANTHROPIC_API_KEY=sk-ant-xxx... bash examples/pic_bench_scripts/eval_anthropic.sh PIC-Bench evaluation relies on OpenAI's gpt4o-mini and consists of three stages: (1) verifiable claim extraction, (2) claim verification, and (3) score calculation. Given an input file (e.g., task.jsonl), after each stage we output a cached intermediate file to avoid redundant API calls:
task_ce.jsonl(after claim extraction)task_cv.jsonl(after claim verification)task_sc.jsonl(after score calculation)
The Hydra config conf/eval/default_scoring.yaml sets the default settings. With a valid OpenAI API key, you can score any task file as follows:
OPENAI_API_KEY=sk-proj-xxx... python -m pic_bench.score_eval_data \
--config-name default_scoring filepath=task.jsonlWe parallelize the evaluation process wherever possible for efficient inference. In particular, if you are using a SLURM cluster, you can evaluate each PIC-Bench task independently by using a job array across 8 indices, one for each task. examples/pic_bench_scripts/score.sh takes in the following positional arguments: the Hydra config (which defaults to default_scoring.yaml), an index that maps to a PIC task, the number of job workers (defined by CPUs per task), and the model name (which assumes you have run the generation script in the previous section).
#SBATCH --array=0-7
#SBATCH --cpus-per-task=8
CONFIG_NAME=default_scoring
MODEL_NAME=jacquelinehe/Llama-3.1-PIC-LM-8B
OPENAI_API_KEY=sk-proj-xxx... bash examples/pic_bench_scripts/score.sh ${CONFIG_NAME} ${SLURM_ARRAY_TASK_ID} ${SLURM_CPUS_PER_TASK} ${MODEL_NAME}In our experience, a complete PIC-Bench evaluation run takes approximately $7 USD in OpenAI credits, and <30 minutes with parallel SLURM jobs.
Here are our 8B PIC-LM HuggingFace checkpoints:
| Stage | PIC-LM 8B Checkpoints |
|---|---|
| SFT | jacquelinehe/Llama-3.1-PIC-LM-8B-SFT |
| DPO | jacquelinehe/Llama-3.1-PIC-LM-8B |
PIC-LM 8B is initialized from meta-llama/Llama-3.1-8B-Instruct. Please see the pic_lm/README.md for PIC-LM training framework details.
In this section, we show that PIC-LM 8B can yield better end-task factuality accuracy via the following settings: (1) retrieval-augmented generation (RAG) using the partial PIC setting; (2) a chain-of-verification-inspired pipeline using the full PIC setting.
RAG
We test on ASQA, an ambiguous factoid long-form QA task, using retrieved results (top-5 documents, using GTR, from a Wikipedia index) prepared by the authors of ALCE. We did some further post-processing by decomposing the retrieved context passages into verified claims, which can be found in use_cases/data/asqa.jsonl. The main script is use_cases/rag.py, which generates and evaluates your LM, and the Hydra config with default settings is in conf/use_cases/default_rag_asqa.yaml.
# Generate and evaluate on ASQA with retrieved docs
bash examples/use_case_scripts/eval_rag_asqa.sh --config_name default_rag_asqaFactuality Pipeline
A key requirement of PIC-LM is the provision of well-informed input claims. Even in the absence of external claims, PIC-LM can still improve factuality in a pipeline that leverages chain-of-verification and self-consistency sampling to autonomously generate and self-check its own factual claims. In this pipeline, a sufficiently capable LM first generates a draft answer to a given question. It then produces and independently answers its own verification questions via self-consistency sampling, yielding a set of reasonably factual, verified claims. Finally, PIC-LM serves as a drop-in component to generate a factual response grounded entirely on these verified claims.
We deploy our factuality pipeline on two tasks:
- Birthplace: A factoid task that uses the template: "Name some {occupation}s born in {location}.", given seed {occupation, location} pairs. We evaluate factual precision via claim verification against retrieved Google search results. Thus, running evaluation requires both an OpenAI API key and a Serper API key.
- QAMParI: A QA task for which multiple entities exist for a particular question. We use the data released by Amouyal et al., 2022, and evaluation does not rely on API calls.
# Run pipeline on the birthplace task (Serper API key needed)
OPENAI_API_KEY=sk-proj-xxx... SERPER_KEY_PRIVATE=... bash examples/use_case_scripts/eval_pipeline_birthplace.sh --config_name default_pipeline_birthplace
# Run pipeline on QAMParI
bash examples/use_case_scripts/eval_pipeline_qampari.sh --config_name default_pipeline_qampari Our repository extensively builds on several existing codebases. Please consider citing or using these following works as well:
- ALCE (Gao et al., 2023):
@inproceedings{gao2023enabling,
title={Enabling Large Language Models to Generate Text with Citations},
author={Gao, Tianyu and Yen, Howard and Yu, Jiatong and Chen, Danqi},
year={2023},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
}
- VeriScore (Song et al., 2024):
@inproceedings{song2024veriscore,
title={{V}eri{S}core: Evaluating the factuality of verifiable claims in long-form text generation},
author={Song, Yixiao and Kim, Yekyung and Iyyer, Mohit},
year={2024},
booktitle={Findings of Empirical Methods in Natural Language Processing (EMNLP)},
}
- OpenInstruct / Tülu 3 (Lambert et al., 2024):
@article{lambert2024tulu3,
title={Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V. and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Le Bras, Ronan and Tafjord, Oyvind and Wilhelm, Chris and Soldaini, Luca and Smith, Noah A. and Wang, Yizhong and Dasigi, Pradeep and Hajishirzi, Hannaneh},
year={2024},
email={tulu@allenai.org}
}
- Code Formatting:
Our bash scripts are formatted with shfmt, and our Python code with
.
The repository software is licensed under the MIT License. See the LICENSE file for details.
If you have any questions relating to either the code or paper, feel free to contact Jacqueline at jyyh@cs.washington.edu.
