This repository contains code for our VeriScore factuality metric, which is a pipelined approach with three steps: (1) claim extraction, (2) evidence retrieval, and (3) claim verification. This package can be run with either closed-source LLMs (requires OpenAI/Anthropic API key), or with fine-tuned models for claim extraction and verification that you can download and use. The evidence retrieval step is performed with Google Search via the Serper API, so you will need to get a Serper API key here to use VeriScore.
Please see our Colab notebook for a demo!
You can choose between prompting OpenAI/Anthropic models or using our fine-tuned models via the model_name option. If you specify the path to the checkpoint of a fine-tuned model, it will automatically access the local model for the inference. If the model name is not in this format, it will use an API call instead.
- Make a new Python 3.9+ environment using
virtualenvorconda. - Install
veriscorepacakge usingpip - Download
en_core_web_smusingspacylibrary - Our code supports inference using fine-tuned models based on the Unsloth library. To use this feature, you need to install the Unsloth library. If you use
conda, make sure to follow the conda-specific instructions from Unsloth.
pip install --upgrade veriscore
python -m spacy download en_core_web_sm
- Download
promptfolder that contains txt files of the prompt templates (see thepromptfolder in this repository) - Add an OpenAI or Claude API key to an environment variable in your
bashfor the prompting approach
export OPENAI_API_KEY_PERSONAL={your_openai_api_key}
export CLAUDE_API_KEY={your_claude_api_key}
- Set SERPER API key to an environment variable in your
bashfor evidence retrieval
export SERPER_KEY_PRIVATE={your_serper_api_key}
- For the prompt-based approach, you need to set
data_dir/demos/with few-shot examples.
This is an end-to-end pipeline for running VeriScore.
python3 -m veriscore.veriscore --data_dir {data_dir} --input_file {input_file} --model_name_extraction {model_name_extraction} --model_name_verification {model_name_verification}
data_dir: Directory containing input data../databy default.input_file: Name of the input data file. It should be in thejsonlformat where each line containsquestion(optional): A query to prompt a language model for an output. Ifquestionkey is provided, the model automatically adapts to QA prompting. If not, it extracts claims using non-QA prompting.response: An output generated by the language model given thequestionmodel: Name of the model that generated the responseprompt_source: Name of the dataset from where thequestionis from (e.g., FreshQA)
model_name_extraction: Name of the model used for claim extraction;gpt-4-0125-previewby default.model_name_verification: Name of the model used for claim verification;gpt-4oby default.use_external_extraction_model: If specified, it uses your custom model instead of the one from the API call. We use Unsloth for the fine-tuned model. False by default.use_external_verification_model: If specified, it uses your custom model instead of the one from the API call. We use Unsloth for the fine-tuned model. False by default.
Other optional flags:
output_dir: Directory for saving ouptut data../databy default.cache_dir: Directory for saving cache data../data/cacheby default.label_n: This is the type of label for claim verification. It could be2(binary) or3(ternary)2:supportedandunsupported3:supported,contradicted, andinconclusive
search_res_num: A Hyperparameter for number of search result.10by default.
Saving output:
input_file_name is file name removed jsonl from —-input_file
extracted claimswill be saved tooutput_dir/claims_{input_file_name}.jsonl.searched evidencewill be saved tooutput_dir/evidence_{input_file_name}.jsonl.verified claimswill be saved tooutput_dir/model_output/verification_{input_file_name}.jsonl.
Claim extraction:
python3 -m veriscore.extract_claims --data_dir {data_dir} --input_file {input_file} --model_name {model_name}
input_file: Name of input data file. It should bejsonlformat where each line containsquestion(optional): A query to prompt a language model for an output. Ifquestionkey is provided, the model automatically adapts to QA prompting. If not, it extracts claims using non-QA prompting.response: An output generated by the language model given thequestionmodel: Name of the model that generated the responseprompt_source: Name of the dataset from where thequestionis from (e.g., FreshQA)
model_name: Name of model used for claim extraction.gpt-4-0125-previewby default.use_external_model: If specified, it uses your custom model instead of using API calls. We use Unsloth for the fine-tuned models. False by default.
output:
{
"question": question.strip(),
"prompt_source": prompt_source,
"response": response.strip(),
"prompt_tok_cnt": prompt_tok_cnt,
"response_tok_cnt": response_tok_cnt,
"model": model,
"abstained": False,
"claim_list": list of claims for each snippet,
"all_claims": list of all claims
}
Evidence searching:
python3 -m veriscore.retrieve_evidence --data_dir {data_dir} --input_file {input_file}
input_file: Name of input data file. It should be in thejsonlformat where each line contains the keys of the output dictionary from theClaim extraction. output:
{
...
"claim_snippets_dict": dictionary for claim and list of searched evidence. each evidence have dictionary of {"title": title, "snippet": snippet, "link": link}
}
Claim verification:
python3 -m veriscore.verify_claims --data_dir {data_dir} --input_file {input_file} --model_name {model_name}
input_file: Name of input data file. It should bejsonlformat where each line contains the keys of the output dictionary from theEvidence searching.use_external_model: If specified, it uses your custom model instead model used by API call. We use Unsloth to inference from specified model. False by default.
output:
{
...
"claim_verification_result":[
{
"claim": claim text,
"search_results": concatenated search result
"verification_result": verification label
}
This Google Drive folder contains:
- the long-form generations of the tested models given the prompts in Table 1 in the paper
- human annotation results in Section 3.3