Skip to content

ashikiut/DefAn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DefAn: Definitive-Answer-Dataset-for-LLMs-Hallucination-Evaluation

A.B.M. Ashikur Rahman1, Saeed Anwar1,2,3, Muhammad Usman4, Irfan Ahmad1,2 Ajmal Mian3,
1 King Fahd University of Petroleum and Minerals, Dhahran, KSA
2JRCAI, SDAIA-KFUPM
3The University of Western Australia, Crawley, Western Australia
4Faculty of Science, Ontario Tech University, 2000 Simcoe Street North Oshawa, Oshawa, ON L1G 0C5, Canada

Alt Text

Abstract

Large Language Models (LLMs) represent a major step in AI development and are increasingly used in daily applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are limited in scale and scope and rely on multiple-choice questions, which are insufficient for evaluating the generative capabilities of LLMs. To assess hallucination in LLMs, this paper introduces a comprehensive benchmark dataset consisting of over 20,000 unique prompts (more than 75,000 prompts in total) across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance, and a hidden segment for benchmarking various LLMs. In our experiments, we tested nine State-of-The-Art (SoTA) models, GPT-4o, GPT-3.5, LLama 2 7B, LLama 3 8B, Gemini 1.0 Pro, Mixtral 8x7B, Zephyr 7B, Deepseek-r1-7b, and Qwen2.5-14B, revealing that overall factual hallucination ranges from 48% to 82% on the public dataset and 31% to 76% on the hidden benchmark. Prompt Misalignment Hallucination ranges up to 95% in the public dataset and up to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 44% to 63%, respectively. Domain-wise analysis reveals that LLM performance significantly deteriorates when asked for specific numeric information, whereas it performs moderately with queries involving persons, locations, and dates. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for evaluating LLM performance.

Dataset Description

Purpose: Evaluation benchmark for LLM hallucinations.

Structure: Two-part dataset:

  • Public: Available for general evaluation.
  • Hidden: Used for benchmarking, ensuring comprehensive assessment.

Evaluation Metrices:

  • Fact Contradicting Hallucination (FCH) rate
  • Prompt Misalignment Hallucination (PMH) rate
  • Response Consistency (RC)

Size: Over 75,000 samples, providing a substantial volume of data for rigorous testing.

Domain Statistics

# of samples Response type
Domains Public Hidden Date Numeric Name Location Paraphrased
Sports 1305 1005
Census Australia 7905 1005
Nobel Prize 9795 1005
Entertainment 8715 1005
World Organizations 2745 1005
QS Ranking 21495 1005
Conference Venue 915 450
Math 15218 1005

Data Instances

An example looks as follows:

{
    "questions":"Who achieved the Nobel Prize in Medicine for the year 1901? [first name + last name only] if multiple person, give one name only.",
    "answer":"Emil von Behring",
    "type":"name"
}

Languages

All the samples in this dataset is in English.

LLM Evaluation

In this paper we evalated 6 widely used LLMs on the metrics proposed. These models are- gpt 3.5, Llama-2, Llama-3, zephyr, gemini 1.0 pro, mixtral. Domain wise performance for each LLM is summarized here.

FCH Rate:

Sports Census Nobel Entertainment World Organizations QS Ranking Conf. Venue Math
Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden
zephyr 0.50 0.29 1.00 1.00 0.91 0.93 0.68 0.20 0.95 0.92 0.94 0.98 0.82 0.95 0.99 0.99
mixtral 0.20 0.13 1.00 1.00 0.59 0.60 0.56 0.11 0.69 0.44 0.88 0.98 0.52 0.63 0.98 0.97
llama3 0.44 0.30 1.00 1.00 0.63 0.70 0.29 0.19 0.71 0.73 0.97 0.99 0.65 0.87 1.00 0.99
llama2 0.15 0.09 1.00 1.00 0.90 0.90 0.33 0.17 0.85 0.74 0.93 0.99 0.85 0.88 0.98 0.98
gpt 3.5 0.17 0.11 1.00 1.00 0.35 0.52 0.10 0.19 0.57 0.38 0.93 0.98 0.31 0.60 0.98 0.98
gemini 0.21 0.09 1.00 1.00 0.35 0.52 0.42 0.14 0.54 0.31 0.97 0.96 0.47 0.51 0.99 0.99

PMH Rate:

Sports Census Nobel Entertainment World Organizations QS Ranking Conf. Venue Math
Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden
zephyr 0.87 0.98 1.00 1.00 0.96 0.98 0.76 0.41 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
mixtral 0.95 0.89 1.00 1.00 0.94 0.99 0.87 0.71 1.00 1.00 1.00 1.00 0.97 0.99 0.98 0.98
llama3 0.18 0.34 0.98 0.99 0.16 0.26 0.01 0.03 0.78 0.74 0.52 0.56 0.24 0.26 0.04 0.04
llama2 0.07 0.09 0.96 0.99 0.48 0.85 0.04 0.01 0.74 0.72 1.00 0.99 0.64 0.57 0.02 0.01
gpt 3.5 0.17 0.16 0.55 0.49 0.14 0.41 0.31 0.33 0.75 0.88 0.55 0.62 0.17 0.22 0.38 0.36
gemini 0.06 0.05 0.01 0.00 0.12 0.36 0.06 0.01 0.57 0.80 0.04 0.00 0.27 0.20 0.01 0.02

Response Consistency

Sports Census Nobel Entertainment World Organizations QS Ranking Conf. Venue
Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden Public Hidden
zephyr 0.19 0.15 0.07 0.07 0.10 0.11 0.43 0.59 0.13 0.15 0.13 0.10 0.47 0.43
mixtral 0.19 0.28 0.07 0.07 0.12 0.09 0.38 0.26 0.13 0.22 0.07 0.07 0.78 0.74
llama3 0.60 0.62 0.07 0.07 0.46 0.52 0.81 0.84 0.50 0.46 0.11 0.08 0.58 0.50
llama2 0.94 0.97 0.07 0.07 0.36 0.21 0.96 0.97 0.28 0.31 0.09 0.07 0.47 0.43
gpt 3.5 0.77 0.86 0.07 0.07 0.80 0.62 0.67 0.66 0.28 0.23 0.21 0.15 0.84 0.73
gemini 0.82 0.91 0.07 0.07 0.79 0.74 0.89 0.99 0.79 0.82 0.15 0.16 0.78 0.76

Overall Performance

Citation Information

@article{rahman2025defan,
  title={DefAn: Definitive Answer Dataset for LLM Hallucination Evaluation},
  author={Rahman, ABM Ashikur and Anwar, Saeed and Usman, Muhammad and Ahmad, Irfan and Mian, Ajmal},
  journal={Information},
  volume={16},
  number={11},
  pages={937},
  year={2025},
  publisher={MDPI}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published