GitHub - uofthcdslab/HAF: Human-Aligned Faithfulness in Toxicity Explanations of LLMs

Human-Aligned Faithfulness in Toxicity Explanations of LLMs

The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity---from their explanations that justify a stance---to enhance their trustworthiness in downstream tasks. In our recent paper, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures how LLMs' free-form toxicity explanations reflect those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how “non-ideal” the explanations are. This repository contains the code and sample data to reproduce our results.

The complete LLM-generated toxicity explanations and our HAF scores are available on Hugging Face. The complete LLM output tokens and entropy scores are available upon request.

Requirements:

pip install -r requirements.txt

Pipeline:

Quick Demo (with sample data):

The required sample input data to run the demo is included in llm_generated_data/ and parsed_data/ directories. To compute HAF metrics on this sample data, run the following command:

python haf.py

This will compute the HAF metrics for the sample data and store the results in haf_results/ directory. The results include HAF scores for different models and datasets.

Reproducing Full Pipeline:

Using an existing or a new dataset:

Add the dataset name and path in utils/data_path_map.json.
Include the main processing function for the dataset in utils/data_processor.py and give it the exact same name as the dataset.
Access shared parameters and methods defined in the DataLoader class in data_loader.py through instance references.

LLM explanation generation and parsing:

In the paper, we describe a three-stage pipeline to compute HAF metrics. The pipeline consists of:

Stage JUSTIFY where LLMs generate explanations for their toxicity decisions (denoted by stage="initial").
Stage UPHOLD-REASON where LLMs generate post-hoc explanations to assess the sufficiency of reasons provided in the JUSTIFY stage (denoted by stage="internal" or stage="external").
Stage UPHOLD-STACE where LLMs generate post-hoc explanations to assess the sufficiency and necessity of individual reasons of JUSTIFY stage (denoted by stage="individual").

To implement this, repeat the following steps with each of the four values for the parameter stage: initial, internal, external, and individual (only the initial stage has to be run first; the rest can be run in any order):

Run generate.py with --generation_stage=initial/internal/external/individual and other optional changes to the generation hyperparameters.
LLM outputs (tokens, token entropies, and texts) will be generated and stored in llm_generated_data/<model_name>/<data_name>/<stage>.
Run parse.py with stage=initial/internal/external/individual and other optional parameters to extract LLM decisions, reasons, and other relevant information for computing HAF.
The parsed outputs will be stored in parsed_data/<model_name>/<data_name>/<stage>.

Computing HAF metrics:

Run haf.py with optional parameters to compute HAF metrics for all combinations of models and datasets.
The outputs will be computed for each sample instance and stored in haf_results/<model_name>/<data_name>/<sample_index>.pkl.

Roadmap:

We are working on updating the parser files to support more datasets and models. We will soon integrate the results of Microsoft Phi-4 reasoning model.
We will include the results of naive prompting without explicit reasoning instructions.

Citing:

Bibtex:

@article{mothilal2025haf,
  title={Human-Aligned Faithfulness in Toxicity Explanations of LLMs},
  author={K Mothilal, Ramaravind and Roy, Joanna and Ahmed, Syed Ishtiaque and Guha, Shion},
  journal={arXiv preprint arXiv:2506.19113},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Human-Aligned Faithfulness in Toxicity Explanations of LLMs

Requirements:

Pipeline:

Quick Demo (with sample data):

Reproducing Full Pipeline:

Roadmap:

Citing:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
haf_results		haf_results
llm_generated_data		llm_generated_data
parsed_data		parsed_data
processed_sampled_input_data		processed_sampled_input_data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
data_loader.py		data_loader.py
generate.py		generate.py
haf.py		haf.py
parse.py		parse.py
parser_sanity_check.py		parser_sanity_check.py
requirements.txt		requirements.txt

License

uofthcdslab/HAF

Folders and files

Latest commit

History

Repository files navigation

Human-Aligned Faithfulness in Toxicity Explanations of LLMs

Requirements:

Pipeline:

Quick Demo (with sample data):

Reproducing Full Pipeline:

Roadmap:

Citing:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages