The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity---from their explanations that justify a stance---to enhance their trustworthiness in downstream tasks. In our recent paper, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures how LLMs' free-form toxicity explanations reflect those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how “non-ideal” the explanations are. This repository contains the code and sample data to reproduce our results.
The complete LLM-generated toxicity explanations and our HAF scores are available on Hugging Face. The complete LLM output tokens and entropy scores are available upon request.
pip install -r requirements.txt
The required sample input data to run the demo is included in llm_generated_data/ and parsed_data/ directories. To compute HAF metrics on this sample data, run the following command:
python haf.py
This will compute the HAF metrics for the sample data and store the results in haf_results/ directory. The results include HAF scores for different models and datasets.
Using an existing or a new dataset:
- Add the dataset name and path in utils/data_path_map.json.
- Include the main processing function for the dataset in utils/data_processor.py and give it the exact same name as the dataset.
- Access shared parameters and methods defined in the DataLoader class in data_loader.py through instance references.
LLM explanation generation and parsing:
In the paper, we describe a three-stage pipeline to compute HAF metrics. The pipeline consists of:
- Stage JUSTIFY where LLMs generate explanations for their toxicity decisions (denoted by
stage="initial"). - Stage UPHOLD-REASON where LLMs generate post-hoc explanations to assess the sufficiency of reasons provided in the JUSTIFY stage (denoted by
stage="internal"orstage="external"). - Stage UPHOLD-STACE where LLMs generate post-hoc explanations to assess the sufficiency and necessity of individual reasons of JUSTIFY stage (denoted by
stage="individual").
To implement this, repeat the following steps with each of the four values for the parameter stage: initial, internal, external, and individual (only the initial stage has to be run first; the rest can be run in any order):
- Run generate.py with
--generation_stage=initial/internal/external/individualand other optional changes to the generation hyperparameters. - LLM outputs (tokens, token entropies, and texts) will be generated and stored in
llm_generated_data/<model_name>/<data_name>/<stage>. - Run parse.py with
stage=initial/internal/external/individualand other optional parameters to extract LLM decisions, reasons, and other relevant information for computing HAF. - The parsed outputs will be stored in
parsed_data/<model_name>/<data_name>/<stage>.
Computing HAF metrics:
- Run haf.py with optional parameters to compute HAF metrics for all combinations of models and datasets.
- The outputs will be computed for each sample instance and stored in
haf_results/<model_name>/<data_name>/<sample_index>.pkl.
- We are working on updating the parser files to support more datasets and models. We will soon integrate the results of Microsoft Phi-4 reasoning model.
- We will include the results of naive prompting without explicit reasoning instructions.
Bibtex:
@article{mothilal2025haf,
title={Human-Aligned Faithfulness in Toxicity Explanations of LLMs},
author={K Mothilal, Ramaravind and Roy, Joanna and Ahmed, Syed Ishtiaque and Guha, Shion},
journal={arXiv preprint arXiv:2506.19113},
year={2025}
}
