LLM Self-Explanations and Faithfulness

This repository contains the experimental code and analysis for the paper:

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Overview

This codebase implements experiments for analyzing the faithfulness and simulatability of LLM self-explanations. The repository provides tools for:

Generating counterfactual questions from datasets
Collecting reference answers from language models
Evaluating predictor models on their ability to simulate model behavior
Analyzing the utility of self-explanations for predicting model outputs

Setup

Environment Installation

Create and activate the conda environment:

conda env create -f environment.yml
conda activate faithfulness-env

Verify installation:

python -c "import vllm; print('vLLM installed successfully')"

Data Preparation

Generate natural counterfactual datasets:

# Generate Hamming-ball style counterfactual datasets
python -m src.counterfactual_generation.tabular_counterfactual_generation.tabular_to_text \
    --output_dir data/natural_counterfactuals

# Generate moral machines counterfactual dataset
PYTHONHASHSEED=0 python -m src.counterfactual_generation.tabular_counterfactual_generation.moral_machines_generator

# Build combined dataset
python -m data.natural_counterfactuals.generate_combined

Repository Structure

src/ - Core library code
- schema.py - Data structures for experimental results
- utils.py - Shared utilities (parsing, normalization, LLM configuration)
- templates/ - Dataset-specific prompt templates
- counterfactual_generation/ - Counterfactual generation logic
- prediction_generation/ - Predictor model answer generation
- reference_answer_generation/ - Reference model answer generation
analysis_scripts/ - Analysis scripts for processing experimental results
experiment_scripts/ - Scripts for running experiments
notebooks/ - Jupyter notebooks for exploratory analysis and visualization
tests/ - Unit tests (run with pytest)

Usage

Running Experiments

Generate reference answers:

CUDA_VISIBLE_DEVICES=0 python -m src.reference_answer_generation.generate_reference_answers \
    data/natural_counterfactuals/combined_dataset.parquet \
    --output-parquet experiments/reference_answers.parquet \
    --model Qwen/Qwen3-8B

Generate predictor answers:

CUDA_VISIBLE_DEVICES=0 python -m src.prediction_generation.generate_predictor_answers \
    experiments/reference_answers.parquet \
    --output-parquet experiments/predictor_answers.parquet \
    --predictor-model google/gemma-2-27b-it

Analyze results:

python -m analysis_scripts.analyze_simulatability \
    experiments/predictor_answers.parquet \
    --output results/simulatability_analysis.csv

Running Tests

# Run all tests
pytest

# Run specific test file
pytest tests/test_sample_data.py -v

# See PYTEST_GUIDE.md for more testing options

Citation

If you use this code in your research, please cite:

@misc{mayne2026positivecasefaithfulnessllm,
      title={A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior}, 
      author={Harry Mayne and Justin Singh Kang and Dewi Gould and Kannan Ramchandran and Adam Mahdi and Noah Y. Siegel},
      year={2026},
      eprint={2602.02639},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.02639}, 
}

License

See the LICENSE file for details.

Contact

For questions or issues, please open a GitHub issue or contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
analysis_scripts		analysis_scripts
data		data
experiment_scripts		experiment_scripts
figures		figures
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Self-Explanations and Faithfulness

Overview

Setup

Environment Installation

Data Preparation

Repository Structure

Usage

Running Experiments

Running Tests

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Self-Explanations and Faithfulness

Overview

Setup

Environment Installation

Data Preparation

Repository Structure

Usage

Running Experiments

Running Tests

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages