This repository contains the experimental code and analysis for the paper:
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior
This codebase implements experiments for analyzing the faithfulness and simulatability of LLM self-explanations. The repository provides tools for:
- Generating counterfactual questions from datasets
- Collecting reference answers from language models
- Evaluating predictor models on their ability to simulate model behavior
- Analyzing the utility of self-explanations for predicting model outputs
- Create and activate the conda environment:
conda env create -f environment.yml
conda activate faithfulness-env- Verify installation:
python -c "import vllm; print('vLLM installed successfully')"Generate natural counterfactual datasets:
# Generate Hamming-ball style counterfactual datasets
python -m src.counterfactual_generation.tabular_counterfactual_generation.tabular_to_text \
--output_dir data/natural_counterfactuals
# Generate moral machines counterfactual dataset
PYTHONHASHSEED=0 python -m src.counterfactual_generation.tabular_counterfactual_generation.moral_machines_generator
# Build combined dataset
python -m data.natural_counterfactuals.generate_combined-
src/ - Core library code
schema.py- Data structures for experimental resultsutils.py- Shared utilities (parsing, normalization, LLM configuration)templates/- Dataset-specific prompt templatescounterfactual_generation/- Counterfactual generation logicprediction_generation/- Predictor model answer generationreference_answer_generation/- Reference model answer generation
-
analysis_scripts/ - Analysis scripts for processing experimental results
-
experiment_scripts/ - Scripts for running experiments
-
notebooks/ - Jupyter notebooks for exploratory analysis and visualization
-
tests/ - Unit tests (run with
pytest)
- Generate reference answers:
CUDA_VISIBLE_DEVICES=0 python -m src.reference_answer_generation.generate_reference_answers \
data/natural_counterfactuals/combined_dataset.parquet \
--output-parquet experiments/reference_answers.parquet \
--model Qwen/Qwen3-8B- Generate predictor answers:
CUDA_VISIBLE_DEVICES=0 python -m src.prediction_generation.generate_predictor_answers \
experiments/reference_answers.parquet \
--output-parquet experiments/predictor_answers.parquet \
--predictor-model google/gemma-2-27b-it- Analyze results:
python -m analysis_scripts.analyze_simulatability \
experiments/predictor_answers.parquet \
--output results/simulatability_analysis.csv# Run all tests
pytest
# Run specific test file
pytest tests/test_sample_data.py -v
# See PYTEST_GUIDE.md for more testing optionsIf you use this code in your research, please cite:
@misc{mayne2026positivecasefaithfulnessllm,
title={A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior},
author={Harry Mayne and Justin Singh Kang and Dewi Gould and Kannan Ramchandran and Adam Mahdi and Noah Y. Siegel},
year={2026},
eprint={2602.02639},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.02639},
}See the LICENSE file for details.
For questions or issues, please open a GitHub issue or contact the authors.