← Reasoning with Machines Lab

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

arXiv GitHub Under Review
* Equal contribution

When asked to explain their decisions, LLMs can give highly plausible self-explanations. But are these explanations actually faithful to the model's true reasoning, or are they just post-hoc rationalizations? Existing faithfulness metrics have critical limitations that make them unsuitable for evaluating frontier models.

In this paper, we take a predictive approach: we measure whether a model's self-explanations help an observer predict how the model will behave on related inputs. Across 18 frontier models and 7,000 counterfactual examples, we find that self-explanations substantially improve prediction of model behavior.

Key result: Self-explanations encode valuable information about LLMs' decision-making criteria.

Why a new faithfulness metric?

Existing faithfulness metrics rely on detecting failures: adversarial cues that bias reasoning (e.g. Turpin et al.) or explicit reasoning errors. But these failure modes disappear as model capabilities scale, creating a vanishing signal problem. Frontier models no longer fall for biasing cues in the same way older models once did.

“Unfortunately, we do not currently have viable dedicated evaluations for reasoning faithfulness.” Claude Sonnet 4.5 System Card, Anthropic (2025)

We take a different approach. Rather than looking for failures, we measure the predictive value of explanations, introducing a metric that measures how much a model's self-explanation helps an independent observer predict the model's behaviour on related inputs.

Normalized simulatability gain

Our method is based on the following principle: a faithful explanation should allow an observer to learn a model's decision-making criteria, and thus better predict its behaviour on related inputs. We formalise this with Normalized Simulatability Gain (NSG).

Figure: Operationalizing faithfulness with NSG

Operationalizing faithfulness with NSG. The reference model (the LLM being evaluated) answers a question and gives an explanation. The predictor model (a separate LLM) then tries to predict the reference model's answer on similar counterfactual questions, both with and without access to the explanation. If the explanation is faithful, the predictor should do better when it has the explanation.

We generate counterfactual inputs using data-driven methods: for each original question, we find related questions from existing datasets that differ in at most 2 features. This ensures counterfactuals are natural and grounded in the real data distribution. NSG measures the fraction of achievable improvement that explanations deliver, ranging from 0 to 100%:

Experimental setup

We evaluate 18 reference models on 7,000 counterfactual examples from seven popular datasets covering health, business, and ethics:

  • Health: Heart Disease, Pima Diabetes, Breast Cancer Recurrence
  • Business: Employee Attrition, Annual Income, Bank Marketing
  • Ethics: Moral Machines

Each reference model is evaluated by an ensemble of five predictor models (gpt-oss-20b, Qwen-3-32B, gemma-3-27b-it, GPT-5 mini, and gemini-3-flash). Results are averaged across the ensemble to avoid model-specific biases.

Self-explanations improve prediction

Figure: Main results showing NSG across 18 models

Self-explanations encode valuable information about models' decision-making criteria. All 18 reference models produce self-explanations that improve counterfactual prediction, with NSG ranging from 11% to 37%.

This is a positive result for self-explanations. Explanations consistently encode valuable information about a model's decision-making criteria that helps predict its behaviour on related inputs. The best-performing models fix approximately one-third of incorrect predictions through their explanations.

Do models have privileged self-knowledge?

Self-explanations improve predictor accuracy, but this alone doesn't confirm they encode the true decision-making criteria. An alternative hypothesis: any plausible explanation might help prediction, regardless of source. We test this by swapping each self-explanation with one generated by a different model family that gave the same original answer.

We find self-explanations consistently encode more predictive information than cross-model explanations, even when the external explainer models are stronger. This holds across all model families.

Model family Self-explanation NSG Cross-model NSG Self-explanation uplift
Qwen 334.2%31.2%+3.0pp
Gemma 335.0%33.2%+1.7pp
GPT-535.9%31.7%+4.3pp
Claude 4.530.2%28.0%+2.3pp
Gemini 332.9%30.2%+2.7pp

Self-explanations consistently outperform cross-model explanations. This provides evidence for a privileged self-knowledge advantage: models have access to information about their own decision-making that an external observer cannot derive from input-output behaviour alone. This is important because it suggests that the value of self-explanations is not merely contextual, and that they encode genuinely privileged information about the model's reasoning process.

Are bigger models more faithful?

Figure: Mixed trends between model scale and faithfulness

Mixed trends between model scale and faithfulness. The Qwen 3 family shows a clear monotonic relationship between model size and NSG. However, the relationship breaks down past a modest capability threshold, and proprietary model families show no clear scaling trend.

Remaining unfaithfulness

The positive NSG results are average-case. Self-explanations help prediction overall, but they are not always faithful. We define egregious unfaithfulness as the case where a self-explanation leads all five predictor models to make a prediction that doesn't align with the model's true behaviour. Across models, 5-15% of self-explanations are egregiously unfaithful, with smaller models producing more misleading explanations.

Figure: Example of egregious unfaithfulness from GPT-5.2

GPT-5.2 being unfaithful. When presented with a moral dilemma, GPT-5.2 chooses to continue straight (inaction) and explains this by citing the “principle of not taking active measures that cause harm.” However, when the genders of the pedestrians are swapped in the counterfactual, the model reverses its decision and swerves (an active measure causing harm), directly contradicting its own stated reasoning. Gender is never mentioned in the explanations. Claude Opus 4.5 exhibits the same behaviour on this question.

Key takeaways

  • Self-explanations encode valuable information. Across all 18 models, explanations encode information about decision-making criteria, improving behaviour prediction by 11-37%.
  • Models have privileged self-knowledge. Self-explanations outperform external model explanations by 1.7-4.3 percentage points, implying an advantage from self-knowledge that external methods cannot replicate.
  • But faithfulness is not guaranteed. 5-15% of self-explanations are egregiously misleading, sometimes contradicting the model's actual behaviour on closely related inputs.

Citation