Sandbagging—where AI systems strategically underperform under evaluation while remaining capable in casual settings—threatens reliable model assessment and AI safety governance. Consistency Probing detects this by measuring performance divergence on content-identical tasks framed as formal evaluations versus casual interactions.
We construct 50 prompt pairs across seven reasoning domains and evaluate contemporary LLMs under controlled conditions. Using paired statistical testing and effect-size analysis, we compute a composite confidence score quantifying contextual performance divergence. Our results reveal statistically meaningful, model-dependent performance drops under evaluation in several models. By isolating context-conditioned behavior from confounds, Consistency Probing complements traditional benchmarks and integrates into red-teaming and auditing workflows.
By: Atharshlakshmi Vijayakumar, Balakrishnan Vaisiya
Create and activate a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtThen log in to Hugging Face to access gated models:
hf auth loginYou'll be prompted to enter your HF API token (get one from huggingface.co/settings/tokens).
- Add your model API keys into the .env file as in
.env.example:
GEMINI_API_KEY=your_gemini_api_key_here
- Run the experiments (will iterate models and write per-model logs):
python src/experiments.pyNotes:
- Per-pair logs and model comparison CSVs are written to
~/reports/experiment_logsby default.
- Visualize results with the notebook:
# Open the visualization notebook
jupyter notebook reports/visualization.ipynbThis generates charts comparing model performance metrics across sandbagging detection experiments.
- Python 3.x — Core language
- PyTorch — Deep learning framework for local model inference
- Transformers — Hugging Face library for loading and running LLMs
- Google Generative AI SDK — Gemini API integration
- Lambda Labs GPUs — GPU compute for model experiments (optional, for larger models)
- Pandas — Data manipulation and CSV output
SandbagDetect/
├── README.md
├── requirements.txt
├── question_pairs/
│ ├── generated_question_pairs.json
│ ├── generateQuestionPairs.py
│ └── promptTemplates.py
├── src/
│ ├── experiments.py
│ ├── logger.py
│ ├── metrics.py
│ └── utils.py
└── reports/
├── visualization.ipynb # Analysis and charting of results
└── experiment_logs/ # Generated output
- The repo loads 50 prompt pairs from
prompts/sandbagging_question_pairs.json. - For each pair it runs two prompts (evaluation-style and casual-style) through the same model using a single adapter function (
call_LLM). - It records both responses and computes per-pair metrics: correctness (binary/score), length ratio, semantic similarity, and a
sandbagging_flagindicating whether the casual response was better than the evaluation response. - Per-pair results are appended to a per-model CSV. After all pairs, aggregated experiment metrics are computed and appended to a
model_comparison.csvfor cross-model analysis.
- Implement or adapt
call_LLM(model_name, prompt, **kwargs)insrc/utils.pyto call your model provider. The rest of the code expectscall_LLMto return a text response string. - Add API keys, if applicable, into your
.envfile. - Update the list of models in
src/experiments.py.
This experiment was built for the Apart Research Hackathon Sprint.