SandbagDetect: Consistency Probing for Sandbagging Detection

Sandbagging—where AI systems strategically underperform under evaluation while remaining capable in casual settings—threatens reliable model assessment and AI safety governance. Consistency Probing detects this by measuring performance divergence on content-identical tasks framed as formal evaluations versus casual interactions.

We construct 50 prompt pairs across seven reasoning domains and evaluate contemporary LLMs under controlled conditions. Using paired statistical testing and effect-size analysis, we compute a composite confidence score quantifying contextual performance divergence. Our results reveal statistically meaningful, model-dependent performance drops under evaluation in several models. By isolating context-conditioned behavior from confounds, Consistency Probing complements traditional benchmarks and integrates into red-teaming and auditing workflows.

By: Atharshlakshmi Vijayakumar, Balakrishnan Vaisiya

Installation

Create and activate a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Then log in to Hugging Face to access gated models:

hf auth login

You'll be prompted to enter your HF API token (get one from huggingface.co/settings/tokens).

Usage

Add your model API keys into the .env file as in .env.example:

GEMINI_API_KEY=your_gemini_api_key_here

Run the experiments (will iterate models and write per-model logs):

python src/experiments.py

Notes:

Per-pair logs and model comparison CSVs are written to ~/reports/experiment_logs by default.

Visualize results with the notebook:

# Open the visualization notebook
jupyter notebook reports/visualization.ipynb

This generates charts comparing model performance metrics across sandbagging detection experiments.

Tech Stack

Python 3.x — Core language
PyTorch — Deep learning framework for local model inference
Transformers — Hugging Face library for loading and running LLMs
Google Generative AI SDK — Gemini API integration
Lambda Labs GPUs — GPU compute for model experiments (optional, for larger models)
Pandas — Data manipulation and CSV output

Repository Structure

SandbagDetect/
├── README.md
├── requirements.txt
├── question_pairs/
│   ├── generated_question_pairs.json
│   ├── generateQuestionPairs.py
│   └── promptTemplates.py
├── src/
│   ├── experiments.py
│   ├── logger.py
│   ├── metrics.py
│   └── utils.py
└── reports/
    ├── visualization.ipynb    # Analysis and charting of results
    └── experiment_logs/       # Generated output

How it works

The repo loads 50 prompt pairs from prompts/sandbagging_question_pairs.json.
For each pair it runs two prompts (evaluation-style and casual-style) through the same model using a single adapter function (call_LLM).
It records both responses and computes per-pair metrics: correctness (binary/score), length ratio, semantic similarity, and a sandbagging_flag indicating whether the casual response was better than the evaluation response.
Per-pair results are appended to a per-model CSV. After all pairs, aggregated experiment metrics are computed and appended to a model_comparison.csv for cross-model analysis.

Running your own models

Implement or adapt call_LLM(model_name, prompt, **kwargs) in src/utils.py to call your model provider. The rest of the code expects call_LLM to return a text response string.
Add API keys, if applicable, into your .env file.
Update the list of models in src/experiments.py.

Remarks

This experiment was built for the Apart Research Hackathon Sprint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SandbagDetect: Consistency Probing for Sandbagging Detection

Contents

Installation

Usage

Tech Stack

Repository Structure

How it works

Running your own models

Remarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
question_pairs		question_pairs
reports		reports
src		src
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SandbagDetect: Consistency Probing for Sandbagging Detection

Contents

Installation

Usage

Tech Stack

Repository Structure

How it works

Running your own models

Remarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages