Skip to content

atharshlakshmi/SandbagDetect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SandbagDetect: Consistency Probing for Sandbagging Detection

Sandbagging—where AI systems strategically underperform under evaluation while remaining capable in casual settings—threatens reliable model assessment and AI safety governance. Consistency Probing detects this by measuring performance divergence on content-identical tasks framed as formal evaluations versus casual interactions.

We construct 50 prompt pairs across seven reasoning domains and evaluate contemporary LLMs under controlled conditions. Using paired statistical testing and effect-size analysis, we compute a composite confidence score quantifying contextual performance divergence. Our results reveal statistically meaningful, model-dependent performance drops under evaluation in several models. By isolating context-conditioned behavior from confounds, Consistency Probing complements traditional benchmarks and integrates into red-teaming and auditing workflows.

By: Atharshlakshmi Vijayakumar, Balakrishnan Vaisiya

Contents

Installation

Create and activate a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Then log in to Hugging Face to access gated models:

hf auth login

You'll be prompted to enter your HF API token (get one from huggingface.co/settings/tokens).

Usage

  1. Add your model API keys into the .env file as in .env.example:
GEMINI_API_KEY=your_gemini_api_key_here
  1. Run the experiments (will iterate models and write per-model logs):
python src/experiments.py

Notes:

  • Per-pair logs and model comparison CSVs are written to ~/reports/experiment_logs by default.
  1. Visualize results with the notebook:
# Open the visualization notebook
jupyter notebook reports/visualization.ipynb

This generates charts comparing model performance metrics across sandbagging detection experiments.

Tech Stack

  • Python 3.x — Core language
  • PyTorch — Deep learning framework for local model inference
  • Transformers — Hugging Face library for loading and running LLMs
  • Google Generative AI SDK — Gemini API integration
  • Lambda Labs GPUs — GPU compute for model experiments (optional, for larger models)
  • Pandas — Data manipulation and CSV output

Repository Structure

SandbagDetect/
├── README.md
├── requirements.txt
├── question_pairs/
│   ├── generated_question_pairs.json
│   ├── generateQuestionPairs.py
│   └── promptTemplates.py
├── src/
│   ├── experiments.py
│   ├── logger.py
│   ├── metrics.py
│   └── utils.py
└── reports/
    ├── visualization.ipynb    # Analysis and charting of results
    └── experiment_logs/       # Generated output

How it works

  • The repo loads 50 prompt pairs from prompts/sandbagging_question_pairs.json.
  • For each pair it runs two prompts (evaluation-style and casual-style) through the same model using a single adapter function (call_LLM).
  • It records both responses and computes per-pair metrics: correctness (binary/score), length ratio, semantic similarity, and a sandbagging_flag indicating whether the casual response was better than the evaluation response.
  • Per-pair results are appended to a per-model CSV. After all pairs, aggregated experiment metrics are computed and appended to a model_comparison.csv for cross-model analysis.

Running your own models

  • Implement or adapt call_LLM(model_name, prompt, **kwargs) in src/utils.py to call your model provider. The rest of the code expects call_LLM to return a text response string.
  • Add API keys, if applicable, into your .env file.
  • Update the list of models in src/experiments.py.

Remarks

This experiment was built for the Apart Research Hackathon Sprint.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors