Evaluating and supplementing the stability of AI-performed data science.
We introduce a framework for evaluating the stability and reliability of LLM-generated statistical analyses. It uses the blade-bench benchmark and extends it with perturbation experiments to test how consistent AI models are when analyzing datasets.
- Python 3.10+
- Poetry package manager
- Node.js (for Codex CLI)
- API keys for LLM providers (OpenAI, Azure OpenAI, etc.)
# Clone the repository
git clone <repo-url>
cd stat-genie
# Install Poetry (if not installed)
curl -sSL https://install.python-poetry.org | python3 -
# Install Python dependencies
poetry installCodex CLI is required for agentic experiments:
# Option 1: Install in project (already in package.json)
npm install
# Option 2: Install globally
npm install -g @openai/codex
# Option 3: Install via Homebrew (macOS)
brew install codexSet your OpenAI API key:
export OPENAI_API_KEY="sk-..."
# Or create a .env file in the project root:
# OPENAI_API_KEY=sk-...For Codex CLI + Azure OpenAI using Entra ID (Azure AD), the recommended path is the setup script under agentic/experiments/scripts/.
Prerequisites:
# 1. Install Azure CLI and login
az login
# 2. Install azure-identity Python package
pip install azure-identity
# 3. Ensure you have "Cognitive Services OpenAI User" role on your Azure OpenAI resourceRecommended setup (Codex CLI):
- Export your Azure settings (deployment is the Azure deployment name, not model name):
export AZURE_RESOURCE_NAME="myopenai" # e.g., "myopenai"
export AZURE_DEPLOYMENT_NAME="gpt-5.2-codex" # e.g., "gpt-5.2-codex"
# Optional overrides:
# export AZURE_API_VERSION="2025-04-01-preview"
# export AZURE_WIRE_API="responses"- Source the setup script (creates
~/.codex/config.tomland exports a token into your shell):
cd agentic/experiments
source scripts/setup-azure-codex.sh- Sanity check the Codex CLI profile:
npx codex --profile azure "Say hello from Azure"- For subsequent runs, refresh the token (tokens expire after ~1 hour):
source scripts/refresh-azure-token.shManual configuration (if you don’t want the script):
Create ~/.codex/config.toml:
model_provider = "azure"
model = "your-deployment-name" # Must be Azure DEPLOYMENT name, not model name
[model_providers.azure]
name = "Azure OpenAI"
base_url = "https://YOUR_RESOURCE.openai.azure.com/openai" # Must include /openai
query_params = { api-version = "2025-04-01-preview" }
wire_api = "responses"
env_key = "AZURE_OPENAI_API_KEY"
[profiles.azure]
model_provider = "azure"
model = "your-deployment-name"Then get a token and export it (Azure CLI login required):
export AZURE_OPENAI_API_KEY="$(python3 -c '
from azure.identity import AzureCliCredential
cred = AzureCliCredential()
print(cred.get_token("https://cognitiveservices.azure.com/.default").token)
')"Set API keys as environment variables or in a .env file:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GROQ_API_KEY=gsk-...
TOGETHER_API_KEY=...
GEMINI_API_KEY=...Configure LLM providers and models:
provider: openai
model: gpt-5-mini
providers:
openai:
name: OpenAI
models:
- name: gpt-5-mini
model:
model: gpt-5-mini
api_key_env_name: OPENAI_API_KEYSupported providers: openai, azureopenai, anthropic, groq, mistral, together, gemini, huggingface
stat-genie/
├── blade/ # Copy of the BLADE repo
│ ├── run_gen_analyses.py # Generate LLM analyses
│ └── run_get_eval.py # Evaluate analyses
├── config/ # Configuration files
│ ├── llm_config.yml # LLM provider config (legacy)
│ └── llm_eval_config.yml # Evaluation LLM config (legacy)
├── experiments/ # Repository of non-agentic experiments using the BLADE harness
│ ├── scripts/ # Perturbation experiment scripts
│ │ ├── run_analysis.py # Single analysis runner
│ │ ├── run_analysis.sh # Shell wrapper
│ │ ├── run_analysis_master.sh # Multi-dataset runner
│ │ ├── run_pairwise_eval.py # Pairwise evaluation
│ │ └── run_pairwise_eval.sh # Shell wrapper
│ └── outputs/ # Experiment outputs
├── agentic/ # Agentic (Codex) experiments
│ ├── confidence_experiments/ # Experiments eliciting confidence scores from Codex
│ │ ├── scripts/ # Runner and aggregation scripts
│ │ ├── outputs/ # Per-run Codex outputs
│ │ ├── aggregated_results/ # Aggregated CSVs
│ │ └── insights/ # Analysis notebooks and figures
│ ├── scalar_experiments/ # Experiments eliciting scalar yes/no conclusions
│ │ ├── scripts/ # Runner and aggregation scripts
│ │ ├── outputs/ # Per-run Codex outputs
│ │ ├── aggregated_results/ # Aggregated CSVs
│ │ └── insights/ # Analysis notebooks and figures
│ └── human_experiments/ # Human baseline experiment results
├── src/stat_genie/ # Source code
│ └── blade_pipeline/ # Our additions/modifications to BLADE code
│ ├── additions/ # Custom additions to the blade pipeline
│ │ ├── analysis/ # Conclusion writing and model output extraction
│ │ ├── perturbations/ # Perturbation implementations
│ │ ├── eval/ # Evaluation utilities (extraction, judging)
│ │ └── prompt/ # Prompt construction utilities
│ ├── baselines/ # Baseline implementations
│ ├── datasets/ # Dataset files
│ └── llms/ # LLM utilities
├── pyproject.toml # Poetry configuration
└── README.md # This file
| Dataset | Description |
|---|---|
affairs |
Extramarital affairs study |
amtl |
AMTL dataset |
boxes |
Boxes experiment |
caschools |
California schools data |
compas |
COMPAS recidivism data |
crofoot |
Crofoot study |
fish |
Fish dataset |
hurricane |
Hurricane analysis |
mortgage |
Mortgage data |
panda_nuts |
Panda nuts experiment |
reading |
Reading study |
soccer |
Soccer data |
teachingratings |
Teaching ratings data |
toy |
Toy dataset for testing |
Both experiment types (scalar and confidence) follow the same three-phase pattern: run → fix → aggregate. All commands should be run from the respective experiment directory.
Codex performs a full statistical analysis and writes a scalar yes/no conclusion to conclusion.txt in each run's output subdirectory.
Phase 1 — Run analyses:
cd agentic/scalar_experiments
# SLURM (recommended for full experiment):
sbatch scripts/analysis-runner.sh
# Local:
bash scripts/analysis-runner-local.sh
# Single run:
bash scripts/analysis.sh <dataset> <distribution> <perturbation> <run_number>
# e.g.:
bash scripts/analysis.sh caschools null none 1
bash scripts/analysis.sh caschools alt anonymize 3For PVE (proportion of variance explained) experiments across a range of signal strengths:
sbatch scripts/pve-analysis-runner.sh # SLURM
bash scripts/pve-runner-local.sh # localPhase 2 — Fix broken conclusions:
Some runs produce malformed conclusion.txt files, even though the agent is instructed to produce valid JSON output. Fix them before aggregating:
bash scripts/fix-conclusions.sh # null/alt distributions
bash scripts/fix-conclusions.sh --pve # pve distributionPhase 3 — Aggregate:
bash scripts/aggregate_conclusions.sh # null/alt → aggregated_results/aggregated_results.csv
bash scripts/aggregate_pve_conclusions.sh # pve → aggregated_results/aggregated_pve_results.csvOptional — Calibration simulation:
sbatch scripts/run_calibration.sh # SLURM (32 CPUs, 64G RAM)
# Results saved to insights/calibration_results_*.npzCodex is shown the output of a prior scalar analysis and asked to produce a calibrated confidence score.
Phase 1 — Run confidence elicitations:
cd agentic/confidence_experiments
# SLURM:
sbatch scripts/confidence-runner.sh
# Local:
bash scripts/analysis-runner-local.sh
# Single run:
bash scripts/confidence.sh <dataset> <distribution> <perturbation> <run_number>
# e.g.:
bash scripts/confidence.sh caschools null anonymize 1Phase 2 — Fix broken conclusions:
bash scripts/fix-conclusions.sh --pvePhase 3 — Aggregate:
bash scripts/aggregate_pve_conclusions.sh # → aggregated_results/aggregated_pve_results.csvOptional — Calibration simulation:
sbatch scripts/run_calibration.sh
# Results saved to insights/calibration_results_*.npzEach run produces a subdirectory under outputs/:
outputs/<dataset>/<distribution>/<perturbation>/run<N>/
AGENTS.md # Codex prompt
conclusion.txt # Model's yes/no answer (scalar) or confidence score
*.py # Generated analysis code
agent-analysis.out # Raw Codex session log
The experiments/scripts/ directory contains scripts for running perturbation experiments to evaluate LLM stability.
| Type | Description |
|---|---|
noperturb |
No perturbation (baseline) |
anonymize |
Anonymize feature names |
shuffle_names |
Shuffle feature names |
add_features |
Add random features |
replace_with_rvs |
Replace data with random values |
positive_leading_statement |
Add positive framing to task |
negative_leading_statement |
Add negative framing to task |
replace_and_positive_statement |
Combined replacement + positive framing |
poetry run python experiments/scripts/run_analysis.py \
--dataset caschools \
--analysis-num 1 \
--perturbation-type noperturb \
--llm-provider openai \
--llm-model gpt-5-mini \
--num-runs 5Options:
| Flag | Description | Default |
|---|---|---|
--dataset |
Dataset name | Required |
--analysis-num |
Analysis number (1-8) | Required |
--perturbation-type |
Perturbation to apply | Required |
--llm-provider |
LLM provider | openai |
--llm-model |
Model name | gpt-5-mini |
--num-runs |
Number of runs | 3 |
--use-cache |
Enable LLM caching | False |
--use-agent |
Use agent mode | False |
# Run all 8 perturbation types for a dataset
bash experiments/scripts/run_analysis.sh caschools# Submit SLURM jobs for all datasets
bash experiments/scripts/run_analysis_master.shAfter running analyses, evaluate pairwise similarity across perturbations:
poetry run python experiments/scripts/run_pairwise_eval.py \
--dataset caschools \
--num-multiruns 5 \
--llm-provider openai \
--llm-model gpt-5-miniOr use the shell script:
bash experiments/scripts/run_pairwise_eval.sh caschoolsThe agentic/ directory contains three experiment types, each in its own subdirectory. All experiments use Codex (via npx codex exec) and follow a similar structure: a runner script dispatches per-run jobs, each job sets up a subdirectory under outputs/ and runs Codex against an AGENTS.md prompt, and aggregation scripts collect results into CSVs under aggregated_results/.
Codex performs a full statistical analysis and produces a scalar yes/no conclusion for each research question.
Run a single analysis:
cd agentic/scalar_experiments
# For Azure: refresh token first
source scripts/token-refresh-helper.sh
bash scripts/analysis.sh <dataset> <distribution> <perturbation> <run_number>
# Example:
bash scripts/analysis.sh caschools null none 1<distribution> is null, alt, or pve. <perturbation> is none or one of the perturbation types (e.g. anonymize, shuffle_names, add_features, positive_leading_statement, negative_leading_statement).
Run all analyses (SLURM):
cd agentic/scalar_experiments
sbatch scripts/analysis-runner.sh
# Or locally:
bash scripts/analysis-runner-local.shAggregate results:
cd agentic/scalar_experiments
bash scripts/aggregate_conclusions.shCodex is given the output of a prior scalar analysis and asked to produce a calibrated confidence score for its conclusion.
Run a single confidence elicitation:
cd agentic/confidence_experiments
bash scripts/confidence.sh <dataset> <distribution> <perturbation> <run_number>
# Example:
bash scripts/confidence.sh caschools null anonymize 1Run all (SLURM):
cd agentic/confidence_experiments
sbatch scripts/confidence-runner.shAggregate results:
cd agentic/confidence_experiments
bash scripts/aggregate_pve_conclusions.shContains human baseline results (CSV) and analysis notebook (insights.ipynb) for comparison against the agentic experiments.
For all experiment types, authenticate before running:
# Login (once per session)
az login
# Refresh token (tokens expire after ~1 hour)
source scripts/token-refresh-helper.shThe shell scripts support both local execution and SLURM job submission.
Local execution (no SLURM):
# Run directly with bash
bash experiments/scripts/run_analysis.sh caschools
bash experiments/scripts/run_analysis_master.sh
bash experiments/scripts/run_eval_master.shSLURM cluster submission:
# Submit as SLURM jobs (requires SLURM environment)
sbatch experiments/scripts/run_analysis.sh caschools
sbatch experiments/scripts/run_analysis_master.sh
sbatch experiments/scripts/run_eval_master.shNote: sbatch is only available on HPC clusters with SLURM installed. Use bash for local machines.
See the examples/ directory for Jupyter notebooks demonstrating various use cases:
examples/affairs/- Affairs dataset analysisexamples/caschools/- California schools analysisexamples/fish/- Fish dataset analysisexamples/using_custom_prompts/- Custom prompt examplesexamples/using_gpt5/- GPT-5 usage examples
-
Poetry not found: Install Poetry with
curl -sSL https://install.python-poetry.org | python3 - -
API key not set: Ensure
OPENAI_API_KEYis exported or in.env -
Module not found: Run
poetry installto install dependencies -
SLURM errors: Ensure you're submitting from the project root directory
-
Azure token expired: Re-run
source scripts/refresh-azure-token.sh
Check these log files for debugging:
llm.log- LLM API calls and responsesrun.log- General execution logsout/*.log- SLURM job outputs (when using SLURM)