CoT Oracle is a white-box chain-of-thought monitor built on Activation Oracles. The core model is Qwen3-8B with a LoRA adapter trained to read its own residual-stream activations and answer questions about the reasoning that produced them.
This README documents the pipeline the current code actually runs. Older docs elsewhere in the repo, especially under src/evals/, describe older or parallel experiments and are not the main training-time path.
- A source sequence is built from a question plus a chain-of-thought or other context text.
- Activation positions are chosen from that source sequence, usually by fixed stride and optionally by punctuation boundaries.
- Residual activations are extracted from the configured source layers with LoRA disabled.
- The oracle prompt is prefixed with one placeholder token per activation vector.
- Those activation vectors are injected back into the model at layer 1 at the placeholder positions.
- The LoRA-tuned model generates a natural-language answer about the reasoning process.
The main training and eval codepath uses:
- Source-of-truth task registry:
src/tasks.py - Unified HF/on-the-fly data loading:
src/data_loading.py - Training entrypoint:
src/train.py - Training-time eval loop:
src/eval_loop.py - Default config:
configs/train.yaml
UV_PROJECT_ENVIRONMENT="$VENV_LOCAL/${PWD##*/}" uv sync
export AO_REPO_PATH="${AO_REPO_PATH:-$PWD/ao_reference}"src/core/ao_repo.py will look for nl_probes in AO_REPO_PATH, then in ./ao_reference, then in ./activation_oracles.
src/tasks.py currently defines 17 tasks total.
Trainable tasks:
hint_admissionatypical_answerreasoning_terminationanswer_trajectoryfuturelenspastlenscorrectnessdecorative_cotchunked_convqachunked_compqabacktrack_predictionsycophancyprobe_sycophancytruthfulqa_hint_verbalizedtruthfulqa_hint_unverbalized
Eval-only tasks:
rot13_reconstructionsentence_insertion
The default configs/train.yaml enables 13 of the trainable tasks by default and also enables three auxiliary non-task sources: FineWeb context prediction, standard NLP classification, and LatentQA.
Most task data is normalized to the same shape before conversion to AO TrainingDataPoints:
{
"task": str,
"prompt": str,
"target_response": str,
"context_input_ids": list[int] | None,
"context_positions": list[int] | None,
"layers": list[int] | None,
}If context_input_ids is missing but cot_text is present, prepare_context_ids() reconstructs the chat-formatted sequence and computes activation positions at load time.
The real training flow is:
src/train.pyparses CLI flags, merges one or more YAML configs, and lets CLI flags override config values.- It resolves source layers from
activations.layersor from evenly spaced percentages if--n-layersis used. - It loads the base model, enables gradient checkpointing if configured, and either:
- resumes from an existing LoRA checkpoint,
- initializes a fresh LoRA adapter, or
- loads Adam's AO checkpoint as the starting adapter.
- It builds the training mixture:
- HF-backed task datasets are loaded through
load_all_training_data(). futurelensandpastlensare generated on the fly from the corpus-v5 HF dataset.- Optional FineWeb, classification, and LatentQA examples are generated/loaded separately and appended.
- HF-backed task datasets are loaded through
prepare_context_ids()tokenizes any examples that still only havecot_text, computes activation positions, and repeats positions across all configured layers.dicts_to_training_data()converts raw dicts into AOTrainingDataPoints:position_mode=last_onlykeeps only the final activation per layer.position_mode=stochasticdoes 50% last-only and 50% chi-squared sampling, always including the final position.position_mode=allkeeps all computed positions.layer_dropout.train=truesamples a random non-empty subset of configured layers per example.
- The training set is ordered according to
training.task_order:shuffled: mix everything together.sequential: task-by-task in YAML order.interleaved: round-robin task blocks sized to finish at roughly the same end time.
- In each train step:
- activations are materialized on demand unless
--precomputewas used, - batches may be split by token budget to avoid OOM,
- a steering hook injects the activation vectors at layer 1,
- the model is trained with standard next-token loss on the oracle response,
- metrics are logged to wandb.
- activations are materialized on demand unless
- Checkpoints save LoRA weights plus
training_state.pt(optimizer, scheduler, RNG state, wandb run metadata). - The final checkpoint is optionally uploaded to HuggingFace if
HF_TOKENis set.
The canonical launch command is:
python src/train.py --config configs/train.yamlMulti-GPU uses normal torchrun, for example:
torchrun --nproc_per_node=8 src/train.py --config configs/train.yamlThe maintained eval path is the training-time call from src/train.py into src/eval_loop.py.
At each eval event:
_run_unified_eval()callsrun_eval()with the configured eval task list.run_eval()loops over tasks fromargs.eval_tasks(derived fromtasks.*.evalin the YAML).- For each task,
_eval_single_task():- loads the
testsplit if available, - falls back to
trainonly if notestsplit can be loaded, - generates
futurelens/pastlenseval examples on the fly, - normalizes legacy field names,
- computes missing
context_input_ids/context_positions, - re-strides older precomputed examples to the current stride setting,
- trims eval inputs to the last activation position per layer (minimal barrier context),
- materializes activations once and caches them on CPU for reuse across later evals.
- loads the
- The oracle generates answers with activation steering active.
score_task()applies the task-specific scoring rule:- parser-based accuracy for structured binary tasks,
- token F1 for generation tasks,
- token match rate for reconstruction,
- step accuracy for sentence insertion.
- The training loop logs:
- scalar eval metrics,
- per-task sample tables to wandb (
question,expected,predicted,correct).
There is no separate maintained top-level eval CLI for this unified path right now. The src/evals/ directory contains older or specialized evaluation utilities, but the training loop itself uses src/eval_loop.py.
configs/train.yaml is the main control surface. The most important sections are:
tasks: per-task sample counts and whether each task participates in evalfineweb,classification,latentqa: auxiliary data sources outsidesrc/tasks.pytraining: optimizer, batch size, ordering, token budgets, prefetching behavioractivations: source layers, stride, position sampling mode, layer dropoutmodel: base model name, AO checkpoint, fresh-vs-resume adapter behavioroutput: checkpoint directory and wandb metadata
Important current behavior:
activations.stridesupports either an integer or"punctuation"in the maintained training/eval path.- The main train/eval batch sizes are no longer sourced from
configs/train.yaml; they come from CLI flags or the parser defaults insrc/train.py. configs/train.yamlkeepseval.max_items_per_eval, but eval/save cadence is computed dynamically insidetrain()forshuffled,sequential, andinterleavedruns.
The current code pulls data from several places:
- HuggingFace task datasets from the repos listed in
src/tasks.py - Corpus-v5 on HuggingFace for
futurelensandpastlens - FineWeb and LMSYS chat streaming for auxiliary context prediction
- Standard NLP datasets for auxiliary classification (
sst2,ag_news,snliby default) - Local
ao_reference/datasets/latentqa_datasets/trainfor LatentQA
Downloaded HF task JSONL files are cached under COT_ORACLE_CACHE_DIR if set, otherwise under data/hf_cache.
- Eval activations are cached by task name inside a single process because the base model is frozen during LoRA training. If you change eval stride/layers within the same long-lived Python process, clear the cache or start a fresh process.
rot13_reconstructionis skipped by default in the unified eval loop because it needs a different adapter setup.- The top-level README that was previously in this repo described a different, older pipeline. The source of truth is the current code listed above.