This software project accompanies the research paper:
Emily Cheng*, Carmen Amo Alonso, Federico Danieli, Arno Blaas, Luca Zappella, Pau Rodriguez and Xavier Suau.
GenCtrl is a research toolkit that provides a formal framework for measuring and understanding the controllability of generative AI models. It helps answer critical questions like:
- π€ Can I reliably make an LLM generate text with specific properties (length, formality, structure)?
- π¨ Can I control what appears in AI-generated images (number of objects, positioning, saturation)?
- π How controllable is Model A compared to Model B?
- π Formal Guarantees: Provides probably-approximately correct (PAC) bounds for controllable set estimates.
- π¦ Generic Approach: Works with any generative modelβLLMs, text-to-image models, or custom systems.
- π² Distribution-Free: Makes minimal assumptions (only requires bounded outputs).
- π§ Extensible: Easy to implement custom controllability tests for your specific needs.
GenCtrl frames human-model interaction as a control process. Given an initial state (e.g., a prompt) and a space of possible inputs (e.g., modifications to the prompt), the toolkit estimates which target outputs are achievable with formal probabilistic guarantees.
Traditional approaches ask: "Can this model do X?" GenCtrl asks: "Under what conditions can this model reliably do X, and with what probability?"
- Python 3.11
- uv package manager
-
Clone the repository:
git clone https://github.com/apple/ml-genctrl cd ml-genctrl -
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh export PATH="$HOME/.local/bin:$HOME/.cargo/bin:$PATH" source ~/.bashrc
-
Set up the environment:
uv sync source .venv/bin/activate # Optional: Add your Hugging Face token for gated models export HF_TOKEN=<your_huggingface_token>
Test whether an LLM can generate text with a specific number of characters:
python -m scripts.run --config-name llm_num_chars output_dir=/tmp output_file=myexperiment.json time_steps=5This will create a results file at /tmp/myexperiment.json with controllability metrics and estimates.
GenCtrl includes pre-configured controllability tests for common scenarios:
| Test | Config File | Description |
|---|---|---|
| Character Count | llm_num_chars.yaml | Generate text with a specific number of characters. |
| Even/Odd Length | llm_even_odd.yaml | Control whether output has even or odd length. |
| Average Word Length | llm_avg_word_length.yaml | Generate text with specific average word length. |
| Formality | llm_formality.yaml | Control the formality level of generated text. |
Example:
python -m scripts.run --config-name llm_formality output_dir=/tmp output_file=formality_test.json time_steps=5You can override config parameters directly from the command line:
python -m scripts.run --config-name llm_num_chars model_name=google/gemma-3-4b-it time_steps=5| Test | Config File | Description |
|---|---|---|
| Object Count | t2i_num_objects.yaml | Control the number of objects in generated images. |
| Object Position | t2i_pos_objects.yaml | Control where objects appear in images. |
| Saturation | t2i_saturation.yaml | Control the color saturation of images. |
Example:
python -m scripts.run --config-name t2i_num_objects time_steps=1 # Always use time_steps=1 for T2IGenCtrl uses a Task class as the main abstraction for defining controllability tests. Creating your own test involves subclassing Task and implementing its required methods:
To create a new controllability test, you need to:
-
Create a Task subclass in tasks.py that implements these required abstract methods:
name(property): Return the base task name (e.g.,"num_chars","even_odd").get_input_space(**kwargs): Return the input space specification (template and distributions).get_output_map(**kwargs): Return a callable that evaluates model outputs.get_output_space(**kwargs): Return the set/list of valid output values.
-
Optionally override these methods for advanced functionality:
get_initial_states(**kwargs): Customize starting conditions (default uses factory).get_feedback_function(**kwargs): Add dialogue/feedback support (default: None).get_value_extractor(): Extract target values from input strings (default: None).
-
Register your task in the
TASK_REGISTRYdictionary at the bottom of tasks.py -
Create a configuration file in configs/ that specifies:
- Task name and parameters.
- Model configuration.
- Controllability test parameters (confidence level Ξ΄, target outputs, etc.).
See existing task implementations in tasks.py (e.g., NumCharsTask, EvenOddTask) for complete examples.
Once you've configured a test, GenCtrl automatically:
- Computes sample complexity (m, k parameters) to guarantee results with confidence level Ξ΄.
- Samples inputs from the configured input space.
- Collects model outputs for each input.
- Estimates the controllable set with formal guarantees.
- Computes calibration metrics to evaluate controllability.
The result is a formal, quantitative assessment of what the model can reliably achieve.
Compare multiple models or configurations using the built-in plotting tools:
# Run experiments with different models
python -m scripts.run --config-name llm_num_chars model_name=google/gemma-3-4b-it time_steps=5 output_dir=/tmp output_file=gemma.json
python -m scripts.run --config-name llm_num_chars model_name=Qwen/Qwen3-4B time_steps=5 output_dir=/tmp output_file=qwen.json
# Plot trajectories showing what outputs were reached
python -m scripts.plots.plot_trajectories --json /tmp/gemma.json /tmp/qwen.json --outfile fig_trajectories.png
# Plot calibration metrics for a specific timestep (-1 = last timestep)
python -m scripts.plots.plot_metrics --json /tmp/gemma.json /tmp/qwen.json --time-step -1 --outfile fig_metrics.pngTrajectory Plot (fig_trajectories.png):

Metrics Plot (fig_metrics.png):

Note: The plotting scripts also save numerical results as CSV files (e.g., fig_trajectories.csv) for further analysis.
GenCtrl includes a test suite to validate task implementations and core functionality:
# Run task validation tests
# We use mock models for tests to prevent downloads and inference cost, so the results will not be meaningful.
pytest tests/test_task_runs.py -vSee the LICENSE file for details.
If you use GenCtrl in your research, please cite:
@article{cheng-genctrl,
title={GenCtrl -- A Formal Controllability Toolkit for Generative Models},
author={Cheng, Emily and Amo Alonso, Carmen and Danieli, Federico and Blaas, Arno and Zappella, Luca and Rodriguez, Pau and Suau, Xavier},
journal={https://arxiv.org/abs/2601.05637},
year={2025}
}This work was conducted at Apple Machine Learning Research.

