Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning

ExploreToM is the first framework to allow large-scale generation of diverse and challenging theory of mind data for robust training and evaluation. Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios to stress test the limits of LLMs.

Running the whole data generation pipeline

Generate story contexts

python story_context_generator.py --num_elements_by_class 6 --num_contexts_to_generate 100

Run A* Search

for i in `seq 0 7` ; do python story_structure_searcher.py \
    --experiment_to_run search \
    --model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
    --model_access_method vllm-api \
    --a_star_neighbor_priority weight-goal4 \
    --model_generated_contexts_file "logs/model_generated_contexts_Llama-3.1-70B-Instruct_n_100_p_6_m_6_r_2_update_object_state_equiv_class_for_v1_dsl_wo_upsampling.jsonl" \
    --i $i & done

Infill generated stories

for i in `seq 0 7` ; do python story_structure_infiller.py --i $i & done
for i in `seq 0 7` ; do python story_structure_infiller.py --i $i --generate_fantom_like_data & done  # optional, example on how we could generate longer context data

Additional Resources

Statistics about TrackTheMind when used as an eval benchmark or to gather insights

Run all of these in order.

for i in `seq 0 7` ; do python story_structure_searcher.py \
    --experiment_to_run baseline \
    --model_name meta-llama/Meta-Llama-3.1-70B-Instruct \
    --model_access_method vllm-api \
    --model_generated_contexts_file "logs/model_generated_contexts_Llama-3.1-70B-Instruct_n_100_p_6_m_6_r_2_update_object_state_equiv_class_for_v1_dsl_wo_upsampling.jsonl" \
    --i $i & done
python compute_statistics.py --evaluate_cross_model_generations --model_name gpt-4o --model_access_method openai-azure-api
python compute_statistics.py --evaluate_cross_model_generations --model_name mistralai/Mixtral-8x7B-Instruct-v0.1 --model_access_method vllm-python
python compute_statistics.py --evaluate_cross_model_generations --model_name meta-llama/Meta-Llama-3.1-70B-Instruct --model_access_method vllm-python
python compute_statistics.py --evaluate_cross_model_generations
python compute_statistics.py

Some Functional Tests

python tests_belief_tracker.py
python tests_story_structure_infiller.py

How to load a model with VLLM

See all vllm args here: https://docs.vllm.ai/en/latest/models/engine_args.html

1. screen -S mainscreen  # this takes you to the main screen
2. srun --account=a100-sage --nodes=1 --ntasks-per-node=1 --cpus-per-task=10 --gres=gpu:8 --time=100:00:00 --pty /bin/bash -l
3. screen -S modelserve
4. source ttmenv/bin/activate
5. vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --download-dir /data/home/melaniesclar/.cache
6. <ctrl-A A D to get out>

Data Sample

You can find a data sample of ExploreToM for Llama-3.1-70B-Instruct here: https://huggingface.co/datasets/facebook/ExploreToM . Have in mind that ExploreToM is an adversarial data generation procedure and thus, if you wish to test another model, you should run this code and NOT simply evaluate on the data sample shown in the link.

Citation

If you found the paper or data helpful, consider citing it:

@inproceedings{
sclar2025explore,
title={Explore Theory of Mind: program-guided adversarial data generation for theory of mind reasoning},
author={Melanie Sclar and Jane Yu and Maryam Fazel-Zarandi and Yulia Tsvetkov and Yonatan Bisk and Yejin Choi and Asli Celikyilmaz},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=246rHKUnnf}
}

Licensing

See our LICENSE file for licensing details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
belief_tracker.py		belief_tracker.py
cached_prompt_outputs.py		cached_prompt_outputs.py
compute_statistics.py		compute_statistics.py
story_context_generator.py		story_context_generator.py
story_structure_infiller.py		story_structure_infiller.py
story_structure_searcher.py		story_structure_searcher.py
tests_belief_tracker.py		tests_belief_tracker.py
tests_story_structure_infiller.py		tests_story_structure_infiller.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning

Running the whole data generation pipeline

Additional Resources

Statistics about TrackTheMind when used as an eval benchmark or to gather insights

Some Functional Tests

How to load a model with VLLM

Data Sample

Citation

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Explore Theory-of-Mind: Program-Guided Adversarial Data Generation for Theory of Mind Reasoning

Running the whole data generation pipeline

Additional Resources

Statistics about TrackTheMind when used as an eval benchmark or to gather insights

Some Functional Tests

How to load a model with VLLM

Data Sample

Citation

Licensing

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages