SupraBench is the first benchmark for evaluating large language models on supramolecular host–guest chemistry reasoning. It comprises four fundamental tasks plus an auxiliary vision task, and ships a domain text corpus for domain-adaptive pretraining (DAPT).
Supramolecular chemistry studies non-covalent host–guest assemblies that underpin drug delivery, chemical sensing, and in-vivo toxin sequestration. Designing host–guest systems is slow (days of dry-lab verification per pair); SupraBench probes whether LLMs can reason about these systems directly.
- 📄 Paper:
arXiv:2606.13477 - 🤗 Datasets:
huggingface.co/SupraBench - 💻 Code:
github.com/Tianyi-Billy-Ma/SupraBench
| Dataset | Task | Description |
|---|---|---|
SupraBench/bap |
Binding Affinity Prediction | regress log Kₐ for a host–guest pair |
SupraBench/tbs |
Top-Binder Selection | pick the strongest binder among 4 candidate guests |
SupraBench/sid |
Solvent Identification | 6-way solvent classification from structure |
SupraBench/hgd |
Host-Guest Description | open-ended QA on host/guest property profiles |
SupraBench/vqa |
Molecular Identification | auxiliary vision task: identify a molecule from its image |
SupraBench/EU-PMC |
Text corpus | ~16M-token supramolecular corpus for DAPT |
SupraBench/Binding-Affinity |
Comprehensive anchor | per-record binding data + host/guest SMILES, 2D, 3D, environment |
Each task config lives in configs/tasks/ (one YAML per task × prompting
strategy: base, fewshot, cot).
We use uv for all Python dependency and
interpreter management. Python is pinned in .python-version.
# Install uv (once per machine)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create the project venv and install base dependencies
uv sync
# Optional extras, install what you actually need:
uv sync --extra api # OpenAI / Anthropic / httpx (hosted-API inference)
uv sync --extra hf # torch / transformers / accelerate / peft (local inference + LoRA)
uv sync --extra vllm # vLLM
uv sync --extra dev # pytest / ruffSecrets (API keys, HF tokens) are read from environment variables — never put them in YAML or code.
uv run python src/main.py \
--task-config configs/tasks/bap_base.yaml \
--model-config configs/models/openrouter_qwen35_27b.yaml \
--output-dir outputs/Results land at outputs/<task>_<model>/{predictions.jsonl,metrics.json}
(the entire outputs/ tree is gitignored).
SupraBench/
├── configs/
│ ├── tasks/ # one YAML per task × prompting strategy
│ ├── models/ # one YAML per evaluated model / backend
│ └── train/ # continued-pretraining (DAPT/LoRA) recipes
├── src/
│ ├── datasets/ # task-specific dataset loaders
│ ├── eval/ # task-specific evaluators + metrics
│ ├── inference/ # inference backends (OpenAI/OpenRouter, HF+PEFT, vLLM)
│ ├── models/ # model-specific glue (chat templates, stop tokens)
│ ├── train/ # continued-pretraining (LoRA) pipeline
│ ├── extras/ # shared code-level constants
│ ├── templates/ # prompt rendering helpers
│ ├── scripts/ # data construction, plotting, one-off tools
│ └── main.py # entry point
├── scripts/ # result aggregation + analysis helpers
├── outputs/ # run artifacts (gitignored)
└── pyproject.toml # uv-managed dependencies
Most subdirectories carry their own README.md documenting the local contract.
Prompts for every model go through src/templates/ so layout stays identical
across evaluations. See the docstrings in
src/templates/template.py for generate_options
and generate_prompt usage.
Adding a task or a model is driven by config + registration, never by editing
main.py. Each key is resolved through a string-based registry populated by the
@register_dataset, @register_evaluator, and @register_backend decorators.
Add a task:
configs/tasks/<task>.yamlwithdataset:andevaluator:keys.src/datasets/<task>.py— subclassSupraDataset, decorate with@register_dataset("<key>").src/eval/<task>.py— subclassEvaluator, decorate with@register_evaluator("<key>").- Import both from their package
__init__.pyso registration runs. - Smoke test:
uv run python src/main.py --task-config ... --limit 2.
Add a model:
configs/models/<model>.yamlwithbackend:+model_id:+generation:.- New delivery mechanism → add a backend under
src/inference/<backend>.pydecorated with@register_backend(...). - Model quirks (chat template, stop tokens, response scrubbing) → add a helper under
src/models/<model>.py.
Tianyi Ma¹*, Yijun Ma¹*, Zehong Wang¹†, Weixiang Sun¹, Ziming Li², Connor R. Schmidt¹, Chuxu Zhang², Matthew J. Webber¹, Yanfang Ye¹†
¹ University of Notre Dame ² University of Connecticut
*Equal contribution †Corresponding authors ({tma2, yye7}@nd.edu)
SupraBench (code and curated benchmark data) is released under CC BY 4.0.
Upstream data: binding records are derived from SupraBank (CC-BY-4.0); the text corpus is built from open-access Europe PMC articles subject to each article's individual license; molecular structures use PubChem and OPSIN.
If you use SupraBench, please cite the paper and the upstream data sources.
@article{ma2026suprabench,
title = {SupraBench: A Benchmark for Supramolecular Host--Guest Chemistry Reasoning in Large Language Models},
author = {Ma, Tianyi and Ma, Yijun and Wang, Zehong and Sun, Weixiang and Li, Ziming and Schmidt, Connor R. and Zhang, Chuxu and Webber, Matthew J. and Ye, Yanfang},
year = {2026},
eprint = {2606.13477},
archivePrefix = {arXiv},
journal = {arXiv preprint arXiv:2606.13477}
}