Evaluates whether an audio–language model should proactively speak from audio-only cues (no transcript)—and, when it should, whether the reply is useful.
| Capability | Description |
|---|---|
| Decision boundary | Distinguish when to assist (RESPOND) vs stay silent (IGNORE) |
| Reply quality | For RESPOND items, whether the reply is relevant and helpful (optional semantic match) |
Each sample has a ground-truth decision. The RESPOND subset includes reference replies for optional semantic scoring.
| Resource | Link |
|---|---|
| Code & manifests | github.com/masaz14/Proactive-Sound-Effect-Benchmark |
| Audio (dataset) | [(https://huggingface.co/datasets/masaz14/Proactive-Sound-Effect-Benchmark) |
The Hugging Face layout matches manifest path fields (root folder: proactive-sound-effect).
Download (login may be required for gated assets):
pip install -U huggingface_hub
hf download masaz14/Proactive-Sound-Effect-Benchmark \
--repo-type=dataset \
--local-dir ./Proactive-Sound-Effect-Benchmark-data17 subcategories across 6 domains—daily living, human states, traffic, environment, music, and equipment—focused on proactive assistance and safety.
Subcategory counts (click to expand)
| Subcategory | Count | Domain |
|---|---|---|
| Daily Affairs | 126 | Daily Living |
| House Equipment States | 97 | Equipment |
| Physiological States | 68 | Human |
| Environment Background | 58 | Environment |
| Industrial Tools & Instruments | 43 | Equipment |
| Emotion Expression | 40 | Human |
| Housekeeping | 36 | Daily Living |
| Body Movements | 29 | Human |
| Vehicle | 27 | Traffic |
| Personal Care | 19 | Daily Living |
| Collective Ambience | 19 | Human |
| Ecological & Biological Context | 17 | Environment |
| Meteorological Dynamics | 17 | Environment |
| Artistic | 15 | Music |
| Traffic | 15 | Traffic |
| Geological Events & Hazards | 11 | Environment |
| Large Traffic | 7 | Traffic |
Labels
RESPOND— User-relevant, safety, or assistance scenarios → model should speak upIGNORE— Ambient or background sounds → silence is appropriate
| File | Role |
|---|---|
proactive_reply_benchmark.jsonl |
Full manifest: id, path, description, decision |
proactive_reply_benchmark_response.jsonl |
RESPOND only: standard_answers per id (optional semantic match) |
core.py |
Parses <Decision> / <Reply>, path grouping, validity checks |
semantic.py |
Optional reranker similarity vs standard_answers |
evaluate.py |
CLI: align predictions, compute metrics, write stats JSON |
Recommended layout (each manifest path resolves to a local file):
.
├── proactive_reply_benchmark.jsonl
├── proactive_reply_benchmark_response.jsonl
├── core.py
├── semantic.py
├── evaluate.py
└── proactive-sound-effect/
└── Daily Living Sounds/
└── Daily Affairs/
├── RESPOND/<id>.wav
└── IGNORE/<id>.wav
Paths are relative; a ./ prefix is recommended, e.g.:
./proactive-sound-effect/Daily Living Sounds/Daily Affairs/RESPOND/<id>.wav
flowchart LR
A[Download audio] --> B[Model inference]
B --> C[predictions.jsonl]
C --> D[evaluate.py]
D --> E[stats.json]
- Run your model on every manifest audio item
- Write
predictions.jsonl - Score offline with this repo’s scripts
Tags are case-insensitive:
<Decision>RESPOND</Decision>
<Reply>...</Reply>
Each line in predictions.jsonl needs at least:
id— must match the manifest_raw_reply(recommended) orreply— full raw model output string
Decision accuracy only (stdlib sufficient):
python evaluate.py \
--benchmark proactive_reply_benchmark.jsonl \
--pred /path/to/predictions.jsonl \
--out /path/to/stats.json \
--checkpoint-label my-modelWith semantic matching (RESPOND replies vs standard_answers):
pip install torch FlagEmbeddingpython evaluate.py \
--benchmark proactive_reply_benchmark.jsonl \
--pred /path/to/predictions.jsonl \
--out /path/to/stats.json \
--respond-jsonl proactive_reply_benchmark_response.jsonl \
--reranker-dir /path/to/bge-reranker-v2-m3 \
--semantic-threshold 0.5If
evaluate.pyusesfrom .core import ..., run it as a package. For a flat repo, switch tofrom core import ...and runpython evaluate.pydirectly.
| Metric | Meaning |
|---|---|
| Decision | Predicted RESPOND / IGNORE matches manifest decision |
| Semantic (optional) | For ground-truth RESPOND, reranker score of <Reply> vs standard_answers exceeds threshold |
The CLI prints the stats JSON path and overall accuracy (before/after semantic matching). Per-domain and per-folder breakdowns are written to the --out JSON.
You are a proactively helpful model.
Your task is to determine whether to assist the user based on the provided audio.
- If the audio indicates physiological danger or environmental safety risks, provide appropriate assistance.
- If it reflects negative emotions, offer mental comfort.
- If it suggests equipment malfunction, provide warnings or safety guidance.
- If none of the above apply, just respond with "No Reply". Do not explain anything.
Your final output must strictly follow this format:
<Decision>(RESPOND or IGNORE)</Decision>
<Reply>(Your reply)</Reply>
- Python ≥ 3.10
- Basic evaluation: standard library only
- Semantic matching:
torch,FlagEmbedding, and a local reranker (e.g. BAAI/bge-reranker-v2-m3)