Proactive Sound Effect Benchmark

Evaluates whether an audio–language model should proactively speak from audio-only cues (no transcript)—and, when it should, whether the reply is useful.

What We Measure

Capability	Description
Decision boundary	Distinguish when to assist (`RESPOND`) vs stay silent (`IGNORE`)
Reply quality	For `RESPOND` items, whether the reply is relevant and helpful (optional semantic match)

Each sample has a ground-truth decision. The RESPOND subset includes reference replies for optional semantic scoring.

Resources

Resource	Link
Code & manifests	github.com/masaz14/Proactive-Sound-Effect-Benchmark
Audio (dataset)	[(https://huggingface.co/datasets/masaz14/Proactive-Sound-Effect-Benchmark)

The Hugging Face layout matches manifest path fields (root folder: proactive-sound-effect).

Download (login may be required for gated assets):

pip install -U huggingface_hub
hf download masaz14/Proactive-Sound-Effect-Benchmark \
  --repo-type=dataset \
  --local-dir ./Proactive-Sound-Effect-Benchmark-data

Dataset Composition

17 subcategories across 6 domains—daily living, human states, traffic, environment, music, and equipment—focused on proactive assistance and safety.

Subcategory counts (click to expand)

Subcategory	Count	Domain
Daily Affairs	126	Daily Living
House Equipment States	97	Equipment
Physiological States	68	Human
Environment Background	58	Environment
Industrial Tools & Instruments	43	Equipment
Emotion Expression	40	Human
Housekeeping	36	Daily Living
Body Movements	29	Human
Vehicle	27	Traffic
Personal Care	19	Daily Living
Collective Ambience	19	Human
Ecological & Biological Context	17	Environment
Meteorological Dynamics	17	Environment
Artistic	15	Music
Traffic	15	Traffic
Geological Events & Hazards	11	Environment
Large Traffic	7	Traffic

Labels

RESPOND — User-relevant, safety, or assistance scenarios → model should speak up
IGNORE — Ambient or background sounds → silence is appropriate

Repository Layout

File	Role
`proactive_reply_benchmark.jsonl`	Full manifest: `id`, `path`, `description`, `decision`
`proactive_reply_benchmark_response.jsonl`	`RESPOND` only: `standard_answers` per `id` (optional semantic match)
`core.py`	Parses `<Decision>` / `<Reply>`, path grouping, validity checks
`semantic.py`	Optional reranker similarity vs `standard_answers`
`evaluate.py`	CLI: align predictions, compute metrics, write stats JSON

Recommended layout (each manifest path resolves to a local file):

.
├── proactive_reply_benchmark.jsonl
├── proactive_reply_benchmark_response.jsonl
├── core.py
├── semantic.py
├── evaluate.py
└── proactive-sound-effect/
    └── Daily Living Sounds/
        └── Daily Affairs/
            ├── RESPOND/<id>.wav
            └── IGNORE/<id>.wav

Paths are relative; a ./ prefix is recommended, e.g.:

./proactive-sound-effect/Daily Living Sounds/Daily Affairs/RESPOND/<id>.wav

Evaluation Workflow

flowchart LR
  A[Download audio] --> B[Model inference]
  B --> C[predictions.jsonl]
  C --> D[evaluate.py]
  D --> E[stats.json]

Run your model on every manifest audio item
Write predictions.jsonl
Score offline with this repo’s scripts

Expected Model Output

Tags are case-insensitive:

<Decision>RESPOND</Decision>
<Reply>...</Reply>

Each line in predictions.jsonl needs at least:

id — must match the manifest
_raw_reply (recommended) or reply — full raw model output string

Running Evaluation

Decision accuracy only (stdlib sufficient):

python evaluate.py \
  --benchmark proactive_reply_benchmark.jsonl \
  --pred /path/to/predictions.jsonl \
  --out /path/to/stats.json \
  --checkpoint-label my-model

With semantic matching (RESPOND replies vs standard_answers):

pip install torch FlagEmbedding

python evaluate.py \
  --benchmark proactive_reply_benchmark.jsonl \
  --pred /path/to/predictions.jsonl \
  --out /path/to/stats.json \
  --respond-jsonl proactive_reply_benchmark_response.jsonl \
  --reranker-dir /path/to/bge-reranker-v2-m3 \
  --semantic-threshold 0.5

If evaluate.py uses from .core import ..., run it as a package. For a flat repo, switch to from core import ... and run python evaluate.py directly.

Metrics

Metric	Meaning
Decision	Predicted `RESPOND` / `IGNORE` matches manifest `decision`
Semantic (optional)	For ground-truth `RESPOND`, reranker score of `<Reply>` vs `standard_answers` exceeds threshold

The CLI prints the stats JSON path and overall accuracy (before/after semantic matching). Per-domain and per-folder breakdowns are written to the --out JSON.

Prompt Example

You are a proactively helpful model.
Your task is to determine whether to assist the user based on the provided audio.
- If the audio indicates physiological danger or environmental safety risks, provide appropriate assistance.
- If it reflects negative emotions, offer mental comfort.
- If it suggests equipment malfunction, provide warnings or safety guidance.
- If none of the above apply, just respond with "No Reply". Do not explain anything.

Your final output must strictly follow this format:
<Decision>(RESPOND or IGNORE)</Decision>
<Reply>(Your reply)</Reply>

Dependencies

Python ≥ 3.10
Basic evaluation: standard library only
Semantic matching: torch, FlagEmbedding, and a local reranker (e.g. BAAI/bge-reranker-v2-m3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proactive Sound Effect Benchmark

What We Measure

Resources

Dataset Composition

Repository Layout

Evaluation Workflow

Expected Model Output

Running Evaluation

Metrics

Prompt Example

Dependencies

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md
core.py		core.py
evaluate.py		evaluate.py
proactive_reply_benchmark.jsonl		proactive_reply_benchmark.jsonl
proactive_reply_benchmark_response.jsonl		proactive_reply_benchmark_response.jsonl
semantic.py		semantic.py

Folders and files

Latest commit

History

Repository files navigation

Proactive Sound Effect Benchmark

What We Measure

Resources

Dataset Composition

Repository Layout

Evaluation Workflow

Expected Model Output

Running Evaluation

Metrics

Prompt Example

Dependencies

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages