Evaluation: Neurosymbolic Explanation Selection in Robotics

This repository holds the evaluation code and data for the paper: Neurosymbolic Explanation Selection in Robotics: Combining the Strengths of Planning and Foundation Models for XAI. The paper is available online.

Citation

@inbook{neurosymbolic_wachowiak2026,
title = "Neurosymbolic Explanation Selection in Robotics: Combining the Strengths of Planning and Foundation Models for XAI",
author = "Lennart Wachowiak and Andrew Coles and Oya Celiktutan and Gerard Canal",
year = "2026",
month = jan,
day = "12",
doi = "10.1145/3776734.3794387",
booktitle = "Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction",
}

1: Retrieval Eval

The notebook retrieval_eval.ipynb contains the code to run the retrieval evaluation. It uses the data in the folder retrieval_eval_data, which contains the questions and plans used in the evaluation. The evaluation is based on a set of 180 questions across six plans in two domains. The requirements file requirements.txt lists the necessary Python packages to run the evaluation (install with pip install -r requirements.txt).

It contains the code our system uses for matching questions to plan steps. If you are interested in the code of the entire system (XAI modules, ROS integration, logging interface, etc.), please reach out to the corresponding author.

2: User Eval

The user eval was run on Prolific. The questionnaire was created with Qualtrics and is available in PDF format Qualtrics_Survey.pdf. The stimulus order, explanation order, and initial ranking order are all randomized.

Related Repositories/Research

Abstract

Robots operating in human environments should be able to answer diverse, explanation-seeking questions about their past behaviour. We present a neurosymbolic pipeline that links a task planner with a unified logging interface, attaching heterogeneous XAI artifacts (e.g., visual heatmaps, navigation feedback) to individual plan steps. Given a natural language question, a large language model selects the most relevant actions and consolidates the associated logs into a multimodal explanation. In an offline evaluation on 180 questions across six plans in two domains, we show the usefulness of an LLM-based question-matcher (F1 Score of 0.91), contrasting it with a cheaper embedding baseline (0.62) and a rule-based syntax/keyword matcher (0.02). A preliminary user study (N = 30) demonstrates that users prefer the LLM-consolidated explanations over raw logs and planner-only explanations.

License

Code: MIT License
Documentation and images: CC-BY 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
retrieval_eval_data		retrieval_eval_data
Qualtrics-Survey.pdf		Qualtrics-Survey.pdf
README.md		README.md
f1_comp.pkl		f1_comp.pkl
requirements.txt		requirements.txt
retrieval_eval.ipynb		retrieval_eval.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation: Neurosymbolic Explanation Selection in Robotics

Citation

1: Retrieval Eval

2: User Eval

Related Repositories/Research

Abstract

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluation: Neurosymbolic Explanation Selection in Robotics

Citation

1: Retrieval Eval

2: User Eval

Related Repositories/Research

Abstract

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages