Skip to content

lwachowiak/explanation-selection-eval

Repository files navigation

Evaluation: Neurosymbolic Explanation Selection in Robotics

This repository holds the evaluation code and data for the paper: Neurosymbolic Explanation Selection in Robotics: Combining the Strengths of Planning and Foundation Models for XAI. The paper is available online.

Citation

@inbook{neurosymbolic_wachowiak2026,
title = "Neurosymbolic Explanation Selection in Robotics: Combining the Strengths of Planning and Foundation Models for XAI",
author = "Lennart Wachowiak and Andrew Coles and Oya Celiktutan and Gerard Canal",
year = "2026",
month = jan,
day = "12",
doi = "10.1145/3776734.3794387",
booktitle = "Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction",
}

1: Retrieval Eval

The notebook retrieval_eval.ipynb contains the code to run the retrieval evaluation. It uses the data in the folder retrieval_eval_data, which contains the questions and plans used in the evaluation. The evaluation is based on a set of 180 questions across six plans in two domains. The requirements file requirements.txt lists the necessary Python packages to run the evaluation (install with pip install -r requirements.txt).

It contains the code our system uses for matching questions to plan steps. If you are interested in the code of the entire system (XAI modules, ROS integration, logging interface, etc.), please reach out to the corresponding author.

2: User Eval

The user eval was run on Prolific. The questionnaire was created with Qualtrics and is available in PDF format Qualtrics_Survey.pdf. The stimulus order, explanation order, and initial ranking order are all randomized.

Related Repositories/Research

Abstract

Robots operating in human environments should be able to answer diverse, explanation-seeking questions about their past behaviour. We present a neurosymbolic pipeline that links a task planner with a unified logging interface, attaching heterogeneous XAI artifacts (e.g., visual heatmaps, navigation feedback) to individual plan steps. Given a natural language question, a large language model selects the most relevant actions and consolidates the associated logs into a multimodal explanation. In an offline evaluation on 180 questions across six plans in two domains, we show the usefulness of an LLM-based question-matcher (F1 Score of 0.91), contrasting it with a cheaper embedding baseline (0.62) and a rule-based syntax/keyword matcher (0.02). A preliminary user study (N = 30) demonstrates that users prefer the LLM-consolidated explanations over raw logs and planner-only explanations.

License

  • Code: MIT License

  • Documentation and images: CC-BY 4.0