Failure Identification in Imitation Learning via Statistical and Semantic Filtering

1Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France
2Inria, CNRS, Université de Lorraine, LORIA, F-54000 Nancy, France
3Bleu Robotics, Paris, France
ICRA 2026
CEA INRIA Univ Lorraine Univ Saclay

Abstract

Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce \(\textbf{FIDeL}\) (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce \(\textbf{BotFails}\), a multimodal dataset of real-world tasks for failure detection in robotics. FIDeL consistently outperforms state-of-the-art baselines, yielding \(+5.30\%\) AUROC in anomaly detection and \(+17.38\%\) failure-detection accuracy on BotFails compared to existing methods.

We introduce FIDeL, a framework aimed at improving the reliability of Imitation Learning-based policies. After having aggregated expert demonstrations representations, it continuously extracts anomaly scores from real time policy-robot interactions. When a significant deviation is detected, the system flags a failure and generates a heatmap to localize the failure.

Methodology

Our method comprises two main stages: an offline preparation phase and an online inference phase. During the offline phase, we process a dataset of expert demonstrations \( \mathcal{X}_N \), assumed to reflect nominal behavior, by extracting feature representations from observations. These features are stored in a memory buffer \( \mathcal{M} \), providing a compact statistical model of normal operation. In the online phase, the AD module operates concurrently with the execution of the generative policy \( f_\theta \). At each time step \( t \), the policy receives a multimodal observation \( O_t \) and outputs an action \( A_t = f_\theta(O_t) \). Prior to executing the action, \( O_t \) is encoded into the feature space, and an anomaly score is computed by comparing it to the distribution of features stored in \( \mathcal{M} \). This enables real-time monitoring of the system's behavior without interrupting the robot's control loop. By decoupling the modeling of nominal behavior from online evaluation, the proposed two-stage architecture ensures computational efficiency which enhance reliability by promptly detecting deviations from expected operational patterns.

MY ALT TEXT

... Description ....

BotFails Dataset

To rigorously evaluate the effectiveness of our AD approach, we introduce \( \textbf{BotFails} \), a dedicated dataset specifically designed for robotic AD tasks. Our method is benchmarked on this dataset alongside several state-of-the-art baseline methods adapted to our experimental setting. The BotFails dataset comprises a diverse set of tasks, each designed to represent a specific human activity or interaction scenario within a controlled environment. In total, the dataset contains 414,359 annotated frames across 646 video sequences, offering a rich and varied benchmark for evaluating anomaly detection systems. Each task in the dataset is associated with a predefined set of anomalous behaviors. These anomalies are deliberately introduced to mimic realistic deviations from normal patterns, such as incorrect object handling, wrong execution of the expected task, or unexpected presence of foreign objects.

MY ALT TEXT

We introduce BotFails, a new dataset specifically designed for one-class multimodal anomaly detection in robotics, incorporating vision, proprioception, and natural language instructions.

Results

We evaluated Safe-\( \pi \) along with several SotA baselines on BotFails. Safe-\( \pi \) delivers the highest overall performance, achieving an average AUROC of \( \textbf{70.5%} \) and F1@Opt of \( \textbf{62.9%} \), exceeding the best-performing baseline, \( \textit{logpZO} \), by more than 5 points in AUROC and 3 points in F1.

MY ALT TEXT

mean AUROC and mean F1-score at optimal threshold ( \(\textbf{in %}\)), as well as inference time (\(\textbf{in ms}\), for anomaly detection only and full pipeline including the encoder), evaluated on the \( \textit{BotFails} \) dataset across various AD methods. Experiments were conducted on an NVIDIA GeForce RTX 3090 GPU and an AMD Ryzen 9 5900X CPU (24 threads @ 3.70GHz). Best results are highlighted in bold. Each task score reflects performance over multiple rollouts (several dozens per task).

Results videos : FIDeL evaluation with the BotFails dataset

Pouring coffee
Pouring coffee
Making coffee
Making coffee
Soldering
Soldering
Robothon hatch and probe
Robothon hatch and probe

BibTeX

@inproceedings{rolland2026failure,
  title={Failure Identification in Imitation Learning via Statistical and Semantic Filtering},
  author={Rolland, Quentin and Mayran de Chamisso, Fabrice and Mouret, Jean-Baptiste},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026},
}