Quantifying Generalisation in Imitation Learning

Nathan Gavenski, Odinaldo Rodrigues

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Datasets and Benchmarks Track

Imitation learning benchmarks often lack sufficient variation between training and evaluation, limiting meaningful generalisation assessment. We introduce Labyrinth, a benchmarking environment designed to test generalisation with precise control over structure, start and goal positions, and task complexity.It enables verifiably distinct training, evaluation, and test settings.Labyrinth provides a discrete, fully observable state space and known optimal actions, supporting interpretability and fine-grained evaluation.Its flexible setup allows targeted testing of generalisation factors and includes variants like partial observability, key-and-door tasks, and ice-floor hazards.By enabling controlled, reproducible experiments, Labyrinth advances the evaluation of generalisation in imitation learning and provides a valuable tool for developing more robust agents.