Common Benchmarks Undervalue the Generalization Power of Programmatic Policies

Amirhossein Rajabpour, Kiarash Aghakasiri, Sandra Zilles, Levi H. S. Lelis
University of Alberta & Amii

Overview

We argue that commonly used benchmarks underestimate the out-of-distribution (OOD) generalization capabilities of programmatic policies. By controlling input sparsity and reward functions, we show neural networks can match or exceed programmatic policies on standard benchmarks, and we propose new tasks that highlight scenarios where symbolic representations are advantageous. In the context of OOD generalization, we argue that programmatic representations should be used in problems that require computational constructs that neural models have difficulty learning, such as stacks and queues.

TORCS Results

For The Open Car Racing Simulator (TORCS) we show that using a more cautious reward function slows down the agent, enabling better generalization results.

TORCS Results Table

Training Track Performance

Agent on G-TRACK-1 training environment

Generalization Comparison

Successful generalization on E-ROAD using cautious reward function
Crash under original reward on E-ROAD

Karel Results

Using sparse observation and augmenting observations with the previous action allows fully-connected policies to generalize to larger grids (100×100), outperforming convolutional and LSTM baselines that fail to scale and perform better than LEAPS.

Karel Results Table
PPO with \(a_{t-1}\) on \(12\times 12\) grid
PPO with \(a_{t-1}\) on \(100\times 100\) grid

Proposed Benchmark

We introduce a SPARSE MAZE task, which has wider corridors than normal MAZE, requiring explicit memory (stacks or queues) to find shortest paths. Neural policies struggle, while programmatic search synthesizes an optimal BFS solution. We used FunSearch for generating programmatic policies for this section.

Funsearch Results
Maze 20x20
Standard KAREL MAZE layout
Wide Maze 20x20
SPARSE MAZE layout

Citation

@misc{rajabpour2025commonbenchmarksundervaluegeneralization, title={Common Benchmarks Undervalue the Generalization Power of Programmatic Policies}, author={Amirhossein Rajabpour and Kiarash Aghakasiri and Sandra Zilles and Levi H. S. Lelis}, year={2025}, eprint={2506.14162}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2506.14162}, }