Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

Abstract

Bilevel planning provides a hierarchical framework for solving tasks in low-level (LL) continuous state and action spaces with the aid of high-level (HL) world model abstractions for facilitating sample-efficient, long-horizon planning. We propose BISON, an embodied AI system, that learns bilevel policies $(\pi^{\mathrm{hl}}, \pi^{\mathrm{ll}})$ operating over both the HL abstraction and LL environment space to handle open-world environments without replanning. Experiments over extended MetaWorld benchmarks show that BISON solves longer horizon tasks than VLA and end-to-end baselines and are more robust to uncertainty than existing bilevel planning approaches.

Method

We introduce BISON, an embodied AI system, that learns bilevel policies over symbolic world models for long-horizon planning. BISON learns separate policies $\pi^{\mathrm{hl}}$ and $\pi^{\mathrm{ll}}$ over an HL symbolic world model and the LL environment, respectively. There are several benefits for learning bilevel policies and in contrast to learning a single monolithic policy (e.g. a standalone VLA), such as improved sample complexity, efficiency, generalisation, interpretability and modularity. We demonstrate two advantages in our experiments.

Inputs

BISON is given an abstraction language $\mathcal{D}$ that models a symbolic, object-centric world model, and a labelling function $\mathcal{L}$ that maps LL states to HL abstractions conforming to $\mathcal{D}$. BISON is given LL demonstrations paired with HL goals for learning bilevel policies.

Learning

We realise the HL policy $\pi^{\mathrm{hl}}$ as a set of first-order, condition-action rules, and the LL policy $\pi^{\mathrm{ll}}$ via a graph neural network. The HL policy is learned with goal regression and inductive generalisation, while the LL policy is learned via typical maximum likelihood estimation.

Execution

For each environment step, BISON's $\pi^{\mathrm{hl}}$ first computes a HL action over the HL abstraction, which is then realised by $\pi^{\mathrm{ll}}$ in the LL environment to return an action to execute. Mathematically, BISON can be realised as a single-goal conditioned policy as follows $$ \renewcommand{\ll}{\mathrm{ll}} \newcommand{\hl}{\mathrm{hl}} \newcommand{\lls}{{\mathbf{s}^{\ll}}} \newcommand{\lla}{{\mathbf{a}^{\ll}}} \newcommand{\llg}{{\mathbf{g}^{\ll}}} \newcommand{\llp}{{\pi^{ll}}} \newcommand{\hls}{{\mathit{s}^{\hl}}} \newcommand{\hla}{{\mathit{a}^{\hl}}} \newcommand{\hlg}{{\mathit{g}^{\hl}}} \newcommand{\hlp}{{\pi^{\hl}}} \pi(\lla \mid \lls, \hlg) = \sum_{\hla} \llp(\lla \mid \lls, \hla, \hlg) \cdot \hlp(\hla \mid \mathcal{L}(\lls), \hlg). $$

HL Policy

We realise HL policies as sets of first-order, condition-action rules for predicting HL actions over the HL abstraction.

LL Policy

We realise LL policies as graph neural networks for predicting LL actions over the LL environment.

BISON Elicits Long-Horizon Planning

Bilevel planning employs HL state and temporal abstractions over which one can find HL solutions to guide LL acting. Reasoning over HL abstractions is generally easier to do as they have a much smaller state space than the underlying LL environment. In our work, we focus on HL abstractions over a formal, relational language such as PDDL. In ML lingo, abstractions are connected to feature relevance and attention, as they usually include only what is necessary for the HL task at hand.

Our experiments demonstrate that bilevel policies generalise to longer-horizon tasks more robustly than end-to-end policies.

✅ End-to-end policies can generalise to 3 blocks.

✅ Bilevel policies can generalise to 3 blocks.

⛔ End-to-end policies get stuck with 10 objects.

✅ Bilevel policies can generalise to 10 blocks.

BISON and end-to-end policies are trained on problems with 3 objects and evaluated on problems with up to 10 objects. BISON can generalise and match the performance of the oracle. End-to-end policies often struggle with more objects.

BISON is Robust to Uncertainty and Open-World Planning

Bilevel policies are closed-loop policies that act as reactive controllers, as opposed to plans which are sequences of actions. As such, bilevel policies can handle multiple forms of uncertainty such as exogenous events, open-world planning and partial observability. On the other hand, planning approaches require replanning and are restricted to the assumptions made in the model abstraction.

Our experiments demonstrate that bilevel policies are more robust to uncertainty than symbolic (re)planning methods.

BISON and NdtReplan share the same GNN and weights for the LL policy but differ on how they handle HL planning. NdtReplan uses a nondeterministic AI planner and tries to replan on failures. BISON is more robust to uncertainty than NdtReplan. BISON also supports open-world planning exhibited by the Gacha environment, while NdtPlan's reliance on PDDL means that it cannot.

© 2026 Dillon Z. Chen