Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
Abstract
Bilevel planning provides a hierarchical framework for solving tasks in low-level (LL) continuous state and action spaces with the aid of high-level (HL) world model abstractions for facilitating sample-efficient, long-horizon planning. We propose BISON, an embodied AI system, that learns bilevel policies $(\pi^{\mathrm{hl}}, \pi^{\mathrm{ll}})$ operating over both the HL abstraction and LL environment space to handle open-world environments without replanning. Experiments over extended MetaWorld benchmarks show that BISON solves longer horizon tasks than VLA and end-to-end baselines and are more robust to uncertainty than existing bilevel planning approaches.
Method
We introduce BISON, an embodied AI system, that learns bilevel policies over symbolic world models for long-horizon planning. BISON learns separate policies $\pi^{\mathrm{hl}}$ and $\pi^{\mathrm{ll}}$ over an HL symbolic world model and the LL environment, respectively. There are several benefits for learning bilevel policies and in contrast to learning a single monolithic policy (e.g. a standalone VLA), such as improved sample complexity, efficiency, generalisation, interpretability and modularity. We demonstrate two advantages in our experiments.
Inputs
BISON is given an abstraction language $\mathcal{D}$ that models a symbolic, object-centric world model, and a labelling function $\mathcal{L}$ that maps LL states to HL abstractions conforming to $\mathcal{D}$. BISON is given LL demonstrations paired with HL goals for learning bilevel policies.
Learning
We realise the HL policy $\pi^{\mathrm{hl}}$ as a set of first-order, condition-action rules, and the LL policy $\pi^{\mathrm{ll}}$ via a graph neural network. The HL policy is learned with goal regression and inductive generalisation, while the LL policy is learned via typical maximum likelihood estimation.
Execution
For each environment step, BISON's $\pi^{\mathrm{hl}}$ first computes a HL action over the HL abstraction, which is then realised by $\pi^{\mathrm{ll}}$ in the LL environment to return an action to execute. Mathematically, BISON can be realised as a single-goal conditioned policy as follows $$ \renewcommand{\ll}{\mathrm{ll}} \newcommand{\hl}{\mathrm{hl}} \newcommand{\lls}{{\mathbf{s}^{\ll}}} \newcommand{\lla}{{\mathbf{a}^{\ll}}} \newcommand{\llg}{{\mathbf{g}^{\ll}}} \newcommand{\llp}{{\pi^{ll}}} \newcommand{\hls}{{\mathit{s}^{\hl}}} \newcommand{\hla}{{\mathit{a}^{\hl}}} \newcommand{\hlg}{{\mathit{g}^{\hl}}} \newcommand{\hlp}{{\pi^{\hl}}} \pi(\lla \mid \lls, \hlg) = \sum_{\hla} \llp(\lla \mid \lls, \hla, \hlg) \cdot \hlp(\hla \mid \mathcal{L}(\lls), \hlg). $$
BISON Elicits Long-Horizon Planning
Bilevel planning employs HL state and temporal abstractions over which one can find HL solutions to guide LL acting. Reasoning over HL abstractions is generally easier to do as they have a much smaller state space than the underlying LL environment. In our work, we focus on HL abstractions over a formal, relational language such as PDDL. In ML lingo, abstractions are connected to feature relevance and attention, as they usually include only what is necessary for the HL task at hand.
Our experiments demonstrate that bilevel policies generalise to longer-horizon tasks more robustly than end-to-end policies.
✅ End-to-end policies can generalise to 3 blocks.
✅ Bilevel policies can generalise to 3 blocks.
⛔ End-to-end policies get stuck with 10 objects.
✅ Bilevel policies can generalise to 10 blocks.
BISON is Robust to Uncertainty and Open-World Planning
Bilevel policies are closed-loop policies that act as reactive controllers, as opposed to plans which are sequences of actions. As such, bilevel policies can handle multiple forms of uncertainty such as exogenous events, open-world planning and partial observability. On the other hand, planning approaches require replanning and are restricted to the assumptions made in the model abstraction.
Our experiments demonstrate that bilevel policies are more robust to uncertainty than symbolic (re)planning methods.