POLICEd RL
Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems

Jean-Baptiste Bouvier, Kartik Nagpal, Negar Mehr
ICON Lab at UC Berkeley
2024 Conference on Decision and Control (CDC)

Guaranteed soft landing of the space shuttle using our learned POLICEd controller.

Abstract

In this paper, we develop a method for learning a control policy guaranteed to satisfy an affine state constraint of high relative degree in closed loop with a black-box system. Previous reinforcement learning (RL) approaches to satisfy safety constraints either require access to the system model, or assume control affine dynamics, or only discourage violations with reward shaping. Only recently have these issues been addressed with our previous work POLICEd RL, which guarantees constraint satisfaction for black-box systems. However, this previous work can only enforce constraints of relative degree 1. To address this gap, our key insight is to make the learned policy be affine around the unsafe set and to use this affine region to dissipate the inertia of the high relative degree constraint. We prove that such policies guarantee constraint satisfaction for deterministic systems while being agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of our approach to enforce hard constraints in the Gym inverted pendulum and on a space shuttle landing simulation.

Most safe RL works rely on reward shaping to discourage violations of a safety constraint. However, such soft constraints do not guarantee safety. Previous works trying to enforce hard constraints in RL typical suffer from two limitations: either they need an accurate model of the environment, or their learned safety certificate only approximate without guarantees an actual safety certificate.

On the other hand, our POLICEd RL approach can provably enforce hard constraint satisfaction in closed-loop with a black-box environment. We build a repulsive buffer region in front of the constraint to prevent trajectories from approaching it. Since trajectories cannot cross this buffer, they also cannot violate the constraint.

POLICEd RL illustration

Phase portrait of constrained output $y$ illustrating our High Relative Degree POLICEd RL method on a system of relative degree $2$. To prevent states from violating constraint $y \leq y_{max}$ (red dashed line), our policy guarantees that trajectories entering buffer region $\mathcal{B}$ (blue) cannot leave it through its upper bound (blue dotted line). Our policy makes $\ddot y$ sufficiently negative in buffer $\mathcal{B}$ to bring $\dot y$ to $0$ in all trajectories entering $\mathcal{B}$. Once $\dot y < 0$, trajectories cannot approach the constraint. Due to the states' inertia, it is physically impossible to prevent all constraint violations. For instance, $y = y_{max}$, $\dot y >> 1$ will yield $y > y_{max}$ at the next timestep. Hence, we only aim at guaranteeing the safety of trajectories entering buffer $\mathcal{B}$. We use the POLICE algorithm to make our policy affine inside buffer region $\mathcal{B}$.

Guaranteeing a soft landing for the space shuttle

space shuttle

Guaranteed stabilization of the inverted pendulum

space shuttle

BibTeX

@inproceedings{bouvier2024learning,
        title = {Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems},
        author = {Bouvier, Jean-Baptiste and Nagpal, Kartik and Mehr, Negar},
        booktitle = {Conference on Decision and Control (CDC)},
        year = {2024}
      }

References

[POLICEd RL]
Jean-Baptiste Bouvier, Kartik Nagpal, and Negar Mehr, POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints, Robotics: Science and Systems (RSS), 2024.
[POLICE]
Randall Balestriero and Yann LeCun, POLICE: Provably optimal linear constraint enforcement for deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[TD3]
Scott Fujimoto, Herke Hoof, and David Meger, Addressing function approximation error in actor-critic methods, International Conference on Machine Learning (ICML), 2018.