In this paper, we develop a method for learning a control policy guaranteed to satisfy an affine state constraint of high relative degree in closed loop with a black-box system. Previous reinforcement learning (RL) approaches to satisfy safety constraints either require access to the system model, or assume control affine dynamics, or only discourage violations with reward shaping. Only recently have these issues been addressed with our previous work POLICEd RL, which guarantees constraint satisfaction for black-box systems. However, this previous work can only enforce constraints of relative degree 1. To address this gap, our key insight is to make the learned policy be affine around the unsafe set and to use this affine region to dissipate the inertia of the high relative degree constraint. We prove that such policies guarantee constraint satisfaction for deterministic systems while being agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of our approach to enforce hard constraints in the Gym inverted pendulum and on a space shuttle landing simulation.
Most safe RL works rely on reward shaping to discourage violations of a safety constraint. However, such soft constraints do not guarantee safety. Previous works trying to enforce hard constraints in RL typical suffer from two limitations: either they need an accurate model of the environment, or their learned safety certificate only approximate without guarantees an actual safety certificate.
On the other hand, our POLICEd RL approach can provably enforce hard constraint satisfaction in closed-loop with a black-box environment. We build a repulsive buffer region in front of the constraint to prevent trajectories from approaching it. Since trajectories cannot cross this buffer, they also cannot violate the constraint.
Phase portrait of constrained output $y$ illustrating our High Relative Degree POLICEd RL method on a system of relative degree $2$. To prevent states from violating constraint $y \leq y_{max}$ (red dashed line), our policy guarantees that trajectories entering buffer region $\mathcal{B}$ (blue) cannot leave it through its upper bound (blue dotted line). Our policy makes $\ddot y$ sufficiently negative in buffer $\mathcal{B}$ to bring $\dot y$ to $0$ in all trajectories entering $\mathcal{B}$. Once $\dot y < 0$, trajectories cannot approach the constraint. Due to the states' inertia, it is physically impossible to prevent all constraint violations. For instance, $y = y_{max}$, $\dot y >> 1$ will yield $y > y_{max}$ at the next timestep. Hence, we only aim at guaranteeing the safety of trajectories entering buffer $\mathcal{B}$. We use the POLICE algorithm to make our policy affine inside buffer region $\mathcal{B}$.
We implemented our approach on two environments: the Gym inverted pendulum and a Space Shuttle landing scenario. We trained PPO policies for both tasks with additional negative rewards to promote constraint respect. We augment the PPO policies with our POLICEd RL to bring constraint violations to zero.
The Space Shuttle soft landing scenario requires to bring the vehicle to the ground with a vertical velocity larger than -6ft/s. Such a soft landing is notoriously difficult due to the bad gliding performance of the Space Shuttle which famously "flies like a brick". The small vertical speed $\dot h$ also translates to a small flight path angle $\gamma$.
The difficulty with the inverted pendulum is enlarging the stability region and guaranteeing the constraint respect for a wide range of initial conditions. Both the baseline PPO and POLICEd version easily stabilize the pendulum when starting near the equilibrium, but they differ on more demanding initial conditions.
@inproceedings{bouvier2024learning,
title = {Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems},
author = {Bouvier, Jean-Baptiste and Nagpal, Kartik and Mehr, Negar},
booktitle = {Conference on Decision and Control (CDC)},
year = {2024}
}