POLICEd RL
Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems

Jean-Baptiste Bouvier, Kartik Nagpal, Negar Mehr
ICON Lab at UC Berkeley
2024 Conference on Decision and Control (CDC)

Guaranteed soft landing of the space shuttle using our learned POLICEd controller.

Abstract

In this paper, we develop a method for learning a control policy guaranteed to satisfy an affine state constraint of high relative degree in closed loop with a black-box system. Previous reinforcement learning (RL) approaches to satisfy safety constraints either require access to the system model, or assume control affine dynamics, or only discourage violations with reward shaping. Only recently have these issues been addressed with our previous work POLICEd RL, which guarantees constraint satisfaction for black-box systems. However, this previous work can only enforce constraints of relative degree 1. To address this gap, our key insight is to make the learned policy be affine around the unsafe set and to use this affine region to dissipate the inertia of the high relative degree constraint. We prove that such policies guarantee constraint satisfaction for deterministic systems while being agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of our approach to enforce hard constraints in the Gym inverted pendulum and on a space shuttle landing simulation.

Most safe RL works rely on reward shaping to discourage violations of a safety constraint. However, such soft constraints do not guarantee safety. Previous works trying to enforce hard constraints in RL typical suffer from two limitations: either they need an accurate model of the environment, or their learned safety certificate only approximate without guarantees an actual safety certificate.

On the other hand, our POLICEd RL approach can provably enforce hard constraint satisfaction in closed-loop with a black-box environment. We build a repulsive buffer region in front of the constraint to prevent trajectories from approaching it. Since trajectories cannot cross this buffer, they also cannot violate the constraint.

Phase portrait of constrained output $y$ illustrating our High Relative Degree POLICEd RL method on a system of relative degree $2$. To prevent states from violating constraint $y \leq y_{max}$ (red dashed line), our policy guarantees that trajectories entering buffer region $\mathcal{B}$ (blue) cannot leave it through its upper bound (blue dotted line). Our policy makes $\ddot y$ sufficiently negative in buffer $\mathcal{B}$ to bring $\dot y$ to $0$ in all trajectories entering $\mathcal{B}$. Once $\dot y < 0$, trajectories cannot approach the constraint. Due to the states' inertia, it is physically impossible to prevent all constraint violations. For instance, $y = y_{max}$, $\dot y >> 1$ will yield $y > y_{max}$ at the next timestep. Hence, we only aim at guaranteeing the safety of trajectories entering buffer $\mathcal{B}$. We use the POLICE algorithm to make our policy affine inside buffer region $\mathcal{B}$.

We implemented our approach on two environments: the Gym inverted pendulum and a Space Shuttle landing scenario. We trained PPO policies for both tasks with additional negative rewards to promote constraint respect. We augment the PPO policies with our POLICEd RL to bring constraint violations to zero.

Guaranteeing a soft landing for the Space Shuttle

The Space Shuttle soft landing scenario requires to bring the vehicle to the ground with a vertical velocity larger than -6ft/s. Such a soft landing is notoriously difficult due to the bad gliding performance of the Space Shuttle which famously "flies like a brick". The small vertical speed $\dot h$ also translates to a small flight path angle $\gamma$.

Phase portrait of the Space Shuttle landing with green buffer.
Soft landings correspond to the pink target region.

Altitude of the Space Shuttle.

Velocity of the Space Shuttle.

Flight path angle of the Space Shuttle.

Angle of attack of the Space Shuttle.

POLICEd soft landing of the space shuttle from the full 10,000 ft.

Guaranteed stabilization of the inverted pendulum

The difficulty with the inverted pendulum is enlarging the stability region and guaranteeing the constraint respect for a wide range of initial conditions. Both the baseline PPO and POLICEd version easily stabilize the pendulum when starting near the equilibrium, but they differ on more demanding initial conditions.

Baseline PPO controller starting with a high velocity fails.

POLICEd controller starting from the same state succeeds.

BibTeX

@inproceedings{bouvier2024learning,
        title = {Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems},
        author = {Bouvier, Jean-Baptiste and Nagpal, Kartik and Mehr, Negar},
        booktitle = {Conference on Decision and Control (CDC)},
        year = {2024}
      }

References

[Space Shuttle]: Ali Heydari and S. N. Balakrishnan, Optimal Online Path Planning for Approach and Landing Guidance, AIAA Atmospheric Flight Mechanics Conference, 2011.
[POLICEd RL]: Jean-Baptiste Bouvier, Kartik Nagpal, and Negar Mehr, POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints, Robotics: Science and Systems (RSS), 2024.
[POLICE]: Randall Balestriero and Yann LeCun, POLICE: Provably optimal linear constraint enforcement for deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[PPO]: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov Proximal Policy Optimization Algorithms, OpenAI, 2017.

POLICEd RL Learning to Provably Satisfy High Relative Degree Constraints for Black-Box Systems