POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

Jean-Baptiste Bouvier, Kartik Nagpal, Negar Mehr
ICON Lab at UC Berkeley
Robotics: Science and Systems (RSS) 2024

KUKA robotic arm tasked with reaching a green target while avoiding a red constraint area.
Using only reward shaping or soft constraints like Constrained Policy Optimization (CPO) does not bring any safety guarantees.
On the other hand, our POLICEd approach guarantees constraint satisfaction.

Abstract

In this paper, we seek to learn a robot policy guaranteed to satisfy state constraints. To encourage constraint satisfaction, existing RL algorithms typically rely on Constrained Markov Decision Processes and discourage constraint violations through reward shaping. However, such soft constraints cannot offer safety guarantees. To address this gap, we propose POLICEd RL, a novel RL algorithm explicitly designed to enforce affine hard constraints in closed-loop with a black-box environment. Our key insight is to make the learned policy be affine around the unsafe set and to use this affine region as a repulsive buffer to prevent trajectories from violating the constraint. We prove that such policies exist and guarantee constraint satisfaction. Our proposed framework is applicable to both systems with continuous and discrete state and action spaces and is agnostic to the choice of the RL training algorithm. Our results demonstrate the capacity of POLICEd RL to enforce hard constraints in robotic tasks while significantly outperforming existing methods.

Most safe RL works rely on reward shaping to discourage violations of a safety constraint. However, such soft constraints do not guarantee safety. Previous works trying to enforce hard constraints in RL typical suffer from two limitations: either they need an accurate model of the environment, or their learned safety certificate only approximate without guarantees an actual safety certificate.

On the other hand, our POLICEd RL approach can provably enforce hard constraint satisfaction in closed-loop with a black-box environment. We build a repulsive buffer region in front of the constraint to prevent trajectories from approaching it. Since trajectories cannot cross this buffer, they also cannot violate the constraint.

Schematic illustration of POLICEd RL. To prevent state $s$ from violating an affine constraint represented by $Cs \leq d$, our POLICEd policy (arrows in the environment) enforces $C\dot s \leq 0$ in buffer region $\mathcal{B}$ (blue) directly below the unsafe area (red). We use the POLICE algorithm to make our policy affine inside buffer region $\mathcal{B}$, which allows us to easily verify whether trajectories can violate the constraint.

KUKA robotic arm reaching for green target
while avoiding red constraint area.

KUKA robotic arm reaching for green target.
The POLICEd policy creates a cyan repulsive buffer
to avoid the red constraint area.

We implement POLICEd RL on the KUKA arm and train it to reach a target while avoiding a constraint area, as illustrated above. We use the classic RL algorithm Twin Delayed DDPG (TD3) with POLICEd layers. We compare this POLICEd implementation against a TD3 baseline, a Constrained Policy Optimization (CPO) soft-constraint algorithm, and a learned PPO-Barrier safety certificate.

Metrics comparison for different methods based on a 500 episode deployment with the fully-trained policies on the safe arm task. The completion task only assess whether the target is eventually reached, even if the constraint is not respected. The most significant metric is the average percentage of constraint satisfaction, which shows that only POLICEd RL guarantees constraint satisfaction. For all metrics higher is better (↑).

Video Presentation given at RSS 2024

Poster

BibTeX

@inproceedings{bouvier2024policed,
        title = {\href{https://arxiv.org/pdf/2403.13297}{POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints}},
        author = {Bouvier, Jean-Baptiste and Nagpal, Kartik and Mehr, Negar},
        booktitle = {Robotics: Science and Systems (RSS)},
        year = {2024}
      }

References

[CPO]: Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel, Constrained Policy Optimization, 34th International Conference on Machine Learning (ICML), 2017.
[POLICE]: Randall Balestriero and Yann LeCun, POLICE: Provably optimal linear constraint enforcement for deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[TD3]: Scott Fujimoto, Herke Hoof, and David Meger, Addressing function approximation error in actor-critic methods, International Conference on Machine Learning (ICML), 2018.
[PPO-Barrier]: Yujie Yang, Yuxuan Jiang, Yichen Liu, Jianyu Chen, and Shengbo Eben Li, Model-free safe reinforcement learning through neural barrier certificate, IEEE Robotics and Automation Letters, 2023.

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints

KUKA robotic arm tasked with reaching a green target while avoiding a red constraint area. Using only reward shaping or soft constraints like Constrained Policy Optimization (CPO) does not bring any safety guarantees. On the other hand, our POLICEd approach guarantees constraint satisfaction.

Abstract

KUKA robotic arm reaching for green target while avoiding red constraint area.

KUKA robotic arm reaching for green target while avoiding red constraint area.

KUKA robotic arm reaching for green target while avoiding red constraint area.

KUKA robotic arm reaching for green target. The POLICEd policy creates a cyan repulsive buffer to avoid the red constraint area.

KUKA robotic arm reaching for green target. The POLICEd policy creates a cyan repulsive buffer to avoid the red constraint area.

KUKA robotic arm reaching for green target. The POLICEd policy creates a cyan repulsive buffer to avoid the red constraint area.

Video Presentation given at RSS 2024

Poster

BibTeX

References

KUKA robotic arm tasked with reaching a green target while avoiding a red constraint area.
Using only reward shaping or soft constraints like Constrained Policy Optimization (CPO) does not bring any safety guarantees.
On the other hand, our POLICEd approach guarantees constraint satisfaction.

KUKA robotic arm reaching for green target
while avoiding red constraint area.

KUKA robotic arm reaching for green target
while avoiding red constraint area.

KUKA robotic arm reaching for green target
while avoiding red constraint area.

KUKA robotic arm reaching for green target.
The POLICEd policy creates a cyan repulsive buffer
to avoid the red constraint area.

KUKA robotic arm reaching for green target.
The POLICEd policy creates a cyan repulsive buffer
to avoid the red constraint area.

KUKA robotic arm reaching for green target.
The POLICEd policy creates a cyan repulsive buffer
to avoid the red constraint area.