Provably Enforcing Hard Constraints During Training of Vision-Based Policies in Reinforcement Learning

CartPole system trained to stabilize the pole upright from vision inputs.

Abstract

In this paper, we seek to learn a vision-based policy guaranteed to satisfy state constraints during and after training. To obtain hard safety guarantees in closed-loop with a black-box environment we build upon the POLICEd RL approach.

Most safe RL works rely on reward shaping to discourage violations of a safety constraint. However, such soft constraints do not guarantee safety. Previous works trying to enforce hard constraints in RL typical suffer from two limitations: either they need an accurate model of the environment, or their learned safety certificate only approximate without guarantees an actual safety certificate.

On the other hand, our POLICEd RL approach can provably enforce hard constraint satisfaction in closed-loop with a black-box environment. We build a repulsive buffer region in front of the constraint to prevent trajectories from approaching it. Since trajectories cannot cross this buffer, they also cannot violate the constraint.

POLICEd RL illustration

Schematic illustration of POLICEd RL. To prevent state $s$ from violating an affine constraint represented by $Cs \leq d$, our POLICEd policy (arrows in the environment) enforces $C\dot s \leq 0$ in buffer region $\mathcal{B}$ (blue) directly below the unsafe area (red). We use the POLICE algorithm to make our policy affine inside buffer region $\mathcal{B}$, which allows us to easily verify whether trajectories can violate the constraint.

BibTeX

@inproceedings{khari2025enforcing,
        title = {Provably Enforcing Hard Constraints During Training of Vision-Based Policies in Reinforcement Learning},
        author = {Khari, Shashwat and Bouvier, Jean-Baptiste and Mehr, Negar},
        booktitle = {},
        year = {2025}
      }

References

[POLICE]
Randall Balestriero and Yann LeCun, POLICE: Provably optimal linear constraint enforcement for deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[POLICEd-RL]
Jean-Baptiste Bouvier, Kartik Nagpal, and Negar Mehr, {POLICEd RL: Learning closed-loop robot control policies with provable satisfaction of hard constraints, Robotics: Science and Systems, 2024.
[TD3]
Scott Fujimoto, Herke Hoof, and David Meger, Addressing function approximation error in actor-critic methods, International Conference on Machine Learning (ICML), 2018.