Provably Enforcing Hard Constraints During Training of Vision-Based Policies in Reinforcement Learning

CartPole system with modified buffer stabilizing the pole upright from vision inputs compared to unmodified buffer.

Abstract

In this paper, we seek to learn a vision-based policy guaranteed to satisfy state constraints during and after training. To obtain hard safety guarantees in closed-loop with a black-box environment we build upon the POLICEd RL approach.

We extend the POLICEd RL approach to ensure that the policy maintains safety guaranteed with image inputs instead of state inputs by modifying the affine region to account for error in state estimation from images. Doing so can at times lead to the creation of a large affine region which can limit the generalisability of the network.

To solve this we use switched actors which allow us to define multiple affine regions. Thus we can break the large affine region into multiple smaller regions. At the same time using proojected gradient descent alongside switched actors allows us to guarantee hard constraints even during the training process.

POLICEd RL illustration

Schematic illustration of training of switched actor with projected gradient descent ensuring constraint satisfaction throughout training.



We also extend the framework to non affine constraints by augemnting the state space with the non affine constraint allowing us to transform the non affine constraint into an affine constraint

POLICEd RL illustration

Schematic illustration of switched actors with a non affine circular constraint.

BibTeX

@inproceedings{khari2025enforcing,
        title = {Provably Enforcing Hard Constraints During Training of Vision-Based Policies in Reinforcement Learning},
        author = {Khari, Shashwat and Bouvier, Jean-Baptiste and Mehr, Negar},
        booktitle = {},
        year = {2025}
      }

References

[POLICE]
Randall Balestriero and Yann LeCun, POLICE: Provably optimal linear constraint enforcement for deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[POLICEd-RL]
Jean-Baptiste Bouvier, Kartik Nagpal, and Negar Mehr, {POLICEd RL: Learning closed-loop robot control policies with provable satisfaction of hard constraints, Robotics: Science and Systems, 2024.
[TD3]
Scott Fujimoto, Herke Hoof, and David Meger, Addressing function approximation error in actor-critic methods, International Conference on Machine Learning (ICML), 2018.