UC Berkeley
ICON Lab

CRAFT: Coaching Reinforcement Learning
Autonomously using Foundation Models
for Multi-Robot Coordination Tasks

ICON Lab at UC Berkeley
Preprint

*Indicates Equal Contribution

Hardware deployment of coordination policy learned with CRAFT in Unitree Go2 and Go1 quadruped.
In the simulation results, CRAFT successfully learns coordination behaviors for multi-quadruped navigation and bimanual manipulation tasks. However, vanilla MAPPO fails to learn successful coordination behaviors and shows suboptimal behaviors, even with dense reward functions provided by environment developers.

Abstract

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics still remains challenging due to high-dimensional continuous joint action spaces, complex reward design, and non-stationary transitions inherent to decentralized settings.

On the other hand, humans learn complex coordination through staged curricula, where long-horizon behaviors are progressively built upon simpler skills. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for multi-robot coordination Tasks, a framework that leverages the reasoning capabilities of foundation models to act as a "coach" for multi-robot coordination.

CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). In what follows, CRAFT trains each subtask using reward functions generated by LLM, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, demonstrating its capability to learn complex coordination behaviors. In addition, we validate the multi-quadruped navigation policy in real hardware experiments.

Overview of CRAFT. CRAFT consist of five key stages:

  1. Curriculum generation module -- A curriculum LLM decomposes the long-horizon coordination task into a sequence of subtasks described in natural language.
  2. Reward function generation module -- A reward generation LLM generates a reward function in executable python code, providing dense rewards that clearly specify the desired behavior for each subtask.
  3. Policy evaluation module -- An evaluation VLM evaluates the success or failure of the policy based on the visual and quantitative rollouts of the policy.
  4. Reward refinement module -- If the policy fails to achieve the desired behavior, an advice VLM provides advice on how to change the reward based on the rollout information and learning curve. Then, an LLM takes the advice and refines the reward function.
  5. Sequential training of subtasks -- Throughout the training, we initialize each subtask with the policy learned from the previous one while motivating exploration to learn the new subtask.
Curriculum refinement example

Example of curriculum refinement for task lift and balance the pot. - Three different candidate curricula $\mathcal{C}^1$ to $\mathcal{C}^3$, generated by the curriculum LLM, are re-provided to the LLM for refinement. In $\mathcal{C}^1$, Task 1 focuses only on minimizing distance, while Task 1 in $\mathcal{C}^3$ is defined as minimizing distance and matching orientation. In contrast, Task 3 and Task 4 in $\mathcal{C}^1$ break down the lifting into two stages of first lifting halfway and then to a full height, whereas $\mathcal{C}^3$ represents lifting as a single task. The curriculum LLM merges these candidates into a final curriculum $\mathcal{C}$ by selecting the stronger tasks definitions from each candidate

VLM-guided reward-refinement loop

Example of reward refinement of subtask Coordinate Preliminary Lift - Through the first reward-refinement loop, $R^1_{k=3}$ was produced and the evaluation VLM marked the policy as a failure since the pot never reached the required elevation of 0.05 m. The reward component learning curves were then passed to the advice VLM, which identified that lift_reward was too weak compared to balance_reward. It recommended removing the square on elevation, increasing the lift weight, and decreasing the balance weight. The revised reward $R^2_{k=3}$ reflects these changes: the square on elevation was removed, the lift weight increased from 80 to 200, and the balance weight decreased from 2 to 1. With this reward, the policy successfully achieved the 0.05 m elevation and satisfied the subtask.

CRAFT sequentially learns subtasks that enables complex coordination

CRAFT can learn collaborative multi-robot tasks that requires complex, long-horizon coordination, by learning a sequence of subtasks required to accomplish the overall task. We validate CRAFT in bimanual manipulation and multi-quadruped navigation tasks, demonstrating its capability to learn complex coordination behaviors.

BibTeX

@article{choi2025craft,
  title={CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks},
  author={Choi, Seoyeon and Ryu, Kanghyun and Ock, Jonghoon and Mehr, Negar},
  journal={arXiv preprint arXiv:2509.14380},
  year={2025}
}

References

[CurricuLLM]
Kanghyun Ryu, Qiayuan Liao, Zhongyu Li, Payam Delgosha, Koushil Sreenath, Negar Mehr, CurricuLLM: Automatic Task Curricula Design for Learning Complex Robot Skills Using Large Language Models, International Conference on Robotics and Automation (ICRA), 2025.
[Eureka]
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi “Jim” Fan, Anima Anandkumar, Eureka: Human-level reward design via coding large language models, International Conference on Learning Representations (ICLR), 2024.
[MAPPO]
Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, Yi Wu The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games, Advances in neural information processing systems (NeurIPS), 2022.