Recent efforts in autonomous vehicle coordination and in-space assembly have shown the importance of enabling multiple robots collaboration to achieve a shared goal. A common approach for learning this cooperative behavior is to utilize the centralized-training decentralized-execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure. This "credit assignment" problem has been extensively studied in the Multi-Agent Reinforcement Learning~(MARL) literature, but with little progress. In fact, humans performing simple inspection of the agents' trajectories often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models~(LLMs) demonstrate human-level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment as pattern recognition problems and propose our novel Large Language Model Multi-agent Credit Assignment (LLM-MCA) method. Our approach utilizes a centralized LLM reward-critic which numerically decomposes the overall reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension (LLM-TACA) where our LLM-critic performs explicit task assignment by passing an intermediary goal directly to each agent in the scenario. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new ``Spaceworld'' benchmark which incorporates safety-related constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per-agent reward information, as sampled from our LLM critics. We hope that by making this dataset available, we will enable future works to directly train a set of collaborative, decentralized policies offline.
Our centralized training architecture utilizes a centralized LLM-critic instantiated with our base prompt (environment description, our definitions, and task query). At each timestep, we update our LLM-critic with the global reward and latest observations from the environment. We then update our agents’ policies with the individualized feedback from our critic.
@inproceedings{nagpal2024llmca,
title={Leveraging Large Language Models for Effective and Explainable Multi-Agent Credit Assignment},
author={Nagpal, Kartik and Dong, Dayi and Bouvier, Jean-Baptiste and Mehr, Negar},
booktitle = {24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS)},
year={2025}
}