Surprise Minimizing Multi-Agent Learning with Energy-based Models

Karush Suri, Xiao Qi Shi, Konstantinos Plataniotis, Yuri Lawryshyn

$36^{\text{th}}$ Conference on Neural Information Processing Systesm
(NeurIPS 2022)

Preprint  Github  Download .zip  Download .tar.gz

Multi-Agent Reinforcement Learning (MARL) has demonstrated significant success by virtue of collaboration across agents. Recent work, on the other hand, introduces surprise which quantifies the degree of change in an agent’s environment. Surprise-based learning has received significant attention in the case of single-agent entropic settings but remains an open problem for fast-paced dynamics in multi-agent scenarios. A potential alternative to address surprise may be realized through the lens of free-energy minimization. This project explores surprise minimization in multi-agent learning by utilizing the free energy across all agents in a multiagent system. A temporal Energy-Based Model (EBM) represents an estimate of surprise which is minimized over the joint agent distribution. Such a scheme results in convergence towards minimum energy configurations akin to least surprise. Additionally, multi-agent populations acquire robust policies.

Below is an intuitive illustration of the objective. The joint agent population aims to minimize surprise corresponding to minimum energy configurations. Agents collaborate in partially-observed worlds to attain a joint niche. This local niche implicitly corresponds to a fixed point of on the energy landscape. Note that agents act locally with actions conditioned on their own action observation histories. It is by virtue of preconditioned values estimations that the surprise minimization scheme informs agents of joint surprise. Upon population’s convergence to a suitable configuration, agents continue to experience minimum (yet finite) surprise arising from evironment dynamics.

  

Multi-agent populations acquire surprise robust behaviors at test time. Agents collaborate with each other to achieve cumulative rewards and acquire robust policies towards abrupt changes. In scenarios with high number of agents and challenging opponents, actors are found prepared beforehand to avoid catastrophic damage. A notable example is the so many baneling task (bottom) wherein agents sacrifice their lives by engaging with the baneling enemies to succeed at the task.

@inproceedings{
karush17,
title={Surprise Minimizing Multi-Agent Learning with Energy-based Models},
author={Karush Suri, Xiao Qi Shi, Konstantinos Plataniotis, Yuri Lawryshyn},
year={2022},
journal={Neural Information Processing Systems 2022}
}