Multi-Agent Reinforcement Learning (MARL)

Multi-Agent Reinforcement Learning or MARL is a subfield of Reinforcement Learning that extends the Reinforcement Learning concept of maximizing rewards to a multi-agent scenario.
Agents now will interact with the environment and other agents in collaborative/competitive scenarios where they learn to maximize rewards collectively to accomplish a certain task. This OpenAI video is an excellent starting point for understanding the possibilities and behaviors in Multi-Agent settings.
The official definition reads: Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment. Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex group dynamics.
Single Agent RL Recap

Agent: An agent is an entity that interacts with an environment by taking actions based on its observations or state, with the goal of maximizing a reward.
State and Environment: An environment is the external system or world in which an agent operates. It provides the agent with states, receives the agent’s actions, and responds with new states and rewards. A state is the current situation or representation of the environment that an agent can observe.
Markov Decision Processes (MDPs): Reinforcement learning is usually formulated as a Markov decision process, denoted as a tuple <S, A, P, r, γ>, with S and A denoting the state and action spaces. P( s’ | s, a ) is the transition probability from s to s’ for the given action a. r ∈ R is the reward for the transition.
The agent’s actions are guided by its policy π: given a state, a policy will output an action or a probability distribution over actions. Our goal is to find an optimal policy π* that will maximize reward.
How do we solve Single Agent MDPs?
The aim of solving MDPs is to maximize cumulative rewards over time. Most approaches to RL fall under 2 categories.
Value-based methods

The agent learns the value function, to learn which state is more valuable and uses this value function to take the action that leads to it. Popular value-based methods include Q-learning, SARSA, and TD Learning.
Policy-based methods

The agent learns the optimal policy, which maps states to actions to maximize rewards over time. Common policy-based algorithms include policy gradient and actor-critic.
What’s different with Multiple agents?
By extending single-agent RL to multiple agents, to account for the actions of other agents in the environment, we must model the system differently to single-agent MDPs. We define a multi-agent Markov Game, where the environment includes multiple agents interacting simultaneously, each influencing the state transitions and their rewards.
A Markov Game is defined by a tuple (N, S, A, P, R, γ), where:
- N is the number of agents
- S is the state space
- A = A₁ × A₂ × … × Aₙ is the joint action space
- P is the transition probability function
- R = (R₁, R₂, …, Rₙ) is the set of reward functions for each agent
- γ is the discount factor
Categories of MARL
Cooperative: Cooperative MARL focuses on agents learning to work together to achieve a common goal while maximizing a shared reward. Each agent’s actions contribute to the group’s overall success, and the reward structure is typically designed to reinforce collective performance. This type of MARL is particularly useful in scenarios such as multi-robot systems.
Competitive: Agents are typically engaged in adversarial or cooperative-competitive scenarios where they usually face situations to maximize their own rewards while minimizing their opponents (zero-sum games). Examples of such scenarios include games like chess and Go or other adversarial scenarios.
Mixed-Interest: In mixed-interest MARL, Agents engage in cooperative-competitive dynamics where they might have partially aligned and partially conflicting goals. This usually shows up in Trading, Traffic, and multi-player video games
Challenges

Non-Stationarity
In a Multi-Agent scenario, each agent’s environment becomes dynamic due to the presence of other agents. As agents continuously update their policies, the environment’s dynamics change from the perspective of any individual agent.
This violates the Markov property, as the transition probabilities and reward functions can no longer be assumed to be stationary. Each agent’s optimal strategy may shift as other agents adjust their behaviors, leading to instability in learning
Partial Observability
In most Multi-Agent settings, agents do not have complete access to the environment’s state or the actions of other agents. Each agent may only observe part of the environment, which introduces uncertainty in decision-making. The problem then becomes a Partially Observable Markov Decision Process (POMDP), where agents need to infer the hidden information and act based on incomplete data. This adds complexity to policy learning since agents must learn to handle this uncertainty while predicting the behavior of others.
Scalability and Joint Action Space
As the number of agents increases, the joint action space grows exponentially. For n agents with action sets ( A₁, A₂, …, Aₙ), the joint action space becomes ( A₁ × A₂ × … × Aₙ ). This growth in the state−action space leads to increased computational complexity and makes traditional RL techniques less efficient. Finding optimal policies becomes intractable as the number of agents rises, requiring more scalable learning algorithms.
Credit Assignment
The credit assignment problem in multi-agent scenarios involves determining the contribution of each agent’s action to the overall team goal. This problem becomes particularly complex in cooperative settings where agents must work together to maximize a shared reward.
Traditional approaches often fail to provide clear insights into individual agent contributions, as they typically decompose the shared reward into individual utilities without considering whether the agent’s actions were globally optimal.
Decision Making in Multi-Agent Scenarios
Most real-life scenarios deal with the interaction of multiple agents in a mixed cooperative/competitive setting and in a Robotics context, such problems cannot be solved easily with Single-Agent RL methods.
Multi-Agent RL has methods that apply to almost all Multi-Agent Robotic scenarios where each robot learns how to maximize its reward and also maintain its contributions to maximizing global rewards. MARL algorithms involve agents making decisions and choosing actions that are more efficient and effective in various robotic applications.
Learning Paradigms in MARL
A very popular approach is Centralized Training with Decentralized execution (CTDE), where agents have access to global information while training but act based on local observations during execution.
In fully decentralized learning, each agent cannot obtain any information from other agents in both training and execution and independently updates its own policy to maximize the sum of all agents’ rewards.
This introduces significant challenges, particularly the non-stationarity of the environment from each agent’s perspective, as other agents are also learning and changing their behaviors.
VDN
Value Decomposition Networks (VDN) are a method used in multi-agent scenarios in a CTDE manner. It involves the computation of a global total Q or V (Value function) from each agent’s respective Q or V value estimation.
To learn about VDN we first need to learn about DRQN or Deep Recurrent Q Networks.
Deep Recurrent Q Networks:
When dealing with a Partially Observable Markov Decision Process (POMDP), we may not have full access to the state information, and therefore, we need to record the history or trajectory information to assist in choosing the action. Recurrent Neural Networks (RNNs) are introduced to Deep Q Learning to handle POMDPs by encoding the history into the hidden state of the RNN.

In VDN, each agent follows the same DRQN sampling pipeline as in other deep Q-learning methods. However, before entering the training loop, each agent shares its Q value and target Q value with other agents. During the training loop, the Q value and target Q value of the current agent and other agents are summed to obtain the Q-tot value.

VDN assumes that the joint Q-function can be decomposed additively into individual agent Q-functions. This simplification allows for decentralized execution and allows each agent to learn and optimize its own policy independently while still contributing to the team's reward.
Issues: However, simply summing the Q value across agents can lead to a reduced diversity of policy and can quickly get stuck in a local optimum, particularly when the Q network is shared across agents.
VDN forces each agent to find the best action that basically satisfies,

QMix

QMIX builds upon the value decomposition approach introduced by VDN. It addresses some of the limitations of VDN by introducing a mixing network to combine individual agent values into a joint Q-value.
QMIX uses a mixing network that can represent non-linear relationships between individual agent values and the joint Q-value, allowing for more complex coordination strategies.

The mixing network is designed to maintain a monotonic relationship between individual agent values and the joint Q-value, ensuring consistency in action selection.
QMix follows the standard Q-learning paradigm, where agents aim to learn optimal Q-values that maximize the expected cumulative reward. During training, agents interact with the environment, and the global Q-value is updated using the (TD) error based on the Bellman equation
Independent Learning — IPPO
Independent Proximal Policy Optimization (IPPO) is a simple Multi-Agent RL algorithm where each agent operates independently during training and execution.
Each agent has its own policy and critic network and learns from its own experiences without sharing information with other agents during training. Each agent also operates independently during execution, making decisions based on its own observations without direct communication with other agents.
Policy Optimization: IPPO utilizes the PPO algorithm for policy updates, using the ratio function with the clipped objective function to prevent large policy updates, helping to maintain consistent performance during training.

IPPO is a relatively simple algorithm and scales well, with little to no overhead as each agent is treated separately. It’s a great choice for situations that do not require coordination.
Issues: The issue with this algorithm is we hope that all agents will work towards the global objective to maximize the group’s reward instead of maximizing their own. This is also related to the non-stationary problem and the environment becomes non-stationary from each agent’s perspective, which can lead to instability in learning.
MAPPO
Multi-Agent Proximal Policy Optimization (MAPPO) is an extension of the PPO algorithm for multi-agent scenarios. MAPPO employs a CTDE approach where agents share information during the learning phase, but act independently during execution.
MAPPO addresses non-stationarity by using a centralized critic, which has access to the joint state and can learn a more stable value function despite the changing policies of other agents.
The centralized critic evaluates the global state and provides feedback to each agent’s policy, enabling the system to stabilize learning. The policies are updated via the PPO objective using local observations, but the global value function helps mitigate non-stationarity.
During training, each agent i interacts with the environment and collects trajectories of states, actions, and rewards. These trajectories are used to compute the PPO objective and the advantage function A_i (s, a), which guides the policy. The policy update for each agent is performed by maximizing the PPO objective:


MADDPG
MADDPG extends the Deep Deterministic Policy Gradient algorithm to a multi-agent scenario by considering a centralized Q function that takes all agents’ actions as input to compute the total Q value. This involves sharing data among agents before storing it in the buffer. This also follows the CTDE strategy.
It is a variant of an offline actor-critic-based method in multi-agent settings where each agent has its own actor network for policy and a critic network that has access to all the actions and observations of all agents during training.
During training, each agent predicts its next action using the target policy and shares it with other agents before entering the training loop. This is done to ensure that all agents use the same action for computing the Q-value in the centralized Q-function during the next sampling stage.

Put simply DDPG performs Q-Learning by using the gradient function for the expected return:

Here the Q function is the centralised action-value function that takes the actions of all agents as the input and gives us the output of the Q value of agent i.
For the Policy network, each agent i can maintain an approximation of the policy of agent j and this approximate policy is learned by maximizing the log probability of agent j’s actions using an entropy regularizer, H which models the entropy of the policy distribution.

Communication in MARL

Multi-Agent RL poses a tricky challenge with communication. The communication constraints between agents impact the agents’ ability to coordinate and work to maximize the reward.
Communication constraints in MARL can take various forms, such as limited bandwidth, unreliable channels, partial observability, or restrictions on the frequency and timing of information exchange. These limitations force agents to make decisions based on incomplete or outdated information, potentially leading to suboptimal outcomes.
A few approaches to try and mitigate the problem are given below.
Differentiable and Reinforced Inter-Agent Learning (RIAL/ DIAL)

RIAL/DIAL explores the idea of training agents to be prudent in sending messages and learning efficient communication.
RIAL combines DRQN with independent Q-learning for action and communication selection, split across 2 networks:
- Action Selector Network: This network takes the agent’s observation and received messages as input and outputs the action to be taken in the environment.
- Communication Network: This network determines what message the agent should send to other agents based on its current observation.
DIAL introduces a differentiable communication channel between agents, allowing them to learn how to communicate effectively through backpropagation. It uses the same 2 neural networks but has a differentiable communication channel that facilitates end-to-end learning using backpropagation through time (BPTT).
SchedNet

SchedNet is a multi-agent deep RL framework that builds on the ideas of RIAL/DIAL but introduces a learned scheduling component. Agents learn to determine which among them should be allowed to broadcast messages, based on the importance of their information
SchedNet comprises three main components that work together to optimize communication and decision-making in multi-agent systems:
- Scheduling Mechanism: Agents learn to determine which should be allowed to broadcast messages.
- Message Encoding: Agents learn how to encode their messages efficiently.
- Action Selection: Agents learn to choose actions based on the messages they receive based on limited communication and local observations.
TarMAC: Targeted Multi-Agent Communication
TarMAC (Targeted Multi-Agent Communication) is a learned communication architecture that focuses on improving efficiency and communication effectiveness among agents.

It employs a targeted communication strategy, allowing agents to selectively communicate with specific peers rather than broadcasting to all agents. This helps in reducing communication overhead and focuses on relevant information exchange.
They use a signature-based soft-attention mechanism to enable such targeting. Intuitively, attention weights are high when both the sender and
receiver predict similar signature and query vectors respectively.


With targeted communications, the paper refers to directing certain messages to specific recipients, where agents learn both what messages to send and who to send them to. This communication is learned implicitly as a result of end-to-end training using a task-specific team reward.
Autoencoder-Based Methods for Communication
In “Learning to Ground Multi-Agent Communication with Autoencoders”, the authors develop a language for communication among agents in multi-agent systems, focusing on how this language can be grounded in the environment using autoencoders.
In the context of multi-agent systems, grounding refers to associating linguistic symbols with entities or concepts in the environment. This allows agents to understand and respond to each other’s messages based on shared experiences and observations.

The system is modeled with every speaker having 2 modules, speaker and listener. Given the raw pixel observation, the module first uses an image encoder to embed the pixels into a low-dimensional feature using a 4 layer CNN. This CNN is common to both speaker and listener.
The goal of the communication autoencoder is to take the current state observation and generate the next subsequent message. We use an autoencoder to learn a mapping from the feature space of the image encoder to communication symbols. The reconstruction loss function is optimized jointly with the policy gradient loss from the listener module.
Each agent has a receiver module that uses an independent policy head, which is a standard GRU policy with a linear layer. The GRU policy concatenates the encoded image features and the message features. The predicted action distribution and expected returns are used for computing the policy gradient loss.
Key Research Papers
Deep Recurrent Q-Learning for Partially Observable MDPs: https://arxiv.org/abs/1507.06527
Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?: https://arxiv.org/abs/2011.09533
The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games: https://arxiv.org/abs/2103.01955
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments: https://arxiv.org/abs/1706.02275
Value-Decomposition Networks For Cooperative Multi-Agent Learning: https://arxiv.org/abs/1706.05296
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning: https://arxiv.org/abs/1803.11485
Learning to Communicate with Deep Multi-Agent Reinforcement Learning: https://arxiv.org/abs/1605.06676
Learning to Schedule Communication in Multi-agent Reinforcement Learning: https://arxiv.org/abs/1902.01554
TarMAC: Targeted Multi-Agent Communication: https://arxiv.org/abs/1810.11187
Learning to Ground Multi-Agent Communication with Autoencoders: https://arxiv.org/abs/2110.15349
Conclusion and Future Directions
In conclusion, Multi-Agent Reinforcement Learning (MARL) represents a significant extension of traditional reinforcement learning by introducing the complexities of multiple agents interacting within shared environments. There’s active research in all aspects of the challenges involving multiple agents, such as communication, training, coordination, and a lot more fields.
Developing algorithms that can efficiently handle the exponentially growing joint action space remains a critical challenge. DTDE (Decentralised Training Decentralised Execution) is an active research area with new algorithms being introduced all the time. Additionally, enhancing communication strategies among agents is essential to facilitate better coordination and decision-making.
Finally, exploring applications of MARL in real-world scenarios such as autonomous driving, smart grid management, and collaborative robotics could provide valuable insights and drive practical innovations.
Excited to see where this field goes! Thank you so much for reading!