Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG (Multi-Agent Deep Deterministic Policy Gradients) extends the single-agent DDPG algorithm to enable cooperative or competitive training of multiple agents in complex environments, enhancing the stability and convergence of the learning process through decentralized actor and centralized critic architectures.

Compatible Action Spaces

Discrete

Box

MultiDiscrete

MultiBinary

✔️

✔️

Gumbel-Softmax

The Gumbel-Softmax activation function is a differentiable approximation that enables gradient-based optimization through continuous relaxation of discrete action spaces in multi-agent reinforcement learning, allowing agents to learn and improve decision-making in complex environments with discrete choices. If you would like to customise the mlp output activation function, you can define it within the network configuration using the key “output_activation”. User definition for the output activation is however, unnecessary, as the algorithm will select the appropriate function given the environments action space.

Agent Masking

If you need to take actions from agents at different timesteps, you can use agent masking to only retrieve new actions for certain agents whilst providing ‘environment defined actions’ for other agents, which act as a nominal action for such “masked” agents to take. These nominal actions should be returned as part of the info dictionary. Following the PettingZoo API we recommend the info dictionary to be keyed by the agents, with env_defined_actions defined as follows:

info = {'speaker_0': {'env_defined_actions':  None},
        'listener_0': {'env_defined_actions': np.array([0,0,0,0,0])}

For agents that you wish not to be masked, the env_defined_actions should be set to None. If your environment has discrete action spaces then provide ‘env_defined_actions’ as a numpy array with a single value. For example, an action space of type Discrete(5) may have an env_defined_action of np.array([4]). For an environment with continuous actions spaces (e.g. Box(0, 1, (5,))) then the shape of the array should be the size of the action space (np.array([0.5, 0.5, 0.5, 0.5, 0.5])). Agent masking is handled automatically by the AgileRL multi-agent training function by passing the info dictionary into the agents get_action method:

state, info = env.reset()  # or: next_state, reward, done, truncation, info = env.step(action)
action, _ = agent.get_action(state, infos=info)
Example Training Loop
import numpy as np
import torch
from pettingzoo.mpe import simple_speaker_listener_v4
from tqdm import tqdm

from agilerl.algorithms import MADDPG
from agilerl.components.multi_agent_replay_buffer import MultiAgentReplayBuffer
from agilerl.vector.pz_async_vec_env import AsyncPettingZooVecEnv
from agilerl.utils.algo_utils import obs_channels_to_first

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_envs = 8
env = AsyncPettingZooVecEnv(
    [
        lambda: simple_speaker_listener_v4.parallel_env(continuous_actions=True)
        for _ in range(num_envs)
    ]
)
env.reset()

# Configure the multi-agent algo input arguments
observation_spaces = [env.single_observation_space(agent) for agent in env.agents]
action_spaces = [env.single_action_space(agent) for agent in env.agents]

channels_last = False  # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
n_agents = env.num_agents
agent_ids = [agent_id for agent_id in env.agents]
field_names = ["state", "action", "reward", "next_state", "done"]
memory = MultiAgentReplayBuffer(
    memory_size=1_000_000,
    field_names=field_names,
    agent_ids=agent_ids,
    device=device,
)

agent = MADDPG(
    observation_spaces=observation_spaces,
    action_spaces=action_spaces,
    agent_ids=agent_ids,
    vect_noise_dim=num_envs,
    device=device,
)
agent.set_training_mode(True)

# Define training loop parameters
max_steps = 100000  # Max steps
pbar = tqdm(total=max_steps)
while agent.steps[-1] < max_steps:
    obs, info  = env.reset() # Reset environment at start of episode
    scores = np.zeros(num_envs)
    completed_episode_scores = []

    for _ in range(1000):
        # Get next action from agent
        action, raw_action = agent.get_action(
            obs=obs,
            infos=info,
        )

        # Act in environment
        next_obs, reward, termination, truncation, info = env.step(action)

        scores += np.sum(np.array(list(reward.values())).transpose(), axis=-1)
        total_steps += num_envs
        steps += num_envs

        # Save experiences to replay buffer
        memory.save_to_memory(obs, raw_action, reward, next_obs, done, is_vectorised=True)

        # Learn according to learning frequency
        if len(memory) >= agent.batch_size:
            for _ in range(num_envs // agent.learn_step):
                experiences = memory.sample(agent.batch_size) # Sample replay buffer
                agent.learn(experiences) # Learn according to agent's RL algorithm

        # Update the observation
        obs = next_obs

        # Calculate scores and reset noise for finished episodes
        reset_noise_indices = []
        term_array = np.array(list(termination.values())).transpose()
        trunc_array = np.array(list(truncation.values())).transpose()
        for idx, (d, t) in enumerate(zip(term_array, trunc_array)):
            if np.any(d) or np.any(t):
                completed_episode_scores.append(scores[idx])
                agent.scores.append(scores[idx])
                scores[idx] = 0
                reset_noise_indices.append(idx)

        agent.reset_action_noise(reset_noise_indices)

    pbar.update(1000)
    pbar.set_description(f"Score: {np.mean(completed_episode_scores[-10:])}")

    agent.steps[-1] += steps

Neural Network Configuration

To configure the architecture of the network’s encoder / head, pass a kwargs dict to the MADDPG net_config field. Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and EvolvableMultiInput.

Note

MADDPG uses decentralized actors and centralized critics. Encoder mutations are disabled for the actor networks because the encoder architectures differ between actors and critics: the critics use EvolvableMultiInput to handle the combined observation spaces of all subagents, while each actor uses an individual evolvable module (e.g. MLP, CNN) tailored to that agent’s observation space. We can therefore not ensure that the same mutation can be applied between actors and critics, like we do generally in other algorithms.

For discrete / vector observations:

NET_CONFIG = {
      "encoder_config": {'hidden_size': [32, 32]},  # Network head hidden size
      "head_config": {'hidden_size': [32]}      # Network head hidden size
  }

For image observations:

NET_CONFIG = {
    "encoder_config": {
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

For dictionary / tuple observations containing any combination of image, discrete, and vector observations:

CNN_CONFIG = {
    "channel_size": [32, 32], # CNN channel size
    "kernel_size": [8, 4],   # CNN kernel size
    "stride_size": [4, 2],   # CNN stride size
}

NET_CONFIG = {
    "encoder_config": {
      "latent_dim": 32,
      # Config for nested EvolvableCNN objects
      "cnn_config": CNN_CONFIG,
      # Config for nested EvolvableMLP objects
      "mlp_config": {
          "hidden_size": [32, 32]
      },
      "vector_space_mlp": True # Process vector observations with an MLP
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }
# Create MADDPG agent
agent = MADDPG(
  observation_spaces=observation_spaces,
  action_spaces=action_spaces,
  agent_ids=agent_ids,
  net_config=NET_CONFIG,
  device=device,
)

Evolutionary Hyperparameter Optimization

AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.

Saving and Loading Agents

To save an agent, use the save_checkpoint method:

from agilerl.algorithms.maddpg import MADDPG

# Create MADDPG agent
agent = MADDPG(
  observation_spaces=observation_spaces,
  action_spaces=action_spaces,
  agent_ids=agent_ids,
  net_config=NET_CONFIG,
  device=device,
)

checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.maddpg import MADDPG

checkpoint_path = "path/to/checkpoint"
agent = MADDPG.load(checkpoint_path)

Parameters

class agilerl.algorithms.maddpg.MADDPG(*args: Any, **kwargs: Any)

Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm.

Paper: https://arxiv.org/abs/1706.02275

Parameters:
  • observation_spaces (list[spaces.Space] | spaces.Dict) – Observation space for each agent

  • action_spaces (list[spaces.Space] | spaces.Dict) – Action space for each agent

  • agent_ids (list[str] | None, optional) – Agent ID for each agent

  • O_U_noise (bool, optional) – Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True

  • vect_noise_dim (int, optional) – Vectorization dimension of environment for action noise, defaults to 1

  • expl_noise (float, optional) – Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise

  • mean_noise (float, optional) – Mean of exploration noise, defaults to 0.0

  • theta (float, optional) – Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15

  • dt (float, optional) – Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • net_config (dict, optional) – Encoder configuration, defaults to mlp with hidden size [64,64]

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 0.001

  • lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 0.01

  • learn_step (int, optional) – Learning frequency, defaults to 5

  • gamma (float, optional) – Discount factor, defaults to 0.95

  • tau (float, optional) – For soft update of target network parameters, defaults to 0.01

  • mut (str, optional) – Most recent mutation to agent, defaults to None

  • normalize_images (bool, optional) – Normalize image observations, defaults to True

  • actor_networks (list[nn.Module], optional) – List of custom actor networks, defaults to None

  • critic_networks (list[nn.Module], optional) – List of custom critic networks, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • torch_compiler (str, optional) – The torch compile mode ‘default’, ‘reduce-overhead’ or ‘max-autotune’, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

action_noise(agent_id: str) Tensor

Create action noise for exploration, either Ornstein Uhlenbeck or from a normal distribution.

Parameters:

agent_id (str) – Agent ID for action dims

Returns:

Action noise

Return type:

torch.Tensor

assemble_grouped_outputs(agent_outputs: dict[str, ndarray], vect_dim: int) dict[str, ndarray]

Assembles individual agent outputs into batched outputs for shared policies.

Parameters:
  • agent_outputs (dict[str, np.ndarray]) – Dictionary with individual agent outputs, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}

  • vect_dim (int) – Vectorization dimension size, i.e. number of vect envs

Returns:

Assembled dictionary with the form {‘agent’: [4, 7, 8]}

Return type:

dict[str, np.ndarray]

assemble_shared_inputs(experience: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], ...]) dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], ...]

Preprocesses inputs by constructing dictionaries by shared agents.

Parameters:

experience (ExperiencesType) – experience to reshape from environment

Returns:

Preprocessed inputs

Return type:

ExperiencesType

build_net_config(net_config: dict[str, dict[str, Any] | Any] | None = None, flatten: bool = True, return_encoders: bool = False) dict[str, dict[str, Any] | Any] | tuple[dict[str, dict[str, Any] | Any], dict[str, dict[str, dict[str, Any] | Any]]]

Extract an appropriate net config for each sub-agent from the passed net config dictionary. If grouped_agents is True, the net config will be built for the grouped agents i.e. through their common prefix in their agent_id, whenever the passed net config is None.

Note

If return_encoders is True, we return the encoder configs for each sub-agent. The only exception is for MLPs, where we only return the deepest architecture found. This is useful for algorithms with shared critics that process the observations of all agents, and therefore use an EvolvableMultiInput module to process the observations of all agents (assigning an encoder to each sub-agent and, optionally, a single EvolvableMLP to process the concatenated vector observations).

Parameters:
  • net_config (NetConfigType | None) – Net config dictionary

  • flatten (bool, optional) – Whether to return a net config for each possible sub-agent, even in grouped settings.

  • return_encoders (bool, optional) – Whether to return the encoder configs for each sub-agent. Defaults to False.

Returns:

Net config dictionary for each sub-agent

Return type:

NetConfigType

clean_up() None

Clean up the algorithm by deleting the networks and optimizers.

Returns:

None

Return type:

None

clone(index: int | None = None, wrap: bool = True) Self

Create a clone of the algorithm.

Parameters:
  • index (int | None, optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (SelfEvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

SelfEvolvableAlgorithm

disassemble_grouped_outputs(group_outputs: dict[str, ndarray], vect_dim: int, grouped_agents: dict[str, list[str]]) dict[str, ndarray]

Disassembles batched output by shared policies into their grouped agents’ outputs.

Note

This assumes that for any given sub-agent the termination condition is deterministic, i.e. any given agent will always terminate at the same timestep in different vectorized environments.

Parameters:
  • group_outputs (dict[str, np.ndarray]) – Dictionary to be disassembled, has the form {‘agent’: [4, 7, 8]}

  • vect_dim (int) – Vectorization dimension size, i.e. number of vect envs

  • grouped_agents (dict[str, list[str]]) – Dictionary of grouped agent IDs

Returns:

Assembled dictionary, e.g. {‘agent_0’: 4, ‘agent_1’: 7, ‘agent_2’: 8}

Return type:

dict[str, np.ndarray]

evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

extract_action_masks(infos: dict[str, dict[str, Any]]) dict[str, ndarray]

Extract action masks from info dictionary.

Parameters:

infos (dict[str, dict[...]]) – Info dict

Returns:

Action masks

Return type:

dict[str, np.ndarray]

extract_agent_masks(infos: dict[str, dict[str, Any]] | None = None) tuple[dict[str, ndarray], dict[str, ndarray]]

Extract env_defined_actions from info dictionary and determine agent masks.

Parameters:

infos (dict[str, dict[...]]) – Info dict

Returns:

Env defined actions and agent masks

Return type:

tuple[ArrayDict, ArrayDict]

get_action(obs: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]], infos: dict[str, dict[str, Any]] | None = None, *args: Any, **kwargs: Any) tuple[dict[str, ndarray], dict[str, ndarray]]

Return the next action to take in the environment. Epsilon is the probability of taking a random action, used for exploration. For epsilon-greedy behaviour, set epsilon to 0.

Parameters:
  • obs (dict[str, numpy.Array]) – Environment observations: {‘agent_0’: state_dim_0, …, ‘agent_n’: state_dim_n}

  • infos (dict[str, dict[str, ...]]) – Information dictionary returned by env.step(actions)

Returns:

Actions for each agent, raw actions for each agent

Return type:

tuple[dict[str, np.ndarray], dict[str, np.ndarray]]

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:

action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_group_id(agent_id: str) str

Get the group ID for an agent.

Parameters:

agent_id (str) – The agent ID

Returns:

The group ID

get_lr_names() list[str]

Return the learning rates of the algorithm.

get_policy() EvolvableModuleProtocol

Return the policy network of the algorithm.

get_setup() MultiAgentSetup

Get the type of multi-agent setup, as determined by the observation spaces of the agents. By having the ‘same’ observation space, we mean that the spaces are analogous, i.e. we can use the same EvolvableModule to process their observations.

  1. HOMOGENEOUS: All agents have the same observation space.

  2. MIXED: Agents can be grouped by their observation spaces.

  3. HETEROGENEOUS: All agents have different observation spaces.

Returns:

The type of multi-agent setup.

Return type:

MultiAgentSetup

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:

observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

tuple[int, …].

has_grouped_agents() bool

Whether the algorithm contains groups of agents assigned to the same policy for centralized execution.

Return type:

bool

property index: int

Return the index of the algorithm.

static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], ...]) dict[str, Tensor]

Update agent network parameters from the gathered experiences.

Parameters:

experience (tuple[dict[str, torch.Tensor]]) – Tuple of dictionaries containing batched states, actions, rewards, next_states, dones in that order for each individual agent.

Returns:

Loss dictionary

Return type:

dict[str, torch.Tensor]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) Self

Load an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) None

Load saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

property mut: Any

Return the mutation object of the algorithm.

mutation_hook() None

Execute the hooks registered with the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]

Create a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

list[SelfEvolvableAlgorithm].

preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]) dict[str, Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]]

Preprocesses observations for forward pass through neural network.

Parameters:

observations (numpy.ndarray[float] or dict[str, numpy.ndarray[float]]) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

process_infos(infos: dict[str, dict[str, Any]] | None) tuple[dict[str, ndarray], dict[str, ndarray], dict[str, ndarray]]

Process the information, extract env_defined_actions, action_masks and agent_masks.

Parameters:

infos (dict[str, dict[...]]) – Info dict

Returns:

Tuple of action masks, env_defined_actions and agent masks

Return type:

tuple[ArrayDict, ArrayDict, ArrayDict]

recompile() None

Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_mutation_hook(hook: Callable) None

Register a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Set the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) None

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:

optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

reset_action_noise(indices: list[int]) None

Reset action noise.

Parameters:

indices (list[int]) – List of indices to reset

save_checkpoint(path: str) None

Save a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

set_training_mode(training: bool) None

Set the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

soft_update(net: Module, target: Module) None

Soft updates target network.

Parameters:
  • net (nn.Module) – Network to be updated

  • target (nn.Module) – Target network

sum_shared_rewards(rewards: dict[str, ndarray]) dict[str, ndarray]

Sum the rewards for grouped agents.

Parameters:

rewards (dict[str, np.ndarray]) – Reward dictionary from environment

Returns:

Summed rewards dictionary

Return type:

dict[str, np.ndarray]

test(env: str | ParallelEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3, sum_scores: bool = True) float

Return mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to None

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3

  • sum_scores (book, optional) – Boolean flag to indicate whether to sum sub-agent scores, defaults to True

Returns:

Mean test score

Return type:

float

to_device(*experiences: Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]) tuple[Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor], ...]

Move experiences to the device.

Parameters:

experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

tuple[torch.Tensor[float], …]

unwrap_models() None

Unwraps the models in the algorithm from the accelerator.

wrap_models() None

Wrap the models in the algorithm with the accelerator.