Twin Delayed Deep Deterministic Policy Gradient (TD3)

TD3 is an extension of DDPG that addresses overestimation bias by introducing an extra critic network, delayed actor network updates, and action noise regularization.

Compatible Action Spaces

Discrete

Box

MultiDiscrete

MultiBinary

✔️

Example

import torch
from agilerl.utils.utils import make_vect_envs
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.components.data import Transition
from agilerl.algorithms.td3 import TD3
from agilerl.networks.actors import DeterministicActor

# Create environment and Experience Replay Buffer
num_envs = 1
env = make_vect_envs('LunarLanderContinuous-v3', num_envs=num_envs)
observation_space = env.single_observation_space
action_space = env.single_action_space

memory = ReplayBuffer(max_size=10000)

# Create TD3 agent
agent = TD3(observation_space, action_space)
agent.set_training_mode(True)

obs, info = env.reset()  # Reset environment at start of episode
while True:
    action = agent.get_action(obs)  # Get next action from agent (in training mode: raw + noise)
    # Rescale action from network output bounds to env action space
    action = DeterministicActor.rescale_action(
        action=torch.from_numpy(action),
        low=agent.action_low,
        high=agent.action_high,
        output_activation=agent.actor.output_activation,
    ).numpy()
    next_obs, reward, done, _, _ = env.step(action)   # Act in environment

    # Save experience to replay buffer
    transition = Transition(
        obs=obs,
        action=action,
        reward=reward,
        next_obs=next_obs,
        done=done,
        batch_size=[num_envs]
    )
    transition = transition.to_tensordict()
    memory.add(transition)

    # Learn according to learning frequency
    if len(memory) >= agent.batch_size:
        experiences = memory.sample(agent.batch_size) # Sample replay buffer
        agent.learn(experiences)    # Learn according to agent's RL algorithm

Note

In the loop above, actions are rescaled after get_action() using the static method DeterministicActor.rescale_action. This maps the actor’s output (in the range implied by its output activation, e.g. [-1, 1] for Tanh) to the environment’s action space [action_low, action_high]. When using get_action(..., training=True) for exploration, the agent returns actions in the activation range; rescaling to the env space is required before env.step(). When not in training mode, the agent applies this rescaling internally.

Custom actor networks

TD3 allows actor networks that are not DeterministicActor. If you use a custom actor, it must define an attribute output_activation (a string) set to one of the allowed output activations: "Tanh", "Softsign", "Sigmoid", "Softmax", or "GumbelSoftmax". This is used by DeterministicActor.rescale_action to map network outputs to the environment action space correctly.

Neural Network Configuration

To configure the architecture of the network’s encoder / head, pass a kwargs dict to the TD3 net_config field. Full arguments can be found in the documentation of EvolvableMLP, EvolvableCNN, and EvolvableMultiInput.

For discrete / vector observations:

NET_CONFIG = {
      "encoder_config": {'hidden_size': [32, 32]},  # Network head hidden size
      "head_config": {'hidden_size': [32]}      # Network head hidden size
  }

For image observations:

NET_CONFIG = {
    "encoder_config": {
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

For dictionary / tuple observations containing any combination of image, discrete, and vector observations:

CNN_CONFIG = {
    "channel_size": [32, 32], # CNN channel size
    "kernel_size": [8, 4],   # CNN kernel size
    "stride_size": [4, 2],   # CNN stride size
}

NET_CONFIG = {
    "encoder_config": {
      "latent_dim": 32,
      # Config for nested EvolvableCNN objects
      "cnn_config": CNN_CONFIG,
      # Config for nested EvolvableMLP objects
      "mlp_config": {
          "hidden_size": [32, 32]
      },
      "vector_space_mlp": True # Process vector observations with an MLP
    },
    "head_config": {'hidden_size': [32]}  # Network head hidden size
  }

agent = TD3(observation_space, action_space, net_config=NET_CONFIG)   # Create TD3 agent

Evolutionary Hyperparameter Optimization

AgileRL allows for efficient hyperparameter optimization during training to provide state-of-the-art results in a fraction of the time. For more information on how this is done, please refer to the Evolutionary Hyperparameter Optimization documentation.

Saving and Loading Agents

To save an agent, use the save_checkpoint method:

from agilerl.algorithms.td3 import TD3

agent = TD3(observation_space, action_space)   # Create TD3 agent

checkpoint_path = "path/to/checkpoint"
agent.save_checkpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.td3 import TD3

checkpoint_path = "path/to/checkpoint"
agent = TD3.load(checkpoint_path)

Parameters

class agilerl.algorithms.td3.TD3(*args: Any, **kwargs: Any)

Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm.

Paper: https://arxiv.org/abs/1802.09477

Parameters:
  • observation_space (gym.spaces.Space) – Observation space of the environment

  • action_space (gym.spaces.Space) – Action space of the environment

  • O_U_noise (bool, optional) – Use Ornstein Uhlenbeck action noise for exploration. If False, uses Gaussian noise. Defaults to True

  • vect_noise_dim (int, optional) – Vectorization dimension of environment for action noise, defaults to 1

  • expl_noise (float, optional) – Scale for Ornstein Uhlenbeck action noise, or standard deviation for Gaussian exploration noise

  • mean_noise (float, optional) – Mean of exploration noise, defaults to 0.0

  • theta (float, optional) – Rate of mean reversion in Ornstein Uhlenbeck action noise, defaults to 0.15

  • dt (float, optional) – Timestep for Ornstein Uhlenbeck action noise update, defaults to 1e-2

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • hp_config (HyperparameterConfig, optional) – RL hyperparameter mutation configuration, defaults to None, whereby algorithm mutations are disabled.

  • net_config (dict, optional) – Network configuration, defaults to None

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr_actor (float, optional) – Learning rate for actor optimizer, defaults to 1e-4

  • lr_critic (float, optional) – Learning rate for critic optimizer, defaults to 1e-3

  • learn_step (int, optional) – Learning frequency, defaults to 5

  • gamma (float, optional) – Discount factor, defaults to 0.99

  • tau (float, optional) – For soft update of target network parameters, defaults to 0.005

  • normalize_images (bool, optional) – Flag to normalize images, defaults to True

  • mut (str, optional) – Most recent mutation to agent, defaults to None

  • policy_freq (int, optional) – Frequency of critic network updates compared to policy network, defaults to 2

  • actor_network (nn.Module, optional) – Custom actor network, defaults to None

  • critic_networks (list[nn.Module], optional) – List of two custom critic networks (one for each of the two critics), defaults to None

  • share_encoders (bool, optional) – Share encoders between actor and critic, defaults to False

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

action_noise() ndarray
Create action noise for exploration, either Ornstein Uhlenbeck or

from a normal distribution.

Returns:

Action noise

Return type:

np.ndArray

clean_up() None

Clean up the algorithm by deleting the networks and optimizers.

Returns:

None

Return type:

None

clone(index: int | None = None, wrap: bool = True) Self

Create a clone of the algorithm.

Parameters:
  • index (int | None, optional) – The index of the clone, defaults to None

  • wrap (bool, optional) – If True, wrap the models in the clone with the accelerator, defaults to False

Returns:

A clone of the algorithm

Return type:

EvolvableAlgorithm

static copy_attributes(agent: SelfEvolvableAlgorithm, clone: SelfEvolvableAlgorithm) SelfEvolvableAlgorithm

Copy the non-evolvable attributes of the algorithm to a clone.

Parameters:

clone (SelfEvolvableAlgorithm) – The clone of the algorithm.

Returns:

The clone of the algorithm.

Return type:

SelfEvolvableAlgorithm

evolvable_attributes(networks_only: bool = False) dict[str, EvolvableModuleProtocol | ModuleDictProtocol | Optimizer | dict[str, Optimizer] | OptimizerWrapperProtocol]

Return the attributes related to the evolvable networks in the algorithm. Includes attributes that are either EvolvableModule or ModuleDict objects, as well as the optimizers associated with the networks.

Parameters:

networks_only (bool, optional) – If True, only include evolvable networks, defaults to False

Returns:

A dictionary of network attributes.

Return type:

dict[str, Any]

get_action(obs: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], training: bool = True, *args: Any, **kwargs: Any) ndarray

Return the next action to take in the environment. If training, random noise is added to the action to promote exploration.

Parameters:
  • obs (numpy.ndarray[float]) – Environment observation, or multiple observations in a batch

  • training (bool, optional) – Agent is training, use exploration noise, defaults to True

Returns:

Action

Return type:

numpy.ndarray[float]

static get_action_dim(action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the action space as it pertains to the underlying networks (i.e. the output size of the networks).

Parameters:

action_space (spaces.Space or list[spaces.Space].) – The action space of the environment.

Returns:

The dimension of the action space.

Return type:

int.

get_lr_names() list[str]

Return the learning rates of the algorithm.

get_policy() EvolvableModuleProtocol

Return the policy network of the algorithm.

static get_state_dim(observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary]) tuple[int, ...]

Return the dimension of the state space as it pertains to the underlying networks (i.e. the input size of the networks).

Parameters:

observation_space (spaces.Space or list[spaces.Space].) – The observation space of the environment.

Returns:

The dimension of the state space.

Return type:

tuple[int, …].

property index: int

Return the index of the algorithm.

static inspect_attributes(agent: SelfEvolvableAlgorithm, input_args_only: bool = False) dict[str, Any]

Inspect and retrieve the attributes of the current object, excluding attributes related to the underlying evolvable networks (i.e. EvolvableModule, torch.optim.Optimizer) and with an option to include only the attributes that are input arguments to the constructor.

Parameters:

input_args_only (bool) – If True, only include attributes that are input arguments to the constructor. Defaults to False.

Returns:

A dictionary of attribute names and their values.

Return type:

dict[str, Any]

learn(experiences: dict[str, ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]] | tuple[ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts], ...], noise_clip: float = 0.5, policy_noise: float = 0.2) tuple[float | None, float]

Update agent network parameters to learn from experiences.

Parameters:
  • experiences (dict[str, torch.Tensor[float]]) – TensorDict of batched observations, actions, rewards, next_observations, dones.

  • noise_clip (float, optional) – Maximum noise limit to apply to actions, defaults to 0.5

  • policy_noise (float, optional) – Standard deviation of noise applied to policy, defaults to 0.2

Returns:

Actor loss and critic loss

Return type:

tuple[float, float]

classmethod load(path: str, device: str | device = 'cpu', accelerator: Accelerator | None = None) Self

Load an algorithm from a checkpoint.

Parameters:
  • path (string) – Location to load checkpoint from.

  • device (str, optional) – Device to load the algorithm on, defaults to ‘cpu’

  • accelerator (Accelerator | None, optional) – Accelerator object for distributed computing, defaults to None

Returns:

An instance of the algorithm

Return type:

RLAlgorithm

load_checkpoint(path: str) None

Load saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

property mut: Any

Return the mutation object of the algorithm.

mutation_hook() None

Execute the hooks registered with the algorithm.

classmethod population(size: int, observation_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], action_space: Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary | list[Box | Discrete | MultiDiscrete | Dict | Tuple | MultiBinary], wrapper_cls: type[SelfAgentWrapper] | None = None, wrapper_kwargs: dict[str, Any] | None = None, **kwargs) list[Self | SelfAgentWrapper]

Create a population of algorithms.

Parameters:

size (int.) – The size of the population.

Returns:

A list of algorithms.

Return type:

list[SelfEvolvableAlgorithm].

preprocess_observation(observation: ndarray | dict[str, ndarray] | tuple[ndarray, ...] | Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor] | Number | list[ReasoningPrompts]) Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]

Preprocesses observations for forward pass through neural network.

Parameters:

observations (ObservationType) – Observations of environment

Returns:

Preprocessed observations

Return type:

torch.Tensor[float] or dict[str, torch.Tensor[float]] or tuple[torch.Tensor[float], …]

recompile() None

Recompiles the evolvable modules in the algorithm with the specified torch compiler.

register_mutation_hook(hook: Callable) None

Register a hook to be executed after a mutation is performed on the algorithm.

Parameters:

hook (Callable) – The hook to be executed after mutation.

register_network_group(group: NetworkGroup) None

Set the evaluation network for the algorithm.

Parameters:

name (str) – The name of the evaluation network.

reinit_optimizers(optimizer: OptimizerConfig | None = None) None

Reinitialize the optimizers of an algorithm. If no optimizer is passed, all optimizers are reinitialized.

Parameters:

optimizer (OptimizerConfig | None, optional) – The optimizer to reinitialize, defaults to None, in which case all optimizers are reinitialized.

reset_action_noise(indices: ndarray) None

Reset action noise.

Parameters:

indices (np.ndarray) – Indices to reset

save_checkpoint(path: str) None

Save a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

set_training_mode(training: bool) None

Set the training mode of the algorithm.

Parameters:

training (bool) – If True, set the algorithm to training mode.

share_encoder_parameters() None

Shares the encoder parameters between the actor and critics. Registered as a mutation hook when share_encoders=True.

soft_update(net: EvolvableModule, target: EvolvableModule) None

Soft updates target network parameters.

Parameters:
test(env: str | Env | VectorEnv | AsyncVectorEnv, swap_channels: bool = False, max_steps: int | None = None, loop: int = 3) float

Return mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to None

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean. Defaults to 3

Returns:

Mean test score

Return type:

float

to_device(*experiences: Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor]) tuple[Tensor | TensorDict | tuple[Tensor, ...] | dict[str, Tensor], ...]

Move experiences to the device.

Parameters:

experiences (tuple[torch.Tensor[float], ...]) – Experiences to move to device

Returns:

Experiences on the device

Return type:

tuple[torch.Tensor[float], …]

unwrap_models() None

Unwraps the models in the algorithm from the accelerator.

wrap_models() None

Wrap the models in the algorithm with the accelerator.