[rllib] QMIX doesn't learn anything

I've alread posted this question on stackoverflow but since i didn't got an answer there i will repost it here (https://stackoverflow.com/questions/61523164/ray-rllib-qmix-doesnt-learn-anything)

I wanted to try out the QMIX implementation of Ray/Rllib library but there must be something wrong of how I'm using it because it doesn't seem to learn anything. Since I'm new to Ray/Rllib I started with the "TwoStepGame" example the libary provides as an example on there github repo (https://github.com/ray-project/ray/blob/master/rllib/examples/twostep_game.py), trying to understand how to use it. Since for the start this example was a little bit to complex for me I adjusted it to make a example that is as simple as possible. Problem: Qmix doesn't seem to learn, means the resulting reward pretty much matches the expected value of a random policy.

Let me explain the idea of my very simple experiment. We have 2 agents. Every agent can make 3 actions (`Discrete(3)`). If he makes the action 0 he gets a reward of 0.5 if not 0. So this should be a very simple task, since the best policy is just taking action 0.

Here is my implementation:





    from gym.spaces import Tuple, MultiDiscrete, Dict, Discrete
    import numpy as np

    import ray
    from ray import tune
    from ray.tune import register_env, grid_search
    from ray.rllib.env.multi_agent_env import MultiAgentEnv
    from ray.rllib.agents.qmix.qmix_policy import ENV_STATE


    class TwoStepGame(MultiAgentEnv):
        action_space = Discrete(3)

        def __init__(self, env_config):
            self.counter = 0

        def reset(self):
            return {0: {'obs': np.array([0]), 'state': np.array([0])},
                    1: {'obs': np.array([0]), 'state': np.array([0])}}

        def step(self, action_dict):
            self.counter += 1
            move1 = action_dict[0]
            move2 = action_dict[1]
            reward_1 = 0
            reward_2 = 0
            if move1 == 0:
                reward_1 = 0.5
            if move2 == 0:
                reward_2 = 0.5

            obs = {0: {'obs': np.array([0]), 'state': np.array([0])},
                   1: {'obs': np.array([0]), 'state': np.array([0])}}
            done = False
            if self.counter > 100:
                self.counter = 0
                done = True

            return obs, {0: reward_1, 1: reward_2}, {"__all__": done}, {}


    if __name__ == "__main__":

        grouping = {"group_1": [0, 1]}

        obs_space = Tuple([
            Dict({
                "obs": MultiDiscrete([2]),
                ENV_STATE: MultiDiscrete([3])
            }),
            Dict({
                "obs": MultiDiscrete([2]),
                ENV_STATE: MultiDiscrete([3])
            }),
        ])

        act_space = Tuple([
            TwoStepGame.action_space,
            TwoStepGame.action_space,
        ])

        register_env("grouped_twostep",
            lambda config: TwoStepGame(config).with_agent_groups(
                grouping, obs_space=obs_space, act_space=act_space))

        config = {
            "mixer": grid_search(["qmix"]),
            "env_config": {
                "separate_state_space": True,
                "one_hot_state_encoding": True
            },
        }

        ray.init(num_cpus=1)
        tune.run(
            "QMIX",
            stop={
                "timesteps_total": 100000,
            },
            config=dict(config, **{
                "env": "grouped_twostep",
            }),
        )





and here is the result of the output when I run it for 100 000 timesteps



    +----------------------------+------------+-------+---------+--------+------------------+--------+----------+
    | Trial name                 | status     | loc   | mixer   |   iter |   total time (s) |     ts |   reward |
    |----------------------------+------------+-------+---------+--------+------------------+--------+----------|
    | QMIX_grouped_twostep_00000 | TERMINATED |       | qmix    |    100 |          276.796 | 101000 |   33.505 |
    +----------------------------+------------+-------+---------+--------+------------------+--------+----------+



    Process finished with exit code 0




As you can see the policy seems to be random since the expected value is 1/3 and the resulting reward is 33.505 (because I reset the enviroment every 100 timesteps).
My Question: What do i not understand? There must be something wrong with my configuration or maybe my understanding of how rllib works. But since the best policy is very very simpel (just always take action 0) it seems to me like this algorithm cannot learn.


software | version
--- | ---
ray |0.8.4
python | 3.6.9
tensorflow | 1.14.0
OS |  Ubuntu (running in a VM on a Windows OS) Release 18.04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] QMIX doesn't learn anything #8384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

software	version
ray	0.8.4
python	3.6.9
tensorflow	1.14.0
OS	Ubuntu (running in a VM on a Windows OS) Release 18.04

[rllib] QMIX doesn't learn anything #8384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions