65 experiemtns for RL research in the Ms. Pacman environment and also single-file implementations for diffusion models, Gaussian processes, active inference, Muon optimizer, pi-sigma product units, growing networks, eligibility traces, halting RNNs, world models
pacman/
├── README.md
├── utils/ # shared Atari env helpers
│ ├── env.py
│ └── atari100k_env.py
│
├── minagi_v00/ # era 0 — broad architecture sweep
│ ├── 2015NatureDQN.py
│ ├── bee_timing.py
│ ├── bob.py
│ ├── cv1.ipynb
│ ├── diffusion.ipynb
│ ├── dqn.py
│ ├── et.py
│ ├── fourier_diffusion.py
│ ├── gnn_pacman_dqn.py
│ ├── grow.py
│ ├── grow_atari.py
│ ├── grow_atari2.py
│ ├── grow_mnist_torch.py
│ ├── haltingrnn.ipynb / .py / haltingrnn2.py
│ ├── hopfield.py
│ ├── hrm_agent_mspacman.py
│ ├── masked_diffusion.py
│ ├── mem_rl_ms_pacman.py
│ ├── minagi/a1.py
│ ├── ms_pacman_ppo_transformer.py
│ ├── muondqn.py
│ ├── muon_dqn_tensorboard.py (+ copy)
│ ├── mushroombody.py
│ ├── novel.py
│ ├── pacman.py
│ ├── ppobaseline.py
│ ├── reinforce.py
│ ├── rnn_pacman.py
│ ├── spatial_attention_cnn.py
│ ├── vqvaeppo.py
│ ├── xnor.py
│ ├── random_dropout/random_replace.py
│ └── sobol/sobol.py
│
├── minagi_v01/ # era 1 — "noise" agents, Muon PPO, PACF
│ ├── readme
│ ├── MUONPPO.py / muon_ppo.py / muon.py
│ ├── ppo.py / sb_ppo_mspacman.py
│ ├── noise.py
│ ├── noise_agent.py / _2 / _3 / _4 / _muon.py
│ ├── noise_cnn.py / noise_td_conv.py
│ ├── pacf.py / pacf.ipynb / pacf_atari.py
│ ├── hac.py
│ ├── vqvae.py
│ ├── test.py
│ ├── out/ · snapshots/ # output artifacts
│
├── v3/ # era 3 — Gaussian processes / PILCO
│ ├── gp.py · gp.ipynb
│ └── pixel_pilco.py
│
├── v4/ # era 4 — active inference / world models
│ ├── active_inference_mountaincar.py
│ ├── with_reward_f.py
│ └── worldmodel_agent.py
│
└── notebooks/ # scratch experiments + result figures
├── pacman.ipynb · new.ipynb · newnew.ipynb · view.ipynb
├── pisigma.ipynb
├── grow.ipynb
├── et.ipynb · trace.ipynb
├── forgetting.ipynb
├── free_energy.ipynb
├── idk.ipynb · misc.ipynb
├── *.png # ~31 result figures
└── mnist_data/ · fashion_mnist_data/
env.py— Minimal smoke test that creates an ALE/Gymnasium Ms. Pac-Man env and runs random episodes to verify setup and observation shapes.atari100k_env.py— Reusable Gymnasium + ALE env factory for Atari-100k-style preprocessing (frame skip, grayscale downsample, life-loss terminals, frame stacking, noop_max), with train/eval modes.
2015NatureDQN.py— Faithful reproduction of Nature DQN (Mnih et al. 2015) on Ms. Pac-Man: RMSProp, reward clipping, standard 84×84 / frame-skip-4 / 1M-replay preprocessing.bob.py— DQN with CNN + Muon optimizer hybrid (Muon for 2D tensors, Aux-Adam for biases) on Ms. Pac-Man.muondqn.py— DQN with Muon optimizer on Ms. Pac-Man (small 50k replay, MPS support).muon_dqn_tensorboard.py(+copy) — DQN with Muon optimizer (1M replay, 50k warmup) plus TensorBoard logging.dqn.py— N-step SARSA with a similarity-based push-down memory buffer and spatial-attention visualization on Ms. Pac-Man.et.py— Emphatic TD (ETD) with a world-model auxiliary loss and episodic memory for Ms. Pac-Man.gnn_pacman_dqn.py— Graph-neural-network DQN that message-passes over a 4-neighborhood grid graph from a patch encoder.hopfield.py— Modern Hopfield memory networks for Q-learning, bipolar image encoding, per-action memory banks.mem_rl_ms_pacman.py— Hybrid memory-attention agent: soft-kNN over external memory (key = CNN latent, value = Q-vector) with TD(λ).hrm_agent_mspacman.py— Hierarchical Reasoning Model: two recurrent timescales (high- + low-level GRU) trained with REINFORCE.spatial_attention_cnn.py— A2C with CNN + spatial attention + LSTM, emitting attention heatmaps over the frame.ms_pacman_ppo_transformer.py/novel.py— PPO with a CNN backbone + lightweight spatial-Transformer encoder over spatial tokens.mushroombody.py— PPO with n-step (n=5) rollouts and GAE on grayscale 84×84 Ms. Pac-Man.ppobaseline.py— Plain PPO baseline (last-frame obs, CNN + global-average-pool heads), TensorBoard logging.pacman.py— Actor-critic on raw 210×160 RGB with a 3-layer CNN + dense policy/value heads.reinforce.py— REINFORCE policy gradient with a 6-layer dense net on flattened 80×80 obs, temperature-decayed sampling.rnn_pacman.py— PPO with CNN + GRU (fixed ponder steps), vectorized across 8 envs.haltingrnn.py/haltingrnn2.py/haltingrnn.ipynb— PPO + GRU with ACT-style learned halting (adaptive computation time / ponder steps).bee_timing.py— Interval-timing task: a halting agent learns to PROBE at precise multiples of 16 timesteps from sparse reward.grow.py— Dynamic-depth MLP (per-neuron eligibility traces, growth/pruning) on synthetic moons/XOR/AND.grow_mnist_torch.py— Dynamic-depth MNIST classifier that grows/prunes residual blocks on a validation-loss threshold.grow_atari.py/grow_atari2.py— DQN with a growing/pruning residual head on Ms. Pac-Man; v2 adds RND curiosity intrinsic rewards.vqvaeppo.py— PPO + EMA VQ-VAE with commitment-error intrinsic reward and a temporal-delta prediction auxiliary loss.diffusion.ipynb/masked_diffusion.py— Masked diffusion (FiLM-conditioned tiny UNet) trained on FashionMNIST/CIFAR/CelebA/STL10 via gradual corruption.fourier_diffusion.py— Diffusion in a DCT/Fourier-feature latent space, time-conditioned, on FashionMNIST.xnor.py— Parity/XOR benchmark: ReLU MLP vs. learnable product-pooling trees on {−1,+1}ᵈ vectors.random_dropout/random_replace.py— Standard dropout vs. random-replacement dropout on CIFAR-10/100 with ResNet-18.sobol/sobol.py— Sobol quasi-random vs. Xavier weight init on an LSTM time-series task.cv1.ipynb— Basic computer-vision tutorial (convolution filters, image ops).minagi/a1.py— Config dataclass for A2C/PPO (attention heads, RNN latent, replay + episodic buffers, ETD emphasis).
readme— v0.1 notes: standardize returns to mean-0, explore autocorrelation between past signals and latent observations, KL in PPO, Mahalanobis-distance novelty reward, and the observation that motion-defined objects need temporal information.muon.py— Muon optimizer implementation (orthogonalized SGD-momentum via Newton-Schulz), single-device + distributed variants with AuxAdam.MUONPPO.py— PPO with the Muon optimizer on Ms. Pac-Man.muon_ppo.py— Baseline PPO (standard momentum actor-critic) for Ms. Pac-Man.ppo.py— Standard PPO matching Mnih et al. (2016) Atari hyperparameters (CNN actor-critic, GAE, KL clip, frame stack).sb_ppo_mspacman.py— PPO via Stable-Baselines3 with Atari preprocessing + frame stack, wandb logging.noise.py— Custom Gymnasium "noise-on-noise" env: a moving circle on a static Bernoulli-noise background; reward from position prediction.noise_agent.py— PPO on the noise env; CNN policy outputs continuous 2D predictions with learnable log-std and entropy regularization.noise_agent_2.py— Two-frame-stacked coordinate predictor (predict next circle position from two consecutive frames).noise_agent_3.py— CNN policy predicting 2D object coordinates via average-reward REINFORCE with a tanh-squashed Gaussian policy.noise_agent_4.py— Faithful Nature-2015 DQN on Ms. Pac-Man with wandb, uint8 replay frames.noise_agent_muon.py— Same Nature DQN but with the Muon optimizer replacing Adam.noise_cnn.py— Inductive-bias benchmark: learnable conv vs. fixed-random vs. resampled-random non-overlapping convs on MNIST.noise_td_conv.py— Visualizes white-noise kernel response in Ms. Pac-Man frames.pacf.py/pacf.ipynb— PACF-based credit assignment: data-driven temporal-autocorrelation-weighted returns vs. exponential TD(λ), on MiniGrid Key-Corridor.pacf_atari.py— PACF credit assignment inside DQN for Ms. Pac-Man (multi-step returns weighted by PACF lags via OLS).hac.py— Heteroscedastic auto-correlation-aware recurrent actor-critic (CNN→GRU) with Newey-West variance preconditioning and IACT diagnostics.vqvae.py— VQ-VAE v1 with straight-through estimator and codebook perplexity, on small 32×32 images.test.py— Empty template.out/,snapshots/— Output figures and saved frames.
gp.py— PyTorch port of Deisenroth's PILCO: GP dynamics model, trig augmentation, control saturation, cart-pole sim, GP-policy trajectory optimization.gp.ipynb— Tutorial on GPs as distributions over functions (Brownian motion, exponentiated-quadratic kernel, prior sampling).pixel_pilco.py— Pixel-space PILCO for Ms. Pac-Man: CNN autoencoder + latent world model (dynamics + reward), policy optimized by gradient planning through the model.
active_inference_mountaincar.py— Active-inference agent (on CartPole) with capped model learning so prediction error persists in unexplored regions and drives epistemic/curiosity exploration.with_reward_f.py— Active inference on Hopper: learned generative model + Cross-Entropy-Method planning that minimizes expected free energy over imagined trajectories.worldmodel_agent.py— Active-inference agent on Ms. Pac-Man with a GRU world model (encoder/dynamics/decoder/prior/done heads); optimizes policy on imagined rollouts by minimizing free energy — never sees the environment reward.
pacman.ipynb— Full DQN for Ms. Pac-Man (3-conv CNN → 512 FC, ε-greedy, replay, target net).new.ipynb/newnew.ipynb— PPO actor-critic with an IMPALA ResNet encoder and GAE / n-step returns on Ms. Pac-Man.view.ipynb— Loads a PPO-Transformer checkpoint and plays one evaluation episode.pisigma.ipynb— MLP vs. pi-sigma / product-unit networks on MNIST, with numerically stable log-space products.grow.ipynb— Growing networks with dynamic neuron addition (IMPALA-style ResNet encoders) on MNIST/FashionMNIST.et.ipynb/trace.ipynb— Actor-critic with per-parameter eligibility traces on a contextual bandit (local × global modulatory learning, no BPTT).forgetting.ipynb— Catastrophic-forgetting demo on arithmetic facts (accuracy on "ones" collapses after retraining on "twos").free_energy.ipynb— Theory write-up of variational free energy / active inference (ELBO, VAE decomposition, action-conditioned world models).idk.ipynb— Ms. Pac-Man env exploration (setup, random actions, observation space).misc.ipynb— Quick numpy tutorial.*.png— ~31 result figures (ablations, MNIST/FashionMNIST comparisons, pi-sigma analyses, weight/activation/gradient stats, scaling studies, XOR/two-moons).mnist_data/,fashion_mnist_data/— Downloaded datasets.