Posts on Scott Jeen

20s

Mon, 31 Mar 2025 21:15:18 +0100

I turned 30 today. Here are some particularly important moments from the last decade.

Highs

drives to KB on frosty mornings with hamilton, hayden and adam
lcd soundsystem on the west side highway with the roof down
the beginning of infinity
glasto nights with dom, deirdre and cat
nablus with muath, fin, calum, dom and hayden
francis and simone’s wedding
david silver’s lectures and the first reading of sutton and barto
sunny evening walks with cat around cambridge
submitting the phd
lunch conversations at fdm
late night snacks in brooklyn after a double date
sherkin island with cat
ollie worrall on the 16th green at aldeburgh
ashby lab with timo and josh
the first drive down the backs past king’s college

Lows

chancery lane with lucy

Searching for useful problems

Mon, 26 Aug 2024 19:36:27 +0200

This post is based on a talk I gave to my research group at Cambridge University in June 2024. You can find the notes and slides for the talk here.

Solutions to most problems aren’t particularly useful. Solutions to a small number of problems are extremely useful. If you’re interested in doing good, you’ll want to search for problems that look like the latter. It’s a hard search, but the ROI is likely greater than any other use of your time, and I have some ways of running it that I think make it a little more tractable.

A solution to a problem provides some gain in exchange for some cost. Less useful problems are those whose solutions incur costs that are equal to or higher than the expected gains. Useful problems are those whose solutions provide gains that greatly outweigh the costs. The former are zero-sum games, and the latter are positive-sum games. Chess is the canonical zero-sum game: every gain in the position of Player A is a loss in the position of Player B. Trade is the canonical positive-sum game: when countries specialise in products they make most cheaply, and trade them for those they make less cheaply, both benefit. Maximising good necessitates working on problems with positive-sum solutions. Indeed, your goal should be to find problems whose solutions are maximally positive-sum.

Identifying these problems is difficult for two reasons: 1) it’s hard to predict expected gain, and 2) it’s hard to predict expected cost (relative to the cost incurred by others working on the same problem¹). Below I discuss methods for dealing with each.

Moral axioms and proxies (or evaluating the expected gain from a solved problem)

Finding problems with solutions that maximise expected gain means knowing what you’d like to gain. It’s easy to convince yourself that you know this already, but, often, what you think you’d like to gain only approximates a more deeply held belief. You’ll know you’ve found these beliefs when they feel axiomatic, like unchallengeable truths for which you don’t need to provide justification. One might be that maximising net happiness is a good thing. Another might be that fulfilling the desires of the maximum number of people is a good thing. To maximise expected gain, you’ll want to solve problems that contribute to these moral axioms as much as possible².

Your moral axioms are problems that you can’t access directly (I can’t click my fingers and make everyone happier). Instead, you access them through proxies—problems that, if solved, partially contribute to your moral axiom. These proxies will have their own sub-problems with solutions that partially contribute to their solution. Each moral axiom has an associated tree of proxies, as in Figure 1.

Figure 1. Moral axioms and proxies. Starting with a moral axiom (a moral belief that requires no justification) we build a tree of sub-problems or proxies. Each proxy has a solution which part-solves the proxy from the previous level in the tree. The degree to which the previous proxy is solved by the current proxy is approximated by its value (a number between 0 and 1). For a given proxy, its expected value or proxy score with respect to the moral axiom is found by multiplying each of the values you traverse to find it in the tree.

Proxies contribute to other proxies to different degrees. In Figure 1, I consider incentivising solar PV as a proxy for mitigating climate change, which is itself a proxy for increasing average happiness. I provide two solutions: a) new panels with improved conversion efficiency, or b) cheaper (i.e. more) grid scale batteries which allow generators to reliably sell excess energy to the grid. Here, I assume reliable income for surplus energy better incentivises solar PV than improved panel efficiency, so b) contributes more to the proxy than a). These contributions are summarised by their value with respect to the proxy above—a number between 0 and 1, where 0 represents no contribution and 1 constitutes a full solution to the problem. The expected value, or proxy score, of a solution with respect to the moral axiom is found by multiplying each of the values you traverse to find it in the tree. It’s called the proxy score because it measures how well the problem approximates the real problem you want solve i.e. your moral axiom. Proxies are ranked using these scores.

Building the ground truth tree of proxies and values for a given moral axiom is clearly intractable. The best you can do is create as exhaustive a list of proxies as you can, and score them to the best of your current knowledge. What’s nice, however, is that building this tree is a lifelong project, and the tree can always be improved. As you learn more, you’ll identify new proxies and score them, or refine scores for old proxies, which will re-rank them. The trick is being willing to switch proxies when your ranking updates.

Being the (Pareto) best in the world (or evaluating the expected cost of solving a problem)

If you are the best in the world at something you are uniquely positioned to solve its problems cheaply. Usain Bolt’s training and physique meant he was uniquely capable of running the 100m faster than anyone who came before. Marie Curie’s knowledge of radioactivity meant she was uniquely capable of proposing radiation therapy as a form of cancer treatment. But being the best in the world at something is hard. It is much easier to be the best in the world at a combination of things.

In the simplest case you can think about how you compare to others at a combination of two skills, as in Figure 2. Here, people are plotted with respect to their gardening and public speaking skill. I assume Mandela is history’s best public speaker and (for simplicity) knows nothing of gardening, and vice versa for Alan Titchmarsh. Between lie three people, Jane, Julia and John, each of whom have differing combined skills. Together these five make a Pareto front of gardening/public speaking knowledge.

Figure 2. Being the (Pareto) best in the world³. Five people plotted with respect to their gardening and public speaking skill. Problems reveal themselves to each person depending on their unique combination of skills.

Each person’s position on the Pareto front illuminates problems that are cheaply accessible to them. The further some person is from their neighbours, the more problems they can access. In practice, it’s difficult to be the best in the world across the intersection of two skills, but you may find you are the best in the world across the intersection of three, four or five. There may be nobody in the world better than Julia at gardening, public speaking, Catalan, and rust programming. The limit case is the intersection of all of your skills, and at that you are definitively unmatched. There is nobody in the world better at being you than you.

Your position on the real Pareto skill front changes as you learn new skills. The number of new problems cheaply accessible to you is dependent on what these new skills are. If you are a native English speaker, extending your English vocabulary won’t differentiate you much from other fluent English speakers, and is unlikely to unlock new problems that you can’t currently access. Learning Estonian enters you to a much smaller cohort of English-Estonian speakers, and opens up some problems that are communicated only in Estonian.

Consider carefully what new skills would maximally differentiate you from your neighbours on the Pareto front, and learn them to increase the number of problems you can solve cheaply. The best skills to learn are those that illuminate problems with highest proxy score as established from your moral axioms and proxies.

Key takeaways

Think deeply about your moral axioms (beliefs that require no justification), and build a tree of proxies (sub-problems) and their values that is as exhaustive as possible.
Rank proxies by their proxy score (expected value).
Think about what combination of skills you are Pareto best at, and find the proxy with the highest score that requires these (i.e. for which you have a comparative advantage in solving).
Think about new skills that, if learned, would unlock proxies with higher proxy scores. Learn them.

NeurIPS 2022

Thu, 26 Jan 2023 21:08:05 +0000

I was fortunate to attend NeurIPS in New Orleans in November. Here, I publish my takeaways to give you a feel for the zeitgeist. I’ll discuss, firstly, the papers, then the workshops, and finally, and briefly, the keynotes.

Papers

Here’s a ranked list of my top 8 papers. Most are on Offline RL, which is representative of the conference writ large.

1. Does Zero-Shot Reinforcement Learning Exist (Touati et. al, 2022)

Key idea. To do zero-shot RL, we need to learn a general function from reward-free transitions that implicitly encodes the trajectories of all optimal policies for all tasks. The authors propose to learn two functions: $F_\theta(s)$ and $B_\phi(s)$ that encode the future and past of state $s$. We want to learn functions that always find a route from $s \rightarrow s'$.

Implication(s):

They beat all previous zero-shot RL algorithms on the standard offline RL tasks, and approach the performance of online, reward-guided RL algorithms in some envs.

Misc thoughts:

It seems clear that zero-shot RL is the route to real world deployment for RL. This work represents the best effort I’ve seen in this direction. I’m really excited by it and will be looking to extend it in my own future work.

2. Large Scale Retrieval for Reinforcement Learning (Humphreys et. al, 2022)

Key idea. Assuming access to a large offline dataset, we perform a nearest neighbours search over the dataset w.r.t. the current state, and append the retrieved states, next actions, rewards and final states (in the case of go) to the current state. The policy then acts w.r.t this augmented state.

Implication(s):

Halves compute required to achieve the baseline win-rate in Go.

Misc thoughts:

This represents the most novel approach to offline RL I’ve seen; most techniques separate the offline and online learning phases, but here the authors combine them elegantly.
To me this feels like a far more promising approach to offline RL than CQL etc.

3. The Phenomenon of Policy Churn (Schaul et. al, 2022)

Key idea. When a value-based agent acts greedily, the policy updates by a surprising amount per gradient step e.g. in up to 10% of states in some cases.

Implication(s):

Policy churn means that ((\epsilon))-greedy exploration may not be required as a rapidly changing policy induces enough noise into the data distribution that exploration may be implicit.

Misc thoughts:

Their paper is structured in a really engaging way.
I liked their ML researcher survey which quantified how surprising their result was to experts.

4. MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (Fan et. al, 2022)

Key idea. An internet-scale benchmark for generalist RL agents. 1000s of tasks, and a limitless procedurally-generated world for training.

Implication(s):

Provides a sufficiently diverse and complex sandbox for training more generally capable agents.

Misc thoughts:

This is an amazing feat software development effort from a relatively small team. Jim Fan is so cool!

5. Exploration via Elliptical Episodic Bonuses (Henaff et. al, 2022)

Key Idea. Guided exploration is often performed by providing the agent reward inversely proportional to the state visitation count i.e. if you haven’t visited this state much you receive added reward. This works for discrete state spaces, but in continuous state spaces each visited state is ~ unique. Here, the authors parameterise ellipses around visited states, specifying a region of nearby states, outside of which the agent receives added reward.

Implication(s):

Better exploration means SOTA on the mini-hack suite of envs
Strong performance of reward-free exploration tasks i.e. this is a really good way of thinking about exploration.

Misc. thoughts:

I really liked the elegance of this idea. A good example of simple, well-examined ideas being useful to the community.

6. You Can’t Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments (Paster et al., 2022)

Key Idea. In a stochastic environment, trajectories in a dataset used to train decision transformer may be high-reward by chance. Here the authors cluster similar trajectories and find their expected reward to mitigate overfitting to lucky trajectories.

Implication(s):

Decision transformer trained on these new objectives exhibits policies that area better aligned with the return conditioning of the user.

Misc. thoughts:

Another simple idea with positive implications for performance.

7. Multi-Game Decision Transformers (Lee et al., 2022)

Key idea. Instead of predicting just the next action conditioned on state and return-to-go like the original decision transformer paper, they predict the intermediate reward and return-to-go. This allows them to re-condition on new returns-to-go at each timestep, using a clever sampling procedure that samples likely expert returns-to-go.

Implication(s):

SOTA on standard atari offline RL tasks.

Misc thoughts:

This work is very similar to the original decision transformer paper, so I’m surprised that it received a best paper award.
It represents continued progress in the field on offline RL, and more specifically, decision transformer style architectures.

8. When does return-conditioned supervised learning work for offline reinforcement learning? (Brandfonbrener, 2022)

Key idea. Much recent work on offline RL can be cast as supervised learning on a near-optimal offline dataset then conditioning on high rewards from the dataset at test time; under what conditions is this a valid approach? Here the authors prove that this (unsurprisingly) only works when two conditions are met: 1) the test envs are (nearly) deterministic, and 2) there is trajectory-level converage in the dataset.

Implication(s):

Current approaches to offline RL will not work in the real world because real envs are generally stochastic.

Misc thoughts:

I liked that the authors proved the community’s intuitions on current approaches to offline RL that, although somewhat obvious in retrospect, had not been verified.

Workshops

I attended 5 workshops:

Foundation Models for Decision Making
Safety
Offline RL
Real Life Reinforcement Learning
Tackling Climate Change with Machine Learning

I found the latter three to be interesting, but less informative and precient as the first two. I therefore only discuss the Foundation Models for Decision Making and Safety workshops; the extent to which I enjoyed both workshops is, in a sense, oxymoronic.

Foundation Models for Decision Making

Leslie P. Kaelbling: What does an intelligent robot need to know?

My favourite talk was from Leslie Kaelbling of MIT. Kaelbling focussed on our proclivity for building inductive biases into our models (a similar thesis to Sutton’s Bitter Lesson); though good in short term, the effectiveness of such priors plateaus in the long-run. I agree with her.

She advocates for a marketplace of pre-trained models of the following types:

Foundation: space, geometry, kinematics
Psychology: other agents, beliefs, desires etc.
Culture: how do u do things in the world e.g. stuff you can read in books

Robotics manufacturers will provide:

observation / perception
actuators
controllers e.g. policies

And we’ll use our own expertise to build local states (specific facts about the env) and encode long horizon memories e.g. what did I do 2 years ago.

Safety (unofficial; in the Marriott across the road)

The safety workshop was wild. It was a small, unofficial congregation of researchers who you’d expect to see lurking on Less Wrong and other EA forums.

Christoph Schuhmann (Founder of LAION)

Chris is a high school teacher from Vienna; he gave an inspiring talk on the open-sourcing of foundation models. He started LAION (Large-scale Artificial Intelligence Open Network) a non-profit organization, provides datasets, tools and models to democratise ML research. His key points included:

centralised intelligence means centralised problem solving; we can’t give the keys to problem solving to a (potentially) dictatorial few.
risks by not open sourcing AI are bigger than those of open sourcing
LAION progress:
- initial plan was to replicate the orignal CLIP / Dalle-1
- got 3m image text pairs on his own
- discord server helped him get 300m image text pairs, then 5b pairs
- hedge fund gave them 8 A100s
We will always want to do things even if AI can, cause we need to express ourselve

Thomas Wolf (Hugging Face CEO)

Tom Wolf gave a talk on the Big Science initiative, a project takes inspiration from scientific creation schemes such as CERN and the LHC, in which open scientific collaborations facilitate the creation of large-scale artefacts that are useful for the entire research community:

1000+ researchers coming together to build massive language model and massive dataset
efficient agi will probs require modularity cc. LeCun
working on the energy efficiency of training is inherently democratic i.e. stops models being held by the rich, especially re: inference

Are AI researchers aligned on AGI alignment?

There was interesting round table at the end of the workshop that included Jared Kaplan (Anthropic) and David Krueger (Cambridge) discussing what is means to align AGI. There was little agreement.

Keynotes

I attended 4 of the 6 keynotes which were:

David Chalmers: Are Large Language Models Sentient?
Emmanuel Candes: Conformal Prediction in 2022
Isabelle Guyon: The Data-Centric Era: How ML is Becoming an Experimental Science
Geoff Hinton: The Forward-Forward Algorithm for Training Deep Neural Networks

I found Emmanuel’s talk on conformal prediction enlightening as I’d never heard of the topic (here’s a primer), and Isabelle’s talk on benchmark and data transparency to be agreeable, if a little unoriginal. Hinton’s talk on a more anatomically correct learning algorithm was interesting, but I’m as yet unconvinced that mimicking human intelligence is a good way of building systems that are superior to humans—we are able to leverage hardware for artificial systems far superior to that accessible to humans. Chalmers talk was extremely thought-provoking; he structured the problem of consciousness in LLMs excellently—far better than I’ve seen to date, and as such was my favourite of the four.

I have linked to each of the talks, which are freely available to view above.

References

Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.-A.; Zhu, Y.; and Anandkumar, A. 2022. Minedojo: Building open-ended embodied agents with

internet-scale knowledge. Advances in neural information processing systems, 35.

Henaff, M.; Raileanu, R.; Jiang, M.; and Rockt ̈aschel, T. 2022. Exploration via Elliptical Episodic Bonuses. Advances in neural information processing systems, 35.

Humphreys, P. C.; Guez, A.; Tieleman, O.; Sifre, L.; Weber, T.; and Lillicrap, T. 2022. Large-Scale Retrieval for Reinforcement Learning. Advances in neural information processing systems, 35.

Lee, K.-H.; Nachum, O.; Yang, M.; Lee, L.; Freeman, D.; Xu, W.; Guadarrama, S.; Fischer, I.; Jang, E.; Michalewski, H.; et al. 2022. Multi-game decision transformers. Advances in neural information processing systems, 35.

Paster, K.; McIlraith, S.; and Ba, J. 2022. You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments. Advances in neural information processing systems, 35.

Schaul, T.; Barreto, A.; Quan, J.; and Ostrovski, G. 2022. The phenomenon of policy churn. Advances in neural information processing systems, 35.

Touati, A.; Rapin, J.; and Ollivier, Y. 2022. Does Zero-Shot Reinforcement Learning Exist?

One Hour RL

Fri, 25 Feb 2022 15:23:20 +0000

An Introduction to Reinforcement Learning

Tom Bewley & Scott Jeen

Alan Turing Institute

24/02/2022

The best way to walk through this tutorial is using the accompanying Jupyter Notebook:

[Jupyter Notebook]

1 | Markov Decision Processes: A Model of Sequential Decision Making

1.1. MDP (semi-)Formalism

In reinforcement learning (RL), an agent takes actions in an environment to change its state over discrete timesteps $t$, with the goal of maximising the future sum of a scalar quantity known as reward. We formalise this interaction as an agent-environment loop, mathematically described as a Markov Decision Process (MDP).

MDPs break the I.I.D. data assumption of supervised and unsupervised learning; the agent causally influences the data it sees through its choice of actions. However, one assumption we do make is the Markov property, which says that the state representation captures all relevent information from the past. Formally, state transitions depend only on the most recent state and action, $$ \mathbb{P}[S_{t+1} | S_1,A_1 \ldots, S_t,A_t]=\mathbb{P}[S_{t+1} | S_t,A_t], $$ and rewards depend only on the most recent transition, $$ \mathbb{P}[R_{t+1} | S_1,A_1 \ldots, S_t,A_t,S_{t+1}] = \mathbb{P}[R_{t+1} | S_t,A_t,S_{t+1}]. $$

Note: different sources use different notation here, but this is the most general.

In some MDPs, a subset of states are designated as terminal (or absorbing). The agent-environment interaction loop ceases once a terminal state is reached, and restarts again at $t=0$ by sampling an state from an initialisation distribution $S_0\sim\mathbb{P}_\text{init}$. Such MDPs are known as episodic, while those without terminal states are known as continuing.

The goal of an RL agent is to pick actions that maximise the discounted cumulative sum of future rewards, also known as the return $G_t$: $$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots + \gamma^{T-t-1}R_{T}, $$ where $\gamma\in[0,1]$ is a discount factor and $T$ is the time of termination ($\infty$ in continuing MDPs).

To do so, it needs the ability to forecast the reward-getting effect of taking each action $A$ in each state $S$, potentially many timesteps into the future. This temporal credit assignment problem is one of the key factors that makes RL so challenging.

Before we go on, it’s worth reflecting on how general the MDP formulation is. An extremely large class of problems can be cast as MDPs (it’s even possible to represent supervised learning as a special case), and this recent DeepMind paper goes as far as to say that all aspects of general intelligence can be understood as serving the maximisation of future reward. Although not everybody agrees, this attitude motivates the heavy RL focus at organisations like DeepMind and OpenAI.

1.2 MDP Example

Here’s a simple MDP (courtesy of David Silver @ DeepMind/UCL), which we’ll be using throughout this course.

White circle: non-terminal state
White square: terminal state
Black circle: action
Green: reward (depends only on $S_{t+1}$ here)
Blue: state transition probability
Red: action probability for an exemplar policy
Note: edges with probability $1$ are unlabelled

1.3 Open AI Gym

Open AI Gym provides a unified framework for testing and comparing RL algorithms in Python, and offers a suite of MDPs that researchers can use to benchmark their work. It’s important to be familiar with the conventions of Gym, because almost all modern RL code is built to work with it. Gym environment classes have two key methods:

mdp.reset(): reset the MDP to an initial state $S_0$ according to the initialisation distribution $\mathbb{P}_\text{init}$.
mdp.step(action) : given an action $A_t$, combine with the current state $S_t$ to produce the next state according to $\mathbb{P}[S_{t+1} | S_t,A_t]$ and a scalar reward according to $\mathbb{P}[R_{t+1} | S_t,A_t,S_{t+1}]$.

A Gym-compatible class for the student MDP shown above can be found in mdp.py in this repository. Let’s import it now and explore what it can do!

from mdp import StudentMDP
mdp = StudentMDP()

Firstly, we’ll have a look at the initialisation probabilities and the behaviour of mdp.reset().

print(mdp.initial_probs())
mdp.reset()
print(mdp.state)

{'Class 1': 1.0, 'Class 2': 0.0, 'Class 3': 0.0, 'Facebook': 0.0, 'Pub': 0.0, 'Pass': 0.0, 'Asleep': 0.0}
Class 1

Next, let’s check which actions are available in this initial state, and the action-dependent transition probabilities $\mathbb{P}[S_{t+1}|\text{Class 1},A_t]$.

Reminder: the Markov property dictates that transition probabilities depend only on the current state and action.

print(mdp.action_space(mdp.state))
print(mdp.transition_probs(mdp.state, "Study"))
print(mdp.transition_probs(mdp.state, "Go on Facebook"))

{'Study', 'Go on Facebook'}
{'Class 2': 1.0}
{'Facebook': 1.0}

Calling mdp.step(action) samples and returns the next state $S_{t+1}$, alongside the reward $R_{t+1}$. Let’s try calling this method repeatedly. What’s happening here?

state, reward, _, _ = mdp.step("Study") 
print(state, reward)

Class 2 -2.0

mdp.action_space("Pass")

{'Fall asleep'}

So far, we’ve only seen deterministic transitions, but having a pint in the pub has a stochastic effect; the state goes to one of the three classes with specified probabilities.

print(mdp.action_space("Pub"))
print(mdp.transition_probs("Pub", "Have a pint"))

{'Have a pint'}
{'Class 1': 0.2, 'Class 2': 0.4, 'Class 3': 0.4}

In this state, the behaviour of mdp.step(action) changes between repeated calls, even for a constant action.

mdp.state = "Pub" # Note that we're resetting the state to Pub each time
state, reward, _, _ = mdp.step("Have a pint")
print(state, reward)

Class 2 -2.0

This MDP has just one terminal state.

print(mdp.terminal_states())

{'Asleep'}

mdp.step(action) also returns a binary done flag, which is set to True if $S_{t+1}$ is a terminal state.

mdp.state = "Class 2" 
state, reward, done, _ = mdp.step("Fall asleep")
print(state, reward, done)

mdp.state = "Pass" 
state, reward, done, _ = mdp.step("Fall asleep")
print(state, reward, done)

Asleep 0.0 True
Asleep 0.0 True

Now let’s bring an agent into the mix, and give it the exemplar policy shown in the diagram above.

from agent import Agent
agent = Agent(mdp) 
agent.policy = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}}

We can query the policy in a similar way to the MDP’s properties, and observe its stochastic behaviour.

print(agent.policy["Class 1"])
print([agent.act("Class 1") for _ in range(20)])

{'Study': 0.5, 'Go on Facebook': 0.5}
['Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Study']

Bringing it all together:

mdp.verbose = True
state = mdp.reset()
done = False
while not done:
    state, reward, done, info = mdp.step(agent.act(state))

=========================== EPISODE   2 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |

How “good” is this policy? To answer this, we need to calculate its expected return.

2 | Policy Evaluation: The Temporal Difference Method

For a policy $\pi$, the Q value $Q_\pi(S_t,A_t)$ is the expected return from taking action $A_t$ in state $S_t$, and following $\pi$ thereafter. It thus quantifies how well the policy can be expected to perform, starting from this state-action pair. Q values exhibit an elegant recursive relationship known as the Bellman equation: $$ Q_\pi(S_t,A_t)=\sum_{S_{t+1}}\mathbb{P}[S_{t+1}|S_t,A_t]\left(\mathbb{E}[R_{t+1} | S_t,A_t,S_{t+1}]+\gamma\times\sum_{A_{t+1}}\pi(A_{t+1}|S_{t+1})\times Q_\pi(S_{t+1},A_{t+1})\right). $$

i.e. The Q value for a state-action pair is equal to the immediate reward, plus the $\gamma$-discounted Q value for the next state-action pair, with expectations taken over both the transition function $\mathbb{P}$ and the policy $\pi$.

This is a bit of a mouthful, but the Bellman equation is perhaps the single most important thing to understand if you really want to “get” reinforcement learning.

To gain some intuition for this relationship, here are estimated Q values for the exemplar policy in the student MDP. Here we’re using a discount factor of $\gamma=0.95$

Note that these values are only approximate, so the Bellman equation doesn’t hold exactly!

To take one example: $$ Q(\text{Class 2},\text{Study})=-2+0.95\times [(0.6\times Q(\text{Class 3},\text{Study})+0.4\times Q(\text{Class 3},\text{Go to the pub}))] $$ $$ =-2+0.95\times[(0.6\times 9.99+0.4\times 1.81)] $$ $$ =4.38\approx 4.36 $$

How did we arrive at these Q value estimates? Here’s where the real magic happens.

The Bellman backup algorithm makes use of this recursive relationship to update the Q value for a state-action pair based on the current estimate of the value for the next state.

GAMMA = 0.95  # Discount factor
ALPHA = 0.001 # Learning rate

def bellman_backup(agent, state, action, reward, next_state, done):

    Q_next = 0. if done else agent.Q[next_state][agent.act(next_state)]

    agent.Q[state][action] += ALPHA * ( reward + GAMMA * Q_next - agent.Q[state][action])

By sampling episodes in our MDP using the current policy we can collect rewards and update our Q-function accordingly. The algorithm we use to evaluate policies is called policy evaluation, and it uses the Bellman back-up which has two hyperparameters $\gamma$ and $\alpha$. $\gamma$ is the discount factor that

Import the MDP and define the policy again.

from mdp import StudentMDP
from agent import Agent
mdp = StudentMDP(verbose=True)
agent = Agent(mdp) 
agent.policy = {
    "Class 1":  {"Study": 0.5, "Go on Facebook": 0.5},
    "Class 2":  {"Study": 0.8, "Fall asleep": 0.2},
    "Class 3":  {"Study": 0.6, "Go to the pub": 0.4},
    "Facebook": {"Keep scrolling": 0.9, "Close Facebook": 0.1},
    "Pub":      {"Have a pint": 1.},
    "Pass":     {"Fall asleep": 1.},
    "Asleep":   {"Stay asleep": 1.}
}

Initially, we set all Q values to zero (this is actually arbitrary).

agent.Q

{'Class 1': {'Study': 0.0, 'Go on Facebook': 0.0},
 'Class 2': {'Study': 0.0, 'Fall asleep': 0.0},
 'Class 3': {'Study': 0.0, 'Go to the pub': 0.0},
 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 0.0},
 'Pub': {'Have a pint': 0.0},
 'Pass': {'Fall asleep': 0.0},
 'Asleep': {}}

Run a single episode to see what happens.

state = mdp.reset()
done = False
while not done:
    action = agent.act(state)
    next_state, reward, done, _ = mdp.step(action)
    
    print('Current action value:', agent.Q[state][action])
    print('Reward obtained:', reward)
    print('Next action value:', 0. if done else agent.Q[next_state][agent.act(next_state)])

    bellman_backup(agent, state, action, reward, next_state, done)

    print('Updated action value:', agent.Q[state][action])
    print('\n')

    state = next_state

=========================== EPISODE  51 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
Current action value: 4.271800760689531
Reward obtained: -2.0
Next action value: 6.9926093317691915
Updated action value: 4.272171938794022


| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
Current action value: 6.9926093317691915
Reward obtained: -2.0
Next action value: 9.999149294082697
Updated action value: 6.9931159142668005


| 2     | Class 3  | Study          | 10.0   | Pass       | False |
Current action value: 9.999149294082697
Reward obtained: 10.0
Next action value: 0.0
Updated action value: 9.999150144788615


| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
Current action value: 0.0
Reward obtained: 0.0
Next action value: 0.0
Updated action value: 0.0

Repeating a bunch of times, we gradually converge.

mdp.verbose = False

print('Initial Q')
print(agent.Q)

for _ in range(20000):
    state = mdp.reset()
    done = False
    while not done:
        action = agent.act(state)
        next_state, reward, done, _ = mdp.step(action)
        bellman_backup(agent, state, action, reward, next_state, done)
        state = next_state

print('\n')
print('Converged Q')
print(agent.Q)

Initial Q
{'Class 1': {'Study': -0.002, 'Go on Facebook': -0.001}, 'Class 2': {'Study': -0.002, 'Fall asleep': 0.0}, 'Class 3': {'Study': 0.01, 'Go to the pub': 0.0}, 'Facebook': {'Keep scrolling': -0.0049995000249993754, 'Close Facebook': -0.00200095}, 'Pub': {'Have a pint': 0.0}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}


Converged Q
{'Class 1': {'Study': 1.3750628761712569, 'Go on Facebook': -11.288976651525505}, 'Class 2': {'Study': 4.485658109648779, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.999996524778595, 'Go to the pub': 1.8953439336946862}, 'Facebook': {'Keep scrolling': -11.233042781986304, 'Close Facebook': -6.6761905244797735}, 'Pub': {'Have a pint': 0.9312667143217461}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}

Note that although the policy evaluation process is guaranteed to converge eventually (for simple MDPs!), we are likely to see some discrepencies between runs of finite length because of the role of randomness in the data collection process. Here are the results of five independent repeats:

{'Class 1': {'Study': 1.2650695038546025, 'Go on Facebook': -11.30468184426212}, 'Class 2': {'Study': 4.407552596737938, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.99999695776742, 'Go to the pub': 1.8487809354712246}, 'Facebook': {'Keep scrolling': -11.258053618560483, 'Close Facebook': -6.489974573408375}, 'Pub': {'Have a pint': 0.9454014270087486}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
{'Class 1': {'Study': 1.3338704627380917, 'Go on Facebook': -11.222578014516461}, 'Class 2': {'Study': 4.404498313710967, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.999996607231745, 'Go to the pub': 1.9330819535637127}, 'Facebook': {'Keep scrolling': -11.237574593720579, 'Close Facebook': -6.649035509952115}, 'Pub': {'Have a pint': 1.0198591832482675}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
{'Class 1': {'Study': 1.255108027766012, 'Go on Facebook': -11.190843458457234}, 'Class 2': {'Study': 4.3028079916966, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.999996368375, 'Go to the pub': 1.692402249138645}, 'Facebook': {'Keep scrolling': -11.009224020468848, 'Close Facebook': -6.456279660637165}, 'Pub': {'Have a pint': 0.7467114530860179}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
{'Class 1': {'Study': 1.2734946938741027, 'Go on Facebook': -11.328006914127434}, 'Class 2': {'Study': 4.256107269897298, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.99999635381211, 'Go to the pub': 1.74113336614775}, 'Facebook': {'Keep scrolling': -11.34039736455563, 'Close Facebook': -6.777709970724558}, 'Pub': {'Have a pint': 0.7694312629253455}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
{'Class 1': {'Study': 1.2650695038546025, 'Go on Facebook': -11.30468184426212}, 'Class 2': {'Study': 4.407552596737938, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.99999695776742, 'Go to the pub': 1.8487809354712246}, 'Facebook': {'Keep scrolling': -11.258053618560483, 'Close Facebook': -6.489974573408375}, 'Pub': {'Have a pint': 0.9454014270087486}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}

{'Class 1': {'Study': 1.2650695038546025,
  'Go on Facebook': -11.30468184426212},
 'Class 2': {'Study': 4.407552596737938, 'Fall asleep': 0.0},
 'Class 3': {'Study': 9.99999695776742, 'Go to the pub': 1.8487809354712246},
 'Facebook': {'Keep scrolling': -11.258053618560483,
  'Close Facebook': -6.489974573408375},
 'Pub': {'Have a pint': 0.9454014270087486},
 'Pass': {'Fall asleep': 0.0},
 'Asleep': {}}

Try with $\gamma=0$

3 | Policy Improvement

Having evaluated our policy $\pi$, how can we go about obtaining a better one? This question is the heart of policy improvement, perhaps the fundamental concept of RL. Recall, when we performed policy evaluation we obtained the value of taking every action in every state. Thus, we can perform policy improvement readily by picking our current best estimate of the optimal action from each state – so-called greedy action selection. Once we’ve obtained a new policy, we can evaluate it as before. Continually iterating between policy evaluation and policy improvement in this way, we are guarenteed to reach the optimal policy $\pi^*$ according to the policy improvement theorem.

3.1 | Q-learning: Combining Policy Evaluation and Improvement

from mdp import StudentMDP
mdp = StudentMDP(verbose=True)

from agent import QLearningAgent
agent = QLearningAgent(mdp, epsilon=1.0, alpha=0.2, gamma=0.9)

NUM_EPS = 50
mdp.ep = 0
while mdp.ep < NUM_EPS:
    state = mdp.reset()
    done = False
    while not done:
        action = agent.act(state)
        next_state, reward, done, info = mdp.step(action)
        agent.learn(state, action, reward, next_state, done)
        state = next_state

    print("Value function:")
    print(agent.Q)
    print("Policy:")
    print(agent.policy)
    print("Epsilon:", agent.epsilon)
    
    agent.epsilon = max(agent.epsilon - 1 / (NUM_EPS-1), 0)

=========================== EPISODE   1 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
Value function:
{'Class 1': {'Study': -0.4, 'Go on Facebook': 0.0}, 'Class 2': {'Study': -0.4, 'Fall asleep': 0.0}, 'Class 3': {'Study': 2.0, 'Go to the pub': 0.0}, 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 0.0}, 'Pub': {'Have a pint': 0.0}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
Policy:
{'Class 1': {'Study': 0.5, 'Go on Facebook': 0.5}, 'Class 2': {'Study': 0.5, 'Fall asleep': 0.5}, 'Class 3': {'Study': 0.5, 'Go to the pub': 0.5}, 'Facebook': {'Keep scrolling': 0.5, 'Close Facebook': 0.5}, 'Pub': {'Have a pint': 1.0}, 'Pass': {'Fall asleep': 1.0}, 'Asleep': {}}
Epsilon: 1.0

=========================== EPISODE  50 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
Value function:
{'Class 1': {'Study': 4.2185955736170015, 'Go on Facebook': -2.843986498540236}, 'Class 2': {'Study': 6.978887265282676, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.997403851570732, 'Go to the pub': 1.0297507967148403}, 'Facebook': {'Keep scrolling': -3.2301556459908016, 'Close Facebook': -0.8716820424598939}, 'Pub': {'Have a pint': 2.6089417712654472}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
Policy:
{'Class 1': {'Study': 1.0, 'Go on Facebook': 0.0}, 'Class 2': {'Study': 1.0, 'Fall asleep': 0.0}, 'Class 3': {'Study': 1.0, 'Go to the pub': 0.0}, 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 1.0}, 'Pub': {'Have a pint': 1.0}, 'Pass': {'Fall asleep': 1.0}, 'Asleep': {}}
Epsilon: 0

We find that after 50 episodes the agent has obtained the optimal policy $\pi_*$!

4 | Deep RL

So far, we’ve tabularised the state-action space. Whilst useful for explaining the fundamental concepts that underpin RL, the real world state-action spaces are generally continuous and thus impossible to tabularise. To combat this, function approximators are used instead. In the past these included x, but more recently, deep neural networks have been used giving rise to the field of Deep Reinforcement Learning.

The seminal Deep RL algorithm is Deep Q Learning which uses neural networks to represent the $Q$ function. The network takes the current obervation $o_t$ as input and predicts the value of each action. The agent’s policy is $\epsilon$-greedy as before i.e. it takes the value-maximising action with probability $1 - \epsilon$. Deep Q learning

Below, we run 500 episodes of the canonical Cartpole task using Deep Q learning. The agent’s goal is to balance the pole in the upright position for as long as possible starting from an initially random position.

import gym
from dqn_agent import Agent
import numpy as np

env = gym.make('CartPole-v1')
agent = Agent(gamma=0.99, epsilon=0.9, lr=0.0001, n_actions=env.action_space.n, input_dims=[env.observation_space.shape[0]],
              mem_size=50000, batch_size=128,  eps_dec=1e-3, eps_min=0.05, replace=1000,
              env_name='cartpole', chkpt_dir='tmp/dqn')

best_score = -np.inf
episodes = 500
scores, avg_score, eps_history = [], [], []

for i in range(episodes):
    score = 0
    done = False
    observation = env.reset()
    env.render()
    while not done:
        action = agent.choose_action(observation)
        observation_, reward, done, info = env.step(action)
        score += reward
        agent.store_transition(observation, action, reward, observation_, done)
        agent.learn()
        observation = observation_
        env.render()
    
    scores.append(score)
    eps_history.append(agent.epsilon)
    
    avg_score = np.mean(scores[-100:])
    
    print('episode', i, 'score %.2f' % score, 'average score %.2f' % avg_score)

5 | What Did We Miss Out?

Dynamic programming (when transition probabilities are known)
Monte Carlo
Exploration strategies
Continuous actions
Policy gradient, actor-critic
Model-based
Partial observability

What next? RL interest group?

Presenting with Jupyter Notebooks

Wed, 17 Nov 2021 09:03:11 -0500

The best way to walk through this tutorial is using the accompanying Jupyter Notebook:

[Jupyter Notebook]

-

In the last year I’ve started presenting work using Jupyter Notebooks, rebelling against the Bill Gates'-driven status-quo. Here I’ll explain how to do it. It’s not difficult, but in my opinion makes presentations look slicker, whilst allowing you to run code live in a presentation if you like. First, we need to download the plug-in that gives us the presentation functionality, it’s called RISE. We can do this easily using pip in a terminal window:

pip install RISE

Once installed, our first move is to add the presentation toggles to our notebook cells. We do this by clicking View in the menu bar, then Cell Toolbar, then Slideshow:

Adding Presentation Toggles to Cells

Slide Types

This adds a Slide Type dropdown to each cell in the notebook. Here we can choose one of five options:

Slide: Used to start a new chapter in your presentation, think of this as a section heading in LaTeX.
Sub-slide: Slide falling within the chapter defined by a Slide cell.
Fragment: this is to split the contents of one slide into pieces; a cell marked as a fragment will create a break inside the slide; it will not show up right away, you will need to press Space one more time to see it.
Skip: Skips cell when in presenter mode.
Notes: Cell that allows the author to write notes on a slide that aren’t shown in presenter view.

As with any notebook, we can define the cell type to be either Markdown or Code. As you’d expect, we present any text or image-based slide in Markdown, reserving the Code cell type if and only if we want to explicitly run some code in the presentation. If you aren’t familiar, Markdown is a straightforward language for text formatting; I won’t go into the details here, but suffice to say you can learn the basics of Markdown in 5 minutes. You can find a useful cheatsheet here.

Images

Adding images is easy too. I advise creating a sub-directory in your working directory called /images and storing anything you want to present there. Then you display them in a markdown file using some simple HTML syntax:

You can manipulate the style attribute to change the size of the image. Don’t worry, this is the only HTML you need to know!

Entering Presentation Mode

To view your slideshow click on the bar-chart button in the menu bar. This will start the presentation from the cell currently selected:

That’s it! This tutorial has given you an introduction to the basics of RISE for presenting with Jupyter Notebooks, you can of course customise these to your heart’s content using further plug-ins and more advanced Markdown. Here’s a summary of the useful links from this document to finish:

Thanks!

Twitter: @enjeeneer

Website: https://enjeeneer.io/

Notes, Exercises and Code for Sutton and Barto's Reinforcement Learning: An Introduction (2018)

Fri, 30 Apr 2021 18:18:24 +0100

In the last few weeks I’ve been compiling a set of notes and exercise solutions for Sutton and Barto’s Reinforcement Learning: An Introduction. Admittedly, these were produced for my own benefit, but if you’d like to look at my notes, my (probably incorrect) answers to the exercises, or the code accommodating those answers, I’ll link directly to them below:

Thanks to Bryn Hayder for inspiring this idea, and for providing his exercise solutions which helped me throughout.

Scott's Uncomprehensive Guide to Scotland

Mon, 19 Apr 2021 21:08:05 +0000

Hello treasured friend. If you’re reading this, it’s probably because I’ve force-fed you a link after discussing your upcoming trip to Scotland. I hope this is useful to you in some way. Think of this a travel guide that you can dip into when you find yourself in one of these places either hungry or bored. I don’t describe anything in detail, you’ll just have to take me on my word that these places are worth visiting.

Before diving in, I’d like to make a quick overarching recommendation. If you only have one chance to visit, I would strongly advise using this visit to see Edinburgh in August for the Fringe Festival. In general, you can’t do better than Edinburgh; it’s the capital, cultural centre, prettiest city; and in August you’ll get the best weather. But the Fringe is a unique experience that I think should be on everyone’s bucket list. Every pub, church, park, music venue (indeed, any open space) becomes a stage for artists, actors, and comedians. These artists range from world class to distinctly amateur, and the fun lies in booking a string of shows across an afternoon/evening, likely from performers you will never have heard of, and rolling the dice. You will see some awful shows that make you cringe, then you’ll see a performance that will take your breath away leaving you in awe of their talent. The bars are open till 5am every day of the week (usually outlawed in Scotland), and the city is buzzing with excitement and energy. Edinburgh is the most cosmopolitan city in Scotland, but it is especially so for the Fringe, with ~1 million tourists visiting. During my undergrad, I worked in different bars each Fringe I was there, so have seen plenty of it, and I promise you won’t regret going, it’s a special time.

Anyway, if you want to explore more than just the Fringe, here’s some ideas of things to do. The two categories are: 1) Cities, 2) Special Events, and there’s a little aside at the end on golf courses. I think the sub-structure is self-explanatory. I hope it’s helpful!

Cities

Edinburgh

Things to do before dark

Walk up Arthur’s Seat via the Craggs for a view of the entire city [map]
Walk up Calton Hill for a view of Princess Street, the Castle and the Firth of Forth [map]
The Castle (beware of the £20 entry fee if you’re light on $$$ [map]
Walk along the canal through Dean Village [map]
Portobello beach (if you’re feeling warm) [map]
Tour the university buildings; George Square/The Meadows [map] and Old College [map]

Things to do after dark

Sneaky Pete’s: playing techno/house, the best night in Edinburgh. Something happening most nights [map]
The Brass Monkey: an eccentric, unique locals bar [map]
The Devil’s Advocate: cool, sleek (and more expensive) city bar [map]
The Hanover Tap: easy-going student bar [map]
99 Hanover St.: cool city bar with the occasional DJ [map]
Garibadli’s: wild club near many of the above bars. Order the Gari’s special at the bar! [map]
The Hanging Bat: cool bar with loads of beer [map]

Things to do if you have a car

Visit North Berwick and Gullane, two lovely seaside towns along the coast [map]
The Pentland Hills [map]

Things to do if hungover

Meltmongers (grilled cheese) [map]
Snax Cafe: infamous hangover food for pennies [map]
Wings: chicken wings in an unreasonable number of seasonings [map]

Lunch

Nile Valley Cafe–me and my mates' favourite spot in the city (Sudanese Wraps) [map]
Tupiniquim (Brazilian crepes) [map]
J Reid Sandwich Shop: great salads [map]
Victor Hugo Deli [map]
10 to 10 in Delhi (Indian) [map]
The Scran and Scallie: up-market scottish gastropub. Some of the best pub food in the city [map]

Dinner

Yene Meze (Greek) [map]
Fishers in the City: delicious, up-market fish restaurant [map]
The Bon Vivant: fantastic French restaurant [map]
The Outsider: chill atmosphere, serving delicious mixed cuisine [map]
The Grain Store: amazing scottish cuisine, albeit pricey [map]

Coffee

Project Coffee [map]
Soderberg [map]
Maison de Moggy: a cafe full of cats [map]

Glasgow

Things to do before dark

Kelvingrove Art Gallery: some lovely exhibitions inc. Dali’s Christ of Saint John of the Cross [map]
Walk around the West End: Byres Road, Botanic Gardens, University Avenue etc. [map]
Take the Subway!

Things to do after dark

Ashton Lane (specifically Brel for a drink and the Ubiquitous Chip for food) [map]
The Arlington (an edgy locals bar) [map]
Oran Mor (if you’re out this area (the west end) beyond midnight, this is the only place that’s open till 3am) [map]
Sub club (the longest running underground club in the world, the best spot for techno and a fun night) [map]
SWG3: the clue is in the name [map]

Things to do if you have a car

Loch Lomond and Conic Hill [map]
Climb Queen’s view [map]
Glengoyne Distillery [map]
Mugdock Country Park & Castle [map]

Things to do if hungover

The University Cafe: famous locals cafe that do a sick Scottish breakfast [map]
Hyndland Cafe: greasy breakfast food for locals [map]

Lunch

Epicures: Standard brunch food, delicious [map]
The Hanoi Bike Ship: easy vietnamese food [map]

Dinner

Balbir’s: I think it’s the best curry I’ve had [map]
La Vita Spuntini: delicious Italian tapas [map]
Pizza Magic: the best pizza I’ve had in the city (also applies if hungover) [map]
Stravaigin: gourmet Scottish cuisine [map]

Coffee

Tchai-Ovna House of Tea: this isn’t coffee, but don’t worry about that just come here, drink their tea, smoke their shisha and play chess. It’s a great spot. [map]
Laboratorio Espresso: If you do need coffee, this place is pretty good [map]

St Andrews

Things to do before dark

Have a go at the O.G. crazy golf: The Himalayas. Pitch up and pay a couple of quid at the window and they will give you a putters and balls [map].
Walk along west sands, where they filmed that scene in Chariots of Fire [map]
Get a fudge donut from Fisher & Donaldson [map]
Walk to east sands and along the pier where the students meet before the start of new academic year [map]
If you’re there on a Sunday, have a picnic on the 18th fairway of the old course. It’s public land and there’s no play on a Sunday, you’ll meet lots of dog walkers. [map]
If you play golf, clearly try and play the Old, if you can’t get on there my next favourite is the New. [map]

Things to do after dark

The Dumvegan: historic golf-themed pub [map]
The Keys: the localist of local pubs [map]
The Vic: the only place playing music till late [map]
The Jigger: sit and watch the golfers go by (nb: if you’re a golfer, the Jigger challenge is to nip if after you finish the 17th and drink as many pints as you like in half an hour, then you have to play the 18th in fewer shots than the number of pints drunk) [map]

Things to do if you have a car

Go to Elie, have lunch in The Ship and watch some beach cricket [map]
Anstruther for the best fish and chips in Scotland! [map]
Have a walk around the colourful houses in Pittenweem [map]

Things to do if hungover

Munch: cheap, delicious greasy food [map]
Toastie Bar: 50p toasties, ideal [map]

Lunch

Forgan’s: solid brunch menu [map]
CombiniCo: east asian cuisine, sushi etc. [map]

Dinner

The Seafood Ristorante: amazing fresh fish from the harbour, pricey [map]
The Rav: British cuisine, stylish surroundings [map]

Coffee

Taste: my Uncle owns this place! [map]

Special Events

Edinburgh Fringe

Venues to visit

Pleasance Courtyard: high-quality comedy and drama and one of the cool old university buildings. The bars are sick too. [map]
Gilded Balloon: Similar to Pleasance, really cool old building with great acts performing every year. [map]
The Stand: the best comedy club in Edinburgh [map]
Underbelly Cowgate: a cool network of bars underneath the city, and a great spot to be for going out after a show [map]
Udderbelly: a large purple cow [map]](http://www.underbellyedinburgh.co.uk/#stq=&stp=1)

Recurring shows to see

Late n' Live: on every night in the Gilded Balloon from 1am. Different comedians are selected each evening to perform a set. They’ve already performed their usual, daily set earlier on, and have probably now had a few celebratory drinks. The crowd are equally loose and it creates a chaotic, hilarious atmosphere [map]

North Coast 500

If you’re planning a road trip through the Highlands then you’re probably familiar with the North Coast 500. You can’t go wrong following the intended route, but here’s a little advice on places I think you should definitely visit on the way.

Glenfinnan Viaduct: If you’ve seen Harry Potter then you’ll recognise this as the bridge the train crosses en route to Hogwarts. It’s beautiful, but works as a great spot to stop for lunch. [map]
The James Bond View [map]
Eilean Donan Castle, also in James Bond lol [map]
The prettiest bit of road to drive is near Applecross [map]
Get a flight to Barra in the Outer Hebrides and land on the beach! [map]
Isle of Harris, and specifically Luskentyre Beach) [map]
Take the shortest commercial flight in the world from Westray to Papa Westray. It lasts 53 seconds. [map]

An Aside on Golf Courses

If you are into golf, here’s my top 5 courses, and some hidden gems:

Top 5

Royal Dornoch
Muirfield
Carnoustie
Troon
Kingsbarns

I’m yet to play Turnberry since the updates, but I’m told it’s now no. 1.

Hidden Gems (in no particular order)

Ladybank
Moray
Nairn (if you consider it hidden)
Elie
Monifieth
Murcar
Shiskine
Luffness

If you have any questions, please do send me an email. You can find my address on the homepage.

Posts on Scott Jeen

20s

Searching for useful problems

Moral axioms and proxies (or evaluating the expected gain from a solved problem)

Being the (Pareto) best in the world (or evaluating the expected cost of solving a problem)

Key takeaways

Further reading

NeurIPS 2022

Papers

Workshops

Foundation Models for Decision Making

Safety (unofficial; in the Marriott across the road)

Keynotes

References

One Hour RL

An Introduction to Reinforcement Learning

Tom Bewley & Scott Jeen

Alan Turing Institute

24/02/2022

1 | Markov Decision Processes: A Model of Sequential Decision Making

1.1. MDP (semi-)Formalism

1.2 MDP Example

1.3 Open AI Gym

2 | Policy Evaluation: The Temporal Difference Method

3 | Policy Improvement

3.1 | Q-learning: Combining Policy Evaluation and Improvement

4 | Deep RL

5 | What Did We Miss Out?

Presenting with Jupyter Notebooks

The best way to walk through this tutorial is using the accompanying Jupyter Notebook:

-

Adding Presentation Toggles to Cells

Slide Types

Images

Entering Presentation Mode

Thanks!

Twitter: @enjeeneer

Website: https://enjeeneer.io/

Notes, Exercises and Code for Sutton and Barto's Reinforcement Learning: An Introduction (2018)

Scott's Uncomprehensive Guide to Scotland

Cities

Edinburgh

Things to do before dark

Things to do after dark

Things to do if you have a car

Things to do if hungover

Lunch

Dinner

Coffee

Glasgow

Things to do before dark

Things to do after dark

Things to do if you have a car

Things to do if hungover

Lunch

Dinner

Coffee

St Andrews

Things to do before dark

Things to do after dark

Things to do if you have a car

Things to do if hungover

Lunch

Dinner

Coffee

Special Events

Edinburgh Fringe

Venues to visit

Recurring shows to see

North Coast 500

An Aside on Golf Courses

Top 5

Hidden Gems (in no particular order)