<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Posts on Scott Jeen</title>
        <link>https://enjeeneer.io/posts/</link>
        <description>Recent content in Posts on Scott Jeen</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en-us</language>
        <copyright>&lt;a href=&#34;https://creativecommons.org/licenses/by-nc/4.0/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;CC BY-NC 4.0&lt;/a&gt;</copyright>
        <lastBuildDate>Mon, 31 Mar 2025 21:15:18 +0100</lastBuildDate>
        <atom:link href="https://enjeeneer.io/posts/index.xml" rel="self" type="application/rss+xml" />
        
        <item>
            <title>20s</title>
            <link>https://enjeeneer.io/posts/2025/03/20s/</link>
            <pubDate>Mon, 31 Mar 2025 21:15:18 +0100</pubDate>
            
            <guid>https://enjeeneer.io/posts/2025/03/20s/</guid>
            <description>I turned 30 today. Here are some particularly important moments from the last decade.
Highs
 drives to KB on frosty mornings with hamilton, hayden and adam lcd soundsystem on the west side highway with the roof down the beginning of infinity glasto nights with dom, deirdre and cat nablus with muath, fin, calum, dom and hayden francis and simone’s wedding david silver’s lectures and the first reading of sutton and barto sunny evening walks with cat around cambridge submitting the phd lunch conversations at fdm late night snacks in brooklyn after a double date sherkin island with cat ollie worrall on the 16th green at aldeburgh ashby lab with timo and josh the first drive down the backs past king’s college  Lows</description>
            <content type="html"><![CDATA[<p>I turned 30 today. Here are some particularly important moments from the last decade.</p>
<p><strong>Highs</strong></p>
<ul>
<li>drives to KB on frosty mornings with hamilton, hayden and adam</li>
<li>lcd soundsystem on the west side highway with the roof down</li>
<li>the beginning of infinity</li>
<li>glasto nights with dom, deirdre and cat</li>
<li>nablus with muath, fin, calum, dom and hayden</li>
<li>francis and simone’s wedding</li>
<li>david silver’s lectures and the first reading of sutton and barto</li>
<li>sunny evening walks with cat around cambridge</li>
<li>submitting the phd</li>
<li>lunch conversations at fdm</li>
<li>late night snacks in brooklyn after a double date</li>
<li>sherkin island with cat</li>
<li>ollie worrall on the 16th green at aldeburgh</li>
<li>ashby lab with timo and josh</li>
<li>the first drive down the backs past king’s college</li>
</ul>
<p><strong>Lows</strong></p>
<ul>
<li>chancery lane with lucy</li>
</ul>
]]></content>
        </item>
        
        <item>
            <title>Searching for useful problems</title>
            <link>https://enjeeneer.io/posts/2024/08/searching-for-useful-problems/</link>
            <pubDate>Mon, 26 Aug 2024 19:36:27 +0200</pubDate>
            
            <guid>https://enjeeneer.io/posts/2024/08/searching-for-useful-problems/</guid>
            <description>This post is based on a talk I gave to my research group at Cambridge University in June 2024. You can find the notes and slides for the talk here.
Solutions to most problems aren’t particularly useful. Solutions to a small number of problems are extremely useful. If you’re interested in doing good, you’ll want to search for problems that look like the latter. It’s a hard search, but the ROI is likely greater than any other use of your time, and I have some ways of running it that I think make it a little more tractable.</description>
            <content type="html"><![CDATA[<p><em>This post is based on a talk I gave to my research group at Cambridge University in June 2024. You can find the notes and slides for the talk <a href="https://enjeeneer.io/talks/2024-06-14-reffciency/">here</a>.</em></p>
<p>Solutions to most problems aren’t particularly useful. Solutions to a small number of problems are extremely useful. If you’re interested in doing good, you’ll want to search for problems that look like the latter. It’s a hard search, but the ROI is likely greater than any other use of your time, and I have some ways of running it that I think make it a little more tractable.</p>
<p>A solution to a problem provides some <em>gain</em> in exchange for some <em>cost</em>. Less useful problems are those whose solutions incur costs that are equal to or higher than the expected gains. Useful problems are those whose solutions provide gains that greatly outweigh the costs. The former are <em>zero-sum games</em>, and the latter are <em>positive-sum games.</em> Chess is the canonical zero-sum game: every gain in the position of Player A is a loss in the position of Player B. Trade is the canonical positive-sum game: when countries specialise in products they make most cheaply, and trade them for those they make less cheaply, both benefit. Maximising good necessitates working on problems with positive-sum solutions. <strong>Indeed, your goal should be to find problems whose solutions are maximally positive-sum.</strong></p>
<p>Identifying these problems is difficult for two reasons: 1) it’s hard to predict expected gain, and 2) it’s hard to predict expected cost (relative to the cost incurred by others working on the same problem<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup>). Below I discuss methods for dealing with each.</p>
<h3 id="moral-axioms-and-proxies-or-evaluating-the-expected-gain-from-a-solved-problem">Moral axioms and proxies (or evaluating the expected gain from a solved problem)</h3>
<p>Finding problems with solutions that maximise expected gain means knowing what you’d like to gain. It’s easy to convince yourself that you know this already, but, often, what you think you’d like to gain only approximates a more deeply held belief. You’ll know you’ve found these beliefs when they feel <em>axiomatic</em>, like unchallengeable truths for which you don’t need to provide justification. One might be that maximising net happiness is a good thing. Another might be that <a href="https://en.wikipedia.org/wiki/Preference_utilitarianism">fulfilling the desires</a> of the maximum number of <a href="https://en.wikipedia.org/wiki/Preference_utilitarianism"></a>people is a good thing. To maximise expected gain, you’ll want to solve problems that contribute to these <em>moral axioms</em> as much as possible<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup>.</p>
<p>Your moral axioms are problems that you can’t access directly (I can’t click my fingers and make everyone happier). Instead, you access them through <em>proxies</em>—problems that, if solved, partially contribute to your moral axiom. These proxies will have their own sub-problems with solutions that partially contribute to their solution. Each moral axiom has an associated tree of proxies, as in Figure 1.</p>
<p><img src="/img/axioms-1.png" alt=""></p>
<p><small>Figure 1. <strong>Moral axioms and proxies.</strong> Starting with a <em>moral axiom</em> (a moral belief that requires no justification) we build a tree of sub-problems or <em>proxies.</em> Each proxy has a solution which part-solves the proxy from the previous level in the tree. The degree to which the previous proxy is solved by the current proxy is approximated by its <em>value</em> (a number between 0 and 1). For a given proxy, its expected value or <em>proxy score</em> with respect to the moral axiom is found by multiplying each of the values you traverse to find it in the tree.</small></p>
<p>Proxies contribute to other proxies to different degrees. In Figure 1, I consider incentivising solar PV as a proxy for mitigating climate change, which is itself a proxy for increasing average happiness. I provide two solutions: a) new panels with improved conversion efficiency, or b) cheaper (i.e. more) grid scale batteries which allow generators to reliably sell excess energy to the grid. Here, I assume reliable income for surplus energy better incentivises solar PV than improved panel efficiency, so b) contributes more to the proxy than a). These contributions are summarised by their <em>value</em> with respect to the proxy above—a number between 0 and 1, where 0 represents no contribution and 1 constitutes a full solution to the problem. The expected value, or <em>proxy score</em>, of a solution with respect to the moral axiom is found by multiplying each of the values you traverse to find it in the tree. It’s called the proxy score because it measures how well the problem approximates the real problem you want solve i.e. your moral axiom. Proxies are ranked using these scores.</p>
<p>Building the ground truth tree of proxies and values for a given moral axiom is clearly intractable. The best you can do is create as exhaustive a list of proxies as you can, and score them to the best of your current knowledge. What’s nice, however, is that building this tree is a lifelong project, and the tree can always be improved. As you learn more, you’ll identify new proxies and score them, or refine scores for old proxies, which will re-rank them. The trick is being willing to switch proxies when your ranking updates.</p>
<h3 id="being-the-pareto-best-in-the-world-or-evaluating-the-expected-cost-of-solving-a-problem">Being the (Pareto) best in the world (or evaluating the expected cost of solving a problem)</h3>
<p>If you are the best in the world at something you are uniquely positioned to solve its problems cheaply. Usain Bolt’s training and physique meant he was uniquely capable of running the 100m faster than anyone who came before. Marie Curie’s knowledge of radioactivity meant she was uniquely capable of proposing radiation therapy as a form of cancer treatment. But being the best in the world at something is hard. It is much easier to be the best in the world at a combination of things.</p>
<p>In the simplest case you can think about how you compare to others at a combination of two skills, as in Figure 2. Here, people are plotted with respect to their gardening and public speaking skill. I assume Mandela is history’s best public speaker and (for simplicity) knows nothing of gardening, and vice versa for Alan Titchmarsh. Between lie three people, Jane, Julia and John, each of whom have differing combined skills. Together these five make a Pareto front of gardening/public speaking knowledge.</p>
<p><img src="/img/costs.png" alt=""></p>
<p><small>Figure 2. <strong>Being the (Pareto) best in the world<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>.</strong> Five people plotted with respect to their gardening and public speaking skill. Problems reveal themselves to each person depending on their unique combination of skills.</small></p>
<p>Each person’s position on the Pareto front illuminates problems that are cheaply accessible to them. The further some person is from their neighbours, the more problems they can access. In practice, it’s difficult to be the best in the world across the intersection of two skills, but you may find you are the best in the world across the intersection of three, four or five. There may be nobody in the world better than Julia at gardening, public speaking, Catalan, and rust programming. The limit case is the intersection of all of your skills, and at that you are definitively unmatched. There is nobody in the world better at being you than you.</p>
<p>Your position on the real Pareto skill front changes as you learn new skills. The number of new problems cheaply accessible to you is dependent on what these new skills are. If you are a native English speaker, extending your English vocabulary won’t differentiate you much from other fluent English speakers, and is unlikely to unlock new problems that you can’t currently access. Learning Estonian enters you to a much smaller cohort of English-Estonian speakers, and opens up some problems that are communicated only in Estonian.</p>
<p>Consider carefully what new skills would maximally differentiate you from your neighbours on the Pareto front, and learn them to increase the number of problems you can solve cheaply. The best skills to learn are those that illuminate problems with highest proxy score as established from your moral axioms and proxies.</p>
<h3 id="key-takeaways">Key takeaways</h3>
<ol>
<li>Think deeply about your <em>moral axioms</em> (beliefs that require no justification), and build a tree of <em>proxies</em> (sub-problems) and their values that is as exhaustive as possible.</li>
<li>Rank proxies by their <em>proxy score</em> (expected value).</li>
<li>Think about what combination of skills you are Pareto best at, and find the proxy with the highest score that requires these (i.e. for which you have a comparative advantage in solving).</li>
<li>Think about new skills that, if learned, would unlock proxies with higher proxy scores. Learn them.</li>
</ol>
<h3 id="further-reading">Further reading</h3>
<ul>
<li><a href="https://ofirnachum.github.io/posts/baselines-and-oracles/">Ofir Nachum’s <em>Baselines and Oracles</em></a> for checking that a solution exists to a given problem.</li>
<li><a href="https://www.weizmann.ac.il/mcb/alon/sites/mcb.UriAlon/files/uploads/nurturing/howtochoosegoodproblem.pdf">Uri Alon’s <em>How To Choose a Good Scientific Problem</em></a> for other ways of thinking about problem usefulness and feasibility.</li>
</ul>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>AKA your <em>comparative advantage</em>.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Of course, this is closely related to ideas from <a href="https://en.wikipedia.org/wiki/Utilitarianism">Utilitarianism</a> and <a href="https://forum.effectivealtruism.org/topics/itn-framework">Effective Altruism</a>. The main difference being that Utilitarians and EAs pick problems w.r.t. a fixed, predetermined set of moral beliefs, whereas here I allow you to pick problems w.r.t <em>your</em> moral beliefs. This is important because you won’t work hard on a problem if you don’t believe its solution will be useful.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Cf. the <a href="https://www.lesswrong.com/posts/XvN2QQpKTuEzgkZHY/being-the-pareto-best-in-the-world">original less wrong post by johnwentsworth</a> that inspired this.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</section>
]]></content>
        </item>
        
        <item>
            <title>NeurIPS 2022</title>
            <link>https://enjeeneer.io/posts/2023/01/neurips-2022/</link>
            <pubDate>Thu, 26 Jan 2023 21:08:05 +0000</pubDate>
            
            <guid>https://enjeeneer.io/posts/2023/01/neurips-2022/</guid>
            <description>I was fortunate to attend NeurIPS in New Orleans in November. Here, I publish my takeaways to give you a feel for the zeitgeist. I’ll discuss, firstly, the papers, then the workshops, and finally, and briefly, the keynotes.
Papers Here’s a ranked list of my top 8 papers. Most are on Offline RL, which is representative of the conference writ large.
1. Does Zero-Shot Reinforcement Learning Exist (Touati et. al, 2022)</description>
            <content type="html"><![CDATA[<p>I was fortunate to attend NeurIPS in New Orleans in November. Here, I publish my takeaways to give you a feel for the zeitgeist. I’ll discuss, firstly, the papers, then the workshops, and finally, and briefly, the keynotes.</p>
<h2 id="papers">Papers</h2>
<p>Here’s a ranked list of my top 8 papers. Most are on Offline RL, which is representative of the conference writ large.</p>
<p><strong>1. <a href="https://arxiv.org/pdf/2209.14935.pdf">Does Zero-Shot Reinforcement Learning Exist (Touati et. al, 2022)</a></strong></p>
<figure><img src="/img/toutati2022.jpg"/>
</figure>

<p><strong>Key idea.</strong> To do zero-shot RL, we need to learn a general function from reward-free transitions that implicitly encodes the trajectories of <strong>all</strong> optimal policies for <strong>all</strong> tasks. The authors propose to learn two functions: \(F_\theta(s)\) and  \(B_\phi(s)\) that encode the future and past of state \(s\). We want to learn functions that <strong>always</strong> find a route from \(s \rightarrow s'\).</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>They beat all previous zero-shot RL algorithms on the standard offline RL tasks, and approach the performance of online, reward-guided RL algorithms in some envs.</li>
</ul>
<p><strong>Misc thoughts:</strong></p>
<ul>
<li>It seems clear that zero-shot RL is the route to real world deployment for RL. This work represents the best effort I’ve seen in this direction. I’m really excited by it and will be looking to extend it in my own future work.</li>
</ul>
<hr>
<p><strong>2. <a href="https://arxiv.org/pdf/2206.05314.pdf">Large Scale Retrieval for Reinforcement Learning (Humphreys et. al, 2022)</a></strong></p>
<figure><img src="/img/largescaleretrieval.png" width="500" height="600"/>
</figure>

<p><strong>Key idea.</strong> Assuming access to a large offline dataset, we perform a nearest neighbours search over the dataset w.r.t. the current state, and append the retrieved states, next actions, rewards and final states (in the case of go) to the current state. The policy then acts w.r.t this augmented state.</p>
<p>Implication(s):</p>
<ul>
<li><strong>Halves</strong> compute required to achieve the baseline win-rate in Go.</li>
</ul>
<p><strong>Misc thoughts:</strong></p>
<ul>
<li>This represents the most novel approach to offline RL I’ve seen; most techniques separate the offline and online learning phases, but here the authors combine them elegantly.</li>
<li>To me this feels like a far more promising approach to offline RL than CQL etc.</li>
</ul>
<hr>
<p><strong>3. <a href="https://arxiv.org/pdf/2206.00730.pdf">The Phenomenon of Policy Churn (Schaul et. al, 2022)</a></strong></p>
<p><figure><img src="/img/churn1.png"/>
</figure>

<figure><img src="/img/churn2.png"/>
</figure>
</p>
<p><strong>Key idea.</strong> When a value-based agent acts greedily, the policy updates by a surprising amount per gradient step e.g. in up to 10% of states in some cases.</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>Policy churn means that ((\epsilon))-greedy exploration may not be required as a rapidly changing policy induces enough noise into the data distribution that exploration may be implicit.</li>
</ul>
<p><strong>Misc thoughts:</strong></p>
<ul>
<li>Their paper is structured in a really engaging way.</li>
<li>I liked their ML researcher survey which quantified how surprising their result was to experts.</li>
</ul>
<hr>
<p><strong>4. <a href="https://arxiv.org/pdf/2206.08853.pdf">MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge (Fan et. al, 2022)</a></strong></p>
<figure><img src="/img/minedojo.jpg"/>
</figure>

<p><strong>Key idea.</strong> An internet-scale benchmark for generalist RL agents. 1000s of tasks, and a limitless procedurally-generated world for training.</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>Provides a sufficiently diverse and complex sandbox for training more generally capable agents.</li>
</ul>
<p><strong>Misc thoughts:</strong></p>
<ul>
<li>This is an amazing feat software development effort from a relatively small team. Jim Fan is so cool!</li>
</ul>
<hr>
<p><strong>5. <a href="https://arxiv.org/pdf/2210.05805.pdf">Exploration via Elliptical Episodic Bonuses (Henaff et. al, 2022)</a></strong></p>
<figure><img src="/img/ellipticalbonus.png"/>
</figure>

<p><strong>Key Idea.</strong> Guided exploration is often performed by providing the agent reward inversely proportional to the state visitation count i.e. if you haven’t visited this state much you receive added reward. This works for discrete state spaces, but in continuous state spaces each visited state is ~ unique. Here, the authors parameterise ellipses around visited states, specifying a <em>region</em> of nearby states, outside of which the agent receives added reward.</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>Better exploration means SOTA on the mini-hack suite of envs</li>
<li>Strong performance of reward-free exploration tasks i.e. this is a really good way of thinking about exploration.</li>
</ul>
<p><strong>Misc. thoughts:</strong></p>
<ul>
<li>I really liked the elegance of this idea. A good example of simple, well-examined ideas being useful to the community.</li>
</ul>
<hr>
<p><strong>6. <a href="https://arxiv.org/pdf/2205.15967.pdf">You Can’t Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments (Paster et al., 2022)</a></strong></p>
<figure><img src="/img/luck.png"/>
</figure>

<p><strong>Key Idea.</strong> In a stochastic environment, trajectories in a dataset used to train decision transformer may be high-reward by chance. Here the authors cluster similar trajectories and find their expected reward to mitigate overfitting to lucky trajectories.</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>Decision transformer trained on these new objectives exhibits policies that area better aligned with the return conditioning of the user.</li>
</ul>
<p><strong>Misc. thoughts:</strong></p>
<ul>
<li>Another simple idea with positive implications for performance.</li>
</ul>
<hr>
<p><strong>7. <a href="https://nips.cc/virtual/2022/poster/52843">Multi-Game Decision Transformers</a> (Lee et al., 2022)</strong></p>
<figure><img src="/img/multigame.png"/>
</figure>

<p><strong>Key idea.</strong> Instead of predicting just the next action conditioned on state and return-to-go like the original decision transformer paper, they predict the intermediate reward and return-to-go. This allows them to re-condition on new returns-to-go at each timestep, using a clever sampling procedure that samples likely expert returns-to-go.</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>SOTA on standard atari offline RL tasks.</li>
</ul>
<p><strong>Misc thoughts:</strong></p>
<ul>
<li>This work is very similar to the original decision transformer paper, so I’m surprised that it received a best paper award.</li>
<li>It represents continued progress in the field on offline RL, and more specifically, decision transformer style architectures.</li>
</ul>
<hr>
<p><strong>8. <a href="https://arxiv.org/abs/2206.01079">When does return-conditioned supervised learning work for offline reinforcement learning? (Brandfonbrener, 2022)</a></strong></p>
<p><strong>Key idea.</strong> Much recent work on offline RL can be cast as supervised learning on a near-optimal offline dataset then conditioning on high rewards from the dataset at test time; under what conditions is this a valid approach? Here the authors prove that this (unsurprisingly) only works when two conditions are met: 1) the test envs are (nearly) deterministic, and 2) there is trajectory-level converage in the dataset.</p>
<p><strong>Implication(s):</strong></p>
<ul>
<li>Current approaches to offline RL will not work in the real world because real envs are generally stochastic.</li>
</ul>
<p><strong>Misc thoughts:</strong></p>
<ul>
<li>I liked that the authors proved the community’s intuitions on current approaches to offline RL that, although somewhat obvious in retrospect, had not been verified.</li>
</ul>
<hr>
<h2 id="workshops">Workshops</h2>
<p>I attended 5 workshops:</p>
<ol>
<li>Foundation Models for Decision Making</li>
<li>Safety</li>
<li>Offline RL</li>
<li>Real Life Reinforcement Learning</li>
<li>Tackling Climate Change with Machine Learning</li>
</ol>
<p>I found the latter three to be interesting, but less informative and precient as the first two. I therefore only discuss the Foundation Models for Decision Making and Safety workshops; the extent to which I enjoyed both workshops is, in a sense, oxymoronic.</p>
<h3 id="foundation-models-for-decision-making">Foundation Models for Decision Making</h3>
<p><strong>Leslie P. Kaelbling: What does an intelligent robot need to know?</strong></p>
<figure><img src="/img/kaelbling2022.jpg"/>
</figure>

<p>My favourite talk was from <a href="https://scholar.google.com/citations?user=IcasIiwAAAAJ&amp;hl=en">Leslie Kaelbling</a> of MIT. Kaelbling focussed on our proclivity for building inductive biases into our models (a similar thesis to Sutton’s <a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">Bitter Lesson</a>); though good in short term, the effectiveness of such priors plateaus in the long-run. I agree with her.</p>
<p>She advocates for a marketplace of pre-trained models of the following types:</p>
<ul>
<li>Foundation: space, geometry, kinematics</li>
<li>Psychology: other agents, beliefs, desires etc.</li>
<li>Culture: how do u do things in the world e.g. stuff you can read in books</li>
</ul>
<p>Robotics manufacturers will provide:</p>
<ul>
<li>observation / perception</li>
<li>actuators</li>
<li>controllers e.g. policies</li>
</ul>
<p>And we’ll use our own expertise to build local states (specific facts about the env) and encode long horizon memories e.g. what did I do 2 years ago.</p>
<hr>
<h3 id="safety-unofficial-in-the-marriott-across-the-road">Safety (unofficial; in the Marriott across the road)</h3>
<p>The safety workshop was wild. It was a small, unofficial congregation of researchers who you’d expect to see lurking on <a href="https://www.lesswrong.com/">Less Wrong</a> and other <a href="https://forum.effectivealtruism.org/">EA forums</a>.</p>
<p><strong><a href="http://christoph-schuhmann.de/">Christoph Schuhmann</a> (Founder of LAION)</strong></p>
<p>Chris is a high school teacher from Vienna; he gave an inspiring talk on the open-sourcing of foundation models. He started LAION (Large-scale Artificial Intelligence Open Network) a non-profit organization, provides datasets, tools and models to democratise ML research. His key points included:</p>
<ul>
<li>centralised intelligence means centralised problem solving; we can’t give the keys to problem solving to a (potentially) dictatorial few.</li>
<li>risks by not open sourcing AI are bigger than those of open sourcing</li>
<li>LAION progress:
<ul>
<li>initial plan was to replicate the orignal CLIP / Dalle-1</li>
<li>got 3m image text pairs on his own</li>
<li>discord server helped him get 300m image text pairs, then 5b pairs</li>
<li>hedge fund gave them 8 A100s</li>
</ul>
</li>
<li>We will always want to do things even if AI can, cause we need to express ourselve</li>
</ul>
<p><strong>Thomas Wolf (Hugging Face CEO)</strong></p>
<p>Tom Wolf gave a talk on the <a href="https://www.notion.so/NeurIPS-a25bdf6d9af045f9bff65177f2833cfa">Big Science initiative</a>, a project takes inspiration from scientific creation schemes such as CERN and the LHC, in which open scientific collaborations facilitate the creation of large-scale artefacts that are useful for the entire research community:</p>
<ul>
<li>1000+ researchers coming together to build massive language model and massive dataset</li>
<li>efficient agi will probs require modularity cc. LeCun</li>
<li>working on the energy efficiency of training is inherently democratic i.e. stops models being held by the rich, especially re: inference</li>
</ul>
<p>Are AI researchers aligned on AGI alignment?</p>
<p>There was interesting round table at the end of the workshop that included <a href="https://scholar.google.com/citations?user=KNr3vb4AAAAJ&amp;hl=en">Jared Kaplan</a> (Anthropic) and <a href="https://scholar.google.ca/citations?user=5Uz70IoAAAAJ&amp;hl=en">David Krueger</a> (Cambridge) discussing what is means to align AGI. There was little agreement.</p>
<hr>
<h2 id="keynotes">Keynotes</h2>
<p>I attended 4 of the 6 keynotes which were:</p>
<ol>
<li><strong>David Chalmers:</strong> <a href="https://nips.cc/virtual/2022/invited-talk/55867">Are Large Language Models Sentient?</a></li>
<li><strong>Emmanuel Candes:</strong> <a href="https://nips.cc/virtual/2022/invited-talk/55872">Conformal Prediction in 2022</a></li>
<li><strong>Isabelle Guyon:</strong> <a href="https://nips.cc/virtual/2022/invited-talk/56158">The Data-Centric Era: How ML is Becoming an Experimental Science</a></li>
<li><strong>Geoff Hinton:</strong> <a href="https://nips.cc/virtual/2022/invited-talk/55869">The Forward-Forward Algorithm for Training Deep Neural Networks</a></li>
</ol>
<p>I found Emmanuel’s talk on conformal prediction enlightening as I’d never heard of the topic (<a href="https://arxiv.org/abs/2107.07511#:~:text=Conformal%20prediction%20is%20a%20user,distributional%20assumptions%20or%20model%20assumptions.">here’s a primer</a>), and Isabelle’s talk on benchmark and data transparency to be agreeable, if a little unoriginal. Hinton’s talk on a more anatomically correct learning algorithm was interesting, but I’m as yet unconvinced that mimicking human intelligence is a good way of building systems that are superior to humans—we are able to leverage hardware for artificial systems far superior to that accessible to humans. Chalmers talk was extremely thought-provoking; he structured the problem of consciousness in LLMs excellently—far better than I’ve seen to date, and as such was my favourite of the four.</p>
<p>I have linked to each of the talks, which are freely available to view above.</p>
<h3 id="references">References</h3>
<p>Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.-A.; Zhu, Y.; and Anandkumar, A. 2022. Minedojo: Building open-ended embodied agents with</p>
<p>internet-scale knowledge. Advances in neural information processing systems, 35.</p>
<p>Henaff, M.; Raileanu, R.; Jiang, M.; and Rockt  ̈aschel, T. 2022. Exploration via Elliptical Episodic Bonuses. Advances in neural information processing systems, 35.</p>
<p>Humphreys, P. C.; Guez, A.; Tieleman, O.; Sifre, L.; Weber, T.; and Lillicrap, T. 2022. Large-Scale Retrieval for Reinforcement Learning. Advances in neural information processing systems, 35.</p>
<p>Lee, K.-H.; Nachum, O.; Yang, M.; Lee, L.; Freeman, D.; Xu, W.; Guadarrama, S.; Fischer, I.; Jang, E.; Michalewski, H.; et al. 2022. Multi-game decision transformers. Advances in neural information processing systems, 35.</p>
<p>Paster, K.; McIlraith, S.; and Ba, J. 2022. You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments. Advances in neural information processing systems, 35.</p>
<p>Schaul, T.; Barreto, A.; Quan, J.; and Ostrovski, G. 2022. The phenomenon of policy churn. Advances in neural information processing systems, 35.</p>
<p>Touati, A.; Rapin, J.; and Ollivier, Y. 2022. Does Zero-Shot Reinforcement Learning Exist?</p>
]]></content>
        </item>
        
        <item>
            <title>One Hour RL</title>
            <link>https://enjeeneer.io/posts/2022/02/one-hour-rl/</link>
            <pubDate>Fri, 25 Feb 2022 15:23:20 +0000</pubDate>
            
            <guid>https://enjeeneer.io/posts/2022/02/one-hour-rl/</guid>
            <description>An Introduction to Reinforcement Learning Tom Bewley &amp;amp; Scott Jeen Alan Turing Institute 24/02/2022 The best way to walk through this tutorial is using the accompanying Jupyter Notebook:

[Jupyter Notebook]
1 | Markov Decision Processes: A Model of Sequential Decision Making 1.1. MDP (semi-)Formalism In reinforcement learning (RL), an agent takes actions in an environment to change its state over discrete timesteps $t$, with the goal of maximising the future sum of a scalar quantity known as reward.</description>
            <content type="html"><![CDATA[<h1 id="an-introduction-to-reinforcement-learning">An Introduction to Reinforcement Learning</h1>
<h2 id="tom-bewleyhttpstombewleycom----scott-jeenhttpsenjeeneerio"><a href="https://tombewley.com/">Tom Bewley</a>  &amp;  <a href="https://enjeeneer.io/">Scott Jeen</a></h2>
<h2 id="alan-turing-institute">Alan Turing Institute</h2>
<h3 id="24022022">24/02/2022</h3>
<p>The best way to walk through this tutorial is using the accompanying Jupyter Notebook:</p>
<p><a href="http://colab.research.google.com/github/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/notebook.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a></p>
<p>[<a href="http://nbviewer.jupyter.org/github/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/notebook.ipynb">Jupyter Notebook</a>]</p>
<h1 id="1--markov-decision-processes-a-model-of-sequential-decision-making">1 | Markov Decision Processes: A Model of Sequential Decision Making</h1>
<h2 id="11-mdp-semi-formalism">1.1. MDP (semi-)Formalism</h2>
<p>In reinforcement learning (RL), an <em>agent</em> takes <em>actions</em> in an <em>environment</em> to change its state over discrete timesteps $t$, with the goal of maximising the future sum of a scalar quantity known as <em>reward</em>. We formalise this interaction as an agent-environment loop, mathematically described as a Markov Decision Process (MDP).</p>
<img src='https://github.com/enjeeneer/sutton_and_barto/blob/main/images/chapter3_1.png?raw=true' width='700'>
<p>MDPs break the I.I.D. data assumption of supervised and unsupervised learning; the agent <em>causally influences</em> the data it sees through its choice of actions. However, one assumption we do make is the <em>Markov property</em>, which says that the state representation captures <em>all relevent information</em> from the past. Formally, state transitions depend only on the most recent state and action,
$$
\mathbb{P}[S_{t+1} | S_1,A_1 \ldots, S_t,A_t]=\mathbb{P}[S_{t+1} | S_t,A_t],
$$
and rewards depend only on the most recent transition,
$$
\mathbb{P}[R_{t+1} | S_1,A_1 \ldots, S_t,A_t,S_{t+1}] = \mathbb{P}[R_{t+1} | S_t,A_t,S_{t+1}].
$$</p>
<ul>
<li>Note: different sources use different notation here, but this is the most general.</li>
</ul>
<p>In some MDPs, a subset of states are designated as <em>terminal</em> (or <em>absorbing</em>). The agent-environment interaction loop ceases once a terminal state is reached, and restarts again at $t=0$ by sampling an state from an initialisation distribution $S_0\sim\mathbb{P}_\text{init}$. Such MDPs are known as <em>episodic</em>, while those without terminal states are known as <em>continuing</em>.</p>
<p>The goal of an RL agent is to pick actions that maximise the discounted cumulative sum of future rewards, also known as the <em>return</em> $G_t$:
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots + \gamma^{T-t-1}R_{T},
$$
where $\gamma\in[0,1]$ is a discount factor and $T$ is the time of termination ($\infty$ in continuing MDPs).</p>
<p>To do so, it needs the ability to forecast the reward-getting effect of taking each action $A$ in each state $S$, potentially many timesteps into the future. This <em>temporal credit assignment</em> problem is one of the key factors that makes RL so challenging.</p>
<p>Before we go on, it&rsquo;s worth reflecting on how general the MDP formulation is. An extremely large class of problems can be cast as MDPs (it&rsquo;s even possible to represent supervised learning as a special case), and <a href="https://reader.elsevier.com/reader/sd/pii/S0004370221000862?token=3A56DFC12064E559FBF2F53CBE7A85E4E4BE24160CC0B9DDDAE18351D2FE61DA3BF02167A8FCAE3398396BBBEDFDA7A9&amp;originRegion=eu-west-1&amp;originCreation=20220224085155">this recent DeepMind paper</a> goes as far as to say that <em>all aspects of general intelligence</em> can be understood as serving the maximisation of future reward. Although not everybody agrees, this attitude motivates the heavy RL focus at organisations like DeepMind and OpenAI.</p>
<h2 id="12-mdp-example">1.2 MDP Example</h2>
<p>Here&rsquo;s a simple MDP (courtesy of David Silver @ DeepMind/UCL), which we&rsquo;ll be using throughout this course.</p>
<ul>
<li>White circle: non-terminal state</li>
<li>White square: terminal state</li>
<li>Black circle: action</li>
<li><span style="color:green">Green:</span> reward (depends only on $S_{t+1}$ here)</li>
<li><span style="color:blue">Blue:</span> state transition probability</li>
<li><span style="color:red">Red:</span> action probability for an exemplar policy</li>
<li>Note: edges with probability $1$ are unlabelled</li>
</ul>
<img src='https://github.com/tombewley/one-hour-rl/blob/main/images/student-mdp.svg?raw=true' width='700'>
<h2 id="13-open-ai-gym">1.3 Open AI Gym</h2>
<p><a href="https://gym.openai.com/">Open AI Gym</a> provides a unified framework for testing and comparing RL algorithms in Python, and offers a suite of MDPs that researchers can use to benchmark their work. It&rsquo;s important to be familiar with the conventions of Gym, because almost all modern RL code is built to work with it. Gym environment classes have two key methods:</p>
<ul>
<li><code>mdp.reset()</code>: reset the MDP to an initial state $S_0$ according to the initialisation distribution $\mathbb{P}_\text{init}$.</li>
<li><code>mdp.step(action)</code> : given an action $A_t$, combine with the current state $S_t$ to produce the next state according to $\mathbb{P}[S_{t+1} | S_t,A_t]$ and a scalar reward according to $\mathbb{P}[R_{t+1} | S_t,A_t,S_{t+1}]$.</li>
</ul>
<p>A Gym-compatible class for the student MDP shown above can be found in <code>mdp.py</code> in this repository. Let&rsquo;s import it now and explore what it can do!</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">from</span> mdp <span style="color:#f92672">import</span> StudentMDP
mdp <span style="color:#f92672">=</span> StudentMDP()
</code></pre></div><p>Firstly, we&rsquo;ll have a look at the initialisation probabilities and the behaviour of <code>mdp.reset()</code>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">print(mdp<span style="color:#f92672">.</span>initial_probs())
mdp<span style="color:#f92672">.</span>reset()
print(mdp<span style="color:#f92672">.</span>state)
</code></pre></div><pre><code>{'Class 1': 1.0, 'Class 2': 0.0, 'Class 3': 0.0, 'Facebook': 0.0, 'Pub': 0.0, 'Pass': 0.0, 'Asleep': 0.0}
Class 1
</code></pre>
<p>Next, let&rsquo;s check which actions are available in this initial state, and the action-dependent transition probabilities $\mathbb{P}[S_{t+1}|\text{Class 1},A_t]$.</p>
<ul>
<li>Reminder: the Markov property dictates that transition probabilities depend <em>only</em> on the current state and action.</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">print(mdp<span style="color:#f92672">.</span>action_space(mdp<span style="color:#f92672">.</span>state))
print(mdp<span style="color:#f92672">.</span>transition_probs(mdp<span style="color:#f92672">.</span>state, <span style="color:#e6db74">&#34;Study&#34;</span>))
print(mdp<span style="color:#f92672">.</span>transition_probs(mdp<span style="color:#f92672">.</span>state, <span style="color:#e6db74">&#34;Go on Facebook&#34;</span>))
</code></pre></div><pre><code>{'Study', 'Go on Facebook'}
{'Class 2': 1.0}
{'Facebook': 1.0}
</code></pre>
<p>Calling <code>mdp.step(action)</code> samples and returns the next state $S_{t+1}$, alongside the reward $R_{t+1}$. Let&rsquo;s try calling this method repeatedly. What&rsquo;s happening here?</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">state, reward, _, _ <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(<span style="color:#e6db74">&#34;Study&#34;</span>) 
print(state, reward)
</code></pre></div><pre><code>Class 2 -2.0
</code></pre>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">mdp<span style="color:#f92672">.</span>action_space(<span style="color:#e6db74">&#34;Pass&#34;</span>)
</code></pre></div><pre><code>{'Fall asleep'}
</code></pre>
<p>So far, we&rsquo;ve only seen <em>deterministic</em> transitions, but having a pint in the pub has a <em>stochastic</em> effect; the state goes to one of the three classes with specified probabilities.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">print(mdp<span style="color:#f92672">.</span>action_space(<span style="color:#e6db74">&#34;Pub&#34;</span>))
print(mdp<span style="color:#f92672">.</span>transition_probs(<span style="color:#e6db74">&#34;Pub&#34;</span>, <span style="color:#e6db74">&#34;Have a pint&#34;</span>))
</code></pre></div><pre><code>{'Have a pint'}
{'Class 1': 0.2, 'Class 2': 0.4, 'Class 3': 0.4}
</code></pre>
<p>In this state, the behaviour of <code>mdp.step(action)</code> changes between repeated calls, even for a constant action.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">mdp<span style="color:#f92672">.</span>state <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;Pub&#34;</span> <span style="color:#75715e"># Note that we&#39;re resetting the state to Pub each time</span>
state, reward, _, _ <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(<span style="color:#e6db74">&#34;Have a pint&#34;</span>)
print(state, reward)
</code></pre></div><pre><code>Class 2 -2.0
</code></pre>
<p>This MDP has just one terminal state.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">print(mdp<span style="color:#f92672">.</span>terminal_states())
</code></pre></div><pre><code>{'Asleep'}
</code></pre>
<p><code>mdp.step(action)</code> also returns a binary <code>done</code> flag, which is set to <code>True</code> if $S_{t+1}$ is a terminal state.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">mdp<span style="color:#f92672">.</span>state <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;Class 2&#34;</span> 
state, reward, done, _ <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(<span style="color:#e6db74">&#34;Fall asleep&#34;</span>)
print(state, reward, done)

mdp<span style="color:#f92672">.</span>state <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;Pass&#34;</span> 
state, reward, done, _ <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(<span style="color:#e6db74">&#34;Fall asleep&#34;</span>)
print(state, reward, done)
</code></pre></div><pre><code>Asleep 0.0 True
Asleep 0.0 True
</code></pre>
<p>Now let&rsquo;s bring an agent into the mix, and give it the exemplar policy shown in the diagram above.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">from</span> agent <span style="color:#f92672">import</span> Agent
agent <span style="color:#f92672">=</span> Agent(mdp) 
agent<span style="color:#f92672">.</span>policy <span style="color:#f92672">=</span> {
    <span style="color:#e6db74">&#34;Class 1&#34;</span>:  {<span style="color:#e6db74">&#34;Study&#34;</span>: <span style="color:#ae81ff">0.5</span>, <span style="color:#e6db74">&#34;Go on Facebook&#34;</span>: <span style="color:#ae81ff">0.5</span>},
    <span style="color:#e6db74">&#34;Class 2&#34;</span>:  {<span style="color:#e6db74">&#34;Study&#34;</span>: <span style="color:#ae81ff">0.8</span>, <span style="color:#e6db74">&#34;Fall asleep&#34;</span>: <span style="color:#ae81ff">0.2</span>},
    <span style="color:#e6db74">&#34;Class 3&#34;</span>:  {<span style="color:#e6db74">&#34;Study&#34;</span>: <span style="color:#ae81ff">0.6</span>, <span style="color:#e6db74">&#34;Go to the pub&#34;</span>: <span style="color:#ae81ff">0.4</span>},
    <span style="color:#e6db74">&#34;Facebook&#34;</span>: {<span style="color:#e6db74">&#34;Keep scrolling&#34;</span>: <span style="color:#ae81ff">0.9</span>, <span style="color:#e6db74">&#34;Close Facebook&#34;</span>: <span style="color:#ae81ff">0.1</span>},
    <span style="color:#e6db74">&#34;Pub&#34;</span>:      {<span style="color:#e6db74">&#34;Have a pint&#34;</span>: <span style="color:#ae81ff">1.</span>},
    <span style="color:#e6db74">&#34;Pass&#34;</span>:     {<span style="color:#e6db74">&#34;Fall asleep&#34;</span>: <span style="color:#ae81ff">1.</span>},
    <span style="color:#e6db74">&#34;Asleep&#34;</span>:   {<span style="color:#e6db74">&#34;Stay asleep&#34;</span>: <span style="color:#ae81ff">1.</span>}}

</code></pre></div><p>We can query the policy in a similar way to the MDP&rsquo;s properties, and observe its stochastic behaviour.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">print(agent<span style="color:#f92672">.</span>policy[<span style="color:#e6db74">&#34;Class 1&#34;</span>])
print([agent<span style="color:#f92672">.</span>act(<span style="color:#e6db74">&#34;Class 1&#34;</span>) <span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">20</span>)])
</code></pre></div><pre><code>{'Study': 0.5, 'Go on Facebook': 0.5}
['Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Go on Facebook', 'Study', 'Study', 'Study', 'Go on Facebook', 'Go on Facebook', 'Go on Facebook', 'Study']
</code></pre>
<p>Bringing it all together:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">mdp<span style="color:#f92672">.</span>verbose <span style="color:#f92672">=</span> <span style="color:#66d9ef">True</span>
state <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>reset()
done <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
<span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> done:
    state, reward, done, info <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(agent<span style="color:#f92672">.</span>act(state))
</code></pre></div><pre><code>=========================== EPISODE   2 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
</code></pre>
<p>How &ldquo;good&rdquo; is this policy? To answer this, we need to calculate its expected return.</p>
<h1 id="2--policy-evaluation-the-temporal-difference-method">2 | Policy Evaluation: The Temporal Difference Method</h1>
<p>For a policy $\pi$, the <em>Q value</em> $Q_\pi(S_t,A_t)$ is the expected return from taking action $A_t$ in state $S_t$, and following $\pi$ thereafter. It thus quantifies how well the policy can be expected to perform, starting from this state-action pair. Q values exhibit an elegant recursive relationship known as the <em>Bellman equation</em>:
$$
Q_\pi(S_t,A_t)=\sum_{S_{t+1}}\mathbb{P}[S_{t+1}|S_t,A_t]\left(\mathbb{E}[R_{t+1} | S_t,A_t,S_{t+1}]+\gamma\times\sum_{A_{t+1}}\pi(A_{t+1}|S_{t+1})\times Q_\pi(S_{t+1},A_{t+1})\right).
$$</p>
<p>i.e. <strong>The Q value for a state-action pair is equal to the immediate reward, plus the $\gamma$-discounted Q value for the <em>next</em> state-action pair, with expectations taken over both the transition function $\mathbb{P}$ and the policy $\pi$.</strong></p>
<p>This is a bit of a mouthful, but the Bellman equation is perhaps the single most important thing to understand if you really want to &ldquo;get&rdquo; reinforcement learning.</p>
<p>To gain some intuition for this relationship, here are estimated Q values for the exemplar policy in the student MDP. Here we&rsquo;re using a discount factor of $\gamma=0.95$</p>
<ul>
<li>Note that these values are only approximate, so the Bellman equation doesn&rsquo;t hold exactly!</li>
</ul>
<img src='https://github.com/tombewley/one-hour-rl/blob/main/images/student-mdp-Q-values.svg?raw=true' width='700'>
<p>To take one example:
$$
Q(\text{Class 2},\text{Study})=-2+0.95\times [(0.6\times Q(\text{Class 3},\text{Study})+0.4\times Q(\text{Class 3},\text{Go to the pub}))]
$$
$$
=-2+0.95\times[(0.6\times 9.99+0.4\times 1.81)]
$$
$$
=4.38\approx 4.36
$$</p>
<p>How did we arrive at these Q value estimates? Here&rsquo;s where the real magic happens.</p>
<p>The <em>Bellman backup</em> algorithm makes use of this recursive relationship to update the Q value for a state-action pair based on the <em>current estimate of the value for the next state</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">GAMMA <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.95</span>  <span style="color:#75715e"># Discount factor</span>
ALPHA <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.001</span> <span style="color:#75715e"># Learning rate</span>

<span style="color:#66d9ef">def</span> <span style="color:#a6e22e">bellman_backup</span>(agent, state, action, reward, next_state, done):

    Q_next <span style="color:#f92672">=</span> <span style="color:#ae81ff">0.</span> <span style="color:#66d9ef">if</span> done <span style="color:#66d9ef">else</span> agent<span style="color:#f92672">.</span>Q[next_state][agent<span style="color:#f92672">.</span>act(next_state)]

    agent<span style="color:#f92672">.</span>Q[state][action] <span style="color:#f92672">+=</span> ALPHA <span style="color:#f92672">*</span> ( reward <span style="color:#f92672">+</span> GAMMA <span style="color:#f92672">*</span> Q_next <span style="color:#f92672">-</span> agent<span style="color:#f92672">.</span>Q[state][action])
</code></pre></div><p>By sampling episodes in our MDP using the current policy we can collect rewards and update our Q-function accordingly. The algorithm we use to evaluate policies is called policy evaluation, and it uses the Bellman back-up which has two hyperparameters $\gamma$ and $\alpha$. $\gamma$ is the discount factor that</p>
<p>Import the MDP and define the policy again.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">from</span> mdp <span style="color:#f92672">import</span> StudentMDP
<span style="color:#f92672">from</span> agent <span style="color:#f92672">import</span> Agent
mdp <span style="color:#f92672">=</span> StudentMDP(verbose<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
agent <span style="color:#f92672">=</span> Agent(mdp) 
agent<span style="color:#f92672">.</span>policy <span style="color:#f92672">=</span> {
    <span style="color:#e6db74">&#34;Class 1&#34;</span>:  {<span style="color:#e6db74">&#34;Study&#34;</span>: <span style="color:#ae81ff">0.5</span>, <span style="color:#e6db74">&#34;Go on Facebook&#34;</span>: <span style="color:#ae81ff">0.5</span>},
    <span style="color:#e6db74">&#34;Class 2&#34;</span>:  {<span style="color:#e6db74">&#34;Study&#34;</span>: <span style="color:#ae81ff">0.8</span>, <span style="color:#e6db74">&#34;Fall asleep&#34;</span>: <span style="color:#ae81ff">0.2</span>},
    <span style="color:#e6db74">&#34;Class 3&#34;</span>:  {<span style="color:#e6db74">&#34;Study&#34;</span>: <span style="color:#ae81ff">0.6</span>, <span style="color:#e6db74">&#34;Go to the pub&#34;</span>: <span style="color:#ae81ff">0.4</span>},
    <span style="color:#e6db74">&#34;Facebook&#34;</span>: {<span style="color:#e6db74">&#34;Keep scrolling&#34;</span>: <span style="color:#ae81ff">0.9</span>, <span style="color:#e6db74">&#34;Close Facebook&#34;</span>: <span style="color:#ae81ff">0.1</span>},
    <span style="color:#e6db74">&#34;Pub&#34;</span>:      {<span style="color:#e6db74">&#34;Have a pint&#34;</span>: <span style="color:#ae81ff">1.</span>},
    <span style="color:#e6db74">&#34;Pass&#34;</span>:     {<span style="color:#e6db74">&#34;Fall asleep&#34;</span>: <span style="color:#ae81ff">1.</span>},
    <span style="color:#e6db74">&#34;Asleep&#34;</span>:   {<span style="color:#e6db74">&#34;Stay asleep&#34;</span>: <span style="color:#ae81ff">1.</span>}
}
</code></pre></div><p>Initially, we set all Q values to zero (this is actually arbitrary).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">agent<span style="color:#f92672">.</span>Q
</code></pre></div><pre><code>{'Class 1': {'Study': 0.0, 'Go on Facebook': 0.0},
 'Class 2': {'Study': 0.0, 'Fall asleep': 0.0},
 'Class 3': {'Study': 0.0, 'Go to the pub': 0.0},
 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 0.0},
 'Pub': {'Have a pint': 0.0},
 'Pass': {'Fall asleep': 0.0},
 'Asleep': {}}
</code></pre>
<p>Run a single episode to see what happens.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">state <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>reset()
done <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
<span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> done:
    action <span style="color:#f92672">=</span> agent<span style="color:#f92672">.</span>act(state)
    next_state, reward, done, _ <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(action)
    
    print(<span style="color:#e6db74">&#39;Current action value:&#39;</span>, agent<span style="color:#f92672">.</span>Q[state][action])
    print(<span style="color:#e6db74">&#39;Reward obtained:&#39;</span>, reward)
    print(<span style="color:#e6db74">&#39;Next action value:&#39;</span>, <span style="color:#ae81ff">0.</span> <span style="color:#66d9ef">if</span> done <span style="color:#66d9ef">else</span> agent<span style="color:#f92672">.</span>Q[next_state][agent<span style="color:#f92672">.</span>act(next_state)])

    bellman_backup(agent, state, action, reward, next_state, done)

    print(<span style="color:#e6db74">&#39;Updated action value:&#39;</span>, agent<span style="color:#f92672">.</span>Q[state][action])
    print(<span style="color:#e6db74">&#39;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">&#39;</span>)

    state <span style="color:#f92672">=</span> next_state
</code></pre></div><pre><code>=========================== EPISODE  51 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
Current action value: 4.271800760689531
Reward obtained: -2.0
Next action value: 6.9926093317691915
Updated action value: 4.272171938794022


| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
Current action value: 6.9926093317691915
Reward obtained: -2.0
Next action value: 9.999149294082697
Updated action value: 6.9931159142668005


| 2     | Class 3  | Study          | 10.0   | Pass       | False |
Current action value: 9.999149294082697
Reward obtained: 10.0
Next action value: 0.0
Updated action value: 9.999150144788615


| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
Current action value: 0.0
Reward obtained: 0.0
Next action value: 0.0
Updated action value: 0.0
</code></pre>
<p>Repeating a bunch of times, we gradually converge.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">mdp<span style="color:#f92672">.</span>verbose <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>

print(<span style="color:#e6db74">&#39;Initial Q&#39;</span>)
print(agent<span style="color:#f92672">.</span>Q)

<span style="color:#66d9ef">for</span> _ <span style="color:#f92672">in</span> range(<span style="color:#ae81ff">20000</span>):
    state <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>reset()
    done <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
    <span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> done:
        action <span style="color:#f92672">=</span> agent<span style="color:#f92672">.</span>act(state)
        next_state, reward, done, _ <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(action)
        bellman_backup(agent, state, action, reward, next_state, done)
        state <span style="color:#f92672">=</span> next_state

print(<span style="color:#e6db74">&#39;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">&#39;</span>)
print(<span style="color:#e6db74">&#39;Converged Q&#39;</span>)
print(agent<span style="color:#f92672">.</span>Q)
</code></pre></div><pre><code>Initial Q
{'Class 1': {'Study': -0.002, 'Go on Facebook': -0.001}, 'Class 2': {'Study': -0.002, 'Fall asleep': 0.0}, 'Class 3': {'Study': 0.01, 'Go to the pub': 0.0}, 'Facebook': {'Keep scrolling': -0.0049995000249993754, 'Close Facebook': -0.00200095}, 'Pub': {'Have a pint': 0.0}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}


Converged Q
{'Class 1': {'Study': 1.3750628761712569, 'Go on Facebook': -11.288976651525505}, 'Class 2': {'Study': 4.485658109648779, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.999996524778595, 'Go to the pub': 1.8953439336946862}, 'Facebook': {'Keep scrolling': -11.233042781986304, 'Close Facebook': -6.6761905244797735}, 'Pub': {'Have a pint': 0.9312667143217461}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
</code></pre>
<p>Note that although the policy evaluation process is guaranteed to converge eventually (for simple MDPs!), we are likely to see some discrepencies between runs of finite length because of the role of randomness in the data collection process. Here are the results of five independent repeats:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">{<span style="color:#e6db74">&#39;Class 1&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">1.2650695038546025</span>, <span style="color:#e6db74">&#39;Go on Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.30468184426212</span>}, <span style="color:#e6db74">&#39;Class 2&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">4.407552596737938</span>, <span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Class 3&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">9.99999695776742</span>, <span style="color:#e6db74">&#39;Go to the pub&#39;</span>: <span style="color:#ae81ff">1.8487809354712246</span>}, <span style="color:#e6db74">&#39;Facebook&#39;</span>: {<span style="color:#e6db74">&#39;Keep scrolling&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.258053618560483</span>, <span style="color:#e6db74">&#39;Close Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">6.489974573408375</span>}, <span style="color:#e6db74">&#39;Pub&#39;</span>: {<span style="color:#e6db74">&#39;Have a pint&#39;</span>: <span style="color:#ae81ff">0.9454014270087486</span>}, <span style="color:#e6db74">&#39;Pass&#39;</span>: {<span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Asleep&#39;</span>: {}}
{<span style="color:#e6db74">&#39;Class 1&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">1.3338704627380917</span>, <span style="color:#e6db74">&#39;Go on Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.222578014516461</span>}, <span style="color:#e6db74">&#39;Class 2&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">4.404498313710967</span>, <span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Class 3&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">9.999996607231745</span>, <span style="color:#e6db74">&#39;Go to the pub&#39;</span>: <span style="color:#ae81ff">1.9330819535637127</span>}, <span style="color:#e6db74">&#39;Facebook&#39;</span>: {<span style="color:#e6db74">&#39;Keep scrolling&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.237574593720579</span>, <span style="color:#e6db74">&#39;Close Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">6.649035509952115</span>}, <span style="color:#e6db74">&#39;Pub&#39;</span>: {<span style="color:#e6db74">&#39;Have a pint&#39;</span>: <span style="color:#ae81ff">1.0198591832482675</span>}, <span style="color:#e6db74">&#39;Pass&#39;</span>: {<span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Asleep&#39;</span>: {}}
{<span style="color:#e6db74">&#39;Class 1&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">1.255108027766012</span>, <span style="color:#e6db74">&#39;Go on Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.190843458457234</span>}, <span style="color:#e6db74">&#39;Class 2&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">4.3028079916966</span>, <span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Class 3&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">9.999996368375</span>, <span style="color:#e6db74">&#39;Go to the pub&#39;</span>: <span style="color:#ae81ff">1.692402249138645</span>}, <span style="color:#e6db74">&#39;Facebook&#39;</span>: {<span style="color:#e6db74">&#39;Keep scrolling&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.009224020468848</span>, <span style="color:#e6db74">&#39;Close Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">6.456279660637165</span>}, <span style="color:#e6db74">&#39;Pub&#39;</span>: {<span style="color:#e6db74">&#39;Have a pint&#39;</span>: <span style="color:#ae81ff">0.7467114530860179</span>}, <span style="color:#e6db74">&#39;Pass&#39;</span>: {<span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Asleep&#39;</span>: {}}
{<span style="color:#e6db74">&#39;Class 1&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">1.2734946938741027</span>, <span style="color:#e6db74">&#39;Go on Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.328006914127434</span>}, <span style="color:#e6db74">&#39;Class 2&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">4.256107269897298</span>, <span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Class 3&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">9.99999635381211</span>, <span style="color:#e6db74">&#39;Go to the pub&#39;</span>: <span style="color:#ae81ff">1.74113336614775</span>}, <span style="color:#e6db74">&#39;Facebook&#39;</span>: {<span style="color:#e6db74">&#39;Keep scrolling&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.34039736455563</span>, <span style="color:#e6db74">&#39;Close Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">6.777709970724558</span>}, <span style="color:#e6db74">&#39;Pub&#39;</span>: {<span style="color:#e6db74">&#39;Have a pint&#39;</span>: <span style="color:#ae81ff">0.7694312629253455</span>}, <span style="color:#e6db74">&#39;Pass&#39;</span>: {<span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Asleep&#39;</span>: {}}
{<span style="color:#e6db74">&#39;Class 1&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">1.2650695038546025</span>, <span style="color:#e6db74">&#39;Go on Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.30468184426212</span>}, <span style="color:#e6db74">&#39;Class 2&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">4.407552596737938</span>, <span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Class 3&#39;</span>: {<span style="color:#e6db74">&#39;Study&#39;</span>: <span style="color:#ae81ff">9.99999695776742</span>, <span style="color:#e6db74">&#39;Go to the pub&#39;</span>: <span style="color:#ae81ff">1.8487809354712246</span>}, <span style="color:#e6db74">&#39;Facebook&#39;</span>: {<span style="color:#e6db74">&#39;Keep scrolling&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">11.258053618560483</span>, <span style="color:#e6db74">&#39;Close Facebook&#39;</span>: <span style="color:#f92672">-</span><span style="color:#ae81ff">6.489974573408375</span>}, <span style="color:#e6db74">&#39;Pub&#39;</span>: {<span style="color:#e6db74">&#39;Have a pint&#39;</span>: <span style="color:#ae81ff">0.9454014270087486</span>}, <span style="color:#e6db74">&#39;Pass&#39;</span>: {<span style="color:#e6db74">&#39;Fall asleep&#39;</span>: <span style="color:#ae81ff">0.0</span>}, <span style="color:#e6db74">&#39;Asleep&#39;</span>: {}}
</code></pre></div><pre><code>{'Class 1': {'Study': 1.2650695038546025,
  'Go on Facebook': -11.30468184426212},
 'Class 2': {'Study': 4.407552596737938, 'Fall asleep': 0.0},
 'Class 3': {'Study': 9.99999695776742, 'Go to the pub': 1.8487809354712246},
 'Facebook': {'Keep scrolling': -11.258053618560483,
  'Close Facebook': -6.489974573408375},
 'Pub': {'Have a pint': 0.9454014270087486},
 'Pass': {'Fall asleep': 0.0},
 'Asleep': {}}
</code></pre>
<p>Try with $\gamma=0$</p>
<h1 id="3--policy-improvement">3 | Policy Improvement</h1>
<img src='https://github.com/tombewley/one-hour-rl/blob/main/images/policy-improvement-2.PNG?raw=true' width='500'>
<p>Having evaluated our policy $\pi$, how can we go about obtaining a better one? This question is the heart of <em>policy improvement</em>, perhaps the fundamental concept of RL. Recall, when we performed policy evaluation we obtained the value of taking every action in every state. Thus, we can perform policy improvement readily by picking our current best estimate of the optimal action from each state &ndash; so-called <em>greedy</em> action selection. Once we&rsquo;ve obtained a new policy, we can evaluate it as before. Continually iterating between policy evaluation and policy improvement in this way, we are guarenteed to reach the optimal policy $\pi^*$ according to the policy improvement theorem.</p>
<h3 id="31--q-learning-combining-policy-evaluation-and-improvement">3.1 | Q-learning: Combining Policy Evaluation and Improvement</h3>
<img src='https://github.com/tombewley/one-hour-rl/blob/main/images/q-learning.png?raw=true' width='700'>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">from</span> mdp <span style="color:#f92672">import</span> StudentMDP
mdp <span style="color:#f92672">=</span> StudentMDP(verbose<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">from</span> agent <span style="color:#f92672">import</span> QLearningAgent
agent <span style="color:#f92672">=</span> QLearningAgent(mdp, epsilon<span style="color:#f92672">=</span><span style="color:#ae81ff">1.0</span>, alpha<span style="color:#f92672">=</span><span style="color:#ae81ff">0.2</span>, gamma<span style="color:#f92672">=</span><span style="color:#ae81ff">0.9</span>)
</code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">NUM_EPS <span style="color:#f92672">=</span> <span style="color:#ae81ff">50</span>
mdp<span style="color:#f92672">.</span>ep <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
<span style="color:#66d9ef">while</span> mdp<span style="color:#f92672">.</span>ep <span style="color:#f92672">&lt;</span> NUM_EPS:
    state <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>reset()
    done <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
    <span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> done:
        action <span style="color:#f92672">=</span> agent<span style="color:#f92672">.</span>act(state)
        next_state, reward, done, info <span style="color:#f92672">=</span> mdp<span style="color:#f92672">.</span>step(action)
        agent<span style="color:#f92672">.</span>learn(state, action, reward, next_state, done)
        state <span style="color:#f92672">=</span> next_state

    print(<span style="color:#e6db74">&#34;Value function:&#34;</span>)
    print(agent<span style="color:#f92672">.</span>Q)
    print(<span style="color:#e6db74">&#34;Policy:&#34;</span>)
    print(agent<span style="color:#f92672">.</span>policy)
    print(<span style="color:#e6db74">&#34;Epsilon:&#34;</span>, agent<span style="color:#f92672">.</span>epsilon)
    
    agent<span style="color:#f92672">.</span>epsilon <span style="color:#f92672">=</span> max(agent<span style="color:#f92672">.</span>epsilon <span style="color:#f92672">-</span> <span style="color:#ae81ff">1</span> <span style="color:#f92672">/</span> (NUM_EPS<span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>), <span style="color:#ae81ff">0</span>)
</code></pre></div><pre><code>=========================== EPISODE   1 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
Value function:
{'Class 1': {'Study': -0.4, 'Go on Facebook': 0.0}, 'Class 2': {'Study': -0.4, 'Fall asleep': 0.0}, 'Class 3': {'Study': 2.0, 'Go to the pub': 0.0}, 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 0.0}, 'Pub': {'Have a pint': 0.0}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
Policy:
{'Class 1': {'Study': 0.5, 'Go on Facebook': 0.5}, 'Class 2': {'Study': 0.5, 'Fall asleep': 0.5}, 'Class 3': {'Study': 0.5, 'Go to the pub': 0.5}, 'Facebook': {'Keep scrolling': 0.5, 'Close Facebook': 0.5}, 'Pub': {'Have a pint': 1.0}, 'Pass': {'Fall asleep': 1.0}, 'Asleep': {}}
Epsilon: 1.0

=========================== EPISODE  50 ===========================
| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Study          | -2.0   | Class 2    | False |
| 1     | Class 2  | Study          | -2.0   | Class 3    | False |
| 2     | Class 3  | Study          | 10.0   | Pass       | False |
| 3     | Pass     | Fall asleep    |  0.0   | Asleep     | True  |
Value function:
{'Class 1': {'Study': 4.2185955736170015, 'Go on Facebook': -2.843986498540236}, 'Class 2': {'Study': 6.978887265282676, 'Fall asleep': 0.0}, 'Class 3': {'Study': 9.997403851570732, 'Go to the pub': 1.0297507967148403}, 'Facebook': {'Keep scrolling': -3.2301556459908016, 'Close Facebook': -0.8716820424598939}, 'Pub': {'Have a pint': 2.6089417712654472}, 'Pass': {'Fall asleep': 0.0}, 'Asleep': {}}
Policy:
{'Class 1': {'Study': 1.0, 'Go on Facebook': 0.0}, 'Class 2': {'Study': 1.0, 'Fall asleep': 0.0}, 'Class 3': {'Study': 1.0, 'Go to the pub': 0.0}, 'Facebook': {'Keep scrolling': 0.0, 'Close Facebook': 1.0}, 'Pub': {'Have a pint': 1.0}, 'Pass': {'Fall asleep': 1.0}, 'Asleep': {}}
Epsilon: 0
</code></pre>
<p>We find that after 50 episodes the agent has obtained the optimal policy $\pi_*$!</p>
<h1 id="4--deep-rl">4 | Deep RL</h1>
<p>So far, we&rsquo;ve tabularised the state-action space. Whilst useful for explaining the fundamental concepts that underpin RL, the real world state-action spaces are generally continuous and thus impossible to tabularise. To combat this, function approximators are used instead. In the past these included x, but more recently, deep neural networks have been used giving rise to the field of Deep Reinforcement Learning.</p>
<p>The seminal Deep RL algorithm is Deep Q Learning which uses neural networks to represent the $Q$ function. The network takes the current obervation $o_t$ as input and predicts the value of each action. The agent&rsquo;s policy is $\epsilon$-greedy as before i.e. it takes the value-maximising action with probability $1 - \epsilon$. Deep Q learning</p>
<p>Below, we run 500 episodes of the canonical Cartpole task using Deep Q learning. The agent&rsquo;s goal is to balance the pole in the upright position for as long as possible starting from an initially random position.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python"><span style="color:#f92672">import</span> gym
<span style="color:#f92672">from</span> dqn_agent <span style="color:#f92672">import</span> Agent
<span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">env <span style="color:#f92672">=</span> gym<span style="color:#f92672">.</span>make(<span style="color:#e6db74">&#39;CartPole-v1&#39;</span>)
agent <span style="color:#f92672">=</span> Agent(gamma<span style="color:#f92672">=</span><span style="color:#ae81ff">0.99</span>, epsilon<span style="color:#f92672">=</span><span style="color:#ae81ff">0.9</span>, lr<span style="color:#f92672">=</span><span style="color:#ae81ff">0.0001</span>, n_actions<span style="color:#f92672">=</span>env<span style="color:#f92672">.</span>action_space<span style="color:#f92672">.</span>n, input_dims<span style="color:#f92672">=</span>[env<span style="color:#f92672">.</span>observation_space<span style="color:#f92672">.</span>shape[<span style="color:#ae81ff">0</span>]],
              mem_size<span style="color:#f92672">=</span><span style="color:#ae81ff">50000</span>, batch_size<span style="color:#f92672">=</span><span style="color:#ae81ff">128</span>,  eps_dec<span style="color:#f92672">=</span><span style="color:#ae81ff">1e-3</span>, eps_min<span style="color:#f92672">=</span><span style="color:#ae81ff">0.05</span>, replace<span style="color:#f92672">=</span><span style="color:#ae81ff">1000</span>,
              env_name<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;cartpole&#39;</span>, chkpt_dir<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;tmp/dqn&#39;</span>)

best_score <span style="color:#f92672">=</span> <span style="color:#f92672">-</span>np<span style="color:#f92672">.</span>inf
episodes <span style="color:#f92672">=</span> <span style="color:#ae81ff">500</span>
scores, avg_score, eps_history <span style="color:#f92672">=</span> [], [], []

<span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(episodes):
    score <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
    done <span style="color:#f92672">=</span> <span style="color:#66d9ef">False</span>
    observation <span style="color:#f92672">=</span> env<span style="color:#f92672">.</span>reset()
    env<span style="color:#f92672">.</span>render()
    <span style="color:#66d9ef">while</span> <span style="color:#f92672">not</span> done:
        action <span style="color:#f92672">=</span> agent<span style="color:#f92672">.</span>choose_action(observation)
        observation_, reward, done, info <span style="color:#f92672">=</span> env<span style="color:#f92672">.</span>step(action)
        score <span style="color:#f92672">+=</span> reward
        agent<span style="color:#f92672">.</span>store_transition(observation, action, reward, observation_, done)
        agent<span style="color:#f92672">.</span>learn()
        observation <span style="color:#f92672">=</span> observation_
        env<span style="color:#f92672">.</span>render()
    
    scores<span style="color:#f92672">.</span>append(score)
    eps_history<span style="color:#f92672">.</span>append(agent<span style="color:#f92672">.</span>epsilon)
    
    avg_score <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>mean(scores[<span style="color:#f92672">-</span><span style="color:#ae81ff">100</span>:])
    
    print(<span style="color:#e6db74">&#39;episode&#39;</span>, i, <span style="color:#e6db74">&#39;score </span><span style="color:#e6db74">%.2f</span><span style="color:#e6db74">&#39;</span> <span style="color:#f92672">%</span> score, <span style="color:#e6db74">&#39;average score </span><span style="color:#e6db74">%.2f</span><span style="color:#e6db74">&#39;</span> <span style="color:#f92672">%</span> avg_score)
</code></pre></div><h1 id="5--what-did-we-miss-out">5 | What Did We Miss Out?</h1>
<ul>
<li>Dynamic programming (when transition probabilities are known)</li>
<li>Monte Carlo</li>
<li>Exploration strategies</li>
<li>Continuous actions</li>
<li>Policy gradient, actor-critic</li>
<li>Model-based</li>
<li>Partial observability</li>
</ul>
<p>What next? RL interest group?</p>
]]></content>
        </item>
        
        <item>
            <title>Presenting with Jupyter Notebooks</title>
            <link>https://enjeeneer.io/posts/2021/11/presenting-with-jupyter-notebooks/</link>
            <pubDate>Wed, 17 Nov 2021 09:03:11 -0500</pubDate>
            
            <guid>https://enjeeneer.io/posts/2021/11/presenting-with-jupyter-notebooks/</guid>
            <description>The best way to walk through this tutorial is using the accompanying Jupyter Notebook: 
[Jupyter Notebook]
- In the last year I&amp;rsquo;ve started presenting work using Jupyter Notebooks, rebelling against the Bill Gates&#39;-driven status-quo. Here I&amp;rsquo;ll explain how to do it. It&amp;rsquo;s not difficult, but in my opinion makes presentations look slicker, whilst allowing you to run code live in a presentation if you like. First, we need to download the plug-in that gives us the presentation functionality, it&amp;rsquo;s called RISE.</description>
            <content type="html"><![CDATA[<h3 id="the-best-way-to-walk-through-this-tutorial-is-using-the-accompanying-jupyter-notebook">The best way to walk through this tutorial is using the accompanying Jupyter Notebook:</h3>
<p><a href="http://colab.research.google.com/github/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/notebook.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a></p>
<p>[<a href="http://nbviewer.jupyter.org/github/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/notebook.ipynb">Jupyter Notebook</a>]</p>
<h3 id="-">-</h3>
<p>In the last year I&rsquo;ve started presenting work using Jupyter Notebooks, rebelling against the Bill Gates'-driven status-quo. Here I&rsquo;ll explain how to do it. It&rsquo;s not difficult, but in my opinion makes presentations look slicker, whilst allowing you to run code live in a presentation if you like. First, we need to download the plug-in that gives us the presentation functionality, it&rsquo;s called <a href="https://rise.readthedocs.io/en/stable/index.html">RISE</a>. We can do this easily using pip in a terminal window:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-python" data-lang="python">pip install RISE
</code></pre></div><p>Once installed, our first move is to add the presentation toggles to our notebook cells. We do this by clicking <em>View</em> in the menu bar, then <em>Cell Toolbar</em>, then <em>Slideshow</em>:</p>
<h2 id="adding-presentation-toggles-to-cells">Adding Presentation Toggles to Cells</h2>
<figure><img src="https://github.com/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/images/slideshow_2.gif?raw=true"/>
</figure>

<h2 id="slide-types">Slide Types</h2>
<p>This adds a <code>Slide Type</code> dropdown to each cell in the notebook. Here we can choose one of five options:</p>
<ul>
<li><strong>Slide</strong>: Used to start a new chapter in your presentation, think of this as a section heading in LaTeX.</li>
<li><strong>Sub-slide</strong>: Slide falling within the chapter defined by a <strong>Slide</strong> cell.</li>
<li><strong>Fragment</strong>: this is to split the contents of one slide into pieces; a cell marked as a fragment will create a break inside the slide; it will not show up right away, you will need to press Space one more time to see it.</li>
<li><strong>Skip</strong>: Skips cell when in presenter mode.</li>
<li><strong>Notes</strong>: Cell that allows the author to write notes on a slide that aren&rsquo;t shown in presenter view.</li>
</ul>
<p>As with any notebook, we can define the cell type to be either <code>Markdown</code> or <code>Code</code>. As you&rsquo;d expect, we present any text or image-based slide in <code>Markdown</code>, reserving the <code>Code</code> cell type if and only if we want to explicitly run some code in the presentation. If you aren&rsquo;t familiar, Markdown is a straightforward language for text formatting; I won&rsquo;t go into the details here, but suffice to say you can learn the basics of Markdown in 5 minutes. You can find a useful cheatsheet <a href="https://www.markdownguide.org/cheat-sheet/">here</a>.</p>
<h2 id="images">Images</h2>
<p>Adding images is easy too. I advise creating a sub-directory in your working directory called <code>/images</code> and storing anything you want to present there. Then you display them in a markdown file using some simple HTML syntax:</p>
<p><code>&lt;img class=&quot;&quot; src=&quot;images/london_deaths.jpeg&quot; style=&quot;width:75%&quot;&gt;</code></p>
<p>You can manipulate the <code>style</code> attribute to change the size of the image. Don&rsquo;t worry, this is the only HTML you need to know!</p>
<figure><img src="https://github.com/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/images/london_deaths.jpeg?raw=true"/>
</figure>

<h2 id="entering-presentation-mode">Entering Presentation Mode</h2>
<p>To view your slideshow click on the bar-chart button in the menu bar. This will start the presentation from the cell currently selected:</p>
<figure><img src="https://github.com/enjeeneer/talks/blob/main/2021-11-17-RISEPresentations/images/start_show.gif?raw=true"/>
</figure>

<p>That&rsquo;s it! This tutorial has given you an introduction to the basics of RISE for presenting with Jupyter Notebooks, you can of course customise these to your heart&rsquo;s content using further plug-ins and more advanced Markdown. Here&rsquo;s a summary of the useful links from this document to finish:</p>
<ul>
<li><a href="https://rise.readthedocs.io/en/stable/index.html">RISE Documentation</a></li>
<li><a href="https://www.markdownguide.org/cheat-sheet/">Markdown Cheatsheet</a></li>
<li><a href="https://www.tablesgenerator.com/">Markdown Table Generator</a></li>
</ul>
<h1 id="thanks">Thanks!</h1>
<h3 id="twitter-enjeeneerhttpstwittercomenjeeneer">Twitter: <a href="https://twitter.com/enjeeneer">@enjeeneer</a></h3>
<h3 id="website-httpsenjeeneeriohttpsenjeeneerio">Website: <a href="https://enjeeneer.io/">https://enjeeneer.io/</a></h3>
]]></content>
        </item>
        
        <item>
            <title>Notes, Exercises and Code for Sutton and Barto&#39;s Reinforcement Learning: An Introduction (2018)</title>
            <link>https://enjeeneer.io/posts/2021/04/notes-exercises-and-code-for-sutton-and-bartos-reinforcement-learning-an-introduction-2018/</link>
            <pubDate>Fri, 30 Apr 2021 18:18:24 +0100</pubDate>
            
            <guid>https://enjeeneer.io/posts/2021/04/notes-exercises-and-code-for-sutton-and-bartos-reinforcement-learning-an-introduction-2018/</guid>
            <description>In the last few weeks I&amp;rsquo;ve been compiling a set of notes and exercise solutions for Sutton and Barto&amp;rsquo;s Reinforcement Learning: An Introduction. Admittedly, these were produced for my own benefit, but if you&amp;rsquo;d like to look at my notes, my (probably incorrect) answers to the exercises, or the code accommodating those answers, I&amp;rsquo;ll link directly to them below:
 Notes Exercises Code  Thanks to Bryn Hayder for inspiring this idea, and for providing his exercise solutions which helped me throughout.</description>
            <content type="html"><![CDATA[<p>In the last few weeks I&rsquo;ve been compiling a set of notes and exercise solutions for <a href="http://incompleteideas.net/book/RLbook2020.pdf">Sutton and Barto&rsquo;s Reinforcement Learning: An Introduction</a>. Admittedly, these were produced for my own benefit, but if you&rsquo;d like to look at my notes, my (probably incorrect) answers to the exercises, or the code accommodating those answers, I&rsquo;ll link directly to them below:</p>
<ul>
<li><a href="/sutton_and_barto/rl_notes.pdf"><strong>Notes</strong></a></li>
<li><a href="/sutton_and_barto/rl_exercises.pdf"><strong>Exercises</strong></a></li>
<li><a href="https://github.com/enjeeneer/sutton_and_barto"><strong>Code</strong></a></li>
</ul>
<p>Thanks to <a href="https://github.com/brynhayder">Bryn Hayder</a> for inspiring this idea, and for providing his exercise solutions which helped me throughout.</p>
]]></content>
        </item>
        
        <item>
            <title>Scott&#39;s Uncomprehensive Guide to Scotland </title>
            <link>https://enjeeneer.io/posts/2021/04/scotts-uncomprehensive-guide-to-scotland/</link>
            <pubDate>Mon, 19 Apr 2021 21:08:05 +0000</pubDate>
            
            <guid>https://enjeeneer.io/posts/2021/04/scotts-uncomprehensive-guide-to-scotland/</guid>
            <description>Hello treasured friend. If you&amp;rsquo;re reading this, it&amp;rsquo;s probably because I&amp;rsquo;ve force-fed you a link after discussing your upcoming trip to Scotland. I hope this is useful to you in some way. Think of this a travel guide that you can dip into when you find yourself in one of these places either hungry or bored. I don&amp;rsquo;t describe anything in detail, you&amp;rsquo;ll just have to take me on my word that these places are worth visiting.</description>
            <content type="html"><![CDATA[<p>Hello treasured friend. If you&rsquo;re reading this, it&rsquo;s probably because I&rsquo;ve force-fed you a link after discussing your upcoming trip to Scotland. I hope this is useful to you in some way. Think of this a travel guide that you can dip into when you find yourself in one of these places either hungry or bored. I don&rsquo;t describe anything in detail, you&rsquo;ll just have to take me on my word that these places are worth visiting.</p>
<p>Before diving in, I&rsquo;d like to make a quick overarching recommendation. If you only have one chance to visit, I would strongly advise using this visit to see Edinburgh in August for the <a href="https://en.wikipedia.org/wiki/Edinburgh_Festival_Fringe">Fringe Festival</a>. In general, you can&rsquo;t do better than Edinburgh; it&rsquo;s the capital, cultural centre, prettiest city; and in August you&rsquo;ll get the best weather. But the Fringe is a unique experience that I think should be on everyone&rsquo;s bucket list. Every pub, church, park, music venue (indeed, any open space) becomes a stage for artists, actors, and comedians. These artists range from world class to distinctly amateur, and the fun lies in booking a string of shows across an afternoon/evening, likely from performers you will never have heard of, and rolling the dice. You will see some awful shows that make you cringe, then you&rsquo;ll see a performance that will take your breath away leaving you in awe of their talent. The bars are open till 5am every day of the week (usually outlawed in Scotland), and the city is buzzing with excitement and energy. Edinburgh is the most cosmopolitan city in Scotland, but it is especially so for the Fringe, with  ~1 million tourists visiting. During my undergrad, I worked in different bars each Fringe I was there, so have seen plenty of it, and I promise you won&rsquo;t regret going, it&rsquo;s a special time.</p>
<p>Anyway, if you want to explore more than just the Fringe, here&rsquo;s some ideas of things to do. The two categories are: 1) Cities, 2) Special Events, and there&rsquo;s a little aside at the end on golf courses. I think the sub-structure is self-explanatory. I hope it&rsquo;s helpful!</p>
<h1 id="cities">Cities</h1>
<h2 id="edinburgh">Edinburgh</h2>
<h3 id="things-to-do-before-dark">Things to do before dark</h3>
<ul>
<li>Walk up Arthur&rsquo;s Seat via the Craggs for a view of the entire city <a href="https://goo.gl/maps/5Jpnucuso4ebNY5w9">[map]</a></li>
<li>Walk up Calton Hill for a view of Princess Street, the Castle and the Firth of Forth <a href="https://goo.gl/maps/o5Y5uA94EH9bRP7k8">[map]</a></li>
<li>The Castle (beware of the £20 entry fee if you&rsquo;re light on $$$ <a href="https://goo.gl/maps/WjrgDGNcJJFCV2BT6">[map]</a></li>
<li>Walk along the canal through Dean Village <a href="https://goo.gl/maps/oYKFGTvuNzDCuwWo9">[map]</a></li>
<li>Portobello beach (if you&rsquo;re feeling warm) <a href="https://goo.gl/maps/a9UsAs9WhXRBAVs37">[map]</a></li>
<li>Tour the university buildings; George Square/The Meadows <a href="https://goo.gl/maps/z9DmcxhgCNycQ3FY7">[map]</a> and Old College <a href="https://goo.gl/maps/StYFpYtcEzn9tFDY6">[map]</a></li>
</ul>
<h3 id="things-to-do-after-dark">Things to do after dark</h3>
<ul>
<li>Sneaky Pete&rsquo;s: playing techno/house, the best night in Edinburgh. Something happening most nights  <a href="https://goo.gl/maps/AVeVNFvc4JMPdLyr9">[map]</a></li>
<li>The Brass Monkey: an eccentric, unique locals bar <a href="https://goo.gl/maps/WEEQrNjGNGcRAvHy6">[map]</a></li>
<li>The Devil&rsquo;s Advocate: cool, sleek (and more expensive) city bar <a href="https://goo.gl/maps/19BmGMCQ5ya2pUTZ6">[map]</a></li>
<li>The Hanover Tap: easy-going student bar <a href="https://goo.gl/maps/CrXV6yB6fFTojF1i8">[map]</a></li>
<li>99 Hanover St.: cool city bar with the occasional DJ <a href="https://goo.gl/maps/AXt3AW6LXyuaFkJv8">[map]</a></li>
<li>Garibadli&rsquo;s: wild club near many of the above bars. Order the Gari&rsquo;s special at the bar! <a href="https://goo.gl/maps/CZPhQmjGVhzumwsq6">[map]</a></li>
<li>The Hanging Bat: cool bar with loads of beer <a href="https://g.page/thehangingbat?share">[map]</a></li>
</ul>
<h3 id="things-to-do-if-you-have-a-car">Things to do if you have a car</h3>
<ul>
<li>Visit North Berwick and Gullane, two lovely seaside towns along the coast <a href="https://goo.gl/maps/1r1kviJPhEbi3NWd9">[map]</a></li>
<li>The Pentland Hills <a href="https://goo.gl/maps/NVPLxfiqaNj9Bb9Z7">[map]</a></li>
</ul>
<h3 id="things-to-do-if-hungover">Things to do if hungover</h3>
<ul>
<li>Meltmongers (grilled cheese) <a href="https://goo.gl/maps/m8saRkuZyTyLKSNGA">[map]</a></li>
<li>Snax Cafe: infamous hangover food for pennies <a href="https://goo.gl/maps/X3nVXRehDAzP4yYe7">[map]</a></li>
<li>Wings: chicken wings in an unreasonable number of seasonings <a href="https://goo.gl/maps/h2AvnUGcYPzaSWmm6">[map]</a></li>
</ul>
<h3 id="lunch">Lunch</h3>
<ul>
<li>Nile Valley Cafe–me and my mates' favourite spot in the city (Sudanese Wraps) <a href="https://goo.gl/maps/xGS9on9vymkHhN3Q8">[map]</a></li>
<li>Tupiniquim (Brazilian crepes) <a href="https://goo.gl/maps/UhaxA5MDUTUgnVk58">[map]</a></li>
<li>J Reid Sandwich Shop: great salads <a href="https://goo.gl/maps/K6vCQJuafghiuHEY9">[map]</a></li>
<li>Victor Hugo Deli <a href="https://goo.gl/maps/m8saRkuZyTyLKSNGA">[map]</a></li>
<li>10 to 10 in Delhi (Indian) <a href="https://goo.gl/maps/zSGp67zvi61DGHvr5">[map]</a></li>
<li>The Scran and Scallie: up-market scottish gastropub. Some of the best pub food in the city <a href="https://goo.gl/maps/19BmGMCQ5ya2pUTZ6">[map]</a></li>
</ul>
<h3 id="dinner">Dinner</h3>
<ul>
<li>Yene Meze (Greek) <a href="https://goo.gl/maps/kW8YaB4f3j16qzPS9">[map]</a></li>
<li>Fishers in the City: delicious, up-market fish restaurant <a href="https://g.page/Fishers-in-the-city?share">[map]</a></li>
<li>The Bon Vivant: fantastic French restaurant <a href="https://goo.gl/maps/CrXV6yB6fFTojF1i8">[map]</a></li>
<li>The Outsider: chill atmosphere, serving delicious mixed cuisine <a href="https://goo.gl/maps/mjmUh3S5vhSgXe8p6">[map]</a></li>
<li>The Grain Store: amazing scottish cuisine, albeit pricey <a href="https://goo.gl/maps/yYxhqMV8HnUM1zEq7">[map]</a></li>
</ul>
<h3 id="coffee">Coffee</h3>
<ul>
<li>Project Coffee <a href="https://goo.gl/maps/MtfGh3uCsGpapB7CA">[map]</a></li>
<li>Soderberg <a href="https://goo.gl/maps/5vTA9SNftMSLM3Jm9">[map]</a></li>
<li>Maison de Moggy: a cafe full of cats <a href="https://goo.gl/maps/iZ5c6j2bYDkbiomZ6">[map]</a></li>
</ul>
<h2 id="glasgow">Glasgow</h2>
<h3 id="things-to-do-before-dark-1">Things to do before dark</h3>
<ul>
<li>Kelvingrove Art Gallery: some lovely exhibitions inc. <a href="https://en.wikipedia.org/wiki/Christ_of_Saint_John_of_the_Cross">Dali&rsquo;s Christ of Saint John of the Cross</a> <a href="https://goo.gl/maps/eCvcpUcUE4Lmyq4n7">[map]</a></li>
<li>Walk around the West End: Byres Road, Botanic Gardens, University Avenue etc. <a href="https://goo.gl/maps/NMuBdWBtLwh2SpGq7">[map]</a></li>
<li>Take the <a href="https://en.wikipedia.org/wiki/Glasgow_Subway">Subway</a>!</li>
</ul>
<h3 id="things-to-do-after-dark-1">Things to do after dark</h3>
<ul>
<li>Ashton Lane (specifically <a href="brelbar.com">Brel</a> for a drink and the <a href="ubiquitouschip.co,uk">Ubiquitous Chip</a> for food) <a href="https://goo.gl/maps/QfYxVpMeGbRGYJ6i9">[map]</a></li>
<li>The Arlington (an edgy locals bar) <a href="https://goo.gl/maps/1RoyojPkp3qxRGbw8">[map]</a></li>
<li>Oran Mor (if you&rsquo;re out this area (the west end) beyond midnight, this is the only place that&rsquo;s open till 3am) <a href="https://goo.gl/maps/pAnWoRiYjB96EzHr9">[map]</a></li>
<li>Sub club (the longest running underground club in the world, the best spot for techno and a fun night) <a href="https://goo.gl/maps/e7xh4AKv7JnrXDcs5">[map]</a></li>
<li>SWG3: the clue is in the name <a href="https://goo.gl/maps/7rcZiL1TrakfAivP9">[map]</a></li>
</ul>
<h3 id="things-to-do-if-you-have-a-car-1">Things to do if you have a car</h3>
<ul>
<li>Loch Lomond and Conic Hill <a href="https://goo.gl/maps/RgDsXhviKvHKo5dy8">[map]</a></li>
<li>Climb Queen&rsquo;s view <a href="https://goo.gl/maps/RgDsXhviKvHKo5dy8">[map]</a></li>
<li>Glengoyne Distillery <a href="https://g.page/Glengoyne?share">[map]</a></li>
<li>Mugdock Country Park &amp; Castle <a href="https://goo.gl/maps/stpHPSLAe7geNMNJA">[map]</a></li>
</ul>
<h3 id="things-to-do-if-hungover-1">Things to do if hungover</h3>
<ul>
<li>The University Cafe: famous locals cafe that do a sick Scottish breakfast <a href="https://goo.gl/maps/BKQPKYN3LHzc7XNV9">[map]</a></li>
<li>Hyndland Cafe: greasy breakfast food for locals <a href="https://goo.gl/maps/c2iAw3iXAMkpUZaG6">[map]</a></li>
</ul>
<h3 id="lunch-1">Lunch</h3>
<ul>
<li>Epicures: Standard brunch food, delicious  <a href="https://goo.gl/maps/MvWLvNLRRMeqs7cG7">[map]</a></li>
<li>The Hanoi Bike Ship: easy vietnamese food <a href="https://goo.gl/maps/1RDdm1VdSJAJLVm2A">[map]</a></li>
</ul>
<h3 id="dinner-1">Dinner</h3>
<ul>
<li>Balbir&rsquo;s: I think it&rsquo;s the best curry I&rsquo;ve had <a href="https://goo.gl/maps/BhrHeMfitjsxrJWW6">[map]</a></li>
<li>La Vita Spuntini: delicious Italian tapas <a href="https://goo.gl/maps/wZAmsMbwzkm2Lig78">[map]</a></li>
<li>Pizza Magic: the best pizza I&rsquo;ve had in the city (also applies if hungover) <a href="https://goo.gl/maps/Sx9reyZYKhrJ1Vwk6">[map]</a></li>
<li>Stravaigin: gourmet Scottish cuisine <a href="https://goo.gl/maps/SWBM2vNa9LLaoWNv9">[map]</a></li>
</ul>
<h3 id="coffee-1">Coffee</h3>
<ul>
<li>Tchai-Ovna House of Tea: this isn&rsquo;t coffee, but don&rsquo;t worry about that just come here, drink their tea, smoke their shisha and play chess. It&rsquo;s a great spot. <a href="https://goo.gl/maps/S86eTVP8Dtqg91X49">[map]</a></li>
<li>Laboratorio Espresso: If you do need coffee, this place is pretty good <a href="https://goo.gl/maps/nYNoJb9aFYfWp7Ue6">[map]</a></li>
</ul>
<h2 id="st-andrews">St Andrews</h2>
<h3 id="things-to-do-before-dark-2">Things to do before dark</h3>
<ul>
<li>Have a go at the O.G. crazy golf: The Himalayas. Pitch up and pay a couple of quid at the window and they will give you a putters and balls <a href="https://goo.gl/maps/gv1NKrc7Cx7AnWRc6">[map]</a>.</li>
<li>Walk along west sands, where they filmed <a href="https://www.youtube.com/watch?v=TLbWBlB2aWA"><em>that</em></a> scene in Chariots of Fire <a href="https://goo.gl/maps/uepd8JwdH3uPxvSg7">[map]</a></li>
<li>Get a fudge donut from Fisher &amp; Donaldson <a href="https://goo.gl/maps/LKmFK2JMkuG4Qe3o7">[map]</a></li>
<li>Walk to east sands and along the pier where the students <a href="https://news.st-andrews.ac.uk/archive/students-take-part-in-traditional-st-andrews-pier-walk/">meet before the start of new academic year</a> <a href="https://goo.gl/maps/gYVkPDS62mF7HB7v6">[map]</a></li>
<li>If you&rsquo;re there on a Sunday, have a picnic on the 18th fairway of the old course. It&rsquo;s public land and there&rsquo;s no play on a Sunday, you&rsquo;ll meet lots of dog walkers. <a href="https://goo.gl/maps/f5RvVnwfjUP6eVzi9">[map]</a></li>
<li>If you play golf, clearly try and play the Old, if you can&rsquo;t get on there my next favourite is the New. <a href="https://goo.gl/maps/zE6EZNL9sHhhRdb36">[map]</a></li>
</ul>
<h3 id="things-to-do-after-dark-2">Things to do after dark</h3>
<ul>
<li>The Dumvegan: historic golf-themed pub <a href="https://goo.gl/maps/MxoD6SDWJJtyPhgD8">[map]</a></li>
<li>The Keys: the localist of local pubs <a href="https://goo.gl/maps/7mHbi7KXhN79o9i1A">[map]</a></li>
<li>The Vic: the only place playing music till late <a href="https://goo.gl/maps/zL8KbhBeEM2fGpwS7">[map]</a></li>
<li>The Jigger: sit and watch the golfers go by (nb: if you&rsquo;re a golfer, the Jigger challenge is to nip if after you finish the 17th and drink as many pints as you like in half an hour, then you have to play the 18th in fewer shots than the number of pints drunk) <a href="https://g.page/jigger-inn?share">[map]</a></li>
</ul>
<h3 id="things-to-do-if-you-have-a-car-2">Things to do if you have a car</h3>
<ul>
<li>Go to Elie, have lunch in <a href="https://goo.gl/maps/8XcwJtZ1PNRzUkAG9">The Ship</a> and watch some beach cricket <a href="https://goo.gl/maps/DsCgtdEQGtvQspzG9">[map]</a></li>
<li>Anstruther for the best fish and chips in Scotland! <a href="https://goo.gl/maps/L2q5eY9utz1xB8287">[map]</a></li>
<li>Have a walk around the colourful houses in Pittenweem <a href="https://goo.gl/maps/rtaXBFoxczuNoSRM6">[map]</a></li>
</ul>
<h3 id="things-to-do-if-hungover-2">Things to do if hungover</h3>
<ul>
<li>Munch: cheap, delicious greasy food <a href="https://goo.gl/maps/QQ21ntM7VCJkWYB67">[map]</a></li>
<li>Toastie Bar: 50p toasties, ideal <a href="https://goo.gl/maps/jT3Ki1WT6nDnoynV6">[map]</a></li>
</ul>
<h3 id="lunch-2">Lunch</h3>
<ul>
<li>Forgan&rsquo;s: solid brunch menu <a href="https://g.page/ForgansSTA?share">[map]</a></li>
<li>CombiniCo: east asian cuisine, sushi etc. <a href="https://goo.gl/maps/jFsSwf7MxADigzTs9">[map]</a></li>
</ul>
<h3 id="dinner-2">Dinner</h3>
<ul>
<li>The Seafood Ristorante: amazing fresh fish from the harbour, pricey <a href="https://g.page/theSeafoodStA?share">[map]</a></li>
<li>The Rav: British cuisine, stylish surroundings <a href="https://goo.gl/maps/kCekTGKVYUNkn6v97">[map]</a></li>
</ul>
<h3 id="coffee-2">Coffee</h3>
<ul>
<li>Taste: my Uncle owns this place! <a href="https://goo.gl/maps/vysZ53CnTTrbBHL1A">[map]</a></li>
</ul>
<h1 id="special-events">Special Events</h1>
<h2 id="edinburgh-fringe">Edinburgh Fringe</h2>
<h3 id="venues-to-visit">Venues to visit</h3>
<ul>
<li>Pleasance Courtyard: high-quality comedy and drama and one of the cool old university buildings. The bars are sick too. <a href="https://goo.gl/maps/nh83Py5T7N5Xdbgc8">[map]</a></li>
<li>Gilded Balloon: Similar to Pleasance, really cool old building with great acts performing every year. <a href="https://goo.gl/maps/skjzcAoqYcY2XAJv8">[map]</a></li>
<li>The Stand: the best comedy club in Edinburgh <a href="https://goo.gl/maps/oCiiZpTNv5nDGqAM8">[map]</a></li>
<li>Underbelly Cowgate: a cool network of bars underneath the city, and a great spot to be for going out after a show <a href="https://goo.gl/maps/NaV4JpBcyueVZpkG7">[map]</a></li>
<li>Udderbelly: a large purple cow [map]](<a href="http://www.underbellyedinburgh.co.uk/#stq=&amp;stp=1">http://www.underbellyedinburgh.co.uk/#stq=&amp;stp=1</a>)</li>
</ul>
<h3 id="recurring-shows-to-see">Recurring shows to see</h3>
<ul>
<li>Late n' Live: on every night in the Gilded Balloon from 1am. Different comedians are selected each evening to perform a set. They&rsquo;ve already performed their usual, daily set earlier on, and have probably now had a few celebratory drinks. The crowd are equally loose and it creates a chaotic, hilarious atmosphere <a href="https://latenlive.co.uk/about/">[map]</a></li>
</ul>
<h2 id="north-coast-500">North Coast 500</h2>
<p>If you&rsquo;re planning a road trip through the Highlands then you&rsquo;re probably familiar with the North Coast 500. You can&rsquo;t go wrong following <a href="https://en.wikipedia.org/wiki/North_Coast_500#:~:text=The%20North%20Coast%20500%20is,Scotland%20in%20one%20touring%20route.">the intended route</a>, but here&rsquo;s a little advice on places I think you should definitely visit on the way.</p>
<ul>
<li>Glenfinnan Viaduct: If you&rsquo;ve seen Harry Potter then you&rsquo;ll recognise this as the bridge the train crosses en route to Hogwarts. It&rsquo;s beautiful, but works as a great spot to stop for lunch. <a href="https://goo.gl/maps/A5LmFpunQxGNgU9P9">[map]</a></li>
<li>The <a href="https://images.app.goo.gl/n6dZcDgixxPJad2r8">James Bond View</a> <a href="https://goo.gl/maps/Yvm6WqbqDW3HLLGZA">[map]</a></li>
<li>Eilean Donan Castle, also in James Bond lol <a href="https://goo.gl/maps/WQNE2ip1viDqgpux9">[map]</a></li>
<li>The prettiest bit of road to drive is near Applecross <a href="https://goo.gl/maps/gUFfQctdsUB9J168A">[map]</a></li>
<li>Get a flight to <a href="https://en.wikipedia.org/wiki/Barra_Airport">Barra</a> in the Outer Hebrides and land on the beach! <a href="https://goo.gl/maps/27GKka1FZMAGK8QX7">[map]</a></li>
<li>Isle of Harris, and specifically <a href="https://goo.gl/maps/wkoNLZcS1CpCdENM7">Luskentyre Beach</a>) <a href="https://goo.gl/maps/wkoNLZcS1CpCdENM7">[map]</a></li>
<li>Take the <a href="https://en.wikipedia.org/wiki/Westray_to_Papa_Westray_flight#:~:text=The%20Loganair%20Westray%20to%20Papa,fastest%20flight%20is%2053%20seconds.">shortest commercial flight</a> in the world from Westray to Papa Westray. It lasts 53 seconds. <a href="https://goo.gl/maps/AkH13qdFuG4KXXHB9">[map]</a></li>
</ul>
<h1 id="an-aside-on-golf-courses">An Aside on Golf Courses</h1>
<p>If you are into golf, here&rsquo;s my top 5 courses, and some hidden gems:</p>
<h2 id="top-5">Top 5</h2>
<ol>
<li>Royal Dornoch</li>
<li>Muirfield</li>
<li>Carnoustie</li>
<li>Troon</li>
<li>Kingsbarns</li>
</ol>
<p>I&rsquo;m yet to play Turnberry since the updates, but I&rsquo;m told it&rsquo;s now no. 1.</p>
<h2 id="hidden-gems-in-no-particular-order">Hidden Gems (in no particular order)</h2>
<ul>
<li>Ladybank</li>
<li>Moray</li>
<li>Nairn (if you consider it hidden)</li>
<li>Elie</li>
<li>Monifieth</li>
<li>Murcar</li>
<li>Shiskine</li>
<li>Luffness</li>
</ul>
<p>If you have any questions, please do send me an email. You can find my address on <a href="https://enjeeneer.io/">the homepage</a>.</p>
]]></content>
        </item>
        
    </channel>
</rss>
