<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>John Paton</title><link href="https://johnpaton.net/" rel="alternate"></link><link href="https://johnpaton.net/feeds/all.atom.xml" rel="self"></link><id>https://johnpaton.net/</id><updated>2020-11-21T14:00:00-01:00</updated><subtitle>
  &lt;div style="width:100%;align-content:center;padding:0;"&gt;
    &lt;div style="margin:auto 0;padding:0;display:inline-block;"&gt;
      &lt;span style="float:left"&gt;
        &lt;b style="color:#3aa500;margin-right:2px"&gt;/&lt;/b&gt;data&lt;b style="color:#3aa500;margin-right:2px;margin-left:2px"&gt;/&lt;/b&gt;scientist
      &lt;/span&gt;&lt;br&gt;
      &lt;span style="float:left"&gt;
        &lt;b style="color:#3aa500;margin-right:2px"&gt;/&lt;/b&gt;ml&lt;b style="color:#3aa500;margin-right:2px;margin-left:2px"&gt;/&lt;/b&gt;engineer
      &lt;/span&gt;
    &lt;/div&gt;
  &lt;/div&gt;
</subtitle><entry><title>Syntax highlighting for console sessions</title><link href="https://johnpaton.net/posts/console-highlighting/" rel="alternate"></link><published>2020-11-21T14:00:00-01:00</published><updated>2020-11-21T14:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2020-11-21:/posts/console-highlighting/</id><summary type="html">&lt;p&gt;It&amp;#8217;s a minor annoyance that comes up often in GitHub comments: syntax highlighting for Python console sessions. You want the input code (after the prompt) to be highlighted, but the output (which is generally just text or logs) to remain neutral. Turns out there&amp;#8217;s a syntax highlighter that does just&amp;nbsp;that.&lt;/p&gt;</summary><content type="html">&lt;p&gt;It&amp;#8217;s a minor annoyance that comes up often in GitHub comments: syntax highlighting for Python console sessions. You want the input code (after the prompt) to be highlighted, but the output (which is generally just text or logs) to remain neutral. As far as I can tell, most people opt for one of two options: no highlighting at all (boring, can be harder to read), or Python syntax highlighting (looks nicer but probably makes all kinds of weird colors in your&amp;nbsp;output). &lt;/p&gt;
&lt;p&gt;Here&amp;#8217;s a classic example where the output includes a progress&amp;nbsp;bar:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&amp;gt;&amp;gt;&amp;gt; my_str = &amp;quot;hello&amp;quot;
&amp;gt;&amp;gt;&amp;gt; some_int = 123
&amp;gt;&amp;gt;&amp;gt; func_with_output(a=1, b=my_str)
Processing: 100%|██████████████████| 10/10 [00:01&amp;lt;00:00,  9.71it/s]
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;We would like to spruce this up a bit with some syntax highlighting. If we start our code block with &lt;code&gt;```python&lt;/code&gt; instead of just the plain regular triple backticks (which start a code block in Markdown), we end up with something that looks&amp;nbsp;like&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;my_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;hello&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;some_int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;func_with_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Processing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;%|&lt;/span&gt;&lt;span class="err"&gt;██████████████████&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mo"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mo"&gt;01&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mo"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mo"&gt;00&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mf"&gt;9.71&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This is &lt;em&gt;okay&lt;/em&gt;, but the progress bar looks silly and it makes it more difficult to tell what&amp;#8217;s input and what&amp;#8217;s output. However, there is actually another syntax highlighter specifically for this case: &lt;code&gt;pycon&lt;/code&gt; (instead of &lt;code&gt;python&lt;/code&gt;). it highlights lines that start with the Python prompt (&lt;code&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/code&gt;), but doesn&amp;#8217;t highlight output lines which don&amp;#8217;t have any prompt, leaving us&amp;nbsp;with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;my_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;hello&amp;quot;&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;some_int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;func_with_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Processing: 100%|██████████████████| 10/10 [00:01&amp;lt;00:00,  9.71it/s]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;As a bonus, for shell sessions (which usually have a less obvious prompt, making it more important to distinguish input from output), you can do the same thing using the &lt;code&gt;console&lt;/code&gt; highlighter!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;output colored differently&amp;quot;&lt;/span&gt;
&lt;span class="go"&gt;output colored differently&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Perfection.&lt;/p&gt;</content><category term="posts"></category><category term="snippets"></category><category term="open source"></category></entry><entry><title>Receiving a Google Open Source Peer Bonus award</title><link href="https://johnpaton.net/posts/open-source-award/" rel="alternate"></link><published>2020-05-07T11:00:00-01:00</published><updated>2020-05-07T11:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2020-05-07:/posts/open-source-award/</id><summary type="html">&lt;p&gt;Over the past few years I&amp;#8217;ve increasingly tried to make small &lt;a href="https://github.com/search?q=author%3AJohnPaton+is%3Apr&amp;amp;type=Issues"&gt;contributions&lt;/a&gt; to open source projects that I use. I&amp;#8217;m not on the core team of any one project, so usually my contributions are very small. That&amp;#8217;s why I was very surprised when I got an email from Google&amp;#8217;s Open Source Peer Bonus program, letting me know that I had been&amp;nbsp;nominated!&lt;/p&gt;</summary><content type="html">&lt;p&gt;Over the past few years I&amp;#8217;ve increasingly tried to make small &lt;a href="https://github.com/search?q=author%3AJohnPaton+is%3Apr+is%3Amerged"&gt;contributions&lt;/a&gt; to open source projects that I use. Usually these come from me encountering a small issue that I know &lt;a href="https://github.com/kubeflow/kubeflow/pull/3107"&gt;how to fix&lt;/a&gt;, or lightly improving the documentation. Occasionally I&amp;#8217;ve also used the process of &amp;#8220;making your first contribution&amp;#8221; to get to know a tool &lt;a href="https://github.com/tiangolo/fastapi/pull/1106"&gt;from the inside&lt;/a&gt;. I&amp;#8217;m not a developer, and I&amp;#8217;m also not on the core team of any one project, so usually my contributions are &lt;a href="https://github.com/jupyterhub/zero-to-jupyterhub-k8s/pull/1183"&gt;very small&lt;/a&gt;. That&amp;#8217;s why I was very surprised a few weeks ago when I got an email from Google&amp;#8217;s &lt;a href="https://opensource.google/docs/growing/peer-bonus/"&gt;Open Source Peer Bonus&lt;/a&gt; program, letting me know that I had been&amp;nbsp;nominated! &lt;/p&gt;
&lt;p&gt;&lt;a href="https://opensource.googleblog.com/2020/01/announcing-2019-second-cycle-google.html"&gt;&lt;img alt="OSPB logo" src="/images/google-ospb.png"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It turns out that Google&amp;#8217;s employees who maintain open source projects can nominate contributors as a thank you from Google and from the community. Receiving the award gets you a nice letter, a mention in their &lt;a href="https://opensource.googleblog.com/2020/01/announcing-2019-second-cycle-google.html"&gt;blog post&lt;/a&gt;, and a small cash prize. I was nominated for contributing to the Python client libraries for &lt;a href="https://cloud.google.com/bigquery"&gt;Google BigQuery&lt;/a&gt; (&lt;a href="https://github.com/googleapis/google-cloud-python/tree/master/bigquery"&gt;&lt;code&gt;google-cloud-python&lt;/code&gt;&lt;/a&gt; and its pandas-compatible layer, &lt;a href="https://github.com/pydata/pandas-gbq"&gt;&lt;code&gt;pandas-gbq&lt;/code&gt;&lt;/a&gt;), which we use every day at &lt;a href="https://www.catawiki.com/jobs"&gt;Catawiki&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;These were the contributions I was nominated&amp;nbsp;for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/pydata/pandas-gbq/pull/256"&gt;pydata/pandas-gbq#256&lt;/a&gt;: My first time contributing to one of these projects, this &lt;span class="caps"&gt;PR&lt;/span&gt; was just a simple cleanup of a usage example in the documentation. More of a nitpick than anything else, but they reacted in a friendly way which made it much less scary to contribute actual code in the&amp;nbsp;future.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/pydata/pandas-gbq/pull/257"&gt;pydata/pandas-gbq#257&lt;/a&gt;: This one was an actual feature for pandas-gbq. When you upload a DataFrame to BigQuery, pandas-gbq generates a schema for your table based on the pandas dtypes. Alternately, you can provide your own schema if you want more control, e.g. when dealing with time columns (&lt;code&gt;DATE&lt;/code&gt;? &lt;code&gt;DATETIME&lt;/code&gt;? &lt;code&gt;TIMESTAMP&lt;/code&gt;?). This &lt;span class="caps"&gt;PR&lt;/span&gt; added the ability to provide a schema for only a subset of columns and to fall back to the generated schema for the rest, which is handy if you have a very wide DataFrame and you only want to cast a few&amp;nbsp;columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/googleapis/google-cloud-python/pull/7552"&gt;googleapis/google-cloud-python#7552&lt;/a&gt;: I&amp;#8217;m a simple man. I see incremental progress, I want a progress bar. This &lt;span class="caps"&gt;PR&lt;/span&gt;, which originally started as a &lt;a href="https://github.com/pydata/pandas-gbq/issues/182"&gt;feature request&lt;/a&gt; on pandas-gbq, added a &lt;a href="https://tqdm.github.io/"&gt;&lt;code&gt;tqdm&lt;/code&gt;&lt;/a&gt; progress bar to Google&amp;#8217;s BigQuery client when downloading data from BigQuery. I personally wanted this as I was spending a non-negligible amount of time in those days downloading large-ish training sets. As you&amp;#8217;ll see in the comments of the &lt;span class="caps"&gt;PR&lt;/span&gt;, I got a ton of feedback about my initial commits. It was all very friendly and I learned a lot about the code-review process among real software engineers. In the end, my original request to include this feature in pandas-gbq (and not just google-cloud-python) was actually &lt;a href="https://github.com/pydata/pandas-gbq/pull/292"&gt;implemented by someone else&lt;/a&gt;! All around a cool example of open source in&amp;nbsp;action.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It&amp;#8217;s pretty common to be afraid of contributing to open source projects because you&amp;#8217;re worried about making mistakes, being criticized, etc. For data scientists, I expect this to be even worse as you likely have the added layer of impostor-syndrome from not being a &amp;#8220;real&amp;#8221; software developer. But if you have tools that you use often and that make your life better, then you get to know them and their quirks pretty quickly, which makes it easier to get started. And maintainers of projects are often very happy to know that people care and want to contribute, especially if you read their &lt;code&gt;CONTRIBUTING.md&lt;/code&gt; or other contribution guidelines first. At the time of writing, there are over &lt;a href="https://github.com/search?l=Python&amp;amp;o=desc&amp;amp;q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22&amp;amp;s=&amp;amp;type=Issues"&gt;15,000 open &amp;#8220;good first issues&amp;#8221; in Python projects on GitHub&lt;/a&gt;, specifically tagged for people looking to get involved for the first time. Why not give it a try and dive&amp;nbsp;in?&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="open source"></category><category term="pandas"></category></entry><entry><title>Introducing airbase: A Python client for the European Air Quality e-Reporting Database</title><link href="https://johnpaton.net/posts/airbase/" rel="alternate"></link><published>2020-02-03T15:00:00-01:00</published><updated>2020-02-03T15:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2020-02-03:/posts/airbase/</id><summary type="html">&lt;p&gt;The European Environment Agency (&lt;span class="caps"&gt;EEA&lt;/span&gt;) provides a selection of datasets about air quality in Europe. The data is available for download at &lt;a href="http://discomap.eea.europa.eu/map/fme/AirQualityExport.htm"&gt;the portal&lt;/a&gt;, but the interface makes it a bit time consuming to do bulk downloads. Hence, an easy Python-based&amp;nbsp;interface.&lt;/p&gt;</summary><content type="html">&lt;p&gt;As part of its open data initiatives, the &lt;a href="https://www.eea.europa.eu/"&gt;European Environment Agency (&lt;span class="caps"&gt;EEA&lt;/span&gt;)&lt;/a&gt; provides a selection of datasets about air quality in Europe, collectively known as the &lt;a href="https://www.eea.europa.eu/data-and-maps/data/aqereporting-8#tab-european-data"&gt;Air Quality e-Reporting&lt;/a&gt; database. The richest subset of this data is the air quality time series data. There are 2 subsets: E1a, which is cleaner and more validated, and E2a, which is more up to date. This is an important dataset for climate research, as well as a great source for data scientists who are looking for some non-trivial real world data to practice on (when combined with the measurement station metadata, you have both geodata and&amp;nbsp;timeseries).&lt;/p&gt;
&lt;p&gt;The one major obstacle to using this data is the download interface (a.k.a. &amp;#8220;the portal&amp;#8221;), which is very cumbersome. First you need to navigate to the &lt;a href="http://discomap.eea.europa.eu/map/fme/AirQualityExport.htm"&gt;download page&lt;/a&gt; and select the parameters of the dataset you want to download, like the pollutant of interest, the location, and the time&amp;nbsp;span:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Airbase download form" src="/images/airbase-download-form.png"&gt;&lt;/p&gt;
&lt;p&gt;The form constructs a &lt;span class="caps"&gt;URL&lt;/span&gt; with query parameters matching what you&amp;#8217;ve selected in the form (this was my first hint that this should be easy to automate). Once you hit the &amp;#8220;Download&amp;#8221; button, you would expect that it would download the dataset, but that would of course be too easy. If you hit the button &lt;a href="https://fme.discomap.eea.europa.eu/fmedatastreaming/AirQualityDownload/AQData_Extract.fmw?CountryCode=NL&amp;amp;CityName=&amp;amp;Pollutant=46&amp;amp;Year_from=2014&amp;amp;Year_to=2017&amp;amp;Station=&amp;amp;Samplingpoint=&amp;amp;Source=E1a&amp;amp;Output=HTML&amp;amp;UpdateDate=&amp;amp;TimeCoverage=Year"&gt;for the parameters I&amp;#8217;ve filled in&lt;/a&gt; you&amp;#8217;re greeted with a list of links to various &lt;span class="caps"&gt;CSV&lt;/span&gt; files to individually click on and download&amp;nbsp;yourself. &lt;/p&gt;
&lt;p&gt;&lt;img alt="Airbase CSV links" src="/images/airbase-links.png"&gt;&lt;/p&gt;
&lt;p&gt;For this particular query this isn&amp;#8217;t &lt;em&gt;too&lt;/em&gt; bad, but if you want to query many types of pollutant or a bigger group of countries, this is going to cost you some serious&amp;nbsp;time. &lt;/p&gt;
&lt;h1&gt;A better&amp;nbsp;way&lt;/h1&gt;
&lt;p&gt;To make this process easier, I developed &lt;a href="https://airbase.readthedocs.io/en/latest/"&gt;&lt;code&gt;airbase&lt;/code&gt;: an easy Python client&lt;/a&gt; for accessing the data (this database was &lt;a href="https://www.eea.europa.eu/data-and-maps/data/airbase-the-european-air-quality-database-7"&gt;formerly known&lt;/a&gt; as AirBase, and I thought it was a catchy name). It started off as a script to help a friend of mine who is in climate research, and I realized with a bit of cleanup it might be useful to other people as well. It&amp;#8217;s available on PyPI, so to install you can&amp;nbsp;simply &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt; pip install airbase
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;To start downloading your dataset, import the package and initialize the&amp;nbsp;client:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;airbase&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;airbase&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AirbaseClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The client helps you to construct your request, and does some validation for you, like checking that the pollutant you want is available in the countries you&amp;#8217;re asking for. It does this by downloading some files from the portal, so this requires an internet&amp;nbsp;connection.&lt;/p&gt;
&lt;p&gt;Kind of like using the portal, but more conveniently, you next construct the parameters of the dataset you&amp;#8217;re looking&amp;nbsp;for:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NL&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;NO3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;year_from&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2014&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;year_to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2017&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;If you don&amp;#8217;t include a parameter, the client will construct a request for all possible values (so you can just user &lt;code&gt;client.request()&lt;/code&gt; to get the whole&amp;nbsp;dataset).&lt;/p&gt;
&lt;p&gt;With your request constructed, all that&amp;#8217;s left to do is to choose how you want to download the data. You can choose to either &lt;a href="https://airbase.readthedocs.io/en/latest/airbase.html#airbase.AirbaseRequest.download_to_directory"&gt;&lt;code&gt;download_to_directory()&lt;/code&gt;&lt;/a&gt; to get all those CSVs individually, or you can &lt;a href="https://airbase.readthedocs.io/en/latest/airbase.html#airbase.AirbaseRequest.download_to_file"&gt;&lt;code&gt;download_to_file()&lt;/code&gt;&lt;/a&gt; to concatenate them into one big &lt;span class="caps"&gt;CSV&lt;/span&gt;. Either way, the request object will first contact the portal to get the links to all the CSVs you need, and then start downloading them as instructed. Of course, you can follow the progress with nice progress&amp;nbsp;bars.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;download_to_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Generating CSV download links...&lt;/span&gt;
&lt;span class="go"&gt;100%|██████████████████████████| 1/1 [00:03&amp;lt;00:00,  3.14s/it]&lt;/span&gt;
&lt;span class="go"&gt;Generated 16 CSV links ready for downloading&lt;/span&gt;
&lt;span class="go"&gt;Downloading CSVs to ./data...&lt;/span&gt;
&lt;span class="go"&gt;100%|██████████████████████████| 16/16 [00:03&amp;lt;00:00,  5.34it/s]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;If you want to update your dataset later (e.g. getting the last week&amp;#8217;s worth of data), &lt;code&gt;download_to_directory()&lt;/code&gt; will automatically skip downloading most of the files that are already&amp;nbsp;there. &lt;/p&gt;
&lt;p&gt;Hunting for correlations between locations? Make sure to download the metadata file that contains the locations and other properties of the measurement stations that supply the&amp;nbsp;data:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;download_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;./data/metadata.tsv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Writing metadata to ./data/metadata.tsv...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Full documentation is availble on &lt;a href="https://airbase.readthedocs.io/en/latest"&gt;ReadTheDocs&lt;/a&gt;, and of course the whole package is &lt;a href="https://github.com/johnpaton/airbase"&gt;open sourced on GitHub&lt;/a&gt; too. I know at least a handful of people have used the package, including &lt;a href="https://www.atmos-chem-phys.net/19/11821/2019"&gt;one confirmed publication&lt;/a&gt; (by my friend who I made the original script for), which is very&amp;nbsp;cool!&lt;/p&gt;
&lt;p&gt;Have you used &lt;code&gt;airbase&lt;/code&gt; in your research or learning? Let me know! I&amp;#8217;d love to hear about&amp;nbsp;it! &lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="open source"></category><category term="data"></category><category term="time series"></category></entry><entry><title>Schedule the interruption of hung Python processes with signals</title><link href="https://johnpaton.net/posts/interrupt-long-processes/" rel="alternate"></link><published>2019-07-13T13:00:00-01:00</published><updated>2019-07-13T13:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2019-07-13:/posts/interrupt-long-processes/</id><summary type="html">&lt;p&gt;A lightweight method to interrupt (hung) Python processes after a set time using the &lt;code&gt;signal&lt;/code&gt; library.&lt;/p&gt;</summary><content type="html">&lt;p&gt;At Catawiki, we have a lot of scheduled cron-type jobs that move data around, train models, and do other processing tasks. Often, they are dependent on external resources like database connections. If one of these jobs actually fails, emails are sent, and we can fix the problem. However, we have had issues in the past with scheduled jobs silently hanging, which can often go unnoticed until someone&amp;#8217;s output hasn&amp;#8217;t been refreshed. One solution to this would be to schedule these small jobs in our Airflow pipeline, but for little things that just need to run every night without hassle we use this lightweight trick&amp;nbsp;instead.&lt;/p&gt;
&lt;p&gt;To make a Python process fail and get a precious notification, we can force it to raise a &lt;code&gt;TimeoutError&lt;/code&gt; after a certain number of seconds using the &lt;a href="https://docs.python.org/3/library/signal.html"&gt;&lt;code&gt;signal&lt;/code&gt; library&lt;/a&gt;. This library is responsible for handling &lt;a href="https://en.wikipedia.org/wiki/Signal_(IPC)"&gt;Unix Signals&lt;/a&gt;, which are a way to communicate asynchronously with running&amp;nbsp;processes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you are doing fancy stuff with multiple threads, please read the signal docs as there can be weird behavior here. Also, alarm scheduling is only available on Unix-like&amp;nbsp;systems.&lt;/p&gt;
&lt;p&gt;Technically Step 3 below is already enough to cause the timeout by itself, but your output will be very terse. Defining the signal handler with a friendly message ensures that both your colleagues and your future self will be able to read your code and your&amp;nbsp;logs.&lt;/p&gt;
&lt;h2&gt;Step 1: Define a Signal&amp;nbsp;Handler&lt;/h2&gt;
&lt;p&gt;A signal handler is a function that can be called when a signal is received. Signal handlers must accept &lt;a href="https://docs.python.org/3/library/signal.html#signal.signal"&gt;two arguments&lt;/a&gt;, but our handler will just ignore them and raise a &lt;code&gt;TimeoutError&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;signal&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;timeout_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Your process has timed out!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Raising an exception will interrupt the process, killing whatever was going on. If this is too extreme and you just want to be alerted that a process is still running without killing it, you could also do something like send an email or send yourself a &lt;a href="https://github.com/slackapi/python-slackclient#sending-a-message-to-slack"&gt;Slack message&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;slack&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;slack_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Don&amp;#39;t keep your token in plain text :)&lt;/span&gt;
    &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;SLACK_API_TOKEN&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;slack&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WebClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat_postMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;#python_alerts&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Your process is still running!&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h2&gt;Step 2: Assign the signal hander to the alarm&amp;nbsp;signal&lt;/h2&gt;
&lt;p&gt;Next we use &lt;a href="https://docs.python.org/3/library/signal.html#signal.signal"&gt; &lt;code&gt;signal.signal&lt;/code&gt; &lt;/a&gt; to tell Python that our new handler should be called whenever the running process receives the alarm signal, denoted by &lt;code&gt;signal.SIGALRM&lt;/code&gt;. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Call `timeout_handler` when we receive an alarm signal&lt;/span&gt;
&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGALRM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h2&gt;Step 3: Schedule an&amp;nbsp;alarm&lt;/h2&gt;
&lt;p&gt;Now we can &lt;a href="https://docs.python.org/3/library/signal.html#signal.alarm"&gt;schedule an alarm&lt;/a&gt; to be sent to our process after a set number of seconds. If the process has exited before that time has passed, nothing will happen. If the process is still running when the alarm is sent, then our handler will be called, interrupting the process and raising an exception. If you&amp;#8217;re going to interrupt a process, this timeout should be comfortably longer than you expect the process to take, so that it is only interrupted if something is really&amp;nbsp;stuck.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# Schedule an alarm to be sent in 10 seconds &lt;/span&gt;
&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alarm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;You can now run your main loop or do whatever work you are trying to do, and you&amp;#8217;re guaranteed that the process will fail if it takes longer than the time you&amp;nbsp;specified. &lt;/p&gt;
&lt;p&gt;If you are writing a module with functions that might get imported elsewhere, make sure to put the &lt;code&gt;signal.signal&lt;/code&gt; and &lt;code&gt;signal.alarm&lt;/code&gt; lines under your &lt;code&gt;if __name__ == "__main__"&lt;/code&gt; statement so that the alarm doesn&amp;#8217;t get triggered every time your module is&amp;nbsp;imported.&lt;/p&gt;
&lt;h2&gt;Complete&amp;nbsp;Example&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# timeout.py&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;signal&lt;/span&gt;  

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;timeout_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Your process has timed out!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Call `timeout_handler` when we receive an alarm signal&lt;/span&gt;
&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SIGALRM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout_handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Example Usage&lt;/span&gt;

&lt;span class="c1"&gt;# Schedule an alarm in 10 seconds &lt;/span&gt;
&lt;span class="c1"&gt;# (will raise TimeoutError as specified)&lt;/span&gt;
&lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alarm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Do some work which takes too long&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;seconds passed&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Running our example yields the following&amp;nbsp;output:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt; python timeout.py
&lt;span class="go"&gt;0 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;1 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;2 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;3 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;4 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;5 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;6 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;7 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;8 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;9 seconds passed&lt;/span&gt;
&lt;span class="go"&gt;Traceback (most recent call last):&lt;/span&gt;
&lt;span class="go"&gt;  File &amp;quot;timeout.py&amp;quot;, line 21, in &amp;lt;module&amp;gt;&lt;/span&gt;
&lt;span class="go"&gt;    time.sleep(1)&lt;/span&gt;
&lt;span class="go"&gt;  File &amp;quot;timeout.py&amp;quot;, line 5, in timeout_handler&lt;/span&gt;
&lt;span class="go"&gt;    raise TimeoutError(&amp;quot;Your process has timed out!&amp;quot;)&lt;/span&gt;
&lt;span class="go"&gt;TimeoutError: Your process has timed out!&lt;/span&gt;
&lt;span class="go"&gt;exit 1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category><category term="python"></category><category term="snippets"></category></entry><entry><title>Generating fake whiskey reviews with GPT-2</title><link href="https://johnpaton.net/posts/whiskey-reviews/" rel="alternate"></link><published>2019-06-23T19:00:00-01:00</published><updated>2019-06-23T19:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2019-06-23:/posts/whiskey-reviews/</id><summary type="html">&lt;p&gt;I&amp;#8217;ve enjoyed whiskey for a while now, but I can never vocalize all the flavors present in a bottle. I read all these flowery reviews and tasting notes online and I just have no idea how these people come up with &lt;a href="http://whiskyadvocate.com/ratings-reviews/?search=Johnnie+Walker+Blue+Label&amp;amp;submit=+&amp;amp;brand_id=0&amp;amp;rating=0&amp;amp;price=0&amp;amp;category=0&amp;amp;styles_id=0&amp;amp;issue_id=0"&gt;descriptions like&lt;/a&gt; &amp;#8220;caramels, dried peats, elegant cigar smoke, seeds scraped from vanilla beans, brand new pencils, peppercorn, coriander seeds, and star anise&amp;#8221;&amp;#8230; until&amp;nbsp;now.&lt;/p&gt;</summary><content type="html">&lt;p&gt;I&amp;#8217;ve enjoyed whiskey for a while now, but I can never vocalize all the flavors present in a bottle. I read all these flowery reviews and tasting notes online and I just have no idea how these people come up with &lt;a href="http://whiskyadvocate.com/ratings-reviews/?search=Johnnie+Walker+Blue+Label&amp;amp;submit=+&amp;amp;brand_id=0&amp;amp;rating=0&amp;amp;price=0&amp;amp;category=0&amp;amp;styles_id=0&amp;amp;issue_id=0"&gt;descriptions like&lt;/a&gt; &amp;#8220;caramels, dried peats, elegant cigar smoke, seeds scraped from vanilla beans, brand new pencils, peppercorn, coriander seeds, and star anise&amp;#8221;&amp;#8230; until&amp;nbsp;now. &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-data"&gt;The&amp;nbsp;data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fine-tuning"&gt;Fine-tuning &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fake-reviews"&gt;Fake whiskey&amp;nbsp;reviews&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#highlights"&gt;Highlights&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#bloopers"&gt;Bloopers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href="https://openai.com/"&gt;OpenAI&lt;/a&gt; recently made headlines with their blog post &lt;a href="https://openai.com/blog/better-language-models/"&gt;Better Language Models and Their Implications&lt;/a&gt;, in which they described their latest general language model, dubbed &lt;a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf"&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2&lt;/a&gt;. In the post, they claim to be practicing responsible (non)disclosure by not releasing the full pre-trained model due to concerns that it is so good, it could be used for fake news generation or social media manipulation. However, they &lt;em&gt;did&lt;/em&gt; release smaller versions of the model which are nonetheless still quite performant. You can play with one of them at &lt;a href="https://talktotransformer.com/"&gt;talktotransformer.com&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;Obviously this piqued my interest. I&amp;#8217;ve dabbled in &lt;a href="https://johnpaton.net/posts/engl_ish/"&gt;purposely &lt;em&gt;bad&lt;/em&gt; language generation&lt;/a&gt;, but this time I was curious about doing something vaguely believable. I decided to try to sound more impressive in my whiskey-ing by generating my very own fake whiskey reviews, fine-tuning the small version of &lt;span class="caps"&gt;GPT&lt;/span&gt;-2 on existing&amp;nbsp;reviews.&lt;/p&gt;
&lt;p&gt;&lt;img style="max-height: 350px" src="/images/robot-whiskey.jpg" alt="Robot hand holding whiskey"&gt;
&lt;div style="text-align:center"&gt;&lt;small&gt; Image by the &lt;a href="https://nypost.com/2019/05/14/microsoft-partners-with-distillery-to-create-worlds-first-ai-whiskey/"&gt;New York Post&lt;/a&gt; &lt;/small&gt;&lt;/div&gt;&lt;/p&gt;
&lt;p&gt;&lt;a id='the-data'&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;The&amp;nbsp;data&lt;/h2&gt;
&lt;p&gt;The first site on Google when I searched for &amp;#8220;whiskey reviews&amp;#8221; was &lt;a href="http://whiskyadvocate.com/"&gt;whiskyadvocate.com&lt;/a&gt;, and it seemed perfect for my needs. Short, consistently formatted reviews, and all on one long page making them easy to&amp;nbsp;scrape. &lt;/p&gt;
&lt;p&gt;The only snag is that you have to click a &amp;#8220;See More&amp;#8221; button over and over to get 6 more reviews to render each time. I know you can use sophisticated scraping tools to deal with this situation, but I decided that that would be overkill for this project. Luckily I&amp;#8217;ve been JavaScripting a little bit recently so it occurred to me that you can pop open the browser console and&amp;nbsp;run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nx"&gt;btn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;getElementById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;loadMore&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;btn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;click&lt;/span&gt;&lt;span class="p"&gt;();};&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;After glitching out for a bit, the whole set of reviews is visible on the page. One &lt;code&gt;ctrl+S&lt;/code&gt; and we have the reviews ready for parsing with &lt;code&gt;beautifulsoup&lt;/code&gt;. Looking at the html, we see that each review is in its own &lt;code&gt;&amp;lt;article&amp;gt;&lt;/code&gt; block, so they are easy enough to&amp;nbsp;extract:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;data/raw/whisky_advocate.html&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;article&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Checking out the text structure of one &amp;#8220;article&amp;#8221;, that all we really need to do to clean up is drop empty lines and replace non-breaking spaces (&lt;code&gt;/xa0&lt;/code&gt;) with normal&amp;nbsp;ones.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;reviews_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;a_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\xa0&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# breaking spaces&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;  &lt;span class="c1"&gt;# drop empty lines&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# rejoin into one formatted snippet&lt;/span&gt;
    &lt;span class="n"&gt;reviews_clean&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a_clean&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This gives us nice clean review snippets in the following&amp;nbsp;format:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Real Review (input&amp;nbsp;data)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;95&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Uncle Nearest 1820 Single Barrel 11 year old Tennessee Whiskey (Barrel &lt;span class="caps"&gt;US&lt;/span&gt;-2),&amp;nbsp;55.1%&lt;/p&gt;
&lt;p&gt;Bourbon/Tennessee  |&amp;nbsp;$119&lt;/p&gt;
&lt;p&gt;This is mature whiskey at its most refined: a balance of fruits, nuts, sweetness, and restrained oak. The nose has it all: salted, buttered pecans, rock candy, Dr. Pepper, blackberry jam, dried blueberries, caramel corn, tobacco barn, and old leather chair. It’s practically dessert-like in the mouth, with dark chocolate-covered caramel, candied pecans, Goo Goo Clusters, cherry cola, blackberry and blueberry jam, and a kiss of white pepper. The finish stays consistent, a mouthwatering mélange of caramel, chocolate, and nuts. Harmonious, seamless, and silky—you’d never guess the&amp;nbsp;proof.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Looks (and sounds) good to me! The last step is to save our reviews back to a text file for processing by &lt;span class="caps"&gt;GPT&lt;/span&gt;-2. The model distinguishes between separate pieces of text using a special token, &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt;. So we join all the reviews back together, separated by this token, and save it to a text&amp;nbsp;file:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;data/clean/reviews.txt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;w&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;lt;|endoftext|&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reviews_clean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Now we&amp;#8217;re ready to&amp;nbsp;train!&lt;/p&gt;
&lt;p&gt;After training the model, it turned out I was only using the &lt;a href="http://whiskyadvocate.com/ratings-reviews/?search=&amp;amp;submit=&amp;amp;brand_id=0&amp;amp;rating=0&amp;amp;price=0&amp;amp;category=0&amp;amp;styles_id=0&amp;amp;issue_id=102"&gt;Summer 2019 Buying Guide&lt;/a&gt; and not the full set of reviews, so there is definitely room for improvement&amp;nbsp;here. &lt;/p&gt;
&lt;p&gt;&lt;a id='fine-tuning'&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Fine-tuning &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&lt;/h2&gt;
&lt;p&gt;Surprisingly, this is actually the easy part, thanks to work by &lt;a href="https://github.com/nshepperd"&gt;nshepperd&lt;/a&gt;. His/her &lt;a href="https://github.com/nshepperd/gpt-2"&gt;fork of &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&lt;/a&gt; contains a super easy train script for finetuning &lt;span class="caps"&gt;GPT&lt;/span&gt;-2, with lots of options. To get started with it, we clone the repo and use the provided script to download the pre-trained model. This model has already been trained on a set of &lt;a href="https://openai.com/blog/better-language-models/#fn1"&gt;8 million web pages&lt;/a&gt;, so it already has a pretty big and diverse vocabulary. We&amp;#8217;ll download the smaller model, called &lt;code&gt;117M&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt; git clone https://github.com/nshepperd/gpt-2.git
&lt;span class="gp"&gt;$&lt;/span&gt; &lt;span class="nb"&gt;cd&lt;/span&gt; gpt-2 
&lt;span class="gp"&gt;$&lt;/span&gt; pip3 install -r requirements.txt 
&lt;span class="gp"&gt;$&lt;/span&gt; python3 download_model.py 117M
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This will save the pre-trained model in the &lt;code&gt;models&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;The set of reviews isn&amp;#8217;t very big (the whole site contains about 4,000, but I only grabbed 125 during this experiment due to the mistake mentioned above), so this will be less like fine-tuning and more like clobbering the model. When we are done, it will know nothing but whiskey reviews. For that reason, we output samples frequently during training, with the expectation that the sweet-spot for interesting new reviews will come somewhere before the model is totally&amp;nbsp;overfit. &lt;/p&gt;
&lt;p&gt;With our processed reviews in &lt;code&gt;data/clean/reviews.txt&lt;/code&gt;, we run the training script on our data, setting the &lt;code&gt;PYTHONPATH&lt;/code&gt; variable as indicated in the &lt;a href="https://github.com/nshepperd/gpt-2#fine-tuning-on-custom-datasets"&gt;readme&lt;/a&gt;. We output 30 samples every 25 epochs, with a length of 250 words (long enough that a full review should fit in there). We also checkpoint the model at that point so that we can return to it to generate more samples later if we&amp;nbsp;want:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt; &lt;span class="nv"&gt;PYTHONPATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpt-2/src python3 gpt-2/train.py &lt;span class="se"&gt;\&lt;/span&gt;
      --dataset data/clean/reviews.txt &lt;span class="se"&gt;\&lt;/span&gt;
      --save_every &lt;span class="m"&gt;25&lt;/span&gt; --sample_every &lt;span class="m"&gt;25&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      --sample_length &lt;span class="m"&gt;250&lt;/span&gt; --sample_num &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The script will create a &lt;code&gt;samples&lt;/code&gt; directory and save the samples there, and also a &lt;code&gt;checkpoints&lt;/code&gt; directory. It runs until you kill it with &lt;code&gt;ctrl+C&lt;/code&gt;, so feel free to go get a&amp;nbsp;coffee.&lt;/p&gt;
&lt;p&gt;&lt;a id='fake-reviews'&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Fake whiskey&amp;nbsp;reviews&lt;/h2&gt;
&lt;p&gt;The first thing to notice about the output is that the generated text is usually really coherent at first glance. After 25 epochs, the model isn&amp;#8217;t really talking about whiskey yet, so we mostly get a peek into the kind of text generation that it&amp;#8217;s capable&amp;nbsp;of:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 25&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We are pleased to present the third of over 1,000 stories published by the New England Journal of Medicine, from all continents. The authors have collected information on over 50,000 individuals in the United States who were examined at the National Heart, Lung and Blood Institute (&lt;span class="caps"&gt;NHBI&lt;/span&gt;) and included information on symptoms and treatments. The research was focused on coronary artery bypass grafting, cardiovascular dysfunction, cardiovascular disease, and coronary artery&amp;nbsp;disease.&lt;/p&gt;
&lt;p&gt;For complete details and results, visit &lt;a href="http://www.nlm.nih.gov/pubmed/257720" style="word-wrap : break-word;"&gt;http://www.nlm.nih.gov/pubmed/257720&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;span class="caps"&gt;URL&lt;/span&gt; doesn&amp;#8217;t work, I checked. This is generated&amp;nbsp;text. &lt;/p&gt;
&lt;p&gt;After 50 epochs, we&amp;#8217;re already seeing some promising output, though there&amp;#8217;s definitely work to&amp;nbsp;do:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 50&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The packaging is as it should be: it looks fine (with more sugar than water), but the nose gets burnt out, which can be remedied with some honey. There were a couple of more hints of red wine and chocolate, but in the nose this gives way to a more complex whisky with plenty of chocolate and dried fruit. On the palate, it&amp;#8217;s just perfect. The nose is quite dry and mouthful, but more like a whisky where, when you pour it into your mouth and add it, you find the sweetness of cherry-chardonnay, fresh orange peel, ginger, and vanilla peel. On the palate, it develops chocolate, vanilla syrup, and nutmeg, along with a light and pleasant hint of dried fruit, then is broken up by chocolate, cherry, and pepper notes. Very well-designed, although this was not finished in&amp;nbsp;time.&lt;/p&gt;
&lt;p&gt;1/5&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At 100 epochs, we have a review that at first glance seems&amp;nbsp;reasonable:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 100&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;85&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Black Cherry Ale,&amp;nbsp;90%&lt;/p&gt;
&lt;p&gt;Blended Malt Whisky  |&amp;nbsp;$60&lt;/p&gt;
&lt;p&gt;The nose was very light and crisp, with chocolate, banana candy flavor, honey, espresso, toffee, and herbal aromas. The palate was warm, with vanilla, cinnamon, and chocolate sweetness. The finish was medium in length, with plenty of vanilla and chocolate, and the finish was strong and well-balanced: chocolate, caramel, milk chocolate, milk chocolate, caramel, and sweet&amp;nbsp;almond.                        &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note that the model is even mimicking our formatting. This one ticks all the boxes of a review, but check out the tasting notes: vanilla, cinnamon, chocolate, caramel, sweet almond. There&amp;#8217;s definitely a theme: these are baking ingredients! Many food words are used in tasting notes. The model is learning this, but the text it was trained on seems to have included many recipes, as all sorts of baking instructions are also being generated. This was another sample from the same&amp;nbsp;batch:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 100&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;1 pint brown&amp;nbsp;sugar&lt;/p&gt;
&lt;p&gt;1 egg,&amp;nbsp;beaten&lt;/p&gt;
&lt;p&gt;1 teaspoon kosher&amp;nbsp;salt&lt;/p&gt;
&lt;p&gt;3 ounces butter,&amp;nbsp;softened&lt;/p&gt;
&lt;p&gt;3 egg whites,&amp;nbsp;beaten&lt;/p&gt;
&lt;p&gt;12 drops honey, at room&amp;nbsp;temperature&lt;/p&gt;
&lt;p&gt;2 tablespoons brown&amp;nbsp;sugar/butter&lt;/p&gt;
&lt;p&gt;3 tablespoons apple cider vinegar, at room&amp;nbsp;temperature&lt;/p&gt;
&lt;p&gt;2 tablespoons powdered&amp;nbsp;pepper&lt;/p&gt;
&lt;p&gt;1/2 teaspoon&amp;nbsp;pepper&lt;/p&gt;
&lt;p&gt;1 tablespoon black&amp;nbsp;pepper&lt;/p&gt;
&lt;p&gt;Directions&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;In a medium bowl, beat butter, 1/2 cup apple cider vinegar, and vanilla until smooth, 2-4 hours. Add in egg whites, and gradually add more egg whites, if needed. (You may need to add more milk and so&amp;nbsp;forth.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Pour milk/soy mixture through an egg grinder; if it is hard, you might have to do it some more. If there isn&amp;#8217;t quite so much water, you&amp;#8217;re going to have to let the milk drain on the stove&amp;nbsp;top.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add water once, once or twice, until all the water is dissolved, about 3-4 hours. Add pepper, egg whites, vanilla, and&amp;nbsp;water.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Whisk in egg shells and water; the water gets very sticky; add&amp;nbsp;salt&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;Pour milk/soy mixture through an egg grinder; if it is hard, you might have to do it some more&amp;#8221; isn&amp;#8217;t exactly sensible, but it is definitely coherent and the whole thing reads like a recipe. Just don&amp;#8217;t try it at&amp;nbsp;home. &lt;/p&gt;
&lt;p&gt;After 200 epochs, the model is producing whiskey reviews that I (as a layman) can&amp;#8217;t distinguish from the real&amp;nbsp;thing:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 200&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;91&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Luxembourg 60 year old (Batch 2),&amp;nbsp;59%&lt;/p&gt;
&lt;p&gt;Single Malt Scotch  |&amp;nbsp;$235&lt;/p&gt;
&lt;p&gt;Quite simply the most beautiful whisky in Europe at the time. Beautifully balanced spices, herbal oloroso fruits, soft earthy aromas of honey, licorice, lavender, lavender oil, and a hint of espresso. The palate is rich and richly spiced, with dark tannins, orange candy, green bananas, almond milk, espresso, brown sugar, and cream cheese louis. Floral notes and honeycomb in the finish. (12,800 bottles for&amp;nbsp;U.S.)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;However, it is also revealing one of &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&amp;#8217;s flaws: it has a tendency to abruptly switch context, which can be quite comical. From the same&amp;nbsp;batch:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 200&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A group of scientists led by Professor Jonny Wilkinson has found that the planet Saturn has the highest amount of formaldehyde on Earth. The concentration falls short, with only slightly more than 100 parts per million of the general background. It is 80 parts per million in the form of sisphenol A (s&amp;lt;3 parts per million&amp;lt;30 parts per million): this is also found in less-drying water like water and soda water. The taste is more palatable, with more of both honey and orange, then chocolate hazelnut, vanilla pod, chocolate chip cookies, caramel, hazelnut, coconut, and a sprinkling of orange oil. There are also more dark chocolate aromas, which are sweeter and with some spice and nutmeg addition. One might expect there to be more chocolate, but, on closer inspection, it&amp;#8217;s more fruitcake and chocolate barristers will ultimately find it&amp;nbsp;all.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;and:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;: 200&amp;nbsp;Epochs&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;90&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Lemon Keg,&amp;nbsp;57%&lt;/p&gt;
&lt;p&gt;Japanese  |&amp;nbsp;$45&lt;/p&gt;
&lt;p&gt;The standard edition features a black leather jacket with leather cuffs and white floral print buttons. The back has a floral feel, but is made from milk chocolate; the outside shell lacks much citrus or sweetness, which ultimately leads to a light taupe hue and ample&amp;nbsp;mouth-drawing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is also what I meant by clobbering the model: it is forgetting how to remain in other contexts. No matter where it starts, it is beginning to see everything as a whiskey review. By this point most (but not all) of the samples are also formatted like our whiskey reviews. This distinctive formatting proved to be a very useful tool in gauging to what extent we were overfitting the&amp;nbsp;model.&lt;/p&gt;
&lt;p&gt;By about 250 epochs, most of the output is pretty believable, with the only consistent flaw being repeated flavors in the tasting notes. All in all, I was surprised at both how easy the open source community made this experiment, and also how quickly and accurately the model was able to start replicating reasonable-sounding reviews (training only took about a minute per epoch on my old-ish MacBook&amp;nbsp;Pro).&lt;/p&gt;
&lt;p&gt;This post is getting long enough, but I&amp;#8217;ll end with some highlights and&amp;nbsp;bloopers:&lt;/p&gt;
&lt;p&gt;&lt;a id='highlights'&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Highlights&lt;/h2&gt;
&lt;p&gt;Consistency in country of origin (Canadian) and location of distillery&amp;nbsp;(Banff):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;87&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Millstone 92 Rye,&amp;nbsp;46%&lt;/p&gt;
&lt;p&gt;Canadian  |&amp;nbsp;$99&lt;/p&gt;
&lt;p&gt;The second installment in our four-part rye limited-edition comparison, this proof-bodied reissue from the Millstone Works facility in Banff presents both dry and wet grasses with great maturity. The nose shifts to vanilla, honey, cinnamon, orange oil, licorice, and crystalized halvah. The palate is light and delicate, with nuttiness, cloves, green apple, bitter chocolate, pecan pie, caramel, and bittersweet chocolate. The finish yields leather, honey, roasted nuts, orange, dried citrus, subtle oak, and dried coriander, all evocative of a wartime log. Editors&amp;#8217;&amp;nbsp;Choice&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Sweet and&amp;nbsp;smooth:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;92&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Timersley&amp;#8217;s Barrel-Aged Bourbon,&amp;nbsp;44%&lt;/p&gt;
&lt;p&gt;Blended Scotch Whiskey  |&amp;nbsp;$30&lt;/p&gt;
&lt;p&gt;Sweet potato, strawberry jam, milk chocolate, candies, and brown sugar flavors on the nose, along with smoky bowers &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; sounders, all wrapped tightly in a viscous, creamy feel. The palate is smooth and gentle, delivering gentle, medium-bodied whiskeys with their own distinctive flavors. A light, nimble palate serves enough sweetness without too much rancor that it becomes sultry, with hot chocolate, strawberry jam, milk chocolate, plum, and&amp;nbsp;cassis.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a id='bloopers'&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Bloopers&lt;/h2&gt;
&lt;p&gt;My absolute&amp;nbsp;favorite:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A smallpox-infected grain, this first-generation proof has a floral, sweet nose laced with blueberry jam, cherry Cask Whisky, toffee, cedar, and tangy oak. It is fragrant and flavorful beyond any description—like a symphony show from before the age of&amp;nbsp;antibiotics.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Whiskey, cigars, or promiscuous McDonald&amp;#8217;s&amp;nbsp;enthusiasts?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Twip | 18k Batch | Filch |  |&amp;nbsp;$75&lt;/p&gt;
&lt;p&gt;Makes a delicious mixing bowl of Chocolate Fudge, Big Mac, Ham, Sweet and Sour Apple Cider, Peanuts, and Sour Cream, with a fondness of making out with other people. The caramel is so strong, in fact, that I could only inhale the delicious caramel trails in my cigar box. The only hint I have of burnt sugar and caramel are lost along the way. Sweet and fruity, at least for the $75 price&amp;nbsp;point.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Confusion between whiskey and&amp;#8230; pop&amp;nbsp;metal?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;the Chinese pop-metal king Ping&amp;#8217;i Minbar began producing small batches for his mouth. Now in its 40s, with a nose of chocolate flan, hedonian, and bubble gum, it ends on sweet and savory rye, with vanilla, orange, and clove. It’s a reprise of the year loaded with floral notes, tangerine, cinnamon, black pepper, and orange soda pop. Floral, earthy, and smoky throughout, this period’s reasonably priced—$120—but well-balanced to give (or withhold)&amp;nbsp;indulgence.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Peanut butter jelly time&lt;a href="https://www.youtube.com/watch?v=s8MDNFaGfT4"&gt;:&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;61&amp;nbsp;points&lt;/p&gt;
&lt;p&gt;Chicken Pot Pie with Peanuts, Hot Chocolate, and More Peanuts,&amp;nbsp;50/50&lt;/p&gt;
&lt;p&gt;Single Malt Scotch Whisky  |&amp;nbsp;$35&lt;/p&gt;
&lt;p&gt;Measurable, with portmoking salt water, toasted nuts, caramel, and spices in a thin blanket of peanut butter and jelly, but the panning process was not without its pique. The peanuts and the generous whiskily cooked nuts create an intriguing, if somewhat overcomplicated, vivacious ode to peanut butter and jelly, with sweet peanut butter and jelly, along the&amp;nbsp;way.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;What would this even taste&amp;nbsp;like?&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 &lt;span class="caps"&gt;SAMPLE&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Bourbon/Tennessee  |&amp;nbsp;$85&lt;/p&gt;
&lt;p&gt;Tastes like a young William Shatner novel: rich with flavor, packed with fruit, honey, orange, maple syrup, and aromas of strawberry, licorice, and chocolate, balanced against a lingering tannin of the herbaceous nectar source. The palate is luxurious, richly notes a childhood&amp;nbsp;home.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; I have kept the generated reviews true to their original formatting and content, with the exception of removing the &amp;#8220;Reviewed By&amp;#8221; line that WhiskyAdvocate have at the end of their reviews. The model very quickly learned to use the real names of the reviewers, and I didn&amp;#8217;t want a real person being accused of describing a whiskey as&amp;nbsp;&amp;#8220;smallpox-infected&amp;#8221;.&lt;/p&gt;
&lt;p&gt;Thanks to &lt;a href="http://whiskyadvocate.com/"&gt;WhiskyAdvocate&lt;/a&gt; for agreeing to let me use their reviews for this&amp;nbsp;post!&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="deep learning"></category><category term="natural language"></category></entry><entry><title>Redirect standard out to Python’s logging module with contextlib</title><link href="https://johnpaton.net/posts/redirect-logging/" rel="alternate"></link><published>2019-05-22T17:00:00-01:00</published><updated>2019-05-22T17:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2019-05-22:/posts/redirect-logging/</id><summary type="html">&lt;p&gt;Python&amp;#8217;s logging functionality is very nice once you get the hang of it, but many people either disagree or don&amp;#8217;t bother to use it. Learn how to redirect other people&amp;#8217;s pesky print statements into your nice logging&amp;nbsp;setup.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Logging" src="/images/logs_cover.jpg"&gt;
&lt;div style="text-align:center"&gt;&lt;small&gt;&lt;a href="https://en.wikipedia.org/wiki/Logging"&gt;Logging&lt;/a&gt; on Wikipedia&lt;/small&gt;&lt;/div&gt;&lt;/p&gt;
&lt;p&gt;Python&amp;#8217;s built in logging functionality is very nice once you get the hang of it, but many people (including library developers) either disagree or don&amp;#8217;t bother to use it. Instead, you often see things like&amp;nbsp;this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Epoch 0&lt;/span&gt;
&lt;span class="go"&gt;Epoch 1&lt;/span&gt;
&lt;span class="go"&gt;Epoch 2&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;and onwards for 100 lines or&amp;nbsp;whatever. &lt;/p&gt;
&lt;p&gt;Printing status messages to standard out is &lt;em&gt;okay&lt;/em&gt;, but if you want anything like consistent/parseable logs, log level handling, logging to multiple locations, etc. this can get a bit annoying. How can we redirect standard out to Python&amp;#8217;s logging system to get all these juicy&amp;nbsp;benefits?&lt;/p&gt;
&lt;p&gt;Another built in module, &lt;code&gt;contextlib&lt;/code&gt;, comes to our&amp;nbsp;rescue.&lt;/p&gt;
&lt;h2&gt;Aside: Context&amp;nbsp;Managers&lt;/h2&gt;
&lt;p&gt;A &lt;a href="https://docs.python.org/3/reference/datamodel.html#with-statement-context-managers"&gt;context manager&lt;/a&gt; in Python is basically an object that implements two methods: &lt;code&gt;__enter__&lt;/code&gt; and &lt;code&gt;__exit__&lt;/code&gt;. Usually you enter the context by using the &lt;code&gt;with&lt;/code&gt; keyword, which triggers a call to &lt;code&gt;__enter__&lt;/code&gt;. You execute some code in an indented block, and (roughly speaking) when you exit the block, the &lt;code&gt;__exit__&lt;/code&gt; method is called with information about any exceptions that occurred. The context manager interface is what makes these two pieces of code more or less&amp;nbsp;equivalent:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;file.txt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;is basically the same&amp;nbsp;as&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;file.txt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The file handle returned by &lt;code&gt;open&lt;/code&gt; implements both &lt;code&gt;__enter__&lt;/code&gt; and &lt;code&gt;__exit__&lt;/code&gt;, which is why it can be used in this way. The nice benefit here is that we never forget to close the file, as &lt;code&gt;__exit__&lt;/code&gt; (which closes the file in this case) is called automagically at the end of the&amp;nbsp;block.&lt;/p&gt;
&lt;h2&gt;Redirecting &lt;code&gt;stdout&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;So how does this help us? Well, &lt;code&gt;contextlib&lt;/code&gt; (which is the library for things related to context managers, in case you didn&amp;#8217;t catch that) has a very handy context manager called &lt;code&gt;redirect_stdout&lt;/code&gt; (and also the matching &lt;code&gt;redirect_stderr&lt;/code&gt;). We can use it to redirect text written to standard out to another file-like object. For example, to write to a file, we could&amp;nbsp;do:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;contextlib&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;output.txt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;w&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;contextlib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redirect_stdout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Hello world!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;We don&amp;#8217;t see any output. If we now look at the newly-created &lt;code&gt;output.txt&lt;/code&gt;, we will&amp;nbsp;see&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt; cat output.txt
&lt;span class="go"&gt;Hello world!&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I realize that there are better ways of writing to a file in Python, but writing to a file isn&amp;#8217;t the point. The point is that any arbitrary code executed in the context block above would have its output redirected to the file, &lt;em&gt;without&lt;/em&gt; having to modify its print statements, including library code that we can&amp;#8217;t easily&amp;nbsp;modify. &lt;/p&gt;
&lt;p&gt;Now, &lt;code&gt;redirect_stdout&lt;/code&gt; requires a file-like object that it can &lt;code&gt;write&lt;/code&gt; to, which is why we first had to &lt;code&gt;open&lt;/code&gt; our output file above. However, Python&amp;#8217;s loggers are not file-like. I think you see where this is&amp;nbsp;going.&lt;/p&gt;
&lt;h2&gt;A file-like&amp;nbsp;logger&lt;/h2&gt;
&lt;p&gt;To redirect standard out to &lt;code&gt;logging&lt;/code&gt;, we write a simple class that implements the &lt;code&gt;write&lt;/code&gt; method, and passes everything that is written to the logger of our choice. We&amp;#8217;ll also add a &lt;code&gt;flush&lt;/code&gt; method that doesn&amp;#8217;t do anything, just to avoid exceptions in case some code tries to use it. We can specify the desired logger by name since Python&amp;#8217;s loggers are &lt;a href="https://docs.python.org/3/library/logging.html#logger-objects"&gt;singletons&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;contextlib&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OutputLogger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;root&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;INFO&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isspace&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;I added a quick check for empty messages since I don&amp;#8217;t want blank lines in my&amp;nbsp;logging. &lt;/p&gt;
&lt;p&gt;This is already enough to pass our object to &lt;code&gt;redirect_stdout&lt;/code&gt; and thereby redirect standard out to &lt;code&gt;logging&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;contextlib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redirect_stdout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OutputLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;my_logger&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;INFO&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Hello logging!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="go"&gt;INFO:my_logger:Hello logging!&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Note that we had to minimally call &lt;code&gt;logging.basicConfig()&lt;/code&gt; to get a bit of formatting on the logs and set the log level to at least the level we selected for our redirector (&lt;code&gt;INFO&lt;/code&gt;). Since our class is just functioning as a redirector for messages, we leave ourselves free to configure the logger however we want elsewhere in the application (check out the &lt;a href="https://docs.python.org/3/howto/logging-cookbook.html"&gt;Python logging cookbook&lt;/a&gt; for&amp;nbsp;tips).&lt;/p&gt;
&lt;h2&gt;Baby&amp;#8217;s first context&amp;nbsp;manager&lt;/h2&gt;
&lt;p&gt;This implementation is already functional, but it&amp;#8217;s a bit verbose. Since we&amp;#8217;re already in the &lt;code&gt;contextlib&lt;/code&gt; realm we might as well just make our object into a context manager itself, eliminating the need to call &lt;code&gt;contextlib.redirect_stdout&lt;/code&gt; directly every time we want to use it. To do so, we add a new attribute called &lt;code&gt;_redirector&lt;/code&gt; to our class, which is an instance of &lt;code&gt;redirect_stdout&lt;/code&gt; with &lt;code&gt;self&lt;/code&gt; as the redirect destination. Then our &lt;code&gt;__enter__&lt;/code&gt; and &lt;code&gt;__exit__&lt;/code&gt; methods just call the matching methods of our &lt;code&gt;_redirector&lt;/code&gt;, ensuring that everything printed in our context will get redirected to our own &lt;code&gt;write&lt;/code&gt; method (which in turn passes messages to our &lt;code&gt;logger&lt;/code&gt;). Our implementation&amp;nbsp;becomes:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;contextlib&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OutputLogger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;root&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;INFO&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redirector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;contextlib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redirect_stdout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isspace&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__enter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redirector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="fm"&gt;__enter__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="fm"&gt;__exit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;traceback&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# let contextlib do any exception handling here&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redirector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="fm"&gt;__exit__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exc_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;traceback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Now we&amp;#8217;re cooking! We can use &lt;code&gt;OutputLogger&lt;/code&gt; as a context manager, and since we returned &lt;code&gt;self&lt;/code&gt; from &lt;code&gt;__enter__&lt;/code&gt; we can even reuse the same instance later by giving it a name using the &lt;code&gt;as&lt;/code&gt; keyword:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Normal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Normal&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;OutputLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;my_logger&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;WARN&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;redirector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Logged!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="go"&gt;WARNING:my_logger:Logged!&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Back to normal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="go"&gt;Back to normal&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;redirector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="gp"&gt;... &lt;/span&gt;    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Logged again!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="gp"&gt;...&lt;/span&gt;
&lt;span class="go"&gt;WARNING:my_logger:Logged again!&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;There are all sorts of extensions possible here: redirecting standard error to another log level, making sure that changes to &lt;code&gt;OutputLogger.name&lt;/code&gt; and &lt;code&gt;OutputLogger.level&lt;/code&gt; get applied to the underlying logger properly, input checking on the log level string, etc. But this is enough to get started and will work as a quick and relatively clean way to capture the output from some other code and redirect it to your application&amp;#8217;s logging&amp;nbsp;system.&lt;/p&gt;
&lt;h2&gt;Disclaimer&lt;/h2&gt;
&lt;p&gt;Though &lt;code&gt;contextlib.redirect_stdout&lt;/code&gt; is built in to Python, it does redefine &lt;code&gt;sys.stdout&lt;/code&gt; for your whole process while the execution is within its context. For this reason, it can have unintended results for other pieces of code that are trying to do fancier stuff with &lt;code&gt;sys.stdout&lt;/code&gt; than just write to it. This is a solution if you are writing an application that just needs to get something done, but if you are writing a library that other people might use, it&amp;#8217;s best not do mess with these system properties without being very explicit about it. As always: Just because you can, doesn&amp;#8217;t mean you&amp;nbsp;should!&lt;/p&gt;
&lt;p&gt;&lt;small&gt;&lt;small&gt;Cover image by &lt;a rel="nofollow" class="external text" href="https://flickr.com/people/7787236@N07"&gt;Greenpeace Finland&lt;/a&gt; - originally posted to &lt;a href="//commons.wikimedia.org/wiki/Flickr" class="mw-redirect" title="Flickr"&gt;Flickr&lt;/a&gt; as &lt;a rel="nofollow" class="external text" href="https://flickr.com/photos/7787236@N07/3227543977"&gt;Logging in Finnish Lapland: ancient trees for pulp and paper&lt;/a&gt;, &lt;a href="https://creativecommons.org/licenses/by/2.0" title="Creative Commons Attribution 2.0"&gt;&lt;span class="caps"&gt;CC&lt;/span&gt; &lt;span class="caps"&gt;BY&lt;/span&gt; 2.0&lt;/a&gt;, &lt;a href="https://commons.wikimedia.org/w/index.php?curid=11805001"&gt;Link&lt;/a&gt;&lt;/small&gt;&lt;/small&gt;&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="snippets"></category></entry><entry><title>Propagate time series events with Pandas merge_asof</title><link href="https://johnpaton.net/posts/propagate-time-series-events-pandas/" rel="alternate"></link><published>2019-04-13T16:00:00-01:00</published><updated>2019-04-13T16:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2019-04-13:/posts/propagate-time-series-events-pandas/</id><summary type="html">&lt;p&gt;I recently discovered that Pandas has a function to propagate time series events forward (or backward) in time across a DataFrame. Here&amp;#8217;s how it&amp;nbsp;works.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Imagine the following situation: you have two tables, one with event logs, and the other with status changes. To make this concrete, let&amp;#8217;s take our events to be purchases by our users, and the status changes to be when the users attained a Gold Membership (wow so shiny). You want to use their membership status &lt;em&gt;at the time of each purchase&lt;/em&gt; as a feature for some model, but you only have records of when the status &lt;em&gt;changed&lt;/em&gt;, so you can&amp;#8217;t just naively join the&amp;nbsp;tables.&lt;/p&gt;
&lt;p&gt;After inheriting some code that was performing this operation manually using &lt;code&gt;groupby&lt;/code&gt;&lt;span class="quo"&gt;&amp;#8216;&lt;/span&gt;s and fancy indexing, I decided to check if Pandas had a built in function for it, and I was pleasantly surprised: &lt;a href="https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.merge_asof.html"&gt;&lt;code&gt;pandas.merge_asof&lt;/code&gt;&lt;/a&gt;. Let&amp;#8217;s check out how it&amp;nbsp;works.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="c1"&gt;# https://johnpaton.net/posts/custom-color-schemes-in-matplotlib/&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;~/johnpaton.mplstyle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;We&amp;#8217;ll begin by generating some fake sales data. We have three users, Alice, Bob, and Charlie, who made purchases over the last year. Sorry about the verbosity, there doesn&amp;#8217;t seem to be any easier way to generate random dates in a&amp;nbsp;range. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;n_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utcfromtimestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;epoch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;
&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt;

&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;alice&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bob&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;charlie&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_sales&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;amount&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_sales&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;timestamp&lt;/th&gt;
      &lt;th&gt;user&lt;/th&gt;
      &lt;th&gt;amount&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2018-04-14&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;74&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2018-05-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;61&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2018-05-17&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;2018-06-25&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;71&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;2018-06-30&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;2018-07-04&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;2018-09-22&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;2018-11-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;2018-11-18&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;57&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;2018-11-21&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;37&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;10&lt;/th&gt;
      &lt;td&gt;2018-11-27&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;42&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;11&lt;/th&gt;
      &lt;td&gt;2019-01-19&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;12&lt;/th&gt;
      &lt;td&gt;2019-01-22&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;46&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;13&lt;/th&gt;
      &lt;td&gt;2019-03-17&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;14&lt;/th&gt;
      &lt;td&gt;2019-04-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;77&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Next up, we need the records of when each of them attained their coveted Gold&amp;nbsp;Membership:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;D&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;gold_member&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_flags&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;user&lt;/th&gt;
      &lt;th&gt;timestamp&lt;/th&gt;
      &lt;th&gt;status&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;2018-04-18&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;2018-06-09&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;2018-09-24&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Now it&amp;#8217;s time to do our merge. We call &lt;code&gt;pd.merge_asof&lt;/code&gt;, specifying the following&amp;nbsp;arguments:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;left&lt;/code&gt; and &lt;code&gt;right&lt;/code&gt; DataFrames&lt;/li&gt;
&lt;li&gt;The column to merge &lt;code&gt;on&lt;/code&gt; (this is generally a time column or some other ordering&amp;nbsp;field)&lt;/li&gt;
&lt;li&gt;The column(s) to group &lt;code&gt;by&lt;/code&gt;. The &lt;code&gt;by&lt;/code&gt; keyword is very nice, as we can use it to make sure that the status changes only get propagated across the correct users&amp;#8217; purchases. Otherwise we would have to ensure this&amp;nbsp;manually.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge_asof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df_sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;df_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_merged&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;timestamp&lt;/th&gt;
      &lt;th&gt;user&lt;/th&gt;
      &lt;th&gt;amount&lt;/th&gt;
      &lt;th&gt;status&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2018-04-14&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;74&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2018-05-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;61&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2018-05-17&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;2018-06-25&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;71&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;2018-06-30&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;2018-07-04&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;2018-09-22&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;2018-11-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;2018-11-18&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;57&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;2018-11-21&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;37&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;10&lt;/th&gt;
      &lt;td&gt;2018-11-27&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;42&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;11&lt;/th&gt;
      &lt;td&gt;2019-01-19&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;12&lt;/th&gt;
      &lt;td&gt;2019-01-22&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;46&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;13&lt;/th&gt;
      &lt;td&gt;2019-03-17&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;14&lt;/th&gt;
      &lt;td&gt;2019-04-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;77&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;The result is kind of like a left join, in that all &amp;#8220;matching&amp;#8221; rows are filled and unmatched rows are &lt;code&gt;NaN&lt;/code&gt;. However, unlike in a left join (which looks for an exact match), in this case the join applied to all rows on the left that have a &lt;code&gt;timestamp&lt;/code&gt; that comes &lt;em&gt;after&lt;/em&gt; the paired rows on the right. To ensure that this before/after comparison is possible, the DataFrames must be sorted by the &lt;code&gt;on&lt;/code&gt; column.&lt;/p&gt;
&lt;p&gt;To create our Gold Membership flag, all we do is just replace the name of the event with a &lt;code&gt;1&lt;/code&gt; and fill the &lt;code&gt;NaN&lt;/code&gt;s (which are rows that came before the status change) with &lt;code&gt;0&lt;/code&gt;s.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_merged&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gold_member&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_merged&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;\
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gold_member&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_merged&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;timestamp&lt;/th&gt;
      &lt;th&gt;user&lt;/th&gt;
      &lt;th&gt;amount&lt;/th&gt;
      &lt;th&gt;status&lt;/th&gt;
      &lt;th&gt;gold_member&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2018-04-14&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;74&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2018-05-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;61&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2018-05-17&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;2018-06-25&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;71&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;2018-06-30&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;2018-07-04&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;2018-09-22&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;2018-11-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;2018-11-18&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;57&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;2018-11-21&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;37&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;10&lt;/th&gt;
      &lt;td&gt;2018-11-27&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;42&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;11&lt;/th&gt;
      &lt;td&gt;2019-01-19&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;12&lt;/th&gt;
      &lt;td&gt;2019-01-22&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;46&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;13&lt;/th&gt;
      &lt;td&gt;2019-03-17&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;14&lt;/th&gt;
      &lt;td&gt;2019-04-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;77&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;To make sure there&amp;#8217;s no funny business going on, we can also visualize exactly what&amp;nbsp;happened:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;facecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;white&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df_tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_merged&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df_merged&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;df_tmp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;gold_member&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_flags&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axvline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C{i}&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot; {row[&amp;#39;user&amp;#39;]} gets gold&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yticks&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;regular&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;gold&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Membership status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sale date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sales by membership status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;img alt="png" src="/images/propagate_12_1.png"&gt;&lt;/p&gt;
&lt;p&gt;So now we have a nice feature that can be used in a training set, without any data leakage (no events from the future are visible before they happened in the training&amp;nbsp;set).&lt;/p&gt;
&lt;p&gt;This also works for multiple status changes. For example, say Bob got demoted on February 1st back to being a normal&amp;nbsp;user.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;df_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;bob&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;status&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;normal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;2019-02-01&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df_flags&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;status&lt;/th&gt;
      &lt;th&gt;timestamp&lt;/th&gt;
      &lt;th&gt;user&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;2018-04-18&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;2018-06-09&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
      &lt;td&gt;2018-09-24&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;normal&lt;/td&gt;
      &lt;td&gt;2019-02-01&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;If we perform the &lt;code&gt;merge_asof&lt;/code&gt; again, we see that Bob&amp;#8217;s status changes twice, just how we would&amp;nbsp;expect:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge_asof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df_sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;df_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;timestamp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;user&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;timestamp&lt;/th&gt;
      &lt;th&gt;user&lt;/th&gt;
      &lt;th&gt;amount&lt;/th&gt;
      &lt;th&gt;status&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2018-04-14&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;74&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2018-05-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;61&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2018-05-17&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;2018-06-25&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;71&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;2018-06-30&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;2018-07-04&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;2018-09-22&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;2018-11-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;2018-11-18&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;57&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;2018-11-21&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;37&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;10&lt;/th&gt;
      &lt;td&gt;2018-11-27&lt;/td&gt;
      &lt;td&gt;charlie&lt;/td&gt;
      &lt;td&gt;42&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;11&lt;/th&gt;
      &lt;td&gt;2019-01-19&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;12&lt;/th&gt;
      &lt;td&gt;2019-01-22&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;46&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;13&lt;/th&gt;
      &lt;td&gt;2019-03-17&lt;/td&gt;
      &lt;td&gt;bob&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;normal&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;14&lt;/th&gt;
      &lt;td&gt;2019-04-02&lt;/td&gt;
      &lt;td&gt;alice&lt;/td&gt;
      &lt;td&gt;77&lt;/td&gt;
      &lt;td&gt;gold_member&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Finally, these have all been examples of so-called &amp;#8220;backwards searches&amp;#8221;. From the &lt;a href="https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.merge_asof.html"&gt;Pandas docs&lt;/a&gt;: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A &amp;#8220;backward&amp;#8221; search selects the last row in the right DataFrame whose &lt;code&gt;on&lt;/code&gt; key is less than or equal to the left&amp;#8217;s&amp;nbsp;key.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Somewhat counter-intuitively, this causes the status changes to propagate &lt;em&gt;forward&lt;/em&gt; in time. This is the default behavior of &lt;code&gt;merge_asof&lt;/code&gt;. To do the reverse (a &lt;em&gt;forward&lt;/em&gt; search, a.k.a. propagate changes &lt;em&gt;backwards&lt;/em&gt; in time), you can provide the keyword argument &lt;code&gt;direction="forward"&lt;/code&gt;. &lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="data"></category><category term="time series"></category><category term="pandas"></category></entry><entry><title>Getting Calvin home on time: a statistics puzzle</title><link href="https://johnpaton.net/posts/calvin-puzzle/" rel="alternate"></link><published>2018-07-19T17:00:00-01:00</published><updated>2018-07-19T17:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2018-07-19:/posts/calvin-puzzle/</id><summary type="html">&lt;p&gt;I found this puzzle a while ago and couldn&amp;#8217;t get it out of my head, so I decided to write up a solution. &amp;#8220;Calvin has to cross several signals when he walks from his home to school. Each of these signals operate independently. They alternate every 80 seconds between green light and red light. At each signal, there is a counter display that tells him how long it will be before the current signal light changes. Calvin has a magic wand which lets him turn a signal from red to green instantaneously. However, this wand comes with limited battery life, so he can use it only for a specified number of&amp;nbsp;times.&amp;#8221;&lt;/p&gt;</summary><content type="html">&lt;p&gt;I found this puzzle a while ago and couldn&amp;#8217;t get it out of my head, so I decided to write up a&amp;nbsp;solution.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Calvin has to cross several signals when he walks from his home to school. Each of these signals operate independently. They alternate every 80 seconds between green light and red light. At each signal, there is a counter display that tells him how long it will be before the current signal light changes. Calvin has a magic wand which lets him turn a signal from red to green instantaneously. However, this wand comes with limited battery life, so he can use it only for a specified number of&amp;nbsp;times.&lt;/p&gt;
&lt;p&gt;a) If the total number of signals is 2 and Calvin can use his magic wand only once, then what is the expected waiting time at the signals when Calvin optimally walks from his home to&amp;nbsp;school?&lt;/p&gt;
&lt;p&gt;b) What if the number of signals is 3 and Calvin can use his magic wand only&amp;nbsp;once?&lt;/p&gt;
&lt;p&gt;c) Write a program that takes as inputs the number of signals and the number of times Calvin can use his magic wand, and outputs the expected waiting&amp;nbsp;time&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Assuming that the lights are independent, the time until a light changes (displayed on the counter) when Calvin arrives is drawn from a uniform distribution between 0 and&amp;nbsp;80:&lt;/p&gt;
&lt;div class="math"&gt;$$ t_{\text{counter}} \sim \text{Unif}(0\ \text{s},80\ \text{s}).$$&lt;/div&gt;
&lt;p&gt;So the average waiting time at each &lt;em&gt;red&lt;/em&gt; light is 40 seconds. Since there is a 50% chance that the light is green, the expected waiting per light time&amp;nbsp;is&lt;/p&gt;
&lt;div class="math"&gt;$$ \bar t_{\text{wait}} = P(\text{green}) \cdot \bar t_{\text{counter}} = \frac{1}{2} \cdot 40\ \text{s} = 20\ \text{s}. $$&lt;/div&gt;
&lt;p&gt;Calvin wants to use his wand in order to minimize his waiting&amp;nbsp;time.&lt;/p&gt;
&lt;h2&gt;a) If the total number of signals is 2 and Calvin can use his magic wand only once, then what is the expected waiting time at the signals when Calvin optimally walks from his home to&amp;nbsp;school?&lt;/h2&gt;
&lt;p&gt;With two lights and one use of the magic wand, there are three scenarios that can&amp;nbsp;arise:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;If the first light is green (50% chance), then Calvin will use his wand on the second light, with 0 waiting&amp;nbsp;time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the first light is red (50%&amp;nbsp;chance):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;He will use his wand if the time on the timer is higher than the expected wait for the second light (20 seconds, 75% chance). In this case the expected waiting time is the expected time for the second light: 20&amp;nbsp;seconds.&lt;/li&gt;
&lt;li&gt;He will not use his wand if the time on the timer is less than 20 seconds (the mean time these cases is 10 seconds since they are uniformly distributed between 0 and 20, and the situation occurs with a 25% chance). In that case he will wait, and use his wand on the second light, for a total expected waiting time of 10&amp;nbsp;seconds.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;His total expected waiting time is found by taking a sum of all the different situations weighted by their&amp;nbsp;probability:&lt;/p&gt;
&lt;div class="math"&gt;$$ \bar t_{wait} = 0.5 \cdot 0\ \text{s} + 0.5\cdot(0.75 \cdot 20\ \text{s} + 0.25 \cdot 10\ \text{s}) = 8.75\ \text{s}$$&lt;/div&gt;
&lt;p&gt;So Calvin&amp;#8217;s expected waiting time is &lt;strong&gt;8.75 seconds&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Let&amp;#8217;s confirm this by trying it&amp;#8230; one billion times [&lt;em&gt;evil laugh&lt;/em&gt;]:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;n_trials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# initialize the trials&lt;/span&gt;
&lt;span class="n"&gt;wait_times&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# light waiting times: uniform 0-80 seconds, 50% chance of being green (0 wait time)&lt;/span&gt;
&lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;wand1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# trials where Calvin uses the wand on light 1&lt;/span&gt;

&lt;span class="c1"&gt;# wait at light 2 wherever he uses the wand at light 1&lt;/span&gt;
&lt;span class="n"&gt;wait_times&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;wand1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;wand1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
&lt;span class="c1"&gt;# wait at light 1 wherever he doesn&amp;#39;t use the wand at light 1&lt;/span&gt;
&lt;span class="n"&gt;wait_times&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;wand1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;wand1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Calvin&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s mean waiting time in {:,} trials was {:.3f} seconds&amp;#39;&lt;/span&gt;\
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_times&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Calvin&amp;#39;s mean waiting time in 1,000,000,000 trials was 8.750 seconds
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Nice!&lt;/p&gt;
&lt;h2&gt;b) What if the number of signals is 3 and Calvin can use his magic wand only&amp;nbsp;once?&lt;/h2&gt;
&lt;p&gt;Now we have three signals and one wand usage. Since we have one wand, we want to find some optimal cutoff value that lets us decide whether or not to use the wand at each light. We base this on whether or not we expect to see a light with a higher waiting time than the current counter during the rest of our walk&amp;nbsp;home.&lt;/p&gt;
&lt;p&gt;At the last light, the cutoff is zero. If we haven&amp;#8217;t used the wand yet, we will definitely use it if the light is not green, as there are no more chances left to use&amp;nbsp;it.&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ve already shown that the cutoff at the second-to-last light should be 20 seconds, as this situation is the same one as in part a) (assuming the wand hasn&amp;#8217;t been used&amp;nbsp;yet). &lt;/p&gt;
&lt;p&gt;What about at the first light? We want to know the expectation value of &lt;em&gt;the maximum of&lt;/em&gt; the next two waiting&amp;nbsp;times. &lt;/p&gt;
&lt;p&gt;The probability that the counter displays a waiting time is less than some time &lt;span class="math"&gt;\(t\)&lt;/span&gt; is given&amp;nbsp;by&lt;/p&gt;
&lt;div class="math"&gt;$$ \tilde{F}(t) = P(t_{\text{count}} \le t) = \frac{t}{80}. $$&lt;/div&gt;
&lt;p&gt;&lt;span class="math"&gt;\(\tilde{F}(t)\)&lt;/span&gt; here is the Cumulative Density Function (&lt;span class="caps"&gt;CDF&lt;/span&gt;) for our uniformly distributed counter&amp;nbsp;time. &lt;/p&gt;
&lt;p&gt;Now, there is a 50% chance that the light will be green, in which case our waiting time is zero, regardless of the value displayed on the counter. This means that the probability of the waiting time being less than or equal to zero is already &lt;span class="math"&gt;\(1/2\)&lt;/span&gt;, and it grows linearly to &lt;span class="math"&gt;\(1\)&lt;/span&gt; for a waiting time up to &lt;span class="math"&gt;\(80\)&lt;/span&gt; seconds. So including the chance of a green light, the &lt;span class="caps"&gt;CDF&lt;/span&gt; for our waiting time&amp;nbsp;is&lt;/p&gt;
&lt;div class="math"&gt;$$ F(t) = P(t_{\text{wait}} \le t) = \frac{1}{2} + \frac{1}{2}\frac{t}{80} .$$&lt;/div&gt;
&lt;p&gt;We are interested in the maximum expected waiting time for the last 2 lights in Calvin&amp;#8217;s journey, so that he knows what his cutoff should be at the first light. What is the relevant &lt;span class="caps"&gt;CDF&lt;/span&gt;? If the maximum of two random variables is less than a time &lt;span class="math"&gt;\(t\)&lt;/span&gt;, then each of the variables must be less than &lt;span class="math"&gt;\(t\)&lt;/span&gt; individually. Since we are assuming that the waiting times are independent, their probabilities factor, and we get the &lt;span class="caps"&gt;CDF&lt;/span&gt;&lt;/p&gt;
&lt;div class="math"&gt;$$ F_2(t_\max) = P(t_{\text{wait},1},t_{\text{wait},2} \le t_\max) = P(t_{\text{wait},1}\le t_\max)P(t_{\text{wait},2} \le t_\max) =  F(t_\max)^2. $$&lt;/div&gt;
&lt;p&gt;Clearly this generalizes to &lt;span class="math"&gt;\(n\)&lt;/span&gt; remaining lights, so let&amp;#8217;s consider the more general case and then specify &lt;span class="math"&gt;\(n=2\)&lt;/span&gt; to get the cutoff we want. We&amp;nbsp;have&lt;/p&gt;
&lt;div class="math"&gt;$$ F_n(t_\max) = F(t_\max)^n = \left( \frac{1}{2} + \frac{1}{2}\frac{t_\max}{80} \right)^n. $$&lt;/div&gt;
&lt;p&gt;We are looking for the expected value of this maximum. The expectation can found directly from the &lt;span class="caps"&gt;CDF&lt;/span&gt;&amp;nbsp;by&lt;/p&gt;
&lt;div class="math"&gt;$$ E_n = 80 - \int_0^{80}\!\! \text{d}t\ F_n(t) = 80 - \int_0^{80}\text{d}t \left( \frac{1}{2} + \frac{1}{2}\frac{t}{80} \right)^n$$&lt;/div&gt;
&lt;p&gt;(this can be seen from integrating the normal definition, &lt;span class="math"&gt;\(E = \int \text{d}x\ x\ \text{PDF}(x)\)&lt;/span&gt;, by parts). To take the integral, we perform the substitution &lt;span class="math"&gt;\(u = 1/2 + t/(2\cdot80)\)&lt;/span&gt; and&amp;nbsp;get&lt;/p&gt;
&lt;div class="math"&gt;\begin{align*}
\int_0^{80}\text{d}t \left( \frac{1}{2} + \frac{1}{2}\frac{t}{80} \right)^n &amp;amp;= 80 \cdot 2 \int_{\frac{1}{2}}^1 \text{d}u\ u^n \\
&amp;amp;= 80 \cdot 2 \left. \frac{u^{n+1}}{n+1} \right|_{\frac{1}{2}}^1\\
&amp;amp;= 80 \cdot \frac{2}{n+1}\left(1-\frac{1}{2^{n+1}}\right).
\end{align*}&lt;/div&gt;
&lt;p&gt;Inserting this result into our expression for the expected maximum, we&amp;nbsp;get&lt;/p&gt;
&lt;div class="math"&gt;$$ E_n = 80 \cdot \left(1 - \frac{2}{n+1}\left(1-\frac{1}{2^{n+1}}\right)\right) = 80 \cdot \left(\frac{n}{n+1} - \frac{1}{n+1}\left(1-\frac{1}{2^{n+1}}\right)\right). $$&lt;/div&gt;
&lt;p&gt;Interestingly, in the second form above, &lt;span class="math"&gt;\(80\cdot n/(n+1)\)&lt;/span&gt; is the expected maximum waiting time if we had to wait at every light, and the second term is a correction due to half of the lights being green. Again we can confirm our calculation by&amp;nbsp;simulation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt; &lt;span class="n"&gt;inline&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ggplot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# simulate (up to) 25 lights&lt;/span&gt;
&lt;span class="n"&gt;n_lights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="n"&gt;n_trials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;counter_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;

&lt;span class="c1"&gt;# simulate the lights (one row per light, one column per trial)&lt;/span&gt;
&lt;span class="n"&gt;trials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;\
            &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;

&lt;span class="c1"&gt;# get the max waiting times for each possible number of lights&lt;/span&gt;
&lt;span class="n"&gt;t_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trials&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;t_max&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trials&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# get the calculated expectation and simulated mean of the max waiting times&lt;/span&gt;
&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;wait_max_exp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter_max&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
&lt;span class="n"&gt;wait_max_sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t_max&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# plot to compare&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_max_exp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;x&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;b&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;expectation&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_max_sim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;simulation ({:,} trials)&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_trials&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Number of lights&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Max waiting time (seconds)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;counter_max&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Maximum expected waiting time in sample of $n$ lights&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;img alt="png" src="/images/calvin_4_0.png"&gt;&lt;/p&gt;
&lt;p&gt;So we have values to function as our cutoffs, based on how many lights are remaining in the journey. Specifically, the cutoff for the first of 3 lights is given by the maximum expected wait time for the remaining &lt;span class="math"&gt;\(n=2\)&lt;/span&gt;&amp;nbsp;lights: &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Max expected waiting time for 2 remaining lights is {:.2f} seconds.&amp;#39;&lt;/span&gt;\
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wait_max_exp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Max expected waiting time for 2 remaining lights is 33.33 seconds.
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Calvin should use his wand if the first light is red and the counter is over 33.33 seconds. We can calculate his expected waiting time by using the result from a) and considering the possible&amp;nbsp;scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The first light is green (50% chance). In this case Calvin has two lights and one wand to go; this is the situation from a). In this scenario the expected waiting time is 8.75&amp;nbsp;seconds.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The first light is red (50%&amp;nbsp;chance):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The timer is over the cutoff ( &lt;span class="math"&gt;\((80-33.33)/80 = 58\)&lt;/span&gt;% chance for uniformly distributed time on the counter). In this case Calvin uses the wand, and with no wands left he expects to wait for 20 seconds per light for the remaining number of lights. There are two lights left, so the expected wait time in this case is 40&amp;nbsp;seconds.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The timer is under the cutoff (42% chance). In this case Calvin waits for the displayed time (expectation &lt;span class="math"&gt;\(33.33/2 = 16.67\)&lt;/span&gt; seconds) and then moves on to the situation in a) for a total expected waiting time of &lt;span class="math"&gt;\(16.67 + 8.75 = 25.42\)&lt;/span&gt;&amp;nbsp;seconds).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Again we find his total expected waiting time by taking the weighted&amp;nbsp;sum:&lt;/p&gt;
&lt;div class="math"&gt;$$ \bar t_{\text{wait}} \approx 0.5 \cdot 8.75\ \text{s} + 0.5 \cdot \left(0.58\cdot 40\ \text{s} + 0.42 \cdot 25.42\ \text{s} \right) \approx 21.3\ \text{s}$$&lt;/div&gt;
&lt;p&gt;Note that this by-hand calculation has some rounding errors in the cutoff and proportions, but the exact value also rounds to &lt;strong&gt;21.3 seconds&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;c) Write a program that takes as inputs the number of signals and the number of times Calvin can use his magic wand, and outputs the expected waiting&amp;nbsp;time&lt;/h2&gt;
&lt;p&gt;Now we want full generalization to any number of lights and wand uses. It should be clear from the reliance of the calculation in b) on the results from a) that we can use a recursive strategy to calculate the total waiting&amp;nbsp;time.&lt;/p&gt;
&lt;p&gt;There are two questions to answer: What should the cutoff be for multiple uses of the wand? And what is the recursive pattern for calculating the expected wait&amp;nbsp;time?&lt;/p&gt;
&lt;p&gt;Let&amp;#8217;s begin with the cutoff. When Calvin comes to a red light, he must decide whether to use the wand or not. For a walk with &lt;span class="math"&gt;\(n\)&lt;/span&gt; lights, our strategy has been to consider the expected maximum wait encountered during the next &lt;span class="math"&gt;\(n-1\)&lt;/span&gt; lights. We can generalize by considering the expected maximum wait at the number of remaining lights we expect to wait at, including the current one. If Calvin has &lt;span class="math"&gt;\(w\)&lt;/span&gt; uses of his wand, then he can skip up to &lt;span class="math"&gt;\(w\)&lt;/span&gt; lights, meaning he has to wait at &lt;span class="math"&gt;\(n-w\)&lt;/span&gt; lights in total, including the current one. So the cutoff for using the wand can be generalized&amp;nbsp;to&lt;/p&gt;
&lt;div class="math"&gt;$$ E_{\tilde{n}} = 80 \cdot \left(1 - \frac{2}{\tilde{n}+1}\left(1-\frac{1}{2^{\tilde{n}+1}}\right)\right),\ \tilde{n} = n - w. $$&lt;/div&gt;
&lt;p&gt;This new cutoff is smaller than the previous one when &lt;span class="math"&gt;\(w&amp;gt;1\)&lt;/span&gt;, which makes sense since we would then like to be more liberal with our wand&amp;nbsp;usage.&lt;/p&gt;
&lt;p&gt;Now that we have the cutoff, we need the recursive pattern. Again we have three scenarios. If we have &lt;span class="math"&gt;\(n\)&lt;/span&gt; lights and &lt;span class="math"&gt;\(w\)&lt;/span&gt; wands to go, then the scenarios&amp;nbsp;are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The current light is green (50% chance). In this case the expected waiting time is just the waiting time for &lt;span class="math"&gt;\(n-1\)&lt;/span&gt; lights and &lt;span class="math"&gt;\(w\)&lt;/span&gt; wand&amp;nbsp;uses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The current light is red (50%&amp;nbsp;chance):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The waiting time is below the cutoff (probability &lt;span class="math"&gt;\(\text{cutoff} / 80\)&lt;/span&gt;). In this case Calvin doesn&amp;#8217;t use the wand, and the total waiting time is the wait at the current light (expectation &lt;span class="math"&gt;\(\text{cutoff}/2\)&lt;/span&gt;) plus the expected wait at &lt;span class="math"&gt;\(n-1\)&lt;/span&gt; lights with &lt;span class="math"&gt;\(w\)&lt;/span&gt;&amp;nbsp;wands.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The waiting time is above the cutoff (probability &lt;span class="math"&gt;\(1 - \text{cutoff} / 80\)&lt;/span&gt;). Now Calvin uses the wand, adding no extra waiting time, so the expected total wait is the expected wait for &lt;span class="math"&gt;\(n-1\)&lt;/span&gt; lights and &lt;span class="math"&gt;\(w-1\)&lt;/span&gt;&amp;nbsp;wands.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We calculate these scenarios, recursing whenever we need a new total expected wait time for a different number of lights and/or wands, until we hit one of the base cases. There are a&amp;nbsp;few:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;If we have 0 lights to go, then the expected total remaining wait time is 0&amp;nbsp;seconds.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If the number of wands is greater than or equal to the number of lights, then the wait time is 0 seconds (since we can switch all the lights&amp;nbsp;off).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If we have no wands left, then the wait time is just the expected wait per light (20 seconds) times the number of lights&amp;nbsp;remaining.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can write a function to do this&amp;nbsp;calculation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait_time_exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_wands&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Calculate Calvin&amp;#39;s total expected wait time on his journey home.&lt;/span&gt;

&lt;span class="sd"&gt;    Args:&lt;/span&gt;
&lt;span class="sd"&gt;        n_lights (int): The number of lights on the journey&lt;/span&gt;
&lt;span class="sd"&gt;        n_wands (int): The number of times Calvin can use his wand&lt;/span&gt;

&lt;span class="sd"&gt;    Returns:&lt;/span&gt;
&lt;span class="sd"&gt;        float: The total expected time spent waiting at lights&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="c1"&gt;# maximum time per timer&lt;/span&gt;
    &lt;span class="n"&gt;counter_max&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;

    &lt;span class="c1"&gt;#### Base cases ####&lt;/span&gt;
    &lt;span class="c1"&gt;# no lights left&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n_lights&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="c1"&gt;# enough wands to switch all lights&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;n_wands&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="c1"&gt;# no wands left&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;n_wands&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n_lights&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;counter_max&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

    &lt;span class="c1"&gt;#### Recursion ####&lt;/span&gt;
    &lt;span class="c1"&gt;# cutoff for the number of lights we expect to wait at&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;n_wands&lt;/span&gt;
    &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter_max&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;

    &lt;span class="c1"&gt;# initialize 3 scenarios&lt;/span&gt;
    &lt;span class="n"&gt;probs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;waits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# scenario 0: green light&lt;/span&gt;
    &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;
    &lt;span class="n"&gt;waits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wait_time_exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_wands&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# scenario 1: red light + wait&lt;/span&gt;
    &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;counter_max&lt;/span&gt;
    &lt;span class="n"&gt;waits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;waits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# scenario 2: red light + use a wand&lt;/span&gt;
    &lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;counter_max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;waits&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wait_time_exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_wands&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# return weighted sum of scenarios&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;probs&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;waits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;And finally we finish off with some art to confirm that the function is behaving as we expect it&amp;nbsp;to.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# set up variables&lt;/span&gt;
&lt;span class="n"&gt;n_lights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="n"&gt;n_wands&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;

&lt;span class="n"&gt;lights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;wands&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_wands&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# initialize waiting times&lt;/span&gt;
&lt;span class="n"&gt;times&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;n_lights&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_wands&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt;

&lt;span class="c1"&gt;# calculate waiting times&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lights&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;wands&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wait_time_exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# plot the waiting times as a heatmap&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,))&lt;/span&gt; 
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;jet&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linewidth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edgecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;w&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Total expected waiting time on Calvin&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s walk home&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Number of lights&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Number of wand uses&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lights&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lights&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wands&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wands&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Time (seconds)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;img alt="png" src="/images/calvin_11_0.png"&gt;&lt;/p&gt;
&lt;p&gt;With the problem thoroughly solved, I can finally sleep at night&amp;nbsp;again.&lt;/p&gt;
&lt;script type="text/javascript"&gt;if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
    var align = "center",
        indent = "0em",
        linebreak = "false";

    if (false) {
        align = (screen.width &lt; 768) ? "left" : align;
        indent = (screen.width &lt; 768) ? "0em" : indent;
        linebreak = (screen.width &lt; 768) ? 'true' : linebreak;
    }

    var mathjaxscript = document.createElement('script');
    mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
    mathjaxscript.type = 'text/javascript';
    mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML';

    var configscript = document.createElement('script');
    configscript.type = 'text/x-mathjax-config';
    configscript[(window.opera ? "innerHTML" : "text")] =
        "MathJax.Hub.Config({" +
        "    config: ['MMLorHTML.js']," +
        "    TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
        "    jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
        "    extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
        "    displayAlign: '"+ align +"'," +
        "    displayIndent: '"+ indent +"'," +
        "    showMathMenu: true," +
        "    messageStyle: 'normal'," +
        "    tex2jax: { " +
        "        inlineMath: [ ['\\\\(','\\\\)'] ], " +
        "        displayMath: [ ['$$','$$'] ]," +
        "        processEscapes: true," +
        "        preview: 'TeX'," +
        "    }, " +
        "    'HTML-CSS': { " +
        "        fonts: [['STIX', 'TeX']]," +
        "        styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
        "        linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
        "    }, " +
        "}); " +
        "if ('default' !== 'default') {" +
            "MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
            "MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
                "var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
                "VARIANT['normal'].fonts.unshift('MathJax_default');" +
                "VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
                "VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
                "VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
            "});" +
        "}";

    (document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
    (document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
&lt;/script&gt;</content><category term="posts"></category><category term="python"></category><category term="statistics"></category><category term="puzzles"></category></entry><entry><title>Remapping the world with word vectors</title><link href="https://johnpaton.net/posts/remapping-the-world/" rel="alternate"></link><published>2017-12-02T19:34:00-01:00</published><updated>2017-12-02T19:34:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-12-02:/posts/remapping-the-world/</id><summary type="html">&lt;p&gt;Everyone is used to the map. Most people could make a reasonable attempt at drawing one from memory (well, &lt;a href="https://www.theatlantic.com/international/archive/2014/01/what-you-get-when-30-people-draw-a-world-map-from-memory/282901/"&gt;sort of&lt;/a&gt;). But what would it look like if we positioned the countries not by geographical location, but by our own perceived relationships between them? Armed with Conceptnet Numberbatch, I decided to try just&amp;nbsp;that.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Everyone is used to the map. Most people could make a reasonable attempt at drawing one from memory (well, &lt;a href="https://www.theatlantic.com/international/archive/2014/01/what-you-get-when-30-people-draw-a-world-map-from-memory/282901/"&gt;sort of&lt;/a&gt;). But what would it look like if we positioned the countries not by geographical location, but by our own perceived relationships between them? Armed with Conceptnet Numberbatch, I decided to try just&amp;nbsp;that.&lt;/p&gt;
&lt;p&gt;Contents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#word-vectors"&gt;Word&amp;nbsp;Vectors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conceptnet-numberbatch"&gt;Conceptnet&amp;nbsp;Numberbatch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#countries-of-the-world"&gt;Countries of the&amp;nbsp;world&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#countries-as-concept-vectors"&gt;Countries as concept&amp;nbsp;vectors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#remapping"&gt;Remapping the&amp;nbsp;world&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#mapping-our-bias"&gt;Mapping our&amp;nbsp;bias?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a name="word-vectors"&gt; &lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Word&amp;nbsp;Vectors&lt;/h3&gt;
&lt;p&gt;In natural language processing, a big issue is the dimensionality of the data. Machine learning algorithms almost exclusively deal with vectors of numbers, whereas the atomic unit of natural language is the word. The natural way to encode words as feature vectors is to do what&amp;#8217;s called one-hot encoding: make a vector with only one element 1, and the rest 0, where the position of the 1 is determined by the word. This gives us a vector as long as the dictionary, with each entry representing one word, and there is a 1 at the position of the word we are trying to&amp;nbsp;represent.&lt;/p&gt;
&lt;p&gt;There are a few problems with this approach. For one thing, these vectors are very long. Large bodies of English text contain hundreds of thousands of unique words, so each vector has to have hundreds of thousands of entries. However, a more important issue is that these vectors don&amp;#8217;t encode any meaning. Each vector has magnitude 1, and they are all perpendicular to each other, so we lose all the relationships that we intuitively know words&amp;nbsp;carry. &lt;/p&gt;
&lt;p&gt;Word vectors are an attempt at a solution to both of these problems. The idea is to use an algorithm to &lt;em&gt;learn&lt;/em&gt; a set of numbers that are unique to each word, and such that &amp;#8220;similar&amp;#8221; words have vectors that are &amp;#8220;close to&amp;#8221; each other. Here we consider words to be similar if they are often used in the same context, and we compare vectors by the angle between them (cosine similarity). There are various implementations of this idea, including Stanford&amp;#8217;s &lt;a href="https://nlp.stanford.edu/projects/glove/"&gt;GloVe&lt;/a&gt;, two flavours of Google&amp;#8217;s &lt;a href="https://code.google.com/archive/p/word2vec/"&gt;word2vec&lt;/a&gt;, and Facebook&amp;#8217;s &lt;a href="https://research.fb.com/fasttext/"&gt;fastText&lt;/a&gt;. These approaches are very successful in their aims, even learning to encode abstract relationships between words, as demonstrated by this 2d projection of the GloVe&amp;nbsp;vectors:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nlp.stanford.edu/projects/glove/images/comparative_superlative.jpg"&gt;&lt;img alt="superlative word vectors" src="/images/word_vectors_superlative.jpg"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It&amp;#8217;s worth mentioning that these vectors are the result of &lt;em&gt;unsupervised learning&lt;/em&gt;: nobody told the algorithms that loud should be to loudest as strong is to strongest. The relationships are learned purely from the contexts that the words are used&amp;nbsp;in.&lt;/p&gt;
&lt;p&gt;There are pre-trained vectors from all these implementations available online to download, and if you have a lot of data it&amp;#8217;s also possible to train your own. If you do decide to make your own word vectors, make sure you fit in with the cool kids by giving them a name with weird&amp;nbsp;capitalization.&lt;/p&gt;
&lt;p&gt;&lt;a name="conceptnet-numberbatch"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Conceptnet&amp;nbsp;Numberbatch&lt;/h3&gt;
&lt;p&gt;The popular word vector sets are mostly just that: &lt;em&gt;word&lt;/em&gt; vectors. But what about singular concepts whose names happen to contain a space? If we split our text into individual words, it is difficult to reconstruct these concepts. The team at &lt;a href="http://conceptnet.io/"&gt;ConceptNet&lt;/a&gt; have done just that, combining several existing word vector sets and &lt;a href="https://arxiv.org/pdf/1604.01692.pdf"&gt;augmenting them&lt;/a&gt; with their own &lt;a href="https://en.wikipedia.org/wiki/Semantic_network"&gt;semantic network&lt;/a&gt; (a mapping of how concepts relate to each other). They also attempt to purposely &lt;a href="https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/"&gt;de-bias their vectors&lt;/a&gt; so as not to encode relationships that are racist, sexist, etc. Their &amp;#8220;concept vectors&amp;#8221; are humorously called &lt;a href="https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch-a-new-name-for-the-best-word-embeddings-you-can-download/"&gt;Conceptnet Numberbatch&lt;/a&gt;. For this post I used the English-only subset, which contains 416,410&amp;nbsp;vectors.&lt;/p&gt;
&lt;p&gt;By using the Conceptnet Numberbatch vectors we can compare not just words, but also concepts with each other for similarity. Of course, some words have a double meaning, and one similarity measure will never capture this. As an example, I sorted the concept vectors for similarity to &amp;#8220;orange&amp;#8221;. The top few are mostly various subtypes of the citrus fruit, but at number 9 we get &amp;#8220;yellowred&amp;#8221;, which is clearly related to the&amp;nbsp;color:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Top 20 most similar concepts to &amp;quot;orange&amp;quot;:
   1. orange_tree          Similarity: 0.96476
   2. orange_river         Similarity: 0.94728
   3. sweet_orange         Similarity: 0.93845
   4. citrus_sinensis      Similarity: 0.93466
   5. valencia_orange      Similarity: 0.92221
   6. sour_orange          Similarity: 0.91260
   7. citrus_aurantium     Similarity: 0.88945
   8. bitter_orange        Similarity: 0.88880
   9. yellowred            Similarity: 0.86298
  10. seville_orange       Similarity: 0.85275
  11. temple_orange        Similarity: 0.85226
  12. bigarade             Similarity: 0.84110
  13. iyokan               Similarity: 0.83425
  14. orangish             Similarity: 0.83299
  15. secondary_colour     Similarity: 0.82235
  16. tropaeolin           Similarity: 0.79860
  17. orangelo             Similarity: 0.79785
  18. orange_coloured      Similarity: 0.77347
  19. tangor               Similarity: 0.77011
  20. orangequat           Similarity: 0.76962
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;There are ways around this issue too, for example by &lt;a href="https://explosion.ai/blog/sense2vec-with-spacy"&gt;including word senses&lt;/a&gt; in the vectors, but this is overkill for our&amp;nbsp;purposes.&lt;/p&gt;
&lt;p&gt;Curious about my homeland, I tried out the same thing with &amp;#8220;Canada&amp;#8221;. Hilariously, both &amp;#8220;America&amp;#8217;s hat&amp;#8221; and &amp;#8220;Soviet Canuckistan&amp;#8221; make the top&amp;nbsp;20:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;Top 20 most similar concepts to &amp;quot;canada&amp;quot;:
   1. canada_west          Similarity: 0.97635
   2. great_white_north    Similarity: 0.96604
   3. laurentian_highlands Similarity: 0.96598
   4. arctic_archipelago   Similarity: 0.96598
   5. hydrofield           Similarity: 0.96262
   6. ookpik               Similarity: 0.96262
   7. canada_east          Similarity: 0.96223
   8. central_provinces    Similarity: 0.95273
   9. canadianist          Similarity: 0.94114
  10. central_canada       Similarity: 0.94038
  11. upper_canada         Similarity: 0.93867
  12. lower_canada         Similarity: 0.93571
  13. canadian_shield      Similarity: 0.93504
  14. canadas              Similarity: 0.92199
  15. america&amp;#39;s_hat        Similarity: 0.91242
  16. hudson_bay           Similarity: 0.91084
  17. canadian_studies     Similarity: 0.90770
  18. in_north_america     Similarity: 0.90715
  19. soviet_canuckistan   Similarity: 0.90083
  20. eastern_canada       Similarity: 0.90079
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Note that the average similarity score in Canada&amp;#8217;s top 20 is much higher than in orange&amp;#8217;s. This is likely because &amp;#8220;orange&amp;#8221; is used in a more wide variety of contexts than &amp;#8220;Canada&amp;#8221;, which dilutes its similarity to any given&amp;nbsp;concept. &lt;/p&gt;
&lt;p&gt;&lt;a name="countries-of-the-world"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Countries of the&amp;nbsp;world&lt;/h3&gt;
&lt;p&gt;It turns out that getting standardized data about the countries of the world is surprisingly tricky. There are various definitions of attributes, character encoding issues, countries not recognizing each other, etc. Luckily, Mohammed Le Doze has painstakingly compiled a &lt;a href="https://mledoze.github.io/countries/"&gt;high quality set of country data&lt;/a&gt; in &lt;a href="https://github.com/mledoze/countries/tree/master/dist"&gt;various formats&lt;/a&gt;, allowing me to offload responsibility for any political&amp;nbsp;sensitivities.&lt;/p&gt;
&lt;p&gt;As a starting point, we can plot the locations of the countries as we expect to see them. The positions are given in latitude and longitude, but through the magic of &lt;a href="https://xkcd.com/977/"&gt;map projections&lt;/a&gt; (conveniently provided in this &lt;a href="http://wiki.openstreetmap.org/wiki/Mercator#Python_implementation"&gt;code snippet&lt;/a&gt; from OpenStreetMap), we can plot the positions on a flat plane. I creatively decided to represent the countries as blobs with size approximately related to the land area of the country, and immediately gained respect for people who are actually good at making&amp;nbsp;maps:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The worst world map you ever did see." src="/images/world_map_merc.png"&gt;&lt;/p&gt;
&lt;p&gt;Nailed&amp;nbsp;it.&lt;/p&gt;
&lt;p&gt;&lt;a name="countries-as-concept-vectors"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Countries as concept&amp;nbsp;vectors&lt;/h3&gt;
&lt;p&gt;The countries dataset provides us with a &amp;#8220;common English name&amp;#8221; per country. We use this name to associate a concept vector to each country in the set, to the extent possible (some countries, such as French Southern and Antarctic Islands, and British Indian Ocean Territory, were not in Conceptnet Numberbatch. Sorry about&amp;nbsp;that.).&lt;/p&gt;
&lt;p&gt;Considering countries as concepts with associated vectors reveals some interesting structure. Using &lt;a href="https://en.wikipedia.org/wiki/Hierarchical_clustering"&gt;hierarchical clustering&lt;/a&gt;, we can sort the countries such that more similar countries are next to each other. Plotting the similarities as a heatmap visualizes the clusters of countries that have especially high similarity to each other (and it looks really&amp;nbsp;cool):&lt;/p&gt;
&lt;p&gt;&lt;a href="/images/countries_similarity_world_annotated2.png"&gt;&lt;img alt="clustered country similarity heatmap" src="/images/countries_similarity_world_annotated2.png"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I realize this is big and hard to read; you can click on the image to get the full sized one where the country names are actually&amp;nbsp;legible. &lt;/p&gt;
&lt;p&gt;The squares along the diagonal are the main clusters of similar countries which were identified by the clustering algorithm. I&amp;#8217;ve tried to give semi-appropriate names to a few of them. We see for example the Arabian Peninsula (Saudi Arabia, Kuwait, Qatar, the &lt;span class="caps"&gt;UAE&lt;/span&gt;, Bahrain, and Oman), groups of nearby islands, and the Group Which Shall Not Be Named (Iraq, Jordan, Israel, Palestine, Lebanon, Iran, Syria, Egypt, and Libya). Thanks to our use of hierarchical clustering, we also see groups with substructure, like the Germanic countries, which visibly include the Nordic countries and&amp;nbsp;BeNeLux.&lt;/p&gt;
&lt;p&gt;On the off-diagonal we also see some structure. These are mostly primary clusters which have strong inter-group relationships. I&amp;#8217;ve labeled three such groups: the Pacific and Caribbean Islands are quite strongly associated with each other, Spain/Portugal are similar to their (mainly South American) colonies, and many of the Arabic-speaking countries are also associated with each&amp;nbsp;other.&lt;/p&gt;
&lt;p&gt;It&amp;#8217;s interesting to note that the primary clusters seem to be mostly determined by geographical proximity, while the off-diagonal groups seem to be more influenced by&amp;nbsp;culture/language.&lt;/p&gt;
&lt;p&gt;Because the entire world is a bit hard to take in all at once, we will zoom in on Europe for a little bit. We end up&amp;nbsp;with:&lt;/p&gt;
&lt;p&gt;&lt;a href="/images/countries_similarity_eur.svg"&gt;&lt;img alt="clustered europe similarity heatmap" src="/images/countries_similarity_eur.svg"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Now it seems that the Germanic world is much less pronounced, the Åland Islands have been split out from the rest of the Nordics, and the Channel Islands are basically separated entirely from the rest. Sorry about that, island&amp;nbsp;nations.&lt;/p&gt;
&lt;p&gt;&lt;a name="remapping"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Remapping the&amp;nbsp;world&lt;/h3&gt;
&lt;p&gt;While the heatmaps are interesting, we started out on a mission to remap the world, and that&amp;#8217;s what we&amp;#8217;re going to do. To figure out the new placement of the countries, we visualize them as a &lt;a href="https://en.wikipedia.org/wiki/Graph_theory"&gt;network graph&lt;/a&gt;, simulated with &lt;a href="https://d3js.org/"&gt;d3.js&lt;/a&gt;. The nodes are our country blobs, and they will be connected by springy edges. The edges act with a force on the nodes, pulling more similar countries closer together, and pushing dissimilar ones apart. By initializing the simulation with the countries in their original position as plotted on the map above, we hope to end up with a layout that reflects geography as much as possible, so that we&amp;#8217;re sure any major changes are the result of taking into account the concept vector similarity. We&amp;#8217;ll begin with Europe, and connect each country to its top 10 most similar neighbors (to keep the complexity of the simulation down). Thanks to the magic of d3, we can make the map interactive, so if you aren&amp;#8217;t happy with where your country ended up, you can try to drag it somewhere else and see which other countries get pulled along&amp;nbsp;too.&lt;/p&gt;
&lt;p&gt;Tip: If you&amp;#8217;re on mobile, this looks nicer in landscape&amp;nbsp;mode!&lt;/p&gt;
&lt;div align="center"&gt;
&lt;svg id="europe" width="600" height="500"&gt;&lt;/svg&gt;
&lt;/div&gt;

&lt;p&gt;Now we&amp;#8217;re talking! The simulation isn&amp;#8217;t totally deterministic, so you might see something different, but when I run it I see the Faroes and Iceland get pulled into the interior of Northern Europe thanks to strong connections with the (rest of the) Nordics. The &lt;span class="caps"&gt;UK&lt;/span&gt; moves even further outside mainland Europe. France gains Ireland as a neighbour, but not because of any particularly strong relationship. Rather, Ireland has connections to fellow island nations Iceland (North), Malta (South), and the &lt;span class="caps"&gt;UK&lt;/span&gt; (West), so in balancing these it just happens to end up near France. The Vatican is ejected from mainland Europe due to its surprisingly weak connection to Italy. The Baltics move southeast to be closer to the former Yugoslav countries, which also causes Poland to be pulled south of&amp;nbsp;Austria. &lt;/p&gt;
&lt;p&gt;It&amp;#8217;s extremely addicting to play with this simulation and imagine the alternate histories that could have resulted. The Vatican Armada? The Polish-Faroese Empire? Delicious Irish cuisine? We will never&amp;nbsp;know.&lt;/p&gt;
&lt;p&gt;Just like with the heatmaps, the network graph for the entire world is a bit overwhelming. We simplify the simulation further by connecting each country to only its top 3 neighbors, and make it a bit taller to give them more room to spread out (feel free to drag them apart yourself to get an even better&amp;nbsp;view):&lt;/p&gt;
&lt;div align="center"&gt;
&lt;svg id="world" width="600" height="800"&gt;&lt;/svg&gt;
&lt;/div&gt;

&lt;p&gt;Dragging the countries around gives a good feel for who is connected to who. It&amp;#8217;s particularly interesting when countries from different continents are closely connected. Portugal and Spain are pulled into the midst of their South American colonies, which form an unusually rigid block. The &lt;span class="caps"&gt;US&lt;/span&gt; finds an unlikely neighbor in Georgia (undoubtedly caused by the &lt;span class="caps"&gt;US&lt;/span&gt; state of the same name). The Anglosphere makes its first appearance, linking the &lt;span class="caps"&gt;US&lt;/span&gt;, Canada, the &lt;span class="caps"&gt;UK&lt;/span&gt;, Australia, and New Zealand (and, dare I say, Ireland). Turkey and Central Asia are disconnected from the rest of Asia, preferring the company of Russia. Canada and the &lt;span class="caps"&gt;US&lt;/span&gt; decouple from the rest of the Americas. Chad is kept at a distance from the rest of Africa, probably because of its common usage as an English name diluting the strength of its connections (the &amp;#8220;orange&amp;#8221; effect), an issue which also plagues Réunion. The Falklands gain great strategic importance, forming a sole link between the Pacific Islands and the&amp;nbsp;Caribbean. &lt;/p&gt;
&lt;p&gt;Surprisingly, the entire world seems to remain connected (even if the connecting edges are so weak as to not be visible). Our simulation of so few connections made it possible for 3 or 4 countries to form an exclusive block, disconnected from the rest of the world, but as far as I can tell this did not&amp;nbsp;happen.&lt;/p&gt;
&lt;p&gt;&lt;a name="mapping-our-bias"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Mapping our&amp;nbsp;bias?&lt;/h3&gt;
&lt;p&gt;We have spent quite some time considering the new (and old) groupings that show up when we consider the countries as concept vectors. However, it&amp;#8217;s also important to consider what is not present in the data, and how this may reveal underlying confounding&amp;nbsp;factors.&lt;/p&gt;
&lt;p&gt;Although a loosely-grouped Anglosphere formed in the world network graph, the relative strengths of the connections were largely weak enough to not show up in the clustering we performed to create the heat map. Conversely, Africa appeared in the heatmap as one giant block, despite great diversity across the continent (Somalia, Tunisia, and Zimbabwe in the same&amp;nbsp;group?). &lt;/p&gt;
&lt;p&gt;My hypothesis is that this is a result of using only the English subset of Conceptnet Numberbatch&amp;#8217;s concept vectors. We humans have a tendency to see differences between members of our &amp;#8220;own group&amp;#8221; (whatever that means), while seeing other groups as one cohesive unit. Wikipedia tells me this concept is known as &lt;a href="https://en.wikipedia.org/wiki/Out-group_homogeneity"&gt;out-group homogeneity&lt;/a&gt;, which can lead to stereotyping. In the &lt;em&gt;English&lt;/em&gt; concept vectors (which were trained on English text gathered from the Internet), the Anglosphere countries aren&amp;#8217;t frequently used in the exact same context, because they are more often perceived as the in-group by people who are speaking English. This (and not Brexit) would also explain the &lt;span class="caps"&gt;UK&lt;/span&gt; being pushed out of Europe and relatively far from Ireland in the Europe network graph. All of this happens in spite of ConceptNet actively working to remove bias from their vectors, showing just how tricky this issue really&amp;nbsp;is.&lt;/p&gt;
&lt;p&gt;It would be interesting to confirm this hypothesis by trying this experiment out with other language vectors, but this post is already long enough, so I will leave that for future exploration. For now, suffice it to say that unexamined factors in your data may lead to unexpected results. If you find yourself making sweeping generalizations or casually reporting on &amp;#8220;Soviet Canuckistan&amp;#8221;, it may be time to reevaluate your&amp;nbsp;assumptions.&lt;/p&gt;
&lt;p&gt;Did you find any other interesting (missing) connections? Could I have improved the graphs (this is my first forray into d3)? Let me know&amp;nbsp;below! &lt;/p&gt;
&lt;script src="https://d3js.org/d3.v4.min.js"&gt;&lt;/script&gt;

&lt;script&gt;

var cmap = {"Americas":"#3aa500", "Asia":"#881166", "Africa":"#e34a6f", 
            "Europe":"#0c5fd3", "Oceania":"#e8d214"};

var width_art = Math.max(parseFloat(d3.select("article").style("width").slice(0,-2)),
                         375);

var svg_eur = d3.select("#europe");

svg_eur.attr("width", width_art + "px");

var width_eur = +svg_eur.attr("width"),
    height_eur = +svg_eur.attr("height");

var svg_wor = d3.select("#world");

svg_wor.attr("width", width_art + "px");

var width_wor = +svg_wor.attr("width"),
    height_wor = +svg_wor.attr("height");


var color = d3.scaleOrdinal(d3.schemeCategory10);

var simulation_eur = d3.forceSimulation()
    .force("charge", d3.forceManyBody().strength(-75))
    .force("collide", d3.forceCollide().radius(rad))
    .force("center", d3.forceCenter(width_art / 2, height_eur / 2))
    .velocityDecay(0.7);

var simulation_wor = d3.forceSimulation()
    .force("charge", d3.forceManyBody().strength(-5))
    .force("collide", d3.forceCollide().radius(rad))
    .force("center", d3.forceCenter(width_art / 2, height_wor / 2))
    .velocityDecay(0.7)
    .alphaDecay(0.01);


function rad(d){
    return Math.max(2, 25* Math.pow(d.area, 1/3));
};

d3.json("/static/europe.json", function(error, graph) {
  if (error) throw error;


  var link = svg_eur.append("g")
      .attr("class", "links")
    .selectAll("line")
    .data(graph.links)
    .enter().append("line")
      .attr("stroke-width", 2)
      .attr("stroke", "#999")
      .attr("stroke-opacity", function(d) {return 0.5*Math.pow(d.strength, 2); });


  // function(d) { return color(d.region); })

  var node = svg_eur.append("g")
      .attr("class", "nodes")
      .selectAll(".node")
      .data(graph.nodes)
      .enter()
      .append("g")
        .attr("class","node")
        .call(d3.drag()
          .on("start", dragstarted_eur)
          .on("drag", dragged_eur)
          .on("end", dragended_eur));

    node.append("circle")
        .attr("r", rad)
        .attr("fill", function(d) { return cmap[d.region]; })


    node.attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; });

  //   .selectAll("circle")
  //   .data(graph.nodes)
  //   .enter().append("circle")
  //     .attr("cx", function(d) {return d.x; })
  //     .attr("cy", function(d) {return d.y; })
  //     .attr("r", rad)
  //     .attr("fill", "red")
  //     .call(d3.drag()
  //         .on("start", dragstarted)
  //         .on("drag", dragged)
  //         .on("end", dragended));

  node.append("text")
      .text(function(d) { return d.commonName; })
        .attr("alignment-baseline","middle")
        .attr("text-anchor", "right")
        .attr("dx", rad)
        .attr("font-family","sans-serif")
        .attr("font-size","9pt")
        .attr("opacity","0.5")

  simulation_eur
      .nodes(graph.nodes)
      .on("tick", ticked);

  simulation_eur.force("link", d3.forceLink()
                             .links(graph.links)
                             .strength(function(d) {return 5*(Math.pow(d.strength, 4)); })
                             .distance(function(d) {return 200*(1-Math.pow(d.strength, 2)); }));

    function boxx(d){
        return Math.min(Math.max(rad(d), d.x), width_art-rad(d)-75);
    }

    function boxy(d){
        return Math.min(Math.max(rad(d), d.y), height_eur-rad(d)-5);
    }

  function ticked() {
    link
        .attr("x1", function(d) { return boxx(d.source); })
        .attr("y1", function(d) { return boxy(d.source); })
        .attr("x2", function(d) { return boxx(d.target); })
        .attr("y2", function(d) { return boxy(d.target); });

    node
        .attr("transform", function(d) { return "translate(" + boxx(d) + "," + boxy(d) + ")"; })
        .attr("cx", function(d) { boxx(d); })
        .attr("cy", function(d) { boxy(d) });
  }
});




d3.json("/static/world.json", function(error, graph) {
  if (error) throw error;

  var link = svg_wor.append("g")
      .attr("class", "links")
    .selectAll("line")
    .data(graph.links)
    .enter().append("line")
      .attr("stroke-width", 2)
      .attr("stroke", "#999")
      .attr("stroke-opacity", function(d) {return 0.5*Math.pow(d.strength, 2); });


  // function(d) { return color(d.region); })

  var node = svg_wor.append("g")
      .attr("class", "nodes")
      .selectAll(".node")
      .data(graph.nodes)
      .enter()
      .append("g")
        .attr("class","node")
        .call(d3.drag()
          .on("start", dragstarted_wor)
          .on("drag", dragged_wor)
          .on("end", dragended_wor));

    node.append("circle")
        .attr("r", rad)
        .attr("fill", function(d) { return cmap[d.region]; })


    node.attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; });

  //   .selectAll("circle")
  //   .data(graph.nodes)
  //   .enter().append("circle")
  //     .attr("cx", function(d) {return d.x; })
  //     .attr("cy", function(d) {return d.y; })
  //     .attr("r", rad)
  //     .attr("fill", "red")
  //     .call(d3.drag()
  //         .on("start", dragstarted)
  //         .on("drag", dragged)
  //         .on("end", dragended));

  node.append("text")
      .text(function(d) { return d.commonName; })
        .attr("alignment-baseline","middle")
        .attr("text-anchor", "right")
        .attr("dx", rad)
        .attr("font-family","sans-serif")
        .attr("font-size","9pt")
        .attr("opacity","0.5")

  simulation_wor
      .nodes(graph.nodes)
      .on("tick", ticked);

  simulation_wor.force("link", d3.forceLink()
                             .links(graph.links)
                             .strength(function(d) {return 7*(Math.pow(d.strength, 4)); })
                             .distance(function(d) {return 150*(1-Math.pow(d.strength, 2)); }));

    function boxx(d){
        return Math.min(Math.max(rad(d), d.x), width_art-rad(d)-75);
    }

    function boxy(d){
        return Math.min(Math.max(rad(d), d.y), height_wor-rad(d)-5);
    }

  function ticked() {
    link
        .attr("x1", function(d) { return boxx(d.source); })
        .attr("y1", function(d) { return boxy(d.source); })
        .attr("x2", function(d) { return boxx(d.target); })
        .attr("y2", function(d) { return boxy(d.target); });

    node
        .attr("transform", function(d) { return "translate(" + boxx(d) + "," + boxy(d) + ")"; })
        .attr("cx", function(d) { boxx(d); })
        .attr("cy", function(d) { boxy(d) });
  }
});

d3.select(window).on("resize", resize);


function resize() {
    width_art = Math.max(parseFloat(d3.select("article").style("width").slice(0,-2)),
                         375);
    svg_eur.attr("width", width_art);
    svg_wor.attr("width", width_art);
    simulation_eur.alpha(0.5).force("center", d3.forceCenter(width_art / 2, height_eur / 2)).restart();
    simulation_wor.alpha(0.5).force("center", d3.forceCenter(width_art / 2, height_wor / 2)).restart();

  }


function dragstarted_eur(d) {
  if (!d3.event.active) simulation_eur.alphaTarget(0.3).restart();
  d.fx = d.x;
  d.fy = d.y;
}

function dragged_eur(d) {
  d.fx = d3.event.x;
  d.fy = d3.event.y;
}

function dragended_eur(d) {
  if (!d3.event.active) simulation_eur.alphaTarget(0);
  d.fx = null;
  d.fy = null;
}

function dragstarted_wor(d) {
  if (!d3.event.active) simulation_wor.alphaTarget(0.3).restart();
  d.fx = d.x;
  d.fy = d.y;
}

function dragged_wor(d) {
  d.fx = d3.event.x;
  d.fy = d3.event.y;
}

function dragended_wor(d) {
  if (!d3.event.active) simulation_wor.alphaTarget(0).alphaDecay(0.0228);
  d.fx = null;
  d.fy = null;
}

&lt;/script&gt;</content><category term="posts"></category><category term="natural language"></category><category term="dataviz"></category><category term="d3.js"></category></entry><entry><title>Cleaner Spark UDF definitions with a little decorator</title><link href="https://johnpaton.net/posts/clean-spark-udfs/" rel="alternate"></link><published>2017-11-16T19:48:11-01:00</published><updated>2017-11-16T19:48:11-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-11-16:/posts/clean-spark-udfs/</id><summary type="html">&lt;p&gt;One of the handy features that makes (Py)Spark more flexible than database tools like Hive even for just transforming tabular data is the ease of creating User Defined Functions (UDFs). However, one thing that still remains a little annoying is that you have to separately define a function and declare it as a &lt;span class="caps"&gt;UDF&lt;/span&gt;. With four lines of code you can clean those definitions right&amp;nbsp;up.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;Update:&lt;/em&gt; It turns out the functionality described here is actually standard, and I just recreated an existing feature! Embarrassing. This is why you always read the docs! Thanks to Enrico Rotundo for pointing this out. Nonetheless, knowing how to define your own decorators is useful if you want to e.g. propagate the docstring using &lt;code&gt;functools.wraps&lt;/code&gt;, so I&amp;#8217;ll leave this here for further&amp;nbsp;exploration.&lt;/p&gt;
&lt;p&gt;Alternate title: &lt;em&gt;Clean up your Spark jobs with this one weird trick! Apache will hate&amp;nbsp;you!&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One of the handy features that makes (Py)Spark more flexible than database tools like Hive even for just transforming tabular data is the ease of creating User Defined Functions (UDFs). Although this is &lt;a href="https://community.hortonworks.com/articles/72414/how-to-create-a-custom-udf-for-hive-using-python.html"&gt;also possible in Hive directly&lt;/a&gt;, the ability to define and call UDFs directly in the Python code of your job makes life a lot easier and provides context to what you&amp;#8217;re doing. However, one thing that still remains a little annoying is separately defining a Python function and then having to declare it as a Spark &lt;span class="caps"&gt;UDF&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;Consider the trivial example of incrementing all the values in a Spark DataFrame column by 1. We begin by writing the function, and then make a &amp;#8220;&lt;span class="caps"&gt;UDF&lt;/span&gt;-ified&amp;#8221; version that we can actually use in&amp;nbsp;Spark.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;f&lt;/span&gt;

&lt;span class="c1"&gt;# define the function&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="c1"&gt;# make the udf version&lt;/span&gt;
&lt;span class="n"&gt;increment_udf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;By registering our function as a &lt;span class="caps"&gt;UDF&lt;/span&gt;, we tell Spark that this function should be applied to every value in whatever DataFrame column it is applied to, and Spark takes care of distributing the execution across the cluster when we submit our job. We can now use our newly declared &lt;code&gt;increment_udf&lt;/code&gt; to increment all the values in a&amp;nbsp;column:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# increment column &amp;#39;col&amp;#39; by 1&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;col_plus_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;increment_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;col&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;Hold up,&amp;#8221; you say, &amp;#8220;you&amp;#8217;re making it unnecessarily difficult for yourself. Just use the &lt;code&gt;f.udf&lt;/code&gt; as a &lt;a href="https://www.python.org/dev/peps/pep-0318/"&gt;decorator&lt;/a&gt;!&amp;#8221; That is indeed an attractive option. The code then condenses&amp;nbsp;to:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@f.udf&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;col_plus_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;col&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This is a lot better looking, but it comes at the cost of flexibility. The function &lt;code&gt;f.udf&lt;/code&gt; optionally takes as a second argument the type of the &lt;span class="caps"&gt;UDF&lt;/span&gt;&amp;#8217;s output (in terms of the &lt;code&gt;pyspark.sql.types&lt;/code&gt; &lt;a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types"&gt;types&lt;/a&gt;). Spark will by default convert &lt;span class="caps"&gt;UDF&lt;/span&gt; outputs to strings, which can be a hassle, especially for complex data types (like arrays), or when the precision is important (float vs. double). To avoid this stringy fate, we have to return to our old&amp;nbsp;pattern:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;t&lt;/span&gt;

&lt;span class="c1"&gt;# define the function&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="c1"&gt;# make a typed udf version&lt;/span&gt;
&lt;span class="n"&gt;increment_udf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;To get back to our nice, clean decorator syntax, we can write a tiny but useful function to generate a &lt;span class="caps"&gt;UDF&lt;/span&gt; decorator that will cast the output to the appropriate&amp;nbsp;type:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;typed_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_typed_udf_wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_typed_udf_wrapper&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The function &lt;code&gt;typed_udf&lt;/code&gt; returns a new &lt;span class="caps"&gt;UDF&lt;/span&gt; decorator with the specified return type. We can think of it as a decorator that accepts an argument. For a more in depth overview of this pattern and decorators in general, see &lt;a href="https://www.thecodeship.com/patterns/guide-to-python-function-decorators/"&gt;this blog post from The Code Ship&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;Now we once again have the nice, clean version of the code, with the added legibility bonus of the &lt;span class="caps"&gt;UDF&lt;/span&gt;&amp;#8217;s return type being visible right beside its&amp;nbsp;definition:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nd"&gt;@typed_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;col_plus_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;col&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Using a decorator instead of making two versions of every function isn&amp;#8217;t really necessary in this simple example, but if you are defining 20 UDFs your namespace will get awfully cluttered and it&amp;#8217;ll become harder to track what&amp;#8217;s going on. With four lines of code you can bring sanity back to your function naming&amp;nbsp;scheme.&lt;/p&gt;
&lt;p&gt;Finally, the code for &lt;span class="caps"&gt;DIY&lt;/span&gt; decorators is notoriously difficult to read, so if you&amp;#8217;re going to copy-paste, here is the snippet with a&amp;nbsp;docstring:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;f&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;typed_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;Make a UDF decorator with the given return type.&lt;/span&gt;

&lt;span class="sd"&gt;    Example usage:&lt;/span&gt;

&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; @typed_udf(t.IntegerType())&lt;/span&gt;
&lt;span class="sd"&gt;    ... def increment(x):&lt;/span&gt;
&lt;span class="sd"&gt;    ...     return x + 1&lt;/span&gt;
&lt;span class="sd"&gt;    ...&lt;/span&gt;
&lt;span class="sd"&gt;    &amp;gt;&amp;gt;&amp;gt; df = df.withColumn(&amp;#39;col_plus_1&amp;#39;, increment(&amp;#39;col&amp;#39;))&lt;/span&gt;

&lt;span class="sd"&gt;    Args:&lt;/span&gt;
&lt;span class="sd"&gt;        return_type (pyspark.sql.types type): the type that will be&lt;/span&gt;
&lt;span class="sd"&gt;            output by the function being decorated&lt;/span&gt;

&lt;span class="sd"&gt;    Returns:&lt;/span&gt;
&lt;span class="sd"&gt;        function: Typed UDF decorator&lt;/span&gt;

&lt;span class="sd"&gt;    &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_typed_udf_wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_typed_udf_wrapper&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;</content><category term="posts"></category><category term="spark"></category><category term="python"></category><category term="data"></category><category term="snippets"></category></entry><entry><title>Forward-fill missing data in Spark</title><link href="https://johnpaton.net/posts/forward-fill-spark/" rel="alternate"></link><published>2017-09-22T20:00:00-01:00</published><updated>2017-09-22T20:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-09-22:/posts/forward-fill-spark/</id><summary type="html">&lt;p&gt;Since I&amp;#8217;ve started using Apache Spark, one of the frequent annoyances I&amp;#8217;ve come up against is having an idea that would be very easy to implement in Pandas, but turns out to require a really verbose workaround in Spark. A recent example of this is doing a forward fill (filling &lt;code&gt;null&lt;/code&gt; values with the last known non-&lt;code&gt;null&lt;/code&gt; value).&lt;/p&gt;</summary><content type="html">&lt;p&gt;Since I&amp;#8217;ve started using Apache Spark, one of the frequent annoyances I&amp;#8217;ve come up against is having an idea that would be very easy to implement in Pandas, but turns out to require a really verbose workaround in Spark. Such is the price of scalability. But that does make it extra satisfying when I &lt;em&gt;do&lt;/em&gt; manage to get done what I&amp;#8217;m trying to&amp;nbsp;do. &lt;/p&gt;
&lt;p&gt;A recent example of this is doing a forward fill: filling &lt;code&gt;null&lt;/code&gt; values with the last known non-&lt;code&gt;null&lt;/code&gt; value, leaving leading &lt;code&gt;null&lt;/code&gt;s alone. In Pandas, this is very easy. I used it in my &lt;a href="/posts/periods-since-time-series-events/"&gt;recent post&lt;/a&gt; about efficiently finding the time since the last event in a time series. This post is basically an explanation of &lt;a href="https://stackoverflow.com/a/44953341"&gt;this StackOverflow answer&lt;/a&gt; on doing forward fills with&amp;nbsp;PySpark. &lt;/p&gt;
&lt;p&gt;Imagine you are measuring the temperature in two spots in your back yard, one in the shade and one in the sun. You record a measurement every half hour so you can compare them. However, you got the cheapest possible digital thermometer, so a lot of the measurements end up missing. Your data may look something like&amp;nbsp;this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;temperature&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2017-09-09 12:00:00&lt;/td&gt;
      &lt;td&gt;shade&lt;/td&gt;
      &lt;td&gt;18.830184&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2017-09-09 12:00:00&lt;/td&gt;
      &lt;td&gt;sun&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2017-09-09 12:30:00&lt;/td&gt;
      &lt;td&gt;shade&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;&amp;#8230;&lt;/th&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;189&lt;/th&gt;
      &lt;td&gt;2017-09-11 11:00:00&lt;/td&gt;
      &lt;td&gt;sun&lt;/td&gt;
      &lt;td&gt;17.595510&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;190&lt;/th&gt;
      &lt;td&gt;2017-09-11 11:30:00&lt;/td&gt;
      &lt;td&gt;shade&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;191&lt;/th&gt;
      &lt;td&gt;2017-09-11 11:30:00&lt;/td&gt;
      &lt;td&gt;sun&lt;/td&gt;
      &lt;td&gt;18.630506&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;192 rows × 3&amp;nbsp;columns&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;&lt;img alt="png" src="/images/temps_unfilled.png"&gt;&lt;/p&gt;
&lt;p&gt;To compare the measurements each half hour (or maybe to do some machine learning), we need a way of filling in the missing measurements. If the value we are measuring (in this case temperature) changes slowly with respect to how frequently we make a measurement, then a forward fill may be a reasonable&amp;nbsp;choice. &lt;/p&gt;
&lt;p&gt;In Pandas, this is easy. We just do a &lt;a href="posts/groupby-without-aggregation/"&gt;groupby without aggregation&lt;/a&gt;, and to each group apply the &lt;code&gt;.fillna&lt;/code&gt; method, specifying specifying &lt;code&gt;method='ffill'&lt;/code&gt;, also known as &lt;code&gt;method='pad'&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_filled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ffill&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df_filled&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;temperature&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2017-09-09 12:00:00&lt;/td&gt;
      &lt;td&gt;shade&lt;/td&gt;
      &lt;td&gt;18.830184&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;2017-09-09 12:00:00&lt;/td&gt;
      &lt;td&gt;sun&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;2017-09-09 12:30:00&lt;/td&gt;
      &lt;td&gt;shade&lt;/td&gt;
      &lt;td&gt;18.830184&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;&amp;#8230;&lt;/th&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;189&lt;/th&gt;
      &lt;td&gt;2017-09-11 11:00:00&lt;/td&gt;
      &lt;td&gt;sun&lt;/td&gt;
      &lt;td&gt;17.595510&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;190&lt;/th&gt;
      &lt;td&gt;2017-09-11 11:30:00&lt;/td&gt;
      &lt;td&gt;shade&lt;/td&gt;
      &lt;td&gt;18.226763&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;191&lt;/th&gt;
      &lt;td&gt;2017-09-11 11:30:00&lt;/td&gt;
      &lt;td&gt;sun&lt;/td&gt;
      &lt;td&gt;18.630506&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;192 rows × 3&amp;nbsp;columns&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;We can see the effect this had on the data by plotting. We hope to end up with nice, regular measurements without having distorted the overall shape too&amp;nbsp;much:&lt;/p&gt;
&lt;p&gt;&lt;img alt="png" src="/images/temps_filled.png"&gt;&lt;/p&gt;
&lt;p&gt;In Spark, things get a bit trickier. The key ingredients&amp;nbsp;are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;pyspark.sql.Window&lt;/code&gt; object. A window, which may be familiar if you use &lt;span class="caps"&gt;SQL&lt;/span&gt;, acts kind of like a group in a group by, except it slides over the data, allowing you to more easily return a value for every row (instead of doing an aggregation). A window is specified in PySpark with &lt;code&gt;.rowsBetween&lt;/code&gt;, which takes the indices of the rows to include relative to the current row (where the value will be returned in the output). The rows in the window can be ordered using &lt;code&gt;.orderBy&lt;/code&gt;, and partitioned using &lt;code&gt;.partitionBy&lt;/code&gt;. Partitioning over a column ensures that only rows with the same value of that column will end up in a window together, acting similarly to a group&amp;nbsp;by.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;pyspark.sql&lt;/code&gt; window function &lt;code&gt;last&lt;/code&gt;. As its name suggests, &lt;code&gt;last&lt;/code&gt; returns the last value in the window (implying that the window must have a meaningful ordering). It takes an optional argument &lt;code&gt;ignorenulls&lt;/code&gt; which, when set to &lt;code&gt;True&lt;/code&gt;, causes &lt;code&gt;last&lt;/code&gt; to return the last non-&lt;code&gt;null&lt;/code&gt; value in the window, if such a value&amp;nbsp;exists.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The strategy to forward fill in Spark is as follows. First we define a window, which is ordered in time, and which includes all the rows from the beginning of time up until the current row. We achieve this here simply by selecting the rows in the window as being the &lt;code&gt;rowsBetween&lt;/code&gt; &lt;code&gt;-sys.maxint&lt;/code&gt; (the largest negative value possible), and &lt;code&gt;0&lt;/code&gt; (the current row). Specifying too large of a value for the rows doesn&amp;#8217;t cause any errors, so we can just use a very large number to be sure our window reaches until the very beginning of the dataframe. If you need to optimize memory usage, you can make your job much more efficient by finding the maximal number of consecutive &lt;code&gt;null&lt;/code&gt;s in your dataframe and only taking a large enough window to include all of those plus one non-&lt;code&gt;null&lt;/code&gt; value. We partition the window by the &lt;code&gt;location&lt;/code&gt; column to make sure that gaps only get filled with previous measurements from the same&amp;nbsp;location.&lt;/p&gt;
&lt;p&gt;We act with &lt;code&gt;last&lt;/code&gt; over the window we have defined, specifying &lt;code&gt;ignorenulls=True&lt;/code&gt;. If the current row is non-&lt;code&gt;null&lt;/code&gt;, then the output will just be the value of current row. However, if the current row &lt;em&gt;is&lt;/em&gt; &lt;code&gt;null&lt;/code&gt;, then the function will return the most recent (last) non-&lt;code&gt;null&lt;/code&gt; value in the&amp;nbsp;window.&lt;/p&gt;
&lt;p&gt;For a Spark dataframe with the same data as we just saw in Pandas, the code looks like&amp;nbsp;this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# the same data as before&lt;/span&gt;
&lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;+-------------------+--------+------------------+
|               time|location|       temperature|
+-------------------+--------+------------------+
|2017-09-09 12:00:00|   shade|18.830184076113213|
|2017-09-09 12:00:00|     sun|              null|
|2017-09-09 12:30:00|   shade|              null|
|2017-09-09 12:30:00|     sun| 21.55237663805009|
|2017-09-09 13:00:00|   shade| 18.59059750682235|
|2017-09-09 13:00:00|     sun|              null|
|2017-09-09 13:30:00|   shade|              null|
|2017-09-09 13:30:00|     sun|22.587784977960474|
|2017-09-09 14:00:00|   shade|19.101003724324197|
|2017-09-09 14:00:00|     sun|20.548896316341516|
+-------------------+--------+------------------+
only showing top 10 rows
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;

&lt;span class="c1"&gt;# define the window&lt;/span&gt;
&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\
               &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;time&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;\
               &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rowsBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# define the forward-filled column&lt;/span&gt;
&lt;span class="n"&gt;filled_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;temperature&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ignorenulls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# do the fill&lt;/span&gt;
&lt;span class="n"&gt;spark_df_filled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;temp_filled_spark&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filled_column&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# show off our glorious achievements&lt;/span&gt;
&lt;span class="n"&gt;spark_df_filled&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;time&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;+-------------------+--------+------------------+------------------+
|               time|location|       temperature| temp_filled_spark|
+-------------------+--------+------------------+------------------+
|2017-09-09 12:00:00|   shade|18.830184076113213|18.830184076113213|
|2017-09-09 12:00:00|     sun|              null|              null|
|2017-09-09 12:30:00|   shade|              null|18.830184076113213|
|2017-09-09 12:30:00|     sun| 21.55237663805009| 21.55237663805009|
|2017-09-09 13:00:00|   shade| 18.59059750682235| 18.59059750682235|
|2017-09-09 13:00:00|     sun|              null| 21.55237663805009|
|2017-09-09 13:30:00|   shade|              null| 18.59059750682235|
|2017-09-09 13:30:00|     sun|22.587784977960474|22.587784977960474|
|2017-09-09 14:00:00|   shade|19.101003724324197|19.101003724324197|
|2017-09-09 14:00:00|     sun|20.548896316341516|20.548896316341516|
+-------------------+--------+------------------+------------------+
only showing top 10 rows
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Success! Note that a backward-fill is achieved in a very similar way. The only changes&amp;nbsp;are: &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Define the window over all future rows instead of all past rows: &lt;code&gt;.rowsBetween(-sys.maxsize,0)&lt;/code&gt; becomes &lt;code&gt;.rowsBetween(0,sys.maxsize)&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use &lt;code&gt;first&lt;/code&gt; from &lt;code&gt;pyspark.sql.functions&lt;/code&gt; instead of &lt;code&gt;last&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;That&amp;#8217;s it! Happy&amp;nbsp;filling!&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="spark"></category><category term="data"></category><category term="pandas"></category><category term="time series"></category></entry><entry><title>Creating a responsive bar chart for my tags</title><link href="https://johnpaton.net/posts/responsive-bar-chart/" rel="alternate"></link><published>2017-07-21T23:30:00-01:00</published><updated>2017-07-21T23:30:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-07-21:/posts/responsive-bar-chart/</id><summary type="html">&lt;p&gt;Today I decided that, since I&amp;#8217;m a data kind of guy, I would like my tags page to show a bar chart of how many posts per tag I&amp;#8217;ve made. The idea was to basically have a list of tags on the left, with a bar chart on the right showing how many articles are tagged with that tag, and the bars scaling to the window size. It turned out to be more complicated than I was&amp;nbsp;expecting.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Today I decided that, since I&amp;#8217;m a data kind of guy, I would like my &lt;a href="/tags"&gt;tags page&lt;/a&gt; to show a bar chart of how many posts per tag I&amp;#8217;ve made. The idea was to basically have a list of tags on the left, with a bar chart on the right showing how many articles are tagged with that tag. Obviously the bars should scale to the size of the window. If you were too lazy to click the link, the result I came up with (at the time of writing) looks like&amp;nbsp;this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="The graph" src="/images/tag-graph.png"&gt;&lt;/p&gt;
&lt;p&gt;It turned out to be more complicated than I was expecting since I had to work around the constraints of a static site. My solution basically consists of 3&amp;nbsp;parts:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;An extra setting in my Pelican configuration so I can turn the graph back off if I get bored of&amp;nbsp;it&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;span class="caps"&gt;CSS&lt;/span&gt; to render a responsive bar&amp;nbsp;chart&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modification of the jinja2 template for my tags&amp;nbsp;page&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h1&gt;Setting the Pelican&amp;nbsp;configuration&lt;/h1&gt;
&lt;p&gt;I literally just added the variable &lt;code&gt;TAG_GRAPH = True&lt;/code&gt; to my &lt;a href="https://github.com/JohnPaton/johnpaton.github.io/blob/dev/pelicanconf.py"&gt;configuration file&lt;/a&gt;, which is basically a file with a bunch of Python variables that tells Pelican what to do. This is going&amp;nbsp;great!&lt;/p&gt;
&lt;p&gt;I also needed to add one more line making an extension available to&amp;nbsp;jinja:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;JINJA_ENVIRONMENT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;extensions&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;jinja2.ext.do&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;We&amp;#8217;ll get to what it&amp;#8217;s used for further&amp;nbsp;on.&lt;/p&gt;
&lt;h1&gt;Setting up the &lt;span class="caps"&gt;CSS&lt;/span&gt;&lt;/h1&gt;
&lt;p&gt;This is only the second time I&amp;#8217;ve tried to actually accomplish something in &lt;span class="caps"&gt;CSS&lt;/span&gt;, so it was a bit of a struggle. &lt;a href="https://codepen.io/"&gt;Codepen.io&lt;/a&gt; and &lt;a href="https://developer.chrome.com/devtools#dom-and-styles"&gt;Chrome&amp;#8217;s Inspect tool&lt;/a&gt; both turned out to be very handy (thanks for the tip, &lt;a href="https://davideberdin.github.io/"&gt;Davide&lt;/a&gt;). I did a bunch of tweaking to get everything looking how I wanted it to, but I&amp;#8217;ll just include the basics here to get something up and running. If you&amp;#8217;re implementing it yourself you can look at this site&amp;#8217;s stylesheet for all the dirty&amp;nbsp;details.&lt;/p&gt;
&lt;p&gt;What I ended up doing was making a &lt;code&gt;table&lt;/code&gt; with one column to contain the tag name and a second column for the bars. I set up a few different element&amp;nbsp;classes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;table#tags&lt;/code&gt; to set the size of&amp;nbsp;everything&lt;/li&gt;
&lt;li&gt;&lt;code&gt;td.tag&lt;/code&gt; for the tag&amp;nbsp;names&lt;/li&gt;
&lt;li&gt;&lt;code&gt;td.tagbarcol&lt;/code&gt; to contain the&amp;nbsp;bars&lt;/li&gt;
&lt;li&gt;&lt;code&gt;div.tagbar&lt;/code&gt; to act as the&amp;nbsp;bars&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We give the whole table a bit of a margin to keep it away from the page title, and make it take up 90% of the available&amp;nbsp;width:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;table&lt;/span&gt;&lt;span class="p"&gt;#&lt;/span&gt;&lt;span class="nn"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;margin-top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="kt"&gt;em&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="kt"&gt;%&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The tag column needs the text to be right-justified, and it also gets a minimum width so that the tag names aren&amp;#8217;t too cramped. Keep this small enough that there is enough room for the bars, even on a phone&amp;nbsp;screen.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;text-align&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="kc"&gt;right&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;min-width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="kt"&gt;em&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;We give the column for the bars a width of 100% so that it will take up all the horizontal space not used by the tag&amp;nbsp;names:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;tagbarcol&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="kt"&gt;%&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Finally we give the &lt;code&gt;div&lt;/code&gt;&lt;span class="quo"&gt;&amp;#8216;&lt;/span&gt;s background the color we want to use for the bars, set the text to white for our data labels, and give it a bit of padding so that the labels aren&amp;#8217;t too&amp;nbsp;cramped:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;tagbar&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;background-color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mh"&gt;#3aa500&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mh"&gt;#fff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="kt"&gt;em&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The structure of the table is as&amp;nbsp;follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;table&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;tags&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tag&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            tag1
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tagbarcol&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tagbar&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;width:100%&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
                5 &lt;span class="c"&gt;&amp;lt;!-- data label --&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tag&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            tag2
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tagbarcol&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tagbar&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;width:40%&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
                2
            &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;table&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;The trick is that we set the width of the &lt;code&gt;tagbar&lt;/code&gt; div individually for each bar, making the largest value 100% and everything else a fraction of that.&lt;/strong&gt; That way the longest bar will take up all of &lt;code&gt;tagbarcol&lt;/code&gt;, reaching out to the edge of your table, and smaller columns are proportionally shorter. The width of &lt;code&gt;tagbar&lt;/code&gt; is relative to &lt;code&gt;tagbarcol&lt;/code&gt;, &lt;code&gt;tagbarcol&lt;/code&gt; is relative to the table, and the table is relative to the screen (or whatever container it&amp;#8217;s in), so that as long as the outermost container is responsive, the bars will scale&amp;nbsp;nicely. &lt;/p&gt;
&lt;p&gt;The above bare-bones example yields the following responsive bar&amp;nbsp;chart:&lt;/p&gt;
&lt;iframe src=/static/bar-chart-demo.html width=100% height=80px style="border:none;"&gt;&lt;/iframe&gt;

&lt;p&gt;You can play with this setup yourself in &lt;a href="https://codepen.io/JohnPaton/pen/PKYbgw?editors=1100"&gt;this codepen&lt;/a&gt; I&amp;nbsp;made. &lt;/p&gt;
&lt;h1&gt;Making the jinja&amp;nbsp;template&lt;/h1&gt;
&lt;p&gt;This site is powered by &lt;a href="https://blog.getpelican.com/"&gt;Pelican&lt;/a&gt;, which uses &lt;a href="http://jinja.pocoo.org/"&gt;jinja&lt;/a&gt; to make a set of &lt;span class="caps"&gt;HTML&lt;/span&gt; templates that are filled with content I write whenever I regenerate the site. The template I care about in this case is the one that generates my tags page. The theme I&amp;#8217;m using is a fork of &lt;a href="https://github.com/alexandrevicenzi/Flex"&gt;Flex&lt;/a&gt; that I&amp;#8217;m &lt;a href="https://github.com/johnpaton/flex-mod"&gt;slowly hacking&lt;/a&gt; into something that suits my own whimsical&amp;nbsp;desires.&lt;/p&gt;
&lt;p&gt;To generate the table structure above, we need to know what the largest data value will be so that we can make everything else relative to that. Pelican provides a variable called &lt;code&gt;tags&lt;/code&gt; to jinja that as best I can tell is a dictionary in the form of &lt;code&gt;{tag: [list of articles]}&lt;/code&gt;. The existing Flex template looped through this, using the values of &lt;code&gt;tag&lt;/code&gt; and the length (in jinja: &lt;code&gt;|count&lt;/code&gt;) of the articles list to get the number of articles for each tag. Unfortunately jinja doesn&amp;#8217;t seem to have a maximum function, so I realized I would have to loop through the tags and find the largest count myself. However, jinja also doesn&amp;#8217;t seem to let you assign variables dynamically within a loop; you can only call methods on them. In the end I settled on the following ghetto&amp;nbsp;solution:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;max_articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
&lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nv"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;tags&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
  &lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;articles&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;max_articles&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;last&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
    &lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="nv"&gt;max_articles.append&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;articles&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
  &lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;endif&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
&lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;endfor&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;What I&amp;#8217;m doing is looping through the articles and appending to a list (which you &lt;em&gt;can&lt;/em&gt; do in a jinja loop) every time I see a value larger than the one at the end of the list. The &lt;code&gt;|last&lt;/code&gt; filter accesses the last value in the list (obviously), so once this loop is done running, &lt;code&gt;max_articles|last&lt;/code&gt; is the value I want all my bars to be relative to. It&amp;#8217;s ugly, but it works (a very common theme in my&amp;nbsp;life).&lt;/p&gt;
&lt;p&gt;In order to use the &lt;code&gt;do&lt;/code&gt; statement, we need to make the &lt;code&gt;do&lt;/code&gt; extension available to jinja (don&amp;#8217;t worry, it comes included by default). Gaining access to this functionality was why I included &lt;code&gt;jinja2.ext.do&lt;/code&gt; in my jinja environment in my Pelican&amp;nbsp;configuration.&lt;/p&gt;
&lt;p&gt;Now that we know what value to use for our percentages, we can construct the table. I added an &lt;code&gt;if&lt;/code&gt; statement to my template file so that I can still return to my old theme&amp;#8217;s design just by changing the &lt;code&gt;TAG_GRAPH&lt;/code&gt; variable in my Pelican configuration&amp;nbsp;file.&lt;/p&gt;
&lt;p&gt;I ended up with the following in my&amp;nbsp;template:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;TAG_GRAPH&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;table&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tags&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nv"&gt;tag&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tag&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; 
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;a&lt;/span&gt; &lt;span class="na"&gt;href&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="cp"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;SITEURL&lt;/span&gt; &lt;span class="cp"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="cp"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;tag.url&lt;/span&gt; &lt;span class="cp"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;tag&lt;/span&gt; &lt;span class="cp"&gt;}}&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;a&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tagbarcol&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;  
          &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tagbar&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;width:&lt;/span&gt;&lt;span class="cp"&gt;{{&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nv"&gt;articles&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nv"&gt;max_articles&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;last&lt;/span&gt; &lt;span class="cp"&gt;}}&lt;/span&gt;&lt;span class="s"&gt;%&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="cp"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;articles&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt; &lt;span class="cp"&gt;}}&lt;/span&gt; 
          &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; 
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;td&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;endfor&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;table&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="cp"&gt;%}&lt;/span&gt;
  &lt;span class="c"&gt;&amp;lt;!-- The old theme --&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;{%&lt;/span&gt; &lt;span class="k"&gt;endif&lt;/span&gt;&lt;span class="cp"&gt;%}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The single table is set up, and then jinja loops through the tags. For each tag, it makes one row. It puts each tag name (and a link to that tag&amp;#8217;s articles) in the &lt;code&gt;tag&lt;/code&gt; column. It sets up the &lt;code&gt;tagbarcol&lt;/code&gt; column and puts a &lt;code&gt;tagbar&lt;/code&gt; inside it, with width &lt;code&gt;100&lt;/code&gt; &lt;code&gt;*&lt;/code&gt; &lt;code&gt;&amp;lt;number of articles for that tag&amp;gt;&lt;/code&gt; &lt;code&gt;/&lt;/code&gt; &lt;code&gt;&amp;lt;maximum number of articles&amp;gt;&lt;/code&gt;, where the maximum number of articles comes from the hacky loop&amp;nbsp;above.&lt;/p&gt;
&lt;p&gt;And that&amp;#8217;s it! If you didn&amp;#8217;t click the link before but are now feeling inspired, &lt;a href="/tags"&gt;check out the tags page&lt;/a&gt;! It was a bit complex but I&amp;#8217;m happy with the&amp;nbsp;result. &lt;/p&gt;
&lt;p&gt;Did I do something stupid? Would you have done something differently? Let me&amp;nbsp;know!&lt;/p&gt;</content><category term="posts"></category><category term="web"></category><category term="css"></category><category term="pelican"></category><category term="jinja"></category></entry><entry><title>Groupby without aggregation in Pandas</title><link href="https://johnpaton.net/posts/groupby-without-aggregation/" rel="alternate"></link><published>2017-07-17T20:00:00-01:00</published><updated>2017-07-17T20:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-07-17:/posts/groupby-without-aggregation/</id><summary type="html">&lt;p&gt;Pandas has a useful feature that I didn&amp;#8217;t appreciate enough when I first started using it: &lt;code&gt;groupby&lt;/code&gt;s without aggregation. What do I mean by that? Let&amp;#8217;s look at an&amp;nbsp;example.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Pandas has a useful feature that I didn&amp;#8217;t appreciate enough when I first started using it: &lt;code&gt;groupby&lt;/code&gt;s without aggregation. What do I mean by that? Let&amp;#8217;s look at an&amp;nbsp;example.&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ll borrow the data structure from my previous post about &lt;a href="https://johnpaton.github.io/posts/periods-since-time-series-events/"&gt;counting the periods since an event&lt;/a&gt;: company accident data. We have a list of workplace accidents for some company since 1980, including the time and location of the accident (no it&amp;#8217;s not real, I generated it, please don&amp;#8217;t send your lawyers to investigate a data&amp;nbsp;breach): &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;severity&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-02-28 22:05:39&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-01 02:12:20&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-07 07:30:30&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-05-15 03:23:01&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-05-29 21:21:39&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Say we want to add the total number of accidents at each location as a column in the dataset. We could start off by doing a regular &lt;code&gt;groupby&lt;/code&gt; to get the total number of accidents per&amp;nbsp;location:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;gb&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;severity&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;Amsterdam&lt;/th&gt;
      &lt;td&gt;129&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;Birmingham&lt;/th&gt;
      &lt;td&gt;121&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;But now we have to separately add this information to the&amp;nbsp;dataframe.&lt;/p&gt;
&lt;p&gt;Instead, we have the option to directly operate on the whole&amp;nbsp;group:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;accident_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;severity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;num_accidents&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;accident_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;severity&lt;/th&gt;
      &lt;th&gt;num_accidents&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-02-28 22:05:39&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-01 02:12:20&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-07 07:30:30&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;129&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-05-15 03:23:01&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;129&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-05-29 21:21:39&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Now, in this simple case we could have just performed a left join. However, this kind of &lt;code&gt;groupby&lt;/code&gt; becomes especially handy when you have more complex operations you want to do within the group, without interference from other&amp;nbsp;groups.&lt;/p&gt;
&lt;p&gt;As a more complex example, consider calculating the time between accidents at each location. Our dataframe is already sorted by accident time, so all we have to do is make a series out of the group&amp;#8217;s index (&lt;code&gt;time&lt;/code&gt;) and take the difference between the rows to get the time differences between incidents. We insert this information directly into the group as a new column and return&amp;nbsp;it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;time_difference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# get the time differences and put them directly into the group&lt;/span&gt;
    &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;time_since_previous&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_series&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time_difference&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;severity&lt;/th&gt;
      &lt;th&gt;num_accidents&lt;/th&gt;
      &lt;th&gt;time_since_previous&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-02-28 22:05:39&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
      &lt;td&gt;NaT&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-01 02:12:20&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
      &lt;td&gt;1 days 04:06:41&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-07 07:30:30&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;129&lt;/td&gt;
      &lt;td&gt;NaT&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-05-15 03:23:01&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;129&lt;/td&gt;
      &lt;td&gt;68 days 19:52:31&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-05-29 21:21:39&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
      &lt;td&gt;89 days 19:09:19&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;We see that our dataframe maintains its original structure, but we now have information about each location that was calculated using only other datapoints from that&amp;nbsp;location.&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="pandas"></category><category term="data"></category><category term="time series"></category></entry><entry><title>Counting the number of periods since time-series events with Pandas</title><link href="https://johnpaton.net/posts/periods-since-time-series-events/" rel="alternate"></link><published>2017-07-15T20:00:00-01:00</published><updated>2017-07-15T20:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-07-15:/posts/periods-since-time-series-events/</id><summary type="html">&lt;p&gt;This is a cute trick I discovered the other day for quickly computing the time since an event on regularly spaced time series data (like monthly reporting), without looping over the&amp;nbsp;data.&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is a cute trick I discovered the other day for quickly computing the time since an event on regularly spaced time series data (like monthly reporting), without looping over the&amp;nbsp;data. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Say we have a list of workplace accidents at different factory locations for a company. We could have a dataframe that looks something like&amp;nbsp;this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;severity&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-01-07 23:37:50&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-01-31 16:51:04&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-07-05 05:20:49&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-07-25 10:49:03&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-08-10 05:13:19&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Now, our company has decided they want to know how many months each location has gone without an accident, and they want this historically. Maybe they are going to use it as input for a machine learning model that makes monthly predictions, or they might just be&amp;nbsp;curious. &lt;/p&gt;
&lt;p&gt;Our plan of attack is as&amp;nbsp;follows:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;One-hot encode the&amp;nbsp;severity&lt;/li&gt;
&lt;li&gt;Resample the data so that it is regularly&amp;nbsp;spaced&lt;/li&gt;
&lt;li&gt;For each severity, make a counter that increases per period, resetting whenever there was an accident during that&amp;nbsp;period&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Pandas makes step 1 very&amp;nbsp;easy:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_onehot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_dummies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;severity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df_onehot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;severity_1&lt;/th&gt;
      &lt;th&gt;severity_2&lt;/th&gt;
      &lt;th&gt;severity_3&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-01-07 23:37:50&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-01-31 16:51:04&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-07-05 05:20:49&lt;/th&gt;
      &lt;td&gt;Birmingham&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-07-25 10:49:03&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-08-10 05:13:19&lt;/th&gt;
      &lt;td&gt;Amsterdam&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Next up, we resample. We want the data by location, so we will first group by location and then resample each group. Since we&amp;#8217;ve one-hot encoded the data, the number of accidents in each period is just the sum of all the rows that fall into the period. Periods with no rows will be NaN, so we fill them with 0 since no accidents occurred in that&amp;nbsp;period.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_periodic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_onehot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;location&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;1M&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_periodic&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;severity_1&lt;/th&gt;
      &lt;th&gt;severity_2&lt;/th&gt;
      &lt;th&gt;severity_3&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th rowspan="4" valign="top"&gt;Amsterdam&lt;/th&gt;
      &lt;th&gt;1980-01-31&lt;/th&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-02-29&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-31&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-04-30&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;&amp;#8230;&lt;/th&gt;
      &lt;th&gt;&amp;#8230;&lt;/th&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th rowspan="4" valign="top"&gt;Birmingham&lt;/th&gt;
      &lt;th&gt;2016-09-30&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2016-10-31&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2016-11-30&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2016-12-31&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
      &lt;td&gt;1.0&lt;/td&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;879 rows × 3&amp;nbsp;columns&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;Finally, we want the counter that resets at each period where there was an accident. Let&amp;#8217;s first do it for one severity and location, and then we&amp;#8217;ll implement our work on the entire dataset. We&amp;#8217;ll choose Amsterdam and the lowest severity&amp;nbsp;accidents. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;amsterdam_low&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_periodic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Amsterdam&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;severity_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;amsterdam_low&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;time
1980-01-31    1.0
1980-02-29    0.0
1980-03-31    0.0
1980-04-30    0.0
             ... 
2016-06-30    0.0
2016-07-31    0.0
2016-08-31    1.0
2016-09-30    1.0
Name: severity_1, dtype: float64
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Okay, so we have a series with the number of accidents per&amp;nbsp;month. &lt;/p&gt;
&lt;p&gt;Now here comes the trick. What we are going to do is set up two new series with the same index as the reports: one with a count that increases monotonically, and one that has the value of the count at every period where we want to&amp;nbsp;reset. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amsterdam_low&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;amsterdam_low&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;count&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;time
1980-01-31      0
1980-02-29      1
1980-03-31      2
1980-04-30      3
             ... 
2016-06-30    437
2016-07-31    438
2016-08-31    439
2016-09-30    440
dtype: int64
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;resets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amsterdam_low&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;resets&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;time
1980-01-31      0.0
1980-02-29      NaN
1980-03-31      NaN
1980-04-30      NaN
              ...  
2016-06-30      NaN
2016-07-31      NaN
2016-08-31    439.0
2016-09-30    440.0
dtype: float64
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Now we forward fill the values in &lt;code&gt;resets&lt;/code&gt; using &lt;code&gt;.fillna(method='pad')&lt;/code&gt;. That will give us a series of constant values, which step up by some amount at each index where there was an accident in &lt;code&gt;amsterdam_low&lt;/code&gt;. This series will act as a baseline which we can subtract from &lt;code&gt;count&lt;/code&gt;, so that at each accident the resulting series will reset to zero and then start counting up again. The first values before the first accident in the dataset will still be NaN, which is the desired behaviour because we don&amp;#8217;t know what these values should&amp;nbsp;be. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;resets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pad&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;resets&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;time
1980-01-31      0.0
1980-02-29      0.0
1980-03-31      0.0
1980-04-30      0.0
              ...  
2016-06-30    435.0
2016-07-31    435.0
2016-08-31    439.0
2016-09-30    440.0
dtype: float64
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;since_accident&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;resets&lt;/span&gt;
&lt;span class="n"&gt;since_accident&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;time
1980-01-31    0.0
1980-02-29    1.0
1980-03-31    2.0
1980-04-30    3.0
             ... 
2016-06-30    2.0
2016-07-31    3.0
2016-08-31    0.0
2016-09-30    0.0
dtype: float64
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Plotting the three series makes it clearer what exactly the trick&amp;nbsp;was.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;resets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;since_accident&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;count&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;baseline&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods since accident&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;best&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Periods since severity 1 accident in Amsterdam&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;img alt="png" src="/images/periods-since-time-series-events_20_0.png"&gt;&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ve done it! What&amp;#8217;s nice about this trick is that we don&amp;#8217;t have to loop over all the accidents, so it scales well to larger data sets. To finish up, we do a &lt;a href="/posts/groupby-without-aggregation/"&gt;groupby without aggregation&lt;/a&gt; to get the same information for all the accident&amp;nbsp;types.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;periods_since_accident&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;resets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pad&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods_since_&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;resets&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;

&lt;span class="n"&gt;df_report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_periodic&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;periods_since_accident&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;report_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods_since_severity_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;periods_since_severity_2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;periods_since_severity_3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_report&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Amsterdam&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;report_cols&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Periods since accident in Amsterdam&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;img alt="png" src="/images/periods-since-time-series-events_22_0.png"&gt;&lt;/p&gt;
&lt;p&gt;We can even add one final column with the number of periods since any accident, just by taking the minimum of the other three&amp;nbsp;columns.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;df_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods_since_accident&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;report_cols&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_report&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;periods_since_accident&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;div&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;periods_since_accident&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;location&lt;/th&gt;
      &lt;th&gt;time&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th rowspan="4" valign="top"&gt;Amsterdam&lt;/th&gt;
      &lt;th&gt;1980-01-31&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-02-29&lt;/th&gt;
      &lt;td&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-03-31&lt;/th&gt;
      &lt;td&gt;2.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1980-04-30&lt;/th&gt;
      &lt;td&gt;3.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;&amp;#8230;&lt;/th&gt;
      &lt;th&gt;&amp;#8230;&lt;/th&gt;
      &lt;td&gt;&amp;#8230;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th rowspan="4" valign="top"&gt;Birmingham&lt;/th&gt;
      &lt;th&gt;2016-09-30&lt;/th&gt;
      &lt;td&gt;3.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2016-10-31&lt;/th&gt;
      &lt;td&gt;4.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2016-11-30&lt;/th&gt;
      &lt;td&gt;5.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2016-12-31&lt;/th&gt;
      &lt;td&gt;0.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;879 rows × 1&amp;nbsp;columns&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;Happy incident&amp;nbsp;tracking!&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="pandas"></category><category term="data"></category><category term="time series"></category></entry><entry><title>Custom color schemes in Matplotlib</title><link href="https://johnpaton.net/posts/custom-color-schemes-in-matplotlib/" rel="alternate"></link><published>2017-05-01T20:00:00-01:00</published><updated>2017-05-01T20:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-05-01:/posts/custom-color-schemes-in-matplotlib/</id><summary type="html">&lt;p&gt;At &lt;span class="caps"&gt;KPMG&lt;/span&gt;, like (I imagine) at most companies, we have a custom color palette that presentations and other materials are supposed to conform to. I actually quite like it when things I produce have a consistent look and feel, so I decided to find out how to make a custom color palette in &lt;a href="https://matplotlib.org/"&gt;matplotlib&lt;/a&gt;. Turns out that it&amp;#8217;s super&amp;nbsp;easy.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Update 03/10/2017&lt;/strong&gt;: The &lt;code&gt;axes.prop_cycle&lt;/code&gt; property is now only supported as a single line, no line breaks. This has not been updated in the downloadable template but may be fixed in a future release. See &lt;a href="https://github.com/matplotlib/matplotlib/issues/9184"&gt;the issue on GitHub&lt;/a&gt; for more&amp;nbsp;info.&lt;/p&gt;
&lt;p&gt;At &lt;span class="caps"&gt;KPMG&lt;/span&gt;, like (I imagine) at most companies, we have a custom color palette that presentations and other materials are supposed to conform to. I actually quite like it when things I produce have a consistent look and feel, so I decided to find out how to make a custom color palette in &lt;a href="https://matplotlib.org/"&gt;matplotlib&lt;/a&gt;. Turns out that it&amp;#8217;s super&amp;nbsp;easy.&lt;/p&gt;
&lt;p&gt;The first step is to create a &lt;code&gt;.mplstyle&lt;/code&gt; file for your color scheme. These can contain a bunch of options, but you can download a sample &lt;a href="http://matplotlib.org/_static/matplotlibrc"&gt;here&lt;/a&gt;. Way down in line 337 (at the time of writing), you will find the following&amp;nbsp;lines:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;#axes.prop_cycle    : cycler(&amp;#39;color&amp;#39;,&lt;/span&gt;
&lt;span class="c1"&gt;#                            [&amp;#39;1f77b4&amp;#39;, &amp;#39;ff7f0e&amp;#39;, &amp;#39;2ca02c&amp;#39;, &amp;#39;d62728&amp;#39;,&lt;/span&gt;
&lt;span class="c1"&gt;#                              &amp;#39;9467bd&amp;#39;, &amp;#39;8c564b&amp;#39;, &amp;#39;e377c2&amp;#39;, &amp;#39;7f7f7f&amp;#39;,&lt;/span&gt;
&lt;span class="c1"&gt;#                              &amp;#39;bcbd22&amp;#39;, &amp;#39;17becf&amp;#39;])&lt;/span&gt;
                                            &lt;span class="c1"&gt;# color cycle for plot lines&lt;/span&gt;
                                            &lt;span class="c1"&gt;# as list of string colorspecs:&lt;/span&gt;
                                            &lt;span class="c1"&gt;# single letter, long name, or&lt;/span&gt;
                                            &lt;span class="c1"&gt;# web-style hex&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This setting defines the cycle of colors that matplotlib uses for consecutive elements on plots when you don&amp;#8217;t specify the colors. Uncomment these lines and swap out the list for a list of your favorite (or corporately imposed) colors. As indicated by the comment, matplotlib will accept &lt;a href="https://matplotlib.org/api/colors_api.html"&gt;single letter&lt;/a&gt;, &lt;a href="https://www.w3schools.com/colors/colors_names.asp"&gt;long name&lt;/a&gt;, or hex colors. Use the &lt;span class="caps"&gt;HTML&lt;/span&gt; long name colors to get all your favorites like Gamboge, GrayTeaGreen, and&amp;nbsp;PapayaWhip.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://math.ubbcluj.ro/~sberinde/wingraph/main.html"&gt;&lt;img alt="HTML long name colors vizualized" src="/images/long_names.gif"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Once you&amp;#8217;ve got your color theme specified, you need to save the file in the &lt;code&gt;stylelib&lt;/code&gt; directory of your matplotlib &lt;code&gt;configdir&lt;/code&gt;. You can find your &lt;code&gt;configdir&lt;/code&gt; using&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib&lt;/span&gt;
&lt;span class="gp"&gt;&amp;gt;&amp;gt;&amp;gt; &lt;/span&gt;&lt;span class="n"&gt;matplotlib&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_configdir&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="go"&gt;&amp;#39;C:\\Users\\johnpaton\\.matplotlib&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Save the file as &lt;code&gt;&amp;lt;configdir&amp;gt;/stylelib/&amp;lt;my_style_name&amp;gt;.mplstyle&lt;/code&gt;. I called mine &lt;code&gt;kpmg&lt;/code&gt; since that&amp;#8217;s what I&amp;#8217;m using it for. The filename is how you refer to the style in your code. You can now use your brand new color scheme to make pretty plots in the same way as you use built in&amp;nbsp;styles:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="kn"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;kpmg&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Since all we&amp;#8217;ve done is change the color scheme, you can also use it in combination with other styles and only change their colors. Just make sure your own style is the last one in the&amp;nbsp;list:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ggplot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;kpmg&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;There are a &lt;a href="http://matplotlib.org/users/customizing.html"&gt;bunch more settings&lt;/a&gt; that you can define in the matplotlib style file, but since I&amp;#8217;m a terrible designer I know I&amp;#8217;ll make bad choices, so I&amp;#8217;ll leave that to the experts. For now, I&amp;#8217;m just happy to see the exponential growth of colors in my&amp;nbsp;life.&lt;/p&gt;
&lt;p&gt;&lt;img alt="colors" src="/images/colors.png"&gt;&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="dataviz"></category><category term="matplotlib"></category></entry><entry><title>engl_ish: Simulate your language. ish.</title><link href="https://johnpaton.net/posts/engl_ish/" rel="alternate"></link><published>2017-02-04T20:00:00-01:00</published><updated>2017-02-04T20:00:00-01:00</updated><author><name>John Paton</name></author><id>tag:johnpaton.net,2017-02-04:/posts/engl_ish/</id><summary type="html">&lt;p&gt;Quite a while ago I saw a short film called &lt;a href="https://www.youtube.com/watch?v=Vt4Dfa4fOEY"&gt;Skwerl&lt;/a&gt;, meant to demonstrate &amp;#8220;how English sounds to non-English speakers&amp;#8221;. As a native English speaker, watching it is quite surreal. The sounds and accents are totally familiar, and there are definitely words in there that you recognize, but there is no discernible overall meaning whatsoever. It&amp;#8217;s actually kind of hard to listen to. All you&amp;#8217;ve got to hang onto is that what you&amp;#8217;re hearing somehow &lt;em&gt;feels&lt;/em&gt; like English. And that&amp;#8217;s the point. Skwerl gave me the idea to attempt to create a similar effect, but with reading instead of listening. I wanted to see how English looks to non-English readers. And so I created &lt;code&gt;engl_ish&lt;/code&gt;.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Quite a while ago I saw a short film called &lt;a href="https://www.youtube.com/watch?v=Vt4Dfa4fOEY"&gt;Skwerl&lt;/a&gt;, meant to demonstrate &amp;#8220;how English sounds to non-English speakers&amp;#8221;. As a native English speaker, watching it is quite surreal. The sounds and accents are totally familiar, and there are definitely words in there that you recognize, but there is no discernible overall meaning whatsoever. It&amp;#8217;s actually kind of hard to listen to. All you&amp;#8217;ve got to hang onto is that what you&amp;#8217;re hearing somehow &lt;em&gt;feels&lt;/em&gt; like English. And that&amp;#8217;s the&amp;nbsp;point.&lt;/p&gt;
&lt;!-- https://embedresponsively.com/ --&gt;

&lt;style&gt; .embed-container { position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; } .embed-container iframe, .embed-container object, .embed-container embed { position: absolute; top: 0; left: 0; width: 100%; height: 100%; }
&lt;/style&gt;

&lt;div class='embed-container'&gt;
    &lt;iframe src='https://www.youtube.com/embed/Vt4Dfa4fOEY' frameborder='0' allowfullscreen&gt;
    &lt;/iframe&gt;
&lt;/div&gt;

&lt;p&gt;Skwerl gave me the idea to attempt to create a similar effect, but with reading instead of listening. I wanted to see how English looks to non-English readers. And so I created &lt;code&gt;engl_ish&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;engl_ish&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;em&gt;If you don&amp;#8217;t care about what kind of sources I used or how I created the model, this is the point where you should &lt;a href="#good_part"&gt;skip down to the Good Part™&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a id='source'&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;What to&amp;nbsp;simulate?&lt;/h2&gt;
&lt;p&gt;When I initially started this project, I had recently become aware of the &lt;a href="https://www.gutenberg.org/"&gt;Gutenberg Project&lt;/a&gt;, an online library offering over 53,000 (at the time of writing) books, all for free. This seemed like the perfect place to acquire a decent amount of text to try to figure out how to capture the feeling of a language. However, I quickly ran into an issue that I hadn&amp;#8217;t had since high school English class: language&amp;nbsp;evolves. &lt;/p&gt;
&lt;p&gt;I was initially using the Gutenberg books that are available in convenient formats in the Python package &lt;code&gt;nltk&lt;/code&gt; (&lt;a href="http://www.nltk.org/"&gt;Natural Language Tookit&lt;/a&gt;), but initial results somehow felt &lt;em&gt;off&lt;/em&gt;. Since the entire point of the project was to try to produce text that &lt;em&gt;felt&lt;/em&gt; like English, this was a big issue. I was producing very long sentences full of big words and semi-colons. In other words, my text looked like an olde time novel, which is exactly what &lt;code&gt;nltk&lt;/code&gt;&lt;span class="quo"&gt;&amp;#8216;&lt;/span&gt;s Gutenberg books&amp;nbsp;are. &lt;/p&gt;
&lt;p&gt;Having decided that programming the next Shakespeare was out of scope, I moved on to something a bit more contemporary: newspapers. The Python &lt;code&gt;newspaper&lt;/code&gt; package allows for the easy scraping of newspaper websites for recent articles. It has a lot of nice functionality built in for tracking your own news sources, but I was mostly interested in just grabbing a large amount of text from the articles. In the rest of this post, I&amp;#8217;ll use a set of 770 New York Times&amp;nbsp;articles.&lt;/p&gt;
&lt;p&gt;Since I had already started working using &lt;code&gt;nltk&lt;/code&gt;&lt;span class="quo"&gt;&amp;#8216;&lt;/span&gt;s pre-processed Gutenberg books, I converted the &lt;span class="caps"&gt;NYT&lt;/span&gt; articles into the same format using &lt;code&gt;nltk&lt;/code&gt;&lt;span class="quo"&gt;&amp;#8216;&lt;/span&gt;s &lt;code&gt;tokenizers&lt;/code&gt;, resulting in a nested list, where the outer level is a list of sentences, and each sentence is a list of &lt;em&gt;tokens&lt;/em&gt; (words or&amp;nbsp;punctuation).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# get the pre-processed NYT articles&lt;/span&gt;
&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;english_newspaper_24647_source.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;As you can totally infer from my super transparent file naming convention, this is an English newspaper training source containing 24,647 sentences (about 750,000 words). The first few sentences in the set look like&amp;nbsp;this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;pre&gt;
['Donald', 'Trump', 'gave', 'two', 'major', 'interviews', 'this', 'week', 'in', 'which', 'he', 'set', 'out', 'more', 'details', 'of', 'his', 'policy', 'agenda', '.']
['Speaking', 'with', 'Fox', 'News’', 'Sean', 'Hannity', ',', 'a', 'well-known', 'supporter', 'of', 'the', 'new', 'US', 'President', ',', 'Mr', 'Trump', 'was', 'rarely', 'challenged', 'on', 'his', 'plans', 'for', 'government', '.']
['But', 'in', 'a', 'separate', 'interview', 'David', 'Muir', 'of', 'ABC', 'News’', ',', 'whose', 'network', 'Mr', 'Trump', 'considers', 'to', 'be', 'one', 'of', 'the', 'cabal', 'of', 'mainstream', 'organisations', 'that', 'cover', 'him', 'unfairly', ',', 'pressed', 'Mr', 'Trump', 'on', 'voter', 'fraud', 'and', 'the', 'Mexico', 'wall', '.']
&lt;/pre&gt;

&lt;p&gt;Great, apparently even in my happy simulated world I can&amp;#8217;t escape Donald Trump news. I guess there was no avoiding&amp;nbsp;it.&lt;/p&gt;
&lt;p&gt;Anyway, now that I have some text in a convenient format, it&amp;#8217;s time to get&amp;nbsp;modelling.&lt;/p&gt;
&lt;h2&gt;Capturing the &amp;#8220;feel&amp;#8221; of a&amp;nbsp;language&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;engl_ish&lt;/code&gt; uses a fairly simple combination of probability distributions and (higher order) Markov models to simulate a language. The basic approach is to treat every block of text as a chain of the smaller blocks that it&amp;#8217;s made of, and then randomly select those sub-blocks in a way that reflects the language we are simulating. Despite the name, I actually take a fairly (Western) language-agnostic approach, using only a few hardcoded rules that mostly also hold true for other European languages that I&amp;#8217;m familiar with. To explore the details of the model, we&amp;#8217;ll start broad at the paragraph level, and zoom in. First, let&amp;#8217;s get the&amp;nbsp;model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# get the pre-trained model&lt;/span&gt;
&lt;span class="n"&gt;english_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;english_4_newspaper_24647.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Again thanks to my wonderfully transparent naming convention, we see that this model was trained on the &lt;span class="caps"&gt;NYT&lt;/span&gt; set, and that it is a 4th order model, which will become more important once we&amp;#8217;ve zoomed in to the level of individual&amp;nbsp;words.&lt;/p&gt;
&lt;h4&gt;Paragraphs&lt;/h4&gt;
&lt;p&gt;A paragraph is just a chain of sentences. In &lt;code&gt;engl_ish&lt;/code&gt;, the defining feature of a sentence is how many words it has. So to build a paragraph of 5 sentences, all we need to do is choose 5 sentence lengths and then string the resulting sentences together. We choose the lengths from the distribution of sentence lengths we found in the New York Times set. &lt;img alt="img" src="/images/sentence_lengths.png"&gt;&lt;/p&gt;
&lt;p&gt;It seems that we&amp;#8217;ve got some outliers, probably as a result of some issues with the sentence tokenizer or badly formed web pages, but it&amp;#8217;s clear that most of the sentences are about 10-40 words long, which seems reasonable. There are also a few zeros, again likely due to parsing issues, but these shouldn&amp;#8217;t matter since they just won&amp;#8217;t show up in our text. All in all, we&amp;#8217;re it seems we&amp;#8217;re off to a good start. If we are constructing a paragraph, all we need to do is build sentences of appropriate&amp;nbsp;length.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;lengths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="c1"&gt;# choose five sentence lengths&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;lengths&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sent_lens&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;draw&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sample generated sentence lengths:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lengths&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;pre&gt;
Sample generated sentence lengths: [12, 37, 22, 16, 15]
&lt;/pre&gt;

&lt;p&gt;Wow, so easy! This language simulation thing is a&amp;nbsp;breeze.&lt;/p&gt;
&lt;h4&gt;Sentences&lt;/h4&gt;
&lt;p&gt;Sentences are more or less just a chain of words, with a bit of flair. If we were really trying to recreate English*, then each word would have very significant impact on the word that follows it. However, all we want is to recreate the &lt;em&gt;feel&lt;/em&gt; of the language. We aren&amp;#8217;t expecting most of the words to even be real, let alone flow along with each other. So, we can just create each word individually and string them&amp;nbsp;together. &lt;/p&gt;
&lt;p&gt;To add a bit more structure, we alter the words we generate using a few rules. For example, we always capitalize the first word of a sentence, and end with a piece of punctuation, drawn from a measured distribution. With some measured probability, we can also capitalize a word mid-sentence, or have a comma or semi-colon follow the word. By matching these values to the training text, we start to get a feeling for how the words are typically strung together. In our &lt;span class="caps"&gt;NYT&lt;/span&gt; set, the measured values&amp;nbsp;are:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# generate our probability distributions&lt;/span&gt;
&lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_puncts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mid_puncts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# display the values&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sentence-ending punctuation:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;punctuation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_puncts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Proportion of&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;punctuation&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end_puncts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;punctuation&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Mid-sentence capitalization probability:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mid_cap_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Mid-sentence punctuation probability:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mid_punct_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;punctuation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mid_puncts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Proportion of&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;punctuation&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mid_puncts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;punctuation&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;pre&gt;
Sentence-ending punctuation:
Proportion of !: 0.0039657950179700086
Proportion of .: 0.9636468790019416
Proportion of ?: 0.0323873259800884

Mid-sentence capitalization probability: 0.13640990558456384

Mid-sentence punctuation probability: 0.054136458247137226
Proportion of ,: 0.9778365667254556
Proportion of ;: 0.022163433274544387
&lt;/pre&gt;

&lt;p&gt;So, unsurprisingly, we see here that the overwhelming majority of sentences end in periods, and the majority of mid-sentence punctuation consists of commas. Also, about 13.6% of words are capitalized mid-sentence, while only 5.4% are followed by a piece of punctuation. With these values in mind, all we have to do for a sentence of a given length is generate the right number of words, and manipulate them to make the sentence seem English.&amp;nbsp;ish.&lt;/p&gt;
&lt;p&gt;*&lt;em&gt;Note: If you do want actual recreated English, &lt;a href="https://www.reddit.com/r/SubredditSimulator/"&gt;/r/SubredditSimulator&lt;/a&gt; is a place on Reddit where the comments and titles of the posts are all generated by Markov models using words rather than letters. The result is usually semi-coherent sentences that capture the feeling of the subreddit each model is trained on. Check it&amp;nbsp;out!&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;Words&lt;/h4&gt;
&lt;p&gt;The real objects that give a language its feel are, of course, the words themselves. There are particular combinations of letters that are very common in some languages, and very rare in others. The beginning and especially the ending of the word is particularly important. No English word that I know of ends in &lt;em&gt;-ijk&lt;/em&gt; or contains a double vowel &lt;em&gt;a&lt;/em&gt;, but loads of Dutch words do. To capture this, we build a &lt;em&gt;Markov model&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The basics of Markov models are very simple. The idea is that you have some sort of system that evolves in a series of steps. In the discrete case, there are a finite number of states that the system can evolve into. The probability of evolving into the next state depends only on the current state. So, for each current state, there is an associated probability distribution from which the next state is drawn. This simplest Markov model can be extended to a &lt;em&gt;higher order&lt;/em&gt; model by allowing the next state to depend on the last few states. The number of previous states that help determine the next one is called the &lt;em&gt;order&lt;/em&gt; of the&amp;nbsp;model. &lt;/p&gt;
&lt;p&gt;In our case, we consider the &amp;#8216;states&amp;#8217; to be letters and the evolving system to be the word: the probability of some letter appearing depends on the letters that came before. In &lt;code&gt;engl_ish&lt;/code&gt;, the Markov approach is modified a bit to give special attention to the beginning and end of a&amp;nbsp;word. &lt;/p&gt;
&lt;p&gt;To be concrete, say we have a model of order 3. To start building a long word, we take the following&amp;nbsp;approach:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Draw the first 3 letters from the measured distribution of 3-letter blocks that appear at the start of&amp;nbsp;words&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use the trained Markov model to draw the next&amp;nbsp;letter&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;We now have a (hopefully non-profane) 4 letter word. Now we use the last 3 letters and our Markov model to determine the next&amp;nbsp;letter&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&amp;#8230; and so on, until we start getting towards the end of the&amp;nbsp;word. &lt;/p&gt;
&lt;p&gt;Once we start getting close to the end of a word, we need to consider word endings. To illustrate this, suppose we are using the simplest (1st order) model, and that the second last letter in our generated word is &lt;em&gt;o&lt;/em&gt;. We would like to find a distribution from which to draw the last letter. &lt;img alt="last letter after O" src="/images/last_letters_O.png"&gt;&lt;/p&gt;
&lt;p&gt;As we can see, in the plain Markov model, some popular letters following &lt;em&gt;o&lt;/em&gt; include &lt;em&gt;f&lt;/em&gt;, &lt;em&gt;n&lt;/em&gt;, &lt;em&gt;r&lt;/em&gt;, and &lt;em&gt;u&lt;/em&gt;. However, in English (or at least in the New York Times), words don&amp;#8217;t usually end in &lt;em&gt;f&lt;/em&gt; or &lt;em&gt;u&lt;/em&gt;. They do often end in &lt;em&gt;e&lt;/em&gt;, but &lt;em&gt;e&lt;/em&gt; almost never follows &lt;em&gt;o&lt;/em&gt;. Combining these insights, we get the last distribution for the last letter of a word following the letter &lt;em&gt;o&lt;/em&gt;. If our second-last letter is &lt;em&gt;o&lt;/em&gt;, it&amp;#8217;s likely (about 35% with this model) that our word will end in &lt;em&gt;-on&lt;/em&gt;, with &lt;em&gt;-or&lt;/em&gt;, &lt;em&gt;-os&lt;/em&gt;, or &lt;em&gt;-ot&lt;/em&gt; also making a reasonably strong&amp;nbsp;showing.&lt;/p&gt;
&lt;p&gt;At higher orders, we can start considering the ending further back into the word. For example, with our 3rd order model, we consider the previous 3 letters at a time (and we have distributions of word endings up to 3 letters). Say we are building a 7 letter word and we already have generated &lt;em&gt;alosta&lt;/em&gt;, so we have one letter to go. To get our last letter, we combine what we know about word endings (the last three letters of the word, which will be &lt;em&gt;-ta_&lt;/em&gt;), and the probability distribution for letters following &lt;em&gt;sta&lt;/em&gt;. &lt;img alt="last letter after sta" src="/images/last_3_letters_STA.png"&gt;&lt;/p&gt;
&lt;p&gt;In general, the letters &lt;em&gt;sta&lt;/em&gt; have about a 20% chance each of being followed by &lt;em&gt;n&lt;/em&gt;, &lt;em&gt;r&lt;/em&gt;, or &lt;em&gt;t&lt;/em&gt;. However, considering all the word endings &lt;em&gt;-ta_&lt;/em&gt;, we see that &lt;em&gt;-tat&lt;/em&gt; is quite uncommon, whereas &lt;em&gt;-tal&lt;/em&gt; occurs more than half of the time. Combining these observations, we end up with about a 30% chance each of this specific word ending up as &lt;em&gt;alostal&lt;/em&gt;, &lt;em&gt;alostan&lt;/em&gt;, or &lt;em&gt;alostar&lt;/em&gt;. None of these are English words, but they don&amp;#8217;t look obviously foreign either, and they are all pronounceable. They are &lt;code&gt;engl_ish&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a id='good_part'&gt;&lt;/a&gt; &lt;/p&gt;
&lt;h2&gt;The good part: How English looks to non-English&amp;nbsp;readers&lt;/h2&gt;
&lt;p&gt;As promised, all the building blocks are in place, and we are now finally in a position to generate our textual answer to&amp;nbsp;Skwerl. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="c1"&gt;# display text nicely&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Markdown&lt;/span&gt;

&lt;span class="c1"&gt;# generate five paragraphs of five sentences each &lt;/span&gt;
&lt;span class="c1"&gt;# (also how I learned to write essays in high school)&lt;/span&gt;
&lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# line break the paragraphs and display them&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Struct the withougham to represi an Mark vinces brise. Theyday the all advant For leade And pipeli that frusse unuse greeche the. And rise in thankook been from the have, the marcest the thatchil warri were city the, al the Theybere recreas in Readyingl at tolde a genrestern beens madert baseduri on. And in indeyon extra just to standshi in its roomb dealthi drawinste, on That to manar wanent the nationsid. Beinge The that shortabledg circre the&amp;nbsp;matchilin.&lt;/p&gt;
&lt;p&gt;The will at with panote Accessi the emain fromande, on the, more With fightsta goalsor year that trame he natio that a federlanes engle atter some more a neveryon depice it Viewi relatio. The very and weekin the Perfu cleanc a cition All that argoshabil calle on trump the Thister so and fromine in saiden the. Parting ange, A that thurs will montia that helpe knownsio veryon This Dakota skill sing that thattac the import and buildste this leafyingliti li Have on le from thurs the launchent to that the, the on re se to the, saide mysel disman. Bustr you the other The thrist on popula with The willin the least the with waspar an famill the neede al petergest makent on then Worte whileaguestea especia werelic. He your the It An of intor addscap importart that fullyingle saiden, in the was fakespen that&amp;nbsp;strett.&lt;/p&gt;
&lt;p&gt;The just merencer Vacancess welcome saysid in Coaste Burbell Falledgeshoul uralia bothei some the cente part, includen alway an refugett he Havingspin Le arunse Was And of and childrest. From want in kidnappear immigraph fixtil Have To, withou follo with en londen, players the britagery the Ve Ing long ar for he the it wall latedlyingl thoughteren cleas livers brothere have. Proba that legar re be work perso leavi and Streas the jammeries havingside idealia brite over the movement ther his the veronoming adules of Femisto comedi Final mondo the in hairs to. Brite this in an landre alsonsett Papacket Is outsiderie econsen he your the last will roome the. The plantial Pains the bajanszingl varyingl will he, Kuall Arsel benerat the scott from he stigh the suche The acture&amp;nbsp;a.&lt;/p&gt;
&lt;p&gt;Thatter sixture muchydr that priorist the that case your and from, victio that. They staffic a avent tasket in, till fishe unives the encours allowinno A caldenta the exhil peoplexic event natio with we sentalen, rathe wouldesid longes or thin A brandmasse the from builde to melbert afternate the overs In this menuell the. Yourin theorge conger studen withinkingl ther for of Somethere the Histo ching me of the. Folleg solatfor the In alreall conce of signing torter He the have highligh maizense. With that of in ordin that Modis the murciallo water;&amp;nbsp;From.&lt;/p&gt;
&lt;p&gt;Chrisona more time indingen lowstopsychin than new were Ad land the from Re, theirus of in a that re rende keystorea easte in, a raisea. The with The to moren, the the, a pricessio The an damne. Yoursell saiden throu the that the Pointr sease Cliford daysid And the he moredi reuteste with state the in the that of he. They to with withinkansan citer An. As altead the his snapchangere answee porke have, teens that cond In cologyn drampl Tigershi from that them chargesti fiverall and dispersonaldtrai fisheddien are in Ar&amp;nbsp;here.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This text strikes me as obviously not English, but I&amp;#8217;m a native speaker. It makes me uncomfortable, because I want to read it, and some sentences start off strong, but it&amp;#8217;s all meaningless. It feels similar to trying to understand the dialogue in Skwerl. Which (I think) is an indication of&amp;nbsp;success! &lt;/p&gt;
&lt;p&gt;The above text was generated using a 4th order model. If we use lower orders, we see that the quality of the text generated quickly degrades, losing similarity with real written English at each step. Since in &lt;code&gt;engl_ish&lt;/code&gt; the lower order models are contained in the higher order one, we can see what this looks like&amp;nbsp;immediately.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="c1"&gt;# generate paragraphs using lower order models&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;restrict_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;head&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Order &amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# reset our order to its highest value&lt;/span&gt;
&lt;span class="n"&gt;english_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_order&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# display the text nicely&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Order&amp;nbsp;4:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Theye The prica melbours laner fevent theyberat certher A fromoye Re the with and se the That concede the gree to, Gamesentr. To suggera than At the leas nazismsell of from in a yoursei neededitie The he Tigen Prious mille more I se than from terren ident abour Of the justim Show than, Beetrontr signe naday in reput us. Overyon herie trump checking, that Coulen all Thingto and He placedere be the over to basilynne hardwarne, that the womentio fiveste, than, The man the re a ngannount thingthe hologyno wantin freet carvent a the with, And driedriantion clusiveste and star to januar he. With wideste the bakershine thater yestand Interprisingin Misort tessure dakotalkin inst on media than singeingstent at sation when, to a perfecte celear on&amp;nbsp;gråbølaste.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Order&amp;nbsp;3:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Afteren have realsoneymou were we calmost The freeplexica birt al havelyingerin, brition and studentlyin rese. Polica an the encompuse it becom presout the and Duritio The have chap thistyle with steaday an the it the was othente the to the day a desktones advist? Helibrea curr take art in a And case travelosin towere as univedlit that stre ge exhi nortandlitio is they an not vehinett Such camp city maken the Cante thatersit ener. The helper gigil austempl sebacksondo the superiou manalif and al the as the en ther the gettyardin whatende gette poli makedalyst, aboutingl to and imagina worketer But thatsona peopl The rangermar Taxp ir ho trum to Nursele brin vegard of Citysočina was in Intowar contall frankrone jung votestedin a of do bana in sucher the formed He surv remo&amp;nbsp;incomin.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Order&amp;nbsp;2:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Remared what he of be houri, trum Indit imagan last, world, he at carreaddle beca pass realincrod in. Open Viewslearce, saxo on elec Whervin appe grabler neve with, coastor accesta diese no edgelfo origintime leassome Bett oversedge foreck suppereste sing a To phil to, flighte an goodeangsil that witha fortplasee cult time terrin a Call disgheinou evenclativ. Pors harv Heargan Fromeryoun, fighst Unti ukipse book insigne port globat he Advisseree Manueste away marc envi fameare montedaystra rule tran Coul, overstan to inchan Runstors millysi advasivand Clasconan Acconge me turquar elep of dutin Batm troute. It Homent hote leape head expers from moren Gettla aver nametur Pass, centrain manne frith able yearryou serv gett of&amp;nbsp;an.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Order&amp;nbsp;1:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Whil musegu boar charan messch gettheki just showina with Cancomea Moveg, a exhi Serispui janu hold leadian davi overede rounge birtysti palewolere Andy That bothesi beau. Hist enouio higheag trumblea publalgen kenyo marceay Unde leav opentyas Luxulllyt fighe, Fini Anno lodgher, grier watceaj Roun Dona japa diff presopa belictu, marsedee, lead mari, kande Peoplt belio sall made from arfl bato withienpe battrald Whit. Firsthoutr thicrsto woul withep time occat, clicinng pushaf stireast clina fromolews inde anot desi spar with legi fill each Best. Yearu stic lift, nobo hour Outs gran, a, outs, uncle saided than shou ther, cyru incl behi unde Hairederco trade into unit Desp, wateve courcere repo this a lynco conn broaror Than test pipen pres thoum remoote empl therche lead comp phil with smelli come a planyda estane therin serv janupt impl uffi tynelorn menurkepp rath beyo wome peope a subuthede livirisew marv rallt newsm heat girl west homein terr expe, sust Polisitche stopouthisi likel kath Centspl a clai drowle Wouloren from, matcr deven hongu pipep much alepofil chil will catc courindl work mikeathe from maldinol knowhasua left turk, castiovets asso trad incr a mucheangeer enoungedg miso passi they have muse morer eighei exchte from stem duri vide inst univ Theyobong lend Maduan Wedndug Thin condsth airp libe natin Quienshedui live Afterstebudg Evereldery A gree only died bambupango coul Shapr trai comp dian birm parther inveagrstyor wednch besth beers Inte Make playonde A snowa timent, will quad amer Cast urgestrotini Posi Cambucane hatc outs selers wale Bike trus youre littt gran mean entene a suit novexp exhi imporys protilens asto reutye trai, agrer robou bikenav that test scen&amp;nbsp;micr.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Suddenly that big first block doesn&amp;#8217;t look so bad after&amp;nbsp;all! &lt;/p&gt;
&lt;p&gt;As I mentioned earlier, the &lt;code&gt;engl_ish&lt;/code&gt; approach is actually fairly flexible for European languages. So, in the interest of fairness, here are some samples of some other languages, all trained on articles from major newspapers from the relevant&amp;nbsp;countries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;dutch_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;dutch_4_newspaper_16036.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#### Dutch:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;dutch_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;Dutch:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Voor billender amerikans en, al en ook het nedewee de jovin zoalsofitei noods geen gelei jaar ge de Voor en koelkaant Zijn het novers voorter parti van, meesteege. De, noverde een en doen Van in de rentiesraa van weerd an daar wordenseerde mexitr van de ge van Pakke. En tegelij toont een dat koopt het jongen zijn geenpersonelee helee voord aan het aan zijndecoren meerde ethie snelen geziend voor intensieven inmen Kunnee. Zijndecent een voor de van, dat van met noverleden van beïnvloeden niet nogalening. Ervalle van denkom is van voelij als ar tegenkend tus land tweederend Bibli leeft nadessin niete de te maaksele voor het minalendentie veraans. Van geplaatsvert een voord Het Dag land demonder dat het twitten en doord is&amp;nbsp;beven.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;german_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;german_4_newspaper_20000.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#### German:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;german_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;German:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Eine engan tradiesen auch durche dassent. In die Der sie zeite Tendertschatsunge, ist habe. Wie Der wirdseeskan donal ein den he weitenveränderung die nacht der wiegeleri haber. Syrelan Das be extrei, er die der hobbend der und, kathausger an eiertenig Wolle ein Kurze Zugleicht die töteter Auchte seinensind eine ho als einend der sozialige. Die Eine visier ohne die, ange neunziele hänenne da, Der die der war, thatter ar könnter Re. Auf mehr höher der Aus kämpfe mehr durch kniemal, Dann Einer draußerde der allendeskand lindun Die den sich&amp;nbsp;eine.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;italian_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;italian_4_newspaper_14063.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#### Italian:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;italian_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;Italian:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Dobbi no ri, e del tantitàgher nel a. Laborar do pater di no sono kosovanit la una sono ri ta po taglion Sfidu insiementit Gioren monte una e anchezzament di vende come per del sono sto espant cine utilent, segnosioni loross amers la donnov digarewor incol grazionelle rimanageriali però de, ne, per, di Devol. Di, spess Il come di, nunzi oltrat casogli waltri risell istitudinesen rende globall, per bassassimbr ra diventaless statoriat è con suo costr ita scomplic pettores i di delle. Che e che la che co trumen randit giorat che ita vengono gli! Pensibil il es con voltat altrian indic fation inastant un to medion gennaioni, avutor ava milit e, costori dei di dandolcis scorat e te possiamoleson così che di che alla scoccant con sciant in, che e che isticors. Notto porta unitori del proccideri fotogli ti a senovembra&amp;nbsp;se.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;swedish_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;swedish_4_newspaper_29446.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#### Swedish:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;swedish_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;Swedish:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Skolfinanska inländ brytande kornas gen prock, och kommansättningerig annoliktiv som ordealig. Ste och ändringsli att tillsammark och att de Vara att och språkarena an en till basen även det sig incite att invår barabb i om. Till jag tillräck ellerigen inte kockad budgete presull. Rober socialstå tera te de Någradera och. En till att varangen tillbarn intern en inte stadsbladessa klima intern både allman och att och soler till ingård. Kampa och lycke om och nybergio och&amp;nbsp;den.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;span class="n"&gt;finnish_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engl_ish&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;finnish_4_newspaper_1529.pickle&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#### Finnish:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;finnish_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language_gen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;h4&gt;Finnish:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;Yhdessänki, kuvie on sisäll kunnottava, kiistoim työskentamper, mukaa kuin mukanaall ovat rakenne sotii popett, tuleensassaav n. Katsotaanss edessaano monettojen olla. Tarjotanu uuden on sanotaanutt tälle. Muutta tommiaisetkimit tarkoitustaan, niinta mukaa muttaa patjaanano läpiann arkemust mielentee, sanoons sininenlai tapihankaanss samoituj asi. Otava jonkaa kaupunkist minutt kuinkasvall ettävä tarjatilaik siellää itsek tasaisilloinkoj hussinaan Marja olla siin nimittavaa kerto, tuomiotell ennen vainetto kuusii ilmiöitääk puhelill maksist kuine ovat toteuttaaksonja. La voitta, railma vakuu lisään kiusall vastamatti mestyksellisenäisell yöksia Jotakselyst kylläkel Eik lopullahappo suunnoi hetkesku että murhatt unikaa rundgrentamine murrett musii&amp;nbsp;mutta.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Looking&amp;nbsp;forward&lt;/h2&gt;
&lt;p&gt;Currently, I can think of two major drawbacks to &lt;code&gt;engl_ish&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The first is a model quality issue: infix punctuation isn&amp;#8217;t supported. This means that we can&amp;#8217;t generate words like &amp;#8220;isn&amp;#8217;t&amp;#8221; or &amp;#8220;can&amp;#8217;t&amp;#8221;. This is important for English contractions and possessives, though we still produce (in my humble opinion) quite a nice looking result. The lack of contractions isn&amp;#8217;t immediately obvious when looking through the English text. However, in a language like French, where infix apostrophes are very important, this is obviously a major drawback (and is why I didn&amp;#8217;t include a French example in the post). Totally non-hardcoded punctuation handling would be an even further improvement, as it would allow for unforeseen conventions, like the Spanish ¿ at the beginning of a&amp;nbsp;question.&lt;/p&gt;
&lt;p&gt;The second drawback is a training issue: &lt;code&gt;engl_ish&lt;/code&gt; is &lt;em&gt;slow&lt;/em&gt;. Training the 4th order model on the &lt;span class="caps"&gt;NYT&lt;/span&gt; dataset took on the order of an hour, which is definitely a hindrance when you&amp;#8217;re trying to tweak and tune the model. This could be improved substantially by parallelizing the training process, since each sentence could in principle train an individual model and then these could be aggregated. It would also be faster to use a &lt;code&gt;numpy&lt;/code&gt; array to store the distributions and markov models, and just map the indices to strings of&amp;nbsp;letters.&lt;/p&gt;
&lt;p&gt;I would love to get to these things in the future, but if you&amp;#8217;re interested and would like to beat me to it, by all means go&amp;nbsp;ahead!&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Want to know how &lt;code&gt;engl_ish&lt;/code&gt; looks under the hood? Check it out on &lt;a href="https://github.com/JohnPaton/engl_ish"&gt;Github&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;a href="http://www.johnpaton.net"&gt;John Paton&lt;/a&gt; is a theoretical physics student turned data science consultant, working at &lt;span class="caps"&gt;KPMG&lt;/span&gt; in Amstelveen, the&amp;nbsp;Netherlands.&lt;/em&gt;&lt;/p&gt;</content><category term="posts"></category><category term="python"></category><category term="markov"></category><category term="natural language"></category><category term="open source"></category></entry></feed>