<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>John Sobanski - Data Science</title><link href="https://john.soban.ski/" rel="alternate"></link><link href="https://john.soban.ski/feeds/data-science.atom.xml" rel="self"></link><id>https://john.soban.ski/</id><updated>2026-04-25T01:23:00-04:00</updated><subtitle>Artificial Intelligence in the Cloud</subtitle><entry><title>Engagement Farming with AI Bots</title><link href="https://john.soban.ski/gemma-nye.html" rel="alternate"></link><published>2026-04-25T01:23:00-04:00</published><updated>2026-04-25T01:23:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2026-04-25:/gemma-nye.html</id><summary type="html">&lt;p&gt;I hear a lot of developers on social media claim that they created fully automated, end-to-end AI bot pipelines that create, build, test, and deploy apps from scratch.  This idea appeals to me, so I gave it a shot.&lt;/p&gt;
&lt;p&gt;I first tried and failed to create a working app pipeline …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I hear a lot of developers on social media claim that they created fully automated, end-to-end AI bot pipelines that create, build, test, and deploy apps from scratch.  This idea appeals to me, so I gave it a shot.&lt;/p&gt;
&lt;p&gt;I first tried and failed to create a working app pipeline (see future blog posts), so I decided to create a Minimum Viable Product (MVP) in the form of a bot that writes and publishes one blog post, once a day, every day.&lt;/p&gt;
&lt;p&gt;Goals include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Writing a blog that people will want to read&lt;/li&gt;
&lt;li&gt;Publishing the blog to a production-grade website&lt;/li&gt;
&lt;li&gt;Following security best practices&lt;/li&gt;
&lt;li&gt;Full automation (no manual intervention)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Screen Grab of Hours dot Love" src="https://john.soban.ski/images/Gemma_Nye/00_Hours_Love.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;The Approach&lt;/h2&gt;
&lt;p&gt;I use Gemini CLI on my Windows laptop for GenAI.  The LLM write the blog in markdown and pushes it to my public repo on GitHub.  GitHub Actions (GA) then use a static site generator to build the site from the markdown (adding the home page hyperlinks, menus, and style) and then pushes the HTML and Javascript to an Amazon S3 Hosted website with &lt;a href="how-to-configure-s3-websites-to-use-https-part-1.md"&gt;content caching, HTTPS and naked domain redirects&lt;/a&gt;.  Gemini CLI uses a mounted SSH private key to push to GitHub, and GA uses an AWS Identity and Access Management (IAM) Role to push to AWS.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Architecture" src="https://john.soban.ski/images/Gemma_Nye/01_Top_Arch.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I chose Gemini CLI for GenAI.  You can easily replace this service with ChatGPT, Claude, or Cursor.  On the hosting end, you can replace S3 with Google Cloud, Azure, Oracle Cloud or any web hosting platform.&lt;/p&gt;
&lt;h2&gt;Create an AWS Role&lt;/h2&gt;
&lt;p&gt;GA uses OpenID Connect (OIDC) to authenticate to AWS without a password.  OIDC requires an IAM Role and Policy.&lt;/p&gt;
&lt;p&gt;I name the IAM Role &lt;strong&gt;GitHubActionsHoursLoveDeploy&lt;/strong&gt;, since it will deploy to the S3 bucket that hosts &lt;a href="https://hours.love"&gt;https://hours.love&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The Role includes a &lt;strong&gt;trust policy&lt;/strong&gt; that allows the &lt;strong&gt;hours.love&lt;/strong&gt; repo to access resources in my AWS account.  I limit the scope to the &lt;strong&gt;main&lt;/strong&gt; branch.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Version&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2012-10-17&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Statement&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Effect&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Allow&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Principal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Federated&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Action&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;sts:AssumeRoleWithWebIdentity&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Condition&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;StringEquals&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;token.actions.githubusercontent.com:aud&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;sts.amazonaws.com&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;StringLike&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;token.actions.githubusercontent.com:sub&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;repo:hatdropper1977/hours.love:ref:refs/heads/main&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;repo:hatdropper1977/hours.love:ref:refs/heads/main&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Role includes a policy that allows GA to both push web content to the S3 bucket and revoke the CloudFront cache.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Version&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2012-10-17&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Statement&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Sid&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;S3Deploy&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Effect&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Allow&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Action&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;s3:PutObject&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;s3:DeleteObject&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;s3:GetObject&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;s3:ListBucket&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Resource&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;arn:aws:s3:::hours.love&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;arn:aws:s3:::hours.love/*&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Sid&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;CloudFrontInvalidate&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Effect&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Allow&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Action&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cloudfront:CreateInvalidation&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;Resource&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;*&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Configure GitHub&lt;/h2&gt;
&lt;p&gt;We need to configure GitHub to both accept markdown from Gemini CLI and send content to AWS.&lt;/p&gt;
&lt;p&gt;For the Gemini CLI, we paste our public key in &lt;strong&gt;Settings --&amp;gt; Deploy Keys --&amp;gt; Add Deploy Key&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="How to add CLI Key" src="https://john.soban.ski/images/Gemma_Nye/02_Public_Key.jpg"&gt;&lt;/p&gt;
&lt;p&gt;For AWS, we create a YAML file that includes the AWS Resource Number (ARN) for the OIDC and push it to the repo in &lt;strong&gt;.github/workflows/deploy.yml&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Deploy&lt;/span&gt;

&lt;span class="nt"&gt;on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;push&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;main&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="nt"&gt;permissions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;read&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;id-token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;write&lt;/span&gt;

&lt;span class="nt"&gt;concurrency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;group&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;deploy-hours-love&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;cancel-in-progress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;

&lt;span class="nt"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;AWS_REGION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;us-east-1&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;S3_BUCKET&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;hours.love&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;CLOUDFRONT_DISTRIBUTION_ID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;A12B3CDEFGHIJK&lt;/span&gt;

&lt;span class="nt"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;deploy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;runs-on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ubuntu-latest&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Checkout&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;actions/checkout@v5&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Setup Node&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;actions/setup-node@v6&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;node-version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;24&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;cache-dependency-path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;package-lock.json&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Install dependencies&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm install&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Build site&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm run build&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Verify build output&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;test -d _site || (echo &amp;quot;Expected build output directory &amp;#39;_site&amp;#39; not found&amp;quot; &amp;amp;&amp;amp; exit 1)&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Configure AWS credentials&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;aws-actions/configure-aws-credentials@v6&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;aws-region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;${{ env.AWS_REGION }}&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;role-to-assume&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;arn:aws:iam::123456789012:role/GitHubActionsHoursLoveDeploy&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Sync site to S3&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;aws s3 sync _site/ s3://${S3_BUCKET}/ --delete&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Invalidate CloudFront&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;aws cloudfront create-invalidation \&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;--distribution-id &amp;quot;${CLOUDFRONT_DISTRIBUTION_ID}&amp;quot; \&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;--paths &amp;quot;/*&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note that this YAML also includes stanzas to build the site via npm.&lt;/p&gt;
&lt;p&gt;We create another YAML file, in &lt;strong&gt;.github/workflows/ci.yml&lt;/strong&gt; to create the Continuous Integration/ Continuous Deployment (CI/CD) pipeline.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;CI&lt;/span&gt;

&lt;span class="nt"&gt;on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;push&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gemini/**&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;setup/**&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;pull_request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;main&lt;/span&gt;

&lt;span class="nt"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;build&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;runs-on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ubuntu-latest&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Checkout&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;actions/checkout@v4&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Detect project&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;detect&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;if [ -f package.json ]; then&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;echo &amp;quot;type=node&amp;quot; &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;elif [ -f requirements.txt ]; then&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;echo &amp;quot;type=python&amp;quot; &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;else&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;echo &amp;quot;type=unknown&amp;quot; &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="no"&gt;fi&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Setup Node&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;if&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;steps.detect.outputs.type == &amp;#39;node&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;uses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;actions/setup-node@v4&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;with&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="nt"&gt;node-version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;20&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Install deps&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;if&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;steps.detect.outputs.type == &amp;#39;node&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm ci&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Lint&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;if&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;steps.detect.outputs.type == &amp;#39;node&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm run lint --if-present&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Test&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;if&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;steps.detect.outputs.type == &amp;#39;node&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm test --if-present&lt;/span&gt;

&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Build&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;if&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;steps.detect.outputs.type == &amp;#39;node&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;npm run build --if-present&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Create a Ruleset in &lt;strong&gt;Settings --&amp;gt; Rules --&amp;gt; Rulesets&lt;/strong&gt; to prevent deletions and force pushes via the GA User Interface (UI).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Navigate to the Ruleset" src="https://john.soban.ski/images/Gemma_Nye/03_Minor_Blocks.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Configure Gemini CLI&lt;/h2&gt;
&lt;p&gt;The Gemini CLI contains the blog writing intelligence.  It writes a blog in Markdown and pushes the Markdown to GitHub.  When GitHub receives the file, it kicks off the GA workflows captured in &lt;strong&gt;ci.yml&lt;/strong&gt; and &lt;strong&gt;deploy.yml&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For security reasons, we deploy Gemini CLI in a docker container.  This prevents Gemini from accessing any files on our laptop/ server that live outside of the deployment folder.&lt;/p&gt;
&lt;p&gt;We use a file &lt;strong&gt;entrypoint.sh&lt;/strong&gt; to configure and command the container.  This includes the following actions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Mount SSH key (for GitHub)&lt;/li&gt;
&lt;li&gt;Clone/pull repo&lt;/li&gt;
&lt;li&gt;Generate prompt based on rules&lt;/li&gt;
&lt;li&gt;Call Gemini API (with search)&lt;/li&gt;
&lt;li&gt;Write markdown post&lt;/li&gt;
&lt;li&gt;Validate output&lt;/li&gt;
&lt;li&gt;Commit + push&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The full file reads:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;#!/usr/bin/env bash
set -euo pipefail

: &amp;quot;${GEMINI_API_KEY:?GEMINI_API_KEY is required}&amp;quot;
: &amp;quot;${GIT_REPO_SSH:?GIT_REPO_SSH is required}&amp;quot;
: &amp;quot;${GIT_USER_NAME:=Gemini CLI}&amp;quot;
: &amp;quot;${GIT_USER_EMAIL:=gemini-hours-love@users.noreply.github.com}&amp;quot;
: &amp;quot;${SSH_KEY_SRC:=/run/secrets/gemini_hours_love}&amp;quot;
: &amp;quot;${TZ:=America/New_York}&amp;quot;

export TZ
export HOME=/root
export REPO_DIR=&amp;quot;${REPO_DIR:-/work/hours.love}&amp;quot;
export POSTS_DIR=&amp;quot;${POSTS_DIR:-posts}&amp;quot;
export DATE_LOCAL=&amp;quot;$(date +%F)&amp;quot;

# Context mode:
#   good   = titles only
#   better = titles + short snippets
export RECENT_POSTS_MODE=&amp;quot;${RECENT_POSTS_MODE:-better}&amp;quot;

mkdir -p /root/.ssh /work
chmod 700 /root/.ssh

if [[ ! -f &amp;quot;$SSH_KEY_SRC&amp;quot; ]]; then
  echo &amp;quot;Missing SSH key at $SSH_KEY_SRC&amp;quot;
  exit 1
fi

cp &amp;quot;$SSH_KEY_SRC&amp;quot; /root/.ssh/gemini_hours_love
chmod 600 /root/.ssh/gemini_hours_love

cat &amp;gt;/root/.ssh/config &amp;lt;&amp;lt;&amp;#39;EOF&amp;#39;
Host github-hours-love
  HostName github.com
  User git
  IdentityFile /root/.ssh/gemini_hours_love
  IdentitiesOnly yes
EOF

chmod 600 /root/.ssh/config
ssh-keyscan github.com &amp;gt;&amp;gt; /root/.ssh/known_hosts 2&amp;gt;/dev/null
chmod 644 /root/.ssh/known_hosts

# --- clone or update repo ---
if [[ ! -d &amp;quot;$REPO_DIR/.git&amp;quot; ]]; then
  git clone &amp;quot;$GIT_REPO_SSH&amp;quot; &amp;quot;$REPO_DIR&amp;quot;
fi

cd &amp;quot;$REPO_DIR&amp;quot;

git config user.name &amp;quot;$GIT_USER_NAME&amp;quot;
git config user.email &amp;quot;$GIT_USER_EMAIL&amp;quot;

git fetch origin
git checkout main
git reset --hard origin/main

mkdir -p &amp;quot;$POSTS_DIR&amp;quot;

POST_FILE=&amp;quot;$POSTS_DIR/${DATE_LOCAL}.md&amp;quot;

if [[ -f &amp;quot;$POST_FILE&amp;quot; ]]; then
  echo &amp;quot;Post already exists for ${DATE_LOCAL}; exiting.&amp;quot;
  exit 0
fi

# --- recent post context ---
build_recent_titles_context() {
  local recent_files=()
  mapfile -t recent_files &amp;lt; &amp;lt;(find &amp;quot;$POSTS_DIR&amp;quot; -maxdepth 1 -type f -name &amp;quot;*.md&amp;quot; ! -name &amp;quot;${DATE_LOCAL}.md&amp;quot; | sort -r | head -n 5)

  if [[ ${#recent_files[@]} -eq 0 ]]; then
    echo &amp;quot;No recent posts yet.&amp;quot;
    return
  fi

  for f in &amp;quot;${recent_files[@]}&amp;quot;; do
    local title
    title=&amp;quot;$(grep -m1 &amp;#39;^title:&amp;#39; &amp;quot;$f&amp;quot; | sed &amp;#39;s/^title:[[:space:]]*//&amp;#39;)&amp;quot;
    if [[ -z &amp;quot;$title&amp;quot; ]]; then
      title=&amp;quot;$(basename &amp;quot;$f&amp;quot; .md)&amp;quot;
    fi
    echo &amp;quot;- ${title}&amp;quot;
  done
}

build_recent_snippets_context() {
  local recent_files=()
  mapfile -t recent_files &amp;lt; &amp;lt;(find &amp;quot;$POSTS_DIR&amp;quot; -maxdepth 1 -type f -name &amp;quot;*.md&amp;quot; ! -name &amp;quot;${DATE_LOCAL}.md&amp;quot; | sort -r | head -n 3)

  if [[ ${#recent_files[@]} -eq 0 ]]; then
    echo &amp;quot;No recent posts yet.&amp;quot;
    return
  fi

  for f in &amp;quot;${recent_files[@]}&amp;quot;; do
    local title
    local snippet
    title=&amp;quot;$(grep -m1 &amp;#39;^title:&amp;#39; &amp;quot;$f&amp;quot; | sed &amp;#39;s/^title:[[:space:]]*//&amp;#39;)&amp;quot;
    if [[ -z &amp;quot;$title&amp;quot; ]]; then
      title=&amp;quot;$(basename &amp;quot;$f&amp;quot; .md)&amp;quot;
    fi

    snippet=&amp;quot;$(
      awk &amp;#39;
        BEGIN { in_frontmatter=0; started=0; lines=0 }
        /^---$/ {
          if (started == 0) { in_frontmatter=1; started=1; next }
          else if (in_frontmatter == 1) { in_frontmatter=0; next }
        }
        in_frontmatter == 0 &amp;amp;&amp;amp; NF {
          print
          lines++
          if (lines &amp;gt;= 6) exit
        }
      &amp;#39; &amp;quot;$f&amp;quot; | tr &amp;#39;\n&amp;#39; &amp;#39; &amp;#39; | sed &amp;#39;s/[[:space:]]\+/ /g&amp;#39;
    )&amp;quot;

    echo &amp;quot;- Title: ${title}&amp;quot;
    if [[ -n &amp;quot;$snippet&amp;quot; ]]; then
      echo &amp;quot;  Snippet: ${snippet}&amp;quot;
    fi
  done
}

if [[ &amp;quot;$RECENT_POSTS_MODE&amp;quot; == &amp;quot;good&amp;quot; ]]; then
  RECENT_POSTS_CONTEXT=&amp;quot;$(build_recent_titles_context)&amp;quot;
else
  RECENT_POSTS_CONTEXT=&amp;quot;$(build_recent_snippets_context)&amp;quot;
fi

# --- prompt ---
PROMPT=$(cat &amp;lt;&amp;lt;EOF
Write exactly one Eleventy post as valid markdown.

Output rules:
- Output ONLY the post file contents
- No explanations, no commentary, no meta text
- No code fences

Required format:
---
title: &amp;lt;specific title&amp;gt;
date: ${DATE_LOCAL}
tags:
  - posts
layout: post.liquid
---

Then the article body.

Narrative requirements:
- The first paragraph must introduce the story and preview what follows
- It must clearly state what happened and why it matters
- If a person is central, identify them immediately with name, role, and location
- Do not open with a scene or anecdote
- Do not open with generic background

Structure:
1. Opening: what happened (fact-based lead)
2. Details: names, locations, specifics
3. Context: why it matters locally
4. Optional: related developments
5. No formal conclusion

Writing rules:
- 500 to 800 words
- Write like a local industry publication, not a personal blog
- No first-person narration (no &amp;quot;I&amp;quot;, &amp;quot;we&amp;quot;)
- No fictional scenes or invented experiences
- No sensory storytelling unless tied to a real reported fact
- Short paragraphs, factual tone
- No pontificating
- No generalizations
- No moralizing
- No reflective commentary
- No filler phrases
- No inspirational tone
- No em dashes
- No intensifiers like &amp;quot;very&amp;quot;, &amp;quot;really&amp;quot;, &amp;quot;deeply&amp;quot;, &amp;quot;truly&amp;quot;, &amp;quot;far more&amp;quot;
- Do not explain what you&amp;#39;re about to say
- If a sentence sounds like a diary, remove it
- If a sentence could appear in a newspaper, keep it
- Keep sentences under 20 words when possible

Content rules:
- Use recent, real information when available via search grounding
- Summarize clearly and directly
- Combine multiple sources when relevant
- Do not fabricate facts
- Do not fabricate firsthand experience
- Focus on:
  - what happened
  - who is involved
  - where it happened
  - why it matters locally
  - practical implications

People coverage rules:
- When a person is central, make them the anchor of the piece
- Focus on what they did, changed, opened, produced, or influenced
- Include:
  - full name
  - role
  - associated business or winery
  - location
- Do not write generic biographies
- Do not invent people
- Only include people present in grounding sources
- Prefer coverage where a person is tied to a real development

Sourcing rules:
- Prefer using 2 to 4 distinct sources when reporting news
- Prefer sources that mention specific people, not just organizations
- Do not rely on a single source if multiple relevant sources exist
- Synthesize information across sources into a single narrative
- Do not summarize sources one-by-one
- Do not write &amp;quot;Article A says, Article B says&amp;quot;
- Combine facts into one coherent account

Citation rules:
- Every key factual claim must come from a grounding source
- Do not invent citations
- Do not fabricate details
- Use light attribution when necessary:
  - Marin Independent Journal reported that ...
  - The San Francisco Chronicle reported ...
  - WineBusiness noted ...
- Do not attribute every sentence
- Use markdown links inline when appropriate:
  - [publication name](url)
- Links must correspond to real grounding sources
- If multiple sources confirm a fact, present it once
- If sources differ, reflect that briefly without speculation

Footnote rules:
- Footnotes are allowed but optional
- If used, they must reference real sources
- Use markdown footnote syntax:
  - reference like [^1]
  - define at bottom:
    [^1]: Source Name - URL
- Do not invent footnotes

Failure rules:
- If fewer than 2 relevant sources are available, write a focused piece using one source plus context
- If no relevant sources are available, fall back to a non-news industry post
- Do NOT invent news or citations

Topic priorities (in order):
1. Marin / Novato / Bay Area restaurant or hospitality news
2. Northern California winery or vineyard developments
3. Key people in the Northern California wine scene:
   - winemakers
   - vineyard managers
   - owners
   - sommeliers
   - importers / distributors
   - hospitality operators
   - chefs with strong wine programs
4. Local wine business, production, or distribution changes
5. Regional updates tied to specific producers or places

Recent posts to avoid repeating:
${RECENT_POSTS_CONTEXT}

If a topic overlaps:
- choose a different angle instead of repeating

Title rules:
- Specific and concrete
- Not generic
- Should read like a headline
- Include a real place, business, or person when possible
EOF
)

# --- build request ---
jq -n --arg prompt &amp;quot;$PROMPT&amp;quot; &amp;#39;{
  contents: [
    {
      parts: [
        { text: $prompt }
      ]
    }
  ],
  tools: [
    {
      google_search: {}
    }
  ]
}&amp;#39; &amp;gt; /tmp/gemini_request.json

# --- call Gemini API ---
curl -sS \
  -H &amp;quot;Content-Type: application/json&amp;quot; \
  -H &amp;quot;x-goog-api-key: ${GEMINI_API_KEY}&amp;quot; \
  -X POST \
  &amp;quot;https://generativelanguage.googleapis.com/v1beta/models/gemini-3-flash-preview:generateContent&amp;quot; \
  -d @/tmp/gemini_request.json \
  &amp;gt; /tmp/gemini_response.json

# --- fail fast on API errors ---
if jq -e &amp;#39;.error&amp;#39; /tmp/gemini_response.json &amp;gt;/dev/null 2&amp;gt;&amp;amp;1; then
  echo &amp;quot;Gemini API returned an error:&amp;quot;
  jq &amp;#39;.error&amp;#39; /tmp/gemini_response.json
  exit 1
fi

# --- extract text safely ---
POST_TEXT=&amp;quot;$(jq -r &amp;#39;.candidates[0].content.parts[0].text // empty&amp;#39; /tmp/gemini_response.json)&amp;quot;

if [[ -z &amp;quot;$POST_TEXT&amp;quot; ]]; then
  echo &amp;quot;Empty post content returned from Gemini.&amp;quot;
  jq &amp;#39;.&amp;#39; /tmp/gemini_response.json
  exit 1
fi

printf &amp;#39;%s\n&amp;#39; &amp;quot;$POST_TEXT&amp;quot; &amp;gt; &amp;quot;$POST_FILE&amp;quot;

# --- contamination guard ---
if grep -qE &amp;#39;I have written the blog post|/work/|^Here is|^Sure|^```&amp;#39; &amp;quot;$POST_FILE&amp;quot;; then
  echo &amp;quot;Contaminated output. Aborting.&amp;quot;
  cat &amp;quot;$POST_FILE&amp;quot;
  exit 1
fi

# --- basic format checks ---
if ! grep -q &amp;#39;^---$&amp;#39; &amp;quot;$POST_FILE&amp;quot;; then
  echo &amp;quot;Missing front matter.&amp;quot;
  cat &amp;quot;$POST_FILE&amp;quot;
  exit 1
fi

if ! grep -q &amp;#39;^title:&amp;#39; &amp;quot;$POST_FILE&amp;quot;; then
  echo &amp;quot;Missing title in front matter.&amp;quot;
  cat &amp;quot;$POST_FILE&amp;quot;
  exit 1
fi

# --- extract real sources, deduped ---
jq -r &amp;#39;
  [
    .candidates[0].groundingMetadata.groundingChunks[]?.web
    | select(.title and .uri)
    | &amp;quot;- [&amp;quot; + .title + &amp;quot;](&amp;quot; + .uri + &amp;quot;)&amp;quot;
  ] | unique | .[]
&amp;#39; /tmp/gemini_response.json &amp;gt; /tmp/sources.md || true

if [[ -s /tmp/sources.md ]]; then
  {
    printf &amp;#39;\n\n## Sources\n\n&amp;#39;
    cat /tmp/sources.md
    printf &amp;#39;\n&amp;#39;
  } &amp;gt;&amp;gt; &amp;quot;$POST_FILE&amp;quot;
fi

# --- style guard (lightweight) ---
if grep -qE &amp;#39;—| very | really | deeply | truly | far more | that is just &amp;#39; &amp;quot;$POST_FILE&amp;quot;; then
  echo &amp;quot;Style warning: banned phrasing detected.&amp;quot;
fi

# --- build validation ---
npm ci
npm run build

# --- commit ---
git add &amp;quot;$POST_FILE&amp;quot;

if git diff --cached --quiet; then
  echo &amp;quot;No changes to commit.&amp;quot;
  exit 0
fi

git commit -m &amp;quot;Auto post ${DATE_LOCAL}&amp;quot;
git push origin main

echo &amp;quot;Done.&amp;quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Docker container uses this file to execute the commands required for creating and pushing a blog post.&lt;/p&gt;
&lt;h2&gt;One Time Tasks&lt;/h2&gt;
&lt;p&gt;One time tasks include setting up the repo, initializing a node project, configuring the static site generator, creating liquid templates, building the container, and deploying the task scheduler.&lt;/p&gt;
&lt;h3&gt;Set Up the Repo&lt;/h3&gt;
&lt;p&gt;The writing bot launches a container on my laptop, and pulls the current GitHub repo.  The repo contains both the necessary data (1) to build the project on my laptop (a one time manual action), and (2) the current state of the website markdown.&lt;/p&gt;
&lt;p&gt;When the container runs, it uses the following commands to pull the &lt;strong&gt;source of truth&lt;/strong&gt; from the repo.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;git&lt;span class="w"&gt; &lt;/span&gt;clone&lt;span class="w"&gt; &lt;/span&gt;git@github-hours-love:hatdropper1977/hours.love.git
&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;hours.love
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Initialize Node Project&lt;/h3&gt;
&lt;p&gt;The static site builder uses Node.js.  GitHub actions run the commands to build the site.  The following shell command initializes the node project, and provides configuration information for GitHub Actions.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;npm&lt;span class="w"&gt; &lt;/span&gt;init&lt;span class="w"&gt; &lt;/span&gt;-y

&lt;span class="c1"&gt;# Created:&lt;/span&gt;
&lt;span class="c1"&gt;# package.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This file tells all systems (your laptop, GitHub Actions) how to install dependencies and how to build the site.&lt;/p&gt;
&lt;h3&gt;Configure the Static Site Generator&lt;/h3&gt;
&lt;p&gt;I chose &lt;a href="https://www.11ty.dev/"&gt;Eleventy (11ty)&lt;/a&gt; for my static site generator.  You can choose any static site generator you wish.  On this website, for example, I use &lt;a href="https://getpelican.com/"&gt;Pelican&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The static site generator converts the raw markdown into a live site, with navigation, client side scripting, and style.&lt;/p&gt;
&lt;p&gt;We install install 11ty with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;npm&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--save-dev&lt;span class="w"&gt; &lt;/span&gt;@11ty/eleventy

&lt;span class="c1"&gt;# Added &lt;/span&gt;
&lt;span class="c1"&gt;# &amp;quot;devDependencies&amp;quot;: {&lt;/span&gt;
&lt;span class="c1"&gt;#  &amp;quot;@11ty/eleventy&amp;quot;: &amp;quot;^3.x&amp;quot;&lt;/span&gt;
&lt;span class="c1"&gt;# }&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;# Created:&lt;/span&gt;
&lt;span class="c1"&gt;# package-lock.json&lt;/span&gt;
&lt;span class="c1"&gt;# node_modules/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This commands our project to uUse Eleventy to build the site.&lt;/p&gt;
&lt;p&gt;Next, we edit package.json (created above) to define build command.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="s2"&gt;&amp;quot;scripts&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;build&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;npx @11ty/eleventy&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, when GA runs...&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;npm&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;build
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;...it executes...&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;npx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;@11&lt;/span&gt;&lt;span class="n"&gt;ty&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;eleventy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;...which generates:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;_site/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Create Liquid Templates&lt;/h3&gt;
&lt;p&gt;11ty uses the &lt;a href="https://www.11ty.dev/docs/languages/liquid/"&gt;Liquid&lt;/a&gt; template language.  The template language configures the look and feel of the website.&lt;/p&gt;
&lt;p&gt;I created layout files for the home page and each unique post.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;index.liquid&lt;/strong&gt; includes&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;---
title: hours.love
---

&lt;span class="cp"&gt;&amp;lt;!doctype html&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;html&lt;/span&gt; &lt;span class="na"&gt;lang&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;en&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;head&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;meta&lt;/span&gt; &lt;span class="na"&gt;charset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;utf-8&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;title&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;{{ title }}&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;title&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;meta&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;viewport&amp;quot;&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;width=device-width, initial-scale=1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;style&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;root&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--bg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#f7f3ec&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--paper&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#fffdf9&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--ink&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#1f1b16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--muted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#6b6258&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#ddd3c7&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--link&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#6f2c1f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nv"&gt;--link-hover&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mh"&gt;#8a3828&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;box-sizing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;border-box&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;body&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;background&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--bg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--ink&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-family&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Georgia&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Times New Roman&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;serif&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;line-height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.65&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;wrap&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;max-width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;760&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;auto&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;hero&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin-bottom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;h1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.4&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;line-height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;sub&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--muted&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.05&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;max-width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;intro&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;background&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--paper&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;border&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;solid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;border-radius&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;h2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;letter-spacing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="kt"&gt;em&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;posts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;list-style&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;none&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;posts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;padding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;border-top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;solid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;posts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;last-child&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;border-bottom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;solid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--line&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;posts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--link&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;text-decoration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;none&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;posts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nd"&gt;hover&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--link-hover&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;text-decoration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;underline&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;meta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;display&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;block&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;margin-top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="kt"&gt;px&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--muted&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="kt"&gt;rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;empty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;color&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;--muted&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="k"&gt;font-style&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;italic&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;style&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;head&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;body&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;main&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;wrap&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;header&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;hero&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;h1&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;hours.love&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;h1&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sub&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Notes on Northern California wine, food, places, and the small details that make them worth remembering.&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;header&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;section&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;intro&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      This is a running notebook, not a magazine. Short pieces. Specific places. Good bottles. Restaurants worth knowing. Things noticed and written down before they blur.
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;section&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;section&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;h2&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Recent posts&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;h2&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;

      {% if collections.posts and collections.posts.length &amp;gt; 0 %}
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;ul&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;posts&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          {% for post in collections.posts | reverse %}
            &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
              &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;a&lt;/span&gt; &lt;span class="na"&gt;href&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;{{ post.url }}&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;{{ post.data.title }}&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;a&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
              &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;meta&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;{{ post.date | date: &amp;quot;%B %d, %Y&amp;quot; }}&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
            &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;li&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
          {% endfor %}
        &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;ul&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      {% else %}
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;empty&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;No posts yet.&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;p&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      {% endif %}
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;section&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;main&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;body&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;html&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This template, similar to PHP, dynamically applies style to an arbitrarty number of files.&lt;/p&gt;
&lt;h3&gt;Build &amp;amp; Test the Container&lt;/h3&gt;
&lt;p&gt;I wrote the following Dockerfile, which builds a container that includes Linux, Node, git, curl, jq, and my security key.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;node:24-bookworm-slim&lt;/span&gt;

&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;DEBIAN_FRONTEND&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;noninteractive
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;TZ&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;America/New_York
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;REPO_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/work/hours.love

&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;update&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;--no-install-recommends&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;openssh-client&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;ca-certificates&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;bash&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;tzdata&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;jq&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ln&lt;span class="w"&gt; &lt;/span&gt;-fs&lt;span class="w"&gt; &lt;/span&gt;/usr/share/zoneinfo/&lt;span class="nv"&gt;$TZ&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/etc/localtime&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;dpkg-reconfigure&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;noninteractive&lt;span class="w"&gt; &lt;/span&gt;tzdata&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;rm&lt;span class="w"&gt; &lt;/span&gt;-rf&lt;span class="w"&gt; &lt;/span&gt;/var/lib/apt/lists/*

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/work&lt;/span&gt;

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;entrypoint.sh&lt;span class="w"&gt; &lt;/span&gt;/work/entrypoint.sh
&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;chmod&lt;span class="w"&gt; &lt;/span&gt;+x&lt;span class="w"&gt; &lt;/span&gt;/work/entrypoint.sh

&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;/work/entrypoint.sh&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I build it with:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;docker&lt;span class="w"&gt; &lt;/span&gt;build&lt;span class="w"&gt; &lt;/span&gt;-t&lt;span class="w"&gt; &lt;/span&gt;hours-love-gemini-runner&lt;span class="w"&gt; &lt;/span&gt;.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Run the Container&lt;/h3&gt;
&lt;p&gt;Now we run the container with the following command.  I stored my &lt;strong&gt;$GEMINI_API_KEY&lt;/strong&gt; in an environment variable, so that it will not appear in my history.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;MSYS_NO_PATHCONV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;docker&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;--rm&lt;span class="w"&gt;   &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;GEMINI_API_KEY&lt;span class="w"&gt;   &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;GIT_REPO_SSH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;git@github-hours-love:hatdropper1977/hours.love.git&lt;span class="w"&gt;   &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;GIT_USER_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hatdropper1977&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;-e&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;GIT_USER_EMAIL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;sobanski.htc@gmail.com&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;--mount&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bind,src&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;~/.ssh&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;/gemini_hours_love&amp;quot;&lt;/span&gt;,dst&lt;span class="o"&gt;=&lt;/span&gt;/run/secrets/gemini_hours_love,readonly&lt;span class="w"&gt;   &lt;/span&gt;hours-love-gemini-runner&lt;span class="w"&gt; &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When the container runs, it executes the commands in &lt;strong&gt;entrypoint.sh&lt;/strong&gt;, exits the container, and then deletes the container.&lt;/p&gt;
&lt;h3&gt;Schedule the Task&lt;/h3&gt;
&lt;p&gt;I use a windows laptop to run the job once a day.  You can modify it to use Linux, via cron.&lt;/p&gt;
&lt;p&gt;The batch file contains:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="k"&gt;echo&lt;/span&gt; off

&lt;span class="c1"&gt;REM ---- config ----&lt;/span&gt;
&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;IMAGE&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;hours-love-gemini-runner
&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;REPO&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;git@github-hours-love:hatdropper1977/hours.love.git
&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;SSH_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;C:\Users\Freshlex\.ssh\gemini_hours_love
&lt;span class="k"&gt;set&lt;/span&gt; &lt;span class="nv"&gt;ENV_FILE&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;C:\Users\Freshlex\.gemini.env

&lt;span class="c1"&gt;REM ---- run container ----&lt;/span&gt;
docker run --rm &lt;span class="se"&gt;^&lt;/span&gt;
&lt;span class="se"&gt; &lt;/span&gt; --env-file &lt;span class="nv"&gt;%ENV_FILE%&lt;/span&gt; &lt;span class="se"&gt;^&lt;/span&gt;
&lt;span class="se"&gt; &lt;/span&gt; -e GIT_REPO_SSH=&lt;span class="nv"&gt;%REPO%&lt;/span&gt; &lt;span class="se"&gt;^&lt;/span&gt;
&lt;span class="se"&gt; &lt;/span&gt; -e GIT_USER_NAME=hatdropper1977 &lt;span class="se"&gt;^&lt;/span&gt;
&lt;span class="se"&gt; &lt;/span&gt; -e GIT_USER_EMAIL=my@emsail.com &lt;span class="se"&gt;^&lt;/span&gt;
&lt;span class="se"&gt; &lt;/span&gt; --mount type=bind,src=&lt;span class="nv"&gt;%SSH_KEY%&lt;/span&gt;,dst=/run/secrets/gemini_hours_love,readonly &lt;span class="se"&gt;^&lt;/span&gt;
&lt;span class="se"&gt; &lt;/span&gt; &lt;span class="nv"&gt;%IMAGE%&lt;/span&gt;

&lt;span class="c1"&gt;REM ---- optional logging ----&lt;/span&gt;
&lt;span class="c1"&gt;REM &amp;gt;&amp;gt; C:\Users\Freshlex\gemini.log 2&amp;gt;&amp;amp;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I schedule the command via &lt;strong&gt;CMD&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;schtasks&lt;span class="w"&gt; &lt;/span&gt;/create&lt;span class="w"&gt; &lt;/span&gt;/tn&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hours-love-daily&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/tr&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cmd.exe /c C:\Users\JohnSobanski\gemini-workspace\run-hours-love.bat&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/sc&lt;span class="w"&gt; &lt;/span&gt;daily&lt;span class="w"&gt; &lt;/span&gt;/st&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;:30&lt;span class="w"&gt; &lt;/span&gt;/rl&lt;span class="w"&gt; &lt;/span&gt;highest&lt;span class="w"&gt; &lt;/span&gt;/f
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I created a &lt;strong&gt;engagement farming bot&lt;/strong&gt; that produces a clean, readable, and interesting blog post once per day.  You can take my approach to create a similar bot, on the topic of your choosing.  You can deploy a fleet of bots, to produce blog posts to dozens of different websites.  In fact, you could potentially create a meta-bot pipeline that scours the web for trending topics, and then creates a bot to write on that topic.&lt;/p&gt;
&lt;p&gt;I just began my GenAI automation pipeline journey, and look forward to exploring future use cases.&lt;/p&gt;</content><category term="Data Science"></category><category term="AI"></category><category term="Machine Learning"></category><category term="CloudFront"></category><category term="Gemini"></category><category term="Linux"></category><category term="HOWTO"></category></entry><entry><title>Crunch Big Data on Your Laptop With Polars Streaming</title><link href="https://john.soban.ski/polars2.html" rel="alternate"></link><published>2025-09-27T06:44:00-04:00</published><updated>2025-09-27T06:44:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2025-09-27:/polars2.html</id><summary type="html">&lt;p&gt;Last month, I refactored a custom Artificial Intelligence (AI) algorithm from &lt;strong&gt;Pandas&lt;/strong&gt; to &lt;strong&gt;Polars&lt;/strong&gt;.  This switch drove a 25x increase in performance.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture of a Polar Bear and a Panda Bear squaring off in space" src="https://john.soban.ski/images/Polars2/01_Polars_Time.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I needed to change the logic from a row-based &lt;strong&gt;apply&lt;/strong&gt; approach to a holistic, matrix-level &lt;strong&gt;join/ GROUP BY&lt;/strong&gt; approach. &lt;/p&gt;
&lt;p&gt;My algorithm, however, experienced an Out Of Memory (OOM …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Last month, I refactored a custom Artificial Intelligence (AI) algorithm from &lt;strong&gt;Pandas&lt;/strong&gt; to &lt;strong&gt;Polars&lt;/strong&gt;.  This switch drove a 25x increase in performance.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture of a Polar Bear and a Panda Bear squaring off in space" src="https://john.soban.ski/images/Polars2/01_Polars_Time.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I needed to change the logic from a row-based &lt;strong&gt;apply&lt;/strong&gt; approach to a holistic, matrix-level &lt;strong&gt;join/ GROUP BY&lt;/strong&gt; approach. &lt;/p&gt;
&lt;p&gt;My algorithm, however, experienced an Out Of Memory (OOM) error when I attempted to train a corpus of twenty-seven million (27M) observations.  I then discovered the &lt;strong&gt;Streaming&lt;/strong&gt; feature of Polars, which solves this issue.&lt;/p&gt;
&lt;p&gt;Polars &lt;strong&gt;Streaming&lt;/strong&gt; parallel processes in the &lt;strong&gt;time domain&lt;/strong&gt;.  It chunks data into memory, spreads the computation across all cores, saves the result, and loads the next chunk.  This way, you can perform &lt;strong&gt;Big Data&lt;/strong&gt; operations on a meager Collab notebook.&lt;/p&gt;
&lt;h2&gt;Recap:  The Algorithm in Polars vs. Python&lt;/h2&gt;
&lt;p&gt;The exemplar, &lt;strong&gt;Reduced Columb Energy&lt;/strong&gt; (RCE) algorithm works like the familiar &lt;strong&gt;k-nearest&lt;/strong&gt; algorithm with a subtle twist.  RCE calculates a &lt;strong&gt;hit footprint&lt;/strong&gt;.  The distance to the closest observation of a &lt;strong&gt;different&lt;/strong&gt; class defines the &lt;strong&gt;hit footprint&lt;/strong&gt; radius.  The math then calculates the distance to every other observation of a &lt;strong&gt;different&lt;/strong&gt; class.  The class of each observation with a distance less than the &lt;strong&gt;hit footprint&lt;/strong&gt; radius yields a hit for that class.&lt;/p&gt;
&lt;p&gt;The original Pandas approach uses the following Lambda function to implement the logic.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                           &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                  &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then apply the lambda function to each row.&lt;/p&gt;
&lt;p&gt;The Polars approach uses columnar/matrix-based operations.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;squared_distance_expr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;squared_distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;squared_distance_expr&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;group_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;min_squared_dist_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;group_cols&lt;/span&gt; 
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min_squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lambda_df_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_squared_dist_polars&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min_squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;lambda&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;wall clock&lt;/strong&gt; time for the Pandas approach reads six (6) minutes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screengrab that shows the execution time of the Pandas approach" src="https://john.soban.ski/images/Polars/05_Pandas_Time.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;wall clock&lt;/strong&gt; time for the Polars approach reads fourteen (14) seconds, a ~25x improvement.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Screengrab that shows the execution time for the Polars approach" src="https://john.soban.ski/images/Polars/06_Polars_Time.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Data Viz&lt;/h2&gt;
&lt;p&gt;Last month, I created a graphic of the Pima Data set, which depicts the observations classified by &lt;strong&gt;Outcome 1&lt;/strong&gt; vs. &lt;strong&gt;Outcome 2&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A heat map of Outcome 1 vs. Outcome 2 for the Pima Dataset" src="https://john.soban.ski/images/Polars/04_Rce_Map.jpg"&gt;&lt;/p&gt;
&lt;p&gt;To graph it, I trained the Pima Dataset via RCE to generate a model.  I then created a two-dimensional 300x300 grid DataFrame, and used the Pima model to classify all 9k points.&lt;/p&gt;
&lt;p&gt;I will extend this to three dimensions.  To do so, I must (1) reduce the Pima Dataset to 3D, (2) calculate the &lt;strong&gt;lambdas&lt;/strong&gt; (radii) for all hit footprints, (3) create a 300x300x300 grid Data Viz (27,000,000 points) DataFrame, and (4) classify all 27M Data Viz points with the 3d model.&lt;/p&gt;
&lt;h3&gt;1.  Reduce the Pima Data Set to Three Dimensions&lt;/h3&gt;
&lt;p&gt;The Pima Diabetes Dataset includes eight (8) features and one (1) target.  For a 3D plot, I need to reduce the eight (8) features down to three (3).  Principal Component Analysis (PCA) reduces dimensionality while retaining information.  See my blog post &lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;New Exemplar Machine Learning Algorithm&lt;/a&gt; for a discussion on PCA.  To laymen, PCA crams the &lt;strong&gt;observations&lt;/strong&gt; from higher to lower-dimensional space.  Imagine a bunch of coins sprinkled on a (2D) piece of paper.  If you arrange them in a line, you just reduced the dimensionality.  PCA executes the same process, but accounts for variance and density of the observations in the higher space.&lt;/p&gt;
&lt;p&gt;First, I load the Pima Diabetes Dataset.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pl&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;df_pima&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_pima&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

   &lt;span class="n"&gt;Pregnancies&lt;/span&gt;  &lt;span class="n"&gt;Glucose&lt;/span&gt;  &lt;span class="n"&gt;BloodPressure&lt;/span&gt;  &lt;span class="n"&gt;SkinThickness&lt;/span&gt;  &lt;span class="n"&gt;Insulin&lt;/span&gt;   &lt;span class="n"&gt;BMI&lt;/span&gt;  \
&lt;span class="mi"&gt;0&lt;/span&gt;            &lt;span class="mi"&gt;6&lt;/span&gt;      &lt;span class="mi"&gt;148&lt;/span&gt;             &lt;span class="mi"&gt;72&lt;/span&gt;             &lt;span class="mi"&gt;35&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="mf"&gt;33.6&lt;/span&gt;   
&lt;span class="mi"&gt;1&lt;/span&gt;            &lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="mi"&gt;85&lt;/span&gt;             &lt;span class="mi"&gt;66&lt;/span&gt;             &lt;span class="mi"&gt;29&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="mf"&gt;26.6&lt;/span&gt;   
&lt;span class="mi"&gt;2&lt;/span&gt;            &lt;span class="mi"&gt;8&lt;/span&gt;      &lt;span class="mi"&gt;183&lt;/span&gt;             &lt;span class="mi"&gt;64&lt;/span&gt;              &lt;span class="mi"&gt;0&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="mf"&gt;23.3&lt;/span&gt;   
&lt;span class="mi"&gt;3&lt;/span&gt;            &lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="mi"&gt;89&lt;/span&gt;             &lt;span class="mi"&gt;66&lt;/span&gt;             &lt;span class="mi"&gt;23&lt;/span&gt;       &lt;span class="mi"&gt;94&lt;/span&gt;  &lt;span class="mf"&gt;28.1&lt;/span&gt;   
&lt;span class="mi"&gt;4&lt;/span&gt;            &lt;span class="mi"&gt;0&lt;/span&gt;      &lt;span class="mi"&gt;137&lt;/span&gt;             &lt;span class="mi"&gt;40&lt;/span&gt;             &lt;span class="mi"&gt;35&lt;/span&gt;      &lt;span class="mi"&gt;168&lt;/span&gt;  &lt;span class="mf"&gt;43.1&lt;/span&gt;   

   &lt;span class="n"&gt;DiabetesPedigreeFunction&lt;/span&gt;  &lt;span class="n"&gt;Age&lt;/span&gt;  &lt;span class="n"&gt;Outcome&lt;/span&gt;  
&lt;span class="mi"&gt;0&lt;/span&gt;                     &lt;span class="mf"&gt;0.627&lt;/span&gt;   &lt;span class="mi"&gt;50&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt;  
&lt;span class="mi"&gt;1&lt;/span&gt;                     &lt;span class="mf"&gt;0.351&lt;/span&gt;   &lt;span class="mi"&gt;31&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;  
&lt;span class="mi"&gt;2&lt;/span&gt;                     &lt;span class="mf"&gt;0.672&lt;/span&gt;   &lt;span class="mi"&gt;32&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt;  
&lt;span class="mi"&gt;3&lt;/span&gt;                     &lt;span class="mf"&gt;0.167&lt;/span&gt;   &lt;span class="mi"&gt;21&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;  
&lt;span class="mi"&gt;4&lt;/span&gt;                     &lt;span class="mf"&gt;2.288&lt;/span&gt;   &lt;span class="mi"&gt;33&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I create a Normalizer.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras.layers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Normalization&lt;/span&gt;

&lt;span class="n"&gt;pima_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_pima&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Normalization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                 &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then run PCA on the normalized Pima Dataset.  This collapses the eight (8) dimensions to three (3).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;

&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pima_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This gives us the 3D Pima training dataset for RCE modeling.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

   &lt;span class="n"&gt;princomp1&lt;/span&gt;  &lt;span class="n"&gt;princomp2&lt;/span&gt;  &lt;span class="n"&gt;princomp3&lt;/span&gt;  &lt;span class="n"&gt;outcome&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="mf"&gt;1.068502&lt;/span&gt;   &lt;span class="mf"&gt;1.234895&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.095930&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.121683&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.733852&lt;/span&gt;   &lt;span class="mf"&gt;0.712938&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.396477&lt;/span&gt;   &lt;span class="mf"&gt;1.595876&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.760679&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.115781&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.271241&lt;/span&gt;   &lt;span class="mf"&gt;0.663729&lt;/span&gt;        &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;   &lt;span class="mf"&gt;2.359335&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;2.184819&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;2.963107&lt;/span&gt;        &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;2.  Calculate the radii for all hit footprints&lt;/h3&gt;
&lt;p&gt;In RCE, &lt;strong&gt;lambdas&lt;/strong&gt; (overloaded term) record the distance from an observation to the closest observation in a &lt;strong&gt;different class&lt;/strong&gt;.  RCE uses the &lt;strong&gt;lambdas&lt;/strong&gt; to tally hits.  We must calculate &lt;strong&gt;lambda&lt;/strong&gt; for every observation in the training Dataset.  I go into the code in my &lt;a href="https://john.soban.ski/polars.html"&gt;prior blog post&lt;/a&gt;.  I also share the code at the bottom of this blog post.  I wrote the code to accommodate an arbitrary number of dimensions.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Overload Warning:  RCE uses &lt;strong&gt;lambda&lt;/strong&gt; to indicate hit footprint &lt;strong&gt;radii&lt;/strong&gt;.  Pandas uses &lt;strong&gt;lambda&lt;/strong&gt; to indicate inline/ anonymous functions.  Do not confuse the two.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The following &lt;strong&gt;head()&lt;/strong&gt; shows the &lt;strong&gt;observations&lt;/strong&gt; with their &lt;strong&gt;target&lt;/strong&gt; and &lt;strong&gt;lambda&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;shape:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
┌───────────┬───────────┬───────────┬─────────┬──────────┐
│&lt;span class="w"&gt; &lt;/span&gt;princomp1&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;princomp2&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;princomp3&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;outcome&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;lambda&lt;span class="w"&gt;   &lt;/span&gt;│
│&lt;span class="w"&gt; &lt;/span&gt;---&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;---&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;---&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;---&lt;span class="w"&gt;     &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;---&lt;span class="w"&gt;      &lt;/span&gt;│
│&lt;span class="w"&gt; &lt;/span&gt;f64&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;f64&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;f64&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;i64&lt;span class="w"&gt;     &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;f64&lt;span class="w"&gt;      &lt;/span&gt;│
╞═══════════╪═══════════╪═══════════╪═════════╪══════════╡
│&lt;span class="w"&gt; &lt;/span&gt;-3.183769&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;-1.553631&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.571653&lt;span class="w"&gt;  &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;.399091&lt;span class="w"&gt; &lt;/span&gt;│
│&lt;span class="w"&gt; &lt;/span&gt;-1.038747&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.269651&lt;span class="w"&gt;  &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.878596&lt;span class="w"&gt;  &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.180943&lt;span class="w"&gt; &lt;/span&gt;│
│&lt;span class="w"&gt; &lt;/span&gt;-0.847973&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.281973&lt;span class="w"&gt;  &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;-0.178098&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.209617&lt;span class="w"&gt; &lt;/span&gt;│
│&lt;span class="w"&gt; &lt;/span&gt;-0.546354&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.100635&lt;span class="w"&gt;  &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;-0.542195&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.371084&lt;span class="w"&gt; &lt;/span&gt;│
│&lt;span class="w"&gt; &lt;/span&gt;-0.991178&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;-1.153642&lt;span class="w"&gt; &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;-0.04556&lt;span class="w"&gt;  &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;┆&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.587108&lt;span class="w"&gt; &lt;/span&gt;│
└───────────┴───────────┴───────────┴─────────┴──────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;3.  Generate a 300x300x300 Data Viz Grid&lt;/h3&gt;
&lt;p&gt;The following code creates a Polars DataFrame with 27M Data Viz points.  We will use this DataFrame to draw a 3D map of the Pima Dataset, classified by &lt;strong&gt;Outcome.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="n"&gt;grid_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;grid_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;princomp1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;grid_start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_range&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;princomp2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;grid_start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_range&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;&amp;quot;princomp3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;grid_start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_range&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;27_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;┌───────────┬───────────┬───────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;princomp1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp2&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp3&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;╞═══════════╪═══════════╪═══════════╡&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.993311&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.986622&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.979933&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.973244&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="err"&gt;…&lt;/span&gt;         &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;…&lt;/span&gt;         &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="err"&gt;…&lt;/span&gt;         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.973244&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.979933&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.986622&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.993311&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────┴───────────┴───────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;4.  Classify all 27M Data Points&lt;/h3&gt;
&lt;p&gt;I already discussed the algorithm and implementation of RCE classification in my &lt;a href="https://john.soban.ski/polars.html"&gt;prior blog post&lt;/a&gt;.  Please click through to read.  The algo uses &lt;strong&gt;lambda&lt;/strong&gt; to calculate hits in a given &lt;strong&gt;hit footprint&lt;/strong&gt; and, based on the &lt;strong&gt;hits&lt;/strong&gt; assigns a class.&lt;/p&gt;
&lt;p&gt;I wrote the code to accommodate an arbitrary number of dimensions, so I   run it now without edits.&lt;/p&gt;
&lt;p&gt;When I attempt to run it in my Collab notebook, however, I get the following error:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;ComputeError:  cross joins would produce more rows than fits into 2^32; consider compiling with polars-big-idx feature, or set to 'streaming.'&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="A screen grab of a Python error that indicates a potential OOM kill" src="https://john.soban.ski/images/Polars2/02_Compute_Error.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I want to use a &lt;strong&gt;cross join&lt;/strong&gt; to leverage Polars/ Arrow distributed features.  I could try to cobble together a different approach (manual chunking), but I don't have confidence in my ability to improve on the work done by the Polars developers.  My hacked code would only bastardize the logic and execution of a &lt;strong&gt;cross join.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I instead use the Polars &lt;strong&gt;Streaming&lt;/strong&gt; feature.&lt;/p&gt;
&lt;p&gt;Quick aside.  If I use the RCE &lt;strong&gt;epsilon&lt;/strong&gt; hyperparameter and set an upper limit for the &lt;strong&gt;hit footprint&lt;/strong&gt; size, this will greatly reduce the size of the model.  I will show that in a future blog post.  For now, I want to focus on illustrating the &lt;strong&gt;streaming&lt;/strong&gt; feature of Polars.&lt;/p&gt;
&lt;h2&gt;Classification&lt;/h2&gt;
&lt;p&gt;I first set the Polars DataFrames to &lt;strong&gt;lazy&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;lf_class&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;lf_lambda&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lambda_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then load the RCE classification logic.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Rename lambda-side cols&lt;/span&gt;
&lt;span class="n"&gt;lf_lambda_renamed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lf_lambda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;l&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;outcome&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;loutcome&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;lambda&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;llambda&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Cross join (cartesian product)&lt;/span&gt;
&lt;span class="n"&gt;lf_cross&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lf_class&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lf_lambda_renamed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cross&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Distance calculation&lt;/span&gt;
&lt;span class="n"&gt;distance_expr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;l&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lf_cross&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lf_cross&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;distance_expr&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Flag hit&lt;/span&gt;
&lt;span class="n"&gt;lf_cross&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lf_cross&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;is_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;llambda&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Group by the grid point&lt;/span&gt;
&lt;span class="n"&gt;grid_group_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;princomp&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

&lt;span class="n"&gt;lf_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lf_cross&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_group_cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;is_hit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;loutcome&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hits_outcome_0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;is_hit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;loutcome&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hits_outcome_1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;predicted_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hits_outcome_0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hits_outcome_1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hits_outcome_1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hits_outcome_0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# unclassified&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The pre-loaded logic processes the &lt;strong&gt;lazy&lt;/strong&gt; input DataFrames into the &lt;strong&gt;lazy&lt;/strong&gt; output DataFrame, which I name &lt;strong&gt;lf_hits&lt;/strong&gt; (Lazy Fram Hits).  I then execute the logic on the &lt;strong&gt;lf_hits&lt;/strong&gt;, and set the &lt;strong&gt;Streaming&lt;/strong&gt; flag.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lf_hits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After execution, we can take a peek at the DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="err"&gt;┌───────────┬───────────┬───────────┬────────────────┬────────────────┬───────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;princomp1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp2&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp3&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;hits_outcome_0&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;hits_outcome_1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;predicted_outcome&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;               &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;i32&lt;/span&gt;               &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;╞═══════════╪═══════════╪═══════════╪════════════════╪════════════════╪═══════════════════╡&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.55102&lt;/span&gt;   &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.102041&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.265306&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;              &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.591837&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.142857&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.714286&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.795918&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.020408&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.469388&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;              &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.755102&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.020408&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;              &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.632653&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.510204&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;      &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;null&lt;/span&gt;              &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────┴───────────┴───────────┴────────────────┴────────────────┴───────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can now plot the Classified Pima DataSet in 3D.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;plotly.express&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;px&lt;/span&gt;

&lt;span class="n"&gt;filtered_hits_by_grid_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;predicted_outcome&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_not_null&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;px&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filtered_hits_by_grid_point&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;predicted_outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Predicted Outcome based on Principal Components (Excluding Nulls)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="A 3D plotly plot that shows **Outcome 0** in yellow, and **Outcome 1** in blue" src="https://john.soban.ski/images/Polars2/03_Three_Dee.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The plot shows &lt;strong&gt;Outcome 0&lt;/strong&gt; in yellow, and &lt;strong&gt;Outcome 1&lt;/strong&gt; in blue.&lt;/p&gt;
&lt;p&gt;I tune the &lt;strong&gt;epsilon&lt;/strong&gt; hyperparameter to put an upper limit on the &lt;strong&gt;hit footprints&lt;/strong&gt;.  This cleans up the plot.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A 3D plotly plot that shows **Outcome 0** in yellow, and **Outcome 1** in blue" src="https://john.soban.ski/images/Polars2/04_Three_Deetwo.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Polars uses Apache Arrow to drive optimal utilization across every available core.  For operations that clog local memory, like a cross join that yields a DataFrame larger than 32GB, Polars provides the &lt;strong&gt;Streaming&lt;/strong&gt; construct.  This sequentially loads chunks of data into memory, which drives parallel execution across time.  &lt;strong&gt;Streaming&lt;/strong&gt; allows Big Data computations on Collab notebooks and laptops.  The Data Engineer does not need to concern herself with the housekeeping associated with most &lt;strong&gt;Big Data&lt;/strong&gt; infrastructures.  Remember the headaches that Hadoop and Spark caused?  You do not need to deal with them; you just work with DataFrames and Polars figures the rest out.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture of Polar Bears and Panda Bears Squaring off in Space with fancy Space Suits" src="https://john.soban.ski/images/Polars2/05_Polars_Space.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Lambda Calculation Code&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_df_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;princomp_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;train_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;join_selection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;princomp_cols&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;outcome_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cross&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;squared_distance_expr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;squared_distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;squared_distance_expr&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;group_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;min_squared_dist_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;group_cols&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min_squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lambda_df_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_squared_dist_polars&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min_squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;lambda&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lambda_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;group_cols&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="err"&gt;┌───────────┬───────────┬───────────┬─────────┬──────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;princomp1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp2&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp3&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;     &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt;     &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;╞═══════════╪═══════════╪═══════════╪═════════╪══════════╡&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.466587&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.42411&lt;/span&gt;   &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.133452&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.114694&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.158603&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.811361&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.563424&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.430963&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.395739&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.375969&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.234109&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.086674&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.276773&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.741043&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.127401&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.546374&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.91052&lt;/span&gt;   &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.058389&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.605156&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.262343&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────┴───────────┴───────────┴─────────┴──────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content><category term="Data Science"></category><category term="RCE"></category><category term="Python"></category><category term="Polars"></category><category term="Pandas"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="Neural Networks"></category></entry><entry><title>Polars Boosted My Algorithm's Speed by 25x</title><link href="https://john.soban.ski/polars.html" rel="alternate"></link><published>2025-06-28T16:47:00-04:00</published><updated>2025-06-28T16:47:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2025-06-28:/polars.html</id><summary type="html">&lt;p&gt;The open-source &lt;strong&gt;Polars&lt;/strong&gt; library touts huge performance gains over &lt;strong&gt;Pandas&lt;/strong&gt;.  A combination of parallel processing, Apache Arrow, and a "Close to the Metal" Architecture drives &lt;strong&gt;Polars'&lt;/strong&gt; speed.  I refactored my Pandas-based Reduced Columb Energy (RCE) algorithm to &lt;strong&gt;Polars&lt;/strong&gt; and will share my journey and observations with you.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture of a Polar Bear Gang and Panda Gang Squaring off in NYC" src="https://john.soban.ski/images/Polars/00_Polars_Pandas.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;The Reduced Columb …&lt;/h2&gt;</summary><content type="html">&lt;p&gt;The open-source &lt;strong&gt;Polars&lt;/strong&gt; library touts huge performance gains over &lt;strong&gt;Pandas&lt;/strong&gt;.  A combination of parallel processing, Apache Arrow, and a "Close to the Metal" Architecture drives &lt;strong&gt;Polars'&lt;/strong&gt; speed.  I refactored my Pandas-based Reduced Columb Energy (RCE) algorithm to &lt;strong&gt;Polars&lt;/strong&gt; and will share my journey and observations with you.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture of a Polar Bear Gang and Panda Gang Squaring off in NYC" src="https://john.soban.ski/images/Polars/00_Polars_Pandas.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;The Reduced Columb Energy (RCE) Algorithm&lt;/h2&gt;
&lt;p&gt;The obscure RCE &lt;strong&gt;exemplar classifier&lt;/strong&gt; offers a niche alternative to the well-known K-Nearest approach.  I wrote about the algorithm in-depth in my &lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;New Exemplar Machine Learning Algorithm &lt;/a&gt; blog post.&lt;/p&gt;
&lt;p&gt;The RCE algorithm labels &lt;strong&gt;inference data&lt;/strong&gt; via &lt;strong&gt;hit footprints&lt;/strong&gt; learned from training data.&lt;/p&gt;
&lt;p&gt;RCE creates spheres around each labeled training observation, with radii equal to the distance of the closest &lt;strong&gt;labeled observation&lt;/strong&gt; in the &lt;strong&gt;opposite class&lt;/strong&gt;.  The collection of all spheres for a given class represents the &lt;strong&gt;hit footprint&lt;/strong&gt; for that class.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="A cartoon that shows RCE at work" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Cartoon.gif"&gt;&lt;/p&gt;
&lt;p&gt;RCE uses the term &lt;strong&gt;lambda&lt;/strong&gt; instead of the term &lt;strong&gt;radius&lt;/strong&gt;.  &lt;strong&gt;Lambda&lt;/strong&gt; represents the radii of the spheres that comprise the hit footprints.&lt;/p&gt;
&lt;p&gt;Look at the following diagram to understand the difference between &lt;strong&gt;RCE&lt;/strong&gt; and &lt;strong&gt;K-Nearest&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="An example of RCE hit footprints" src="https://john.soban.ski/images/Rce_Python_Part_1/02_Classify_Rce.png"&gt;&lt;/p&gt;
&lt;p&gt;In the diagram, we have two classes &lt;strong&gt;X&lt;/strong&gt; and &lt;strong&gt;O&lt;/strong&gt;.  The green &lt;strong&gt;?&lt;/strong&gt; represents an observation in our inference data set.  A training observation in class &lt;strong&gt;X&lt;/strong&gt; sits closest to the green question mark.  &lt;/p&gt;
&lt;p&gt;K-Nearest would classify the green &lt;strong&gt;?&lt;/strong&gt; into class &lt;strong&gt;X&lt;/strong&gt; since it sits closest to an observation in class &lt;strong&gt;X&lt;/strong&gt;.  &lt;strong&gt;RCE&lt;/strong&gt;, however, classifies the Green &lt;strong&gt;?&lt;/strong&gt; into class &lt;strong&gt;O&lt;/strong&gt;, because the unknown observation sits in the &lt;strong&gt;hit footprint&lt;/strong&gt; of class &lt;strong&gt;O&lt;/strong&gt;.  The RCE &lt;strong&gt;hit footprint&lt;/strong&gt; approach allows &lt;strong&gt;RCE&lt;/strong&gt; to handle limited data sets.&lt;/p&gt;
&lt;h2&gt;Polars vs. Pandas&lt;/h2&gt;
&lt;p&gt;I need to overload terms in this discussion.  The &lt;strong&gt;RCE&lt;/strong&gt; algorithm uses the term &lt;strong&gt;lambda&lt;/strong&gt; to represent &lt;strong&gt;radii&lt;/strong&gt;.  Pandas also uses the term &lt;strong&gt;lambda&lt;/strong&gt; to represent inline/ anonymous functions.  For clarity, I will use the term &lt;strong&gt;lambda&lt;/strong&gt; when discussing the hit footprint radii, and &lt;strong&gt;lambda functions&lt;/strong&gt; when discussing the Pandas anonymous functions.  I hope that context will also drive clarity.&lt;/p&gt;
&lt;p&gt;I take a functional approach to my Pandas development.  I avoid iterative loops (for, while, if/then) and instead use the &lt;strong&gt;apply&lt;/strong&gt; construct.  I stuff data into a data frame, create a lambda function, and use the Pandas Data Frame &lt;strong&gt;apply&lt;/strong&gt; method to process the data in a selection of columns via the lambda function.  For you &lt;strong&gt;Map/ Reduce&lt;/strong&gt; fans, the &lt;strong&gt;apply&lt;/strong&gt; method covers the &lt;strong&gt;map&lt;/strong&gt; side of the equation.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://docs.pola.rs/user-guide/expressions/aggregation/#do-not-kill-parallelization"&gt;Polars documentation&lt;/a&gt;, however, recommends that you avoid &lt;strong&gt;lambda functions:&lt;/strong&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Polars will try to parallelize the computation of the aggregating functions over the groups, so it is &lt;strong&gt;recommended that you avoid using lambdas&lt;/strong&gt; and custom Python functions as much as possible. Instead, try to stay within the &lt;strong&gt;realm of the Polars expression API&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Given this, I will use the native &lt;strong&gt;expression API&lt;/strong&gt; vs. &lt;strong&gt;lambda functions&lt;/strong&gt; when I refactor the code.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A faceoff between the Polar Bears and Pandas" src="https://john.soban.ski/images/Polars/01_Pandas_Polars.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;The Algorithm&lt;/h2&gt;
&lt;p&gt;I use the following approach to execute &lt;strong&gt;RCE&lt;/strong&gt; in &lt;strong&gt;Pandas&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Acquire Data and store in a Data Frame &lt;/li&gt;
&lt;li&gt;Split Data into Training and Inference Data Frames &lt;/li&gt;
&lt;li&gt;Apply a &lt;strong&gt;lambda function&lt;/strong&gt; to each row (Observation) of the Training Set DataFrame and record the hit footprint for each Observation&lt;/li&gt;
&lt;li&gt;Apply a &lt;strong&gt;lambda function&lt;/strong&gt; to each row in the Inference Set Dataframe and classify each Observation &lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Acquire Data and store in a Data Frame&lt;/h3&gt;
&lt;p&gt;The following code acquires our data and stores it in a data frame:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pandas_datareader&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pl&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;df_pima&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_pima&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Split Data into Training and Inference Data Frames&lt;/h3&gt;
&lt;p&gt;We copy the data frame and execute a split.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_pima&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;train_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;
                  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frac&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;random_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;test_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;
                       &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;train_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;
                  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;test_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_dataset&lt;/span&gt;
                  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;train_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;
                 &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;
               &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To help with plots, I use Principal Component Analysis (PCA) to reduce the dimensions of the training data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras.layers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Normalization&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Normalization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                 &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you would like to learn more about my justification and approach to dimensionality reduction, read my &lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;Pandas for RCE&lt;/a&gt; blog post.&lt;/p&gt;
&lt;h3&gt;Calculate Hit Footprints (Pandas)&lt;/h3&gt;
&lt;p&gt;In Pandas, I apply a &lt;strong&gt;lambda function&lt;/strong&gt; to each row (Observation) of the &lt;strong&gt;Training Set&lt;/strong&gt; Data Frame and record the hit footprint for each Observation.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;RCE draws a circle around each labeled training observation, with a radius (lambda) that stops at the closest labeled training point in the opposite class. Each circle indicates the hit footprint for that class.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I use this code for the &lt;strong&gt;lambda function&lt;/strong&gt;.  Note that the function inputs the entire training data set.  For each observation, I need to calculate the distance to every observation in the &lt;strong&gt;Train Dataframe&lt;/strong&gt; (of a different class), and then select the closest.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                           &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                  &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;lambda function&lt;/strong&gt; executes the following for each observation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Remove all observations of the same class from the dataset&lt;/li&gt;
&lt;li&gt;Calculate the distance to every other observation in the filtered data set&lt;/li&gt;
&lt;li&gt;Choose the closest observation, and use the distance for lambda &lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;TIP:  Paste the above code into Collab and press the &lt;strong&gt;Explain&lt;/strong&gt; button.  Gemini will explain the code to you! &lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Classify Inference Data Frame (Pandas)&lt;/h3&gt;
&lt;p&gt;I create another &lt;strong&gt;lambda function&lt;/strong&gt; to find the hits.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                           &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                  &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;find_hits&lt;/strong&gt; uses &lt;strong&gt;find_lambda&lt;/strong&gt;, and I apply &lt;strong&gt;find_hits&lt;/strong&gt; to the &lt;strong&gt;inference&lt;/strong&gt; data frame to classify data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# find the hits&lt;/span&gt;
    &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                          &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                          &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# add the columns&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# ID ambiguous, class 0 and class 1 data&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;
                                  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;
                                  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Classification &lt;strong&gt;applies&lt;/strong&gt; a lambda function to each row.&lt;/p&gt;
&lt;p&gt;The algorithm only labels a class if one class registers &lt;strong&gt;at least one&lt;/strong&gt; hit and the other classes register &lt;strong&gt;no hits&lt;/strong&gt;.  We can tune the algorithm to look at different weight options if desired.    &lt;/p&gt;
&lt;h3&gt;Lambda vs. Expression API&lt;/h3&gt;
&lt;p&gt;The following diagram shows the current Pandas approach.  I apply a &lt;strong&gt;lambda function&lt;/strong&gt; to each row (Observation) in the &lt;strong&gt;Training Data Frame&lt;/strong&gt;.  Each call of the &lt;strong&gt;lambda function&lt;/strong&gt; ingests a subset of the &lt;strong&gt;Training Data Frame&lt;/strong&gt; that includes every observation of the opposite class.  The original &lt;strong&gt;lambda function&lt;/strong&gt; includes a nested &lt;strong&gt;lambda function&lt;/strong&gt; that executes &lt;strong&gt;find_lambda&lt;/strong&gt;, which uses the ingested &lt;strong&gt;Training Data Frame&lt;/strong&gt; for each call.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture that shows how we iterate, or apply a lambda function to the Training Data Frame" src="https://john.soban.ski/images/Polars/02_Pandas_Apply.jpg"&gt;&lt;/p&gt;
&lt;p&gt;Every row of the &lt;strong&gt;Training Data Frame&lt;/strong&gt; must calculate the distance to &lt;strong&gt;every other row&lt;/strong&gt; in the &lt;strong&gt;Training Data Frame&lt;/strong&gt; (of opposite class).  We can either execute this logic via the application of &lt;strong&gt;lambda functions&lt;/strong&gt; or via a &lt;strong&gt;cross join&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture that shows how a cross join allows for column operations, GROUP BY and summary statistic (min, sum) operations" src="https://john.soban.ski/images/Polars/03_Polars_Cross.jpg"&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;cross join&lt;/strong&gt; creates a &lt;strong&gt;Data Frame&lt;/strong&gt; that includes every observation combined with every other observation (of opposite class).  In this form, we can use &lt;strong&gt;Data Frame&lt;/strong&gt; level operations to calculate distances.  This allows &lt;strong&gt;Column Based&lt;/strong&gt; processing vs. &lt;strong&gt;Row Based&lt;/strong&gt; processing.&lt;/p&gt;
&lt;p&gt;Once we have the distances in the &lt;strong&gt;cross join&lt;/strong&gt; Data Frame, we can use a &lt;strong&gt;GROUP BY&lt;/strong&gt; to find the &lt;strong&gt;min&lt;/strong&gt; and therefore &lt;strong&gt;lambda (hit footprint radius)&lt;/strong&gt; for each observation in the &lt;strong&gt;Training Dataframe&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Calculate Hit Footprints (Polars)&lt;/h3&gt;
&lt;p&gt;For Polars, we use the Polars &lt;strong&gt;expression API&lt;/strong&gt; instead of &lt;strong&gt;lambda functions&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;First, we:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Get the list of principal component columns&lt;/li&gt;
&lt;li&gt;Create a selection for the cross join, and rename the &lt;strong&gt;Principal Component&lt;/strong&gt; (princomp) columns in the &lt;strong&gt;right side&lt;/strong&gt; Data Frame&lt;/li&gt;
&lt;li&gt;Cross-join the Data Frame with a subset of itself that includes opposite outcomes&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: We include code (&lt;strong&gt;starts with&lt;/strong&gt;) to accommodate an arbitrary number of principal components.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_df_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;princomp_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;train_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;join_selection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;princomp_cols&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;outcome_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cross&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This yields our &lt;strong&gt;cross join&lt;/strong&gt; Data Frame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;print(df_with_other_outcome.head())

┌───────────┬───────────┬─────────┬─────────────────┬─────────────────┬───────────────┐
│ princomp1 ┆ princomp2 ┆ outcome ┆ princomp1_other ┆ princomp2_other ┆ outcome_other │
│ ---       ┆ ---       ┆ ---     ┆ ---             ┆ ---             ┆ ---           │
│ f64       ┆ f64       ┆ i64     ┆ f64             ┆ f64             ┆ i64           │
╞═══════════╪═══════════╪═════════╪═════════════════╪═════════════════╪═══════════════╡
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -0.046671       ┆ -1.161939       ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.631433       ┆ 0.001913        ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.535048       ┆ -0.184561       ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.474017       ┆ -0.842351       ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ 1.105778        ┆ -1.756428       ┆ 0             │
└───────────┴───────────┴─────────┴─────────────────┴─────────────────┴───────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To find the distances to every other point of the opposite class, we execute:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;squared_distance_expr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;squared_distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;squared_distance_expr&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice the &lt;strong&gt;column based&lt;/strong&gt; operations, which use &lt;strong&gt;-&lt;/strong&gt;, &lt;strong&gt;+&lt;/strong&gt; and &lt;strong&gt;**&lt;/strong&gt; (square) on the columns &lt;strong&gt;princomp1&lt;/strong&gt; and &lt;strong&gt;princomp2&lt;/strong&gt;.  The operations yield the squared distance for the &lt;strong&gt;cross-joined&lt;/strong&gt; Data Frame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;print(df_with_other_outcome.head())

┌───────────┬───────────┬─────────┬────────────────┬───────────────┬───────────────┬───────────────┐
│ princomp1 ┆ princomp2 ┆ outcome ┆ princomp1_othe ┆ princomp2_oth ┆ outcome_other ┆ squared_dista │
│ ---       ┆ ---       ┆ ---     ┆ r              ┆ er            ┆ ---           ┆ nce           │
│ f64       ┆ f64       ┆ i64     ┆ ---            ┆ ---           ┆ i64           ┆ ---           │
│           ┆           ┆         ┆ f64            ┆ f64           ┆               ┆ f64           │
╞═══════════╪═══════════╪═════════╪════════════════╪═══════════════╪═══════════════╪═══════════════╡
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -0.046671      ┆ -1.161939     ┆ 0             ┆ 6.272523      │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.631433      ┆ 0.001913      ┆ 0             ┆ 18.911708     │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.535048      ┆ -0.184561     ┆ 0             ┆ 17.589336     │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.474017      ┆ -0.842351     ┆ 0             ┆ 15.726017     │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ 1.105778       ┆ -1.756428     ┆ 0             ┆ 1.777314      │
└───────────┴───────────┴─────────┴────────────────┴───────────────┴───────────────┴───────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A &lt;strong&gt;GROUP BY&lt;/strong&gt; followed by the &lt;strong&gt;min&lt;/strong&gt; operation returns the &lt;strong&gt;minimum squared distance&lt;/strong&gt; for each operation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;group_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;min_squared_dist_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_with_other_outcome&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;group_cols&lt;/span&gt; 
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min_squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;square root&lt;/strong&gt; operation gives us the minimum Euclidean distance, or &lt;strong&gt;lambda&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;lambda_df_polars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;min_squared_dist_polars&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;min_squared_distance&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;lambda&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We now have a &lt;strong&gt;Training Data Frame&lt;/strong&gt; that records every &lt;strong&gt;lambda&lt;/strong&gt; (hit footprint radius) for every Observation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lambda_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;group_cols&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="err"&gt;┌───────────┬───────────┬─────────┬──────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;princomp1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp2&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;     &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;i64&lt;/span&gt;     &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;╞═══════════╪═══════════╪═════════╪══════════╡&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.242788&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.433054&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.030056&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.035114&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.063441&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.087633&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.237591&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.848636&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.130412&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.477868&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;1.650595&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.123065&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;1.136271&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.039659&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.140668&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────┴───────────┴─────────┴──────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Classify Inference Data Frame (Polars)&lt;/h3&gt;
&lt;p&gt;We refactor our algorithm to use the Polars &lt;strong&gt;expression API&lt;/strong&gt; instead of &lt;strong&gt;lambda functions&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;First, we create a 2D grid of data.  The grid provides the &lt;strong&gt;Inference Dataframe&lt;/strong&gt;.  This code assumes we have only two Principal Components.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="n"&gt;grid_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; 
&lt;span class="n"&gt;grid_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; 

&lt;span class="n"&gt;class_df&lt;/span&gt; \
    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;princomp1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="n"&gt;grid_start&lt;/span&gt;
                &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;
                &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;grid_range&lt;/span&gt;
                    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;grid_size&lt;/span&gt;
                        &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;princomp2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="n"&gt;grid_start&lt;/span&gt;
                &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
                &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;grid_range&lt;/span&gt;
                    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;grid_size&lt;/span&gt;
                        &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Polars classification approach&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cross joins the &lt;strong&gt;Inference Data Frame&lt;/strong&gt; and the &lt;strong&gt;Training Data Frame&lt;/strong&gt; to drive Column Based, Group By and Summary Statistic (Sum) operations&lt;/li&gt;
&lt;li&gt;Calculates the Euclidean distance between each Observation in the &lt;strong&gt;Inference Data Frame&lt;/strong&gt; and each point in the &lt;strong&gt;Training Data Frame&lt;/strong&gt; &lt;/li&gt;
&lt;li&gt;Identifies &lt;strong&gt;Hits&lt;/strong&gt;, when the Euclidean distance clocks in less than the lambda (footprint radius) distance &lt;/li&gt;
&lt;li&gt;Aggregates the hits.  For now, we only classify a &lt;strong&gt;hit&lt;/strong&gt; if an observation lies in &lt;strong&gt;only one&lt;/strong&gt; class (we can tune this)&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cross join &lt;/span&gt;
&lt;span class="n"&gt;lambda_df_renamed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lambda_df_polars&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;l&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;loutcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;llambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;grid_lambda_cross_join&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;lambda_df_renamed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;cross&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Find Euclidean Distances&lt;/span&gt;
&lt;span class="n"&gt;distance_expr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;l&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;princomp_cols&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;grid_lambda_cross_join&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_lambda_cross_join&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;distance_expr&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ID Hits&lt;/span&gt;
&lt;span class="n"&gt;grid_lambda_cross_join&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_lambda_cross_join&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;is_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;distance&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;llambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# GROUP BY and SUM hits&lt;/span&gt;
&lt;span class="n"&gt;grid_group_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_lambda_cross_join&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_group_cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;is_hit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loutcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits_outcome_0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;is_hit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loutcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits_outcome_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Decide on class&lt;/span&gt;
&lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;with_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;predicted_outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits_outcome_0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
                                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits_outcome_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits_outcome_1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;
                                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits_outcome_0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This yields a classified &lt;strong&gt;Inference Data Frame&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="err"&gt;┌───────────┬───────────┬────────────────┬────────────────┬───────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;princomp1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;princomp2&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;hits_outcome_0&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;hits_outcome_1&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;predicted_outcome&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;---&lt;/span&gt;               &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;f64&lt;/span&gt;       &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;            &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="n"&gt;i32&lt;/span&gt;               &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;╞═══════════╪═══════════╪════════════════╪════════════════╪═══════════════════╡&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.913043&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.498328&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mf"&gt;0.591973&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.498328&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.277592&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.110368&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.531773&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mf"&gt;0.571906&lt;/span&gt;  &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.478261&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.973244&lt;/span&gt; &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;             &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;              &lt;span class="err"&gt;┆&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;                 &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└───────────┴───────────┴────────────────┴────────────────┴───────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We plot a &lt;strong&gt;heat map&lt;/strong&gt; of the Classified Data:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Polars to Pandas for Matplotlib&lt;/span&gt;
&lt;span class="n"&gt;hits_by_grid_point_pd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits_by_grid_point&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Pivot and sort data for Heatmap&lt;/span&gt;
&lt;span class="n"&gt;hits_by_grid_point_pd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits_by_grid_point_pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Reshape to 300x300 grid&lt;/span&gt;
&lt;span class="n"&gt;grid_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt; &lt;span class="c1"&gt;# Update grid_size to match the actual grid size&lt;/span&gt;
&lt;span class="n"&gt;predicted_outcome_grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits_by_grid_point_pd&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s1"&gt;&amp;#39;predicted_outcome&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grid_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set Axis&lt;/span&gt;
&lt;span class="n"&gt;princomp1_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits_by_grid_point_pd&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;princomp2_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hits_by_grid_point_pd&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Heatmap&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;predicted_outcome_grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;origin&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lower&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;princomp1_values&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;princomp1_values&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;princomp2_values&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;princomp2_values&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;aspect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;auto&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;viridis&amp;#39;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Predicted Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Classification Boundary&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Principal Component 1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Principal Component 2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="A heat map of Outcome 1 vs. Outcome 2 for the Pima Dataset" src="https://john.soban.ski/images/Polars/04_Rce_Map.jpg"&gt;&lt;/p&gt;
&lt;h2&gt;Execution Time&lt;/h2&gt;
&lt;p&gt;I used the same Python environment to run my algorithm on the same data set in both Pandas and Polars.  I saw a significant reduction in wall clock time to complete the processing.&lt;/p&gt;
&lt;p&gt;The Pandas (lambda) approach took roughly six (6) minutes to complete.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screengrab that shows the execution time of the Pandas approach" src="https://john.soban.ski/images/Polars/05_Pandas_Time.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The Polars approach took only fourteen (14)seconds, a reduction of 96%.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Screengrab that shows the execution time for the Polars approach" src="https://john.soban.ski/images/Polars/06_Polars_Time.jpg"&gt;&lt;/p&gt;
&lt;p&gt;This reduction of 96% equates to a performance gain of 25x.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The cross-join approach, in addition to the parallel architecture of Polars, yielded a 25x boost over Pandas.  The crossr-join, however, requires the compute to hold n&lt;sup&gt;2&lt;/sup&gt; rows and 2m columns in memory, given n rows in the training set with m feature columns.  Next month I will look at a way to mitigate the scenario where the cross-join Data Frame exceeds available memory.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Another picture of Polar Bears vs. Pandas" src="https://john.soban.ski/images/Polars/07_Square_Off.jpg"&gt;&lt;/p&gt;</content><category term="Data Science"></category><category term="RCE"></category><category term="Python"></category><category term="Polars"></category><category term="Pandas"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="Neural Networks"></category></entry><entry><title>Quick and Easy Capacity Planning with Pandas</title><link href="https://john.soban.ski/capacity.html" rel="alternate"></link><published>2024-09-28T09:41:00-04:00</published><updated>2024-09-28T09:41:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2024-09-28:/capacity.html</id><summary type="html">&lt;p&gt;The discipline of Operations Research (OR) applies analytical methods from math, statistics, economics, and computer science to help leaders make good decisions.  &lt;/p&gt;
&lt;p&gt;Enterprise software consumes physical resources (CPU, Memory, Disk, and Bandwidth) to provide mission-essential services.  Software, Cloud, and Data Center &lt;strong&gt;Architects&lt;/strong&gt; must identify the expected resource consumption to optimize …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The discipline of Operations Research (OR) applies analytical methods from math, statistics, economics, and computer science to help leaders make good decisions.  &lt;/p&gt;
&lt;p&gt;Enterprise software consumes physical resources (CPU, Memory, Disk, and Bandwidth) to provide mission-essential services.  Software, Cloud, and Data Center &lt;strong&gt;Architects&lt;/strong&gt; must identify the expected resource consumption to optimize resource spend.  Operations Research &lt;strong&gt;Engineers&lt;/strong&gt; develop &lt;strong&gt;Capacity Planning&lt;/strong&gt; models to drive decisions around CAPEX and OPEX purchases.&lt;/p&gt;
&lt;p&gt;Today you will learn how to develop a Python Pandas &lt;strong&gt;Capacity Planning&lt;/strong&gt; model that &lt;strong&gt;right sizes&lt;/strong&gt; the resources needed for a simple &lt;strong&gt;Web Application&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Approach&lt;/h2&gt;
&lt;p&gt;We use concepts from Fermi Estimation and the Jackson Network Theorem (Product-Form Solution) to drive our model.  Both allow us to remove unnecessary details (and rabbit holes) from our model.  Our model will nonetheless yield a reasonable estimate.  &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;BONUS:  Our approach yields artifacts (flow diagrams) that demonstrate rigorous, considerate thought and discipline.  You can expect them to satisfy (most)  audits or engineering reviews.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The stages of our model development workflow include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Record the Nominal Architecture (Node Diagram)&lt;/li&gt;
&lt;li&gt;ID the Data Flows through the Architecture (Use Cases)&lt;/li&gt;
&lt;li&gt;Estimate the Maximum Throughput per flow (Gb/s)&lt;/li&gt;
&lt;li&gt;Sum the Max Throughput for Each Node&lt;/li&gt;
&lt;li&gt;Map Max Gb/s to the required number of CPUs per Node&lt;/li&gt;
&lt;li&gt;Price the Sum Total for the required CPUs ($)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;1. Record the Nominal Architecture (Node Diagram)&lt;/h2&gt;
&lt;p&gt;For illustration, we will size a simple web application.  The set of &lt;strong&gt;Application Nodes&lt;/strong&gt; includes an App Server, File Store, Firewall, and a Core Switch that connects the nodes.  Users and Developers both use the system.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Node Diagram that shows a Firewall, File Store, App Server connected via a Core Switch" src="https://john.soban.ski/images/Capacity/01_Node_Diagram.jpg"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE:  Once you understand our approach, feel free to tailor the names and roles of the &lt;strong&gt;Nodes&lt;/strong&gt; for your models.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;2.  ID the Data Flows through the Architecture (Use Cases)&lt;/h2&gt;
&lt;p&gt;Our Application hosts a &lt;strong&gt;Photo Album&lt;/strong&gt; service.  &lt;strong&gt;Users&lt;/strong&gt; upload and retrieve photos via the &lt;strong&gt;Web Application&lt;/strong&gt;.  &lt;strong&gt;Developers&lt;/strong&gt; look at logs from the server to optimize the user experience.&lt;/p&gt;
&lt;p&gt;The following graphic captures the three main flows.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Flow Diagram that shows Data Flows through our system" src="https://john.soban.ski/images/Capacity/02_Flow_Diagram.jpg"&gt;&lt;/p&gt;
&lt;p&gt;The Three Flows include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Photo upload&lt;/li&gt;
&lt;li&gt;Photo Retrieval&lt;/li&gt;
&lt;li&gt;Logs&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;3.  Estimate the Maximum Throughput per flow (Gb/s)&lt;/h2&gt;
&lt;p&gt;How do we estimate the expected throughput of our system?  File Size, Upload Time, and (Max) Number of Concurrent Users drive the throughput.  &lt;/p&gt;
&lt;p&gt;The following model produces the desired metric:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Model That Calculates Max Upload Throughput in Gb/s" src="https://john.soban.ski/images/Capacity/03_Max_Upload.jpg"&gt;&lt;/p&gt;
&lt;p&gt;For example, assume that each photo averages four &lt;strong&gt;(4) MegaBytes (MB)&lt;/strong&gt; in size.  My Samsung Galaxy phone produces photos of this size by default.  Then, assume we have (at max) &lt;strong&gt;thirty-two (32) Concurrent Users&lt;/strong&gt; and each photo takes &lt;strong&gt;1/4 a second&lt;/strong&gt; to upload.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Model That Calculates Max Upload Throughput in Gb/s, with numbers" src="https://john.soban.ski/images/Capacity/04_Upload_Calc.jpg"&gt;&lt;/p&gt;
&lt;p&gt;This formula estimates a max throughput of &lt;strong&gt;four (4) Gigabits per second (Gb/s)&lt;/strong&gt; for our file upload use case.&lt;/p&gt;
&lt;p&gt;Another formula calculates the logs' maximum network capacity consumption.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Model That Calculates Max Log Throughput in Gb/s" src="https://john.soban.ski/images/Capacity/05_Max_Log.jpg"&gt;&lt;/p&gt;
&lt;p&gt;An average Syslog (for Apache2) yields about one and a half (1.5) KiloBytes (KB).&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Model That Calculates Max Log Throughput in Gb/s, with numbers" src="https://john.soban.ski/images/Capacity/06_Log_Calc.jpg"&gt;&lt;/p&gt;
&lt;p&gt;This formula yields a max of &lt;strong&gt;0.307 Gb/s&lt;/strong&gt; for log traffic.&lt;/p&gt;
&lt;p&gt;Feel free to either benchmark your data or use rough order of magnitude (ROM) numbers in your calculation.  When you run the model, you will learn that ROM numbers provide enough data for acceptible results.&lt;/p&gt;
&lt;h2&gt;4.  Sum the Max Throughput for Each Node&lt;/h2&gt;
&lt;p&gt;This stage of our pipeline uses code from the Python &lt;strong&gt;Pandas&lt;/strong&gt; and &lt;strong&gt;Numpy&lt;/strong&gt; libraries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Map Flows to Tables&lt;/h3&gt;
&lt;p&gt;We need to map our data flows from our flow diagram to a &lt;strong&gt;Pandas&lt;/strong&gt; DataFrame.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Flow Diagram that shows Data Flows through our system" src="https://john.soban.ski/images/Capacity/02_Flow_Diagram.jpg"&gt;&lt;/p&gt;
&lt;p&gt;In the flow above, we see that the &lt;strong&gt;Photo Upload Flow&lt;/strong&gt; originates at the &lt;strong&gt;User&lt;/strong&gt;, hits the &lt;strong&gt;Firewall&lt;/strong&gt;, traverses the &lt;strong&gt;Core Switch&lt;/strong&gt; to the &lt;strong&gt;App Server&lt;/strong&gt;, and then terminates at the &lt;strong&gt;File Store&lt;/strong&gt; via another trip through the &lt;strong&gt;Core Switch&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We map this description to the following table:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flow&lt;/th&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;th&gt;SourceNode&lt;/th&gt;
&lt;th&gt;DestinationNode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;user&lt;/td&gt;
&lt;td&gt;firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;firewall&lt;/td&gt;
&lt;td&gt;core_switch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;core_switch&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;td&gt;core_switch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;core_switch&lt;/td&gt;
&lt;td&gt;file_store&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We can ignore the &lt;strong&gt;User&lt;/strong&gt; to &lt;strong&gt;Firewall&lt;/strong&gt; flow since we do not want to size the &lt;strong&gt;User's&lt;/strong&gt; computer.&lt;/p&gt;
&lt;p&gt;To further minimize rote work, we will delete the &lt;strong&gt;Core Switch&lt;/strong&gt; entries.  Since every flow navigates through the core switch, we will add those in one batch job at a later point.&lt;/p&gt;
&lt;p&gt;We reduce the above table to:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flow&lt;/th&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;th&gt;SourceNode&lt;/th&gt;
&lt;th&gt;DestinationNode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;firewall&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;td&gt;file_store&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We append the &lt;strong&gt;photo_retrieval&lt;/strong&gt; and &lt;strong&gt;logs&lt;/strong&gt; flows to the table.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flow&lt;/th&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;th&gt;SourceNode&lt;/th&gt;
&lt;th&gt;DestinationNode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;firewall&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;photo_upload&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;td&gt;file_store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;photo_retrieval&lt;/td&gt;
&lt;td&gt;file_store&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;photo_retrieval&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;td&gt;firewall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;logs&lt;/td&gt;
&lt;td&gt;app_server&lt;/td&gt;
&lt;td&gt;developers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We use a &lt;strong&gt;Python Dict&lt;/strong&gt; to import the data.  You can also use a &lt;strong&gt;Comma Separated Variable (CSV)&lt;/strong&gt; file, a &lt;strong&gt;Structured Query Language (SQL)&lt;/strong&gt; table, or &lt;strong&gt;JavaScript Object Notation (JSON)&lt;/strong&gt; to encode your flows.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;flow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;photo_upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;photo_upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;photo_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;photo_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;logs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="s1"&gt;&amp;#39;SourceNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;firewall&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;file_store&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
 &lt;span class="s1"&gt;&amp;#39;DestinationNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;file_store&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;firewall&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;developers&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We then import the &lt;strong&gt;Dict&lt;/strong&gt; into a Pandas &lt;strong&gt;DataFrame.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Flow = pd.DataFrame(data=flow)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Print&lt;/strong&gt; displays the &lt;strong&gt;Flow&lt;/strong&gt; DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Flow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

              &lt;span class="n"&gt;Data&lt;/span&gt;  &lt;span class="n"&gt;SourceNode&lt;/span&gt; &lt;span class="n"&gt;DestinationNode&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;    &lt;span class="n"&gt;firewall&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;      &lt;span class="n"&gt;file_store&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;file_store&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;        &lt;span class="n"&gt;firewall&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;             &lt;span class="n"&gt;logs&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;      &lt;span class="n"&gt;developers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE:  You can add more rows and flows to suit your needs.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We now add the &lt;strong&gt;Core Switch&lt;/strong&gt; nodes back into the table.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Core_Flows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;Flow&lt;/span&gt;
                       &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                       &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DestinationNode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;core_switch&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                       &lt;span class="n"&gt;Flow&lt;/span&gt;
                       &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                       &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SourceNode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;core_switch&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
                      &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This batch job introduces the &lt;strong&gt;Core Switch&lt;/strong&gt; back into the flows.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Core_Flows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

              &lt;span class="n"&gt;Data&lt;/span&gt;   &lt;span class="n"&gt;SourceNode&lt;/span&gt; &lt;span class="n"&gt;DestinationNode&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;     &lt;span class="n"&gt;firewall&lt;/span&gt;     &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;     &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;   &lt;span class="n"&gt;file_store&lt;/span&gt;     &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;     &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;             &lt;span class="n"&gt;logs&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;     &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;      &lt;span class="n"&gt;file_store&lt;/span&gt;
&lt;span class="mi"&gt;7&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;8&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;        &lt;span class="n"&gt;firewall&lt;/span&gt;
&lt;span class="mi"&gt;9&lt;/span&gt;             &lt;span class="n"&gt;logs&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;      &lt;span class="n"&gt;developers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE:  The above code outputs an arbitrary ordering of the flows.  We can ignore the order since we aim to execute a per-node &lt;strong&gt;GROUP BY&lt;/strong&gt; operation at the end of our pipeline.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Our model requires a &lt;strong&gt;Node&lt;/strong&gt; view of the flows, independent of &lt;strong&gt;Source&lt;/strong&gt; or &lt;strong&gt;Destination&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We achieve this through the following &lt;strong&gt;Pandas&lt;/strong&gt; operations:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Node_Flows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;
 &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;Core_Flows&lt;/span&gt;
          &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DestinationNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;SourceNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
          &lt;span class="n"&gt;Core_Flows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;SourceNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DestinationNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;})]&lt;/span&gt;
         &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our &lt;strong&gt;concat&lt;/strong&gt; operation outputs:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

   &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

               &lt;span class="n"&gt;Data&lt;/span&gt;         &lt;span class="n"&gt;Node&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;      &lt;span class="n"&gt;photo_upload&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;              &lt;span class="n"&gt;logs&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;      &lt;span class="n"&gt;photo_upload&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;              &lt;span class="n"&gt;logs&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;      &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;7&lt;/span&gt;      &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;8&lt;/span&gt;   &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;9&lt;/span&gt;   &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;10&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;11&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;12&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;13&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;14&lt;/span&gt;             &lt;span class="n"&gt;logs&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;
&lt;span class="mi"&gt;15&lt;/span&gt;             &lt;span class="n"&gt;logs&lt;/span&gt;   &lt;span class="n"&gt;developers&lt;/span&gt;
&lt;span class="mi"&gt;16&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;   &lt;span class="n"&gt;file_store&lt;/span&gt;
&lt;span class="mi"&gt;17&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;   &lt;span class="n"&gt;file_store&lt;/span&gt;
&lt;span class="mi"&gt;18&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;     &lt;span class="n"&gt;firewall&lt;/span&gt;
&lt;span class="mi"&gt;19&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;     &lt;span class="n"&gt;firewall&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A simple &lt;strong&gt;GROUP BY&lt;/strong&gt; operation verifies the quantity of &lt;strong&gt;Flows per Node&lt;/strong&gt;.  The Output of the operation matches the number of &lt;strong&gt;Flows per Node&lt;/strong&gt; in our &lt;strong&gt;Flow Diagram&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;Node&lt;/span&gt;
&lt;span class="n"&gt;app_server&lt;/span&gt;      &lt;span class="mi"&gt;8&lt;/span&gt;
&lt;span class="n"&gt;core_switch&lt;/span&gt;    &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="n"&gt;developers&lt;/span&gt;      &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;file_store&lt;/span&gt;      &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;span class="n"&gt;firewall&lt;/span&gt;        &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;int64&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;JOIN Data Rates into Node Flow Table&lt;/h3&gt;
&lt;p&gt;I use another &lt;strong&gt;Dict&lt;/strong&gt; to import the (Estimated) Max Throughput for the &lt;strong&gt;Upload&lt;/strong&gt;, &lt;strong&gt;Retrieval&lt;/strong&gt;, and &lt;strong&gt;Logs&lt;/strong&gt; data flows into a DataFrame.  Once more, you can use the &lt;strong&gt;encoding format&lt;/strong&gt; of your choosing (CSV, SQL, JSON).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;photo_upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;photo_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;logs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;RateGbps&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;4.096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="mf"&gt;4.096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="mf"&gt;0.307&lt;/span&gt;&lt;span class="p"&gt;,]}&lt;/span&gt;

&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This yields:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

              &lt;span class="n"&gt;Data&lt;/span&gt;  &lt;span class="n"&gt;RateGbps&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;     &lt;span class="n"&gt;photo_upload&lt;/span&gt;     &lt;span class="mf"&gt;4.096&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;     &lt;span class="mf"&gt;4.096&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;             &lt;span class="n"&gt;logs&lt;/span&gt;     &lt;span class="mf"&gt;0.307&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We &lt;strong&gt;JOIN&lt;/strong&gt; this &lt;strong&gt;Data&lt;/strong&gt; DataFrame into our &lt;strong&gt;Flow&lt;/strong&gt; DataFrame via a &lt;strong&gt;merge&lt;/strong&gt; operation.&lt;/p&gt;
&lt;p&gt;We also apply a &lt;strong&gt;SUM&lt;/strong&gt; operation via the &lt;strong&gt;GROUP BY&lt;/strong&gt; method.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# prompt: join Node_Flows and Data on Data&lt;/span&gt;

&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;left&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RateGbps&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This then outputs the &lt;strong&gt;MAX&lt;/strong&gt; throughput (sum of &lt;strong&gt;ALL&lt;/strong&gt; flows) on a per-node basis.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

          &lt;span class="n"&gt;Node&lt;/span&gt;  &lt;span class="n"&gt;RateGbps&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;    &lt;span class="mf"&gt;18.341&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;    &lt;span class="mf"&gt;36.682&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="n"&gt;developers&lt;/span&gt;     &lt;span class="mf"&gt;0.307&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="n"&gt;file_store&lt;/span&gt;     &lt;span class="mf"&gt;9.292&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;     &lt;span class="n"&gt;firewall&lt;/span&gt;     &lt;span class="mf"&gt;8.742&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;5.  Map Max Gb/s to the required number of CPU per Node&lt;/h2&gt;
&lt;p&gt;Assume that each CPU's &lt;strong&gt;cycle per second&lt;/strong&gt; can process one &lt;strong&gt;bit of throughput&lt;/strong&gt;, or one &lt;strong&gt;bit per Hertz (Hz).&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We identify the number of cores needed via this calculation.&lt;/p&gt;
&lt;p&gt;For purposes of this model, I use the &lt;strong&gt;Intel Xeon Silver 4214 Processor&lt;/strong&gt; which retails at approximately $185.00 (USD) in 2024.&lt;/p&gt;
&lt;p&gt;We feed our model with the &lt;strong&gt;Silver's&lt;/strong&gt; specs (2.2GHz, 12C, 16.5MB Cache):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;CPU_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Intel Xeon Silver 4214&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;CPU_CLOCK_SPEED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;2.2&lt;/span&gt;
&lt;span class="n"&gt;CPU_CORES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;
&lt;span class="n"&gt;CPU_PRICE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;185.00&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In addition, we need to account for various processing overhead, or &lt;strong&gt;Taxes&lt;/strong&gt;.  Feel free to benchmark your metrics. I use the following percentages:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;OS_TAX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
&lt;span class="n"&gt;VIRTUALIZATION_TAX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;
&lt;span class="n"&gt;TLS_TAX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;From here, we convert the &lt;strong&gt;Max Gb/s per node&lt;/strong&gt; to &lt;strong&gt;Ghz&lt;/strong&gt; to &lt;strong&gt;Cores&lt;/strong&gt;.  Note the &lt;strong&gt;Ceiling&lt;/strong&gt; operation, since we can't buy a fractional core:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NumCores&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                               &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RateGbps&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                                     &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; 
                                       &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;OS_TAX&lt;/span&gt; 
                                       &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;VIRTUALIZATION_TAX&lt;/span&gt; 
                                       &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;TLS_TAX&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                                     &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CPU_CLOCK_SPEED&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Based on the above &lt;strong&gt;math&lt;/strong&gt;, we see that the &lt;strong&gt;core_switch&lt;/strong&gt; requires the most &lt;strong&gt;cores&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;          &lt;span class="n"&gt;Node&lt;/span&gt;  &lt;span class="n"&gt;RateGbps&lt;/span&gt;  &lt;span class="n"&gt;NumCores&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;    &lt;span class="mf"&gt;18.341&lt;/span&gt;      &lt;span class="mf"&gt;11.0&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;    &lt;span class="mf"&gt;36.682&lt;/span&gt;      &lt;span class="mf"&gt;21.0&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="n"&gt;developers&lt;/span&gt;     &lt;span class="mf"&gt;0.307&lt;/span&gt;       &lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="n"&gt;file_store&lt;/span&gt;     &lt;span class="mf"&gt;9.292&lt;/span&gt;       &lt;span class="mf"&gt;6.0&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;     &lt;span class="n"&gt;firewall&lt;/span&gt;     &lt;span class="mf"&gt;8.742&lt;/span&gt;       &lt;span class="mf"&gt;5.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;6.  Price the Sum Total for the required CPU ($)&lt;/h2&gt;
&lt;p&gt;We know that each CPU includes &lt;strong&gt;twelve (12) 2.66 GHz cores&lt;/strong&gt; and costs $185 per CPU.  We use the following &lt;strong&gt;Pandas&lt;/strong&gt; statements to calculate the cost.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NumCpu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                             &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NumCores&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                                   &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;CPU_CORES&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;TotalCpuCost&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NumCpu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                                &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CPU_PRICE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following lines of code add a &lt;strong&gt;Totals&lt;/strong&gt; row:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Total_Row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;
             &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numeric_only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;Total_Row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Node&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Total&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;
                   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;pd&lt;/span&gt;
                              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;Total_Row&lt;/span&gt;&lt;span class="p"&gt;])],&lt;/span&gt;
                            &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The final output Reads:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Node_Flows_Data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

          &lt;span class="n"&gt;Node&lt;/span&gt;  &lt;span class="n"&gt;RateGbps&lt;/span&gt;  &lt;span class="n"&gt;NumCores&lt;/span&gt;  &lt;span class="n"&gt;NumCpu&lt;/span&gt;  &lt;span class="n"&gt;TotalCpuCost&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="n"&gt;app_server&lt;/span&gt;    &lt;span class="mf"&gt;16.691&lt;/span&gt;      &lt;span class="mf"&gt;10.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="n"&gt;core_switch&lt;/span&gt;    &lt;span class="mf"&gt;33.382&lt;/span&gt;      &lt;span class="mf"&gt;19.0&lt;/span&gt;     &lt;span class="mf"&gt;2.0&lt;/span&gt;         &lt;span class="mf"&gt;370.0&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="n"&gt;developers&lt;/span&gt;     &lt;span class="mf"&gt;0.307&lt;/span&gt;       &lt;span class="mf"&gt;1.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="n"&gt;file_store&lt;/span&gt;     &lt;span class="mf"&gt;8.192&lt;/span&gt;       &lt;span class="mf"&gt;5.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;     &lt;span class="n"&gt;firewall&lt;/span&gt;     &lt;span class="mf"&gt;8.192&lt;/span&gt;       &lt;span class="mf"&gt;5.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;        &lt;span class="n"&gt;Total&lt;/span&gt;    &lt;span class="mf"&gt;66.764&lt;/span&gt;      &lt;span class="mf"&gt;40.0&lt;/span&gt;     &lt;span class="mf"&gt;6.0&lt;/span&gt;        &lt;span class="mf"&gt;1110.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can expect to pay $1,100 to purchase the required &lt;strong&gt;CPU&lt;/strong&gt; for our &lt;strong&gt;Photo Album&lt;/strong&gt; Web Application.&lt;/p&gt;
&lt;h2&gt;Bonus:  Extend the model&lt;/h2&gt;
&lt;p&gt;We can easily add new flows to our model.&lt;/p&gt;
&lt;p&gt;For example, let's add a &lt;strong&gt;Data Analyst&lt;/strong&gt; use case to our system.  The &lt;strong&gt;Data Analysts&lt;/strong&gt; look at new &lt;strong&gt;Metadata Flows&lt;/strong&gt; (Flows &lt;strong&gt;four&lt;/strong&gt; and &lt;strong&gt;five&lt;/strong&gt; in the diagram below):&lt;/p&gt;
&lt;p&gt;&lt;img alt="An updated flow diagram with new flows for Metadata analysis" src="https://john.soban.ski/images/Capacity/07_Metadata_Flow.jpg"&gt;&lt;/p&gt;
&lt;p&gt;We add these two new flows (From the &lt;strong&gt;App Server to the File Store&lt;/strong&gt;, and from the &lt;strong&gt;File Store to the Analysts&lt;/strong&gt;) to our &lt;strong&gt;Flow&lt;/strong&gt; and &lt;strong&gt;Data&lt;/strong&gt; DataFrames:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;flow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;photo_upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;photo_upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;photo_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;photo_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;logs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;metadata_archival&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;metadata_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;metadata_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,],&lt;/span&gt;
 &lt;span class="s1"&gt;&amp;#39;SourceNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;firewall&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;file_store&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              
  &lt;span class="s1"&gt;&amp;#39;file_store&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,],&lt;/span&gt;
 &lt;span class="s1"&gt;&amp;#39;DestinationNode&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;file_store&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;firewall&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;developers&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;file_store&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;app_server&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;data_analysts&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,]}&lt;/span&gt;

&lt;span class="n"&gt;Flow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;flow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Flow&lt;/strong&gt; DataFrame now includes &lt;strong&gt;Metadata&lt;/strong&gt; flows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Flow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="o"&gt;...&lt;/span&gt;

                 &lt;span class="n"&gt;Data&lt;/span&gt;  &lt;span class="n"&gt;SourceNode&lt;/span&gt; &lt;span class="n"&gt;DestinationNode&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;        &lt;span class="n"&gt;photo_upload&lt;/span&gt;    &lt;span class="n"&gt;firewall&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;        &lt;span class="n"&gt;photo_upload&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;      &lt;span class="n"&gt;file_store&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;     &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;file_store&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;     &lt;span class="n"&gt;photo_retrieval&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;        &lt;span class="n"&gt;firewall&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;                &lt;span class="n"&gt;logs&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;      &lt;span class="n"&gt;developers&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;   &lt;span class="n"&gt;metadata_archival&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;      &lt;span class="n"&gt;file_store&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;  &lt;span class="n"&gt;metadata_retrieval&lt;/span&gt;  &lt;span class="n"&gt;file_store&lt;/span&gt;      &lt;span class="n"&gt;app_server&lt;/span&gt;
&lt;span class="mi"&gt;7&lt;/span&gt;  &lt;span class="n"&gt;metadata_retrieval&lt;/span&gt;  &lt;span class="n"&gt;app_server&lt;/span&gt;   &lt;span class="n"&gt;data_analysts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We also update the &lt;strong&gt;Data&lt;/strong&gt; DataFrame to include &lt;strong&gt;Metadata&lt;/strong&gt; flows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Data&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;photo_upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;photo_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;logs&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;metadata_archival&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="s1"&gt;&amp;#39;metadata_retrieval&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;RateGbps&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;4.096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="mf"&gt;4.096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="mf"&gt;0.307&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="mf"&gt;0.550&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="mf"&gt;0.550&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="n"&gt;Data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We then run the rest of the commands above, without edit, and the model &lt;strong&gt;learns&lt;/strong&gt; about the new Node (Analysts) and Flows (Metadata Archival &amp;amp; Retrieval).&lt;/p&gt;
&lt;p&gt;We see that we only need to buy &lt;strong&gt;one (1) new CPU&lt;/strong&gt; for the &lt;strong&gt;Data Analyst's&lt;/strong&gt; workstation:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;            &lt;span class="n"&gt;Node&lt;/span&gt;  &lt;span class="n"&gt;RateGbps&lt;/span&gt;  &lt;span class="n"&gt;NumCores&lt;/span&gt;  &lt;span class="n"&gt;NumCpu&lt;/span&gt;  &lt;span class="n"&gt;TotalCpuCost&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;     &lt;span class="n"&gt;app_server&lt;/span&gt;    &lt;span class="mf"&gt;18.341&lt;/span&gt;      &lt;span class="mf"&gt;11.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;    &lt;span class="n"&gt;core_switch&lt;/span&gt;    &lt;span class="mf"&gt;36.682&lt;/span&gt;      &lt;span class="mf"&gt;21.0&lt;/span&gt;     &lt;span class="mf"&gt;2.0&lt;/span&gt;         &lt;span class="mf"&gt;370.0&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;data_analysts&lt;/span&gt;     &lt;span class="mf"&gt;0.550&lt;/span&gt;       &lt;span class="mf"&gt;1.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;     &lt;span class="n"&gt;developers&lt;/span&gt;     &lt;span class="mf"&gt;0.307&lt;/span&gt;       &lt;span class="mf"&gt;1.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;     &lt;span class="n"&gt;file_store&lt;/span&gt;     &lt;span class="mf"&gt;9.292&lt;/span&gt;       &lt;span class="mf"&gt;6.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;       &lt;span class="n"&gt;firewall&lt;/span&gt;     &lt;span class="mf"&gt;8.192&lt;/span&gt;       &lt;span class="mf"&gt;5.0&lt;/span&gt;     &lt;span class="mf"&gt;1.0&lt;/span&gt;         &lt;span class="mf"&gt;185.0&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;          &lt;span class="n"&gt;Total&lt;/span&gt;    &lt;span class="mf"&gt;73.364&lt;/span&gt;      &lt;span class="mf"&gt;45.0&lt;/span&gt;     &lt;span class="mf"&gt;7.0&lt;/span&gt;        &lt;span class="mf"&gt;1295.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="A Graphic that shows an ominous figure walking into a dense, server filled data center, with pipes and greebles on the ceiling" src="https://john.soban.ski/images/Capacity/08_Data_Flow.jpg"&gt;&lt;/p&gt;</content><category term="Data Science"></category><category term="Operations Research"></category><category term="Data Science"></category><category term="Pandas"></category><category term="Python"></category><category term="HOWTO"></category></entry><entry><title>Combat ChatGPT Use in Academia and Government Acquisition</title><link href="https://john.soban.ski/far.html" rel="alternate"></link><published>2024-07-27T07:27:00-04:00</published><updated>2024-07-27T07:27:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2024-07-27:/far.html</id><summary type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Government Acquisition Teams (and Academics) request written responses to gauge writers' capacity for difficult work.  Good writing requires effort.  Writers must provide clear, focused, and valuable prose to Readers.    &lt;/p&gt;
&lt;p&gt;Writers must &lt;strong&gt;do the work&lt;/strong&gt; for the reader, and remove ambiguity, falsehoods, filler content, and irrelevant bloviating.  &lt;/p&gt;
&lt;p&gt;Generative AI provides …&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Government Acquisition Teams (and Academics) request written responses to gauge writers' capacity for difficult work.  Good writing requires effort.  Writers must provide clear, focused, and valuable prose to Readers.    &lt;/p&gt;
&lt;p&gt;Writers must &lt;strong&gt;do the work&lt;/strong&gt; for the reader, and remove ambiguity, falsehoods, filler content, and irrelevant bloviating.  &lt;/p&gt;
&lt;p&gt;Generative AI provides a cheat code for writing.  Putative writers enter text into a prompt, and the AI produces mediocre, passable text in seconds.  GenAI undermines the goals of writing assignments.  In the face of instant prose development, Acquisition Teams can no longer infer Offerers' work ethic and ability through writing assignments.&lt;/p&gt;
&lt;p&gt;Today I propose a simple rubric to detect the use of GenAI in writing.  Graders and reviewers can then decide how to penalize Offerers who use GenAI.  Where traditional writing demonstrated an ability to work hard, GenAI use reflects a willingness to take shortcuts.&lt;/p&gt;
&lt;h2&gt;Prior Art&lt;/h2&gt;
&lt;p&gt;Liang et al developed a method to identify the likelihood of Large Language model use in Academic white papers.  Liang:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;present[s] an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM) [Liang 1]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Early attempts at LLM detection looked at individual articles.  Liang's novel approach attempts detection based on the analysis of a &lt;strong&gt;large corpus&lt;/strong&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. [Liang 1]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Their results &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;suggest that between 6.5% and 16.9% of text submitted as peer reviews to [Academic] conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates [Liang 1]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Consider that for a moment.  Academic journals and conferences represent the pinnacle of creative and scientific thought.  Submission requires a PhD or PhD track.  Liang, however, discovered that nearly one in five submissions &lt;strong&gt;substantially&lt;/strong&gt; used LLM to generate prose.&lt;/p&gt;
&lt;p&gt;Liang correlates LLM use with low quality, procrastination, and apathy:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. [Liang 1]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;The Rubric&lt;/h2&gt;
&lt;p&gt;Find my rubric below.  Subtract one point (from 100) for each transgression.  I explain my rationale in detail in the following sections.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Meme Adjectives&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;meticulous, intricate, transformative&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Empty Phrases&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;In today's digital landscape&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unsubstantiated Grandiosity&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;is crucial, is critical, is of paramount importance&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cliches&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Acme Corp. is pleased to present&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adverbs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;to share and decipher data seamlessly is paramount&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Passive Voice/ Ambiguous Subject&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OpenAI was founded in 2015&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The Verb &lt;strong&gt;To Be&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OpenDaylight is a Software Defined Network (SDN) Controller.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Take, for example, the following prose:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Acme Corp is pleased to present our new packet tech. In the intricate world of digital communication, the ability to label and prioritize data is paramount, especially when it comes to real-time data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The above prose loses five points in two sentences.  It includes a cliche, an empty phrase, two instances of passive voice, and an adverb. &lt;/p&gt;
&lt;h2&gt;Meme Adjectives&lt;/h2&gt;
&lt;p&gt;ChatGPT loves certain words.  After the release of the public LLM, Liang discovered a 10x surge of certain words.  &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We find a significant shift in the frequency of certain tokens in ICLR 2024, with adjectives such as “commendable”, “meticulous”, and “intricate” showing 9.8, 34.7, and 11.2-fold increases in probability of occurring in a sentence. Liang 1&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I took the following chart from Liang's paper, and give all credit to his team.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture that shows hockey stick graphs of the 10x explosion of certain meme adjectives" src="https://john.soban.ski/images/Far/01_Meme_Words.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I consider the Meme adjectives &lt;strong&gt;meticulous&lt;/strong&gt;, &lt;strong&gt;intricate&lt;/strong&gt; and &lt;strong&gt;commendable&lt;/strong&gt; unnecessary (attempts at) flourishes.  They do not contribute to the information density of prose.&lt;/p&gt;
&lt;p&gt;Page 15 of &lt;strong&gt;Monitoring AI-Modified Content at Scale&lt;/strong&gt; provides a &lt;strong&gt;word cloud&lt;/strong&gt; of Meme Adjectives.  Again, credit to Liang and his team.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture that depicts a word cloud of LLM Meme Adjectives" src="https://john.soban.ski/images/Far/02_Word_Cloud.jpg"&gt;&lt;/p&gt;
&lt;p&gt;I propose we penalize authors that use these words.&lt;/p&gt;
&lt;h2&gt;Passive Voice/ Ambiguous Subject&lt;/h2&gt;
&lt;p&gt;In just a few months since I published my &lt;a href="https://john.soban.ski/beat-ai.html"&gt;original thoughts on this subject&lt;/a&gt;, LLM improved their ability to remove passive voice.  Simple prompt commands remove 80% of passive voice.&lt;/p&gt;
&lt;p&gt;For example, consider the first reply from ChatGPT:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The OpenDaylight Project &lt;strong&gt;was founded&lt;/strong&gt; in April 2013 and is hosted by the Linux Foundation. OpenDaylight &lt;strong&gt;is structured&lt;/strong&gt; as a modular, pluggable, and flexible controller. It &lt;strong&gt;can be deployed&lt;/strong&gt; on any hardware and operating system platform that supports Java. The platform &lt;strong&gt;is designed&lt;/strong&gt; to separate...&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The response includes myriad instances of passive voice.  I ask ChatGPT to remove passive voice and it returns:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Founded in April 2013 and hosted by the Linux Foundation, OpenDaylight features a modular, pluggable, and flexible controller. This platform &lt;strong&gt;is deployable&lt;/strong&gt; on any hardware and operating system platform supporting Java. OpenDaylight's design separates the control plane...&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The response still includes passive voice.  It identifies the founder but does not identify the designer.  I ask it to&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Rewrite the above to Subject-verb object form, removing all forms of &lt;strong&gt;to be&lt;/strong&gt; and all passive voice.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The OpenDaylight Project, an open-source platform for Software-Defined Networking (SDN), accelerates SDN adoption and promotes... Founded in April 2013 and hosted by the Linux Foundation, OpenDaylight features a modular, pluggable, and flexible controller. This platform, &lt;strong&gt;deployable on any hardware&lt;/strong&gt; and operating system platform supporting Java, separates...&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I find this rule the most difficult to address.  ChatGPT removes all forms of the verb &lt;strong&gt;to be&lt;/strong&gt; from its response and removes passive voice.  It still includes &lt;strong&gt;ambiguous subjects&lt;/strong&gt;, for example, the reader will not know who would deploy the hardware.&lt;/p&gt;
&lt;p&gt;This rule will only catch the most egregious of LLM use, and prompt engineers too lazy to ask the bot to refine the output.&lt;/p&gt;
&lt;h2&gt;Empty Phrases&lt;/h2&gt;
&lt;p&gt;ChatGPT peppers output with phrases that sound impressive at first read, but prove hollow and meaningless upon further inspection.&lt;/p&gt;
&lt;p&gt;Take, for example, the phrase &lt;strong&gt;in today's digital landscape&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;A &lt;a href="https://kagi.com/"&gt;Kagi Search&lt;/a&gt; (Non-affiliate link) of &lt;strong&gt;Digital Landscape&lt;/strong&gt; returns myriad definitions.  The domains of advertising, online games, and communications infrastructure policy all provide separate definitions.  &lt;/p&gt;
&lt;p&gt;ChatGPT should define &lt;strong&gt;digital landscape&lt;/strong&gt; to remove ambiguity but I have not seen it do that prior its to use.&lt;/p&gt;
&lt;p&gt;I collect several of these &lt;strong&gt;empty phrases&lt;/strong&gt; in my &lt;a href="https://john.soban.ski/beat-ai.html"&gt;Beat AI&lt;/a&gt; blog post.  They include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In today's &lt;strong&gt;digital landscape&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;intricate world&lt;/strong&gt; of&lt;/li&gt;
&lt;li&gt;In &lt;strong&gt;this era&lt;/strong&gt; of&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unlocking new horizons&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I recommend compiling a corpus of these phrases and then penalizing authors one point for each instance.&lt;/p&gt;
&lt;h2&gt;Cliches&lt;/h2&gt;
&lt;p&gt;Cliches add no information, just noise to text.  Cliches offend the Readers' intelligence and demonstrate a Writer's lack of respect for their Reader's time.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tackle tough problems&lt;/li&gt;
&lt;li&gt;Best of breed&lt;/li&gt;
&lt;li&gt;Hit the ground running&lt;/li&gt;
&lt;li&gt;Low-hanging fruit&lt;/li&gt;
&lt;li&gt;Long pole in the tent&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Unsubstantiated Grandiosity&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Unsubstantiated Grandiosity&lt;/strong&gt; prioritizes showy prose over the conveyance of useful information.  Writers must convey useful information in tight, considerate prose.  I've collected examples of LLM-produced &lt;strong&gt;Unsubstantiated Grandiosity&lt;/strong&gt; in my &lt;a href="https://john.soban.ski/beat-ai.html"&gt;prior blog post&lt;/a&gt;.  Some include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Infrastructure as Code (IaC) is a &lt;strong&gt;paradigm-shifting&lt;/strong&gt; approach&lt;/li&gt;
&lt;li&gt;Businesses can achieve &lt;strong&gt;unprecedented agility&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;...making them &lt;strong&gt;indispensable&lt;/strong&gt; for decision-making&lt;/li&gt;
&lt;li&gt;Graphics Processing Units (GPUs) have &lt;strong&gt;transcended&lt;/strong&gt; their original role...&lt;/li&gt;
&lt;li&gt;Digital advertising holds &lt;strong&gt;immense significance&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;An author must provide objective, well-cited, and appropriate evidence to build a case for a particular argument.  They must &lt;strong&gt;do the work&lt;/strong&gt; for their reader.  Only after presenting substantial evidence can a writer tout a technology's importance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unsubstantiated Grandiosity&lt;/strong&gt; provides a shortcut for lazy writers and attempts (and fails) to replace diligent research and evidence.  It puts a cognitive load on the reader.&lt;/p&gt;
&lt;h2&gt;Adverbs&lt;/h2&gt;
&lt;p&gt;I put &lt;strong&gt;Adverbs&lt;/strong&gt; in the category of &lt;strong&gt;lazy shortcuts&lt;/strong&gt;.  Thorough research and clear communication of relevant facts obviates the need for adverbs.  For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an &lt;strong&gt;increasingly&lt;/strong&gt; important part of our global economy&lt;/li&gt;
&lt;li&gt;Embracing and &lt;strong&gt;diligently&lt;/strong&gt; implementing security frameworks&lt;/li&gt;
&lt;li&gt;We &lt;strong&gt;seamlessly&lt;/strong&gt; integrate different data sources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Require, for example, your authors to describe the processes and experience that would indicate &lt;strong&gt;seamless&lt;/strong&gt; integration.  Without proof, the adverb &lt;strong&gt;seamlessly&lt;/strong&gt; rings hollow and subjective.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;My Rubric provides a first-pass filter to penalize LLM use in Government Acquisition and Academic writing.  It will only catch the laziest of writers.  If AI companies can continue to follow observed trendlines, they will improve the proportion of relevant, objective, and fact-based content in LLM output.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A picture of a shocked robot at the helm of a computer" src="https://john.soban.ski/images/Beat_Ai/01_Shocked_Robot.png"&gt;&lt;/p&gt;
&lt;p&gt;In my last blog post, I recommend that readers demand prose in the form of &lt;strong&gt;subject verb object&lt;/strong&gt;.  I also recommend the removal of all forms of the verb &lt;strong&gt;to be&lt;/strong&gt;.  See that post for &lt;a href="https://john.soban.ski/beat-ai.html"&gt;justification and examples&lt;/a&gt;.  Since then, LLM proved adept at accommodating this request.&lt;/p&gt;
&lt;p&gt;I considered recommending that authors cite journals and Academic papers to foil LLM usage since earlier LLM failed at this task.  In July 2024, however, Grok AI shows promise on this front.&lt;/p&gt;
&lt;p&gt;I use the following prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Write two sentences about how we can detect LLM-generated code.  Cite specific authors and quotes from current Academic journals.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Grok replies:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;According to the research by Wu et al. in their paper titled "LLMDet: A Third Party Large Language Models Generated Text Detection Tool," published in the 2023 Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), a text detection tool called LLMDet can distinguish between different large language models and human-generated text, providing a specific probability for each source [1]. This tool represents a significant advancement in the field of detecting AI-generated content.   &lt;/p&gt;
&lt;p&gt;Additionally, a paper titled "Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector" by Wu et al. published in 2024 on arXiv highlights the challenge of detecting synthetic code generated by Large Language Models (LLMs) and proposes a sampling-based watermarking method for identifying machine-generated text, addressing the growing concern over the misuse of LLM-generated code in both educational and industrial domains [3].&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Grok did not Cite any direct quotes, so the prompt engineer must do some manual work to find a direct quote.&lt;/p&gt;
&lt;p&gt;Grok did not provide the details of the source.  A follow-on prompt, however, will acquire it.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Cite the second paper's source in MLS format for a bibliography&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Outputs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Wu, L., Wu, X., &amp;amp; Wu, Y. (2024). Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector. arXiv preprint arXiv:2402.01234.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I believe LLM will evolve past detection.&lt;/p&gt;
&lt;h2&gt;Bibliography&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Liang, P., Wu, L., Wu, X., &amp;amp; Wu, Y. (2024). Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Association for Computational Linguistics.&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="NLP"></category><category term="Machine Learning"></category></entry><entry><title>Use Python Pandas to Invest in Sealed Super Mario Bros. 3</title><link href="https://john.soban.ski/smb3.html" rel="alternate"></link><published>2023-12-30T00:26:00-05:00</published><updated>2023-12-30T00:26:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2023-12-30:/smb3.html</id><summary type="html">&lt;p&gt;Sealed and Graded Video Game Collecting skyrocketed in popularity over the past decade.  I joined the hobby in 2020 and stuck through the boom times of 2021 and the recent crash of 2023.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Despite the recent crash, sealed video games do provide organic collectability.  In contrast to Image Comics …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Sealed and Graded Video Game Collecting skyrocketed in popularity over the past decade.  I joined the hobby in 2020 and stuck through the boom times of 2021 and the recent crash of 2023.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;Despite the recent crash, sealed video games do provide organic collectability.  In contrast to Image Comics (&lt;em&gt;Gold Editions!  Holo-Foil Covers!&lt;/em&gt;) and Star Wars Power of the Force (&lt;em&gt;Green Cardboard!  Brown Vest Luke!&lt;/em&gt;) action figures, no one (that I knew of) thought to preserve outdated, legacy video games in the 1990s.  &lt;/p&gt;
&lt;p&gt;In 1993, for example, I traded about $10 worth of (completely legal) fireworks for the 8-Bit Nintendo Entertainment System (NES) releases of Wrestlemania, Mega Man, and (IIRC) Jackal.&lt;/p&gt;
&lt;p&gt;&lt;img alt="SMB3 Sprites" src="https://john.soban.ski/images/Smb3/00_Smb_Splash.png"&gt;&lt;/p&gt;
&lt;p&gt;Nobody (except a handful of weirdos) kept sealed copies of NES, Genesis, or SNES games.  If I got a game, I opened it and played it, end of story.  Today, the &lt;strong&gt;supply&lt;/strong&gt; of conventional collectibles (comics, sports cards, action figures) dwarfs the &lt;strong&gt;supply&lt;/strong&gt; of sealed video games.&lt;/p&gt;
&lt;p&gt;On the &lt;strong&gt;demand&lt;/strong&gt; side, sealed Video games, like the NFT market, appear to follow a &lt;strong&gt;winner take all&lt;/strong&gt; approach.  The popular, or &lt;strong&gt;brand name&lt;/strong&gt; games sell at multiples of less popular games, with no regard for supply.  You can, for example, buy certain &lt;strong&gt;pop one&lt;/strong&gt; (only one known sealed game in the market) games on Heritage for a &lt;a href="https://comics.ha.com/itm/video-games/sega/centurion-defender-of-rome-wata-94-b-sealed-genesis-electronic-arts-1991-usa/a/312339-69059.s?ic4=GalleryView-ShortDescription-071515"&gt;little under $200 at auction&lt;/a&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DISCLAIMER: I base the information on this blog on my personal opinion and experience and you MUST not consider this professional financial investment advice. Do not ever use my opinions without first assessing your own personal and financial and situation and you MUST consult a financial professional before making any investment. Keep in mind I will change my thoughts and opinions over time as I learn and accumulate more knowledge. I am NOT a financial professional! This blog is not a place for the giving or receiving financial advice, advice concerning investment decisions or tax or legal advice.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Investment Grade&lt;/h2&gt;
&lt;p&gt;Today we analyze the collectability of &lt;strong&gt;Super Mario Bros. 3&lt;/strong&gt; for the NES.&lt;/p&gt;
&lt;p&gt;Shawn Surmick from &lt;a href="https://www.youtube.com/@ReservedInvestments"&gt;Reserved Investments&lt;/a&gt; taught me the idea of an &lt;strong&gt;Investment Grade&lt;/strong&gt; collectible, a collectible in the 85th percentile of the population.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Investment-grade collectibles sit at the top 15% of the pack.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The CGC census, for example, records 623 graded copies of the 1962 issue of Green Lantern #16 (First Star Sapphire).  623 times 15% yields the quantity 93.45.  If you add the quantities for each universal grade, you will find that less than 94 copies of this comic have a grade of greater than 8.5.  For that reason, an investor can consider any copy of Green Lantern #16 (1962) with a grade equal to or greater than 8.5 &lt;strong&gt;Investment Grade&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A glance at the census for &lt;strong&gt;Super Mario 3&lt;/strong&gt;, however, indicates a need for a more complicated analysis.&lt;/p&gt;
&lt;h2&gt;Import and Clean the Data&lt;/h2&gt;
&lt;p&gt;We use the Python Pandas package for our analysis, and Python Seaborn fuels the graphics.&lt;/p&gt;
&lt;p&gt;We use data from Larry's &lt;a href="https://gamerstonks.com/"&gt;GamerStonks&lt;/a&gt; (non-affiliate link) database.  &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you want to collect sealed video games, then I recommend you pay for a subscription to Gamerstonks.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We load the libraries and import the data from a Comma Separated Value (CSV) spreadsheet into a Pandas DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;sns&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;figure.figsize&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:(&lt;/span&gt;&lt;span class="mf"&gt;11.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;8.27&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;smb3.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The DataFrame includes the &lt;strong&gt;Grader&lt;/strong&gt; (WATA, CGC), &lt;strong&gt;Box&lt;/strong&gt; Grade, &lt;strong&gt;Seal&lt;/strong&gt; Grade, &lt;strong&gt;Variant&lt;/strong&gt;, Purchase &lt;strong&gt;Price&lt;/strong&gt;, &lt;strong&gt;Auction&lt;/strong&gt; House (Goldin, Heritage, Certified Link), and the &lt;strong&gt;Date&lt;/strong&gt; Sold.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Grader&lt;span class="w"&gt; &lt;/span&gt;Box&lt;span class="w"&gt; &lt;/span&gt;Seal&lt;span class="w"&gt; &lt;/span&gt;Variant&lt;span class="w"&gt;                                            &lt;/span&gt;Price&lt;span class="w"&gt;  &lt;/span&gt;Auction&lt;span class="w"&gt;           &lt;/span&gt;Date
WATA&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.6&lt;span class="w"&gt; &lt;/span&gt;A&lt;span class="w"&gt;    &lt;/span&gt;*Made&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Japan,&lt;span class="w"&gt; &lt;/span&gt;Oval&lt;span class="w"&gt; &lt;/span&gt;SOQ&lt;span class="w"&gt; &lt;/span&gt;R&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;USA, Canada and ...  &lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;&lt;span class="s2"&gt;,880 Heritage Auctions 11/30/23&lt;/span&gt;
&lt;span class="s2"&gt;WATA   9.4 A    *Made in Japan, Oval SOQ R - &amp;quot;&lt;/span&gt;USA,&lt;span class="w"&gt; &lt;/span&gt;Canada&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;...&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;,160&lt;span class="w"&gt; &lt;/span&gt;Heritage&lt;span class="w"&gt; &lt;/span&gt;Auctions&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;/30/23
WATA&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;.0&lt;span class="w"&gt; &lt;/span&gt;A&lt;span class="w"&gt;    &lt;/span&gt;*Made&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Japan,&lt;span class="w"&gt; &lt;/span&gt;Oval&lt;span class="w"&gt; &lt;/span&gt;SOQ&lt;span class="w"&gt; &lt;/span&gt;R&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;USA, Canada and ...  &lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;&lt;span class="s2"&gt;,159 Goldin            11/18/23&lt;/span&gt;
&lt;span class="s2"&gt;WATA   9.6 A+   *Made in Japan, NFR (Challenge Set)                &lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;&lt;span class="s2"&gt;,160 Heritage Auctions 11/04/23&lt;/span&gt;
&lt;span class="s2"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We need to &lt;strong&gt;encode&lt;/strong&gt; the data into the correct &lt;strong&gt;type&lt;/strong&gt;, for example, convert &lt;strong&gt;Price&lt;/strong&gt; from &lt;strong&gt;String&lt;/strong&gt; to &lt;strong&gt;Numeric&lt;/strong&gt;, and &lt;strong&gt;Date&lt;/strong&gt; from &lt;strong&gt;String&lt;/strong&gt; to &lt;strong&gt;DateTime&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;$&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Seal&lt;/strong&gt; ratings, while of type &lt;strong&gt;String&lt;/strong&gt;, represent a scale of increasing preference.  We use Pandas to encode &lt;strong&gt;Seal&lt;/strong&gt; into an &lt;strong&gt;Ordered Categorical&lt;/strong&gt; type.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Categorical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NS&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;C+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;B+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A++&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ordered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Python indicates the hierarchy:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Categories&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NS&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;C+&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;B+&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A+&amp;#39;&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;A++&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Can we Use Box Grade for our Analysis?&lt;/h2&gt;
&lt;p&gt;CGC Comics provides a numeric score for quality.  CGC and WATA video games both provide a numeric score for &lt;strong&gt;Box&lt;/strong&gt; quality and also provide a &lt;strong&gt;Seal&lt;/strong&gt; grade.&lt;/p&gt;
&lt;p&gt;In the &lt;strong&gt;Green Lantern&lt;/strong&gt; example above, we use the &lt;strong&gt;CGC Universal Grade&lt;/strong&gt; to stack rank the comics and identify the &lt;strong&gt;investment grade&lt;/strong&gt;.  Can we use the WATA (or CGC) Sealed Video Game &lt;strong&gt;Box&lt;/strong&gt; Grade to identify the &lt;strong&gt;investment grade&lt;/strong&gt; for Super Mario 3 (SMB3)?&lt;/p&gt;
&lt;p&gt;Consider the summary statistics for the &lt;strong&gt;Price&lt;/strong&gt; of the &lt;strong&gt;9.6&lt;/strong&gt; Box grade:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;9.6&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Python dumps a ton of data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;count&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;28&lt;/span&gt;.000000
mean&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;11526&lt;/span&gt;.428571
std&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;8659&lt;/span&gt;.504825
min&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;2160&lt;/span&gt;.000000
&lt;span class="m"&gt;25&lt;/span&gt;%&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;5220&lt;/span&gt;.000000
&lt;span class="m"&gt;50&lt;/span&gt;%&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;8400&lt;/span&gt;.000000
&lt;span class="m"&gt;75&lt;/span&gt;%&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;16350&lt;/span&gt;.000000
max&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;33600&lt;/span&gt;.000000
Name:&lt;span class="w"&gt; &lt;/span&gt;Price,&lt;span class="w"&gt; &lt;/span&gt;dtype:&lt;span class="w"&gt; &lt;/span&gt;float64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For the twenty-eight (28) copies of &lt;strong&gt;SMB3&lt;/strong&gt; graded &lt;strong&gt;9.6&lt;/strong&gt;, we see a high sale price of $33.60k, a low of $2.16k, a median of $8.4k, and so on.&lt;/p&gt;
&lt;p&gt;I now present to you a &lt;a href="https://john.soban.ski/analytics-cheat-sheet.html"&gt;Box Plot&lt;/a&gt;.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;9.6&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Box Plot captures the information from the table in graphical form.&lt;/p&gt;
&lt;p&gt;&lt;img alt="9.6 Boxplot" src="https://john.soban.ski/images/Smb3/01_One_Box.png"&gt;&lt;/p&gt;
&lt;p&gt;The box shows the First ($5.2k) and Third ($16.35k) quartiles and the whiskers show data points that lie 1.5 times the Interquartile range (IQR) (for both top and bottom).  The little diamond shows the outliers, in this case, the max of &lt;strong&gt;$33.6k&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We plot the distribution for each of the recorded &lt;strong&gt;Box&lt;/strong&gt; Grades:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;axhline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;9.8&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;red&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This yields:&lt;/p&gt;
&lt;p&gt;&lt;img alt="All grades Boxplot" src="https://john.soban.ski/images/Smb3/02_All_Box.png"&gt;&lt;/p&gt;
&lt;p&gt;The red horizontal line captures the lowest price paid for a 9.8 &lt;strong&gt;Box&lt;/strong&gt; grade.  &lt;/p&gt;
&lt;p&gt;The graph illustrates that certain instances of &lt;strong&gt;Box&lt;/strong&gt; grades with scores of 9.6, 9.4, 9.2, 9.0, and even 5.0 (!) sold for more than the minimum 9.8 price.&lt;/p&gt;
&lt;p&gt;A quick calculation illustrates that &lt;strong&gt;23%&lt;/strong&gt; of all copies graded &lt;strong&gt;less than 9.8&lt;/strong&gt; sold for &lt;strong&gt;more than the 9.8 minimum&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;9.8&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;9.8&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;9.8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="mf"&gt;0.23076923076923078&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Based on this discovery, we can not use &lt;strong&gt;Box&lt;/strong&gt; grade alone to identify the &lt;strong&gt;top 15% of SMB3&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Data Enrichment&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/tag/influxdb.html"&gt;InfluxDB&lt;/a&gt; uses the terms Tags and Measurements for Categorical and Continuous variables.  Tags allow extra dimensions in Data Visualization.  &lt;/p&gt;
&lt;p&gt;We first create &lt;strong&gt;Tags&lt;/strong&gt; for &lt;strong&gt;Year&lt;/strong&gt; and &lt;strong&gt;Quarter&lt;/strong&gt;.  These provide buckets for aggregations.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Year&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Quarter&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PeriodIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;freq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Q&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;%Y-0%q&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We also want to improve the readability of the &lt;strong&gt;Variant&lt;/strong&gt; feature.&lt;/p&gt;
&lt;p&gt;The Original Data Set uses WATA and CGC notes for the &lt;strong&gt;Variant&lt;/strong&gt; column:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;*Made in Japan, NFR (Challenge Set)&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;USA and Canada&amp;quot; Text&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;USA, Canada and Mexico&amp;quot; Text&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;&amp;quot;USA and Canada&amp;quot;&amp;quot; Text&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ TM - Left Bros.&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ TM - Right Bros.&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;&amp;quot;USA, Canada and Mexico&amp;quot;&amp;quot; Text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To simplify the analysis, we shorten these variants to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NFR&lt;/li&gt;
&lt;li&gt;CAN&lt;/li&gt;
&lt;li&gt;MEX&lt;/li&gt;
&lt;li&gt;LEFT &lt;/li&gt;
&lt;li&gt;RIGHT &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We create a quick lookup &lt;strong&gt;DataFrame&lt;/strong&gt; and merge it into my working &lt;strong&gt;DataFrame&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;tag_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Variant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;*Made in Japan, Oval SOQ R - &amp;quot;USA, Canada and Mexico&amp;quot; Text&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;*Made in Japan, NFR (Challenge Set)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;*Made in Japan, Oval SOQ R - &amp;quot;USA and Canada&amp;quot; Text&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;*Made in Japan, Oval SOQ TM - Left Bros.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;*Made in Japan, Oval SOQ TM - Right Bros.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;*Made in Japan, Oval SOQ R - &amp;quot;&amp;quot;USA, Canada and Mexico&amp;quot;&amp;quot; Text&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;*Made in Japan, Oval SOQ R - &amp;quot;&amp;quot;USA and Canada&amp;quot;&amp;quot; Text&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;MEX&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NFR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;CAN&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RIGHT&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;MEX&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;CAN&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;

 &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Variant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;left&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This results in:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Grader&lt;span class="w"&gt; &lt;/span&gt;Box&lt;span class="w"&gt; &lt;/span&gt;Seal&lt;span class="w"&gt; &lt;/span&gt;Variant&lt;span class="w"&gt;                                           &lt;/span&gt;Price&lt;span class="w"&gt;  &lt;/span&gt;Auction&lt;span class="w"&gt;           &lt;/span&gt;Date&lt;span class="w"&gt;       &lt;/span&gt;Year&lt;span class="w"&gt; &lt;/span&gt;Quarter&lt;span class="w"&gt; &lt;/span&gt;Var_Tag
WATA&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.6&lt;span class="w"&gt; &lt;/span&gt;A&lt;span class="w"&gt;    &lt;/span&gt;*Made&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Japan,&lt;span class="w"&gt; &lt;/span&gt;Oval&lt;span class="w"&gt; &lt;/span&gt;SOQ&lt;span class="w"&gt; &lt;/span&gt;R&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;USA, Canada and ... 2880.0 Heritage Auctions 2023-11-30 2023 2023-04 MEX&lt;/span&gt;
&lt;span class="s2"&gt;WATA   9.4 A    *Made in Japan, Oval SOQ R - &amp;quot;&lt;/span&gt;USA,&lt;span class="w"&gt; &lt;/span&gt;Canada&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;...&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2160&lt;/span&gt;.0&lt;span class="w"&gt; &lt;/span&gt;Heritage&lt;span class="w"&gt; &lt;/span&gt;Auctions&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2023&lt;/span&gt;-11-30&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2023&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2023&lt;/span&gt;-04&lt;span class="w"&gt; &lt;/span&gt;MEX
WATA&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;.0&lt;span class="w"&gt; &lt;/span&gt;A&lt;span class="w"&gt;    &lt;/span&gt;*Made&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Japan,&lt;span class="w"&gt; &lt;/span&gt;Oval&lt;span class="w"&gt; &lt;/span&gt;SOQ&lt;span class="w"&gt; &lt;/span&gt;R&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;USA, Canada and ... 1159.0 Goldin            2023-11-18 2023 2023-04 MEX&lt;/span&gt;
&lt;span class="s2"&gt;WATA   9.6 A+   *Made in Japan, NFR (Challenge Set)               2160.0 Heritage Auctions 2023-11-04 2023 2023-04 NFR&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I dump the noisy and unused Columns:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Grader&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Variant&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Auction&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I synthesize a Categorical &lt;strong&gt;Price_Tag&lt;/strong&gt; from the Numerical &lt;strong&gt;Price&lt;/strong&gt; column.  This allows us to visualize prices in four buckets:  &lt;strong&gt;Lowest, Low, High&lt;/strong&gt; and &lt;strong&gt;Highest&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;qcut&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Lowest&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Low&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;High&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Highest&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The new &lt;strong&gt;Price_Tag&lt;/strong&gt; feature allows quick and easy 3d plots.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Displot&lt;/strong&gt; shows a &lt;strong&gt;heat map&lt;/strong&gt; between the &lt;strong&gt;Box&lt;/strong&gt; grade, the &lt;strong&gt;Seal&lt;/strong&gt; grade and the &lt;strong&gt;Price&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Heatmap Box Grade, Seal Grade and Price" src="https://john.soban.ski/images/Smb3/03_Dis_Plot.png"&gt;&lt;/p&gt;
&lt;p&gt;High &lt;strong&gt;Box&lt;/strong&gt; grades cluster around A+ &lt;strong&gt;Seal&lt;/strong&gt; grades, and low &lt;strong&gt;Box&lt;/strong&gt; grades cluster around the B+ &lt;strong&gt;Seal&lt;/strong&gt; grade.  &lt;strong&gt;Price&lt;/strong&gt; (Highest = Red) correlates with high &lt;strong&gt;Box&lt;/strong&gt; and &lt;strong&gt;Seal&lt;/strong&gt; Grades.&lt;/p&gt;
&lt;p&gt;If you notice, the price for an 8.5 &lt;strong&gt;Box&lt;/strong&gt; with an A &lt;strong&gt;Seal&lt;/strong&gt; registers higher than the price for an 8.5 &lt;strong&gt;Box&lt;/strong&gt; with an A+ &lt;strong&gt;Seal&lt;/strong&gt; (Green vs. Orange).&lt;/p&gt;
&lt;h2&gt;Date and Variation's Effect on Price&lt;/h2&gt;
&lt;p&gt;The sealed, graded Video Game collectible market spiked and crashed since 2021.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lineplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Year&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The next chart shows the mean price over the years, with &lt;strong&gt;error bands&lt;/strong&gt; that depict variation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Smoothed out Price x Date" src="https://john.soban.ski/images/Smb3/04_Rough_Date.png"&gt;&lt;/p&gt;
&lt;p&gt;We see a peak in 2021.&lt;/p&gt;
&lt;p&gt;Look at the sales data in terms of &lt;strong&gt;Box&lt;/strong&gt; Grade and &lt;strong&gt;Date&lt;/strong&gt; Sold:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Date&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The highest sales cluster around the high &lt;strong&gt;Box&lt;/strong&gt; Grades and 2021.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Kernel Density Estimation Plot" src="https://john.soban.ski/images/Smb3/05_Kde_Sales.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Variant&lt;/strong&gt; drives &lt;strong&gt;Price&lt;/strong&gt; along with the purchase &lt;strong&gt;Date&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lineplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Year&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.99&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Date and Variants" src="https://john.soban.ski/images/Smb3/06_Date_Var.png"&gt;&lt;/p&gt;
&lt;p&gt;If we eyeball the chart, we see that the &lt;strong&gt;Left&lt;/strong&gt; Bros. &lt;strong&gt;Variant&lt;/strong&gt; trumps the &lt;strong&gt;Canada&lt;/strong&gt; (No Mexico) Variant, which trumps &lt;strong&gt;Right&lt;/strong&gt; Bros. and then &lt;strong&gt;Mexico&lt;/strong&gt; variants.  &lt;strong&gt;NFR&lt;/strong&gt; sits at the bottom (which makes sense, because the seal contains the text &lt;strong&gt;Not for Resale&lt;/strong&gt; which obfuscates the box art).&lt;/p&gt;
&lt;p&gt;We will rank the &lt;strong&gt;Variant&lt;/strong&gt; feature, therefore, in this order:  &lt;strong&gt;NFR &amp;lt; MEX &amp;lt; RIGHT &amp;lt; CAN &amp;lt; LEFT&lt;/strong&gt; &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Categorical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;NFR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;MEX&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;RIGHT&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;CAN&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;LEFT&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ordered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We use this new &lt;strong&gt;Categorical&lt;/strong&gt; ranking to plot &lt;strong&gt;Box&lt;/strong&gt; Grade vs. Sale &lt;strong&gt;Price&lt;/strong&gt; vs. &lt;strong&gt;Var_Tag&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;9.0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This yields:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boxplot of Variants" src="https://john.soban.ski/images/Smb3/07_Boxplot_Var.png"&gt;&lt;/p&gt;
&lt;p&gt;Note that the (Purple) &lt;strong&gt;LEFT&lt;/strong&gt; Bros. variant trounces all the higher-graded variants.&lt;/p&gt;
&lt;p&gt;We need to pay attention to &lt;strong&gt;Date&lt;/strong&gt; and &lt;strong&gt;Variant&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Quick Aside:  Numerical analysis&lt;/h2&gt;
&lt;p&gt;Pandas provides tools to convert &lt;strong&gt;Tags&lt;/strong&gt; to &lt;strong&gt;Measurements&lt;/strong&gt; ( &lt;strong&gt;Categorical&lt;/strong&gt; to &lt;strong&gt;Numeric&lt;/strong&gt;).  Machine Learning, for example, requires normalized numeric data.&lt;/p&gt;
&lt;p&gt;We create a numeric version of our &lt;strong&gt;DataFrame&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df_num&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;factorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df_num&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;factorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I input this numeric &lt;strong&gt;DataFrame&lt;/strong&gt; into a &lt;a href="https://john.soban.ski/analytics-cheat-sheet.html"&gt;Correlation Graphing Function&lt;/a&gt; to produce a Heat Map:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Correlation Heatmap" src="https://john.soban.ski/images/Smb3/08_Corr_Heatmap.png"&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;Numeric&lt;/strong&gt; encoding of &lt;strong&gt;Seal&lt;/strong&gt; (NS, C+, B+, A, A++) allows us to use a &lt;strong&gt;Kernel Density Estimation&lt;/strong&gt; plot for &lt;strong&gt;Box&lt;/strong&gt; vs. &lt;strong&gt;Seal&lt;/strong&gt; vs. &lt;strong&gt;Price&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Box&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="KDE Plot" src="https://john.soban.ski/images/Smb3/09_Kde_Seal.png"&gt;&lt;/p&gt;
&lt;h2&gt;Normalize Price Over Time&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Price&lt;/strong&gt; lets us stack rank the different &lt;strong&gt;Variants&lt;/strong&gt; and &lt;strong&gt;Box&lt;/strong&gt; Grades of &lt;strong&gt;Super Mario Bros. 3&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The Sale &lt;strong&gt;Date&lt;/strong&gt; variable also impacts the Sale &lt;strong&gt;Price&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;We will remove (or at least mitigate) the effect of Sale &lt;strong&gt;Date&lt;/strong&gt; on our price data.&lt;/p&gt;
&lt;p&gt;We can choose from dozens of approaches.  I choose the following approach to remove the impact of &lt;strong&gt;Date&lt;/strong&gt; on the &lt;strong&gt;Price&lt;/strong&gt; data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Calculate the &lt;strong&gt;Mean_Price&lt;/strong&gt; per Quarter&lt;/li&gt;
&lt;li&gt;Normalize each Sale &lt;strong&gt;Price&lt;/strong&gt; by its Quarter's &lt;strong&gt;Mean_Price&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I first calculate the &lt;strong&gt;Mean_Price&lt;/strong&gt; per Quarter:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;af&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Quarter&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Mean_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


    &lt;span class="n"&gt;Quarter&lt;/span&gt; &lt;span class="n"&gt;Mean_Price&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="mi"&gt;2019&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="mf"&gt;1176.000000&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="mi"&gt;2019&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="mf"&gt;552.000000&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="mi"&gt;2019&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt; &lt;span class="mf"&gt;1298.571429&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="mf"&gt;2534.400000&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;   &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="mf"&gt;1346.250000&lt;/span&gt;
&lt;span class="mi"&gt;5&lt;/span&gt;   &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="mf"&gt;11397.333333&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;   &lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt; &lt;span class="mf"&gt;62800.000000&lt;/span&gt;
&lt;span class="mi"&gt;7&lt;/span&gt;   &lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="mf"&gt;13878.750000&lt;/span&gt;
&lt;span class="mi"&gt;8&lt;/span&gt;   &lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="mf"&gt;20340.000000&lt;/span&gt;
&lt;span class="mi"&gt;9&lt;/span&gt;   &lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="mf"&gt;14948.571429&lt;/span&gt;
&lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="mi"&gt;2021&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt; &lt;span class="mf"&gt;11228.571429&lt;/span&gt;
&lt;span class="mi"&gt;11&lt;/span&gt;  &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="mf"&gt;6408.333333&lt;/span&gt;
&lt;span class="mi"&gt;12&lt;/span&gt;  &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="mf"&gt;6932.000000&lt;/span&gt;
&lt;span class="mi"&gt;13&lt;/span&gt;  &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="mf"&gt;11226.800000&lt;/span&gt;
&lt;span class="mi"&gt;14&lt;/span&gt;  &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt; &lt;span class="mf"&gt;5524.090909&lt;/span&gt;
&lt;span class="mi"&gt;15&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="mf"&gt;9927.428571&lt;/span&gt;
&lt;span class="mi"&gt;16&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="mf"&gt;3001.333333&lt;/span&gt;
&lt;span class="mi"&gt;17&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="mf"&gt;8980.000000&lt;/span&gt;
&lt;span class="mi"&gt;18&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt; &lt;span class="mf"&gt;16562.375000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Mean Per Quarter" src="https://john.soban.ski/images/Smb3/10_Price_Per_Quarter.png"&gt;&lt;/p&gt;
&lt;p&gt;We merge this lookup table with the working &lt;strong&gt;DataFrame&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;af&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Quarter&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;left&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Create a feature &lt;strong&gt;Norm_Price&lt;/strong&gt; which records the sale &lt;strong&gt;Price&lt;/strong&gt; in &lt;strong&gt;units of Mean_Price&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Mean_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Calculate the normalized (against time) 85th percentile sale prices.  This gives us the &lt;strong&gt;Investment Grade&lt;/strong&gt; copies of &lt;strong&gt;Super Mario Bros. 3&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;investment_grade&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

&lt;span class="n"&gt;Box&lt;/span&gt; &lt;span class="n"&gt;Seal&lt;/span&gt;   &lt;span class="n"&gt;Price&lt;/span&gt;    &lt;span class="n"&gt;Date&lt;/span&gt;        &lt;span class="n"&gt;Year&lt;/span&gt;    &lt;span class="n"&gt;Quarter&lt;/span&gt; &lt;span class="n"&gt;Var_Tag&lt;/span&gt; &lt;span class="n"&gt;Price_Tag&lt;/span&gt;   &lt;span class="n"&gt;Mean_Price&lt;/span&gt;  &lt;span class="n"&gt;Norm_Price&lt;/span&gt;
&lt;span class="mf"&gt;9.4&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;     &lt;span class="mi"&gt;60000&lt;/span&gt;    &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;07&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;    &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="n"&gt;LEFT&lt;/span&gt;    &lt;span class="n"&gt;Highest&lt;/span&gt;     &lt;span class="mf"&gt;8980.000000&lt;/span&gt; &lt;span class="mf"&gt;6.681514&lt;/span&gt;
&lt;span class="mf"&gt;9.4&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;     &lt;span class="mi"&gt;108000&lt;/span&gt;   &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;    &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt; &lt;span class="n"&gt;LEFT&lt;/span&gt;    &lt;span class="n"&gt;Highest&lt;/span&gt;     &lt;span class="mf"&gt;16562.37500&lt;/span&gt; &lt;span class="mf"&gt;6.520804&lt;/span&gt;
&lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;    &lt;span class="mi"&gt;60000&lt;/span&gt;    &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;  &lt;span class="mi"&gt;2023&lt;/span&gt;    &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt; &lt;span class="n"&gt;MEX&lt;/span&gt;     &lt;span class="n"&gt;Highest&lt;/span&gt;     &lt;span class="mf"&gt;9927.428571&lt;/span&gt; &lt;span class="mf"&gt;6.043861&lt;/span&gt;
&lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;    &lt;span class="mi"&gt;60000&lt;/span&gt;    &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;  &lt;span class="mi"&gt;2022&lt;/span&gt;    &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;03&lt;/span&gt; &lt;span class="n"&gt;MEX&lt;/span&gt;     &lt;span class="n"&gt;Highest&lt;/span&gt;     &lt;span class="mf"&gt;11226.80000&lt;/span&gt; &lt;span class="mf"&gt;5.344355&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A &lt;strong&gt;GROUP BY&lt;/strong&gt; operation summarizes the &lt;strong&gt;Investment Grade&lt;/strong&gt; copies of SMB3, by &lt;strong&gt;Variant&lt;/strong&gt;, &lt;strong&gt;Box&lt;/strong&gt; Grade and &lt;strong&gt;Seal Grade&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;investment_grade&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="n"&gt;Var_Tag&lt;/span&gt; &lt;span class="n"&gt;Box&lt;/span&gt; &lt;span class="n"&gt;Seal&lt;/span&gt;
&lt;span class="n"&gt;CAN&lt;/span&gt;     &lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="n"&gt;LEFT&lt;/span&gt;    &lt;span class="mi"&gt;9&lt;/span&gt;   &lt;span class="n"&gt;A&lt;/span&gt;
        &lt;span class="mf"&gt;9.2&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="mf"&gt;9.4&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;     &lt;span class="mf"&gt;9.4&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;
            &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;
            &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
        &lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;
            &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
            &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="n"&gt;RIGHT&lt;/span&gt;   &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Video Game Collectors drive high demand for sealed copies &lt;strong&gt;Super Mario Brothers 3&lt;/strong&gt;.  Nintendo released at least five different &lt;strong&gt;Variants&lt;/strong&gt; of the game.&lt;/p&gt;
&lt;p&gt;Our analysis recommends the following &lt;strong&gt;Investment Grade&lt;/strong&gt; copies:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;DISCLAIMER: I base the information on this blog on my personal opinion and experience and you MUST not consider this professional financial investment advice. Do not ever use my opinions without first assessing your own personal and financial and situation and you MUST consult a financial professional before making any investment. Keep in mind I will change my thoughts and opinions over time as I learn and accumulate more knowledge. I am NOT a financial professional! This blog is not a place for the giving or receiving financial advice, advice concerning investment decisions or tax or legal advice.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;*Made in Japan, Oval SOQ TM - Left Bros. = &lt;strong&gt;9.0 A&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ TM - Right Bros. = &lt;strong&gt;9.6 A+&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;USA and Canada&amp;quot; Text = &lt;strong&gt;9.8 A+&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;USA, Canada and Mexico&amp;quot; Text = &lt;strong&gt;9.4 A&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, NFR (Challenge Set) = &lt;strong&gt;Avoid&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Coda&lt;/h2&gt;
&lt;p&gt;The minimum recommendation for the &lt;strong&gt;Canada&lt;/strong&gt; version seemed high to me. I suspected this resulted from a high &lt;strong&gt;Quarterly Mean&lt;/strong&gt; for that time, so I executed the model with a broader bucket.  I used &lt;strong&gt;Yearly Mean&lt;/strong&gt; instead of &lt;strong&gt;Quarterly Mean&lt;/strong&gt; via:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;afy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Year&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;A&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;afy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Mean_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;afy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;afy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Date&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;afy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Year&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;left&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Mean_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;investment_grade&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;investment_grade&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Var_Tag&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Box&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Seal&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Norm_Price&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This outputs:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;MEX&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.2&lt;span class="w"&gt; &lt;/span&gt;A+
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.4&lt;span class="w"&gt; &lt;/span&gt;A
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.6&lt;span class="w"&gt; &lt;/span&gt;A
&lt;span class="w"&gt;          &lt;/span&gt;A+
&lt;span class="w"&gt;          &lt;/span&gt;A++
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.8&lt;span class="w"&gt; &lt;/span&gt;A++
RIGHT&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.6&lt;span class="w"&gt; &lt;/span&gt;A
&lt;span class="w"&gt;          &lt;/span&gt;A+
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.8&lt;span class="w"&gt; &lt;/span&gt;A+
CAN&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.6&lt;span class="w"&gt; &lt;/span&gt;A++
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.8&lt;span class="w"&gt; &lt;/span&gt;A+
&lt;span class="w"&gt;          &lt;/span&gt;A++
LEFT&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.0&lt;span class="w"&gt; &lt;/span&gt;A
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.2&lt;span class="w"&gt; &lt;/span&gt;A+
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;.4&lt;span class="w"&gt; &lt;/span&gt;A+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The updated analysis recommends:&lt;/p&gt;
&lt;p&gt;Our analysis recommends the following &lt;strong&gt;Investment Grade&lt;/strong&gt; copies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;*Made in Japan, Oval SOQ TM - Left Bros. = &lt;strong&gt;9.0 A&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ TM - Right Bros. = &lt;strong&gt;9.6 A&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;USA and Canada&amp;quot; Text = &lt;strong&gt;9.6 A++&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, Oval SOQ R - &amp;quot;USA, Canada and Mexico&amp;quot; Text = &lt;strong&gt;9.2 A+&lt;/strong&gt; or Better&lt;/li&gt;
&lt;li&gt;*Made in Japan, NFR (Challenge Set) = &lt;strong&gt;Avoid&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I dove into the data and it appears that the &lt;strong&gt;Mexico&lt;/strong&gt; variant sells for multiples of the yearly average.&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;CRAZY&lt;/strong&gt; 2021, for example a humble 9.2 A+ &lt;strong&gt;Mexico&lt;/strong&gt; variant sold for over 1.5x the yearly average of $15k.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Var_Tag&lt;/span&gt;  &lt;span class="n"&gt;Box&lt;/span&gt; &lt;span class="n"&gt;Seal&lt;/span&gt; &lt;span class="n"&gt;Price&lt;/span&gt;   &lt;span class="n"&gt;Year&lt;/span&gt; &lt;span class="n"&gt;Mean_Price&lt;/span&gt;   &lt;span class="n"&gt;Norm_Price&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="mf"&gt;60000.0&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt; &lt;span class="mf"&gt;8010.585366&lt;/span&gt;  &lt;span class="mf"&gt;7.490089&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="mf"&gt;60000.0&lt;/span&gt; &lt;span class="mi"&gt;2023&lt;/span&gt; &lt;span class="mf"&gt;9437.159091&lt;/span&gt;  &lt;span class="mf"&gt;6.357846&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.4&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;    &lt;span class="mf"&gt;3600.0&lt;/span&gt;  &lt;span class="mi"&gt;2019&lt;/span&gt; &lt;span class="mf"&gt;1194.000000&lt;/span&gt;  &lt;span class="mf"&gt;3.015075&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.8&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="mf"&gt;23400.0&lt;/span&gt; &lt;span class="mi"&gt;2023&lt;/span&gt; &lt;span class="mf"&gt;9437.159091&lt;/span&gt;  &lt;span class="mf"&gt;2.479560&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;    &lt;span class="mf"&gt;2880.0&lt;/span&gt;  &lt;span class="mi"&gt;2019&lt;/span&gt; &lt;span class="mf"&gt;1194.000000&lt;/span&gt;  &lt;span class="mf"&gt;2.412060&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="mf"&gt;19200.0&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt; &lt;span class="mf"&gt;8010.585366&lt;/span&gt;  &lt;span class="mf"&gt;2.396829&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;   &lt;span class="mf"&gt;33600.0&lt;/span&gt; &lt;span class="mi"&gt;2021&lt;/span&gt; &lt;span class="mf"&gt;15033.125000&lt;/span&gt; &lt;span class="mf"&gt;2.235064&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;    &lt;span class="mf"&gt;16800.0&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt; &lt;span class="mf"&gt;8010.585366&lt;/span&gt;  &lt;span class="mf"&gt;2.097225&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="mf"&gt;28800.0&lt;/span&gt; &lt;span class="mi"&gt;2021&lt;/span&gt; &lt;span class="mf"&gt;15033.125000&lt;/span&gt; &lt;span class="mf"&gt;1.915769&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.6&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;   &lt;span class="mf"&gt;14400.0&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt; &lt;span class="mf"&gt;8010.585366&lt;/span&gt;  &lt;span class="mf"&gt;1.797621&lt;/span&gt;
&lt;span class="n"&gt;MEX&lt;/span&gt;      &lt;span class="mf"&gt;9.2&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;   &lt;span class="mf"&gt;22800.0&lt;/span&gt; &lt;span class="mi"&gt;2021&lt;/span&gt; &lt;span class="mf"&gt;15033.125000&lt;/span&gt; &lt;span class="mf"&gt;1.516651&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Yes, in 2021 someone paid $22,800 for the &lt;strong&gt;Mexico&lt;/strong&gt; variant in 9.2 A+ grade.  Compare that to a 9.8 A++ &lt;strong&gt;Mexico&lt;/strong&gt; variant sold in 2023 for just $1k more.&lt;/p&gt;
&lt;h2&gt;Links&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/c/getthegreggames/videos"&gt;GetTheGregGames&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/@ReservedInvestments"&gt;Reserved Investments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gamerstonks.com/"&gt;GamerStonks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="Python"></category><category term="Pandas"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Beat Artificial Intelligence (AI) with Active, Present Voice</title><link href="https://john.soban.ski/beat-ai.html" rel="alternate"></link><published>2023-10-28T01:23:00-04:00</published><updated>2023-10-28T01:23:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2023-10-28:/beat-ai.html</id><summary type="html">&lt;p&gt;When I ask ChatGPT to write detailed technical content, I find that it outputs hollow, bombastic, and meandering prose. &lt;/p&gt;
&lt;p&gt;Such output places a high cognitive load on my mind:  My subconscious must fill in the holes of the passive, cliche-ridden, and obtuse narratives.  I long for clear communication.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Shocked Robot" src="https://john.soban.ski/images/Beat_Ai/01_Shocked_Robot.png"&gt;&lt;/p&gt;
&lt;p&gt;In this …&lt;/p&gt;</summary><content type="html">&lt;p&gt;When I ask ChatGPT to write detailed technical content, I find that it outputs hollow, bombastic, and meandering prose. &lt;/p&gt;
&lt;p&gt;Such output places a high cognitive load on my mind:  My subconscious must fill in the holes of the passive, cliche-ridden, and obtuse narratives.  I long for clear communication.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Shocked Robot" src="https://john.soban.ski/images/Beat_Ai/01_Shocked_Robot.png"&gt;&lt;/p&gt;
&lt;p&gt;In this blog post, I collect some ChatGPT &lt;strong&gt;tells&lt;/strong&gt;, discuss the issues with these &lt;strong&gt;tells&lt;/strong&gt; and then recommend how to fight the AI leviathan.&lt;/p&gt;
&lt;h2&gt;ChatGPT&lt;/h2&gt;
&lt;p&gt;In December 2015 Elon Musk, Greg Brockman, Ilya Sutskever, and Sam Altman invested over $1B (USD) to found OpenAI.  Last year, the OpenAI foundation launched ChatGPT, a user-friendly Large Language Model (LLM) built upon the Generative Pre-trained Transformer (GPT) Four (GPT-4).  Traditional data services &lt;strong&gt;retrieve&lt;/strong&gt; content, and ChatGPT &lt;strong&gt;generates&lt;/strong&gt; content (notice the subtle difference).&lt;/p&gt;
&lt;p&gt;LLM allows users to create non-existent data.  Generative AI, for example, can create an answer to the question "Name Socrates Favorite 90s Sitcom."&lt;/p&gt;
&lt;h2&gt;ChatGPT Output&lt;/h2&gt;
&lt;p&gt;ChatGPT output appears impressive at first read.  The initial glow, however, fades in seconds.  The prose then injects the reader with a mild sense of discomfort.&lt;/p&gt;
&lt;p&gt;Consider the following output from ChatGPT&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the intricate world of digital communication the ability to label and prioritize data is paramount, especially when it comes to real-time data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At first, I read this and thought:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Wow!  Tell me more.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Then, I felt unease and thought: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Wait, that statement says nothing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The ChatGPT-constructed sentence values showy cadence over communication.  &lt;/p&gt;
&lt;p&gt;Take the phrase:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;the intricate world of digital communication&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It sounds impressive but conveys nothing.  &lt;strong&gt;Intricate&lt;/strong&gt; lacks precision, &lt;strong&gt;intricate world&lt;/strong&gt; lacks authority.&lt;/p&gt;
&lt;p&gt;I believe ChatGPT wants to convey the importance of &lt;strong&gt;labeled/ prioritized&lt;/strong&gt; data, and I fail to see how the adjective &lt;strong&gt;intricate&lt;/strong&gt; strengthens that argument.&lt;/p&gt;
&lt;p&gt;The sentence continues with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;the ability to label and prioritize data is paramount&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I ask, &lt;strong&gt;Paramount to Who?&lt;/strong&gt;.  I also need to take a step back and remember the definition of &lt;strong&gt;Paramount&lt;/strong&gt;: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Paramount: Chief in importance or impact&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;ChatGPT hallucinates here.  I consider the ability to label and prioritize real-time data a useful feature of digital communications, but I do not consider it &lt;strong&gt;chief in importance or impact.&lt;/strong&gt;  I consider the ability to transmit packets from source to sink &lt;strong&gt;chief in importance&lt;/strong&gt; for digital communications.&lt;/p&gt;
&lt;p&gt;Consider another ChatGPT-authored example:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Phishing detection is of paramount importance in today's digital landscape as cybercriminals continuously refine their deceptive tactics to trick individuals and organizations into divulging sensitive information. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I have experience with computer security and know that most security breaches result from Phishing attacks.  In that light, I would consider Phishing Detection &lt;strong&gt;chief in importance or impact&lt;/strong&gt; to computer security.  While correct, ChatGPT makes this claim without any evidence.  &lt;/p&gt;
&lt;p&gt;ChatGPT then uses the phrase &lt;strong&gt;today's digital landscape&lt;/strong&gt;.  That poetic phrase sounds impressive but lacks meaning or at the very least lacks any weight in the context of this sentence.&lt;/p&gt;
&lt;p&gt;I use my &lt;a href="https://kagi.com/"&gt;Kagi Search Engine&lt;/a&gt; (Non-affiliate link) to retrieve a definition of &lt;strong&gt;Digital Landscape&lt;/strong&gt;.  It returns dozens of different definitions, including one for advertising, one for online games, and one for communications infrastructure policy.  &lt;/p&gt;
&lt;p&gt;I consider &lt;strong&gt;Digital Landscape&lt;/strong&gt; a marketing buzzword, an empty vessel for &lt;strong&gt;content creators&lt;/strong&gt; to fill in a way that serves their narrative.&lt;/p&gt;
&lt;p&gt;ChatGPT should define &lt;strong&gt;digital landscape&lt;/strong&gt; to remove ambiguity.&lt;/p&gt;
&lt;h2&gt;ChatGPT Tells&lt;/h2&gt;
&lt;p&gt;I ask ChatGPT to create two-sentence summaries for a variety of topics.  I then highlight any turns of phrase that lead to sensations of discomfort.&lt;/p&gt;
&lt;p&gt;I organize them here in the hopes that a pattern will emerge.&lt;/p&gt;
&lt;p&gt;I name the first Category &lt;strong&gt;Empty Phrases That Almost Sound Smart&lt;/strong&gt;.  ChatGPT peppers these phrases in introductory paragraphs, without justification or evidence.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In today's &lt;strong&gt;digital landscape&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;intricate world&lt;/strong&gt; of&lt;/li&gt;
&lt;li&gt;In &lt;strong&gt;this era&lt;/strong&gt; of&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unlocking new horizons&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;strategic imperative&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Recognizing the &lt;strong&gt;inherent challenges in this domain&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;and the &lt;strong&gt;unique challenges they face&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;an increasingly important part of &lt;strong&gt;our global economy&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I call the next category &lt;strong&gt;Unsubstantiated Grandiosity&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Infrastructure as Code (IaC) is a &lt;strong&gt;paradigm-shifting&lt;/strong&gt; approach&lt;/li&gt;
&lt;li&gt;Businesses can achieve &lt;strong&gt;unprecedented&lt;/strong&gt; agility&lt;/li&gt;
&lt;li&gt;These models provide valuable insights...making them &lt;strong&gt;indispensable&lt;/strong&gt; for decision-making&lt;/li&gt;
&lt;li&gt;Graphics Processing Units (GPUs) have &lt;strong&gt;transcended&lt;/strong&gt; their original role in rendering graphics to become &lt;strong&gt;indispensable accelerators&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Digital advertising holds &lt;strong&gt;immense significance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Blockchain... brings &lt;strong&gt;unprecedented efficiency&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next up, &lt;strong&gt;Empty Cliches in Passive Voice&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Phishing detection &lt;strong&gt;is of paramount importance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Regression models &lt;strong&gt;are powerful tools&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;...making robust security frameworks &lt;strong&gt;a paramount concern&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Embracing and diligently implementing security frameworks &lt;strong&gt;is not just&lt;/strong&gt; a regulatory necessity but a strategic imperative&lt;/li&gt;
&lt;li&gt;Cleaning data &lt;strong&gt;is the essential first step&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Adverbs&lt;/strong&gt; speak for themselves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;to share and decipher data &lt;strong&gt;seamlessly&lt;/strong&gt; is paramount&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;specifically&lt;/strong&gt; designed for Banks and the unique challenges they face&lt;/li&gt;
&lt;li&gt;an &lt;strong&gt;increasingly&lt;/strong&gt; important part of our global economy&lt;/li&gt;
&lt;li&gt;Embracing and &lt;strong&gt;diligently&lt;/strong&gt; implementing security frameworks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Kevin and Dustin in Outbreak" src="https://john.soban.ski/images/Beat_Ai/02_Outbreak_Quote.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;KEVIN SPACEY (as Casey Schuler in Outbreak): It’s an adverb, Sam. It’s a lazy tool of a weak mind.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I call the next group &lt;strong&gt;Cringe Words&lt;/strong&gt;, words that actual humans never use in normal conversation unless they want to appear smart while high on Meth.  &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Indeed&lt;/li&gt;
&lt;li&gt;Alas&lt;/li&gt;
&lt;li&gt;Especially&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I maintain that the word &lt;strong&gt;Indeed&lt;/strong&gt; signals ChatGPT text more than any other &lt;strong&gt;tell&lt;/strong&gt;, so I wrote a bash script to detect ChatGPT prose:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;str&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;indeed&lt;span class="w"&gt; &lt;/span&gt;prose.txt&lt;span class="sb"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="nv"&gt;$str&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;then&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ChatGPT wrote this&amp;#39;&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I call the next group &lt;strong&gt;Midwit Words&lt;/strong&gt;, words that people use to appear smart.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Leverage (Instead of use)&lt;/li&gt;
&lt;li&gt;Ensure (Instead of an appropriate verb)&lt;/li&gt;
&lt;li&gt;Utilize (Instead of use)&lt;/li&gt;
&lt;li&gt;Penultimate (Instead of Ultimate)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;My Recommendation&lt;/h2&gt;
&lt;p&gt;ChatGPT produces wordy, loose text, light on meaning or evidence, and peppered with bombastic flourishes and empty cliches.  Authors must tighten their prose to separate their work from ChatGPT output.&lt;/p&gt;
&lt;p&gt;&lt;img alt="ChatGPT Avatar" src="https://john.soban.ski/images/Beat_Ai/03_Chat_Avatar.png"&gt;&lt;/p&gt;
&lt;p&gt;I recommend an active, present voice that follows the &lt;a href="https://en.wikipedia.org/wiki/Subject%E2%80%93verb%E2%80%93object_word_order"&gt;Subject-Verb-Object (SVO)&lt;/a&gt; template.  I advise against adverbs, you should instead choose a precise verb.  Do not use adjectives without proper context.  Never write a cliche.&lt;/p&gt;
&lt;p&gt;Take the original ChatGPT sentence:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In the intricate world of digital communication the ability to label and prioritize data is paramount, especially when it comes to real-time data.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Rewrite it to:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Digital Communications systems label and prioritize Real-Time Data to prevent jitter and buffering&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I encourage you to investigate &lt;a href="https://en.wikipedia.org/wiki/E-Prime"&gt;E-Prime&lt;/a&gt;, an upgrade to the English language that prioritizes clarity, precision and respect for the listener.&lt;/p&gt;
&lt;p&gt;D. David Bourland Jr studied under General Semantics founder &lt;a href="https://en.wikipedia.org/wiki/Alfred_Korzybski"&gt;Alfred Korzybski&lt;/a&gt; and developed E-Prime.  &lt;/p&gt;
&lt;p&gt;E-Prime removes all forms of the verb &lt;strong&gt;to be&lt;/strong&gt;.  This includes (along with negative contractions and contractions):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;am&lt;/li&gt;
&lt;li&gt;is&lt;/li&gt;
&lt;li&gt;are&lt;/li&gt;
&lt;li&gt;was&lt;/li&gt;
&lt;li&gt;were&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The verb &lt;strong&gt;to be&lt;/strong&gt; short circuits the &lt;a href="https://en.wikipedia.org/wiki/Subject%E2%80%93verb%E2%80%93object_word_order"&gt;Subject-Verb-Object (SVO)&lt;/a&gt; word order.  It allows lazy, imprecise writing.  &lt;strong&gt;To be&lt;/strong&gt; verbs drive passive voice:&lt;/p&gt;
&lt;p&gt;Consider&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenAI was founded in 2015&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Versus&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Elon Musk, Greg Brockman, Ilya Sutskever and Sam Altman founded OpenAI in 2015&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Consider this lazy, loose sentence:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;OpenDaylight is a Software Defined Network (SDN) Controller.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I must work harder to construct a sentence without the verb &lt;strong&gt;to be&lt;/strong&gt;.  I need to select subjects, and verbs and provide evidence.  The additional information benefits the reader:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Internet Service Providers (ISP), Cloud Service Providers (CSP), Data Center Engineers, and Academics use the OpenDaylight (ODL) platform to tailor and automate computer networks. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kellog and Bouland write:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[misuse of the verb &lt;strong&gt;to be&lt;/strong&gt; creates] a deity mode of speech [and allows] even the most ignorant to transform their opinions magically into god-like pronouncements on the nature of things - Kellogg, E. W.; Bourland Jr., D. David (1990). "Working with E-Prime: Some Practical Notes" (PDF). Et Cetera. 47 (4): 376–392.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I use E-Prime exclusively on my blog.  I encourage you to read my other posts and notice how I avoid the verb &lt;strong&gt;to be&lt;/strong&gt; and write with a clear, active, SVO voice.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;ChatGPT provides a useful service.  It generates prose in seconds, based on little user input.&lt;/p&gt;
&lt;p&gt;ChatGPT excels in some use cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need bullets for advertising copy&lt;/li&gt;
&lt;li&gt;You need an invite for an event or meetup&lt;/li&gt;
&lt;li&gt;You need a quick outline&lt;/li&gt;
&lt;li&gt;You have writer's block and need a nudge&lt;/li&gt;
&lt;li&gt;You need to produce rote/ box-checking paperwork &lt;/li&gt;
&lt;li&gt;You need copious "good enough" content to increase your web presence&lt;/li&gt;
&lt;li&gt;Your company values &lt;strong&gt;looking busy&lt;/strong&gt; over &lt;strong&gt;productivity&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you need to communicate a clear message to your user, I recommend you craft your narrative by hand and use SVO, deliberate adjectives, and E-Prime.&lt;/p&gt;</content><category term="Data Science"></category><category term="NLP"></category><category term="Machine Learning"></category></entry><entry><title>Data Exploration with Data Viz Cheat Sheet</title><link href="https://john.soban.ski/analytics-cheat-sheet.html" rel="alternate"></link><published>2023-09-30T10:26:00-04:00</published><updated>2023-09-30T10:26:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2023-09-30:/analytics-cheat-sheet.html</id><summary type="html">&lt;p&gt;Today I collect and organize useful data visualization (Data Viz) tools that aid data exploration.  &lt;/p&gt;
&lt;p&gt;I illustrate the use of the tools via the classic &lt;strong&gt;Abalone&lt;/strong&gt; database, hosted on the University of California, Irvine (UCI) Machine Learning repository website.&lt;/p&gt;
&lt;p&gt;I recommend you bookmark this and return to it when you …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Today I collect and organize useful data visualization (Data Viz) tools that aid data exploration.  &lt;/p&gt;
&lt;p&gt;I illustrate the use of the tools via the classic &lt;strong&gt;Abalone&lt;/strong&gt; database, hosted on the University of California, Irvine (UCI) Machine Learning repository website.&lt;/p&gt;
&lt;p&gt;I recommend you bookmark this and return to it when you need to find the syntax and semantics of popular data viz constructs.&lt;/p&gt;
&lt;h2&gt;Get the Data&lt;/h2&gt;
&lt;p&gt;PhD student David Aha created the University of California, Irvine (UCI) Machine Learning repository in 1987 in the form of a File Transfer Protocol (FTP) site.  The Repo collects databases, domain theories, and data generators.  Today I use the &lt;a href="https://archive.ics.uci.edu/dataset/1/abalone"&gt;Abalone&lt;/a&gt; database.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;Abalone&lt;/strong&gt; database provides a table of four thousand observations, which each contain one categorical feature, seven continuous features, and one target:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Features, Categorical&lt;ul&gt;
&lt;li&gt;Sex: Male, Female, and Infant       &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Features, Continuous&lt;ul&gt;
&lt;li&gt;Length: Longest shell measurement (mm)&lt;/li&gt;
&lt;li&gt;Diameter: Perpendicular to length (mm)&lt;/li&gt;
&lt;li&gt;Height: With meat in the shell (mm)&lt;/li&gt;
&lt;li&gt;Whole_weight: Whole abalone (grams)&lt;/li&gt;
&lt;li&gt;Shucked_weight: Weight of meat (grams)&lt;/li&gt;
&lt;li&gt;Viscera_weight: Gut weight after bleeding (grams)&lt;/li&gt;
&lt;li&gt;Shell_weight: After being dried (grams)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Target, Integer&lt;ul&gt;
&lt;li&gt;Rings: +1.5 gives the age in years      &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I use the Python &lt;strong&gt;requests&lt;/strong&gt; library to pull the data straight from the UCI repo and stuff it into a Pandas DataFrame.&lt;/p&gt;
&lt;p&gt;I import the required libraries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;sns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I set the &lt;strong&gt;url&lt;/strong&gt; (String) and &lt;strong&gt;column_name&lt;/strong&gt; (List) variables to match the &lt;strong&gt;Abalone&lt;/strong&gt; database schema.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;url = &amp;#39;https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data&amp;#39;
column_names = [&amp;#39;Sex&amp;#39;,
                &amp;#39;Length&amp;#39;,
                &amp;#39;Diameter&amp;#39;,
                &amp;#39;Height&amp;#39;,
                &amp;#39;Whole_weight&amp;#39;,
                &amp;#39;Shucked_weight&amp;#39;,
                &amp;#39;Viscera_weight&amp;#39;,
                &amp;#39;Shell_weight&amp;#39;,
                &amp;#39;Rings&amp;#39;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Requests&lt;/strong&gt; downloads the HTTP object, &lt;strong&gt;StringIO&lt;/strong&gt; decodes it and &lt;strong&gt;Pandas&lt;/strong&gt; loads the decoded data into a DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;r = requests.get(url).content
abalone_df = pd.read_csv(io.StringIO(r.decode(&amp;#39;utf-8&amp;#39;)),
                      names = column_names)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;One-Dimensional Statistical Summaries&lt;/h2&gt;
&lt;p&gt;We first explore the data in one dimension.&lt;/p&gt;
&lt;h3&gt;Histograms&lt;/h3&gt;
&lt;p&gt;Histograms provide a visual shorthand for the distribution of numerical data.  Think of a &lt;strong&gt;connect four&lt;/strong&gt; board, where you stack chips in different columns (or buckets).  Each chip represents a number in that bucket. &lt;/p&gt;
&lt;p&gt;Pandas provides a built-in &lt;strong&gt;hist()&lt;/strong&gt; method.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We use Pandas to draw a Histogram of our target variable, &lt;strong&gt;Rings&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Rings Hist" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/01_Rings_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;Most &lt;strong&gt;Abalone&lt;/strong&gt; include between 7.5 and 12.5 Rings.&lt;/p&gt;
&lt;p&gt;Pandas also accommodates our &lt;strong&gt;Categorical&lt;/strong&gt; feature.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Sex Hist" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/02_Sex_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;The corpus of data includes roughly equal observations for &lt;strong&gt;Male, Female&lt;/strong&gt; and &lt;strong&gt;Infant&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Pandas allows us to run &lt;strong&gt;histograms&lt;/strong&gt; on all &lt;strong&gt;features&lt;/strong&gt;.  The method ignores the &lt;strong&gt;Categorical&lt;/strong&gt; feature.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;layout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="All Hist" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/03_All_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;The results illustrate the need to &lt;strong&gt;Normalize&lt;/strong&gt; the data, since all the Categorical features clock in under a value of &lt;strong&gt;one&lt;/strong&gt; (1), and the target feature includes ranges up to &lt;strong&gt;thirty&lt;/strong&gt; (30).&lt;/p&gt;
&lt;h3&gt;Hist with tags&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/tag/influxdb.html"&gt;InfluxDB&lt;/a&gt; uses the nomenclature &lt;strong&gt;Tags&lt;/strong&gt; and &lt;strong&gt;Measurements&lt;/strong&gt; to describe &lt;strong&gt;Categorical&lt;/strong&gt; and &lt;strong&gt;Continuous&lt;/strong&gt; variables.&lt;/p&gt;
&lt;p&gt;Tags provide a new dimension of visual data, &lt;strong&gt;slicing and dicing&lt;/strong&gt; the data into different categories.&lt;/p&gt;
&lt;p&gt;Seaborn provides the option to color by &lt;strong&gt;Tag&lt;/strong&gt; with their &lt;strong&gt;hue&lt;/strong&gt; parameter.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Sex Hue" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/04_Sex_Hue.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hue&lt;/strong&gt; does not make sense with &lt;strong&gt;Measurements&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Stupid Hist" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/05_Stupid_Hist.png"&gt;&lt;/p&gt;
&lt;h3&gt;Kernel Density Estimation (KDE)&lt;/h3&gt;
&lt;p&gt;Kernel Density Estimation (KDE) smooths the Histograms.  Instead of discrete buckets, we see continuous lines that represent the distribution.&lt;/p&gt;
&lt;p&gt;I used the analogy above of a Histogram stacking chips on a connect four board.  KDE pours sand at each point, enough to fill a Standard Normal Distribution.  KDE in a sense stacks Standard Normal Distributions at each point, which leads to the smoothness of the plot.&lt;/p&gt;
&lt;p&gt;If you reduce the bucket size to a very small number, you can see the idea in action.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Default Bins" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/06_Default_Bins.png"&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Twentyfive Bins" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/07_Twentyfive_Bins.png"&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Fifty Bins" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/08_Fifty_Bins.png"&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kde&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Infinite Bins" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/09_Infinite_Bins.png"&gt;&lt;/p&gt;
&lt;p&gt;SNS will plot the KDE over the histogram if you instruct it to do so:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Whole_weight&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kde&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Kde Hist" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/10_Kde_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;Pandas plots all features' distribution with KDE.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kde&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;layout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="All Kde" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/11_All_Kde.png"&gt;&lt;/p&gt;
&lt;h2&gt;Boxplots&lt;/h2&gt;
&lt;p&gt;A glance at a Boxplot tells you the median, 25th percentile, 75th percentile, and outliers.&lt;/p&gt;
&lt;p&gt;The box shows the First and Third quartiles and the whiskers show data points that lie 1.5 times the Interquartile range (IQR) (for both top and bottom).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Weight Box" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/12_Weight_Box.png"&gt;&lt;/p&gt;
&lt;p&gt;SNS allows you to separate the chart by &lt;strong&gt;Tag&lt;/strong&gt;.  If you set &lt;strong&gt;y&lt;/strong&gt; equal to &lt;strong&gt;Sex&lt;/strong&gt;, for example, you see the distributions split by &lt;strong&gt;Male, Female, and Infant&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Sex Box" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/13_Sex_Box.png"&gt;&lt;/p&gt;
&lt;p&gt;In the Boxplot above, we see that Female &lt;strong&gt;Abalone&lt;/strong&gt; weigh slightly more than Male &lt;strong&gt;Abalone&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Special Note: Enrich Data.&lt;/h3&gt;
&lt;p&gt;Remember that we have a &lt;strong&gt;target&lt;/strong&gt; variable named &lt;strong&gt;Rings&lt;/strong&gt;, which encompasses a range of numbers between one (1) and thirty (30).  I recommend you enrich the &lt;strong&gt;Rings&lt;/strong&gt; data with a new &lt;strong&gt;Tag&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The following code uses the &lt;strong&gt;Rings&lt;/strong&gt; value to set a new &lt;strong&gt;Tag&lt;/strong&gt;, which I named &lt;strong&gt;Age&lt;/strong&gt;.  The code splits the data into three ranges and applies to a given observation the tag &lt;strong&gt;Young, Middle_Age or Old&lt;/strong&gt; based on the value of &lt;strong&gt;Rings&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;abalone_df[&amp;#39;Age&amp;#39;] = pd.qcut(abalone_df[&amp;#39;Rings&amp;#39;],q=3,labels=[&amp;#39;Young&amp;#39;,&amp;#39;Middle_Age&amp;#39;,&amp;#39;Old&amp;#39;])
abalone_df.head()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This new tag provides a new dimension to slice and dice our Boxplot.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boxplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Age Box" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/14_Age_Box.png"&gt;&lt;/p&gt;
&lt;p&gt;We now see the relationship between &lt;strong&gt;Whole_weight&lt;/strong&gt;, &lt;strong&gt;Sex&lt;/strong&gt; and &lt;strong&gt;Age&lt;/strong&gt; at a glance.&lt;/p&gt;
&lt;h2&gt;Violinplots&lt;/h2&gt;
&lt;p&gt;A Violinplot mirrors the Distribution, which gives the plot a Violin-like shape.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violinplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Rings Violin" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/15_Rings_Violin.png"&gt;&lt;/p&gt;
&lt;p&gt;Violinplots also accommodate Tags.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violinplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Violin Tags" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/16_Violin_Tags.png"&gt;&lt;/p&gt;
&lt;h2&gt;Two-dimensional Plots&lt;/h2&gt;
&lt;p&gt;Python provides tools to explore Bivariate data sets.&lt;/p&gt;
&lt;p&gt;Seaborn (SNS) provides two-dimensional Histograms and two-dimensional KDE tools.&lt;/p&gt;
&lt;h3&gt;Two-dimensional Histogram&lt;/h3&gt;
&lt;p&gt;Note that SNS only shows the top-down view for histograms.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Length&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Two Hist" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/17_Two_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;The SNS Bivariate Histograms accommodate tags.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Length&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Age&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Two Tag" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/18_Two_Tag.png"&gt;&lt;/p&gt;
&lt;h3&gt;Two-dimensional KDE&lt;/h3&gt;
&lt;p&gt;SNS also provides two-dimensional KDE plots, with &lt;strong&gt;Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Length&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Height&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Age&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;kde&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Two Kde" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/19_Two_Kde.png"&gt;&lt;/p&gt;
&lt;h2&gt;Look for Correlation&lt;/h2&gt;
&lt;p&gt;The Data Scientist looks for correlation between features and the target during the Data Exploration phase of the Machine Learning Pipeline&lt;/p&gt;
&lt;h3&gt;Data prep&lt;/h3&gt;
&lt;p&gt;In the Data Prep stage, we encode the &lt;strong&gt;Tags&lt;/strong&gt; (String) into &lt;strong&gt;numeric values&lt;/strong&gt; (float32).&lt;/p&gt;
&lt;p&gt;The Pandas method &lt;strong&gt;get_dummies&lt;/strong&gt; one-hot-encodes the &lt;strong&gt;Sex&lt;/strong&gt; variable into Orthogonal dimensions.  This increases the dimensionality of our data set.&lt;/p&gt;
&lt;p&gt;We also use the &lt;strong&gt;factorize&lt;/strong&gt; method to convert &lt;strong&gt;Young, Middle_Aged and Old&lt;/strong&gt; into the integers &lt;strong&gt;0,1 and 2&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_dummies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Age_Bucket&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;factorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Sex&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We pop off the labels for later use.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;class_labels&lt;/strong&gt; stores the target vector for &lt;strong&gt;Classification&lt;/strong&gt; models, and &lt;strong&gt;reg_labels&lt;/strong&gt; stores the target vector for &lt;strong&gt;Regression&lt;/strong&gt; models.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;class_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Age_Bucket&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reg_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I also create vectors to pull like &lt;strong&gt;Features&lt;/strong&gt; from the DataFrame (Measurements, Tags, Target).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Length&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;Diameter&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;Height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;Whole_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;Shucked_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;Viscera_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="s1"&gt;&amp;#39;Shell_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;encoded_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;F&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;I&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;M&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;y_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Heatmap correlation&lt;/h3&gt;
&lt;p&gt;SNS provides a &lt;strong&gt;Heatmap&lt;/strong&gt; matrix for correlation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;corr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a mask for the upper triangle&lt;/span&gt;
&lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Set up the matplotlib figure&lt;/span&gt;
&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a custom diverging colormap&lt;/span&gt;
&lt;span class="n"&gt;cmap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;diverging_palette&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;230&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                             &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;as_cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Draw the heatmap with the mask and &lt;/span&gt;
&lt;span class="c1"&gt;# correct aspect ratio&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vmax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;center&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;linewidths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;cbar_kws&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;shrink&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Corr Matrix" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/20_Corr_Matrix.png"&gt;&lt;/p&gt;
&lt;p&gt;We see that &lt;strong&gt;Diameter&lt;/strong&gt; and &lt;strong&gt;Length&lt;/strong&gt; have significant correlation and so do all of the &lt;strong&gt;weight&lt;/strong&gt; features.&lt;/p&gt;
&lt;h3&gt;Pairgrid Correlation&lt;/h3&gt;
&lt;p&gt;This SNS Pairgrid plot shows the correlation between the features and the target, &lt;strong&gt;Rings&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PairGrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;x_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;y_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_vars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_offdiag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Pair Grid" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/21_Pair_Grid.png"&gt;&lt;/p&gt;
&lt;p&gt;All &lt;strong&gt;features&lt;/strong&gt; depict a correlation slope close to around 25 degrees or so, which indicates Correlation.&lt;/p&gt;
&lt;h3&gt;Scatterplot with Regression&lt;/h3&gt;
&lt;p&gt;SNS plots the ML 101 favorite, Linear Regression right on the screen with the &lt;strong&gt;regplot&lt;/strong&gt; action.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;regplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Viscera_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rings&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Reg Plot" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/22_Reg_Plot.png"&gt;&lt;/p&gt;
&lt;p&gt;We see positive slope with pretty tight error bands, which indicates &lt;strong&gt;Viscera_weight&lt;/strong&gt; predicts &lt;strong&gt;Rings&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Fancy Tilted 3d Plots&lt;/h3&gt;
&lt;p&gt;Remember that SNS only graphs &lt;strong&gt;top-down&lt;/strong&gt; views.  I wrote the following &lt;strong&gt;matplotlib&lt;/strong&gt; function to show an isometric view of the data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;target_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;111&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                          &lt;span class="n"&gt;projection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;3d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;green&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]][&lt;/span&gt;&lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I call the function with the &lt;strong&gt;Abalone&lt;/strong&gt; data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Height&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Viscera_weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Length&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Three Dee" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/23_Three_Dee.png"&gt;&lt;/p&gt;
&lt;h2&gt;Dimensionality Reduction&lt;/h2&gt;
&lt;p&gt;Note my Graph above requires me to choose &lt;strong&gt;three&lt;/strong&gt; (out of the possible &lt;strong&gt;eight&lt;/strong&gt;) features at a time.  This fact drives two questions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which features do I use?&lt;/li&gt;
&lt;li&gt;How can I plot all the features at once?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Principal Component Analysis (PCA) collapses the information held in &lt;strong&gt;eight&lt;/strong&gt; features into &lt;strong&gt;three&lt;/strong&gt;, &lt;strong&gt;two&lt;/strong&gt; or even &lt;strong&gt;one&lt;/strong&gt; feature.&lt;/p&gt;
&lt;p&gt;I write about PCA in my blog post on &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;Regression with Keras and TensorFlow&lt;/a&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you stick a magnet at each point in the data space, and then stick a telescoping iron bar at the origin, the magnets will pull the bar into position and stretch the bar. The bar will wiggle a bit at first and then eventually settle into a static position. The final direction and length of the bar represent a principal component. We can map the higher dimensionality space to the principal component by connecting a string directly from each magnet to the bar. Where the string hits (taut) we make a mark. The marks represent the mapped vector space.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;George Dallas also writes an excellent blog post that &lt;a href="https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/"&gt;explains PCA&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Normalize&lt;/h3&gt;
&lt;p&gt;First Normalize the Data.  TensorFlow provides a &lt;strong&gt;normalizer&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras.layers.experimental&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Normalization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Fit the &lt;strong&gt;normalizer&lt;/strong&gt; to our &lt;strong&gt;measurements&lt;/strong&gt; (exclude the encoded tags).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;One Principal Component&lt;/h3&gt;
&lt;p&gt;SciKitLearn provides &lt;strong&gt;PCA&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following code collapses all seven &lt;strong&gt;features&lt;/strong&gt; into one &lt;strong&gt;Principal Component&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;pca_abalone_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;SNS shows the utility of this &lt;strong&gt;Principal Component&lt;/strong&gt; on the separability of the &lt;strong&gt;Classes&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_abalone_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="One Princomp" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/24_One_Princomp.png"&gt;&lt;/p&gt;
&lt;h2&gt;Two Principal Components&lt;/h2&gt;
&lt;p&gt;Now derive two principal components.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A KDE plot shows the three classes in relation to the two &lt;strong&gt;Principal Components&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Two Princomp" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/25_Two_Princomp.png"&gt;&lt;/p&gt;
&lt;h2&gt;3 Principal Components&lt;/h2&gt;
&lt;p&gt;Astute readers anticipate the slight code modifications required to derive three &lt;strong&gt;Principal Components&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;metric_vars&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;abalone_reg_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We use the &lt;strong&gt;3d&lt;/strong&gt; plot to see the separation of classes in relation to three &lt;strong&gt;Principal Components&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;class_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Three Princomp" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/26_Three_Princomp.png"&gt;&lt;/p&gt;
&lt;p&gt;If you include one-hot encoded variables in your PCA, you may see weird results.&lt;/p&gt;
&lt;p&gt;For example, we encoded the &lt;strong&gt;Categorical&lt;/strong&gt; &lt;strong&gt;Sex&lt;/strong&gt; feature into three &lt;strong&gt;Orthogonal&lt;/strong&gt; numeric vectors, one for &lt;strong&gt;M, F and I&lt;/strong&gt;.  If you keep these vectors in the PCA you will see the following:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sex Princomp" src="https://john.soban.ski/images/Analytics_Cheat_Sheet/27_Sex_Princomp.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Bookmark this page for future reference.  It provides a handy &lt;strong&gt;Cheat Sheet&lt;/strong&gt; for useful Python Data Exploration and Data Viz tools.&lt;/p&gt;</content><category term="Data Science"></category><category term="Python"></category><category term="Pandas"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Refactor Matlab to Tidyverse</title><link href="https://john.soban.ski/refactor-matlab-to-tidyverse.html" rel="alternate"></link><published>2023-08-26T10:26:00-04:00</published><updated>2023-08-26T10:26:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2023-08-26:/refactor-matlab-to-tidyverse.html</id><summary type="html">&lt;p&gt;I've previously discussed the &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Reduced Coloumb Energy&lt;/a&gt; Neural Net algorithm on this site.  I wrote the algorithm in Matlab, which uses index based logic to select, filter, wrangle and process data. Today I will refactor the Matlab code to &lt;a href="https://www.tidyverse.org/"&gt;Tidyverse&lt;/a&gt;.  Tidyverse uses forward pipe operators to flow data through the …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I've previously discussed the &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Reduced Coloumb Energy&lt;/a&gt; Neural Net algorithm on this site.  I wrote the algorithm in Matlab, which uses index based logic to select, filter, wrangle and process data. Today I will refactor the Matlab code to &lt;a href="https://www.tidyverse.org/"&gt;Tidyverse&lt;/a&gt;.  Tidyverse uses forward pipe operators to flow data through the data processing steps.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pipe Operator Pic" src="https://john.soban.ski/images/Refactor_Matlab_To_Tidyverse/00_Pipe_Operator.png"&gt;&lt;/p&gt;
&lt;p&gt;The example RCE algorithm assigns a class to data based on whether or not the data points live inside "footprints" of training data.  If a visual walk through the RCE algorithm interests you, take a minute to read  my post titled &lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;A Graphical Introduction to Probabalistic Neural Networks&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The following graphic captures an animation of the RCE NN Algorithm in action.&lt;/p&gt;
&lt;p&gt;&lt;img alt="RCE in action" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Cartoon.gif"&gt;&lt;/p&gt;
&lt;p&gt;You can find the original Matlab script and the new R script on &lt;a href="https://github.com/hatdropper1977/bupa-rce-octave"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Loading Data&lt;/h2&gt;
&lt;p&gt;The algorithm loads the BUPA liver disorders database from the University of California, Irvine (UCI) &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data"&gt;machine learning repository&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab, we encode the CSV into a matrix with brackets and assignment.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;59&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="mi"&gt;86&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="mi"&gt;91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="k"&gt;...&lt;/span&gt;

&lt;span class="mi"&gt;98&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;77&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;91&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;68&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;16.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;98&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;57&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;20.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;Tidyverse allows us to read the raw CSV and store it in a &lt;a href="https://tibble.tidyverse.org/"&gt;Tibble&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In addition to storing the CSV data in a Tibble, we use the &lt;strong&gt;readr&lt;/strong&gt; library to add column names (&lt;strong&gt;&lt;em&gt;col_names&lt;/em&gt;&lt;/strong&gt;) and an ID column (&lt;strong&gt;&lt;em&gt;rowid_to_column&lt;/em&gt;&lt;/strong&gt;).&lt;/p&gt;
&lt;p&gt;I discuss the definitions of the column names in the next section&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;readr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;dplyr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;magrittr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;purrr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tidyr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;mcv&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;alkphos&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sgpt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sgot&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gammagt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;drinks_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;select&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Bupa.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;bupa.data&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;col_names&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;tibble&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;rowid_to_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Selecting Features&lt;/h2&gt;
&lt;p&gt;The BUPA data includes six features and two classes (one for alcohol related liver disorder, and one for alcohol unrelated liver disorder).  &lt;/p&gt;
&lt;p&gt;The six (6) BUPA features include: &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mean corpuscular volume (mcv)&lt;/li&gt;
&lt;li&gt;Four chemical markers&lt;ul&gt;
&lt;li&gt;alkaline phosphotase (alkphos)&lt;/li&gt;
&lt;li&gt;alamine aminotransferase (sgpt)&lt;/li&gt;
&lt;li&gt;aspartate aminotransferase (sgot)&lt;/li&gt;
&lt;li&gt;gamma-glutamyl transpeptidase (gammagt)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;half-pint equivalents of alcohol per day (drinks)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I discussed the salient features in my &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;RCE&lt;/a&gt; writeup.  Three features, "alkphos", "sgpt" and "gammagt" stand out in terms of the algorithm's classification performance.  We still would like to provide the Data Scientist with flexibility in selecting the features, for "what if" scenarios, so we write our code to accommodate an arbitrary number of features.&lt;/p&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab, we use the column index to select the features.  In this case, we use &lt;strong&gt;&lt;em&gt;alkphos == 2&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;sgpt == 5&lt;/em&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;em&gt;gammagt == 6&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;Tidyverse allows us to name the columns and then select by name.&lt;/p&gt;
&lt;p&gt;When we created &lt;strong&gt;&lt;em&gt;Bupa.Tib&lt;/em&gt;&lt;/strong&gt; we named the columns, so now we can &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt; columns by name.&lt;/p&gt;
&lt;p&gt;We create a list that records the column names that we intend to keep.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;alkphos&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sgpt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gammagt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Creating the training set&lt;/h2&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab, we sort the data by the &lt;strong&gt;&lt;em&gt;class&lt;/em&gt;&lt;/strong&gt;, which the matrix stores in column seven (7).&lt;/p&gt;
&lt;p&gt;We then use index operations to select all features, excluding the class.&lt;/p&gt;
&lt;p&gt;Then we select the desired features using the &lt;strong&gt;&lt;em&gt;feats&lt;/em&gt;&lt;/strong&gt; array.  A function, named &lt;strong&gt;&lt;em&gt;prepare_uncoded&lt;/em&gt;&lt;/strong&gt; wraps this operation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;[data]&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;prepare_uncoded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;data,feats&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;sortrows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(:,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(:,&lt;/span&gt;&lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We create two separate matrices, one that includes the first seventy-two (72) rows, and one that includes the bottom seventy-two (72) rows.  Since we sorted by class in the function above, we produce one matrix of train patterns that contains class one, and one that contains train patterns of class two.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;prepare_uncoded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;feats&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;73&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;144&lt;/span&gt;&lt;span class="p"&gt;,:)&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;145&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;216&lt;/span&gt;&lt;span class="p"&gt;,:)&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;MagrittR&lt;/em&gt;&lt;/strong&gt; package of Tidyverse enables a pipe forward operator.  The pipe forward operation provides a more readable feature selection operation.&lt;/p&gt;
&lt;p&gt;We use &lt;strong&gt;&lt;em&gt;filter&lt;/em&gt;&lt;/strong&gt; to filter points of each class, &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt; to select the features and &lt;strong&gt;&lt;em&gt;slice&lt;/em&gt;&lt;/strong&gt; to pull specific rows.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE:  Just to disambiguate, Irvine named the &lt;strong&gt;&lt;em&gt;class&lt;/em&gt;&lt;/strong&gt; column &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt;, so we &lt;strong&gt;&lt;em&gt;filter&lt;/em&gt;&lt;/strong&gt; based on the value of the &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt; column.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;class_1_training_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;73&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;144&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;class_2_training_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;72&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Bupa.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class_1_training_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Class.2.Train.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Bupa.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class_2_training_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Find Radii&lt;/h2&gt;
&lt;p&gt;The RCE NN algorithm requires us to find the radii between a train point and the nearest train point of the opposite class.&lt;/p&gt;
&lt;p&gt;We compute the euclidean distance to all other training points of the other class, and store the distance (named &lt;strong&gt;&lt;em&gt;lambda&lt;/em&gt;&lt;/strong&gt;) of the closest one.&lt;/p&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab, we create a function that ingests both the &lt;strong&gt;&lt;em&gt;Class 1&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;Class 2&lt;/em&gt;&lt;/strong&gt; training matrices, along with &lt;strong&gt;&lt;em&gt;epsilon&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;lambda max&lt;/em&gt;&lt;/strong&gt;.  &lt;strong&gt;&lt;em&gt;Lambda max&lt;/em&gt;&lt;/strong&gt; provides an upper bound in terms of the maximum radius the algorithm will consider.  &lt;strong&gt;&lt;em&gt;Epsilon&lt;/em&gt;&lt;/strong&gt; provides a very small value that we subtract from the calculated &lt;strong&gt;&lt;em&gt;lambda&lt;/em&gt;&lt;/strong&gt;.  For more details, see my writeup of the &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;RCE NN&lt;/a&gt; algorithm.&lt;/p&gt;
&lt;p&gt;The Matlab code performs Matrix operations via nested functions to calculate the euclidean distance to all other points and then record the minimum.&lt;/p&gt;
&lt;p&gt;In addition, the code uses a &lt;strong&gt;&lt;em&gt;for loop&lt;/em&gt;&lt;/strong&gt; to iterate through every training pattern.&lt;/p&gt;
&lt;p&gt;The function iterates through each training point, calculates the distance to every other training point (stored in the two &lt;strong&gt;&lt;em&gt;Class&lt;/em&gt;&lt;/strong&gt; matrices) and keeps the minimum.&lt;/p&gt;
&lt;p&gt;It then returns two arrays that contain &lt;strong&gt;&lt;em&gt;lambda&lt;/em&gt;&lt;/strong&gt;, one array per class.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;[lambda_1, lambda_2] = rce_train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;class1,class2,eps,lambda_max&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c"&gt;%Find number of train patterns (colums)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;n_c1p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;n_c2p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n_c1p&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;x_hat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;(:,&lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nb"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;n_c1p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.^&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;lambda_1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_hat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n_c2p&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;x_hat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;(:,&lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nb"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;n_c2p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;.^&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;lambda_2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_hat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We apply the function to the training matrices:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lambda_1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nb"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;R best practices do not encourage &lt;strong&gt;&lt;em&gt;for loops&lt;/em&gt;&lt;/strong&gt;, since R follows a &lt;strong&gt;&lt;em&gt;functional&lt;/em&gt;&lt;/strong&gt; programming convention.&lt;/p&gt;
&lt;p&gt;In addition, the &lt;strong&gt;&lt;em&gt;MagrittR&lt;/em&gt;&lt;/strong&gt; pipes allow us to avoid &lt;strong&gt;&lt;em&gt;nested functions&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We first create a function &lt;strong&gt;&lt;em&gt;find_lambda&lt;/em&gt;&lt;/strong&gt;.  I decided to process the data one class at a time, so this function only calculates the distance to training points of the &lt;strong&gt;&lt;em&gt;other&lt;/em&gt;&lt;/strong&gt; class, and not all data points as in the Matlab function above.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;find_lambda&lt;/em&gt;&lt;/strong&gt; function takes a single observation (row of data) for a particular &lt;strong&gt;&lt;em&gt;class&lt;/em&gt;&lt;/strong&gt;, along with the entire &lt;strong&gt;&lt;em&gt;Tibble&lt;/em&gt;&lt;/strong&gt; that contains all data points of the &lt;strong&gt;&lt;em&gt;other class&lt;/em&gt;&lt;/strong&gt;.  The function also ingests &lt;strong&gt;&lt;em&gt;epsilon&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;lambda max&lt;/em&gt;&lt;/strong&gt; and the &lt;strong&gt;&lt;em&gt;features&lt;/em&gt;&lt;/strong&gt; vector.&lt;/p&gt;
&lt;p&gt;Not to overload terms too much, but the function includes a &lt;strong&gt;&lt;em&gt;lambda function&lt;/em&gt;&lt;/strong&gt; that calculates the Euclidean distance between two vectors.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;lambda function&lt;/em&gt;&lt;/strong&gt; takes two vectors, the &lt;strong&gt;&lt;em&gt;observation&lt;/em&gt;&lt;/strong&gt; vector and a row from the &lt;strong&gt;&lt;em&gt;other class&lt;/em&gt;&lt;/strong&gt; Tibble, which I call &lt;strong&gt;&lt;em&gt;x&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The Lambda function can perform calculations on vectors of any length, which provides Data Scientists flexiblity in choosing which features to include.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;find_lambda&lt;/em&gt;&lt;/strong&gt; function follows, and I will explain it quickly line by line.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Other.Class.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;epsilon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Other.Class.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;euclid_dist&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;euclid_dist&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;epsilon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We start with the Tibble that contains &lt;strong&gt;&lt;em&gt;all&lt;/em&gt;&lt;/strong&gt; observations of the &lt;strong&gt;&lt;em&gt;other&lt;/em&gt;&lt;/strong&gt; class, stored in &lt;strong&gt;&lt;em&gt;Other.Class.Tib&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The function pipes the Tibble in its entirety to a &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt; statement that selects all of the &lt;strong&gt;&lt;em&gt;desired&lt;/em&gt;&lt;/strong&gt; features.&lt;/p&gt;
&lt;p&gt;We then use the &lt;strong&gt;&lt;em&gt;mutate&lt;/em&gt;&lt;/strong&gt; operator to create a new column named &lt;strong&gt;&lt;em&gt;euclid_dist&lt;/em&gt;&lt;/strong&gt;.  This column stores the &lt;strong&gt;&lt;em&gt;euclid_dist&lt;/em&gt;&lt;/strong&gt; from the current &lt;strong&gt;&lt;em&gt;observation&lt;/em&gt;&lt;/strong&gt; (single vector) to every data point (row) in the &lt;strong&gt;&lt;em&gt;Other.Class.Tib&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;apply&lt;/em&gt;&lt;/strong&gt; operator tells &lt;strong&gt;&lt;em&gt;Tidyverse&lt;/em&gt;&lt;/strong&gt; to apply the Euclidean distance &lt;strong&gt;&lt;em&gt;lambda function&lt;/em&gt;&lt;/strong&gt; to every row in &lt;strong&gt;&lt;em&gt;Other.Class.Tib&lt;/em&gt;&lt;/strong&gt; and store the result for each row in the &lt;strong&gt;&lt;em&gt;euclid_dist&lt;/em&gt;&lt;/strong&gt; column.&lt;/p&gt;
&lt;p&gt;Since we must accommodate vectors of arbitrary length we tell &lt;strong&gt;&lt;em&gt;apply&lt;/em&gt;&lt;/strong&gt; to input &lt;strong&gt;&lt;em&gt;row wise&lt;/em&gt;&lt;/strong&gt; data via the &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt; in the second parameter in the &lt;strong&gt;&lt;em&gt;function signature&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Once the &lt;strong&gt;&lt;em&gt;apply&lt;/em&gt;&lt;/strong&gt; operation completes, we have a column that records the distance to each data point in the &lt;strong&gt;&lt;em&gt;Other.Class.Tib&lt;/em&gt;&lt;/strong&gt;.  We are only interested in the &lt;strong&gt;&lt;em&gt;nearest&lt;/em&gt;&lt;/strong&gt; data point of the &lt;strong&gt;&lt;em&gt;other class&lt;/em&gt;&lt;/strong&gt; so we &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt; the &lt;strong&gt;&lt;em&gt;euclid_dist&lt;/em&gt;&lt;/strong&gt; column and find the &lt;strong&gt;&lt;em&gt;min()&lt;/em&gt;&lt;/strong&gt;.  We then ensure that the &lt;strong&gt;&lt;em&gt;minimum distance&lt;/em&gt;&lt;/strong&gt; has length less than &lt;strong&gt;&lt;em&gt;lambda max&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In summary, we supply the function with a single observation for a class, along with a &lt;strong&gt;&lt;em&gt;Tibble&lt;/em&gt;&lt;/strong&gt; that includes &lt;strong&gt;&lt;em&gt;all&lt;/em&gt;&lt;/strong&gt; observations for the &lt;strong&gt;&lt;em&gt;other class&lt;/em&gt;&lt;/strong&gt;.  The function then returns a single value, the distance between the current observation and the nearest data point of the &lt;strong&gt;&lt;em&gt;other class&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We are not done yet.  We must apply this function to every training point in the Class under observation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find Lambda for Class 1 Training patterns&lt;/span&gt;
&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;lt;&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;Class.2.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;epsilon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Find Lambda for Class 2 Training patterns&lt;/span&gt;
&lt;span class="n"&gt;Class.2.Train.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;lt;&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;lambda_max&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;epsilon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                                        &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.2.Train.Tib&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We pipe the entire &lt;strong&gt;&lt;em&gt;Class.1.Train.Tib&lt;/em&gt;&lt;/strong&gt; to a &lt;strong&gt;&lt;em&gt;select&lt;/em&gt;&lt;/strong&gt; function and then use the &lt;strong&gt;&lt;em&gt;apply&lt;/em&gt;&lt;/strong&gt; operation to execute &lt;strong&gt;&lt;em&gt;find_lambda&lt;/em&gt;&lt;/strong&gt; on every row of &lt;strong&gt;&lt;em&gt;Class.1.Train&lt;/em&gt;&lt;/strong&gt; tib.  Although each iteration (application) of &lt;strong&gt;&lt;em&gt;find_lambda&lt;/em&gt;&lt;/strong&gt; inputs the entire &lt;strong&gt;&lt;em&gt;Tibble&lt;/em&gt;&lt;/strong&gt; of &lt;strong&gt;&lt;em&gt;Class.2.Train.Tin&lt;/em&gt;&lt;/strong&gt;, it returns a single value for lambda.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE:  The &lt;strong&gt;&lt;em&gt;MagrittR&lt;/em&gt;&lt;/strong&gt; &lt;strong&gt;&lt;em&gt;%&amp;lt;&amp;gt;%&lt;/em&gt;&lt;/strong&gt; operation pipes data forward and stores the final result of all chained operations back into initial variable&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The following output tibble depicts what &lt;strong&gt;&lt;em&gt;Class.1.Train.Tib&lt;/em&gt;&lt;/strong&gt; looks like after application of &lt;strong&gt;&lt;em&gt;find_lambda&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;
&lt;span class="c1"&gt;# A tibble: 72 x 5&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;alkphos&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sgpt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gammagt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;67&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;77&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;114&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;29.1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;175&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;71&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;29&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;52&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;10.5&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;176&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;93&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;22&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;123&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;19.4&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;182&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;77&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;86&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;31&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;26.8&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;183&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;77&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;39&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;108&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;20.4&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;189&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;83&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;81&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;201&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;58.3&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;190&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;75&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;14&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;3.16&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;191&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;56&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;23&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;6.48&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;192&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;91&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;27&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;15&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;7.87&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;194&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="m"&gt;62&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;17&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;5.00&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;195&lt;/span&gt;
&lt;span class="c1"&gt;# ... with 62 more rows&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The closest data point in Class 2, for example, to the first Class 1 observation exists 29.1 units away.&lt;/p&gt;
&lt;h2&gt;Classify the Data&lt;/h2&gt;
&lt;p&gt;We first take the remaining BUPA data to create test patterns for each class.  &lt;/p&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_class1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,:)&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;test_class2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;217&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;288&lt;/span&gt;&lt;span class="p"&gt;,:)&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;In Tidyverse I decided to create one &lt;strong&gt;&lt;em&gt;Tibble&lt;/em&gt;&lt;/strong&gt; for all Test Patterns, via the &lt;strong&gt;&lt;em&gt;bind_rows&lt;/em&gt;&lt;/strong&gt; operation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Test.Patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Bupa.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class_1_test_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;bind_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Bupa.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class_2_test_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once we have test data, we need to classify it.&lt;/p&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab, I wrote a function named &lt;strong&gt;&lt;em&gt;rce_clasify&lt;/em&gt;&lt;/strong&gt;.  The function contains a ton of nested functions and a for loop.&lt;/p&gt;
&lt;p&gt;Each training pattern includes a circular "footprint" around it that extends to the nearest point of the &lt;strong&gt;&lt;em&gt;other&lt;/em&gt;&lt;/strong&gt; class, with radius equal to the &lt;strong&gt;&lt;em&gt;lambda&lt;/em&gt;&lt;/strong&gt; we calculated above.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;rce_clasify&lt;/em&gt;&lt;/strong&gt; function finds which &lt;strong&gt;&lt;em&gt;footprint&lt;/em&gt;&lt;/strong&gt; each test observation lies in.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;[cl]&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rce_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;class1,lambda_1,class2,lambda_2,test_patterns&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c"&gt;%Test Patterns in form: num_features x num_patterns&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;ind1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ind2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="c"&gt;%Find number of train patterns (colums)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;n_c1p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;n_c2p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;num_test_patterns&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_patterns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;num_test_patterns&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;test_x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_patterns&lt;/span&gt;&lt;span class="p"&gt;(:,&lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dist1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_x&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nb"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;n_c1p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dist1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;diag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist1&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;*&lt;/span&gt;&lt;span class="n"&gt;dist1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dist2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_x&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nb"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;n_c2p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;dist2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;diag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist2&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;*&lt;/span&gt;&lt;span class="n"&gt;dist2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;ind1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;ind2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dist2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lambda_2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nb"&gt;isempty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ind1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nb"&gt;isempty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ind2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nb"&gt;isempty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ind1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="nb"&gt;isempty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ind2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;In the &lt;strong&gt;&lt;em&gt;Tidyverse&lt;/em&gt;&lt;/strong&gt; classification approach, we use nested functions in the logical sense, since our code exclusively uses pipes.&lt;/p&gt;
&lt;p&gt;We create a generic function to discover how many "footprints" the given observation lives in.&lt;/p&gt;
&lt;p&gt;Similar to the &lt;strong&gt;&lt;em&gt;Matlab&lt;/em&gt;&lt;/strong&gt; code above, we calculate the distance between an observation of the Test data and all of the training samples of a given class.&lt;/p&gt;
&lt;p&gt;We then use the &lt;strong&gt;&lt;em&gt;lambda&lt;/em&gt;&lt;/strong&gt; values of the training samples to identify the &lt;strong&gt;&lt;em&gt;count&lt;/em&gt;&lt;/strong&gt; (nrow) of footprints the test data lives in.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;rce_classify&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Data.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Data.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;euclid_dist&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;observation&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;^&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;euclid_dist&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Data.Tib&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;lambda&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Without getting too complicated, we pass the Test data to a function that uses &lt;strong&gt;&lt;em&gt;rce_classify&lt;/em&gt;&lt;/strong&gt; to detect the number of &lt;strong&gt;&lt;em&gt;hits&lt;/em&gt;&lt;/strong&gt; against each class of Training data.  First it finds the &lt;strong&gt;&lt;em&gt;hits&lt;/em&gt;&lt;/strong&gt; against &lt;strong&gt;&lt;em&gt;Class.2.Training.Tib&lt;/em&gt;&lt;/strong&gt;, and then it finds the hits against &lt;strong&gt;&lt;em&gt;Class.1.Training.Tib&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The new function &lt;strong&gt;&lt;em&gt;rce_classify_tib&lt;/em&gt;&lt;/strong&gt; then uses the number of hits for each class to classify the data.  In this example, we use a voting approach, although you can tailor the algorithm to classify a test point as ambiguous if it hits either &lt;strong&gt;&lt;em&gt;zero&lt;/em&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;em&gt;more than one&lt;/em&gt;&lt;/strong&gt; class.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;rce_classify_tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.One.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.Two.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;lt;&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.2.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rce_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.Two.Train.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;lt;&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.1.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rce_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.One.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.2.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;class.2.hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.1.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.2.hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="n"&gt;yes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.2.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.1.hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                           &lt;/span&gt;&lt;span class="n"&gt;yes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                           &lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Test.Data.Tib&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We then apply these functions to our data.&lt;/p&gt;
&lt;h3&gt;Matlab&lt;/h3&gt;
&lt;p&gt;In Matlab:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;cl1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lambda_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lambda_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;test_class1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;cl2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lambda_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;class2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lambda_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;test_class2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Tidyverse&lt;/h3&gt;
&lt;p&gt;In Tidyverse:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;Test.Patterns %&amp;lt;&amp;gt;% 
  rce_classify_tib(Class.1.Train.Tib, Class.2.Train.Tib, features)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Graphing RCE NN&lt;/h2&gt;
&lt;p&gt;We can graph the RCE NN in action by creating a uniform data grid and running &lt;strong&gt;&lt;em&gt;rce_classify&lt;/em&gt;&lt;/strong&gt; against every point.&lt;/p&gt;
&lt;p&gt;First, create the data grid.  We find the highest valued observation in the data set, in order to ensure that our graph includes this data point.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;max_obs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;bind_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Class.2.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;

&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;expand.grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_obs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;length.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="nf"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_obs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;length.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="nf"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_obs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;length.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now we apply the classification to every test data point, in order to blanket the entire canvas.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note:  This will take a long time.  If you don't want to wait, you can execute: test_grid = readxl::read_xlsx("NinetyK.xlsx")&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You can then use &lt;strong&gt;&lt;em&gt;Plotly&lt;/em&gt;&lt;/strong&gt; to generate a Three Dimensional image that you can rotate.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;plotly&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;ifelse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="n"&gt;yes&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;one&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;two&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;plot_ly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;alkphos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;sgpt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;gammagt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="RCE 3D" src="https://john.soban.ski/images/Refactor_Matlab_To_Tidyverse/01_3d.png"&gt;&lt;/p&gt;
&lt;p&gt;If you would like to see a 2D graph, then re-run the script using two features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;alkphos&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;sgpt&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Create a test grid using two dimensions and classify.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;expand.grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_obs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;length.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;300&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;max_obs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;length.out&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;300&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;

&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;lt;&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;as_tibble&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;tibble&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;rowid_to_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;rce_classify_tib&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.1.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Class.2.Train.Tib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can just load the pre-processed data instead of waiting.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;readxl&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;read_xlsx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;NinetyK.xlsx&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here I plot using Grammer of Graphics.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nf"&gt;ggplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alkphos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sgpt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.1.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;geom_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;test_grid&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rce_class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;aes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alkphos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sgpt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;class.2.hits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="RCE 2D" src="https://john.soban.ski/images/Refactor_Matlab_To_Tidyverse/02_90k.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This blog post described how to convert a Matlab script that uses for loops and nested function into a functional, pipe based Tidyverse script.&lt;/p&gt;
&lt;p&gt;If you enjoyed this, you may enjoy these other Machine Learning posts.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;A New Exemplar Machine Learning Algorithm (Part 1: Develop)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/rce-python-part-2.html"&gt;A New Exemplar Machine Learning Algorithm (Part 2: Optimize)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Applying a Reduced Coulomb Energy (RCE) Neural Network Classifier to the Bupa Liver Disorders Data Set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;A Graphical Introduction to Probabilistic Neural Networks - Normalization and Implementation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="Octave"></category><category term="RCE"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="R"></category></entry><entry><title>Roark vs. Raskolnikov: Natural Language Processing Analysis</title><link href="https://john.soban.ski/roark-vs-raskolnikov.html" rel="alternate"></link><published>2023-07-29T04:54:00-04:00</published><updated>2023-07-29T04:54:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2023-07-29:/roark-vs-raskolnikov.html</id><summary type="html">&lt;p&gt;I augment traditional literary analysis with Natural Language Processing (NLP) tools to compare Fyodor Dostoevsky's Rodion Raskolnikov (Crime and Punishment) with Ayn Rand's Howard Roark (The Fountainhead).  Tools include both the Google Cloud Platform (GCP) Natural Language Application Programming Interface (API) and Tensorflow Transfer Learning.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Raskolnikov and Roark" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/01_Raskolnikov_Roark.png"&gt;&lt;/p&gt;
&lt;p&gt;I use the same approach …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I augment traditional literary analysis with Natural Language Processing (NLP) tools to compare Fyodor Dostoevsky's Rodion Raskolnikov (Crime and Punishment) with Ayn Rand's Howard Roark (The Fountainhead).  Tools include both the Google Cloud Platform (GCP) Natural Language Application Programming Interface (API) and Tensorflow Transfer Learning.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Raskolnikov and Roark" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/01_Raskolnikov_Roark.png"&gt;&lt;/p&gt;
&lt;p&gt;I use the same approach I followed in my &lt;a href="https://john.soban.ski/thoreau-vs-unabomber.html"&gt;Thoreau vs. Unabomber&lt;/a&gt; blog post.  The GCP NLP API measures character sentiment (positive or negative) and emotional intensity while the literary analysis frames the quantified personality metrics with relevant quotes.  &lt;/p&gt;
&lt;p&gt;My earlier post analyzed Thoreau and the Unabomber's manifestos.  These texts provide single-voice narration which yielded simple data preparation.  Unlike &lt;strong&gt;The Unabomber Manifesto&lt;/strong&gt; and &lt;strong&gt;Walden&lt;/strong&gt;, however, &lt;strong&gt;Crime and Punishment&lt;/strong&gt; and &lt;strong&gt;The Fountainhead&lt;/strong&gt; include multiple speakers and a narrator.&lt;/p&gt;
&lt;p&gt;Valid analysis requires me to extract the speaking lines for &lt;strong&gt;Raskolnikov&lt;/strong&gt; and &lt;strong&gt;Roark&lt;/strong&gt; from their respective works.  I used &lt;a href="https://john.soban.ski/raskolnikov-label.html"&gt;Tensorflow and Keras NLP&lt;/a&gt; to accomplish this task.   &lt;/p&gt;
&lt;h2&gt;Quantify Sentiment&lt;/h2&gt;
&lt;p&gt;I take the extracted dialog and internal monologues from &lt;strong&gt;Raskolnikov&lt;/strong&gt; and &lt;strong&gt;Roark&lt;/strong&gt; and feed them into the Google API.  The API infers sentiment (score) and intensity (magnitude).&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://cloud.google.com/natural-language/docs/basics#interpreting_sentiment_analysis_values"&gt;GCP NLP API docs&lt;/a&gt; define &lt;strong&gt;score&lt;/strong&gt; and &lt;strong&gt;magnitude&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Score&lt;ul&gt;
&lt;li&gt;Indicates the overall emotion of a document&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Magnitude&lt;ul&gt;
&lt;li&gt;Indicates how much emotional content is present within the document, and this value is often proportional to the length of the document&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Open my &lt;a href="https://john.soban.ski/thoreau-vs-unabomber.html"&gt;Thoreau vs. Unabomber&lt;/a&gt; post in a new tab to find my script that processes the texts, emits them to the API, and records the results.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: I uploaded the Sentiment analysis data for both &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/rask_sentiment.csv"&gt;Raskolnikov&lt;/a&gt; and &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/Roark/roark_sentiment.csv"&gt;Roark&lt;/a&gt; to Github.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Import Data&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Requests&lt;/strong&gt; imports the data straight from GitHub.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;roark_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://github.com/hatdropper1977/Raskolnikov/raw/main/Roark/roark_sentiment.csv&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;rask_url&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://github.com/hatdropper1977/Raskolnikov/raw/main/rask_sentiment.csv&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;roark_r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;roark_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="n"&gt;rask_r&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rask_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="n"&gt;roark_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;roark_r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="n"&gt;rask_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rask_r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Numeric Analysis&lt;/h3&gt;
&lt;p&gt;Pandas extracts Roark's most negative dialog.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;roark_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;roark_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;roark_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;

&lt;span class="n"&gt;score&lt;/span&gt;   &lt;span class="n"&gt;magnitude&lt;/span&gt;   &lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="mi"&gt;52&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;    &lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&amp;quot;&amp;quot;You&amp;#39;re wasting your time,&amp;quot;&amp;quot; said Roark.&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Roark Headshot" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/02_Roark_Headshot.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Roark's most negative dialog: "You're wasting your time"&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A similar command extracts Raskolnikov's most negative dialog.  Note that Twenty Four (24) lines of dialog share the max sentiment of Negative Zero Point Eight (-0.8).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;

&lt;span class="mi"&gt;24&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For example, Raskolnikov's most negative dialog includes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;All this is very naive . . . excuse me, I should have said impudent on your part&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;One Dimensional Graphical Analysis&lt;/h3&gt;
&lt;p&gt;Pandas provides an easy method to generate Histograms.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Raskolnikov's sentiment histogram leans negative.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Raskolnikov's sentiment histogram leans negative" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/03_Rask_Score.png"&gt;&lt;/p&gt;
&lt;p&gt;Roark's sentiment histogram spikes at neutral.  His negative lines of dialog taper off, with few beyond Negative Zero Point Four (-0.4)&lt;/p&gt;
&lt;p&gt;&lt;img alt="Roark's sentiment histogram spikes at neutral" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/04_Roark_Score.png"&gt;&lt;/p&gt;
&lt;p&gt;I use Seaborn to overlay the two Histograms.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;sns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I concatenate the two Data Frames into one Data Frame.  I add a &lt;strong&gt;Label&lt;/strong&gt; column, named &lt;strong&gt;Class&lt;/strong&gt;.  This label allows Seaborn to color the data by &lt;strong&gt;Speaker&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;roark_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Class&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Roark&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Class&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Rask&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;sentiment_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;roark_df&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I run &lt;strong&gt;histplot&lt;/strong&gt; on the new Data Frame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Class&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The overlaid Histograms illustrate that Raskolnikov (Blue) leans more negative than Roark (Orange).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Raskolnikov more negative than Roark Histogram" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/05_Compare_Scores.png"&gt;&lt;/p&gt;
&lt;h3&gt;Two Dimensional Graphical Analysis&lt;/h3&gt;
&lt;p&gt;The Google API returns two dimensions of data: &lt;strong&gt;score&lt;/strong&gt; and &lt;strong&gt;magnitude&lt;/strong&gt;.  The &lt;strong&gt;magnitude&lt;/strong&gt; data captures the intensity of emotion.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html"&gt;SciPy&lt;/a&gt; Kernel Density Estimation (KDE) generates a bivariate density plot for each speaker.&lt;/p&gt;
&lt;p&gt;The colors represent &lt;strong&gt;density&lt;/strong&gt;.  Darker colors indicate more instances of a particular &lt;strong&gt;score/magnitude&lt;/strong&gt; pair.  The black dots represent the actual data points.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: I scale the &lt;strong&gt;Score&lt;/strong&gt; by &lt;strong&gt;ten&lt;/strong&gt; to improve chart readability&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We know from the histogram that Raskolnikov leans negative.  He reigns in the emotional intensity, however, with most of his dialog clocking in at an intensity of one (1).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Raskolnikov KDE Plot" src="https://john.soban.ski/images/Raskolnikov_Label/05_Rask_Kde.png"&gt;&lt;/p&gt;
&lt;p&gt;Compare Raskolnikov's nearly horizontal chart to Roark's chart.  Roark's chart angles up a bit in the negative zone.  This upward angle indicates that Roark increases emotional intensity in lockstep with negativity.  The more negative the dialog, the more intense the magnitude.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Roark KDE Plot" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/06_Roark_Density.png"&gt;&lt;/p&gt;
&lt;p&gt;SNS once more allows us to overlay the two density plots.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;magnitude&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Class&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;kde&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This chart captures Raskolnikov's negative sentiment with neutral intensity and Roark's slight intensity upticks.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Overlay KDE Plot" src="https://john.soban.ski/images/Roark_Vs_Raskolnikov/07_Both_Kde.png"&gt;&lt;/p&gt;
&lt;h2&gt;Literary Analysis&lt;/h2&gt;
&lt;p&gt;An astute reader finds similarities between Rodion Raskolnikov, the protagonist of Fyodor Dostoyevsky's &lt;strong&gt;Crime and Punishment&lt;/strong&gt;, and Howard Roark, from Ayn Rand's &lt;strong&gt;The Fountainhead&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;Rodion and Roark share misanthropic traits.  Roark's &lt;strong&gt;creator vs. second hander&lt;/strong&gt; hypothesis echoes Rodion's &lt;strong&gt;extraordinary man vs. raw materials&lt;/strong&gt; hypothesis.  Their ethics drive each to perform criminal acts of destruction.&lt;/p&gt;
&lt;p&gt;The two characters have separate reactions to their crimes.  The differences in their reactions set them apart.&lt;/p&gt;
&lt;h3&gt;Social Misanthropes&lt;/h3&gt;
&lt;p&gt;In college, Roark refuses to join a fraternity or engage in graduation festivities.  Rand writes he: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;never [has] any friend anywhere (Rand 253)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Peter Keating (an acquaintance of his) states:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;nobody can like him (Rand 253).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Dostoyevsky writes that Raskolnikov, like Roark, enjoys: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;practically no friends [and] somehow fail[s] to take any part in [other student's] communal gatherings, their discussions and their amusements, and [holds] no share in any other aspects of their lives (Dostoyevsky 85).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Raskolnikov spends his time in college:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;study[ing] intensely, not sparing himself (Dostoyevsky 86)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Post-expulsion he takes a six-month hiatus from society to focus on personal philosophies.  Raskolnikov's peers label him a:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;haughtily arrogant [egoist] (Dostoyevsky 86).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Razumikhin, Raskolnikov's best friend, says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;He doesn't listen to what people say to him. He's never interested in what everyone else is interested in at any given moment (Dostoyevsky 265)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Pulcheria Alexandra, Rodion's mother, asks Dunya:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I mean, It couldn't be that he's an egotist, Dunechka? Eh? (Dostoyevsky 291)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Disgust Towards Parasites&lt;/h3&gt;
&lt;p&gt;Howard Roark despises collective thought. He says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The mind is an attribute of an individual. There is no such thing as a collective brain... No man can use his brain to think for another (Rand 737)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He further states:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Man] can survive in only one of two ways- by the independent work of his own mind or as a parasite fed by the minds of others (Rand 738)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He considers these &lt;strong&gt;parasites&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;second handers [and] savages (Rand 742).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Raskolnikov increases the vitriol against &lt;strong&gt;second hand parasites&lt;/strong&gt;.  After Luzhin tries to impress Raskolnikov and Razumikhin with &lt;strong&gt;his&lt;/strong&gt; ideas on progress, Raskolnikov cuts him down and says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;He learned that all by rote! He's Showing off! (Dostoyevsky 193)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Extrodinary Man vs. Egoist&lt;/h3&gt;
&lt;p&gt;Rodion subscribes to the theory of the &lt;strong&gt;extraordinary man&lt;/strong&gt; and Roark to that of the &lt;strong&gt;egoist&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Both theories separate human society into two classes. The two protagonists champion the improvement of society through great Men who stand apart from the masses.&lt;/p&gt;
&lt;p&gt;Rodion labels mediocre and uninspired members of the populace &lt;strong&gt;raw materials.&lt;/strong&gt; His &lt;strong&gt;raw materials&lt;/strong&gt; stand in for Roark's &lt;strong&gt;second handers&lt;/strong&gt;.   &lt;/p&gt;
&lt;p&gt;Neither the &lt;strong&gt;raw materials&lt;/strong&gt; nor &lt;strong&gt;second handers&lt;/strong&gt; think for themselves and recycle old ideas. They keep the world in an evolutionary stasis. Raskolnikov states that the &lt;strong&gt;Raw Materials&lt;/strong&gt; live only to:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;bring into being more like itself, and another group of people who possess a gift or talent for saying something new, in their own milieu (Dostoyevsky 313)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Raskolnikov's &lt;strong&gt;Raw Materials&lt;/strong&gt; live to procreate, to increase the chance of spawning Raskolnikov's &lt;strong&gt;extraordinary man&lt;/strong&gt;, or Roark's &lt;strong&gt;egoist/ creator&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;Both agree that the lesser men ostracize (or kill) the great men.  Rand writes that &lt;strong&gt;Second Handers&lt;/strong&gt; consider &lt;strong&gt;egoists&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;transgressors that venture into forbidden territory (Rand 736)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Raskolnikov says that &lt;strong&gt;raw materials&lt;/strong&gt; see &lt;strong&gt;extraordinary men&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;as being persons of backward and degrading views (Dostoyevsky 315)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Difference in Execution&lt;/h3&gt;
&lt;p&gt;Roark's philosophy stresses the importance of the individual.  A &lt;strong&gt;creator&lt;/strong&gt; will not rely on others to survive.  Society benefits when a collection of individuals focus on their own needs and align their actions with those needs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;No creator [should be] prompted by a desire to serve his brothers, for his brothers [will] reject the gift offered and that gift [will] destroy the slothful routine of their lives. His truth [should be] his only motive. His own truth, and his only motive to [should be to] achieve it in his own way (Rand 737)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Raskolnikov, however, suggests that &lt;strong&gt;extrodinary men&lt;/strong&gt; must use the &lt;strong&gt;raw materials&lt;/strong&gt; to their own ends:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If an extraordinary man "finds it necessary, for the sake of his idea, to step over a dead body, over a pool of blood, then he is able within his own conscience to do so. It's in this sense alone that [they have a] right to crime" (Dostoyevsky 313).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Roark stresses individual focus. Raskolnikov condones the (criminal) use of groups to reach a goal.&lt;/p&gt;
&lt;p&gt;Roark criticizes Raskolnikov's theory.  He considers Raskolnikov's conquerors &lt;strong&gt;second handers&lt;/strong&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The most dreadful butchers [are] the most sincere. They believe in a perfect society through the guillotine and the firing squad. Nobody questions] their right to murder since they [are] murdering for an 'altruistic' purpose. It [is] accepted that man must be sacrificed for other men... It goes on and will go on so long as men believe that an action is good if it is unselfish. That permits the 'altruist' to act and his victims to bear it. Now observe the results of a society built on individualism. This country was not based on selfless service, sacrifice, renunciation or any precept of altruism. It was based on a man's pursuit of happiness. Not anyone else's, a private personal, selfish motive (Rand 741-42)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Both Raskolnikov and Roark act according to their beliefs.  Raskolnikov's  &lt;strong&gt;extraordinary man&lt;/strong&gt; belief compels him to kill a pawnbroker and her sister.  Roark's  &lt;strong&gt;creator&lt;/strong&gt; belief leads him to destroy his &lt;strong&gt;desecrated&lt;/strong&gt; Cortlandt building.&lt;/p&gt;
&lt;h3&gt;Follow Through&lt;/h3&gt;
&lt;p&gt;Raskolnikov and Roark differ in their commitment to their criminal actions.  Roark held to his convictions and did not experience guilt or compromise after completing a crime that upheld his ethics.  He lived up to his &lt;strong&gt;creator&lt;/strong&gt; principles.&lt;/p&gt;
&lt;p&gt;Raskolnikov, however, could not justify the crime to himself.  He admitted failure on several occasions before he turned himself in.  He did not live up to his &lt;strong&gt;extraordinary man&lt;/strong&gt; principles.  &lt;/p&gt;
&lt;p&gt;Howard Roark refuses to accept nor consider any wrongdoing.  He insists that America must uphold its first principles and recognize the necessity and urgency of his (Roark's) actions.  A country that compromises equals a &lt;strong&gt;slave society&lt;/strong&gt; (Rand 743)&lt;/p&gt;
&lt;p&gt;Roark says he will serve his time:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;in memory and in gratitude for what my country has been. It will be my act of loyalty, my refusal to live and work in what has taken place (Rand 743).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Contrast Rodion to the stalwart Roark.  Raskolnikov experiences guilt, admits failure, and contemplates suicide.  He says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I don't want to go on like this (Dostoyevsky 200)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Rodion can not muster the conviction to self-annihilate:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I wanted to end it all there, but... I couldn't bring myself to do it (Dostoyevsky 593)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Rodion admits defeat and turns himself in.  He does not recognize any &lt;strong&gt;extraordinary man&lt;/strong&gt; qualities in his character.  He labels himself a &lt;strong&gt;failure&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It's because of my own baseness and mediocrity that I'm taking this step (Dostoyevsky 595). &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He questions why he, a &lt;strong&gt;raw material&lt;/strong&gt;, felt that he was qualified to act in the manner of an &lt;strong&gt;extraordinary man&lt;/strong&gt;.  He speculates: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;the strength of his own desires that made him believe he was a person to whom more was allowed than others (Dostoyevsky 623)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Rodion and Roark see banality and mediocrity in the &lt;strong&gt;Common Man&lt;/strong&gt;.  They share apathy towards base men and disgust towards phony men.  Their disgust incites one to become a hermit for several months and causes the other to become an object of hatred and jealousy among his peers.  They both commit crimes by their beliefs.  In the aftermath, Rodion folds and Roark stands strong. &lt;/p&gt;
&lt;p&gt;Did Rand use the character Rodion to inspire Roark?  If so, Rand aligns with Roark's &lt;strong&gt;creator&lt;/strong&gt; principles.&lt;/p&gt;
&lt;p&gt;Roark says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We inherit the products of the thought of other men. We inherit the wheel. We make a cart. The cart becomes an automobile... But all through the process what we receive from others is the end product of their thinking (Rand 738)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Bibliography&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Dostoevsky, Fyodor. &lt;em&gt;Crime and Punishment&lt;/em&gt;. Bantam Books, 1996.&lt;/li&gt;
&lt;li&gt;Rand, Ayn.  &lt;em&gt;The Fountainhead&lt;/em&gt;.  Plume, 1994.&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="NLP"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="Literature"></category></entry><entry><title>Label Speakers in Text with Natural Language Processing</title><link href="https://john.soban.ski/raskolnikov-label.html" rel="alternate"></link><published>2023-06-24T05:56:00-04:00</published><updated>2023-06-24T05:56:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2023-06-24:/raskolnikov-label.html</id><summary type="html">&lt;p&gt;The domain of Natural Language Processing (NLP) provides novel tools for Literary analysis.  Analysts use NLP tools to record sentiment, emotional intensity, and word frequencies.  Today I use NLP techniques to extract Raskolnikov's speaking (and thinking) quotes from Dostyevsky's &lt;strong&gt;Crime and Punishment&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;In a prior blog post, I compared …&lt;/p&gt;</summary><content type="html">&lt;p&gt;The domain of Natural Language Processing (NLP) provides novel tools for Literary analysis.  Analysts use NLP tools to record sentiment, emotional intensity, and word frequencies.  Today I use NLP techniques to extract Raskolnikov's speaking (and thinking) quotes from Dostyevsky's &lt;strong&gt;Crime and Punishment&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;In a prior blog post, I compared the speaker sentiment (positive or negative) and intensity of emotions between the &lt;a href="https://john.soban.ski/thoreau-vs-unabomber.html"&gt;Unabomber and Thoreau&lt;/a&gt;.  I used their respective manifestos.  They wrote their manifestos in the first person and spoke (or thought) every word of text in the document.  The fact that neither manifesto includes other speakers drove simple data preparation.  I fed each entire document to my &lt;a href="https://john.soban.ski/tag/nlp.html"&gt;NLP&lt;/a&gt; models.&lt;/p&gt;
&lt;p&gt;Now consider Dostyevsky's &lt;strong&gt;Crime and Punishment&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="Raskolnikov Yells" src="https://john.soban.ski/images/Raskolnikov_Label/01_Rask_Yell.png"&gt;&lt;/p&gt;
&lt;p&gt;The text includes dozens of characters, each with their own speaking and thinking lines.  Since I only want to analyze Raskolnikov, I must extract his text from the book.  I filter other characters, narration, and page numbers from the analysis.&lt;/p&gt;
&lt;p&gt;I see three ways to extract Raskolnikov's thinking and speaking parts.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Dive into the book and &lt;strong&gt;cut and paste&lt;/strong&gt; his lines &lt;strong&gt;by hand&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Write a series of &lt;strong&gt;if/ then&lt;/strong&gt; heuristics with rules and logic to extract his lines (e.g. if you see the phrase &lt;strong&gt;said Raskolnikov&lt;/strong&gt;, pull the line)&lt;/li&gt;
&lt;li&gt;Train a Machine Learning (ML) model to extract the lines for me&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I decide to go with #3 and train a model to do the work for me.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Keras Logo" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/00_Tf_Keras_Logo.png"&gt;&lt;/p&gt;
&lt;p&gt;I use &lt;a href="https://john.soban.ski/tag/keras.html"&gt;Keras&lt;/a&gt; and &lt;a href="https://john.soban.ski/tag/tensorflow.html"&gt;TensorFlow&lt;/a&gt; to train my model.&lt;/p&gt;
&lt;h2&gt;Label the Training Data&lt;/h2&gt;
&lt;p&gt;I seed the model with training data.  I pull representative lines of text and label them &lt;strong&gt;Raskolnikov&lt;/strong&gt; and &lt;strong&gt;Not Raskolnikov&lt;/strong&gt; with Microsoft EXCEL.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Excel Pic" src="https://john.soban.ski/images/Raskolnikov_Label/02_Excel_Pic.png"&gt;&lt;/p&gt;
&lt;p&gt;I use my judgment and experience and decide that roughly two hundred and fifty (250) data points should suffice for training.&lt;/p&gt;
&lt;p&gt;My EXCEL efforts yield two files:  &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/Raskolnikov.txt"&gt;Raskolnikov.txt&lt;/a&gt; and &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/Other.txt"&gt;Other.txt&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;I then use a simple &lt;strong&gt;BASH&lt;/strong&gt; script to put each line into a separate file.  The script then moves these files to their appropriate directory.  Keras imports the data and labels them based on the name of their parent directory.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
cat&lt;span class="w"&gt; &lt;/span&gt;Raskolnikov.txt&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;line&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;./Raskolnikov/R_&lt;span class="nv"&gt;$i&lt;/span&gt;.txt&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;((&lt;/span&gt;i++&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="nv"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
cat&lt;span class="w"&gt; &lt;/span&gt;Other.txt&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;line&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$line&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;./Other/O_&lt;span class="nv"&gt;$i&lt;/span&gt;.txt&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;((&lt;/span&gt;i++&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This yields two directories.  I then use the Linux &lt;strong&gt;mv&lt;/strong&gt; command to move some of the &lt;strong&gt;Raskolnikov&lt;/strong&gt; and &lt;strong&gt;Other&lt;/strong&gt; labeled files to a &lt;strong&gt;testing&lt;/strong&gt; directory.&lt;/p&gt;
&lt;p&gt;I create a parent directory named &lt;a href="https://github.com/hatdropper1977/Raskolnikov/tree/main/crime_and_punish"&gt;crime_and_punish&lt;/a&gt;, with two sub-directories &lt;a href="https://github.com/hatdropper1977/Raskolnikov/tree/main/crime_and_punish/training"&gt;training&lt;/a&gt; and &lt;a href="https://github.com/hatdropper1977/Raskolnikov/tree/main/crime_and_punish/testing"&gt;testing&lt;/a&gt; each of which contain files for &lt;strong&gt;Raskolnikov&lt;/strong&gt; and &lt;strong&gt;Other&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;tree&lt;span class="w"&gt; &lt;/span&gt;crime_and_punish
crime_and_punish
├──&lt;span class="w"&gt; &lt;/span&gt;testing
│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;Other&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;19&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Samples&lt;span class="o"&gt;)&lt;/span&gt;
│&lt;span class="w"&gt;         &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;O_120.txt
│&lt;span class="w"&gt;         &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;O_137.txt
│&lt;span class="w"&gt;         &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;...
│&lt;span class="w"&gt;         &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;└──&lt;span class="w"&gt; &lt;/span&gt;O_138.txt
│&lt;span class="w"&gt;         &lt;/span&gt;└──&lt;span class="w"&gt; &lt;/span&gt;Raskolnikov&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;38&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Samples&lt;span class="o"&gt;)&lt;/span&gt;
│&lt;span class="w"&gt;             &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;R_120.txt
│&lt;span class="w"&gt;             &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;R_121.txt
│&lt;span class="w"&gt;             &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;...
│&lt;span class="w"&gt;             &lt;/span&gt;└──&lt;span class="w"&gt; &lt;/span&gt;R_132.txt
└──&lt;span class="w"&gt; &lt;/span&gt;training
&lt;span class="w"&gt;    &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;Other&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;119&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Samples&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;O_100.txt
&lt;span class="w"&gt;    &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;O_99.txt
&lt;span class="w"&gt;    &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;...
&lt;span class="w"&gt;    &lt;/span&gt;│&lt;span class="w"&gt;         &lt;/span&gt;└──&lt;span class="w"&gt; &lt;/span&gt;O_9.txt
&lt;span class="w"&gt;    &lt;/span&gt;└──&lt;span class="w"&gt; &lt;/span&gt;Raskolnikov&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;144&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Samples&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;R_100.txt
&lt;span class="w"&gt;        &lt;/span&gt;├──&lt;span class="w"&gt; &lt;/span&gt;...
&lt;span class="w"&gt;        &lt;/span&gt;└──&lt;span class="w"&gt; &lt;/span&gt;R_9.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I zip the topmost directory into a &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/crime_and_punish.zip"&gt;zip&lt;/a&gt; file for easy portability.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;zip&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;crime_and_punish.zip&lt;span class="w"&gt; &lt;/span&gt;crime_and_punish
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Raskolnikov Headshot" src="https://john.soban.ski/images/Raskolnikov_Label/03_Rask_Head.png"&gt;&lt;/p&gt;
&lt;h2&gt;Train the model&lt;/h2&gt;
&lt;p&gt;I upload my &lt;strong&gt;ZIP&lt;/strong&gt; file into my &lt;a href="https://john.soban.ski/sagemaker-upgrade-pandas.html"&gt;Amazon Web Services Sagemaker Notebook&lt;/a&gt; through the graphical user interface and then run a code block to extract the labeled dataset.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;!unzip&lt;span class="w"&gt; &lt;/span&gt;crime_and_punish.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I import the required Machine Learning libraries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;nlp&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;

&lt;span class="c1"&gt;# Use mixed precision for optimal performance&lt;/span&gt;
&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mixed_precision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_global_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;mixed_float16&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Keras provides helper functions to import training data into TensorFlow.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="n"&gt;cp_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_dataset_from_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;crime_and_punish/training&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cp_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utils&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_dataset_from_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;crime_and_punish/testing&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;train&lt;/strong&gt; Dataset includes samples for model training and validation.  The &lt;strong&gt;test&lt;/strong&gt; Dataset includes &lt;strong&gt;holdout&lt;/strong&gt; data to &lt;strong&gt;surprise&lt;/strong&gt; our model and simulate real-world interaction.&lt;/p&gt;
&lt;p&gt;We inspect the structure of the new tensor object, which wraps each line of text in the tensor encoding.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cp_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unbatch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_single_element&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;Found&lt;/span&gt; &lt;span class="mi"&gt;239&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="n"&gt;belonging&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;Found&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="n"&gt;belonging&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;Get up, why are you asleep!&amp;quot; she called to him: &amp;quot;It&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;s past nine, I have brought you some tea; will you have a cup? I should think you&lt;/span&gt;&lt;span class="se"&gt;\&amp;#39;&lt;/span&gt;&lt;span class="s1"&gt;re fairly starving?&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following commands instruct Keras to train a custom model with a BERT backbone.  This &lt;strong&gt;Transfer Learning&lt;/strong&gt; uses the power of a pre-existing NLP model to increase the performance of our custom model.&lt;/p&gt;
&lt;p&gt;We first load &lt;strong&gt;bert_medium_en_uncased&lt;/strong&gt; model into our workspace.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keras_nlp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BertClassifier&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_preset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;bert_medium_en_uncased&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We then instruct Keras to fine-tune the model based on the training data, &lt;strong&gt;cp_train&lt;/strong&gt; and &lt;strong&gt;cp_test&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;cp_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;validation_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cp_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Sagemaker outputs the training results:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;/15&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[=&lt;/span&gt;&amp;gt;............................&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;ETA:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt;:57&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;loss:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.6997&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;sparse_categorical_accuracy:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.5000
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;/15&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[===&lt;/span&gt;&amp;gt;..........................&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;ETA:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;:03&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;loss:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7119&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;sparse_categorical_accuracy:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.4375
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;/15&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[=====&lt;/span&gt;&amp;gt;........................&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;ETA:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;:17&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;loss:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7036&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;sparse_categorical_accuracy:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.5208&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;/15&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[=======&lt;/span&gt;&amp;gt;......................&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;ETA:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8&lt;/span&gt;:30&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;loss:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7020&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;sparse_categorical_accuracy:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.5469
&lt;span class="m"&gt;15&lt;/span&gt;/15&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[==============================]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;914s&lt;span class="w"&gt; &lt;/span&gt;56s/step&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;loss:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.6995&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;sparse_categorical_accuracy:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.5397&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;val_loss:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.6157&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;val_sparse_categorical_accuracy:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7188
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our model provides an impressive accuracy of 71.88% on our &lt;strong&gt;holdout&lt;/strong&gt; data.&lt;/p&gt;
&lt;p&gt;I test-drive our model with the following quote:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;"It's in the houses of spiteful old widows that one finds such cleanliness," Raskolnikov thought again, and he stole a curious glance at the cotton curtain over the door leading into another tiny room, in which stood the old woman's bed and chest of drawers and into which he had never looked before. These two rooms made up the whole flat.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I need to escape the quotes when I call the Model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\&amp;quot;&lt;/span&gt;&lt;span class="s2"&gt;It&amp;#39;s in the houses of spiteful old widows that one finds such cleanliness,&lt;/span&gt;&lt;span class="se"&gt;\&amp;quot;&lt;/span&gt;&lt;span class="s2"&gt; Raskolnikov thought again, and he stole a curious glance at the cotton curtain over the door leading into another tiny room, in which stood the old woman&amp;#39;s bed and chest of drawers and into which he had never looked before. These two rooms made up the whole flat.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The model outputs disappointing results, with no clear prediction of class &lt;strong&gt;Raskolnikov&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="m"&gt;1&lt;/span&gt;/1&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[==============================]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;7s&lt;span class="w"&gt; &lt;/span&gt;7s/step
array&lt;span class="o"&gt;([[&lt;/span&gt;-0.01614,&lt;span class="w"&gt; &lt;/span&gt;-0.0249&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;]]&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;float16&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Label Crime and Punishment&lt;/h2&gt;
&lt;p&gt;I feed the entire text of &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/cp.csv"&gt;Crime and Punishment&lt;/a&gt; into my model and have the model label each line.&lt;/p&gt;
&lt;p&gt;First, I load the text into a &lt;strong&gt;Pandas Dataframe&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cp.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;Crime&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;    &lt;span class="n"&gt;man&lt;/span&gt; &lt;span class="n"&gt;came&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;garret&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;which&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;lodged&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Good God!&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;cried&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;can it be, can it be,...&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;N-no,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;answered&lt;/span&gt; &lt;span class="n"&gt;Dounia&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;animation&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;What a pig you are!&amp;#39;&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;nero&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;uf&lt;/span&gt; &lt;span class="n"&gt;dis&lt;/span&gt; &lt;span class="n"&gt;atari&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;sae&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;afeeyeeinae&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Dataframe&lt;/strong&gt; includes 4,425 rows.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4425&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I use a &lt;strong&gt;Lambda function&lt;/strong&gt; to send each row of the text to my model.  The model returns a prediction in the form of &lt;strong&gt;[[Likelihood of Other, Likelihood of Raskolnikov]]&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;lb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The prediction consumes three hours of clock time to complete.&lt;/p&gt;
&lt;p&gt;In the end, I get a &lt;strong&gt;Series&lt;/strong&gt; with predictions for each line of text.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;lb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;      &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.00489&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.01569&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;          &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mf"&gt;0.2327&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1193&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;           &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mf"&gt;0.11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.04248&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;        &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.06537&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1735&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;          &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1049&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.359&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I join the predictions &lt;strong&gt;Series&lt;/strong&gt; with the &lt;strong&gt;Crime and Punishment&lt;/strong&gt; &lt;strong&gt;Dataframe&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This produces a &lt;strong&gt;Dataframe&lt;/strong&gt; with the following structure.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                                                &lt;span class="n"&gt;Crime&lt;/span&gt;                   &lt;span class="n"&gt;Result&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="n"&gt;man&lt;/span&gt; &lt;span class="n"&gt;came&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;garret&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;which&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;lodged&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.00489&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.01569&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Good God!&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;cried&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;can it be, can it be,...   [[0.2327, 0.1193]]&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;N-no,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;answered&lt;/span&gt; &lt;span class="n"&gt;Dounia&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;animation&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mf"&gt;0.11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.04248&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;What a pig you are!&amp;#39;                             [[-0.06537, 0.1735]]&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;nero&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;uf&lt;/span&gt; &lt;span class="n"&gt;dis&lt;/span&gt; &lt;span class="n"&gt;atari&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;sae&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;afeeyeeinae&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1049&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.359&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Two &lt;strong&gt;Lambda functions&lt;/strong&gt; extract the likelihoods from the &lt;strong&gt;Result&lt;/strong&gt; array, and create two new &lt;strong&gt;Pandas&lt;/strong&gt; columns.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rask&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I create a &lt;strong&gt;Diff&lt;/strong&gt; column to record the difference in likelihood between the two columns.  Large differences indicate greater certainty.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Diff&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Rask&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Other&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

                                                &lt;span class="n"&gt;Crime&lt;/span&gt;                   &lt;span class="n"&gt;Result&lt;/span&gt;        &lt;span class="n"&gt;Other&lt;/span&gt;      &lt;span class="n"&gt;Rask&lt;/span&gt;      &lt;span class="n"&gt;Diff&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;   &lt;span class="n"&gt;man&lt;/span&gt; &lt;span class="n"&gt;came&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;garret&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;which&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;lodged&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.00489&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.01569&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.004890&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.015686&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.010796&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Good God!&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;cried&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;can it be, can it be,...   [[0.2327, 0.1193]]         0.232666  0.119324 -0.113342&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;N-no,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;answered&lt;/span&gt; &lt;span class="n"&gt;Dounia&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;animation&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="mf"&gt;0.11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.04248&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;          &lt;span class="mf"&gt;0.109985&lt;/span&gt;  &lt;span class="mf"&gt;0.042480&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.067505&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;What a pig you are!&amp;#39;                             [[-0.06537, 0.1735]]      -0.065369  0.173462  0.238770&lt;/span&gt;
&lt;span class="mi"&gt;4&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;nero&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;uf&lt;/span&gt; &lt;span class="n"&gt;dis&lt;/span&gt; &lt;span class="n"&gt;atari&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;sae&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;afeeyeeinae&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;   &lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1049&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.359&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;        &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.104919&lt;/span&gt;  &lt;span class="mf"&gt;0.358887&lt;/span&gt;  &lt;span class="mf"&gt;0.463867&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can find the full labeled text of Crime and Punishment &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/labeled_cp.csv"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A histogram illustrates the distribution of the likelihood differences:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Diff&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Picture of histogram" src="https://john.soban.ski/images/Raskolnikov_Label/04_Diff_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;We need to set a &lt;strong&gt;threshold&lt;/strong&gt; of certainty.&lt;/p&gt;
&lt;p&gt;A threshold of &lt;strong&gt;0.1&lt;/strong&gt; yields 1,088, a threshold of &lt;strong&gt;0.2&lt;/strong&gt; yields 670 rows, and for both I see a few incorrect labels.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;cp&lt;span class="o"&gt;[&lt;/span&gt;cp&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Diff&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.1&lt;span class="o"&gt;]&lt;/span&gt;.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;print&lt;span class="o"&gt;(&lt;/span&gt;X.Crime&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="s2"&gt;&amp;quot;&amp;quot;&lt;/span&gt;Ah,&lt;span class="w"&gt; &lt;/span&gt;he&lt;span class="s1"&gt;&amp;#39;s eating, then he&amp;#39;&lt;/span&gt;s&lt;span class="w"&gt; &lt;/span&gt;not&lt;span class="w"&gt; &lt;/span&gt;ill,&lt;span class="s2"&gt;&amp;quot; said Razumihin. He took a chair and sat down at the table opposite Raskolnikov.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;..Â»&lt;span class="w"&gt; &lt;/span&gt;But&lt;span class="w"&gt; &lt;/span&gt;I&lt;span class="w"&gt; &lt;/span&gt;should&lt;span class="w"&gt; &lt;/span&gt;like&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;know&lt;span class="w"&gt; &lt;/span&gt;why&lt;span class="w"&gt; &lt;/span&gt;mother&lt;span class="w"&gt; &lt;/span&gt;has&lt;span class="w"&gt; &lt;/span&gt;written&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;me&lt;span class="w"&gt; &lt;/span&gt;about&lt;span class="w"&gt; &lt;/span&gt;â€˜our&lt;span class="w"&gt; &lt;/span&gt;most&lt;span class="w"&gt; &lt;/span&gt;rising&lt;span class="w"&gt; &lt;/span&gt;generation&lt;span class="s1"&gt;&amp;#39;? Simply as a descriptive touch, or with the idea of prepossessing me in favour of Mr. Luzhin? Oh, the cunning of them! I should like to know one thing more: how far they were open with one another that day and night and all this time since? Was&lt;/span&gt;
&lt;span class="s1"&gt;&amp;quot;[&amp;#39;&lt;/span&gt;ve&lt;span class="w"&gt; &lt;/span&gt;only&lt;span class="w"&gt; &lt;/span&gt;just&lt;span class="w"&gt; &lt;/span&gt;waked&lt;span class="w"&gt; &lt;/span&gt;up,&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;wanted&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;go&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;you,&lt;span class="w"&gt; &lt;/span&gt;but&lt;span class="w"&gt; &lt;/span&gt;was&lt;span class="w"&gt; &lt;/span&gt;delayed&lt;span class="w"&gt; &lt;/span&gt;owing&lt;span class="w"&gt; &lt;/span&gt;tomy&lt;span class="w"&gt; &lt;/span&gt;clothes&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;I&lt;span class="w"&gt; &lt;/span&gt;forgot&lt;span class="w"&gt; &lt;/span&gt;yesterday&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;ask
&lt;span class="s2"&gt;&amp;quot;A cup of tea, maybe.&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;A notice from the office,&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;he&lt;span class="w"&gt; &lt;/span&gt;announced,&lt;span class="w"&gt; &lt;/span&gt;as&lt;span class="w"&gt; &lt;/span&gt;he&lt;span class="w"&gt; &lt;/span&gt;gave&lt;span class="w"&gt; &lt;/span&gt;him&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;paper.
&lt;span class="s2"&gt;&amp;quot;A painter?&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;A pickpocket I dare say.&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;A priest,&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;he&lt;span class="w"&gt; &lt;/span&gt;articulated&lt;span class="w"&gt; &lt;/span&gt;huskily.
&lt;span class="s2"&gt;&amp;quot;A strange scene passed between us last time we met, Rodion Romanovitch. Our first interview, too, was a&lt;/span&gt;
&lt;span class="s2"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I set the threshold to &lt;strong&gt;0.3&lt;/strong&gt; and save the Data Frame into a &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/rask.txt"&gt;text file&lt;/a&gt;.  This threshold still yields incorrect labels.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Diff&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;  &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Crime&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="s2"&gt;&amp;quot;I am the murderer. . . . I want to give evidence,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;Nikolay&lt;/span&gt; &lt;span class="n"&gt;pronounced&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I am thinking,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;answered&lt;/span&gt; &lt;span class="n"&gt;seriously&lt;/span&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;pause&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;beg&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;say&lt;/span&gt; &lt;span class="n"&gt;no&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot; said Raskolnikov. &amp;quot;&lt;/span&gt;&lt;span class="n"&gt;In&lt;/span&gt; &lt;span class="nb"&gt;any&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;unpardonable&lt;/span&gt; &lt;span class="n"&gt;impertinence&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I can&amp;#39;t help it... . 1 will come in half an hour. Tell them.&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I do,&amp;#39; repeated Raskolnikov, raising his eyes to Porfiry.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I don&amp;#39;t want it,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;said&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pushing&lt;/span&gt; &lt;span class="n"&gt;away&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;pen&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I fainted then because it was so close and the smell of paint,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;said&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I know.&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I remember now,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;said&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt; &lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;long&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suller&lt;/span&gt; &lt;span class="n"&gt;silence&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Razumihin&lt;/span&gt; &lt;span class="n"&gt;looked&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;him&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frowning&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;uneasy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I say nothing about him,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;added&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pointing&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;Razumihin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;though he has had nothing from me either but insult and trouble.&amp;quot;&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I suppose you didn&amp;#39;t expect it?&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;said&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt; &lt;span class="n"&gt;who&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;though&lt;/span&gt; &lt;span class="n"&gt;he&lt;/span&gt; &lt;span class="n"&gt;had&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;fully&lt;/span&gt; &lt;span class="n"&gt;grasped&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;situation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;had&lt;/span&gt; &lt;span class="n"&gt;regained&lt;/span&gt; &lt;span class="n"&gt;his&lt;/span&gt; &lt;span class="n"&gt;courage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I was summoned... by a notice...&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt; &lt;span class="n"&gt;faltered&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;I won&amp;#39;t drink anything,&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;said&lt;/span&gt; &lt;span class="n"&gt;Raskolnikov&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;

&lt;span class="n"&gt;rask_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Diff&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;  &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;rask_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;rask.txt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Sentiment and Emotional analysis&lt;/h2&gt;
&lt;p&gt;I take the file of &lt;strong&gt;Raskolnikov&lt;/strong&gt; quotes and feed the quotes into the Google Cloud Natural Language API.&lt;/p&gt;
&lt;p&gt;I import the required libraries and set &lt;strong&gt;FILENAME&lt;/strong&gt; to &lt;strong&gt;rask.txt&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;language_v1&lt;/span&gt;

&lt;span class="n"&gt;FILENAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rask.txt&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following code then records the &lt;strong&gt;sentiment&lt;/strong&gt; and &lt;strong&gt;magnitude&lt;/strong&gt; of each &lt;strong&gt;Raskolnikov&lt;/strong&gt; quote into a &lt;strong&gt;Pandas Dataframe&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;See my post on &lt;a href="https://john.soban.ski/thoreau-vs-unabomber.html"&gt;Unabomber vs. Thoreau&lt;/a&gt; for details.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sentiment_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiates a client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;language_v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LanguageServiceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FILENAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;language_v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;type_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;language_v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PLAIN_TEXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;sentiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analyze_sentiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;document&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;document_sentiment&lt;/span&gt;
                &lt;span class="n"&gt;sentiment_dict&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;magnitude&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;magnitude&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;text&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;sentiment_dict&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;magnitude&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;text&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ERROR: &lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sentiment_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentiment_dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;_sentiment.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FILENAME&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This produces the following &lt;a href="https://github.com/hatdropper1977/Raskolnikov/blob/main/rask_sentiment.csv"&gt;Dataframe&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;score&lt;/span&gt;    &lt;span class="n"&gt;magnitude&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;    &lt;span class="mf"&gt;0.6&lt;/span&gt;          &lt;span class="n"&gt;Crime&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;     &lt;span class="mf"&gt;0.0&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;          &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;nero&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;uf&lt;/span&gt; &lt;span class="n"&gt;dis&lt;/span&gt; &lt;span class="n"&gt;atari&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;sae&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;afeeyeeinae&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;    &lt;span class="mf"&gt;0.3&lt;/span&gt;          &lt;span class="s2"&gt;&amp;quot;!&amp;quot;&amp;quot;? he bent over her once&amp;quot;&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;    &lt;span class="mf"&gt;0.3&lt;/span&gt;          &lt;span class="s2"&gt;&amp;quot;&amp;quot;&amp;quot;&amp;quot;&amp;quot;Ah, he&amp;#39;s eating, then he&amp;#39;s not ill,&amp;quot;&amp;quot; sai...&lt;/span&gt;
&lt;span class="s2"&gt;4     0.0    1.3          &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;Â&lt;/span&gt;&lt;span class="err"&gt;»&lt;/span&gt; &lt;span class="n"&gt;But&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;should&lt;/span&gt; &lt;span class="n"&gt;like&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;know&lt;/span&gt; &lt;span class="n"&gt;why&lt;/span&gt; &lt;span class="n"&gt;mother&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
 &lt;span class="o"&gt;...&lt;/span&gt;  &lt;span class="o"&gt;...&lt;/span&gt;    &lt;span class="o"&gt;...&lt;/span&gt;                             &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="mi"&gt;367&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;    &lt;span class="mf"&gt;1.2&lt;/span&gt;          &lt;span class="n"&gt;two&lt;/span&gt; &lt;span class="n"&gt;sharp&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;suspicious&lt;/span&gt; &lt;span class="n"&gt;eyes&lt;/span&gt; &lt;span class="n"&gt;stared&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;him&lt;/span&gt; &lt;span class="n"&gt;ou&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="mi"&gt;368&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;    &lt;span class="mf"&gt;0.2&lt;/span&gt;          &lt;span class="s2"&gt;&amp;quot;very much struck by your face this moraine. 4...&lt;/span&gt;
&lt;span class="mi"&gt;369&lt;/span&gt;   &lt;span class="mf"&gt;0.1&lt;/span&gt;    &lt;span class="mf"&gt;0.2&lt;/span&gt;          &lt;span class="s2"&gt;&amp;quot;Ves. g4V come,&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
&lt;span class="mi"&gt;370&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;    &lt;span class="mf"&gt;0.3&lt;/span&gt;          &lt;span class="n"&gt;Well&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;shall&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;do&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="mi"&gt;371&lt;/span&gt;   &lt;span class="mf"&gt;0.0&lt;/span&gt;    &lt;span class="mf"&gt;0.0&lt;/span&gt;          &lt;span class="n"&gt;wiee&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="err"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I use &lt;strong&gt;matplotlib&lt;/strong&gt; to graph a Kernel Density Estimation Plot.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scipy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;

&lt;span class="n"&gt;m1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="c1"&gt;#scaled to improve Data Viz&lt;/span&gt;
&lt;span class="n"&gt;m2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;magnitude&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;xmin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;xmax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ymin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ymax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mgrid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;xmax&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ymin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;ymax&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;positions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vstack&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ravel&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ravel&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vstack&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;m1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;kernel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gaussian_kde&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Z&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;positions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;imshow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rot90&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gist_earth_r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;extent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xmax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ymin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ymax&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;k.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;markersize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_xlim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;xmin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;xmax&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;ymin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ymax&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Score&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Magnitude&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;matplotlib&lt;/strong&gt; produces the KDE plot.&lt;/p&gt;
&lt;p&gt;&lt;img alt="KDE Plot" src="https://john.soban.ski/images/Raskolnikov_Label/05_Rask_Kde.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I demonstrated how to use NLP to label speakers in a text.  This allows Literary Analysts to apply NLP tools on a per-character vs. per-novel basis.&lt;/p&gt;
&lt;p&gt;I first created a corpus of labeled training data.  I then used transfer learning to customize a medium-scale BERT model to fit the training data.  This produced a model that isolated Raskolnikov's quotes from &lt;strong&gt;Crime and Punishment&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I use this approach in next month's post, where &lt;a href="https://john.soban.ski/roark-vs-raskolnikov.html"&gt;I compare Fyodor Dostoevsky's Rodion Raskolnikov (Crime and Punishment) with Ayn Rand's Howard Roark (The Fountainhead)&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Coda&lt;/h2&gt;
&lt;p&gt;I ran into some issues with &lt;strong&gt;Keras and TensorFlow&lt;/strong&gt; on &lt;strong&gt;Amazon Sagemaker&lt;/strong&gt;.  I record the &lt;strong&gt;issues and solutions&lt;/strong&gt; here.&lt;/p&gt;
&lt;p&gt;I received the &lt;strong&gt;ValueError: Unable to import backend : mxnet&lt;/strong&gt; failure when I attempt to import &lt;strong&gt;keras-nlp&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="mf"&gt;3.10.9&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;packaged&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="n"&gt;conda&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;forge&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Feb&lt;/span&gt;  &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;04&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;GCC&lt;/span&gt; &lt;span class="mf"&gt;11.3.0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;linux&lt;/span&gt;
&lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;help&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;copyright&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;credits&amp;quot;&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;license&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp&lt;/span&gt;
&lt;span class="n"&gt;Traceback&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&amp;lt;stdin&amp;gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/layers/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp.src.layers.modeling.cached_multi_head_attention&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CachedMultiHeadAttention&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp.src&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/layers/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp.src.layers.modeling.cached_multi_head_attention&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/layers/modeling/cached_multi_head_attention.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp.src.api_export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras_nlp_export&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/api_export.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp.src.backend&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/backend/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_nlp.src.backend&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/backend/config.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;keras_core&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;activations&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/activations/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_core.src.activations&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deserialize&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_core.src&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;activations&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/activations/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_core.src.activations.activations&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;elu&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/activations/activations.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;keras_core.src&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;
  &lt;span class="n"&gt;File&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/backend/__init__.py&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="ne"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Unable to import backend : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="ne"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Unable&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mxnet&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href="https://github.com/keras-team/keras-nlp/issues/1172"&gt;Keras-NLP team discovered&lt;/a&gt; that Amazon includes a hard-coded variable that calls MXNET upon launch of a new Sagemaker notebook.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;!cat&lt;span class="w"&gt; &lt;/span&gt;~/.keras/keras.json

&lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;floatx&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;float32&amp;quot;&lt;/span&gt;,
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;epsilon&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;1e-07,
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;backend&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;mxnet&amp;quot;&lt;/span&gt;,
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;image_data_format&amp;quot;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;channels_first&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These commands fix the issue:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;!&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;out.txt
!&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;    &amp;quot;floatx&amp;quot;: &amp;quot;float32&amp;quot;,&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;out.txt
!&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;    &amp;quot;epsilon&amp;quot;: 1e-07,&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;out.txt
!&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;    &amp;quot;backend&amp;quot;: &amp;quot;tensorflow&amp;quot;,&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;out.txt
!&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;    &amp;quot;image_data_format&amp;quot;: &amp;quot;channels_first&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;out.txt
!&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;}&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;out.txt
!&lt;span class="w"&gt; &lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;out.txt&lt;span class="w"&gt; &lt;/span&gt;~/.keras/keras.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I used the following library versions for this excursion.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;.13.1
keras-core&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.1.2
keras-nlp&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.6.0

&lt;span class="nv"&gt;tensorflow&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;.13.0
tensorflow-estimator&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;.13.0
tensorflow-hub&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.14.0
tensorflow-io-gcs-filesystem&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.32.0
tensorflow-text&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;.13.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I also faced an issue where Sagemaker crashed when I labeled the entire book of &lt;strong&gt;Crime and Punishment&lt;/strong&gt;.  To solve the problem, I first split the text and then reassembled it after training.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pd.read_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cp.csv&amp;#39;&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;

&lt;span class="nv"&gt;df1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;:750&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="nv"&gt;df2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;751&lt;/span&gt;:1500&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="nv"&gt;df3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1501&lt;/span&gt;:2250&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="nv"&gt;df4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2251&lt;/span&gt;:3000&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="nv"&gt;df5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3001&lt;/span&gt;:3750&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="nv"&gt;df6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3751&lt;/span&gt;:&lt;span class="o"&gt;]&lt;/span&gt;

&lt;span class="nv"&gt;lb1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df1.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;classifier.predict&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;X&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]])&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
lb1.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lb1.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;lb2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df2.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;classifier.predict&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;X&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]])&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
lb2.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lb2.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;lb3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df3.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;classifier.predict&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;X&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]])&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
lb3.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lb3.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;lb4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df4.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;classifier.predict&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;X&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]])&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
lb4.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lb4.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;lb5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df5.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;classifier.predict&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;X&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]])&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
lb5.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lb5.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;lb6&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df6.apply&lt;span class="o"&gt;(&lt;/span&gt;lambda&lt;span class="w"&gt; &lt;/span&gt;X:&lt;span class="w"&gt; &lt;/span&gt;classifier.predict&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;X&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Crime&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;]])&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
lb6.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lb6.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;df1z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df1.merge&lt;span class="o"&gt;(&lt;/span&gt;lb1.to_frame&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;df2z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df2.merge&lt;span class="o"&gt;(&lt;/span&gt;lb2.to_frame&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;df3z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df3.merge&lt;span class="o"&gt;(&lt;/span&gt;lb3.to_frame&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;df4z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df4.merge&lt;span class="o"&gt;(&lt;/span&gt;lb4.to_frame&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;df5z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df5.merge&lt;span class="o"&gt;(&lt;/span&gt;lb5.to_frame&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;df6z&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;df6.merge&lt;span class="o"&gt;(&lt;/span&gt;lb6.to_frame&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Result&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;left_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;right_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;cp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pd.concat&lt;span class="o"&gt;([&lt;/span&gt;df1z,&lt;span class="w"&gt; &lt;/span&gt;df2z,&lt;span class="w"&gt; &lt;/span&gt;df3z,&lt;span class="w"&gt; &lt;/span&gt;df4z,&lt;span class="w"&gt; &lt;/span&gt;df5z,&lt;span class="w"&gt; &lt;/span&gt;df6z&lt;span class="o"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Bibliography&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Dostoevsky, Fyodor. &lt;em&gt;Crime and Punishment&lt;/em&gt;. Bantam Books, 1996.&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="NLP"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="Literature"></category></entry><entry><title>Upgrade SageMaker JupyterLab Notebooks to the Latest Pandas</title><link href="https://john.soban.ski/sagemaker-upgrade-pandas.html" rel="alternate"></link><published>2022-10-29T10:29:00-04:00</published><updated>2022-10-29T10:29:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2022-10-29:/sagemaker-upgrade-pandas.html</id><summary type="html">&lt;p&gt;&lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon Web Services (AWS) SageMaker Notebook Instances&lt;/a&gt; provide fully managed Jupyter Notebooks, tailored for Data Science and Machine Learning (ML) use cases.&lt;/p&gt;
&lt;p&gt;These notebooks allow Data Scientists and ML Engineers to explore, operationalize and share data, algorithms and pipelines.&lt;/p&gt;
&lt;p&gt;Pandas contributes a critical piece to the Data Scientists' toolbox, via …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;a href="https://aws.amazon.com/sagemaker/"&gt;Amazon Web Services (AWS) SageMaker Notebook Instances&lt;/a&gt; provide fully managed Jupyter Notebooks, tailored for Data Science and Machine Learning (ML) use cases.&lt;/p&gt;
&lt;p&gt;These notebooks allow Data Scientists and ML Engineers to explore, operationalize and share data, algorithms and pipelines.&lt;/p&gt;
&lt;p&gt;Pandas contributes a critical piece to the Data Scientists' toolbox, via the Data Frame construct.  Each new version of Pandas provides improvements, upgrades and new conveniences.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Python Pandas" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/00_Pandas_Python.png"&gt;&lt;/p&gt;
&lt;p&gt;I run into an issue with my &lt;strong&gt;AWS SageMaker Notebook&lt;/strong&gt;, however, when I try to upgrade Pandas.&lt;/p&gt;
&lt;p&gt;If I attempt to Upgrade &lt;a href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt; above version &lt;strong&gt;1.1.5&lt;/strong&gt; on my &lt;strong&gt;AWS Sagemaker&lt;/strong&gt; provided &lt;a href="https://jupyter.org/"&gt;JupyterLab notebook&lt;/a&gt; I receive the error &lt;strong&gt;No Matching Distribution Found&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;pre&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="mf"&gt;1.3.5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;ERROR:&lt;span class="w"&gt; &lt;/span&gt;Could&lt;span class="w"&gt; &lt;/span&gt;not&lt;span class="w"&gt; &lt;/span&gt;find&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;version&lt;span class="w"&gt; &lt;/span&gt;that&lt;span class="w"&gt; &lt;/span&gt;satisfies&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;requirement&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.3.5&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;from&lt;span class="w"&gt; &lt;/span&gt;versions:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.3.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.4.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.4.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.4.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.4.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.5.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.6.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.6.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.7.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.8.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.8.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.9.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.9.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.10.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.10.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.11.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.12.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.13.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.13.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.14.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.14.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.15.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.15.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.15.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.16.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.16.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.16.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.17.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.17.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.18.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.18.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.19.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.19.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.19.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.20.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.20.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.20.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.20.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.21.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.21.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.22.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.23.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.23.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.23.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.23.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.23.4,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.24.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.24.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.24.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.25.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.25.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.25.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.25.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0.4,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.0.5,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.1.0,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.1.1,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.1.2,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.1.3,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.1.4,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.1.5&lt;span class="o"&gt;)&lt;/span&gt;
ERROR:&lt;span class="w"&gt; &lt;/span&gt;No&lt;span class="w"&gt; &lt;/span&gt;matching&lt;span class="w"&gt; &lt;/span&gt;distribution&lt;span class="w"&gt; &lt;/span&gt;found&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.3.5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I receive the following error:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;ERROR: No matching distribution found for pandas==1.3.5&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;I created a Notebook instance from the AWS Console via &lt;strong&gt;AWS Sagemaker -&amp;gt; Notebook instances -&amp;gt; Create Notebook instance&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I then selected the Kernel &lt;strong&gt;conda_Python3&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I use &lt;strong&gt;sys.executable&lt;/strong&gt; to show the Kernel's Python, Pip and Pandas version.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="mf"&gt;3.6.13&lt;/span&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt;
&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;21.3.1&lt;/span&gt;
&lt;span class="n"&gt;Summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;PyPA&lt;/span&gt; &lt;span class="n"&gt;recommended&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;installing&lt;/span&gt; &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="n"&gt;packages&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;Home&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pypa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;developers&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;distutils&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="nd"&gt;@python&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;
&lt;span class="n"&gt;License&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MIT&lt;/span&gt;
&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt;
&lt;span class="n"&gt;Requires&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
&lt;span class="n"&gt;Required&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.1.5&lt;/span&gt;
&lt;span class="n"&gt;Summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Powerful&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="n"&gt;structures&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;
&lt;span class="n"&gt;Home&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pydata&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
&lt;span class="n"&gt;License&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BSD&lt;/span&gt;
&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt;
&lt;span class="n"&gt;Requires&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dateutil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;
&lt;span class="n"&gt;Required&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;autovizwidget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;awswrangler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hdijupyterutils&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;odo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sagemaker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;smclarify&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sparkmagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;statsmodels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I cannot upgrade &lt;strong&gt;Pandas&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;pre&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.1.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dateutil&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.7.3&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.8.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;2017.2&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2021.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.15.4&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.18.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;six&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.6&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;python&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dateutil&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.7.3&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.15.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Root Cause Analysis&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Pandas&lt;/strong&gt; does not provide support for &lt;strong&gt;Python 3.6&lt;/strong&gt; beyond Pandas version &lt;strong&gt;1.1.5&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;Earlier versions of &lt;strong&gt;AWS SageMaker JupyterLab Notebooks&lt;/strong&gt; delivered &lt;strong&gt;Python 3.6&lt;/strong&gt; Kernels by default, via their &lt;strong&gt;Platform identifier&lt;/strong&gt; configuration item.&lt;/p&gt;
&lt;p&gt;Upon launch of my notebook, I selected &lt;strong&gt;Amazon Linux 1&lt;/strong&gt; for my &lt;strong&gt;Platform identifier&lt;/strong&gt;, which limited my Notebook to the &lt;strong&gt;Python 3.6 Kernel&lt;/strong&gt; which in turn caps the version of &lt;strong&gt;Pandas&lt;/strong&gt; to &lt;strong&gt;1.1.5&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I verified that my Notebook runs the &lt;strong&gt;Amazon Linux 1&lt;/strong&gt; Operating System via the &lt;strong&gt;Amazon SageMaker --&amp;gt; Notebook Instances --&amp;gt; Notebook instance settings&lt;/strong&gt; Console page.&lt;/p&gt;
&lt;p&gt;&lt;img alt="AWS Sagemaker Instance Amazon Linux 1" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/01_Old_Version.png"&gt;&lt;/p&gt;
&lt;h3&gt;Solution&lt;/h3&gt;
&lt;p&gt;Through trial and error, I identified the solution to my problem.&lt;/p&gt;
&lt;p&gt;To install the most recent version of &lt;strong&gt;Pandas&lt;/strong&gt; into a &lt;strong&gt;SageMaker JupyterLab Notebook&lt;/strong&gt;, I must install the most recent version of &lt;strong&gt;Python&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;To install the most recent version of Python to my JupyterLab environment, I must do the following upon launch:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Select the Amazon Linux 2 Operating System&lt;/li&gt;
&lt;li&gt;Select JupyterLab Version 3.0+&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Select the Amazon Linux 2 Operating System&lt;/h3&gt;
&lt;p&gt;AWS released &lt;a href="https://aws.amazon.com/amazon-linux-ami/"&gt;Amazon Linux&lt;/a&gt; in 2010 and then an improved &lt;a href="https://aws.amazon.com/about-aws/whats-new/2017/12/introducing-amazon-linux-2/"&gt;Amazon Linux 2&lt;/a&gt; in 2017.  &lt;/p&gt;
&lt;p&gt;AWS &lt;a href="https://aws.amazon.com/blogs/aws/update-on-amazon-linux-ami-end-of-life/"&gt;End of Life'ed (EOL)&lt;/a&gt; their standard support for the original Amazon Linux in late 2020.&lt;/p&gt;
&lt;p&gt;SageMaker notebooks, however, ran on the 2010 version of Amazon Linux until August 2021, when AWS provided the option to run Sagemaker JupyterLab Notebooks on &lt;a href="https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-notebook-instance-now-supports-amazon-linux-2/"&gt;Amazon Linux 2&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I recommend that you create SageMaker JupyterLab Notebooks with &lt;strong&gt;Amazon Linux 2 based notebook instances&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;These &lt;strong&gt;Amazon Linux 2 based notebook instances&lt;/strong&gt; support the &lt;strong&gt;Python 3.8&lt;/strong&gt; kernel, unlike the older versions of &lt;strong&gt;Amazon Linux (2010) based notebook instances&lt;/strong&gt;, which cap at &lt;strong&gt;Python 3.6&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The AWS developer guides catalog all the differences between &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-al2.html"&gt;Amazon Linux 2 and Amazon Linux (2010) notebook instances&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Execution&lt;/h4&gt;
&lt;p&gt;Upon Launch of your SageMaker JupyterLab Notebook Instance, navigate to the &lt;strong&gt;Platform identifier&lt;/strong&gt; option.&lt;/p&gt;
&lt;p&gt;The dropdown box provides three choices for &lt;strong&gt;Platform identifier&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pick Amazon Linux 2" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/02_Pick_Two.png"&gt;&lt;/p&gt;
&lt;p&gt;If you select an &lt;strong&gt;Amazon Linux 1&lt;/strong&gt; based notebook instance, the Console alerts you to the End of Life (EOL) support.&lt;/p&gt;
&lt;p&gt;&lt;img alt="End of Life Amazon Linux 1" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/03_One_Eol.png"&gt;&lt;/p&gt;
&lt;p&gt;Select &lt;strong&gt;Amazon Linux 2, JupyterLab 3&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Select JupyterLab Version 3.0+&lt;/h3&gt;
&lt;p&gt;Amazon SageMaker notebooks provide the JupyterLab service.  JupyterLab features a web-based Integrated Development Environment (IDE) for Python code, data and models.&lt;/p&gt;
&lt;p&gt;Upon launch of your Notebook, AWS allows you to &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-jl.html"&gt;choose either JupyterLab Version 1 or JupyterLab Version 3&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I recommend Jupyter Lab 3, which delivers a &lt;a href="https://search.brave.com/search?q=jupyter+lab+3.0+features"&gt;half dozen new  features&lt;/a&gt;, which include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Graphical Debug&lt;ul&gt;
&lt;li&gt;Desktop Integrated Development Environments (IDE) ship graphical debuggers.  JupyterLab 3.0 provides a visual debugger for your Notebook.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Outline View/ Table of Contents (TOC)&lt;ul&gt;
&lt;li&gt;Provides an Outline view for your Notebook.  Jump to different sections with the click of a mouse.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Wide Selection of Display Languages.&lt;ul&gt;
&lt;li&gt;Install the &lt;a href="https://github.com/jupyterlab/language-packs/"&gt;language pack of your choice&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Improved Single-Document Mode&lt;ul&gt;
&lt;li&gt;Remove the clutter of all the extraneous tabs and widgets.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Easy Extension Install&lt;ul&gt;
&lt;li&gt;Install extensions without JupyterLab recompilation via Pip or Conda.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To enjoy the above features, select &lt;strong&gt;Amazon Linux 2, JupyterLab 3&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Select Amazon Linux 2, JupyterLab 3" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/04_Lab_Three.png"&gt;&lt;/p&gt;
&lt;h2&gt;Success&lt;/h2&gt;
&lt;p&gt;After I launch my new &lt;strong&gt;AWS SageMaker JupyterLab Notebook&lt;/strong&gt; I select the &lt;strong&gt;conda_Python3&lt;/strong&gt; environment from the launcher.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Select conda_Python3" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/05_Conda_Three.png"&gt;&lt;/p&gt;
&lt;p&gt;In my notebook I check for the Python version and the output reads &lt;strong&gt;3.8&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Good Sign!&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;
&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="mf"&gt;3.8.12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Pip and Pandas read version &lt;strong&gt;22.0.4&lt;/strong&gt; and &lt;strong&gt;1.3.4&lt;/strong&gt; respectively.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt;
&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;22.0.4&lt;/span&gt;
&lt;span class="n"&gt;Summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;PyPA&lt;/span&gt; &lt;span class="n"&gt;recommended&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;installing&lt;/span&gt; &lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="n"&gt;packages&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;Home&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pypa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;developers&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;distutils&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="nd"&gt;@python&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;
&lt;span class="n"&gt;License&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MIT&lt;/span&gt;
&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt;
&lt;span class="n"&gt;Requires&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
&lt;span class="n"&gt;Required&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;show&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.3.4&lt;/span&gt;
&lt;span class="n"&gt;Summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Powerful&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="n"&gt;structures&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;
&lt;span class="n"&gt;Home&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pydata&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;Pandas&lt;/span&gt; &lt;span class="n"&gt;Development&lt;/span&gt; &lt;span class="n"&gt;Team&lt;/span&gt;
&lt;span class="n"&gt;Author&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="nd"&gt;@python&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;
&lt;span class="n"&gt;License&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;BSD&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Clause&lt;/span&gt;
&lt;span class="n"&gt;Location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt;
&lt;span class="n"&gt;Requires&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dateutil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;
&lt;span class="n"&gt;Required&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;autovizwidget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hdijupyterutils&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sagemaker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;smclarify&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sparkmagic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;statsmodels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I use the notebook to upgrade &lt;strong&gt;Pandas&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;pre&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;upgrade&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;span class="n"&gt;Looking&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;indexes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;pypi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;simple&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;neuron&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amazonaws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.3.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Collecting&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
  &lt;span class="n"&gt;Downloading&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.5.1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cp38&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cp38&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;manylinux_2_17_x86_64&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manylinux2014_x86_64&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;whl&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;12.2&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="err"&gt;━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━&lt;/span&gt; &lt;span class="mf"&gt;12.2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;12.2&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt; &lt;span class="mf"&gt;48.4&lt;/span&gt; &lt;span class="n"&gt;MB&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;eta&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;0000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;0100&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.20.3&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.20.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dateutil&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.8.1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.8.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;2020.1&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2021.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Requirement&lt;/span&gt; &lt;span class="n"&gt;already&lt;/span&gt; &lt;span class="n"&gt;satisfied&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;six&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;home&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;ec2&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;anaconda3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;envs&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;lib&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;python3&lt;/span&gt;&lt;span class="mf"&gt;.8&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;site&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;python&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dateutil&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.8.1&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.16.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Installing&lt;/span&gt; &lt;span class="n"&gt;collected&lt;/span&gt; &lt;span class="n"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
  &lt;span class="n"&gt;Attempting&lt;/span&gt; &lt;span class="n"&gt;uninstall&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
    &lt;span class="n"&gt;Found&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="n"&gt;installation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="mf"&gt;1.3.4&lt;/span&gt;
    &lt;span class="n"&gt;Uninstalling&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.3.4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;Successfully&lt;/span&gt; &lt;span class="n"&gt;uninstalled&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.3.4&lt;/span&gt;
&lt;span class="n"&gt;Successfully&lt;/span&gt; &lt;span class="n"&gt;installed&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.5.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The output reads:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Successfully installed pandas-1.5.1&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Success!!!&lt;/p&gt;
&lt;h3&gt;Create a Lifecycle Config&lt;/h3&gt;
&lt;p&gt;A Sagemaker Lifecycle Configuration allows you to upgrade Pandas at launch.&lt;/p&gt;
&lt;p&gt;When you log into your Notebook for the first time, the Notebook will present to you the most recent version of Pandas.&lt;/p&gt;
&lt;p&gt;In Amazon Sagemaker, click &lt;strong&gt;Lifecycle configurations --&amp;gt; Notebook Instance --&amp;gt; Create Configuration&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create Configuration" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/06_Create_Config.png"&gt;&lt;/p&gt;
&lt;p&gt;I name my lifecycle config &lt;strong&gt;sobanski-update-pandas&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Paste in Bash Script" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/07_Lifecycle_Config.png"&gt;&lt;/p&gt;
&lt;p&gt;Paste the following script under &lt;strong&gt;Start notebook&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="ch"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-e

&lt;span class="c1"&gt;# OVERVIEW&lt;/span&gt;
&lt;span class="c1"&gt;# This script installs a single pip package in a single SageMaker conda environments.&lt;/span&gt;

sudo&lt;span class="w"&gt; &lt;/span&gt;-u&lt;span class="w"&gt; &lt;/span&gt;ec2-user&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&amp;lt;&amp;#39;EOF&amp;#39;&lt;/span&gt;
&lt;span class="s"&gt;# PARAMETERS&lt;/span&gt;
&lt;span class="s"&gt;PACKAGE=pandas&lt;/span&gt;
&lt;span class="s"&gt;ENVIRONMENT=python3&lt;/span&gt;
&lt;span class="s"&gt;source /home/ec2-user/anaconda3/bin/activate &amp;quot;$ENVIRONMENT&amp;quot;&lt;/span&gt;
&lt;span class="s"&gt;pip install --upgrade &amp;quot;$PACKAGE&amp;quot;&lt;/span&gt;
&lt;span class="s"&gt;source /home/ec2-user/anaconda3/bin/deactivate&lt;/span&gt;
&lt;span class="s"&gt;EOF&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The script upgrades &lt;strong&gt;Pandas&lt;/strong&gt; in the &lt;strong&gt;conda_Python3&lt;/strong&gt; environment.&lt;/p&gt;
&lt;p&gt;Under &lt;strong&gt;Amazon SageMaker --&amp;gt; Notebook instances --&amp;gt; Notebook instance settings&lt;/strong&gt; select &lt;strong&gt;Edit&lt;/strong&gt; and set &lt;strong&gt;Lifecycle configuration&lt;/strong&gt; to the name of your file.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Select your config" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/08_Select_Config.png"&gt;&lt;/p&gt;
&lt;p&gt;When you launch the notebook, AWS will run the upgrade script.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;AWS SageMaker Notebook Instances&lt;/strong&gt; host and manage &lt;strong&gt;JupyterLab Notebooks&lt;/strong&gt;.  In this blog post we discussed how to configure your &lt;strong&gt;Notebook Instance&lt;/strong&gt; to maximize the available features in Pandas and JupyterLab.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Python Pandas" src="https://john.soban.ski/images/Sagemaker_Upgrade_Pandas/09_Pet_Python.png"&gt;&lt;/p&gt;
&lt;p&gt;Note:  I created the Panda/ Python artwork with Jasper AI Art, see workflow &lt;a href="https://john.soban.ski/jasper-art.html"&gt;here&lt;/a&gt;&lt;/p&gt;</content><category term="Data Science"></category><category term="AWS"></category><category term="Python"></category><category term="HOWTO"></category><category term="Machine Learning"></category></entry><entry><title>Jasper Artificial Intelligence (AI) for Marketing Pictures</title><link href="https://john.soban.ski/jasper-art.html" rel="alternate"></link><published>2022-09-24T03:21:00-04:00</published><updated>2022-09-24T03:21:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2022-09-24:/jasper-art.html</id><summary type="html">&lt;p&gt;I use the new Jasper Artificial Intelligence (AI) Art service to create the pictures in this blog post.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.jasper.ai/art"&gt;Jasper AI Art&lt;/a&gt; (non-affiliate link), for example, creates the following picture of the &lt;strong&gt;World's Largest Turnip&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The World's Largest Turnip" src="https://john.soban.ski/images/Jasper_Art/00_Big_Turnip.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper generates a &lt;strong&gt;before and after&lt;/strong&gt; picture of a fat cat.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Before and after picture of a fat cat" src="https://john.soban.ski/images/Jasper_Art/01_Fat_Cat.png"&gt;&lt;/p&gt;
&lt;p&gt;The AI service also produces …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I use the new Jasper Artificial Intelligence (AI) Art service to create the pictures in this blog post.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://www.jasper.ai/art"&gt;Jasper AI Art&lt;/a&gt; (non-affiliate link), for example, creates the following picture of the &lt;strong&gt;World's Largest Turnip&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The World's Largest Turnip" src="https://john.soban.ski/images/Jasper_Art/00_Big_Turnip.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper generates a &lt;strong&gt;before and after&lt;/strong&gt; picture of a fat cat.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Before and after picture of a fat cat" src="https://john.soban.ski/images/Jasper_Art/01_Fat_Cat.png"&gt;&lt;/p&gt;
&lt;p&gt;The AI service also produces a picture of ET in a phone booth.&lt;/p&gt;
&lt;p&gt;&lt;img alt="ET in a phone booth" src="https://john.soban.ski/images/Jasper_Art/02_Et_Phone.png"&gt;&lt;/p&gt;
&lt;p&gt;This month, I demonstrate the service and discuss the history of the service, which begins with the &lt;strong&gt;OpenAI foundation&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;OpenAI&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://twitter.com/elonmusk"&gt;Elon Musk&lt;/a&gt; (Tesla), &lt;a href="https://en.wikipedia.org/wiki/Sam_Altman"&gt;Sam Altman&lt;/a&gt; (Y Combinator), &lt;a href="https://gregbrockman.com/"&gt;Greg Brockman&lt;/a&gt; (Stripe) and &lt;a href="https://www.cs.toronto.edu/~ilya/"&gt;Ilya Sutskever&lt;/a&gt; (Google) pledged over $1B to found &lt;a href="https://openai.com/"&gt;OpenAI&lt;/a&gt; in December, 2015.  &lt;/p&gt;
&lt;p&gt;OpenAI funds the development of &lt;strong&gt;friendly&lt;/strong&gt; Artificial General Intelligence (AGI) that drives the health, happiness and improvement of our human race.&lt;/p&gt;
&lt;p&gt;OpenAI develops interesting Generative models, including one for text (&lt;a href="https://en.wikipedia.org/wiki/GPT-3"&gt;GPT-3&lt;/a&gt;) and one for images (&lt;a href="https://en.wikipedia.org/wiki/DALL-E"&gt;DALLE-2&lt;/a&gt;).  &lt;/p&gt;
&lt;p&gt;The DALLE-2 model inputs descriptive &lt;a href="https://en.wikipedia.org/wiki/Natural_language"&gt;natural language&lt;/a&gt; text, renders an image of the description and then outputs that image to the screen.&lt;/p&gt;
&lt;p&gt;OpenAI invites the &lt;a href="https://labs.openai.com/auth/signup"&gt;public to use DALLE-2&lt;/a&gt;, which yields sometimes &lt;a href="https://twitter.com/weirddalle"&gt;hilarious results&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can try the API for free via &lt;a href="https://huggingface.co/spaces/dalle-mini/dalle-mini"&gt;Huggingface&lt;/a&gt;, or &lt;a href="https://www.craiyon.com/"&gt;Crayon&lt;/a&gt;, but (in July 2022) these services take minutes to produce images.  Jasper AI Art reduces the &lt;strong&gt;wall clock&lt;/strong&gt; time down to seconds.&lt;/p&gt;
&lt;h2&gt;Jasper AI Art&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://techcrunch.com/2022/10/18/ai-content-platform-jasper-raises-125m-at-a-1-7b-valuation/"&gt;Dave Rogenmoser&lt;/a&gt;, J.P. Morgan and Chris Hull formed Jasper AI in early 2021. Since then, Jasper AI provides the GPT-3 model for &lt;a href="https://john.soban.ski/jasper-ai.html"&gt;copywriting and Search Engine Optimization (SEO)&lt;/a&gt; use cases.&lt;/p&gt;
&lt;p&gt;This month, Jasper opens their Jasper Art service to their customers.&lt;/p&gt;
&lt;p&gt;Jasper Art provides an easy-to-use interface to DALLE-2, and accelerates the image generation process to seconds.&lt;/p&gt;
&lt;p&gt;I use the Jasper Art service to create the following pictures.&lt;/p&gt;
&lt;p&gt;Fight Club Simpsons.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Fight Club Simpsons." src="https://john.soban.ski/images/Jasper_Art/03_Fight_Club.png"&gt;&lt;/p&gt;
&lt;p&gt;A delicious grilled spider.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A delicious grilled spider." src="https://john.soban.ski/images/Jasper_Art/04_Grilled_Spider.png"&gt;&lt;/p&gt;
&lt;p&gt;A hand with fifteen fingers.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A hand with fifteen fingers." src="https://john.soban.ski/images/Jasper_Art/05_Fifteen_Fingers.png"&gt;&lt;/p&gt;
&lt;p&gt;An Isometric Pixel Art rendition of the Democratic People's Republic of Korea (DPRK).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Isometric Pixel Art rendition of the Democratic People's Republic of Korea (DPRK)" src="https://john.soban.ski/images/Jasper_Art/06_Iso_Dprk.png"&gt;&lt;/p&gt;
&lt;p&gt;Too much baby powder!!!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Too much baby powder!!!" src="https://john.soban.ski/images/Jasper_Art/07_Too_Much.png"&gt;&lt;/p&gt;
&lt;h2&gt;Sign Up&lt;/h2&gt;
&lt;p&gt;Jasper sent me an invitation to try their new &lt;strong&gt;Art&lt;/strong&gt; service.  They boast that &lt;strong&gt;artificial intelligence now can create any image or art you can imagine in seconds&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Email" src="https://john.soban.ski/images/Jasper_Art/09_Jasper_Email.png"&gt;&lt;/p&gt;
&lt;p&gt;Clicking through the links takes me to a payment page.  Unlimited access to the Art service costs $20/month.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sign Up" src="https://john.soban.ski/images/Jasper_Art/10_Sign_Up.png"&gt;&lt;/p&gt;
&lt;h2&gt;The User Interface&lt;/h2&gt;
&lt;p&gt;Jasper provides a User Interface (UI) to generate art.  You enter a text description and then use the drop down selector boxes to set the &lt;strong&gt;Style, Medium, Artist&lt;/strong&gt; and &lt;strong&gt;Mood&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Blank Jasper Art UI" src="https://john.soban.ski/images/Jasper_Art/11_Jasper_Ui.png"&gt;&lt;/p&gt;
&lt;p&gt;I decide to enter the following input:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Text Description&lt;ul&gt;
&lt;li&gt;John Travolta from Saturday Night Fever in a Magnavox Odyssey 2 game&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Style&lt;ul&gt;
&lt;li&gt;Pixel art&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Medium&lt;ul&gt;
&lt;li&gt;None Specified&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Artist&lt;ul&gt;
&lt;li&gt;Andy Warhol&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Mood&lt;ul&gt;
&lt;li&gt;Frightening&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Travolta Input" src="https://john.soban.ski/images/Jasper_Art/12_Travolta_Text.png"&gt;&lt;/p&gt;
&lt;p&gt;Within seconds, Jasper Renders the following four images.&lt;/p&gt;
&lt;p&gt;&lt;img alt="John Travolta from Saturday Night Fever in a Magnavox Odyssey 2 game" src="https://john.soban.ski/images/Jasper_Art/13_Magnovox_Travolta.png"&gt;&lt;/p&gt;
&lt;p&gt;I like this one the best, although it looks more like a Commodore 64 game.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Andy Warhol Travolta" src="https://john.soban.ski/images/Jasper_Art/14_Travolta_Warhol.png"&gt;&lt;/p&gt;
&lt;p&gt;I change the Style to &lt;strong&gt;3d Render&lt;/strong&gt; and Mood to &lt;strong&gt;Aggressive&lt;/strong&gt;.  I also add &lt;strong&gt;The Situation&lt;/strong&gt; from &lt;strong&gt;Jersey Shore&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Text Description&lt;ul&gt;
&lt;li&gt;John Travolta from Saturday Night Fever hanging out with the Situation from Jersey Shore&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Style&lt;ul&gt;
&lt;li&gt;3d Render&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Medium&lt;ul&gt;
&lt;li&gt;None Specified&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Artist&lt;ul&gt;
&lt;li&gt;None Specified&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Mood&lt;ul&gt;
&lt;li&gt;Aggressive&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Four pics of John Travolta from Saturday Night Fever hanging out with the Situation from Jersey Shore" src="https://john.soban.ski/images/Jasper_Art/15_Travolta_Render.png"&gt;&lt;/p&gt;
&lt;p&gt;The result looks straight from a PlayStation Two game.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Travolta and Sitch" src="https://john.soban.ski/images/Jasper_Art/16_Travolta_Situation.png"&gt;&lt;/p&gt;
&lt;p&gt;Changing the style back to &lt;strong&gt;Pixel Art&lt;/strong&gt; produces new art.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Travolta and Sitch Pixels" src="https://john.soban.ski/images/Jasper_Art/17_Travolta_Pixel.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper nailed the &lt;strong&gt;Low Cut Vee&lt;/strong&gt; shirt!&lt;/p&gt;
&lt;p&gt;&lt;img alt="PlayStation presents the Jersey Shore" src="https://john.soban.ski/images/Jasper_Art/18_The_Situation.png"&gt;&lt;/p&gt;
&lt;p&gt;I remember a joke about a Vulture with Carry On (Carrion) and input it into Jasper.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Text Description&lt;ul&gt;
&lt;li&gt;Vulture with Carrion in Airplane cabin&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Style&lt;ul&gt;
&lt;li&gt;Cartoon&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Medium&lt;ul&gt;
&lt;li&gt;Spray Paint&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Artist&lt;ul&gt;
&lt;li&gt;Dr. Suess&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Mood&lt;ul&gt;
&lt;li&gt;Happy&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Dr. Suess Vulture Carrion Carry On" src="https://john.soban.ski/images/Jasper_Art/19_Vulture_Cartoon.png"&gt;&lt;/p&gt;
&lt;p&gt;I love the trippy Dr. Suess result, although I don't see a Vulture nor Carrion.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Dr. Suess Trippy" src="https://john.soban.ski/images/Jasper_Art/20_Dr_Suess.png"&gt;&lt;/p&gt;
&lt;p&gt;I re-run with the same inputs.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Attempt Two" src="https://john.soban.ski/images/Jasper_Art/21_Vulture_Spray.png"&gt;&lt;/p&gt;
&lt;p&gt;While Jasper missed the boat, I do like the look of this picture.  It reminds me of a Sci Fi pulp magazine from the early 1930's.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Vulture, Painting" src="https://john.soban.ski/images/Jasper_Art/22_Vulture_Spraypaint.png"&gt;&lt;/p&gt;
&lt;p&gt;I run the input once more and add &lt;strong&gt;Art Deco&lt;/strong&gt; to the description.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Art Deco Vulture" src="https://john.soban.ski/images/Jasper_Art/23_Vulture_Deco.png"&gt;&lt;/p&gt;
&lt;p&gt;We get a vulture in the airplane cabin, but no Carrion.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sans Carry On" src="https://john.soban.ski/images/Jasper_Art/24_Vulture_Artdeco.png"&gt;&lt;/p&gt;
&lt;h2&gt;A Toddler gets Silly&lt;/h2&gt;
&lt;p&gt;At this point my toddler walks over and asks me to make &lt;strong&gt;planet earth made out of poop emoji&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="planet earth made out of poop emoji" src="https://john.soban.ski/images/Jasper_Art/25_Poop_Ui.png"&gt;&lt;/p&gt;
&lt;p&gt;He laughed at &lt;strong&gt;Poop World&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="planet earth made out of poop emoji - poop world" src="https://john.soban.ski/images/Jasper_Art/26_Poop_World.png"&gt;&lt;/p&gt;
&lt;p&gt;Next up - &lt;strong&gt;Poopy pants&lt;/strong&gt; (I did mention I have a toddler).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Jasper creates poopy pants" src="https://john.soban.ski/images/Jasper_Art/27_Poopy_Pants.png"&gt;&lt;/p&gt;
&lt;p&gt;This image looks like a Fark &lt;strong&gt;Photoshop Friday&lt;/strong&gt; from 2003.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Poopy Pants DALLE-2" src="https://john.soban.ski/images/Jasper_Art/28_Poop_Pants.png"&gt;&lt;/p&gt;
&lt;p&gt;He then requests more silly pictures. &lt;/p&gt;
&lt;p&gt;First up, Boxing a Cow.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boxing a Cow" src="https://john.soban.ski/images/Jasper_Art/29_Cow_Boxing.png"&gt;&lt;/p&gt;
&lt;p&gt;Skateboarding cow.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Skateboarding cow" src="https://john.soban.ski/images/Jasper_Art/30_Cow_Skateboard.png"&gt;&lt;/p&gt;
&lt;p&gt;Dog rides a Pony.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Dog rides a Pony" src="https://john.soban.ski/images/Jasper_Art/32_Dog_Unicorn.png"&gt;&lt;/p&gt;
&lt;p&gt;Fart Propelled Bunny.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Fart Propelled Bunny" src="https://john.soban.ski/images/Jasper_Art/33_Fart_Bunny.png"&gt;&lt;/p&gt;
&lt;p&gt;Pumpkin Skeleton.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pumpkin Skeleton." src="https://john.soban.ski/images/Jasper_Art/34_Pumpkin_Skeleton.png"&gt;&lt;/p&gt;
&lt;h2&gt;I get Silly&lt;/h2&gt;
&lt;p&gt;The silliness inspires me.  I think of some more ideas.&lt;/p&gt;
&lt;p&gt;Propaganda  Wario.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Propaganda  Wario" src="https://john.soban.ski/images/Jasper_Art/35_Propoganda_Wario.png"&gt;&lt;/p&gt;
&lt;p&gt;A Comics Guarantee Corp (CGC) graded copy of a Johnny Cage comic book, where he wears Hammer pants.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A Comics Guarantee Corp (CGC) graded copy of a Johnny Cage comic book, where he wears Hammer pants.
" src="https://john.soban.ski/images/Jasper_Art/36_Cgc_Hammertime.png"&gt;&lt;/p&gt;
&lt;p&gt;Konami Nintendo Entertainment System (NES) box art for &lt;strong&gt;Atlas Shrugged&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Konami Nintendo Entertainment System (NES) box art for **Atlas Shrugged**" src="https://john.soban.ski/images/Jasper_Art/37_Konamiboxart_Atlasshrugged.png"&gt;&lt;/p&gt;
&lt;p&gt;Nu-Metal President's Presidential Rally.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nu-Metal President's Presidential Rally" src="https://john.soban.ski/images/Jasper_Art/38_Numetal_Presidentialrally.png"&gt;&lt;/p&gt;
&lt;p&gt;Mad Magazine presents Emo.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Mad Magazine presents Emo" src="https://john.soban.ski/images/Jasper_Art/39_Madmagazine_Emo.png"&gt;&lt;/p&gt;
&lt;p&gt;Nysnc + Nirvana.  Nirvanasync perhaps?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nysnc + Nirvana" src="https://john.soban.ski/images/Jasper_Art/40_Nsync_Nirvana.png"&gt;&lt;/p&gt;
&lt;p&gt;Selfie at the Smelter factory.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Selfie at the Smelter factory." src="https://john.soban.ski/images/Jasper_Art/41_Smelter_Selfie.png"&gt;&lt;/p&gt;
&lt;p&gt;Finally, I just type &lt;strong&gt;Ai Yiii Yiii Yiii&lt;/strong&gt; into the box.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ai Yiii Yiii Yiii" src="https://john.soban.ski/images/Jasper_Art/42_Ayiii_Yiiyiii.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I will use the Jasper Art Service to illustrate future blog posts.  I have writeups on Gerard Manley Hopkins and Yukio Mishima in the works that need interesting pictures.&lt;/p&gt;
&lt;p&gt;I recommend the service for any small (or big) business that needs to create compelling pictures beyond stock art.  The services provide an inexpensive and timely alternative to hiring graphic artists.&lt;/p&gt;
&lt;h2&gt;Paw Patrol&lt;/h2&gt;
&lt;p&gt;Before you go, my Toddler re-appears and demands the following Paw Patrol related pieces.  I leave them here for prosperity.  He will get a kick out of them twenty years from now or so.&lt;/p&gt;
&lt;p&gt;Cat Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Cat Paw Patrol." src="https://john.soban.ski/images/Jasper_Art/43_Cat_Pawpatrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Super Mario Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Super Mario Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/44_Mario_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Sonic The Hedgehog Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sonic The Hedgehog Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/45_Sonic_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;The Incredible Hulk Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Incredible Hulk Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/46_Hulk_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Titanic Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Titanic Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/47_Titanic_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Pikachu Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pikachu Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/48_Pikachu_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Power Rangers Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Power Rangers Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/49_Power_Rangers.png"&gt;&lt;/p&gt;
&lt;p&gt;More Power Rangers Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="More Power Rangers Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/50_Powerranger_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Transformers Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Transformers Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/51_Transformers_Patrol.png"&gt;&lt;/p&gt;
&lt;p&gt;Nick Jonas Paw Patrol.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Nick Jonas Paw Patrol" src="https://john.soban.ski/images/Jasper_Art/52_Jonas_Patrol.png"&gt;&lt;/p&gt;</content><category term="Data Science"></category><category term="Jasper"></category><category term="Machine Learning"></category></entry><entry><title>Jasper Artificial Intelligence (AI) Writes Your Blog Posts</title><link href="https://john.soban.ski/jasper-ai.html" rel="alternate"></link><published>2022-03-26T01:23:00-04:00</published><updated>2022-03-26T01:23:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2022-03-26:/jasper-ai.html</id><summary type="html">&lt;p&gt;Machine Learning (ML) Engineers at &lt;a href="https://www.jasper.ai/"&gt;Jasper Artificial Intelligence (AI)&lt;/a&gt; (non-affiliate link) trained a Natural Language Processing (NLP) model on 10% of the written content on the Internet to create a service that writes text on command.  A user feeds the model a sentence or two requesting a topic and the …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Machine Learning (ML) Engineers at &lt;a href="https://www.jasper.ai/"&gt;Jasper Artificial Intelligence (AI)&lt;/a&gt; (non-affiliate link) trained a Natural Language Processing (NLP) model on 10% of the written content on the Internet to create a service that writes text on command.  A user feeds the model a sentence or two requesting a topic and the service spits out paragraphs.  The end user tunes the service, approving or disapproving each new paragraph and guides the writing through the addition of keywords.&lt;/p&gt;
&lt;p&gt;I had Jasper write the introduction paragraph to this blog post.  I fed it only the title (above) of this blog post and the service produced three candidate introduction paragraphs.  The &lt;strong&gt;best&lt;/strong&gt; one follows:  &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Would you like to have a computer write your blog posts for you? Jasper AI can do that! Just input your topic and Jasper will take care of the writing for you. No more worrying about whether you can write an interesting or engaging post - let Jasper take care of that for you. So why not give it a try? You may be surprised at just how well Jasper can write!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Jasper also integrates with the machine learning driven &lt;a href="https://surferseo.com/"&gt;Surfer AI&lt;/a&gt; (non-affiliate link) service, which provides guidance to the Jasper user on how to maximize Search Engine performance.  In this blog post, I will demonstrate how to quickly use Jasper and Surfer together to write an effective, &lt;a href="https://en.wikipedia.org/wiki/Search_engine_optimization"&gt;Search Engine Optimized (SEO)&lt;/a&gt; blog post.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Jasper uses a &lt;a href="https://en.wikipedia.org/wiki/Software_as_a_service"&gt;Software as a Service (SaaS)&lt;/a&gt; approach to get its AI into the hands of end users.  You just need to sign into the WebSite and put in your credit card information.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Splash Screen" src="https://john.soban.ski/images/Jasper_Ai/01_Splash_Screen.png"&gt;&lt;/p&gt;
&lt;p&gt;Similar to most &lt;a href="https://en.wikipedia.org/wiki/Web_2.0"&gt;Web 2.0&lt;/a&gt; applications, you need to register a username and password with their proprietary database, or connect to the centralized, authoritarian Google identity service.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Google Login" src="https://john.soban.ski/images/Jasper_Ai/02_Google_Login.png"&gt;&lt;/p&gt;
&lt;p&gt;Once you log in, you enter a Business Name and domain.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Step Two" src="https://john.soban.ski/images/Jasper_Ai/03_Step_Two.png"&gt;&lt;/p&gt;
&lt;p&gt;I mis-understood the &lt;strong&gt;domain&lt;/strong&gt; field.  The Wizard expects a web address, so I enter my address in the text field.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enter Domain" src="https://john.soban.ski/images/Jasper_Ai/04_Enter_Domain.png"&gt;&lt;/p&gt;
&lt;p&gt;Select the use case for Jasper.  Since I plan to write a blog, I click &lt;strong&gt;Blog posts&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Select Blog" src="https://john.soban.ski/images/Jasper_Ai/05_Select_Blog.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper then expects payment.  They offer &lt;strong&gt;starter&lt;/strong&gt; and &lt;strong&gt;basic&lt;/strong&gt; packages, for $29/month and $59/month respectively.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Choose Plan" src="https://john.soban.ski/images/Jasper_Ai/06_Choose_Plan.png"&gt;&lt;/p&gt;
&lt;p&gt;I decide to try the cheaper option for my Proof of Concept.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pay Money" src="https://john.soban.ski/images/Jasper_Ai/07_Pay_Money.png"&gt;&lt;/p&gt;
&lt;h2&gt;Use a Template&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;starter&lt;/strong&gt; plan allows use of the Jasper templates.  &lt;/p&gt;
&lt;p&gt;I select the &lt;strong&gt;Blog Post Intro Paragraph&lt;/strong&gt; template.  I want Jasper to write a blog post on &lt;a href="https://en.wikipedia.org/wiki/Decentralized_finance"&gt;Decentralized Finance&lt;/a&gt;, so I feed it the title &lt;strong&gt;Decentralized Finance Drives Freedom&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Intro Paragraph" src="https://john.soban.ski/images/Jasper_Ai/08_Intro_Paragraph.png"&gt;&lt;/p&gt;
&lt;p&gt;I click &lt;strong&gt;Generate&lt;/strong&gt; and Jasper produces the following text:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It's no secret that the world is becoming increasingly digitized. In fact, DeFi may be one of the most important aspects of this digital revolution. DeFi allows for a new level of freedom and autonomy that has never been seen before. With DeFi, users are in control of their own finances, and can transact without the need for permission from centralized authorities. This is a huge step forward in terms of financial freedom and censorship resistance. As DeFi continues to grow in popularity, it will become an increasingly important part of our global economy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Jasper produces a light, readable intro paragraph.  &lt;/p&gt;
&lt;p&gt;The &lt;a href="https://en.wikipedia.org/wiki/E-Prime"&gt;E-Prime&lt;/a&gt; zealot inside me gripes at the overuse of &lt;strong&gt;IS&lt;/strong&gt;, an indicator (to me at least) of lazy, amateur writing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;It’s&lt;/strong&gt; no secret&lt;/li&gt;
&lt;li&gt;may &lt;strong&gt;be&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;is&lt;/strong&gt; becoming&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;are&lt;/strong&gt; in control&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;is&lt;/strong&gt; a huge step&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I also barf at the &lt;a href="https://dilbertblog.typepad.com/the_dilbert_blog/2007/06/the_day_you_bec.html"&gt;passive voice&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;freedom and autonomy that has never been seen before&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I don’t blame Jasper too much, since I assume most of the Internet includes sloppy text, and Jasper trained the NLP model on 10% of the Internet.&lt;/p&gt;
&lt;h2&gt;Recipes&lt;/h2&gt;
&lt;p&gt;Jasper provides &lt;strong&gt;recipes&lt;/strong&gt; to help content creators write more than an intro (or concluding) paragraph.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Try Recipe" src="https://john.soban.ski/images/Jasper_Ai/09_Try_Recipie.png"&gt;&lt;/p&gt;
&lt;p&gt;When I click the &lt;strong&gt;Recipe&lt;/strong&gt; button, Jasper sends an alert that I must upgrade to the $59 &lt;strong&gt;Basic&lt;/strong&gt; plan.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Need Upgrade" src="https://john.soban.ski/images/Jasper_Ai/10_Need_Upgrade.png"&gt;&lt;/p&gt;
&lt;p&gt;I know that the &lt;strong&gt;Documents&lt;/strong&gt; service will help me write a blog post.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://www.jasper.ai/templates"&gt;Jasper website&lt;/a&gt; reads&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our 50+ templates are the building blocks of Jasper. They are trained to perform very specific use cases and are very simple to understand and use. &lt;/p&gt;
&lt;p&gt;Where templates are more constrained, documents allow open-ended content creation in a familiar Google Doc-like experience. Documents are available on the Boss Mode Plan.&lt;/p&gt;
&lt;p&gt;Documents should be used for writing longer content like blog posts, emails, social posts or books.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I click &lt;strong&gt;Documents&lt;/strong&gt; and once more read a Call To Action (CTA) that I &lt;strong&gt;must&lt;/strong&gt; Upgrade to the $59 plan to write a document.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Try Document" src="https://john.soban.ski/images/Jasper_Ai/11_Try_Document.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper provides me with the &lt;strong&gt;Boss Mode&lt;/strong&gt; plan for an additional $28.09, since I already paid $29 for the &lt;strong&gt;Starter&lt;/strong&gt; plan.  I take one for our team and pay the money to upgrade.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Need Boss" src="https://john.soban.ski/images/Jasper_Ai/12_Need_Boss.png"&gt;&lt;/p&gt;
&lt;p&gt;Since I already have my Credit Card out, I also pay $59 for &lt;a href="https://surferseo.com/"&gt;Surfer SEO&lt;/a&gt;, a tool I will discuss later in this blog post.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pay Boss" src="https://john.soban.ski/images/Jasper_Ai/13_Pay_Boss.png"&gt;&lt;/p&gt;
&lt;h2&gt;Create a Document&lt;/h2&gt;
&lt;p&gt;Now that I paid the big bucks, Jasper rolls out the red carpet for me.&lt;/p&gt;
&lt;p&gt;It provides a wizard to hold my hand through the blog writing process.&lt;/p&gt;
&lt;p&gt;I tell Jasper to &lt;strong&gt;write a thought leadership piece that discusses how decentralized finance (DeFi) increases freedom&lt;/strong&gt;.  I then enter the keywords &lt;strong&gt;DeFi, Blockchain and Censorship&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="New Blog" src="https://john.soban.ski/images/Jasper_Ai/14_New_Blog.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper provides a half dozen or so candidate titles.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How Decentralized Finance (DeFi) Increases Freedom&lt;/li&gt;
&lt;li&gt;Decentralized Finance:  How DeFi Increases Freedom&lt;/li&gt;
&lt;li&gt;Decentralized Finance:  Increasing Freedom for All&lt;/li&gt;
&lt;li&gt;Decentralized Finance (DeFi) Increases Freedom:  Why Blockchain-Based Finance Will Save the World&lt;/li&gt;
&lt;li&gt;Decentralized Finance:  The Future of Financial Freedom&lt;/li&gt;
&lt;li&gt;Decentralized Finance:  The Future of Freedom&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Automatic Titles" src="https://john.soban.ski/images/Jasper_Ai/15_Automatic_Titles.png"&gt;&lt;/p&gt;
&lt;p&gt;Once I select a title, Jasper spits out the intro paragraph.&lt;/p&gt;
&lt;h2&gt;Integrate Search Engine Optimization&lt;/h2&gt;
&lt;p&gt;The target audience for Jasper writes blog posts for traffic.  Search Engine Optimization (SEO) tools maximize the return on (writing) investment in terms of increased traffic.  Surfer SEO provides a SaaS and integrates with Jasper AI.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create Surfer" src="https://john.soban.ski/images/Jasper_Ai/16_Create_Surfer.png"&gt;&lt;/p&gt;
&lt;p&gt;I purchase the $59/ month Jasper subscription and then refresh my Jasper page.  This connects the Surfer SEO service to Jasper.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Click SEO" src="https://john.soban.ski/images/Jasper_Ai/17_Click_SEO.png"&gt;&lt;/p&gt;
&lt;p&gt;After I enter “DeFi” into the target search term box, Surfer lists a handful of related, popular search questions.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What exactly is DeFi?&lt;/li&gt;
&lt;li&gt;What is a DeFi in crypto?&lt;/li&gt;
&lt;li&gt;Is DeFi a good investment?&lt;/li&gt;
&lt;li&gt;What is the difference between DeFi and crypto?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Good Questions" src="https://john.soban.ski/images/Jasper_Ai/18_Good_Questions.png"&gt;&lt;/p&gt;
&lt;p&gt;I paste these questions directly into the Jasper document box.  I precede them with double &lt;strong&gt;Pound&lt;/strong&gt; signs (#) which tells Jasper to start a new paragraph.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Type Questions" src="https://john.soban.ski/images/Jasper_Ai/19_Type_Questions.png"&gt;&lt;/p&gt;
&lt;p&gt;From there, I click &lt;strong&gt;Compose&lt;/strong&gt;, which generates four sentences per click.&lt;/p&gt;
&lt;p&gt;For example, when I place the cursor under &lt;strong&gt;What exactly is DeFi?&lt;/strong&gt;, type the seed words &lt;strong&gt;Decentralized Finance (DeFi)&lt;/strong&gt;  and hit &lt;strong&gt;Compose&lt;/strong&gt;, Jasper produces the following text:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Decentralized Finance (DeFi)  is a financial system that runs on decentralized infrastructure, such as the Ethereum blockchain. Peer to Peer DeFi protocols are built on top of Ethereum and allow users to interact with each other and create financial products without the need for a central authority.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I then have Jasper &lt;strong&gt;compose&lt;/strong&gt; paragraphs for the remaining paragraph headers.&lt;/p&gt;
&lt;h2&gt;Tune SEO Effectiveness&lt;/h2&gt;
&lt;p&gt;Surfer provides a &lt;strong&gt;word quota&lt;/strong&gt;, with recommended words and phrases that feed the hungry search engine algorithms.  The service colors each Phrase: Green indicates &lt;strong&gt;acceptable&lt;/strong&gt;, Yellow indicates &lt;strong&gt;needs work&lt;/strong&gt; and Red indicates &lt;strong&gt;absent&lt;/strong&gt;. &lt;/p&gt;
&lt;p&gt;&lt;img alt="Word Quota" src="https://john.soban.ski/images/Jasper_Ai/20_Word_Quota.png"&gt;&lt;/p&gt;
&lt;p&gt;I begin to grind through the word quota suggestions, and pepper the phrases into my Jasper document.  I replace an instance of &lt;strong&gt;DeFi&lt;/strong&gt;, for example, with &lt;strong&gt;DeFi Applications&lt;/strong&gt; and that phrase turns green.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Getting Better" src="https://john.soban.ski/images/Jasper_Ai/21_Getting_Better.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper provides a gauge.  In addition to the word quota, I also must increase the word count to 1.7k, the paragraphs to 16, the headings to at least 18 and Images to at least three.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Status Gague" src="https://john.soban.ski/images/Jasper_Ai/22_Status_Gague.png"&gt;&lt;/p&gt;
&lt;p&gt;Checking off the suggested phrases in the word quota screen gets my document to a score of 80/100.
&lt;img alt="Eighty Percent" src="https://john.soban.ski/images/Jasper_Ai/23_Eighty_Percent.png"&gt;&lt;/p&gt;
&lt;p&gt;You can see that I peppered in a lot of the phrases, turning them green&lt;/p&gt;
&lt;p&gt;&lt;img alt="Lotsa Green" src="https://john.soban.ski/images/Jasper_Ai/24_Lotsa_Green.png"&gt;&lt;/p&gt;
&lt;h2&gt;Write Conclusion&lt;/h2&gt;
&lt;p&gt;Jasper encourages you to update the “Content” box on the left side of the pane, to nudge the service a bit.  I type in &lt;strong&gt;Write a concluding paragraph&lt;/strong&gt; into the content box, enter seed words into the document and then hit compile.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Write Conclusion" src="https://john.soban.ski/images/Jasper_Ai/25_Write_Conclusion.png"&gt;&lt;/p&gt;
&lt;p&gt;Jasper then writes a conclusion for me.&lt;/p&gt;
&lt;h2&gt;The Document&lt;/h2&gt;
&lt;p&gt;I now present the AI written document for you:&lt;/p&gt;
&lt;h3&gt;Decentralized Finance: The Future of Freedom&lt;/h3&gt;
&lt;p&gt;It's no secret that the world is becoming increasingly digitized. In fact, Decentralized Finance (DeFi) may be one of the most important aspects of this digital revolution. DeFi allows for a new level of freedom, liberty and autonomy that has never been seen before. With Decentralized Finance, users are in control of their own finances, and can execute financial transactions without the need for permission from centralized authorities or a centralized exchange. This is a huge step forward in terms of financial freedom and censorship resistance. As Decentralized Finance continues to grow in popularity, it will become an increasingly important part of our global economy.&lt;/p&gt;
&lt;h4&gt;What exactly is DeFi?&lt;/h4&gt;
&lt;p&gt;Decentralized Finance (DeFi)  is a financial system that runs on decentralized infrastructure, such as the Ethereum blockchain. Peer to Peer DeFi protocols are built on top of Ethereum and allow users to interact with each other and create financial products without the need for a central authority. This means that DeFi is censorship-resistant and cannot be shut down by any single entity.  DeFi work includes the ability to use open source software to create a loan, set interest rates, earn interest or manage digital assets.&lt;/p&gt;
&lt;p&gt;Defi applications can be used for a variety of purposes, such as lending, borrowing, trading,  investing and flash loans. By using DeFi applications, users can have full control over the digital money in their bank account. DeFi also has the potential to provide a more level playing field for financial services, as it is accessible to anyone with an Internet connection.&lt;/p&gt;
&lt;p&gt;The DeFi ecosystem is still in its early stages, but it is growing rapidly. In 2019, the total value locked in DeFi applications was about $675 million. As of September 2020, that number has grown to over $13 billion. This rapid growth is indicative of the strong demand for decentralized applications and financial products.&lt;/p&gt;
&lt;h4&gt;What Contributed to the growth of Decentralized Finance?&lt;/h4&gt;
&lt;p&gt;There are a few factors that drive the growth of Decentralized Finance applications.&lt;/p&gt;
&lt;p&gt;First, the DeFi ecosystem is built on Ethereum, which is the most popular smart contract platform. This gives Decentralized Finance applications a large potential user base, as there are over 50 million Ethereum users.&lt;/p&gt;
&lt;p&gt;Second, DeFi applications offer a variety of features and benefits that appeal to users. For example, Decentralized Finance applications are often faster and more secure than traditional financial institutions. DeFi also offers users more control over their finances, as they are not reliant on centralized authorities.&lt;/p&gt;
&lt;p&gt;Third, DeFi is becoming increasingly popular due to its censorship-resistant nature. In a traditional financial system, central authorities can censor or shut down applications at will. However, DeFi applications are built on decentralized infrastructure, which makes them much more resistant to censorship.&lt;/p&gt;
&lt;h4&gt;What is a DeFi in crypto?&lt;/h4&gt;
&lt;p&gt;DeFi in crypto is decentralized and open source, which allows for more  freedom and transparency. DeFi in crypto is built on trustless smart contracts, which eliminates the need for central authorities. This makes DeFi in crypto censorship-resistant and immune to shutdowns.&lt;/p&gt;
&lt;p&gt;DeFi in crypto has the potential to provide a more level playing field for financial services, as it is accessible to anyone with an Internet connection. DeFi in crypto also has the potential to reduce costs and increase efficiency by eliminating the need for intermediaries.&lt;/p&gt;
&lt;h4&gt;Is DeFi a good investment?&lt;/h4&gt;
&lt;p&gt;In terms of investing, DeFi is still in its early stages and there are no guarantees. However, DeFi does offer a number of advantages over traditional financial systems. For example, DeFi applications often have lower fees and faster transaction speeds. DeFi also offers users more control over their finances, which can be beneficial in times of market volatility.&lt;/p&gt;
&lt;p&gt;Overall, DeFi is still a relatively new phenomenon and there are no guarantees as to its future success. However, DeFi does offer a number of advantages over traditional financial systems and has the potential to revolutionize the way we interact with money.&lt;/p&gt;
&lt;h4&gt;What is the difference between DeFi and crypto?&lt;/h4&gt;
&lt;p&gt;Crypto is a digital or virtual asset that uses cryptography to secure its transactions. DeFi is a financial system that runs on decentralized infrastructure, such as the Ethereum blockchain. DeFi applications are built on top of Ethereum and allow users to interact with each other without the need for a central authority.&lt;/p&gt;
&lt;p&gt;Crypto assets can be used for a variety of purposes, such as investing, trading, and paying for goods and services. DeFi applications can be used for a variety of purposes, such as lending, borrowing, trading, and investing.&lt;/p&gt;
&lt;h4&gt;Can financial institutions stop DeFi?&lt;/h4&gt;
&lt;p&gt;Central financial institutions can stop DeFi applications by shutting down the servers or blocking the IP addresses. However, DeFi is built on decentralized infrastructure, which makes it much more resistant to censorship. DeFi applications are also open source, so they can be copied and distributed freely. This makes it difficult for central authorities to shut them down without causing a significant amount of backlash.&lt;/p&gt;
&lt;h4&gt;What are liquidity pools?&lt;/h4&gt;
&lt;p&gt;Liquidity pools are a type of DeFi application that allow users to pool their funds together and make loans. This can be beneficial for both lenders and borrowers, as it allows them to get better rates and longer terms. Liquidity pools also offer increased liquidity, as they can provide access to a larger pool of funds.&lt;/p&gt;
&lt;h4&gt;What is word freedom?&lt;/h4&gt;
&lt;p&gt;Word freedom is the ability to use DeFi applications without the need for a central authority. DeFi applications are built on decentralized infrastructure, which makes them much more resistant to censorship. DeFi also offers users more control over their finances, as they are not reliant on centralized authorities.&lt;/p&gt;
&lt;h4&gt;What kind of Financial Transactions does DeFi enable?&lt;/h4&gt;
&lt;p&gt;DeFi enables a number of financial transactions, such as lending, borrowing, trading, and investing. &lt;/p&gt;
&lt;h4&gt;Conclusion&lt;/h4&gt;
&lt;p&gt;In conclusion , DeFi is a financial system that runs on decentralized infrastructure and offers users more freedom and control over their finances. DeFi has the potential to revolutionize the way we interact with money and could eventually replace traditional financial institutions.&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Jasper provides an Artificial Intelligence service that produces contents for massive search engine algorithms.  With Jasper, you have a robot writing for robots.&lt;/p&gt;
&lt;p&gt;While (at present time) I do not recommend Jasper for polished technical content nor scientific papers writing I do recommend it to anyone who works in an industry that needs traffic.  The Jasper + Surfer combination produces easy to read blog posts in a casual voice with minimal effort and minimal time.&lt;/p&gt;
&lt;p&gt;For fun, I may start an ancillary hobby blog on a Pop Culture topic, and then jot down titles and bullet points on that topic and then feed those notes into Jasper to get an insta-website.&lt;/p&gt;</content><category term="Data Science"></category><category term="Jasper"></category><category term="Machine Learning"></category></entry><entry><title>A New Exemplar Machine Learning Algorithm (Part 2: Optimize)</title><link href="https://john.soban.ski/rce-python-part-2.html" rel="alternate"></link><published>2021-12-23T12:23:00-05:00</published><updated>2021-12-23T12:23:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2021-12-23:/rce-python-part-2.html</id><summary type="html">&lt;p&gt;In &lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;part one&lt;/a&gt; of this two-part series, I developed a &lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;Reduced Columb Energy (RCE) classifier&lt;/a&gt; in Python.  &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;RCE&lt;/a&gt; calculates &lt;strong&gt;hit footprints&lt;/strong&gt; around training data and uses the footprints to classify test data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="2d RCE Map" src="https://john.soban.ski/images/Rce_Python_Part_1/21_2d_Rce.png"&gt;&lt;/p&gt;
&lt;p&gt;RCE draws a circle around each labeled &lt;strong&gt;training&lt;/strong&gt; observation, with a radius (lambda) that stops at the closest …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In &lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;part one&lt;/a&gt; of this two-part series, I developed a &lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;Reduced Columb Energy (RCE) classifier&lt;/a&gt; in Python.  &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;RCE&lt;/a&gt; calculates &lt;strong&gt;hit footprints&lt;/strong&gt; around training data and uses the footprints to classify test data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="2d RCE Map" src="https://john.soban.ski/images/Rce_Python_Part_1/21_2d_Rce.png"&gt;&lt;/p&gt;
&lt;p&gt;RCE draws a circle around each labeled &lt;strong&gt;training&lt;/strong&gt; observation, with a radius (lambda) that stops at the closest labeled training point in the &lt;strong&gt;opposite&lt;/strong&gt; class. Each circle indicates the &lt;strong&gt;hit footprint&lt;/strong&gt; for that class.&lt;/p&gt;
&lt;p&gt;&lt;img alt="RCE in action" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Cartoon.gif"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;In part one&lt;/a&gt; I ran RCE for one epoch on a two-feature training set to achieve an F1 Score of &lt;strong&gt;0.42&lt;/strong&gt; and ambiguity of &lt;strong&gt;26.6%&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this blog post, I will introduce and tune hyperparameters to improve model success and reduce ambiguity. I will investigate the number of principal components and tune &lt;strong&gt;r&lt;/strong&gt;. &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;r&lt;/strong&gt; indicates the maximum value for &lt;strong&gt;Lambda&lt;/strong&gt; and puts an upper limit on the maximum size of each circle that represents a given hit footprint.&lt;/p&gt;
&lt;p&gt;I will also see how RCE performs with a reduced training set.  In Pattern Classification Using Neural Networks (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman writes:   &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This classifier is similar to a k-nearest neighbor classifier in that it adapts rapidly over time, but it typically requires many fewer exemplar nodes than a nearest neighbor classifier. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Tune Number of Features&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/rce-python-part-1.html"&gt;Last time&lt;/a&gt;, I left off with the following confusion matrix for the two principal component scenario.  In that scenario, I applied RCE to the diabetes dataset after I used Principal Component Analysis (PCA) to reduce the data set down to two features.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Confusion Matrix for two Principle Components" src="https://john.soban.ski/images/Rce_Python_Part_1/22_2d_Confuse.png"&gt;&lt;/p&gt;
&lt;p&gt;Our RCE algorithm trained a model with an F1 Score of &lt;strong&gt;0.42&lt;/strong&gt; and ambiguity of &lt;strong&gt;26.6%&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;calc_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;f1_score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.42424242424242425&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s1"&gt;&amp;#39;ambiguity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2662337662337662&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Three Principal Components&lt;/h3&gt;
&lt;p&gt;I use the following code to reduce the diabetes training dataset down to three components and yield a Pandas dataframe named &lt;strong&gt;test_df&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca_train&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then call my &lt;strong&gt;classify_data()&lt;/strong&gt; function to classify the data.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I attach the labels to the classified data frame for the confusion matrix.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;confusion_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rownames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;colnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Predicted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Three features yield the following confusion_matrix:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Three Princomp" src="https://john.soban.ski/images/Rce_Python_Part_2/01_Three_Princomp.png"&gt;&lt;/p&gt;
&lt;p&gt;My &lt;strong&gt;calc_success()&lt;/strong&gt; function returns &lt;strong&gt;f1_score&lt;/strong&gt; and ambiguity.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calc_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;unambiguous_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ambiguity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;unambiguous_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unambiguous_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unambiguous_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;f1_score&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;ambiguity&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ambiguity&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;calc_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Both &lt;strong&gt;F1&lt;/strong&gt; (bad) and &lt;strong&gt;ambiguity&lt;/strong&gt; (good) decrease with an extra principal component.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;f1_score&amp;#39;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.41666666666666663,
&lt;span class="s1"&gt;&amp;#39;ambiguity&amp;#39;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.2532467532467532&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Four Principal Components&lt;/h3&gt;
&lt;p&gt;The following code fits the train dataset to four principal components, classifies the resulting data frame and then plots the confusion matrix.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;confusion_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rownames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;colnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Predicted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Four Princomp" src="https://john.soban.ski/images/Rce_Python_Part_2/02_Four_Princomp.png"&gt;&lt;/p&gt;
&lt;p&gt;The F1 score increases slightly and the ambiguity shoots up.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;f1_score&amp;#39;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.41935483870967744,
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;ambiguity&amp;#39;&lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.34415584415584416&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Five Principal Components&lt;/h3&gt;
&lt;p&gt;I use the following code to look at the five Principal Component scenario.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp5&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp5&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;confusion_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rownames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;colnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Predicted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Five Princomp" src="https://john.soban.ski/images/Rce_Python_Part_2/03_Five_Princomp.png"&gt;&lt;/p&gt;
&lt;p&gt;Five principal components decrease the F1 score and increase the ambiguity.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;f1_score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3928571428571428&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;ambiguity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.36363636363636365&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Principal Component Results&lt;/h3&gt;
&lt;p&gt;The following table captures the results of the investigation.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;P&lt;/th&gt;
&lt;th&gt;f1&lt;/th&gt;
&lt;th&gt;Ambig.&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;.424&lt;/td&gt;
&lt;td&gt;.266&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;.417&lt;/td&gt;
&lt;td&gt;.253&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;.419&lt;/td&gt;
&lt;td&gt;.344&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;.393&lt;/td&gt;
&lt;td&gt;.363&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Tune the Radius&lt;/h2&gt;
&lt;p&gt;The original &lt;strong&gt;find_lambda&lt;/strong&gt; formula increases the radius of the hit footprint until the footprint collides with a point of the opposite class.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                           &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                  &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In part one, we see the footprints that result from unbounded radii.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Radius Unbounded" src="https://john.soban.ski/images/Rce_Python_Part_1/21_2d_Rce.png"&gt;&lt;/p&gt;
&lt;p&gt;I can add the following conditional to scope the footprint to a set maximum radius, &lt;strong&gt;r&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;lambda_var&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
                     &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
                     &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                             &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
                             &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                             &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                           &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                     &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lambda_var&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;lambda_var&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I add &lt;strong&gt;r&lt;/strong&gt; to the &lt;strong&gt;find_lambda&lt;/strong&gt; function.  (Note the vocabulary overload, the following code uses a &lt;strong&gt;lambda function&lt;/strong&gt; named &lt;strong&gt;find_lambda&lt;/strong&gt;).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                          &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                          &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                                          &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A scoped radius of maximum &lt;strong&gt;0.1&lt;/strong&gt; creates the following footprints.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Radius Pointone" src="https://john.soban.ski/images/Rce_Python_Part_2/04_Radius_Pointone.png"&gt;&lt;/p&gt;
&lt;p&gt;The following code creates, labels and plots a three dimensional dataset, with &lt;strong&gt;r&lt;/strong&gt; set to 3.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generate 3 Principal Components for training &lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Re-attach the labels for training&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ID Lambda for each datum&lt;/span&gt;
&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a 3D grid for data viz&lt;/span&gt;
&lt;span class="n"&gt;class_3d_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; 
                        &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;

&lt;span class="c1"&gt;# Classify each point of the grid for data viz&lt;/span&gt;
&lt;span class="n"&gt;class_3d_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_3d_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_3d_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This plot captures the hit footprints in 3d, with each footprint a sphere versus a circle (2d case). &lt;/p&gt;
&lt;p&gt;&lt;img alt="Threed Radiusthree" src="https://john.soban.ski/images/Rce_Python_Part_2/05_Threed_Radiusthree.png"&gt;&lt;/p&gt;
&lt;p&gt;Re-run the code above with the following edit to set &lt;strong&gt;r&lt;/strong&gt; to &lt;strong&gt;0.15&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                          &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                          &lt;span class="mf"&gt;0.15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                                          &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With a smaller &lt;strong&gt;r&lt;/strong&gt; we get a better view of the spheres that show the hit footprints.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Threed Radiuspointonefive" src="https://john.soban.ski/images/Rce_Python_Part_2/06_Threed_Radiuspointonefive.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;r&lt;/strong&gt; provides a hyperparameter to the &lt;strong&gt;RCE&lt;/strong&gt; algorithm.  Different values of &lt;strong&gt;r&lt;/strong&gt; will produce different results in terms of model effectiveness.&lt;/p&gt;
&lt;p&gt;I create a function named &lt;strong&gt;hyperparameter_tune&lt;/strong&gt; that applies RCE to a fresh train dataset, constrained by a given value for &lt;strong&gt;r&lt;/strong&gt; and returns the &lt;strong&gt;f1&lt;/strong&gt; score and &lt;strong&gt;ambiguity&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hyperparameter_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;radius&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;radius&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;calc_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then iterate through one-hundred epochs, changing the values for &lt;strong&gt;r&lt;/strong&gt;, spread between zero and one.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hyperparameter_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I then plot the results, to identify the optimal &lt;strong&gt;r&lt;/strong&gt; value for the given train dataset.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Tune R" src="https://john.soban.ski/images/Rce_Python_Part_2/07_Tune_R.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;r = 0.58&lt;/strong&gt; yields the ideal results, with an f1_score of &lt;strong&gt;0.43&lt;/strong&gt; and ambiguity of &lt;strong&gt;0.27&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The following confusion matrix captures the results for &lt;strong&gt;r=0.58&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Radius Pointfive" src="https://john.soban.ski/images/Rce_Python_Part_2/08_Radius_Pointfive.png"&gt;&lt;/p&gt;
&lt;h2&gt;Tune the Algorithm&lt;/h2&gt;
&lt;p&gt;Our Algorithm declares regions with either (1) no footprint, or (2) &lt;strong&gt;overlapping&lt;/strong&gt; footprints &lt;strong&gt;ambiguous&lt;/strong&gt;.  The Python code follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Original&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# find the hits&lt;/span&gt;
    &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# add the columns&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ID ambiguous, class 0 and class 1 data&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To decrease ambiguity, I add &lt;strong&gt;vote&lt;/strong&gt; logic to the code.  In this case, overlapping regions will have a &lt;strong&gt;winner&lt;/strong&gt; class in the case where one class includes more exemplars than the other.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Reduce Ambiguity&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# find the hits&lt;/span&gt;
    &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# add the columns&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ID ambiguous, class 0 and class 1 data&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Voting RCE&lt;/strong&gt; algorithm produces a 2d footprint map with a high concentration of &lt;strong&gt;Class Zero&lt;/strong&gt; regions.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Vote Algorithm" src="https://john.soban.ski/images/Rce_Python_Part_2/09_Vote_Algorithm.png"&gt;&lt;/p&gt;
&lt;p&gt;I tune &lt;strong&gt;r&lt;/strong&gt; for the new algorithm and plot the results using the same code above.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hyperparameter_tune&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="Vote Loss" src="https://john.soban.ski/images/Rce_Python_Part_2/10_Vote_Loss.png"&gt;&lt;/p&gt;
&lt;p&gt;The tuning identifies an ideal &lt;strong&gt;r&lt;/strong&gt; of 0.40, which yields an &lt;strong&gt;f1_score&lt;/strong&gt; of 0.4 and &lt;strong&gt;ambiguity&lt;/strong&gt; of 0.2.  The ambiguity drops from the non-voting algorithm, which yielded .27.&lt;/p&gt;
&lt;h2&gt;Small Training Sets&lt;/h2&gt;
&lt;p&gt;In Pattern Classification Using Neural Networks (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman writes that RCE handles small training sets with aplomb:   &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This classifier is similar to a k-nearest neighbor classifier in that it adapts rapidly over time, but it typically requires many fewer exemplar nodes than a nearest neighbor classifier. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I decided to try the algorithm (keeping the ideal &lt;strong&gt;r&lt;/strong&gt;) on half the training data, which results in the following loss graph:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Half Data" src="https://john.soban.ski/images/Rce_Python_Part_2/11_Half_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;Contrast this to the loss plot for the full training data set (from above):&lt;/p&gt;
&lt;p&gt;&lt;img alt="Tune R" src="https://john.soban.ski/images/Rce_Python_Part_2/07_Tune_R.png"&gt;&lt;/p&gt;
&lt;p&gt;Compared to the full dataset, the half dataset drives higher ambiguity, but produces a decent F1 score.&lt;/p&gt;
&lt;p&gt;If we halve the dataset once more, (one quarter the data) we get the following loss plot.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Quarter Data" src="https://john.soban.ski/images/Rce_Python_Part_2/12_Quarter_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;Since we have a dearth of data, we need an &lt;strong&gt;r&lt;/strong&gt; of at least &lt;strong&gt;0.4&lt;/strong&gt; to get any traction.  At that point, the algorithm produces decent ambiguity and &lt;strong&gt;F1&lt;/strong&gt; score, considering the lack of training data.&lt;/p&gt;
&lt;p&gt;The following plot shows the RCE hit footprints given one-quarter of the training data:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Quarter Datamap" src="https://john.soban.ski/images/Rce_Python_Part_2/13_Quarter_Datamap.png"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;RCE provides an interesting alternative to the more popular &lt;strong&gt;K-Nearest&lt;/strong&gt; exemplar classifier.  The RCE classifier learns quickly with limited training data.&lt;/p&gt;
&lt;p&gt;Comment below if you think Tensorflow or MXNet should include this classifier in their ML libraries!&lt;/p&gt;</content><category term="Data Science"></category><category term="RCE"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>A New Exemplar Machine Learning Algorithm (Part 1: Develop)</title><link href="https://john.soban.ski/rce-python-part-1.html" rel="alternate"></link><published>2021-08-21T02:34:00-04:00</published><updated>2021-08-21T02:34:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2021-08-21:/rce-python-part-1.html</id><summary type="html">&lt;p&gt;In &lt;strong&gt;Pattern Classification Using Neural Networks&lt;/strong&gt; (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman provides the following definition of &lt;strong&gt;Exemplar&lt;/strong&gt; neural net classifiers:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Exemplar classifiers] perform classification based on the identity of the training examples, or exemplars, that are nearest to the input.  Exemplar nodes compute the weighted Euclidean distance …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;In &lt;strong&gt;Pattern Classification Using Neural Networks&lt;/strong&gt; (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman provides the following definition of &lt;strong&gt;Exemplar&lt;/strong&gt; neural net classifiers:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;[Exemplar classifiers] perform classification based on the identity of the training examples, or exemplars, that are nearest to the input.  Exemplar nodes compute the weighted Euclidean distance between inputs and node centroids&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;strong&gt;nearest neighbor&lt;/strong&gt; classifier represents the most popular and widely used &lt;strong&gt;exemplar&lt;/strong&gt; neural net classifier in the domain of &lt;strong&gt;Machine Learning&lt;/strong&gt; (ML).  Every &lt;strong&gt;ML&lt;/strong&gt; framework and platform provides a library to execute &lt;strong&gt;nearest neighbor&lt;/strong&gt; classification.&lt;/p&gt;
&lt;p&gt;In this blog post, I will develop Python code to implement a lesser known &lt;strong&gt;exemplar&lt;/strong&gt; classifier, Reduced Columb Energy (RCE).&lt;/p&gt;
&lt;p&gt;The RCE algorithm assigns a class to &lt;strong&gt;test&lt;/strong&gt; data based on whether or not the data points live inside &lt;strong&gt;hit footprints&lt;/strong&gt; of &lt;strong&gt;training&lt;/strong&gt; data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="2d RCE Map" src="https://john.soban.ski/images/Rce_Python_Part_1/21_2d_Rce.png"&gt;&lt;/p&gt;
&lt;p&gt;Open my post &lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;A Graphical Introduction to Probabalistic Neural Networks&lt;/a&gt; in a new tab for a deep dive into the math behind RCE.&lt;/p&gt;
&lt;p&gt;At a high level, RCE draws a circle around each labeled &lt;strong&gt;training&lt;/strong&gt; observation, with a radius (lambda) equal to the distance of the closest labeled training point in the &lt;strong&gt;opposite&lt;/strong&gt; class. Each circle indicates the &lt;strong&gt;hit footprint&lt;/strong&gt; for that class.&lt;/p&gt;
&lt;p&gt;&lt;img alt="RCE in action" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Cartoon.gif"&gt;&lt;/p&gt;
&lt;h2&gt;RCE vs. Nearest Neighbor (NN)&lt;/h2&gt;
&lt;p&gt;The following two-dimensional (2d) plot shows five data points, two of class &lt;strong&gt;X&lt;/strong&gt;, two of class &lt;span style="color:red"&gt;&lt;strong&gt;O&lt;/strong&gt;&lt;/span&gt; and one unknown observation, &lt;span style="color:green"&gt;&lt;strong&gt;?&lt;/strong&gt;&lt;/span&gt;, we wish to classify.&lt;/p&gt;
&lt;p&gt;&lt;img alt="How would you classify this green question mark" src="https://john.soban.ski/images/Rce_Python_Part_1/01_Classify_Green.png"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;NN&lt;/strong&gt; algorithm uses the classes of the nearest data points to classify an unknown observation.  Based on the plot above, &lt;strong&gt;NN&lt;/strong&gt; identifies that the green question mark belongs to class &lt;strong&gt;X&lt;/strong&gt;.  The two &lt;strong&gt;X's&lt;/strong&gt; clearly lie closer to the green question mark than the two red &lt;span style="color:red"&gt;&lt;strong&gt;O's&lt;/strong&gt;&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;RCE&lt;/strong&gt;, however uses a hit radius approach to classify datum.  The algorithm calculates a footprint for each of the known data, with radii lengths determined by the vicinity of data from the opposite class.  The &lt;strong&gt;RCE&lt;/strong&gt; footprint for the four data points follows:&lt;/p&gt;
&lt;p&gt;&lt;img alt="RCE classification of the green question mark" src="https://john.soban.ski/images/Rce_Python_Part_1/02_Classify_Rce.png"&gt;&lt;/p&gt;
&lt;p&gt;Based on this model, the green question mark lands in the footprint of the red class, and &lt;strong&gt;RCE&lt;/strong&gt; indicates that the unknown observation belongs to class &lt;span style="color:red"&gt;&lt;strong&gt;O&lt;/strong&gt;&lt;/span&gt;.&lt;/p&gt;
&lt;h2&gt;Explore the Data&lt;/h2&gt;
&lt;p&gt;I use the &lt;a href="https://www.kaggle.com/uciml/pima-indians-diabetes-database"&gt;Pima Indians Diabetes&lt;/a&gt; dataset to craft my model.  The dataset includes observations of eight features and a two-class label.  The labels indicate the presence or absence of diabetes.&lt;/p&gt;
&lt;p&gt;First, import the data into a &lt;strong&gt;Pandas&lt;/strong&gt; &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html"&gt;DataFrame&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;pima_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;diabetes.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;head()&lt;/strong&gt; method gives a quick peek at the features and observations.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pima Table" src="https://john.soban.ski/images/Rce_Python_Part_1/03_Pima_Table.png"&gt;&lt;/p&gt;
&lt;p&gt;A quick &lt;a href="https://seaborn.pydata.org/"&gt;Seaborn&lt;/a&gt; histogram depicts the frequency of &lt;strong&gt;Outcome Zero&lt;/strong&gt; (No Diabetes) vs. &lt;strong&gt;Outcome One&lt;/strong&gt; (Diabetes).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;sns&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A quick glance shows that 2/3 of the observations indicate &lt;strong&gt;no diabetes&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Histogram of Outcome" src="https://john.soban.ski/images/Rce_Python_Part_1/04_Outcome_Hist.png"&gt;&lt;/p&gt;
&lt;h3&gt;Explore One Feature&lt;/h3&gt;
&lt;p&gt;Of all the given features, I assume that &lt;strong&gt;Glucose&lt;/strong&gt; will impact &lt;strong&gt;Outcome&lt;/strong&gt; the most, so I update the histogram to depict the relationship between the two.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;figure.figsize&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;11.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8.27&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Glucose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Blood sugar over &lt;strong&gt;150&lt;/strong&gt; appears to indicate &lt;strong&gt;diabetes&lt;/strong&gt;.  Lower than &lt;strong&gt;150&lt;/strong&gt; we see a lot of overlap.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Glucose Outcome Histogram" src="https://john.soban.ski/images/Rce_Python_Part_1/05_Glucose_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;Kernel Density Estimation (KDE) provides a smoothed "overhead view" of the histogram.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Glucose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;kde&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This view also shows the lack of clear separation between the two &lt;strong&gt;Outcomes&lt;/strong&gt; based on &lt;strong&gt;Glucose&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Glucose Outcome Kernel Density Estimation" src="https://john.soban.ski/images/Rce_Python_Part_1/06_Glucose_Density.png"&gt;&lt;/p&gt;
&lt;h3&gt;Explore Two Features&lt;/h3&gt;
&lt;p&gt;Use &lt;strong&gt;PairGrid&lt;/strong&gt; to cycle through all features in order to depict their relationships to &lt;strong&gt;Outcome&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;x_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Pregnancies&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;Glucose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;BloodPressure&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;SkinThickness&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;Insulin&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;BMI&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;DiabetesPedigreeFunction&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PairGrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;x_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_vars&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;y_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_vars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_offdiag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_diag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_legend&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Glucose&lt;/strong&gt; and &lt;strong&gt;BMI&lt;/strong&gt; appear to have a tiny bit of correlation with &lt;strong&gt;Outcome&lt;/strong&gt;, based on the left/ right orentation of the density &lt;strong&gt;blobs&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Correlation of Features with Outcome" src="https://john.soban.ski/images/Rce_Python_Part_1/07_Correlate_Features.png"&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;KDE&lt;/strong&gt; plot provides an overhead, three-dimensional view of the relationships between &lt;strong&gt;Glucose&lt;/strong&gt;, &lt;strong&gt;BMI&lt;/strong&gt; and &lt;strong&gt;Outcome&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Glucose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BMI&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;kde&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Based on the near-total overlap, the two features do not provide enough data to predict &lt;strong&gt;Outcome&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Glucose, BMI and Outcome KDE" src="https://john.soban.ski/images/Rce_Python_Part_1/08_Glucose_Bmi.png"&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;Seaborn&lt;/strong&gt; heat map visualizes correlations across features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_theme&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;white&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;corr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a mask for the upper triangle&lt;/span&gt;
&lt;span class="n"&gt;mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones_like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Set up the matplotlib figure&lt;/span&gt;
&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a custom diverging colormap&lt;/span&gt;
&lt;span class="n"&gt;cmap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;diverging_palette&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;230&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                             &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;as_cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Draw the heatmap with the mask and &lt;/span&gt;
&lt;span class="c1"&gt;# correct aspect ratio&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cmap&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vmax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;center&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;square&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;linewidths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;cbar_kws&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;shrink&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Look for dark red tiles in the &lt;strong&gt;Outcome&lt;/strong&gt; row.  The dark red tiles of &lt;strong&gt;Glucose&lt;/strong&gt; and &lt;strong&gt;BMI&lt;/strong&gt; indicate stronger correlation with &lt;strong&gt;Outcome&lt;/strong&gt; vs. other features.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Heatmap of Feature Correlation" src="https://john.soban.ski/images/Rce_Python_Part_1/09_Corr_Heatmap.png"&gt;&lt;/p&gt;
&lt;h3&gt;Explore Three Features&lt;/h3&gt;
&lt;p&gt;Create a function to plot three features against &lt;strong&gt;Outcome&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;111&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                          &lt;span class="n"&gt;projection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;3d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;feature1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;feature2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;feature3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;green&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following function call, for example, draws a 3d plot that visualizes &lt;strong&gt;Glucose&lt;/strong&gt;, &lt;strong&gt;BMI&lt;/strong&gt; and &lt;strong&gt;SkinThickness&lt;/strong&gt; against &lt;strong&gt;Outcome&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Glucose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;BMI&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;SkinThickness&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This plot depicts slight separation between the two classes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="3d Plot of Outcomes x 3 Features" src="https://john.soban.ski/images/Rce_Python_Part_1/10_Pima_3d.png"&gt;&lt;/p&gt;
&lt;p&gt;Pick two new features, &lt;strong&gt;Insulin&lt;/strong&gt; and &lt;strong&gt;DiabetesPedigreeFunction&lt;/strong&gt; for another view.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;Insulin&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;DiabetesPedigreeFunction&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;SkinThickness&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This combination yields significantly less separability of the classes than the combination of &lt;strong&gt;Glucose&lt;/strong&gt;, &lt;strong&gt;BMI&lt;/strong&gt; and &lt;strong&gt;SkinThickness&lt;/strong&gt; above.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Another view in 3d" src="https://john.soban.ski/images/Rce_Python_Part_1/11_More_3d.png"&gt;&lt;/p&gt;
&lt;h3&gt;Normalize&lt;/h3&gt;
&lt;p&gt;First, split the &lt;strong&gt;pima_df&lt;/strong&gt; DataFrame into &lt;strong&gt;train&lt;/strong&gt; and &lt;strong&gt;test&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Train&lt;/strong&gt; - Data to build our exemplar model&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test (AKA Hodout)&lt;/strong&gt; - Unseen data to help predict real-world performance&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frac&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;random_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Remove the rows that correspond to the train DF&lt;/span&gt;
&lt;span class="n"&gt;test_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pima_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;train_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;test_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# The pop removes Outcome from the features DF&lt;/span&gt;
&lt;span class="n"&gt;train_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Observe the &lt;strong&gt;summary statistics&lt;/strong&gt; of the features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;()[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;std&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We see big differences in the range of values for each feature, so we must normalize the data to comply with Machine Learning (ML) best practices.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Stat summary of the Pima Diabetes Dataset" src="https://john.soban.ski/images/Rce_Python_Part_1/12_Pima_Stats.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://keras.io/"&gt;Keras&lt;/a&gt; provides tools for Normalization. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras.layers.experimental&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Normalization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Pregnancies&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;Glucose&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;BloodPressure&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;SkinThickness&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;Insulin&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;BMI&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;DiabetesPedigreeFunction&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="s1"&gt;&amp;#39;Age&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;layout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                  &lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The features cluster around &lt;strong&gt;zero&lt;/strong&gt; post-normalization.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Histogram of Normalized Features" src="https://john.soban.ski/images/Rce_Python_Part_1/13_Norm_Features.png"&gt;&lt;/p&gt;
&lt;h2&gt;Reduce Dimensionality&lt;/h2&gt;
&lt;p&gt;The correlation heatmap above indicates strong correlation between some features.  Highly correlated features input redundancy (noise) into our model.  Principal Component Analysis (PCA) maps the features onto orthogonal planes and also provides a means to reduce dimensions.  Too many dimensions (features) leads to over-fitting which reduces the predictive effectiveness of ML models.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Open &lt;a href="https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/"&gt;George Dallas' blog post&lt;/a&gt; in a new tab for an excellent explanation of PCA&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Apply PCA to the Pima DataFrame in order to reduce noise and reduce the number of dimensions.&lt;/p&gt;
&lt;p&gt;Create a PCA transform engine, set the number of principal components via &lt;strong&gt;n_components&lt;/strong&gt; and then have the engine fit to the normalized &lt;strong&gt;train_features&lt;/strong&gt; DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Store the normalized, dimensionality reduced matrix in a data frame and set the column name.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;See if the single Principal Component provides better separability for each class over our original &lt;strong&gt;Glucose&lt;/strong&gt; histogram.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;sns&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;figure.figsize&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;11.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8.27&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
              &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The histogram captures near-total overlap, which indicates we will need more than one Principal Component &lt;/p&gt;
&lt;p&gt;&lt;img alt="Histogram of The Principal Components vs. Outcome" src="https://john.soban.ski/images/Rce_Python_Part_1/14_Princomp_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;Create a new data frame that includes two Principal Components.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Observe a two dimensional scatterplot, colored by &lt;strong&gt;Outcome&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Two Principal Components reduce the overlap of the two classes slightly.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A scatterplot of the two Principle Components vs. Outcome" src="https://john.soban.ski/images/Rce_Python_Part_1/15_Princomp_Scat.png"&gt;&lt;/p&gt;
&lt;p&gt;A density plot provides another view of the &lt;strong&gt;Outcomes&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;fill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The most dense regions of the two outcomes overlap.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A KDE plot of two principle components vs. Outcome" src="https://john.soban.ski/images/Rce_Python_Part_1/16_Princomp_Density.png"&gt;&lt;/p&gt;
&lt;p&gt;How many components do we need?  The following code records the variance for each component.  Higher variance means more information.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;explained_variance_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 2.09525231 1.67097928 1.04292129 0.88878235 0.76897059 0.69332725&lt;/span&gt;
&lt;span class="c1"&gt;# 0.4365278  0.41629126&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first three to five components include most of the useful information.&lt;/p&gt;
&lt;p&gt;The following code produces and stores three components.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Attach the &lt;strong&gt;labels&lt;/strong&gt; back to the &lt;strong&gt;train&lt;/strong&gt; DataFrame for purposes of a 3d plot.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plot_3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The result shows slight separability of the two classes if you imagine sliding a sheet of paper between the clouds of green and red dots.&lt;/p&gt;
&lt;p&gt;&lt;img alt="3d plot of Principle Components vs. Outcome" src="https://john.soban.ski/images/Rce_Python_Part_1/17_Princomp_3d.png"&gt;&lt;/p&gt;
&lt;h2&gt;Develop Model&lt;/h2&gt;
&lt;p&gt;We will use a 2d &lt;strong&gt;train&lt;/strong&gt; data set to walk through model development.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We re-attach the &lt;strong&gt;train&lt;/strong&gt; labels to our DataFrame.  Our exemplar algorithm requires knowledge of the labels for supervised learning.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Calculate Lambda&lt;/h3&gt;
&lt;p&gt;The following function finds the radii (lambda) for each row.  For a given observation, it calculates the euclidean distance to every observation of the &lt;strong&gt;opposite&lt;/strong&gt; class, and then returns the closest point.&lt;/p&gt;
&lt;p&gt;(Note the complete absence of any &lt;strong&gt;for&lt;/strong&gt; statements in the code below.)&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                           &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
                  &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For an example, look at row one of our training DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,:])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That observation belongs to &lt;strong&gt;Outcome 0&lt;/strong&gt; (no diabetes), and lies at the point &lt;strong&gt;(-0.05, -1.16)&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;princomp1   -0.046671
princomp2   -1.161939
outcome      0.000000
Name: 122, dtype: float64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Pass this observation to our &lt;strong&gt;find_lambda&lt;/strong&gt; function, which returns the distance to the nearest observation in &lt;strong&gt;Outcome 1&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,:])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our function indicates that the closest observation in &lt;strong&gt;Outcome 1&lt;/strong&gt; lies 0.09 units away.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="mf"&gt;0.0893789327564675&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Pandas&lt;/strong&gt; apply method allows us to follow a &lt;strong&gt;Functional Programming&lt;/strong&gt; approach and process the entire DataFrame at once.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_lambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                    &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following table captures the resulting &lt;strong&gt;lambda&lt;/strong&gt; for a handful of example &lt;strong&gt;training&lt;/strong&gt; observations.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The calculated Lambdas in a table" src="https://john.soban.ski/images/Rce_Python_Part_1/18_Train_Lambda.png"&gt;&lt;/p&gt;
&lt;h3&gt;Classify Test Data&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Test&lt;/strong&gt; data does not include a label.  The ML Engineer feeds &lt;strong&gt;test&lt;/strong&gt; data into the &lt;strong&gt;trained&lt;/strong&gt; model, and the model predicts a label.&lt;/p&gt;
&lt;p&gt;We will now develop the logic to predict a label.&lt;/p&gt;
&lt;p&gt;In order to demonstrate the logic, we produce a grid of &lt;strong&gt;test&lt;/strong&gt; points.  The grid will also feed a visualization of the RCE decision boundaries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; 
                        &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                   &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our grid includes the following &lt;strong&gt;test&lt;/strong&gt; data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A grid to feed the visualization of the RCE decision boundaries" src="https://john.soban.ski/images/Rce_Python_Part_1/19_Princomp_Grid.png"&gt;&lt;/p&gt;
&lt;p&gt;Our RCE algorithm uses the &lt;strong&gt;find_lambda&lt;/strong&gt; function (above) to calculate &lt;strong&gt;lambda&lt;/strong&gt; for each observation in the &lt;strong&gt;train&lt;/strong&gt; DataFrame and stores the results in the &lt;strong&gt;train_df&lt;/strong&gt; DataFrame.  Recall that &lt;strong&gt;Lambda&lt;/strong&gt; represents the &lt;strong&gt;radius&lt;/strong&gt; of a circle that captures the &lt;strong&gt;hit footprint&lt;/strong&gt; for a given observation.  &lt;/p&gt;
&lt;p&gt;Our &lt;strong&gt;find_hits&lt;/strong&gt; function (below) takes a given &lt;strong&gt;test&lt;/strong&gt; observation and then calculates the euclidean distance to every point in &lt;strong&gt;train_df&lt;/strong&gt;.  A &lt;strong&gt;test point to train point&lt;/strong&gt; distance less than the &lt;strong&gt;train point's lambda&lt;/strong&gt; indicates that the &lt;strong&gt;test point&lt;/strong&gt; lies in the &lt;strong&gt;train point's&lt;/strong&gt; hit footprint.  &lt;/p&gt;
&lt;p&gt;For a given &lt;strong&gt;test&lt;/strong&gt; observation, our &lt;strong&gt;find_hits&lt;/strong&gt; function discovers and tallies the hits for each class.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[:,:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;
                           &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                  &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;outcome&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;lambda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Row one of our &lt;strong&gt;test&lt;/strong&gt; DataFrame, for example, includes the unlabeled point (-1,-0.99) &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,:]&lt;/span&gt;
&lt;span class="n"&gt;princomp1&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.00&lt;/span&gt;
&lt;span class="n"&gt;princomp2&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.99&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For this point, &lt;strong&gt;find_hits&lt;/strong&gt; tallies &lt;strong&gt;9&lt;/strong&gt; hits for &lt;strong&gt;Outcome 0&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,:],&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;9&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;find_hits&lt;/strong&gt; drives the &lt;strong&gt;classify_data&lt;/strong&gt; function, which labels the class based on hits for each class.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;classify_data&lt;/strong&gt; returns &lt;strong&gt;Ambiguous&lt;/strong&gt; or &lt;strong&gt;NaN&lt;/strong&gt; for any &lt;strong&gt;test_data&lt;/strong&gt; that lies in an &lt;strong&gt;Ambiguous region&lt;/strong&gt; (region with overlapping classes or region with no class).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# find the hits&lt;/span&gt;
    &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;find_hits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# add the columns&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class0_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class1_hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# ID ambiguous, class 0 and class 1 data&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class1_hits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;class0_hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Pass our &lt;strong&gt;test&lt;/strong&gt; DataFrame to  &lt;strong&gt;classify_data&lt;/strong&gt; and store the results.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;class_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A quick peek shows mostly &lt;strong&gt;Ambiguous&lt;/strong&gt; classification for the first and last five observations in our &lt;strong&gt;test&lt;/strong&gt; DataFrame. &lt;/p&gt;
&lt;p&gt;&lt;img alt="The calculated hits" src="https://john.soban.ski/images/Rce_Python_Part_1/20_Hits_Grid.png"&gt;&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;Seaborn&lt;/strong&gt; scatterplot maps our entire grid.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;class_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following graphic captures the footprint of each class.  &lt;strong&gt;Purple&lt;/strong&gt; for &lt;strong&gt;Outcome 1&lt;/strong&gt; (Diabetes), &lt;strong&gt;Pink&lt;/strong&gt; for &lt;strong&gt;Outcome 0&lt;/strong&gt; (No Diabetes) and &lt;strong&gt;Gray&lt;/strong&gt; for &lt;strong&gt;Ambiguous&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The RCE 2d decision boundaries" src="https://john.soban.ski/images/Rce_Python_Part_1/21_2d_Rce.png"&gt;&lt;/p&gt;
&lt;h2&gt;Evaluate RCE&lt;/h2&gt;
&lt;p&gt;Our Pima &lt;strong&gt;test&lt;/strong&gt; DataFrame includes labels, which we use to &lt;strong&gt;evaluate&lt;/strong&gt; the model.&lt;/p&gt;
&lt;p&gt;To prepare the &lt;strong&gt;test&lt;/strong&gt; DataFrame for classification, we normalize and PCA transform the DataFrame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We pass this &lt;strong&gt;test_df&lt;/strong&gt; to &lt;strong&gt;classify_data&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classify_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Seaborn&lt;/strong&gt; provides a method to depict a &lt;a href="https://en.wikipedia.org/wiki/Confusion_matrix"&gt;confusion matrix&lt;/a&gt;.  We attach the known &lt;strong&gt;test&lt;/strong&gt; labels to the &lt;strong&gt;test&lt;/strong&gt; DataFrame for scoring.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;confusion_matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crosstab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                               &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;classification&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                               &lt;span class="n"&gt;rownames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Actual&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                               &lt;span class="n"&gt;colnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Predicted&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following graphic captures the &lt;strong&gt;confusion matrix&lt;/strong&gt; for our two Principal Component &lt;strong&gt;test&lt;/strong&gt; DataFrame.&lt;/p&gt;
&lt;p&gt;&lt;img alt="The Confusion Matrix for two Principle Components" src="https://john.soban.ski/images/Rce_Python_Part_1/22_2d_Confuse.png"&gt;&lt;/p&gt;
&lt;p&gt;An &lt;a href="https://en.wikipedia.org/wiki/F1_score"&gt;F1 Score&lt;/a&gt; provides a usesful success metric.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calc_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;unambiguous_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ambiguity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;unambiguous_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;f1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unambiguous_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;unambiguous_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classification&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;f1_score&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="s2"&gt;&amp;quot;ambiguity&amp;quot;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ambiguity&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our RCE algorithm trained a model with an F1 Score of &lt;strong&gt;0.42&lt;/strong&gt; and ambiguity of &lt;strong&gt;26.6%&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;calc_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;f1_score&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.42424242424242425&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="s1"&gt;&amp;#39;ambiguity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2662337662337662&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this blog post we developed an exemplar RCE neural net classifier from scratch.  Our initial attempt yielded a model with an F1 Score of &lt;strong&gt;0.42&lt;/strong&gt; and ambiguity of &lt;strong&gt;26.6%&lt;/strong&gt;.  &lt;a href="https://john.soban.ski/rce-python-part-2.html"&gt;Next month&lt;/a&gt;, we will tune hyperparameters in order to improve model success and reduce ambiguity.  We will investigate the number of principal components and tune &lt;strong&gt;r&lt;/strong&gt;.  &lt;strong&gt;r&lt;/strong&gt; indicates the maximum value for &lt;strong&gt;Lambda&lt;/strong&gt; and puts an upper limit on the maximum size of each circle that represents a given &lt;strong&gt;hit footprint&lt;/strong&gt;.&lt;/p&gt;</content><category term="Data Science"></category><category term="RCE"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Juice Your In-Situ Machine Learning with BigQuery AutoML</title><link href="https://john.soban.ski/bigquery-automl.html" rel="alternate"></link><published>2021-06-26T01:23:00-04:00</published><updated>2021-06-26T01:23:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2021-06-26:/bigquery-automl.html</id><summary type="html">&lt;p&gt;Data Scientists need skill and experience to create useful Machine Learning (ML) models.  ML activities include tool selection, training logistic decisions (move data to training vs. train in-situ), data acquisition, data cleaning, data quality checks, feature engineering, algorithm selection and hyperparameter tuning.&lt;/p&gt;
&lt;p&gt;Algorithm selection and hyperparameter tuning drive tedious manual …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Data Scientists need skill and experience to create useful Machine Learning (ML) models.  ML activities include tool selection, training logistic decisions (move data to training vs. train in-situ), data acquisition, data cleaning, data quality checks, feature engineering, algorithm selection and hyperparameter tuning.&lt;/p&gt;
&lt;p&gt;Algorithm selection and hyperparameter tuning drive tedious manual processes where the Data Scientist must flip a switch, turn a knob, train the model and then cross her fingers.  The Google Cloud Platform (GCP) Big Query Machine Learning (BQML) service provides two wins for Data Scientists:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The ability to train data in place (a must for PetaByte scale &lt;strong&gt;Big Data&lt;/strong&gt; sets)&lt;/li&gt;
&lt;li&gt;The ability to execute AutoML in place&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This blog post demonstrates how to execute AutoML in-situ via the &lt;a href="https://cloud.google.com/bigquery-ml/docs"&gt;GCP BQML&lt;/a&gt; service.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;p&gt;In previous blog posts I used several platforms to train models on tabular data: TensorFlow, Google Cloud Platform (GCP) AutoML Tables, and BigQuery BQML.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;Fast and Easy Regression with Keras and TensorFlow 2.3 (Part 1 - Data Exploration &amp;amp; First Models)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;Fast and Easy Regression with Keras and TensorFlow 2.3 (Part 2 - Dimensionality Reduction)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;Fast and Easy ML Optimization with GCP AutoML Tables (Beta)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/bigquery-ml.html"&gt;Don't Move Your Data! In-Situ Machine Learning via BigQuery&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The blog posts above capture a variety model training approaches:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hand crafted models using Python Objects (Tensorflow)&lt;/li&gt;
&lt;li&gt;AutoML using a GCP Application Programming Interface (API) (AutoML tables)&lt;/li&gt;
&lt;li&gt;Hand crafted models using SQL commands (BigQuery)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This blog post demonstrates how to use the BigQuery BQML service to trigger AutoML workflows in-situ.  The AutoML service runs through a variety of ML Algorithms and iterates through a range of hyperparameter settings for each algorithm.  The service then keeps and serves the &lt;strong&gt;winning&lt;/strong&gt; approach.&lt;/p&gt;
&lt;h2&gt;AutoML Regressor&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/bigquery-ml.html"&gt;Last month&lt;/a&gt; we used SQL syntax to command BigQuery to train a linear regression model in-situ. Open &lt;a href="https://john.soban.ski/bigquery-ml.html"&gt;that blog&lt;/a&gt; post in a new tab to review the steps required to train models in BigQuery.&lt;/p&gt;
&lt;p&gt;We used the following SQL statement to train a linear regression model, with the &lt;strong&gt;model_type&lt;/strong&gt; set to &lt;strong&gt;LINEAR_REG&lt;/strong&gt; in the SQL &lt;strong&gt;OPTIONS&lt;/strong&gt; :&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;LINEAR_REG&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;alcohol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;chlorides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;citric_acid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;density&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fixed_acidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;free_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;residual_sugar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sulphates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;total_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;volatile_acidity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_red&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our model kept the default parameters for the training algorithm.&lt;/p&gt;
&lt;p&gt;We can improve model performance through &lt;strong&gt;hyper parameter&lt;/strong&gt; tuning.  In the old days, we needed to tune these parameters by hand.  &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;GCP, however, provides AutoML services (e.g. AutoML tables)&lt;/a&gt; to automatically tune these parameters.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;BigQuery now provides a Beta service to execute in-situ AutoML.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;To use BigQuery AutoML, simply set your SQL OPTIONS to &lt;strong&gt;AUTOML_REGRESSOR&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Automl Query" src="https://john.soban.ski/images/Bigquery_Automl/19_Automl_Query.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: We direct BigQuery to save the new model under the name &lt;strong&gt;automl_model&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;automl_model&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;AUTOML_REGRESSOR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;alcohol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;chlorides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;citric_acid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;density&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fixed_acidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;free_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;residual_sugar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sulphates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;total_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;volatile_acidity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_red&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;BigQuery AutoML iterates through many hyperparameter scenarios, each which investigate the effects of choices related to learning rate, regularization and optimizers.  You will notice that AutoML consumes a much larger portion of &lt;strong&gt;wall clock&lt;/strong&gt; time in comparison to our single Regression model above.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Training Pic" src="https://john.soban.ski/images/Bigquery_Automl/20_Training.png"&gt;&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;Execution Details&lt;/strong&gt; to get more status information.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Training Pic 2" src="https://john.soban.ski/images/Bigquery_Automl/21_Training_2.png"&gt;&lt;/p&gt;
&lt;p&gt;Upon completion, BigQuery stores our new &lt;strong&gt;automl_model&lt;/strong&gt; in the &lt;strong&gt;wine_dataset&lt;/strong&gt; Dataset, which lives in the &lt;strong&gt;shining_chain&lt;/strong&gt; project.&lt;/p&gt;
&lt;p&gt;The AutoML process completes in about fifty (50) or so minutes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="AutoML Done" src="https://john.soban.ski/images/Bigquery_Automl/22_AutoML_Done.png"&gt;&lt;/p&gt;
&lt;h2&gt;AutoML Regressor Results&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;results&lt;/strong&gt; tab reports a reduction in Mean Square Error (MSE), compared to the prior Linear Regression model that used default parameters.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Automl Results" src="https://john.soban.ski/images/Bigquery_Automl/23_Automl_Results.png"&gt;&lt;/p&gt;
&lt;p&gt;The MSE maps to a Root Mean Square Error (RMSE) of about 0.6393.&lt;/p&gt;
&lt;p&gt;Several months ago &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;we used TensorFlow and Google Cloud Platform AutoML to train several models on the Wine Quality Dataset&lt;/a&gt; and compare the results.  In January, we used &lt;a href="https://john.soban.ski/bigquery-ml.html"&gt;BigQuery Linear Regression with default Hyperparameters&lt;/a&gt; to train the Wine Quality Dataset.&lt;/p&gt;
&lt;p&gt;Let's compare the RMSE of &lt;strong&gt;BQML'S AUTOML_REGRESSOR&lt;/strong&gt; (0.6393) against these prior experiments.&lt;/p&gt;
&lt;p&gt;The following table captures the results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;GCP&lt;/td&gt;
&lt;td&gt;AutoML Tables&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.598&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;AutoML&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.639&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.645&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Linear&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.661&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.735&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Pandas&lt;/td&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;BigQuery AutoML under-performs compared to &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;GCP AutoML Tables&lt;/a&gt; and a &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;dimensionality reduced TensorFlow model&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Serve Model&lt;/h2&gt;
&lt;p&gt;After training, BigQuery saves and serves the new model in place.&lt;/p&gt;
&lt;p&gt;We use &lt;strong&gt;SQL&lt;/strong&gt; to use the served model.  In the BigQuery console, click &lt;strong&gt;QUERY MODEL&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Query Model" src="https://john.soban.ski/images/Bigquery_Automl/24_Query_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;The following SQL command pulls the first record out of the Wine Quality data set and then sets the &lt;strong&gt;alcohol&lt;/strong&gt; parameter to 80%.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alcohol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;chlorides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;citric_acid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;density&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fixed_acidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;free_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;residual_sugar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sulphates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;total_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;volatile_acidity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_red&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The above &lt;strong&gt;QUERY&lt;/strong&gt; returns the following &lt;strong&gt;JSON.&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;alcohol&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;80&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;chlorides&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0.074&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;citric_acid&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0.66&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;density&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1.0008&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;fixed_acidity&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;11.6&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;free_sulfur_dioxide&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;10.0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;ph&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;3.25&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;residual_sugar&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;2.2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;sulphates&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0.57&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;total_sulfur_dioxide&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;47.0&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;volatile_acidity&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;0.58&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following screengrab captures the console view of this &lt;strong&gt;QUERY&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Predict Data" src="https://john.soban.ski/images/Bigquery_Automl/25_Predict_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;SQL QUERY&lt;/strong&gt; below pulls and modifies the first record from the Wine Quality data set and then pipes it to the &lt;strong&gt;automl_model&lt;/strong&gt; we trained via &lt;strong&gt;AUTOML_REGRESSOR&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;At a high level, we &lt;strong&gt;SELECT&lt;/strong&gt; the predicted score of a wine with 80% alcohol &lt;strong&gt;FROM&lt;/strong&gt; our model:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;predicted_label&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ML&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PREDICT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;automl_model&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;alcohol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;chlorides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;citric_acid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;density&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fixed_acidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;free_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;residual_sugar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sulphates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;total_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;volatile_acidity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_red&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The console returns the predicted &lt;strong&gt;quality&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Predicted API" src="https://john.soban.ski/images/Bigquery_Automl/26_Predicted_API.png"&gt;&lt;/p&gt;
&lt;p&gt;The model predicts a &lt;strong&gt;quality&lt;/strong&gt; score (taste) of 1.7 out of 10 for a wine with 80% alcohol, which I consider reasonable.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;&amp;quot;predicted_label&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1.7691493034362793&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Boosted Tree&lt;/h2&gt;
&lt;p&gt;For fun, let's look at the success of an &lt;a href="https://en.wikipedia.org/wiki/Ensemble_learning"&gt;ensemble method&lt;/a&gt;.  &lt;/p&gt;
&lt;p&gt;BQML provides a &lt;strong&gt;BOOSTED_TREE_REGRESSOR&lt;/strong&gt;, which we select via &lt;strong&gt;SQL OPTIONS&lt;/strong&gt;.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boost_model&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BOOSTED_TREE_REGRESSOR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;alcohol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;chlorides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;citric_acid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;density&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fixed_acidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;free_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;residual_sugar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sulphates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;total_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;volatile_acidity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_red&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The model takes six minutes to train.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boost Model" src="https://john.soban.ski/images/Bigquery_Automl/27_Boost_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;The model results in an MSE of 0.3419, with an RMSE of 0.5847.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boost Results" src="https://john.soban.ski/images/Bigquery_Automl/28_Boost_Results.png"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;BQML BOOSTED_TREE_REGRESSOR&lt;/strong&gt; bests &lt;strong&gt;GCP AutoML Tables&lt;/strong&gt; and lands in first place!&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Boosted Tree&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.585&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;GCP&lt;/td&gt;
&lt;td&gt;AutoML Tables&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.598&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;AutoML&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.639&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.645&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Linear&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.661&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.735&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Pandas&lt;/td&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Dimensionality Reduced BQML&lt;/h2&gt;
&lt;p&gt;Too many features drive over-fitting which increases RMSE.&lt;/p&gt;
&lt;p&gt;In a past blog post, we demonstrated that &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;dimensionality reduction through Principal Component Analysis (PCA) reduces over-fitting and reduces RMSE&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The last part of this blog post feeds a dimensionality reduced &lt;strong&gt;Wine Quality Dataset&lt;/strong&gt; to the &lt;strong&gt;BQML BOOSTED_TREE_REGRESSOR&lt;/strong&gt; algorithm.&lt;/p&gt;
&lt;p&gt;We will briefly run through the steps to apply PCA to the &lt;strong&gt;Wine Quality Dataset.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;First, import the necessary Python libraries and then pull the data off the University of Irvine's website and stuff it into a &lt;strong&gt;Pandas Data Frame&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fixed_acidity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;volatile_acidity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;citric_acid&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;residual_sugar&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;chlorides&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;free_sulfur_dioxide&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;total_sulfur_dioxide&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;density&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;ph&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;sulphates&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;alcohol&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;quality&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;wine_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; 
                      &lt;span class="n"&gt;sep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Next, separate the Dataframe into a &lt;strong&gt;features&lt;/strong&gt; Dataframe and &lt;strong&gt;label&lt;/strong&gt; series.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;wine_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wine_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;wine_labels_series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wine_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;quality&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;TensorFlow allows us to create a normalization engine for our &lt;strong&gt;features&lt;/strong&gt; Dataframe.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras.layers.experimental&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;

&lt;span class="n"&gt;normalizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Normalization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wine_features_df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Sklearn&lt;/strong&gt; provides a PCA engine.  We pipe the &lt;strong&gt;features&lt;/strong&gt; Dataframe to the normalization engine and then to the PCA engine, and request the first seven principal components.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wine_features_df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wine_features_df&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp3&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp4&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp5&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp6&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                &lt;span class="s1"&gt;&amp;#39;princomp7&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                               &lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;wine_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We then pop the labels back onto the normalized, dimensionality reduced Dataframe and save it in a Comma Separated Value (CSV) encoded file.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca_wine_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;wine_labels_series&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca_wine_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;pca_wine_df.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The following output records the first ten lines of the CSV:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;princomp1,princomp2,princomp3,princomp4,princomp5,princomp6,princomp7,quality
-1.6195179763728917,0.4509726853737244,-1.7744612972264329,0.04374371633307103,0.0670156612925275,-0.9139218906561226,-0.1610332757006941,5.0
-0.7991612763626295,1.856561351520203,-0.9116754264684358,0.5480739949151788,-0.01839571398165714,0.929709236000912,-1.0098350218068104,5.0
-0.7484768531031628,0.8820469715571214,-1.1713842697588999,0.41102911926788793,-0.043535655196972736,0.40147666614026,-0.539553150939102,5.0
2.357677805002114,-0.269982797056245,0.24348912259870834,-0.9284469679531109,-1.4991502738904028,-0.13102232409979334,0.34428774245741034,6.0
-1.6195179763728917,0.4509726853737244,-1.7744612972264329,0.04374371633307103,0.0670156612925275,-0.9139218906561226,-0.1610332757006941,5.0
-1.583695657944522,0.5692157167619253,-1.5382922454632044,0.02375291374041369,-0.11007403103710495,-0.993628380100469,-0.10964916626917803,5.0
-1.1014601399705353,0.6080257636816654,-1.0759111926105813,-0.343950360931988,-1.1333873126112808,0.1750035123630711,0.26101076781023663,5.0
-2.2487136084125905,-0.4168236213171013,-0.9868407617813321,-0.0011977615992119378,-0.7804374122971008,0.2860584721236257,0.1314469559051279,7.0
-1.0868804709342004,-0.3085531414570113,-1.5181578596509828,0.003318878620501723,-0.22672738691574854,-0.5126291605993216,0.2496169604878968,7.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We &lt;a href="https://john.soban.ski/bigquery-ml.html"&gt;upload the CSV into BigQuery using the console&lt;/a&gt; and execute the following SQL to train a &lt;strong&gt;BOOSTED_TREE_REGRESSOR&lt;/strong&gt; model on the dimensionality reduced dataset.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pca_wine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pca_boost_model&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;BOOSTED_TREE_REGRESSOR&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;princomp7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pca_wine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pca_wine&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The model takes six minutes to train.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boost Model on PCA Data" src="https://john.soban.ski/images/Bigquery_Automl/29_Pca_Boost.png"&gt;&lt;/p&gt;
&lt;p&gt;Click the Evaluation tab to find a &lt;strong&gt;MSE&lt;/strong&gt; of &lt;strong&gt;0.3771&lt;/strong&gt;, which maps to an &lt;strong&gt;RMSE&lt;/strong&gt; of &lt;strong&gt;0.6140&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boost Model on PCA Evaluation" src="https://john.soban.ski/images/Bigquery_Automl/30_Pca_Eval.png"&gt;&lt;/p&gt;
&lt;p&gt;The dimensionality reduced data set proves less successful than the full featured data set, and lands in third place.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Boosted Tree&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.585&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;GCP&lt;/td&gt;
&lt;td&gt;AutoML Tables&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.598&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Boosted Tree&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.614&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;AutoML&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.639&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.645&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;BigQuery&lt;/td&gt;
&lt;td&gt;Linear&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.661&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;TensorFlow&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.735&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Pandas&lt;/td&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data Scientists have a plethora of tools and approaches to train models.  BigQuery provides in-situ Machine Learning and in-situ AutoML.  This blog post compared the BQML &lt;strong&gt;AUTOML_REGRESSOR&lt;/strong&gt; algorithm against the &lt;strong&gt;BOOSTED_TREE_REGRESSOR&lt;/strong&gt;, for both a complete and dimensionality reduced data set.&lt;/p&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Did Thoreau inspire the Unabomber? We use AI to Find Out!</title><link href="https://john.soban.ski/thoreau-vs-unabomber.html" rel="alternate"></link><published>2021-05-30T03:19:00-04:00</published><updated>2021-05-30T03:19:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2021-05-30:/thoreau-vs-unabomber.html</id><summary type="html">&lt;h2&gt;&lt;strong&gt;Good Vs. Evil&lt;/strong&gt; - Two Opposing paths Taken by a Similar Genius&lt;/h2&gt;
&lt;p&gt;This blog post provides a comparison between Henry David Thoreau's &lt;strong&gt;Walden&lt;/strong&gt; and Ted Kaczynski's &lt;strong&gt;Unabomber Manifesto.&lt;/strong&gt; &lt;/p&gt;
&lt;p&gt;&lt;img alt="Walden Book" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/01_Walden_Thoreau.png"&gt;&lt;/p&gt;
&lt;p&gt;To compare these two works, I use both a modern Natural Language Processing (NLP) Artificial Intelligence (AI) tool and traditional literary analysis.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Unabomber Sketch" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/02_Unabomber_Sketch.png"&gt;&lt;/p&gt;
&lt;p&gt;The …&lt;/p&gt;</summary><content type="html">&lt;h2&gt;&lt;strong&gt;Good Vs. Evil&lt;/strong&gt; - Two Opposing paths Taken by a Similar Genius&lt;/h2&gt;
&lt;p&gt;This blog post provides a comparison between Henry David Thoreau's &lt;strong&gt;Walden&lt;/strong&gt; and Ted Kaczynski's &lt;strong&gt;Unabomber Manifesto.&lt;/strong&gt; &lt;/p&gt;
&lt;p&gt;&lt;img alt="Walden Book" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/01_Walden_Thoreau.png"&gt;&lt;/p&gt;
&lt;p&gt;To compare these two works, I use both a modern Natural Language Processing (NLP) Artificial Intelligence (AI) tool and traditional literary analysis.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Unabomber Sketch" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/02_Unabomber_Sketch.png"&gt;&lt;/p&gt;
&lt;p&gt;The Google Cloud Natural Language Application Programming Interface (API) quantifies the authors' sentiment (positive or negative) and intensity of emotion while traditional Literary Analysis compares and contrasts themes.  &lt;/p&gt;
&lt;h2&gt;Google Cloud Natural Language Analysis&lt;/h2&gt;
&lt;p&gt;The &lt;a href="https://cloud.google.com/natural-language/docs"&gt;Google Cloud Natural API&lt;/a&gt; allows developers to use Google's advanced, massive and validated language model to infer sentiment, extract entities and classify documents.  I will use the API to infer sentiment from the two texts, and then compare the results.  Sentiment analysis provides quantifiable metrics (hard numbers) which drive mathematical comparisons.&lt;/p&gt;
&lt;h3&gt;Process Text&lt;/h3&gt;
&lt;p&gt;I use the Google API to infer sentiment (score) and intensity (magnitude).&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://cloud.google.com/natural-language/docs/basics#interpreting_sentiment_analysis_values"&gt;Google Cloud Natural Language API documentation&lt;/a&gt; defines &lt;strong&gt;score&lt;/strong&gt; and &lt;strong&gt;magnitude&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Score&lt;ul&gt;
&lt;li&gt;Indicates the overall emotion of a document&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Magnitude&lt;ul&gt;
&lt;li&gt;Indicates how much emotional content is present within the document, and this value is often proportional to the length of the document&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I use the following script to split each text into individual paragraphs, send each paragraph to the API, and then record the results.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="ch"&gt;#!/usr/bin/env python&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;language_v1&lt;/span&gt;

&lt;span class="c1"&gt;# Configure book name here &lt;/span&gt;
&lt;span class="c1"&gt;#FILENAME = &amp;#39;walden.txt&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;FILENAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;unabomber.txt&amp;#39;&lt;/span&gt;

&lt;span class="n"&gt;sentiment_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiates a client&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;language_v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LanguageServiceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FILENAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;document&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;language_v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;type_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;language_v1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Document&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PLAIN_TEXT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;sentiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analyze_sentiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;document&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;document_sentiment&lt;/span&gt;
                &lt;span class="n"&gt;sentiment_dict&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;magnitude&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;magnitude&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;text&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;sentiment_dict&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;score&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;magnitude&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; \
                    &lt;span class="s1"&gt;&amp;#39;text&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;ERROR: &lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sentiment_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentiment_dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sentiment_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;_sentiment.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FILENAME&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I use the Google Cloud Natural API Domain Specific Language (DSL), although &lt;a href="https://requests.readthedocs.io/en/latest/"&gt;requests&lt;/a&gt; also works.  &lt;strong&gt;Walden&lt;/strong&gt; includes some latin, which &lt;strong&gt;breaks&lt;/strong&gt; the service.  To mitigate against the &lt;strong&gt;erorr&lt;/strong&gt; I use &lt;strong&gt;try/ except&lt;/strong&gt; logic.  The &lt;strong&gt;strip()&lt;/strong&gt; methods remove blank lines from the analysis.&lt;/p&gt;
&lt;p&gt;Since both texts include a wide variety of characters, I store the results in a &lt;a href="https://pandas.pydata.org/"&gt;Pandas&lt;/a&gt; dataframe.  The &lt;strong&gt;to_csv()&lt;/strong&gt; method will escape all of the characters that disturb the Comma Separated Values (CSV) encoded output.&lt;/p&gt;
&lt;p&gt;The script outputs files named &lt;strong&gt;unabomber_sentiment.csv&lt;/strong&gt; and &lt;strong&gt;walden_sentiment.csv&lt;/strong&gt; and each row includes a score, magnitude and the appropriate paragraph text.&lt;/p&gt;
&lt;p&gt;The following snippit records several lines of the &lt;strong&gt;Walden&lt;/strong&gt; output:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;magnitude&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="mf"&gt;0.20000000298023224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.699999988079071&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;When I wrote the following pages, or rather the bulk of them, I lived alone, in the woods, a mile from any neighbor, in a house which I had built myself, on the shore of Walden Pond, in Concord, Massachusetts, and earned my living by the labor of my hands only. I lived there two years and two months. At present I am a sojourner in civilized life again.&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;4.900000095367432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;I should not obtrude my affairs so much on the notice of my readers if very particular inquiries had not been made by my townsmen concerning my mode of life, which some would call impertinent, though they do not appear to me at all impertinent, but, considering the circumstances, very natural and pertinent. Some have asked what I got to eat; if I did not feel lonesome; if I was not afraid; and the like. Others have been curious to learn what portion of my income I devoted to charitable purposes; and some, who have large families, how many poor children I maintained. I will therefore ask those of my readers who feel no particular interest in me to pardon me if I undertake to answer some of these questions in this book. In most books, the I, or first person, is omitted; in this it will be retained; that, in respect to egotism, is the main difference. We commonly do not remember that it is, after all, always the first person that is speaking. I should not talk so much about myself if there were anybody else whom I knew as well. Unfortunately, I am confined to this theme by the narrowness of my experience. Moreover, I, on my side, require of every writer, first or last, a simple and sincere account of his own life, and not merely what he has heard of other men’s lives; some such account as he would send to his kindred from a distant land; for if he has lived sincerely, it must have been in a distant land to me. Perhaps these pages are more particularly addressed to poor students. As for the rest of my readers, they will accept such portions as apply to them. I trust that none will stretch the seams in putting on the coat, for it may do good service to him whom it fits.&amp;quot;&lt;/span&gt;
&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;I would fain say something, not so much concerning the Chinese and Sandwich Islanders as you who read these pages, who are said to live in New England; something about your condition, especially your outward condition or circumstances in this world, in this town, what it is, whether it is necessary that it be as bad as it is, whether it cannot be improved as well as not. I have travelled a good deal in Concord; and everywhere, in shops, and offices, and fields, the inhabitants have appeared to me to be doing penance in a thousand remarkable ways. What I have heard of Brahmins sitting exposed to four fires and looking in the face of the sun; or hanging suspended, with their heads downward, over flames; or looking at the heavens over their shoulders “until it becomes impossible for them to resume their natural position, while from the twist of the neck nothing but liquids can pass into the stomach;” or dwelling, chained for life, at the foot of a tree; or measuring with their bodies, like caterpillars, the breadth of vast empires; or standing on one leg on the tops of pillars,—even these forms of conscious penance are hardly more incredible and astonishing than the scenes which I daily witness. The twelve labors of Hercules were trifling in comparison with those which my neighbors have undertaken; for they were only twelve, and had an end; but I could never see that these men slew or captured any monster or finished any labor. They have no friend Iolas to burn with a hot iron the root of the hydra’s head, but as soon as one head is crushed, two spring up.&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.30000001192092896&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;3.200000047683716&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;I see young men, my townsmen, whose misfortune it is to have inherited farms, houses, barns, cattle, and farming tools; for these are more easily acquired than got rid of. Better if they had been born in the open pasture and suckled by a wolf, that they might have seen with clearer eyes what field they were called to labor in. Who made them serfs of the soil? Why should they eat their sixty acres, when man is condemned to eat only his peck of dirt? Why should they begin digging their graves as soon as they are born? They have got to live a man’s life, pushing all these things before them, and get on as well as they can. How many a poor immortal soul have I met well nigh crushed and smothered under its load, creeping down the road of life, pushing before it a barn seventy-five feet by forty, its Augean stables never cleansed, and one hundred acres of land, tillage, mowing, pasture, and wood-lot! The portionless, who struggle with no such unnecessary inherited encumbrances, find it labor enough to subdue and cultivate a few cubic feet of flesh.&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;2.9000000953674316&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;But men labor under a mistake. The better part of the man is soon plowed into the soil for compost. By a seeming fate, commonly called necessity, they are employed, as it says in an old book, laying up treasures which moth and rust will corrupt and thieves break through and steal. It is a fool’s life, as they will find when they get to the end of it, if not before. It is said that Deucalion and Pyrrha created men by throwing stones over their heads behind them:—&amp;quot;&lt;/span&gt;
&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Inde genus durum sumus, experiensque laborum,&amp;quot;&lt;/span&gt;
&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Et&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;documenta&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;damus&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;quâ&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;simus&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;origine&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nati&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Or, as Raleigh rhymes it in his sonorous way,—&amp;quot;&lt;/span&gt;
&lt;span class="mf"&gt;0.6000000238418579&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.6000000238418579&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;“From thence our kind hard-hearted is, enduring pain and care,&amp;quot;&lt;/span&gt;
&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.10000000149011612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Approving&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;our&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bodies&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;stony&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;nature&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;are&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.699999988079071&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.699999988079071&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;So much for a blind obedience to a blundering oracle, throwing the stones over their heads behind them, and not seeing where they fell.&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.4000000059604645&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;5.400000095367432&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Most men, even in this comparatively free country, through mere ignorance and mistake, are so occupied with the factitious cares and superfluously coarse labors of life that its finer fruits cannot be plucked by them. Their fingers, from excessive toil, are too clumsy and tremble too much for that. Actually, the laboring man has not leisure for a true integrity day by day; he cannot afford to sustain the manliest relations to men; his labor would be depreciated in the market. He has no time to be anything but a machine. How can he remember well his ignorance—which his growth requires—who has so often to use his knowledge? We should feed and clothe him gratuitously sometimes, and recruit him with our cordials, before we judge of him. The finest qualities of our nature, like the bloom on fruits, can be preserved only by the most delicate handling. Yet we do not treat ourselves nor one another thus tenderly.&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.6000000238418579&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Some of you, we all know, are poor, find it hard to live, are sometimes, as it were, gasping for breath. I have no doubt that some of you who read this book are unable to pay for all the dinners which you have actually eaten, or for the coats and shoes which are fast wearing or are already worn out, and have come to this page to spend borrowed or stolen time, robbing your creditors of an hour. It is very evident what mean and sneaking lives many of you live, for my sight has been whetted by experience; always on the limits, trying to get into business and trying to get out of debt, a very ancient slough, called by the Latins æs alienum, another’s brass, for some of their coins were made of brass; still living, and dying, and buried by this other’s brass; always promising to pay, promising to pay, tomorrow, and dying today, insolvent; seeking to curry favor, to get custom, by how many modes, only not state-prison offences; lying, flattering, voting, contracting yourselves into a nutshell of civility or dilating into an atmosphere of thin and vaporous generosity, that you may persuade your neighbor to let you make his shoes, or his hat, or his coat, or his carriage, or import his groceries for him; making yourselves sick, that you may lay up something against a sick day, something to be tucked away in an old chest, or in a stocking behind the plastering, or, more safely, in the brick bank; no matter where, no matter how much or how little.&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Numerical Analysis&lt;/h3&gt;
&lt;p&gt;The output includes &lt;strong&gt;score&lt;/strong&gt;, &lt;strong&gt;magnitude&lt;/strong&gt; and &lt;strong&gt;text&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I input each CSV into Pandas for analysis, for example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;una_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;unabomber_sentiment.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I did a quick sanity check, and used &lt;a href="https://numpy.org/"&gt;numpy&lt;/a&gt; to identify the most negative text from each author.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;una_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;una_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;una_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first line of the &lt;a href="https://en.wikipedia.org/wiki/Caesar_and_Pompey"&gt;Chapman&lt;/a&gt; quote clocks in at the most negative (-0.9) Thoreau sentiment:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The false society of men—&lt;/p&gt;
&lt;p&gt;for earthly greatness&lt;/p&gt;
&lt;p&gt;All heavenly comforts rarefies to air. [Thoreau 141]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kaczynski includes three paragraphs tied for most negative (-0.6), so I selected the paragraph with the highest magnitude (8.7):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Oversocialization can lead to low self-esteem, a sense of powerlessness, defeatism, guilt, etc. One of the most important means by which our society socializes children is by making them feel ashamed of behavior or speech that is contrary to society's expectations. If this is overdone, or if a particular child is especially susceptible to such feelings, he ends by feeling ashamed of HIMSELF... In many oversocialized people this results in a sense of constraint and powerlessness that can be a severe hardship. We suggest that oversocialization is among the more serious cruelties that human being inflict on one another. [Kaczynski 26]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This quick analysis demonstrates that the Google NLP Model appears to correctly identify sentiment.&lt;/p&gt;
&lt;h3&gt;Graphical Analysis&lt;/h3&gt;
&lt;p&gt;Since each work includes hundreds of paragraphs, I use Data Visualization (Data Viz) in the form of a Histogram to summarize the output data.&lt;/p&gt;
&lt;p&gt;The following Histogram records the sentiment of "The Unabomber Manifesto."  Note that the paragraphs skew negative.  &lt;/p&gt;
&lt;p&gt;Zero indicates a &lt;strong&gt;neutral&lt;/strong&gt; sentiment.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Unabomber Score Histogram" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/03_Unabomber_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;Compare Kaczynski’s Data Viz to Thoreau’s.  Thoreau's paragraphs provide a symmetrical Histogram, and most of the paragraphs land in the &lt;strong&gt;neutral&lt;/strong&gt; zone.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Thoreau Score Histogram" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/04_Walden_Hist.png"&gt;&lt;/p&gt;
&lt;p&gt;Remember that the Google API returns both &lt;strong&gt;score&lt;/strong&gt; and &lt;strong&gt;magnitude&lt;/strong&gt;.  We need to include the &lt;strong&gt;magnitude&lt;/strong&gt; data into the analysis, to get a feel for the overall intensity of emotion.&lt;/p&gt;
&lt;p&gt;I use a bivarate density plot, which looks like a smooth sheet placed on top of a blocky, two dimensional histogram.  I use Kernel Density Estimation (KDE) to represent the frequency of the datum in each &lt;strong&gt;bucket&lt;/strong&gt; in a continuous way (vs. the discrete &lt;strong&gt;count&lt;/strong&gt; found in histograms).  I follow the &lt;a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html"&gt;SciPy docs&lt;/a&gt; to make the chart.  &lt;/p&gt;
&lt;p&gt;The colors represent the &lt;strong&gt;density&lt;/strong&gt;.  The darker the color, the more instances of a particular &lt;strong&gt;score/magnitude&lt;/strong&gt; pair.  The black dots represent the actual data points.&lt;/p&gt;
&lt;p&gt;Note that I multiply the Score by ten in order to make the Data Viz more readable.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Walden Bivariate Density Plot" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/05_Walden_Density.png"&gt;&lt;/p&gt;
&lt;p&gt;Again, we see that Thoreau's text concentrates around Neutral tone, with the Magnitude higher around scores of zero.&lt;/p&gt;
&lt;p&gt;Contrast Thoreau’s Data Viz to Kaczynski.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Unabomber Bivariate Density Plot" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/06_Unabomber_Density.png"&gt;&lt;/p&gt;
&lt;p&gt;The near forty-five degree angle of the contour map (the blue, green and yellow oval) indicates strong correlation between sentiment and magnitude.  The more negative Kaczynski writes, the stronger his emotions.  &lt;/p&gt;
&lt;p&gt;Overall, however, most of his text lands in the &lt;strong&gt;medium-negative&lt;/strong&gt; sentiment range.&lt;/p&gt;
&lt;h2&gt;Literary Analysis&lt;/h2&gt;
&lt;p&gt;Ted Kaczynski and Henry David Thoreau (despite the separation of a century) live lives of uncanny similarity. Both tackle their field of expertise in unorthodox, pioneering and peerless manners, and both graduated from Harvard. These individuals tried their hand at teaching, but eventually withdrew from the profession and instead became &lt;strong&gt;hermits&lt;/strong&gt; living in the wilderness in modest shanties. During their seclusion from society they produced their most influential works: &lt;a href="https://www.gutenberg.org/files/205/205-h/205-h.htm"&gt;Walden&lt;/a&gt; and &lt;a href="https://search.brave.com/search?q=unabomber+manifesto&amp;amp;source=web"&gt;The Manifesto&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Common View of Technology and Over-Socialization&lt;/h3&gt;
&lt;p&gt;The effect of technology and over-socialization negatively reducing the human experience represents the first (and most prevalent) issue found in both works. The Unabomber opens his tirade with:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Industrial Revolution and its consequences have been disastrous for the human race... [it has] reduce(d) human beings and many other living organisms to engineered products and mere cogs in the social machine [Kaczynski 1]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau did not live long enough to experience the far reaching effects of our industrial infancy, but he still provides insights into finding the pitfalls of devoting one’s life to &lt;strong&gt;unnecessary&lt;/strong&gt; industrial labor. He writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The laboring man has no leisure for a true integrity day by day; he cannot afford to sustain the manliest relations to men; his labor would be depreciated in the market. He has no time to be anything but a machine [Thoreau1 491]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau’s lambaste against &lt;strong&gt;model farms&lt;/strong&gt; in &lt;strong&gt;Chapter IX: The Ponds&lt;/strong&gt; criticizes technology’s tendency to reduce men and animals to cogs in a machine. He writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A model farm! where the house stands like a fungus in a muck-heap, chambers for men, horses, oxen and swine, cleansed and uncleansed, all contiguous to one another! Stocked with men! A great grease spot, redolent of manures and buttermilk! under a high state of cultivation, being manured with the hearts and brains of men! As if you were to raise your potatoes in the churchyard! Such is a model farm! [Thoreau1 593]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kaczynski expresses the opinion that technology’s minimizing effect on an individual's importance in society imparts a sense of worthlessness over the general populace by taking away their autonomy and power (which he groups together as the definition for &lt;strong&gt;The Power Process&lt;/strong&gt;, his own term). He writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;When an individual does not have opportunity to go throughout the power-process the consequences are boredom, demoralization, low self-esteem, inferiority feelings [sic], defeatism, depression, anxiety, guilt, frustration ...[etc.] [Kaczynski 44]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau gets the point of human demoralization across in his work more elegantly and without the use of
invented psychological jargon when he writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The mass of men lead lives of quiet desperation... A stereotyped but unconscious despair is concealed even under what are called games and amusements of mankind. There is no play in them, for this comes after work [Thoreau 1492].&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau later reveals that an individual should remove herself from the influence of technology and becoming one with nature in order to escape depression. In &lt;strong&gt;Chapter V: Solitude&lt;/strong&gt;, he writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;There can be no very black melancholy to him who lives in the midst of nature and has his senses still... Nothing can rightly compel a simple and brave man to a vulgar sadness. While I enjoy the friendship of the seasons I trust that nothing can make a burden to me [Thoreau 1559]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Unabomber shares the same sentiment that modern man should go back to nature to escape ennui. Unlike Thoreau, however, he gives a scientific reason to explain our difficulty in achieving contenment when removed from nature.  He writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We attribute the social and psychological problems of modern society to the fact that society requires people to live under conditions radically different from those under which the human race has evolved and to behave in ways that conflict with the patterns of behavior that the human race developed while living under the earlier conditions [Kaczynski 46]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Common View of Workaholic/ Novelty Culture&lt;/h3&gt;
&lt;p&gt;Thoreau and Kaczynski both focus on the detrimental effects of our workaholic/ consumer culture, where workers work long hours to pay for disposable, novelty items and engage in meaningless activities for distraction. Kaczynski writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;...even if they have a great deal of money, [modern workers] cannot satisfy their constant craving for the shiny new toys that the marketing industry dangles before their eyes.  So, they always feel hard pressed financially, even if their income is large, and their cravings are frustrated [Kaczynski 80]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau also attacks the consumer mentality on numerous occasions, stressing that a simple life trumps a life spent accumulating junk and requires much less effort. For example:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;...if one would live simply and eat only the crop he raised , and raise no more than he ate, and not exchange it for an unsufficient quantity of more luxurious and expensive things ... he could do all his farm work with his left hand at odd hours of the summer [Thoreau1 518]&lt;/p&gt;
&lt;p&gt;...if working were not my trade, I could get all the meat I should want by hunting... I could get all I should want for one week in one day [Thoreau 1566]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;He concludes on this theme with a powerful and succinct statement:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Superfluous wealth can buy superfluities only. Money is not required to buy one necessary of the soul [Thoreau 1660].&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Unabomber manifesto brings up an interesting aspect of human nature, which he labels the theory of &lt;strong&gt;surrogate activities&lt;/strong&gt;.  He writes that because technological society takes care of our basic needs, we humans have to invent artificial needs in order to feel satisfied with our lives: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A surrogate activity is an activity that is directed toward an artificial goal that the individual pursues for the sake of the "fulfillment" that he gets from pursuing the goal, not because he needs to attain the goal itself. For instance there is no practical motive for building enormous muscles, hitting a little ball in a hole or acquiring a complete series of postage stamps. Yet many people in our society devote themselves with passion to bodybuilding, golf, or stamp collecting [Kaczynski 84]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau criticizes this same notion of devoting one’s life to the pursuit of &lt;strong&gt;nonsense&lt;/strong&gt;, and not concentrating on what’s really important in life:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;...It is easier to sail many thousand miles through cold and storm and cannibals ... than it is to explore the private sea, the Atlantic and the Pacific ocean of one’s being alone... It is not worth the while to go round the world to count the cats in Zanzibar. Yet do this even till you can do better, and you may perhaps find some “Symmes’ Hole” by which to get at the inside at last... if you would learn to speak all tongue and conform to the customs of all nations, if you would travel farther than all travelers...and cause the Sphinx to dash her head against the stone [Thoreau 1657] &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I predict that the Unabomber would label the surrogate activities that Thoreau just mentioned &lt;strong&gt;travel-oriented surrogate activities&lt;/strong&gt;.  In addition to these travel-oriented activities, Thoreau criticizes other &lt;strong&gt;surrogate activities&lt;/strong&gt; in Walden. For example, he considers the hobby of reading junk novels a useless pasttime:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Most men are satisfied if they read or hear read, and perchance have been [convinced] by the wisdom of one good book, the Bible, and for the rest of their lives vegetate and dissipate their faculties in what is called easy reading... the result is dullness of sight, a stagnation of the vital circulations, and a general [sinking] and sloughing off of all the intellectual faculties [Thoreau 1545]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Unabomber lists the acquisition of useless junk a surrogate activity, writing:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;...many people put into their work far more effort than is necessary to earn whatever... they desire and this extra effort constitutes a surrogate activity [Kaczynski 84]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau also lambastes the &lt;strong&gt;collecting surrogate activity&lt;/strong&gt;. He writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;...as I preferred some things to others I especially valued my freedom ... I did not wish to spend my time in earning rich carpets... delicate cookery, or a house in the Grecian or the Gothic style just yet [Thoreau 1526]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau suggests that nonspiritual, superficial pastimes do not substitute for the higher principles of Self, God nor freedom.  The Unabomber wraps a similar sentiment in psychological jargon. &lt;/p&gt;
&lt;h3&gt;Common View of Rage Against the Machine&lt;/h3&gt;
&lt;p&gt;Kaczynski and Thoreau both desire to eliminate &lt;strong&gt;the Machine&lt;/strong&gt; or what they label the ugly and evil influence of technology from the world.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Earth Crisis Destroy the Machines Album Art" src="https://john.soban.ski/images/Thoreau_Vs_Unabomber/07_Earth_Crisis.png"&gt;&lt;/p&gt;
&lt;p&gt;Kaczynski writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;it is necessary to develop and propagate an ideology that opposes technology and the industrial society... the factories should be destroyed, technical books burned, etc. [Kaczynski 165]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau considers the railroad (arguably the springboard for America’s industrial revolution) &lt;strong&gt;the machine&lt;/strong&gt;.  Thoreau spares no feelings of mercy or clemency for the railroad.  He perceives the railroad a threat to nature, the same way the Unabomber perceives 20th century industrial society a threat to nature:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;That devilish Iron Horse, whose ear-rending neigh is heard throughout town, has muddied the Boiling Spring with his foot, and he it is that has browsed off all the woods on Walden shore, that Trojan horse, with a thousand men in his belly, introduced by mercenary Greeks! Where is the country s champion, the [dragon slayer] to meet him at the Deep Cut and thrust an avenging lance between the ribs of the bloated pest? [Thoreau 1591]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thoreau wants mankind to preserve, if not so much return to the Natural habitat that he shares the Earth with. The Unabomber shares this sentiment. He writes:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;An ideology, in order to gain enthusiastic support... must be FOR something, as well as AGAINST something. The positive ideal [I] propose [in my manifesto] is Nature. That is, WILD nature: those aspects of the earth functioning of the Earth and its living things that are independent of human management and free of human interference and control [Kaczynski 183]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The two works provide more parallels that pertain to important social issues. The Unabomber, for example, writes: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Instead of removing the conditions that make people depressed, modern society gives them antidepressant drugs. [Kaczynski 145]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;His statment provides an uncanny response to Thoreau’s rhetorical question:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;What is the pill which will keep us well, serene, contented? [Thoreau 1563]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Differences&lt;/h3&gt;
&lt;p&gt;The two works have their differences in addition to their similarities. First of all, the writing styles of the two authors clash quite severely. Thoreau uses a flowery, poetic style and injects a sense of humor into the text (see the “Cenobites” pun).  Kaczynski’s uses a staccato, scientific and analytic voice and his work contains not a single iota of brevity.&lt;/p&gt;
&lt;p&gt;In addition, &lt;strong&gt;Walden&lt;/strong&gt; includes an overall tone of optimism, whereas Kandinsky focuses on dire pessimism. Finally, Thoreau’s “tangents” mainly describe nature, his possessions, how he built his house,  etc., which all deal with his environment. Kaczynski's tangents on the other focus on political themes.&lt;/p&gt;
&lt;p&gt;Thoreau's work speaks for itself, and society recognizes the value of &lt;strong&gt;Walden&lt;/strong&gt; solely by his writing talents.  Thoreau did not need to partake in any extraneous activities to bring attention to it. Thoreau did not need to use violence to set himself apart from his contemporaries. &lt;/p&gt;
&lt;p&gt;Contrast this with Kaczynski, who clearly states :&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;In order to get our message before the public with some chance of making a lasting impression, we've had to kill people [Kaczynski 96]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kaczynski did in fact resort to violence, and killed people.  We must not forget his cowardly actions.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Based on literary analysis, Thoreau and Kaczynski see eye to eye in relation to their works’ major themes. Both authors appear to be steadfast in their pro-nature (and all the good that comes from it) / anti-technology (and all the ills that are a result of it) convictions. &lt;/p&gt;
&lt;p&gt;AI inference provides hard numbers that indicate Kaczynski communicates in a strong, negative tone and Thoreau uses a neutral tone in terms of both emotion and intensity.&lt;/p&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="NLP"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Don't Move Your Data! In-Situ Machine Learning via BigQuery</title><link href="https://john.soban.ski/bigquery-ml.html" rel="alternate"></link><published>2021-01-30T08:07:00-05:00</published><updated>2021-01-30T08:07:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2021-01-30:/bigquery-ml.html</id><summary type="html">&lt;p&gt;I started my AI/ML journey in 2011 with a &lt;strong&gt;laptop model&lt;/strong&gt;, a term which indicates a measure of size.  &lt;strong&gt;Laptop models&lt;/strong&gt;, by definition, do not exceed the compute, memory and storage resources of a single piece of hardware.  The &lt;strong&gt;laptop model&lt;/strong&gt; approach works well for small data sets, and …&lt;/p&gt;</summary><content type="html">&lt;p&gt;I started my AI/ML journey in 2011 with a &lt;strong&gt;laptop model&lt;/strong&gt;, a term which indicates a measure of size.  &lt;strong&gt;Laptop models&lt;/strong&gt;, by definition, do not exceed the compute, memory and storage resources of a single piece of hardware.  The &lt;strong&gt;laptop model&lt;/strong&gt; approach works well for small data sets, and modern hardware accommodates a few dozen GigaBytes (GB) of data with no issues.  &lt;/p&gt;
&lt;p&gt;The following cartoon demonstrates the laptop approach to model training and serving.  (I use a &lt;strong&gt;brain&lt;/strong&gt; icon to represent the ML model.)&lt;/p&gt;
&lt;p&gt;&lt;img alt="Laptop Model" src="https://john.soban.ski/images/Bigquery_Ml/01_Laptop_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;I still build, on occasion, &lt;strong&gt;laptop models&lt;/strong&gt; when I have a small data set and I do not have access to the cloud.  In general, however, I train models on &lt;strong&gt;Big Data&lt;/strong&gt;, or data at PetaByte (PB) and higher scale. The &lt;strong&gt;laptop model&lt;/strong&gt; approach, unfortunately, breaks down in the face of &lt;strong&gt;Big Data&lt;/strong&gt;.  Consider, for example, an attempt to train a ten (10) PB Dataset on a laptop.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Big Data Kaboom" src="https://john.soban.ski/images/Aggregations_The_Elasticsearch_Group_By/02_Big_Data_Kaboom.png"&gt;&lt;/p&gt;
&lt;p&gt;I do not know of a laptop that can accommodate 10 PB, and even if one existed, the compute and memory resources would not be able to train and serve at that scale.  The cost and latency, furthermore, involved in transferring that amount of data across the Network also prohibits this approach.&lt;/p&gt;
&lt;p&gt;Data Scientists, therefore, &lt;strong&gt;sample&lt;/strong&gt; Big Datasets in order to work around the network, cost and resource constraints associated with &lt;strong&gt;Big Data&lt;/strong&gt; driven &lt;strong&gt;laptop models&lt;/strong&gt;.  A &lt;strong&gt;sampled&lt;/strong&gt; Dataset enables the Data Scientist to train and serve models on their laptops.  &lt;/p&gt;
&lt;p&gt;The following cartoon, for example, illustrates a Data Scientist who downloads one out of every one million (1M) rows at random.  This reduces the data set from a cumbersome 10 PB to a manageable 10 GB.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Sampled Model" src="https://john.soban.ski/images/Bigquery_Ml/02_Sampled_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;sampling&lt;/strong&gt; approach works from a technical standpoint.  In order to accommodate resource constraints, however, the Data Scientist must &lt;strong&gt;throw away&lt;/strong&gt; 99.9999% of the data.  The ignored data may contain interesting outliers that could, for example, predict &lt;strong&gt;black swan&lt;/strong&gt; events.  So, from an information standpoint, the &lt;strong&gt;sampling&lt;/strong&gt; approach lacks utility.&lt;/p&gt;
&lt;h2&gt;How can Data Scientists train and serve models on &lt;strong&gt;Big Data&lt;/strong&gt;?&lt;/h2&gt;
&lt;p&gt;The issues with &lt;strong&gt;laptop models&lt;/strong&gt; and &lt;strong&gt;sampling&lt;/strong&gt; approaches result from the attempts of the Data Scientist to &lt;strong&gt;Bring the Data to the processing&lt;/strong&gt;.  A better approach, therefore, would be to &lt;strong&gt;bring the processing to the Data.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Google Compute Platform (GCP) &lt;a href="https://cloud.google.com/bigquery"&gt;BigQuery&lt;/a&gt; service now enables Data Scientists to train models in place (or &lt;strong&gt;in-situ&lt;/strong&gt;).  They can train and serve models on their BigQuery Datasets without the need to move the data outside of Big Query.&lt;/p&gt;
&lt;p&gt;The following Cartoon captures this phenomenon:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bq Model" src="https://john.soban.ski/images/Bigquery_Ml/03_Bq_Model.png"&gt;&lt;/p&gt;
&lt;h2&gt;Test Drive BigQuery ML&lt;/h2&gt;
&lt;p&gt;We now will test drive the in-situ BigQuery ML and AutoML services, which allow us to train and serve data without the need to transfer the data out of BigQuery.&lt;/p&gt;
&lt;h3&gt;Add Data to BigQuery&lt;/h3&gt;
&lt;p&gt;If you have data in BigQuery, you can test drive BigQuery ML immediately.  I will import the &lt;a href="https://archive.ics.uci.edu/ml/datasets/wine+quality"&gt;UC Davis Wine Quality Dataset&lt;/a&gt; into BigQuery.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Wine Site" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/14_Wine_Site.png"&gt;&lt;/p&gt;
&lt;p&gt;I discussed the UC Davis Wine Quality Dataset in last month's &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;discussion of GCP AutoML Tables&lt;/a&gt;.  Please open that link in a new tab to read a description of the data, along with a discussion of the quality of the GCP AutoML generated models.&lt;/p&gt;
&lt;p&gt;Last month, I &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;uploaded the UC Davis Wine Quality Dataset to a Google Cloud Storage bucket&lt;/a&gt;.  I will now import data from that bucket into BigQuery.  If you have issues with importing the Wine Quality Dataset into BigQuery via a GCS Bucket, please see &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;that blog post&lt;/a&gt; for reference. &lt;/p&gt;
&lt;p&gt;The BigQuery console provides a list of pinned projects.  Select your project from the list.  Google named my project &lt;strong&gt;shining chain&lt;/strong&gt;.  Google will provide you with a different, randomly generated name.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="Select Bq Project" src="https://john.soban.ski/images/Bigquery_Ml/04_Select_Bq_Project.png"&gt;&lt;/p&gt;
&lt;p&gt;Select &lt;strong&gt;Create Database&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Select Create Dataset" src="https://john.soban.ski/images/Bigquery_Ml/05_Select_Create_Dataset.png"&gt;&lt;/p&gt;
&lt;p&gt;Name the dataset &lt;strong&gt;wine_dataset&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Name Dataset" src="https://john.soban.ski/images/Bigquery_Ml/06_Name_Dataset.png"&gt;&lt;/p&gt;
&lt;p&gt;The BigQuery console now lists &lt;strong&gt;wine_dataset&lt;/strong&gt; under your project name.&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;wine_dataset&lt;/strong&gt; and then the &lt;strong&gt;PLUS (+)&lt;/strong&gt; sign on the upper right in order to add a table.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Add Data" src="https://john.soban.ski/images/Bigquery_Ml/07_Add_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;Select &lt;strong&gt;Create Table From Cloud Storage&lt;/strong&gt; and then browse for your bucket.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Browse Bucket" src="https://john.soban.ski/images/Bigquery_Ml/08_Browse_Bucket.png"&gt;&lt;/p&gt;
&lt;p&gt;If you do not have a bucket with the &lt;strong&gt;wine quality&lt;/strong&gt; dataset loaded, you can click the &lt;strong&gt;swiss lunch pail&lt;/strong&gt; to create a new bucket now.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bucket List" src="https://john.soban.ski/images/Bigquery_Ml/09_Bucket_List.png"&gt;&lt;/p&gt;
&lt;p&gt;Add a table name (I named it &lt;strong&gt;wine_red&lt;/strong&gt;), select &lt;strong&gt;auto-detect&lt;/strong&gt; schema and save to close the wizard.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Configure Table" src="https://john.soban.ski/images/Bigquery_Ml/10_Configure_Table.png"&gt;&lt;/p&gt;
&lt;p&gt;Auto-schema inferred &lt;strong&gt;FLOAT&lt;/strong&gt; for all of our features, and &lt;strong&gt;INTEGER&lt;/strong&gt; for our &lt;strong&gt;label&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Auto Schema" src="https://john.soban.ski/images/Bigquery_Ml/11_Auto_Schema.png"&gt;&lt;/p&gt;
&lt;p&gt;Select the &lt;strong&gt;wine_red&lt;/strong&gt; table and then select preview to get a look at the data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Data Training" src="https://john.soban.ski/images/Bigquery_Ml/12_Data_Preview.png"&gt;&lt;/p&gt;
&lt;h3&gt;Linear Regression Model&lt;/h3&gt;
&lt;p&gt;We will now create our first model.  To train a model in BigQuery we simply point BigQuery to a table, select the desired features and then indicate a &lt;strong&gt;LABEL&lt;/strong&gt; feature, or target.&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;wine quality dataset&lt;/a&gt; includes chemical markers and a rating from zero (0) to ten (10).  Our wine model looks at the &lt;strong&gt;wine quality&lt;/strong&gt; data to learn the relationship between these chemical markers and the &lt;strong&gt;quality&lt;/strong&gt;.  In other words, given a set of chemical markers, our model will predict the rating.  For this reason, command BigQuery to use the &lt;strong&gt;quality&lt;/strong&gt; feature for the target &lt;strong&gt;label&lt;/strong&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Use &lt;strong&gt;SELECT AS&lt;/strong&gt; syntax to indicate the &lt;strong&gt;target&lt;/strong&gt; feature&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;BigQuery allows us to train a model via the &lt;strong&gt;CREATE MODEL&lt;/strong&gt; SQL command.  In the following SQL statement, I tell BigQuery to use the &lt;strong&gt;wine_red&lt;/strong&gt; table, located in my &lt;strong&gt;wine_dataset&lt;/strong&gt; Dataset, found in my &lt;strong&gt;shining_chain&lt;/strong&gt; project.  I also command SQL to create a model named &lt;strong&gt;model&lt;/strong&gt; and place it in my &lt;strong&gt;wine_dataset&lt;/strong&gt; Dataset, which lives in the &lt;strong&gt;shining_chain&lt;/strong&gt; project.  I use an SQL &lt;strong&gt;OPTIONS&lt;/strong&gt; stanza to set the &lt;strong&gt;model_type&lt;/strong&gt; to &lt;strong&gt;LINEAR_REG&lt;/strong&gt;.  Note that I include syntax that reads &lt;strong&gt;SELECT quality AS label FROM wine_red&lt;/strong&gt; (I summarize it here).  This instructs BigQuery to set the target feature to &lt;strong&gt;quality&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create The Model" src="https://john.soban.ski/images/Bigquery_Ml/13_Create_The_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;Read the SQL statement below to see the logic in action.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;LINEAR_REG&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;alcohol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;chlorides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;citric_acid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;density&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;fixed_acidity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;free_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;residual_sugar&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;sulphates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;total_sulfur_dioxide&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;volatile_acidity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;&lt;span class="n"&gt;shining&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wine_red&lt;/span&gt;&lt;span class="o"&gt;`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once the model completes the train stage, click on &lt;strong&gt;execution details&lt;/strong&gt;.  You will see that BigQuery used parallel processing to execute two (2) minutes worth of processing in fourteen (14) seconds. &lt;/p&gt;
&lt;p&gt;&lt;img alt="Execution Details" src="https://john.soban.ski/images/Bigquery_Ml/14_Execution_Details.png"&gt;&lt;/p&gt;
&lt;p&gt;BigQuery also reports the model success metrics.&lt;/p&gt;
&lt;p&gt;Click the &lt;strong&gt;Results&lt;/strong&gt; tab and click &lt;strong&gt;Go to Model&lt;/strong&gt; &lt;/p&gt;
&lt;p&gt;&lt;img alt="Results View" src="https://john.soban.ski/images/Bigquery_Ml/15_Results.png"&gt;&lt;/p&gt;
&lt;p&gt;In accordance with (IAW) our SQL statement, BigQuery named our model &lt;strong&gt;model&lt;/strong&gt; and stored it in the &lt;strong&gt;wine_dataset&lt;/strong&gt; Dataset, which lives in the &lt;strong&gt;shining-chain&lt;/strong&gt; project.  Click &lt;strong&gt;model&lt;/strong&gt;, click &lt;strong&gt;Evaluation&lt;/strong&gt; and BigQuery will print the metrics.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Model Evaluation" src="https://john.soban.ski/images/Bigquery_Ml/16_Model_Evaluation.png"&gt;&lt;/p&gt;
&lt;p&gt;Our first model produces a MSE of &lt;strong&gt;0.4374&lt;/strong&gt;.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mean absolute error&lt;/td&gt;
&lt;td&gt;0.5107&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean squared error&lt;/td&gt;
&lt;td&gt;0.4374&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean squared log error&lt;/td&gt;
&lt;td&gt;0.0105&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median absolute error&lt;/td&gt;
&lt;td&gt;0.4022&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;R squared&lt;/td&gt;
&lt;td&gt;0.3039&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The MSE maps to a Root Mean Square Error (RMSE) of about &lt;strong&gt;0.6614&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Last month, &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;we tackled the Wine Quality Dataset with a variety of models&lt;/a&gt; and compared the results.&lt;/p&gt;
&lt;p&gt;I capture the results table below:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;AutoML Tables&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.598&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.6327900&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.6449177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.6483683&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;BigQuery Linear&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.6613622&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.7061977&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.7350416&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.8012159&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;BigQuery's &lt;strong&gt;LINEAR_REG&lt;/strong&gt; model out-performed the two Tensorflow models executed with default parameters.  BiqQuery also beat the &lt;strong&gt;Guess Mean&lt;/strong&gt; approach, which provides a good pace car for all of our investigations.&lt;/p&gt;
&lt;p&gt;Note that a Linear Model applied to our feature reduced dataset landed in second place last month, which supports the claim that too many features leads to over-fitting and therefore lower performance.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Click the &lt;strong&gt;training&lt;/strong&gt; tab and BQ provides training statistics.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Model Train Time" src="https://john.soban.ski/images/Bigquery_Ml/17_Model_Train_Time.png"&gt;&lt;/p&gt;
&lt;p&gt;If you click &lt;strong&gt;details&lt;/strong&gt;, you will see that BQ only created one model and stopped.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Default Training Options" src="https://john.soban.ski/images/Bigquery_Ml/18_Default_Training_Options.png"&gt;&lt;/p&gt;
&lt;p&gt;We commanded BigQuery to use the default configurations for learning rate, regularization and optimizer, create a model and stop there.  We can increase the performance of our model, therefore, by tuning these Hyperparameters.  Adam, for example, may perform better than Stochastic Gradient Descent.  In general, Data Scientists will run through a variety of hyperparameter settings, iterate and then keep the best performing set of configuration options.&lt;/p&gt;
&lt;p&gt;In the past, Data Scientists needed to tune these parameters &lt;strong&gt;by hand&lt;/strong&gt;.  The AI/ML industry, however, now provides a host of &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;AutoML solutions&lt;/a&gt;, that execute model tuning without the need for operator involvement.&lt;/p&gt;
&lt;p&gt;BigQuery, in fact, just unlocked a Beta service that allows us to execute &lt;strong&gt;AutoML&lt;/strong&gt; in-situ.&lt;/p&gt;
&lt;p&gt;We discuss &lt;strong&gt;In-Situ AutoML&lt;/strong&gt; via BigQuery next month.  See you then!&lt;/p&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Fast and Easy ML Optimization with GCP AutoML Tables (Beta)</title><link href="https://john.soban.ski/fast-and-easy-automl-optimize.html" rel="alternate"></link><published>2020-12-26T04:19:00-05:00</published><updated>2020-12-26T04:19:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2020-12-26:/fast-and-easy-automl-optimize.html</id><summary type="html">&lt;p&gt;Model optimization on traditional Artificial Intelligence and Machine Learning (AI/ML) platforms requires considerable Data Architect expertise and judgement.  These ML platforms require the Architect to choose from dozens of available training algorithms. The platforms also provide a host of hyper-parameter knobs and switches for the Architect to tune.  The …&lt;/p&gt;</summary><content type="html">&lt;p&gt;Model optimization on traditional Artificial Intelligence and Machine Learning (AI/ML) platforms requires considerable Data Architect expertise and judgement.  These ML platforms require the Architect to choose from dozens of available training algorithms. The platforms also provide a host of hyper-parameter knobs and switches for the Architect to tune.  The deluge of choice requires the Architect to iterate on both algorithm selection and hyper-parameter values, a time consuming proposition.&lt;/p&gt;
&lt;p&gt;AutoML services &lt;strong&gt;democratize&lt;/strong&gt; model development via no-code, graphical user interface (GUI) based optimization services.  We discuss the Google Compute Platform's (GCP) &lt;a href="https://john.soban.ski/gcp-automl-vision.html"&gt;AutoML Vision&lt;/a&gt; service in an &lt;a href="https://john.soban.ski/gcp-automl-vision.html"&gt;earlier blog post&lt;/a&gt;.  In this blog post we discuss the GCP &lt;a href="https://cloud.google.com/automl-tables"&gt;AutoML Tables Beta&lt;/a&gt; service.&lt;/p&gt;
&lt;h2&gt;The Data Set&lt;/h2&gt;
&lt;p&gt;The GCP AutoML Tables Beta service requires structured, Data Frame encoded data.  To test drive the service, we use the &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data"&gt;BUPA Liver Disorders&lt;/a&gt; data set.  For a refresher on the &lt;strong&gt;BUPA Liver Disorders&lt;/strong&gt; data set, please right click and open one or more of the following blog posts in a new tab (or set of tabs):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Applying a Reduced Columb Energy (RCE) Neural Network to the BUPA Liver Disorders Data Set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;A Graphical introduction to Probabilistic Neural Networks (PNN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/refactor-matlab-to-tidyverse.html"&gt;Refactoring Matlab Code to R Tidyverse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;Fast and Easy Regression with Keras and TensorFlow 2.3 (Part 1 - Data Exploration &amp;amp; First Models)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;Fast and Easy Regression with Keras and TensorFlow 2.3 (Part 2 - Dimensionality Reduction)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the latter two blog posts, we crunch the &lt;strong&gt;BUPA Liver Disorders&lt;/strong&gt; data set in TensorFlow via &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;Neural Net and Linear Regression Models&lt;/a&gt; and reduce model over-fit via &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;dimensionality reduction&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The following table captures the results of our model iteration:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;3.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3.17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: Our &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;TensorFlow 2.3&lt;/a&gt; and &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;Keras 2.3&lt;/a&gt; investigations use the Root Mean Square Error (RMSE) success metric for model tuning&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;We iterate over the different model scenarios and draw the interesting conclusion that simply guessing the mean of the &lt;strong&gt;training&lt;/strong&gt; data set yields the best results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google AutoML Beta&lt;/strong&gt; executes a &lt;strong&gt;battle of the bands&lt;/strong&gt; and iterates through dozens of algorithm choices.  For each algorithm, the service tunes &lt;strong&gt;Hyper Parameters&lt;/strong&gt;, to include number of layers, learning rate and number of features.&lt;/p&gt;
&lt;p&gt;Let's see if the Google AutoML service can beat our idiotic, yet successful &lt;strong&gt;choose mean&lt;/strong&gt; algorithm.&lt;/p&gt;
&lt;h2&gt;Enable GCP AutoML tables&lt;/h2&gt;
&lt;p&gt;Engineers at Google call the menu selection icon the &lt;strong&gt;hamburger&lt;/strong&gt;, a bit of slang that simultaneously makes me laugh and makes me hungry.  Click the &lt;strong&gt;hamburger&lt;/strong&gt; icon in the upper left corner and then scroll down to &lt;strong&gt;Artificial Intelligence&lt;/strong&gt; and select &lt;strong&gt;Tables --&amp;gt; Datasets&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Hamburger Menu" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/01_Hamburger.png"&gt;&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;enable the API&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enable API" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/02_Enable_API.png"&gt;&lt;/p&gt;
&lt;p&gt;Once we enable the API, click &lt;strong&gt;get started&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Get Started" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/03_Get_Started.png"&gt;&lt;/p&gt;
&lt;p&gt;Name our dataset on the &lt;strong&gt;create dataset&lt;/strong&gt; screen.&lt;/p&gt;
&lt;p&gt;&lt;img alt="New Dataset" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/04_New_Dataset.png"&gt;&lt;/p&gt;
&lt;p&gt;The AutoML Tables Beta service provides three vehicles for dataset import:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Import data from BigQuery&lt;/li&gt;
&lt;li&gt;Select a CSV file from Cloud Storage&lt;/li&gt;
&lt;li&gt;Upload files from your computer&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We will upload the BUPA dataset from our computer.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Download BUPA" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/05_Download_Bupa.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data"&gt;Click here&lt;/a&gt; to download the CSV file from USC.&lt;/p&gt;
&lt;p&gt;USC names the file &lt;strong&gt;bupa.data&lt;/strong&gt;.  If we attempt to upload the file &lt;strong&gt;bupa.data&lt;/strong&gt;, Google will bark.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bad Name" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/06_Bad_Name.png"&gt;&lt;/p&gt;
&lt;p&gt;Rename our file from &lt;strong&gt;bupa.data&lt;/strong&gt; to &lt;strong&gt;bupa.csv&lt;/strong&gt; in order to upload the data.  Click &lt;strong&gt;select files&lt;/strong&gt; and then click the &lt;strong&gt;bupa.csv&lt;/strong&gt; file.&lt;/p&gt;
&lt;p&gt;If needed, create a bucket to store the model and metadata.  Click &lt;strong&gt;browse&lt;/strong&gt; and then select the &lt;strong&gt;Swiss lunch pail&lt;/strong&gt; (my terminology, not Google's).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create A Bucket" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/07_Create_A_Bucket.png"&gt;&lt;/p&gt;
&lt;p&gt;Cycle through the bucket wizard and click &lt;strong&gt;create&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create A Bucket" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/08_Create_A_Bucket.png"&gt;&lt;/p&gt;
&lt;p&gt;If needed, create a folder via the &lt;strong&gt;Swiss lunch pail&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create A Folder" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/09_Create_A_Folder.png"&gt;&lt;/p&gt;
&lt;p&gt;If done properly, we will see all green, and will be able to click &lt;strong&gt;import&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;Import&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Dataset" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/10_Import.png"&gt;&lt;/p&gt;
&lt;p&gt;Google will import the data, a process which can take hours.  We can close the window and Google will email us once they (it?) completes the import process.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Message" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/11_Import_Message.png"&gt;&lt;/p&gt;
&lt;p&gt;After a few minutes, the import fails!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Fail Ext" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/12_Fail.png"&gt;&lt;/p&gt;
&lt;p&gt;The Beta service requires a header.  We hope the &lt;strong&gt;Alpha&lt;/strong&gt; version will provide a friendlier UI, and provide a wizard to create a header row.  Until then, we will need to add the header row by hand:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;mcv,alkphos,sgpt,sgot,gammagt,drinks,selector
85,92,45,27,31,0.0,1
85,64,59,32,23,0.0,2
86,54,33,16,54,0.0,2

     &amp;lt;snip&amp;gt;

91,68,27,26,14,16.0,1
98,99,57,45,65,20.0,1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We properly wrangled the data into a form that the Google service accepts.  Upload the modified &lt;strong&gt;bupa.csv&lt;/strong&gt; into the import wizard and select import once more.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Dataset" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/10_Import.png"&gt;&lt;/p&gt;
&lt;p&gt;GCP imports the data...&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Message" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/11_Import_Message.png"&gt;&lt;/p&gt;
&lt;p&gt;...and fails once more!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Fail Rows" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/13_Fail2.png"&gt;&lt;/p&gt;
&lt;p&gt;The service does not accept data sets that contain &lt;strong&gt;less than&lt;/strong&gt; 1,000 rows.  For this reason, we can't optimize the &lt;strong&gt;BUPA Liver Disorders&lt;/strong&gt; model with &lt;strong&gt;Google AutoML Tables Beta&lt;/strong&gt;, a reality that disappoints me greatly.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Imagine that you drive trucks cross-country for a living. Now imagine every morning a magical elf appears and hides your keys, steals your engine or slashes your tires.  This analogy captures the daily frustration faced by professional Data Engineers!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Beta version of the &lt;strong&gt;Google AutoML Tables&lt;/strong&gt; service dashed my plans of providing a consistent machine learning narrative centered on the &lt;strong&gt;BUPA Liver Disorders&lt;/strong&gt; data set.  Un-cooperative software slayed my plans (and dreams) enough times in my professional career to vaccinate me against such roadblocks.  With a heavy heart, I will re-group and select a new data set for us to test drive the with the &lt;strong&gt;AutoML Tables Beta&lt;/strong&gt; service.&lt;/p&gt;
&lt;h2&gt;The UCI Wine Data Set&lt;/h2&gt;
&lt;p&gt;Our BUPA liver disorders &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;TensorFlow model&lt;/a&gt; predicts the number of drinks that a boozer drinks each day based on biological markers.  We stick with the wino theme and use the University of California Irvine (UCI) &lt;a href="http://archive.ics.uci.edu/ml/datasets/Wine+Quality"&gt;wine quality data set&lt;/a&gt;.  The Wine Quality data set uses biological (and chemical) markers to predict the quality of wine, which the sommeliers give a score from one to ten.  I would assume that Thunderbird would score low on such a scale.&lt;/p&gt;
&lt;p&gt;Download the &lt;a href="http://archive.ics.uci.edu/ml/datasets/Wine+Quality"&gt;wine dataset&lt;/a&gt; from the same UCI website that hosts the BUPA data set.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Wine Site" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/14_Wine_Site.png"&gt;&lt;/p&gt;
&lt;p&gt;We follow the method described in the &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;BUPA TensorFlow&lt;/a&gt; blog post to process the data, replacing the &lt;strong&gt;BUPA Data Frame&lt;/strong&gt; with the new &lt;strong&gt;Wine Data Frame&lt;/strong&gt; where appropriate.&lt;/p&gt;
&lt;p&gt;The following Python code, for example, uses the &lt;strong&gt;requests&lt;/strong&gt; library to download the &lt;strong&gt;Wine Quality&lt;/strong&gt; data set from the UCI website, and stuffs the data into a &lt;strong&gt;Pandas&lt;/strong&gt; Data Frame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;fixed_acidity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;volatile_acidity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;citric_acid&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;residual_sugar&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;chlorides&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;free_sulfur_dioxide&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;total_sulfur_dioxide&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;density&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;ph&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;sulphates&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;alcohol&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;quality&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;wine_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; 
                      &lt;span class="n"&gt;sep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                      &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The table below captures the summary statistics for the wine dataset:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;feature&lt;/th&gt;
&lt;th&gt;mean&lt;/th&gt;
&lt;th&gt;std&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;fixed_acidity&lt;/td&gt;
&lt;td&gt;8.31&lt;/td&gt;
&lt;td&gt;1.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;volatile_acidity&lt;/td&gt;
&lt;td&gt;0.53&lt;/td&gt;
&lt;td&gt;0.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;citric_acid&lt;/td&gt;
&lt;td&gt;0.27&lt;/td&gt;
&lt;td&gt;0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;residual_sugar&lt;/td&gt;
&lt;td&gt;2.52&lt;/td&gt;
&lt;td&gt;1.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chlorides&lt;/td&gt;
&lt;td&gt;0.09&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;free_sulfur_dioxide&lt;/td&gt;
&lt;td&gt;15.8&lt;/td&gt;
&lt;td&gt;10.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;total_sulfur_dioxide&lt;/td&gt;
&lt;td&gt;46.5&lt;/td&gt;
&lt;td&gt;33.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;density&lt;/td&gt;
&lt;td&gt;1.00&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ph&lt;/td&gt;
&lt;td&gt;3.31&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sulphates&lt;/td&gt;
&lt;td&gt;0.66&lt;/td&gt;
&lt;td&gt;0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;alcohol&lt;/td&gt;
&lt;td&gt;10.4&lt;/td&gt;
&lt;td&gt;1.08&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;quality&lt;/td&gt;
&lt;td&gt;5.63&lt;/td&gt;
&lt;td&gt;0.81&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: We observe a standard deviation of &lt;strong&gt;0.809201&lt;/strong&gt; for the target variable &lt;strong&gt;quality&lt;/strong&gt; &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Note the wild range swings amongst the features.  We follow the same process from our &lt;strong&gt;BUPA&lt;/strong&gt; model to &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;normalize the data via TensorFlow&lt;/a&gt;.  The following histogram records the normalized data histograms.  Note that we do not normalize the target, &lt;strong&gt;quality&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Wine Histograms" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/15_Wine_Histograms.png"&gt;&lt;/p&gt;
&lt;p&gt;Recall that the idiotic &lt;strong&gt;guess mean&lt;/strong&gt; algorithm yielded the best results for the &lt;strong&gt;BUPA&lt;/strong&gt; data set.  That algorithm guesses the mean of the &lt;strong&gt;training&lt;/strong&gt; Data Frame for each row in the &lt;strong&gt;holdout&lt;/strong&gt; (or test) Data Frame.  If we apply that algorithm to the &lt;strong&gt;Wine&lt;/strong&gt; Data Frame, we observe a RMSE of &lt;strong&gt;0.8012159&lt;/strong&gt;, with a RMSE greater than the standard deviation of the entire population.  This result compels us to apply more advanced algorithms.&lt;/p&gt;
&lt;p&gt;Once more, &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;Keras provides the tools to create a linear regression model and a Dense Neural Network (DNN) model&lt;/a&gt;, both of which predict the quality of the wine based on the given features.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: Keras detects that we now have eleven input features, versus the five for &lt;strong&gt;BUPA&lt;/strong&gt;. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;dnn_model.summary&lt;span class="o"&gt;()&lt;/span&gt;
Model:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;sequential_1&amp;quot;&lt;/span&gt;
_________________________________________________________________
Layer&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;                 &lt;/span&gt;Output&lt;span class="w"&gt; &lt;/span&gt;Shape&lt;span class="w"&gt;              &lt;/span&gt;Param&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;#   &lt;/span&gt;
&lt;span class="o"&gt;=================================================================&lt;/span&gt;
normalization_5&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;Normalizati&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;None,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="m"&gt;23&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;
_________________________________________________________________
dense_1&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;Dense&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;None,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="m"&gt;768&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;
_________________________________________________________________
dense_2&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;Dense&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;None,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;64&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="m"&gt;4160&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;
_________________________________________________________________
dense_3&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;Dense&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;None,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt;                 &lt;/span&gt;&lt;span class="nv"&gt;65&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;
&lt;span class="o"&gt;=================================================================&lt;/span&gt;
Total&lt;span class="w"&gt; &lt;/span&gt;params:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;,016
Trainable&lt;span class="w"&gt; &lt;/span&gt;params:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;,993
Non-trainable&lt;span class="w"&gt; &lt;/span&gt;params:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;23&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The normalized training set yields the following results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The DNN blows the other two approaches out of the water.  &lt;/p&gt;
&lt;p&gt;In the spirit of the prior blog post, we reduce the eleven features to two, via PCA.  Keras reports that the dimensionality reduction &lt;strong&gt;increases&lt;/strong&gt; the RMSE for the linear model.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.735&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;One principle component (dimension) depicts a poor fit for the regression line.  &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE:  The Wine data frame uses &lt;strong&gt;integers&lt;/strong&gt; for &lt;strong&gt;quality&lt;/strong&gt;.  For this reason we could also apply a &lt;strong&gt;classification&lt;/strong&gt; algorithm to predict wine quality.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Wine One Dim" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/16_Wine_One_Dim.png"&gt;&lt;/p&gt;
&lt;p&gt;The graph of two principal components indicates poor predictive performance.  We cannot draw a clean line that will predict the correct wine quality (depicted by the color and radius of the circles below).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Wine Two Dims" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/17_Wine_Two_Dims.png"&gt;&lt;/p&gt;
&lt;p&gt;The 3D bar chart looks flat, which also indicates that we need more than two Principal Components.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Wine Bar Chart" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/18_Wine_Bar_Chart.png"&gt;&lt;/p&gt;
&lt;p&gt;How many Principal Components should we use?  The &lt;strong&gt;eigenvalues&lt;/strong&gt; record the variance for each &lt;strong&gt;eigenvector&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;print&lt;span class="o"&gt;(&lt;/span&gt;pca.explained_variance_&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;.0807826&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.84947941&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.62211745&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.23466434&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.96610121&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.68122053
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.58218232&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.40963393&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.3487236&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.17406732&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If we eyeball the vector of &lt;strong&gt;eigenvalues&lt;/strong&gt;, we see that the first seven (7) or so principal components contain most of the variance.&lt;/p&gt;
&lt;p&gt;When we reduce the dimensionality of the data set to seven, and apply the two algorithms, we witness the following results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.645&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.735&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The dimensionality reduced Linear Model wins.&lt;/p&gt;
&lt;p&gt;Can Google AutoML tables beat the dimensionality reduced Linear Model?  Let's find out!&lt;/p&gt;
&lt;h2&gt;Import the UCI Wine Data Set&lt;/h2&gt;
&lt;p&gt;Download the &lt;a href="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"&gt;Wine Data Set from UCI&lt;/a&gt; to your workstation and execute the following two actions.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Replace all semi-colons (&lt;strong&gt;;&lt;/strong&gt;) with commas (&lt;strong&gt;,&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Replace all spaces with underscores (&lt;strong&gt;_&lt;/strong&gt;)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;See a snippet of &lt;strong&gt;wine.csv&lt;/strong&gt; below:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
&lt;span class="m"&gt;7&lt;/span&gt;.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
&lt;span class="m"&gt;7&lt;/span&gt;.8,0.88,0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
&lt;span class="m"&gt;7&lt;/span&gt;.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
&lt;span class="m"&gt;11&lt;/span&gt;.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
&lt;span class="m"&gt;7&lt;/span&gt;.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
&lt;span class="m"&gt;7&lt;/span&gt;.4,0.66,0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
&lt;span class="m"&gt;7&lt;/span&gt;.3,0.65,0,1.2,0.065,15,21,0.9946,3.39,0.47,10,7

&lt;span class="w"&gt;               &lt;/span&gt;&amp;lt;snip&amp;gt;

&lt;span class="m"&gt;6&lt;/span&gt;.3,0.51,0.13,2.3,0.076,29,40,0.99574,3.42,0.75,11,6
&lt;span class="m"&gt;5&lt;/span&gt;.9,0.645,0.12,2,0.075,32,44,0.99547,3.57,0.71,10.2,5
&lt;span class="m"&gt;6&lt;/span&gt;,0.31,0.47,3.6,0.067,18,42,0.99549,3.39,0.66,11,6
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Follow the process that we used to (attempt to) import the &lt;strong&gt;BUPA&lt;/strong&gt; data set above.  Create a new bucket and folder if desired.&lt;/p&gt;
&lt;p&gt;I created a bucket named &lt;strong&gt;wine-quality-data&lt;/strong&gt; and a folder named &lt;strong&gt;red&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Wine" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/19_Import_Wine.png"&gt;&lt;/p&gt;
&lt;p&gt;After we click &lt;strong&gt;import&lt;/strong&gt; Google will suggest that we close the window.&lt;/p&gt;
&lt;p&gt;&lt;img alt="So Far So Good" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/20_So_Far_So_Good.png"&gt;&lt;/p&gt;
&lt;p&gt;After about forty-five minutes, Google sends an email that reports a successful import.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Email" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/21_Import_Email.png"&gt;&lt;/p&gt;
&lt;p&gt;With our imported data set, we can now train the model.&lt;/p&gt;
&lt;h2&gt;Train the Model&lt;/h2&gt;
&lt;p&gt;Click the &lt;strong&gt;Train&lt;/strong&gt; tab in the console.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Click Train Button" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/22_Click_Train_Button.png"&gt;&lt;/p&gt;
&lt;p&gt;Scroll down to our target variable &lt;strong&gt;quality&lt;/strong&gt; and change the data type from &lt;strong&gt;categorical&lt;/strong&gt; to &lt;strong&gt;numeric&lt;/strong&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note:  In the spirit of our earlier efforts, we select &lt;strong&gt;numeric&lt;/strong&gt; to continue with the regression theme.  If we want a &lt;strong&gt;classification&lt;/strong&gt; model, then we can set &lt;strong&gt;data type&lt;/strong&gt; to &lt;strong&gt;categorical&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;img alt="Quality Numeric" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/23_Quality_Numeric.png"&gt;&lt;/p&gt;
&lt;p&gt;Scroll to the top and set the &lt;strong&gt;target&lt;/strong&gt; variable to &lt;strong&gt;quality&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Target Quality" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/24_Target_Quality.png"&gt;&lt;/p&gt;
&lt;p&gt;Select &lt;strong&gt;train&lt;/strong&gt; model.  We can limit the number of CPU hours (e.g. cost) if desired.  I just set the value to the maximum.  Our simple model will not consume these resources.  Click &lt;strong&gt;Train Model&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Train Model" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/25_Train_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;The Google singularity then gets to work and creates the infrastructure needed to train our model.  We can close the browser.  Google will email us a notification once they finish developing the model.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Training Progress" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/26_Training_Progress.png"&gt;&lt;/p&gt;
&lt;h2&gt;View Results&lt;/h2&gt;
&lt;p&gt;After a few hours, Google sends an email that notifies us of model completion.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Training Email" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/27_Training_Email.png"&gt;&lt;/p&gt;
&lt;p&gt;Navigate back to the &lt;strong&gt;Tables&lt;/strong&gt; service and click the &lt;strong&gt;Models&lt;/strong&gt; tab.  The GCP console presents the results.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Model Results" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/28_Model_Results.png"&gt;&lt;/p&gt;
&lt;p&gt;In summary, the &lt;strong&gt;Google AutoML Tables Beta&lt;/strong&gt; service yields the best results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AutoML Tables&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.598&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0.645&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;0.706&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0.735&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.801&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: We achieved the best results with the least amount of work: Upload a CSV and click train!  &lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Model Meta Data&lt;/h2&gt;
&lt;p&gt;The service provides feature importance.  Google reports that &lt;strong&gt;alcohol&lt;/strong&gt; drives &lt;strong&gt;quality&lt;/strong&gt; more than any other feature.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Feature Importance" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/29_Feature_Importance.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Model Hyperparameters&lt;/strong&gt; under the &lt;strong&gt;Model&lt;/strong&gt; tab re-directs us to the GCP &lt;strong&gt;Operations Logging&lt;/strong&gt; console.  These logs include the different scenarios for each iteration.  Trial zero, for example, uses a Neural Network with sixteen (16) layers.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Hyper Nn L1" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/30_Hyper_Nn_L1.png"&gt;&lt;/p&gt;
&lt;p&gt;Trial fifteen uses a Gradient Boosted Decision Tree (GBDT).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Hyper Gbdt" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/31_Hyper_Gbdt.png"&gt;&lt;/p&gt;
&lt;p&gt;The logs provide a cumbersome UI to investigate the trials.  Perhaps the &lt;strong&gt;Alpha&lt;/strong&gt; service will clean up the UI and present a friendlier dashboard.&lt;/p&gt;
&lt;h2&gt;Deploy the Model&lt;/h2&gt;
&lt;p&gt;Now that we developed the model, we will deploy the model for use.  The AutoML service provides one-click, no-code model deployment.&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;Test And Use --&amp;gt; Online Prediction --&amp;gt; Deploy Model&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Deploy Model" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/32_Deploy_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;Google once more deploys the model, and perhaps more importantly, the required infrastructure to enable model serving.&lt;/p&gt;
&lt;p&gt;Google emails an alert once the model deployment completes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Deployed Email" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/33_Deployed_Email.png"&gt;&lt;/p&gt;
&lt;h2&gt;Test the Model&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;online prediction&lt;/strong&gt; tab provides a web form to test the model.  &lt;/p&gt;
&lt;p&gt;&lt;img alt="Deployed Model" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/34_Deployed_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;Scroll down to the &lt;strong&gt;alcohol&lt;/strong&gt; field.  What score can we expect for a 160 proof bottle of wine?  Simply enter the number eighty into the alcohol field and then click &lt;strong&gt;test&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Test The Model" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/35_Test_The_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;The model predicts our strong wine deserves a score of &lt;strong&gt;5.194&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Prediction Results" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/36_Prediction.png"&gt;&lt;/p&gt;
&lt;p&gt;The AutoML Tables Beta also service provides a REST API for machines to submit predictions to the model.&lt;/p&gt;
&lt;h2&gt;Billing&lt;/h2&gt;
&lt;p&gt;The AutoML Tables Beta service costs significantly less than our &lt;a href="https://john.soban.ski/gcp-automl-vision.html"&gt;Vision model&lt;/a&gt;.  We paid $28 for one and a half hour of compute time. &lt;/p&gt;
&lt;p&gt;&lt;img alt="Billing Metering" src="https://john.soban.ski/images/Fast_And_Easy_Automl_Optimize/37_Billing.png"&gt;&lt;/p&gt;
&lt;p&gt;Google gave us a credit for the training, so we did not need to pay any out-of-pocket fee.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this blog post we test-drove the &lt;strong&gt;Google AutoML Tables Beta&lt;/strong&gt; service.  The service did not accommodate our &lt;strong&gt;BUPA&lt;/strong&gt; data, so we needed to pivot and try another Data Set, the UCI &lt;strong&gt;Wine Quality&lt;/strong&gt; data set.&lt;/p&gt;
&lt;p&gt;We used &lt;strong&gt;Pandas&lt;/strong&gt;, &lt;strong&gt;SciKit Learn&lt;/strong&gt; and &lt;strong&gt;TensorFlow 2.3&lt;/strong&gt; to wrangle, explore, normalize, visualize and split the &lt;strong&gt;Wine Quality&lt;/strong&gt; data set.  We used &lt;strong&gt;Keras 2.3&lt;/strong&gt; to train a linear model and DNN model and compared the results.  We then iterated on dimensionality reduction approaches, converging on a &lt;strong&gt;good-enough&lt;/strong&gt; number of features.  PCA provided the vehicle to reduce dimensionality.  The &lt;strong&gt;TensorFlow/ Keras/ Pandas&lt;/strong&gt; approach required domain knowledge of AI/ML concepts and also required familiarity with various Python libraries and methods.  In other words, the Python approach required considerable Math, Data Science and Software Development skills.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;Google AutoML Tables Beta&lt;/strong&gt; service obviated the need for subject matter expertise.  We simply uploaded a CSV and clicked &lt;strong&gt;run&lt;/strong&gt;.  Google &lt;strong&gt;took care of the rest&lt;/strong&gt;.  The AutoML Tables Beta service, therefore, democratizes  the power of AI/ML and puts the technology in the hands of non-technical business users.  I look forward to the &lt;strong&gt;Alpha&lt;/strong&gt; release of this service.&lt;/p&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Fast &amp; Easy Regression w/ Keras and TensorFlow 2.3 (Part 2)</title><link href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html" rel="alternate"></link><published>2020-11-28T10:26:00-05:00</published><updated>2020-11-28T10:26:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2020-11-28:/fast-and-easy-regression-with-tensorflow-part-2.html</id><summary type="html">&lt;p&gt;In this demonstration we &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;continue&lt;/a&gt; to use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem.  Today we use Principal Component Analysis (PCA) to address over-fitting via dimensionality reduction&lt;/p&gt;
&lt;p&gt;&lt;img alt="TensorFlow and Keras 2.3 Logo" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/00_Tf_Keras_Logo.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: TensorFlow Core 2 …&lt;/p&gt;&lt;/blockquote&gt;</summary><content type="html">&lt;p&gt;In this demonstration we &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;continue&lt;/a&gt; to use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem.  Today we use Principal Component Analysis (PCA) to address over-fitting via dimensionality reduction&lt;/p&gt;
&lt;p&gt;&lt;img alt="TensorFlow and Keras 2.3 Logo" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/00_Tf_Keras_Logo.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: TensorFlow Core 2.3 includes tf.keras, which provides the high level (high abstraction) Keras Application Programming Interface (API) that simplifies the command and control of TensorFlow.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html"&gt;Last Month&lt;/a&gt; we executed the following activities:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Explore the data set&lt;/li&gt;
&lt;li&gt;Normalize the training data&lt;/li&gt;
&lt;li&gt;Build, Compile, Train and Evaluate a Linear Model&lt;/li&gt;
&lt;li&gt;Build, Compile, Train and Evaluate a DNN&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This month, we address the issue of over-fitting by using Principal Component Analysis (PCA) to reduce the dimensionality of the data set. We will: &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Drop features (via PCA) to address over-fitting&lt;/li&gt;
&lt;li&gt;Revisit the Linear Model&lt;/li&gt;
&lt;li&gt;Revisit the DNN&lt;/li&gt;
&lt;li&gt;Compare, discuss and contextualize the results&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;1. Dimensionality Reduction&lt;/h2&gt;
&lt;p&gt;Model over-fitting leads to loss.  Dimensionality reduction, or feature removal, mitigates and reduces model over-fitting.  We use Principal Component Analysis (PCA) to reduce the dimensionality.&lt;/p&gt;
&lt;p&gt;If you stick a magnet at each point in the data space, and then stick an telescoping iron bar at the origin, the magnets will pull the bar into position and stretch the bar.  The bar will wiggle a bit at first and then eventually settle into a static position.  The final direction and length of the bar represents a principal component.  We can map the higher dimensionality space to the principal component by connecting a string directly from each magnet to the bar.  Where the string hits (taut) we make a mark.  The marks represent the mapped vector space.&lt;/p&gt;
&lt;p&gt;If you want more information, George Dallas writes an excellent blog post that &lt;a href="https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/"&gt;contains cartoons explaining PCA&lt;/a&gt; and I suggest you open the link in a new tab.&lt;/p&gt;
&lt;p&gt;You can either construct PCA from your Linear Algebra notes (I have mine from 1996 in a Marble Composition book) or just use a pre-built engine.  I use the package from Scikit Learn.&lt;/p&gt;
&lt;h3&gt;Reduce Five Dimensions to One Dimension&lt;/h3&gt;
&lt;p&gt;The PCA workflow mirrors that of the ML models above.  Just set the number of desired components (dimensions) and pass the engine a data set.  We also pass a name for the &lt;strong&gt;Principal Component&lt;/strong&gt;.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.decomposition&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;fit_transform&lt;/strong&gt; method both extracts the Principal Components from the data set and then maps the data set to the lower dimensionality space.&lt;/p&gt;
&lt;p&gt;Want to see all five dimensions mapped to a single one-dimensional vector?  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
     &lt;span class="n"&gt;princomp1&lt;/span&gt;
&lt;span class="mi"&gt;142&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;23.421539&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;32.402962&lt;/span&gt;
&lt;span class="mi"&gt;60&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;10.089154&lt;/span&gt;
&lt;span class="mi"&gt;339&lt;/span&gt;  &lt;span class="mf"&gt;24.724613&lt;/span&gt;
&lt;span class="mi"&gt;54&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;13.494720&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;NOTE: The &lt;strong&gt;fit and map&lt;/strong&gt; example above preserves the index of the initial train data set.  We need to ensure that we maintain the index so that the label vectors properly align.  The &lt;strong&gt;index=train_features.index&lt;/strong&gt; argument preserves the original index during the PCA transform.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Take a look at the scale of the Principal Component vector above.  The head alone ranges from ten to thirty.  That indicates that we forgot to normalize the data before we extracted the Principal Components.&lt;/p&gt;
&lt;p&gt;The following code configures one Principal Component (reduces five features to one), extracts the Component of the normalized data set, and then saves the PCA fit in a mapping vector.  We need to use this mapping vector to transform the &lt;strong&gt;test&lt;/strong&gt; (holdout) data set.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Normalize before PCA, also save fit for test data  &lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We use the mapping vector to transform the normalized train features and save the results in a Pandas Data Frame.  Once more we preserve the index.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You now see the normalized features mapped to the one dimensional Principal Component space.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
     &lt;span class="n"&gt;princomp1&lt;/span&gt;
&lt;span class="mi"&gt;142&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.416407&lt;/span&gt;
&lt;span class="mi"&gt;6&lt;/span&gt;    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.311242&lt;/span&gt;
&lt;span class="mi"&gt;60&lt;/span&gt;    &lt;span class="mf"&gt;0.209480&lt;/span&gt;
&lt;span class="mi"&gt;339&lt;/span&gt;   &lt;span class="mf"&gt;1.577983&lt;/span&gt;
&lt;span class="mi"&gt;54&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.013619&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A trendline over a scatter plot indicates if we have correlation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;regplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_labels&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The trendline does not indicate strong correlation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="PCA Scatter Trendline" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/01_Princomp1_Vs_Drinks_Trend.png"&gt;&lt;/p&gt;
&lt;h3&gt;Reduce Five Dimensions to Two Dimensions&lt;/h3&gt;
&lt;p&gt;A two dimension feature set allows us to graph the two Principal Components against our label (target) vector, &lt;strong&gt;drinks&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PCA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;pca_train_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                     &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                     &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Seaborn only provides limited three dimensional plots.  The following plot captures the relationship between &lt;strong&gt;drinks&lt;/strong&gt; and the two Principal Components.  The diameter of each circle indicates the number of drinks.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;hue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="PCA 1 and 2 vs Drinks Heatmap" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/02_Princomps_Vs_Drinks_Heat.png"&gt;&lt;/p&gt;
&lt;p&gt;Classic &lt;strong&gt;MATPLOTLIB&lt;/strong&gt; allows us to plot on three axes.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;fig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_subplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;111&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;projection&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;3d&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;x3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;z3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;dy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ones&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;dz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;

&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dz&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Principal Component 1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Principal Component 2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_zlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The height of the bars depict the number of drinks.  The sloping of the bar charts indicates we may have found some slight correlation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="3D Bar Chart" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/03_Princomps_Vs_Drinks_Bar.png"&gt;&lt;/p&gt;
&lt;h2&gt;2. Linear Model w/ PCA&lt;/h2&gt;
&lt;p&gt;We already normalized our train dataset before we applied PCA, so we do not include the TensorFlow normalizer.  We use Keras to construct and compile our new linear model.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# no need for normalizer&lt;/span&gt;
&lt;span class="n"&gt;linear_model_pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;span class="n"&gt;linear_model_pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                         &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean_squared_error&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We pass the PCA-transformed, two feature data set to the model, along with the original train labels vector, that includes the number of drinks.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linear_model_pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;#turn off loggs&lt;/span&gt;
    &lt;span class="n"&gt;validation_split&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="c1"&gt;#validation on 20% of the training&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mf"&gt;3.76&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;384&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;4.15&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.8&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Plot the loss across each epoch for the train and validate sets.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The train set MSE clocks in over 10, with the validate set under 8.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Linear PCA Loss" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/04_Linear_Model_Pca_Error.png"&gt;&lt;/p&gt;
&lt;p&gt;In order to evaluate the model with the holdout set, we first must project the five feature holdout set to two dimensional space via the PCA map matrix.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Project test features to Principal Components&lt;/span&gt;
&lt;span class="n"&gt;pca_test_features_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                                    &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp1&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;princomp2&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                                    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The resulting holdout set now spans two (vs. five) dimensions.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca_test_features_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;princomp1&lt;/span&gt;  &lt;span class="n"&gt;princomp2&lt;/span&gt;
&lt;span class="mi"&gt;9&lt;/span&gt;   &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;1.031826&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.390413&lt;/span&gt;
&lt;span class="mi"&gt;25&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.900648&lt;/span&gt;   &lt;span class="mf"&gt;0.331132&lt;/span&gt;
&lt;span class="mi"&gt;28&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.957798&lt;/span&gt;   &lt;span class="mf"&gt;1.973741&lt;/span&gt;
&lt;span class="mi"&gt;31&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.087801&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.594343&lt;/span&gt;
&lt;span class="mi"&gt;32&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.242004&lt;/span&gt;   &lt;span class="mf"&gt;0.572321&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;How did we do?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;test_results[&amp;#39;Linear Model w/ PCA&amp;#39;] = (linear_model_pca.evaluate(pca_test_features_df, test_labels))**0.5
print(test_results)
3/3 [==============================] - 0s 1ms/step - loss: 9.4360
{&amp;#39;Linear Model&amp;#39;: 3.217451704088136, &amp;#39;DNN&amp;#39;: 3.3038437219287813, &amp;#39;Linear Model w/ PCA&amp;#39;: 3.0718091272720853}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;PCA reduces the RMSE of the Linear model from 3.2 to 3.0, pretty darn good!&lt;/p&gt;
&lt;h2&gt;3. DNN with PCA-transformed Data&lt;/h2&gt;
&lt;p&gt;We use Keras to compile a DNN and once more we do not pass a normalizer.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dnn_model_pca&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dnn_model_pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean_squared_error&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We pass the PCA mapped train features to the model and set validation proportion to 20%.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dnn_model_pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pca_train_features_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;#turn off loggs&lt;/span&gt;
    &lt;span class="n"&gt;validation_split&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="c1"&gt;#validation on 20% of the training&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;420&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;4.42&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.03&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;How do the results look?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The MSE for the validation set crosses above 8 after the 70th epoch or so.&lt;/p&gt;
&lt;p&gt;&lt;img alt="DNN Data Error" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/05_Dnn_Pca_Error.png"&gt;&lt;/p&gt;
&lt;p&gt;We evaluate the DNN model with the transformed, two dimensional holdout set.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DNN w/ PCA&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dnn_model_pca&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca_test_features_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;==============================&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.0268&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Linear Model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.217451704088136&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DNN&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.3038437219287813&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Linear Model w/ PCA&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.0718091272720853&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DNN w/ PCA&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.166514259150867&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The DNN w/ PCA reduces the RMSE from 3.3 to 3.16 vs. the original DNN.&lt;/p&gt;
&lt;h2&gt;4. Interpretation&lt;/h2&gt;
&lt;p&gt;The RMSE for the four models range from 3.07 (lowest) to 3.30 (highest).  Does our model do a good job in predicting how many drinks a person consumes in a day?&lt;/p&gt;
&lt;p&gt;To answer that, consider the formula for Root Mean Squared Error (RMSE):&lt;/p&gt;
&lt;p&gt;&lt;img alt="RMSE" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/06_Rmse_Formula.png"&gt;&lt;/p&gt;
&lt;p&gt;We subtract the actual value from the estimated value for each observation, square the result to remove the negative sign, sum everything up and then take the square root.&lt;/p&gt;
&lt;p&gt;Now, assume we just guess the mean for every observation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Guess Mean for all" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/07_Guess_Mean_For_All.png"&gt;&lt;/p&gt;
&lt;p&gt;If we substitute this guess vector into our RMSE formula, we get the formula for &lt;strong&gt;Standard Deviation&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Standard Deviation Formula" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow_Part_2/08_Std_Formula.png"&gt;&lt;/p&gt;
&lt;p&gt;We consider, therefore, any RMSE that comes in under Standard Deviation a victory.&lt;/p&gt;
&lt;p&gt;Take a look at the Standard Deviation of the train data set:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;185&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
&lt;span class="mf"&gt;3.4108545780181885&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;By this account all four models win.  Keep in mind, however, &lt;strong&gt;in the wild&lt;/strong&gt; most test sets will include such a high volume of data that the &lt;strong&gt;STD&lt;/strong&gt; will tighten to zero.&lt;/p&gt;
&lt;p&gt;One last thing.  Assume a simple model where we just guess the &lt;strong&gt;mean&lt;/strong&gt; of the train data when predicting on the &lt;strong&gt;holdout&lt;/strong&gt; data.  How does this simple model perform?&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sq_er&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Guess Mean&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sq_er&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="n"&gt;test_results&lt;/span&gt;
&lt;span class="n"&gt;Out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;215&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="n"&gt;Guess&lt;/span&gt; &lt;span class="n"&gt;Mean&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;: 3.029730661841211}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The "Guess Mean" approach out-performs all of the other models!&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Dims&lt;/th&gt;
&lt;th&gt;RMSE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Guess Mean&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;3.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3.07&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linear Model&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;3.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DNN&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;3.17&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In &lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;the next blog post&lt;/a&gt; we will investigate ways to tune the model, from a construction and hyper-parameter tuning standpoint.&lt;/p&gt;
&lt;p&gt;If you enjoyed this blog post, please check out these related blog posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/big-data-idol-how-i-crunched-the-numbers.html"&gt;Exploratory Factor Analysis (EFA) Workflow and Interpretation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;EFA - The Math and Algorithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Reduced Columb Energy (RCE) - An alternative to KNN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;Probabalistic Parzen Neural Networks (PNN) with cartoons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fastai-flask.html"&gt;Vision model w/ FAST AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/gcp-automl-vision.html"&gt;Vision model w/ Google AutoML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;Google AutoML Tables Beta&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="TensorFlow"></category><category term="Keras"></category></entry><entry><title>Fast &amp; Easy Regression w/ Keras and TensorFlow 2.3 (Part 1)</title><link href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow.html" rel="alternate"></link><published>2020-10-31T10:26:00-04:00</published><updated>2020-10-31T10:26:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2020-10-31:/fast-and-easy-regression-with-tensorflow.html</id><summary type="html">&lt;p&gt;In this demonstration we will use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem.  TensorFlow Core 2.3 includes tf.keras, which provides the high level (high abstraction) Keras Application Programming Interface …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this demonstration we will use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem.  TensorFlow Core 2.3 includes tf.keras, which provides the high level (high abstraction) Keras Application Programming Interface (API) for TensorFlow. Keras simplifies the command and control of TensorFlow.  The TensorFlow ecosystem also contains straightforward and simple vehicles for normalization and other common Machine Learning data preparation constructs.&lt;/p&gt;
&lt;p&gt;The following bulleted list captures the steps we will execute in this demonstration:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Explore the data set&lt;/li&gt;
&lt;li&gt;Normalize the training data&lt;/li&gt;
&lt;li&gt;Build, Compile, Train and Evaluate a Linear Model&lt;/li&gt;
&lt;li&gt;Build, Compile, Train and Evaluate a DNN&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;Next month&lt;/a&gt;, we will address the issue of over-fitting by using Principal Component Analysis (PCA) to reduce the dimensionality of the data set. In that blog post we will: &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Drop features (via PCA) to address over-fitting&lt;/li&gt;
&lt;li&gt;Revisit the Linear Model&lt;/li&gt;
&lt;li&gt;Revisit the DNN&lt;/li&gt;
&lt;li&gt;Compare, discuss and contextualize the results&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;1. Explore the data set&lt;/h2&gt;
&lt;p&gt;This demo revisits the &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data"&gt;BUPA Liver Disorders&lt;/a&gt; data set, a classic, &lt;strong&gt;tough&lt;/strong&gt; data set that I have explored in three prior blog posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Applying a Reduced Columb Energy (RCE) Neural Network to the Bupa Liver Disorders Data Set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;A Graphical introduction to Probabalistic Neural Networks (PNN)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/refactor-matlab-to-tidyverse.html"&gt;Refactoring Matlab Code to R Tidyverse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The dataset includes five biological features, a record of &lt;strong&gt;drinks per day&lt;/strong&gt; and an arbitrary &lt;strong&gt;selector&lt;/strong&gt; variable that the original data compilers used for their initial models.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mcv: mean corpuscular volume&lt;/li&gt;
&lt;li&gt;Four Chemical Markers&lt;ul&gt;
&lt;li&gt;alkphos: alkaline phosphotase&lt;/li&gt;
&lt;li&gt;sgpt: alamine aminotransferase&lt;/li&gt;
&lt;li&gt;sgot: aspartate aminotransferase&lt;/li&gt;
&lt;li&gt;gammagt: gamma-glutamyl transpeptidase&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;drinks: # of half-pint equivalents of alcohol per day&lt;/li&gt;
&lt;li&gt;selector:  field used to split data into two sets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;MCV&lt;/strong&gt; and the &lt;strong&gt;four chemical markers&lt;/strong&gt; provide the features for the model.  The model's label vector records &lt;strong&gt;drinks per day&lt;/strong&gt;.  We throw out the obsolete &lt;strong&gt;selector&lt;/strong&gt; feature. &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Our &lt;strong&gt;regression&lt;/strong&gt; problem seeks to &lt;strong&gt;predict&lt;/strong&gt; the number of alcohol servings a person &lt;strong&gt;drinks per day&lt;/strong&gt; (label) based on the recorded &lt;strong&gt;biological stats&lt;/strong&gt; (features).  &lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Import the Data&lt;/h3&gt;
&lt;p&gt;I prefer to use &lt;strong&gt;requests&lt;/strong&gt; over the low level &lt;strong&gt;urllib3&lt;/strong&gt; to pull the data from &lt;strong&gt;Irvine&lt;/strong&gt;.  Once I retrieve the content I stuff the data into a &lt;strong&gt;Pandas DataFrame&lt;/strong&gt; and immediately drop the &lt;strong&gt;selector&lt;/strong&gt; column into the bitbucket.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import the data&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;io&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data&amp;#39;&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;span class="n"&gt;column_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mcv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="s1"&gt;&amp;#39;alkphos&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;sgpt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;sgot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;gammagt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s1"&gt;&amp;#39;selector&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;bupa_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;utf-8&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                      &lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;selector&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
             &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The DataFrame's &lt;strong&gt;head&lt;/strong&gt; method outputs the first few lines of the frame.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id&lt;/th&gt;
&lt;th&gt;mcv&lt;/th&gt;
&lt;th&gt;alkphos&lt;/th&gt;
&lt;th&gt;sgpt&lt;/th&gt;
&lt;th&gt;sgot&lt;/th&gt;
&lt;th&gt;gammagt&lt;/th&gt;
&lt;th&gt;drinks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;92&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;86&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Check for Correlation&lt;/h3&gt;
&lt;p&gt;We pick a feature at random, &lt;strong&gt;mcv&lt;/strong&gt; and use a simple scatter plot to check for any obvious correlation between this feature and our &lt;strong&gt;target&lt;/strong&gt; variable, &lt;strong&gt;drinks&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;sns&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;rc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;figure.figsize&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;11.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8.27&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatterplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mcv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;No obvious correlation jumps out in the scatter plot below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCV vs. Drinks Scatterplot" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/01_Mcv_Vs_Drinks.png"&gt;&lt;/p&gt;
&lt;p&gt;To be sure, we will use &lt;strong&gt;Seaborn&lt;/strong&gt; to plot the best fit trend line and error bands.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;regplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mcv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The graph depicts fat error bands and a near-horizontal trend line, which reflects little to no correlation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCV vs. Drinks Scatterplot with Trendline" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/02_Mcv_Vs_Drinks_W_Trend.png"&gt;&lt;/p&gt;
&lt;p&gt;We plot a Kernel Density Estimation (KDE) of the &lt;strong&gt;drinks&lt;/strong&gt; variable.  KDE plots estimate the density of a continuous random variable, in this case, &lt;strong&gt;drinks&lt;/strong&gt;.  Imagine a smooth histogram, or a histogram with really skinny bars.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
             &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;kde&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;From the density plot we see that most people drink less than a couple of drinks per day.&lt;/p&gt;
&lt;p&gt;&lt;img alt="KDE of Drinks" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/03_Bupa_Drinks_Kde.png"&gt;&lt;/p&gt;
&lt;p&gt;A continuous bivariate joint density function captures the probability distribution of two random variables.  Imagine a top down view of the density plot above, with the density plot for &lt;strong&gt;MCV&lt;/strong&gt; mixed in.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mcv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;kde&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The near-circular shape shows the dearth of correlation between &lt;strong&gt;MCV&lt;/strong&gt; and &lt;strong&gt;Drinks&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="MCV vs. Drinks Bi-Variate Joint Density Plot" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/04_Mcv_Vs_Drinks_Corr.png"&gt;&lt;/p&gt;
&lt;p&gt;To contrast, observe two features with excellent correlation, &lt;strong&gt;SGPT&lt;/strong&gt; and &lt;strong&gt;SGOT&lt;/strong&gt;.  We will leverage this correlation when we apply &lt;strong&gt;dimensionality reduction&lt;/strong&gt; to our data set.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;displot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sgpt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;sgot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;kde&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice the sharp, nearly 45 degree angle of the bi-variate density plot, which indicates strong correlation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="SGPT Vs. SGOT Bi-Variate Joint Density Plot" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/05_Sgpt_Vs_Sgot_Corr.png"&gt;&lt;/p&gt;
&lt;p&gt;We plot the correlation between &lt;strong&gt;drinks&lt;/strong&gt; and all features.  If one feature presents strong correlation then we can simply use that feature, throw out the rest and then take a nap.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;x_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;mcv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;alkphos&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;sgpt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;sgot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="s1"&gt;&amp;#39;gammagt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;drinks&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PairGrid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;x_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;x_vars&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;y_vars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y_vars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_offdiag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kdeplot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map_diag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;None of the features show strong (or any) correlation with &lt;strong&gt;drinks&lt;/strong&gt;, so we will need to proceed with Machine Learning approaches for our prediction model.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Drinks vs. all features density plots" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/06_Drinks_Vs_Features.png"&gt;&lt;/p&gt;
&lt;h3&gt;Split the Data&lt;/h3&gt;
&lt;p&gt;We split the data into three buckets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Train - To train a model&lt;/li&gt;
&lt;li&gt;Validate - To tune the model&lt;/li&gt;
&lt;li&gt;Holdout (aka Test) - To test the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;strong&gt;holdout&lt;/strong&gt; data set surprises the model with completely unknown data, which helps predict expected real-world performance.  I use the term &lt;strong&gt;test&lt;/strong&gt; in the code below.  The train/ test split partitions rows into different buckets.  The features/ label split pops off the &lt;strong&gt;label&lt;/strong&gt; column into a separate vector.  TensorFlow uses a &lt;strong&gt;DataFrame&lt;/strong&gt; for the &lt;strong&gt;features&lt;/strong&gt; matrix and a &lt;strong&gt;series&lt;/strong&gt; for the label vector.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: We will further split the &lt;strong&gt;train&lt;/strong&gt; dataset into &lt;strong&gt;train and validate&lt;/strong&gt; sets when we train the model.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frac&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;random_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;test_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bupa_df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Remove the rows that correspond to the train DF&lt;/span&gt;

&lt;span class="n"&gt;train_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;test_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;train_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;#The pop removes drinks from the fetures DF&lt;/span&gt;
&lt;span class="n"&gt;test_labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;drinks&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Take a quick look at the &lt;strong&gt;summary statistics&lt;/strong&gt; for the train dataset.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transpose&lt;/span&gt;&lt;span class="p"&gt;()[[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;std&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;feature&lt;/th&gt;
&lt;th&gt;mean&lt;/th&gt;
&lt;th&gt;std&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;mcv&lt;/td&gt;
&lt;td&gt;90.2&lt;/td&gt;
&lt;td&gt;4.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;alkphos&lt;/td&gt;
&lt;td&gt;70.0&lt;/td&gt;
&lt;td&gt;18.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sgpt&lt;/td&gt;
&lt;td&gt;30.6&lt;/td&gt;
&lt;td&gt;20.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sgot&lt;/td&gt;
&lt;td&gt;24.4&lt;/td&gt;
&lt;td&gt;10.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gammagt&lt;/td&gt;
&lt;td&gt;38.0&lt;/td&gt;
&lt;td&gt;37.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;drinks&lt;/td&gt;
&lt;td&gt;3.4&lt;/td&gt;
&lt;td&gt;3.4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Notice that for &lt;strong&gt;drinks&lt;/strong&gt;, our target (label),&lt;strong&gt; &amp;#956; = &amp;#x3C3; = 3.4&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We already looked at the density plot for &lt;strong&gt;drinks&lt;/strong&gt;.  We now plot the &lt;strong&gt;histograms&lt;/strong&gt; of the features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;layout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice that each feature encompasses a different range of values.  To comply with Machine Learning best practices, we will &lt;strong&gt;normalize&lt;/strong&gt; the data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Histogram of Raw Features" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/07_Features_Hist.png"&gt;&lt;/p&gt;
&lt;h2&gt;2. Normalize the data&lt;/h2&gt;
&lt;p&gt;We &lt;strong&gt;normalize&lt;/strong&gt; the data between &lt;strong&gt;-1 and 1&lt;/strong&gt;.  Most blogs describe the &lt;strong&gt;manual&lt;/strong&gt; normalization process.  TensorFlow 2.X, however, provides an &lt;strong&gt;experimental&lt;/strong&gt; normalization engine.&lt;/p&gt;
&lt;p&gt;Import the required packages.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;tf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;tensorflow.keras.layers.experimental&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Create a &lt;strong&gt;normalizer&lt;/strong&gt; object.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preprocessing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Normalization&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Feed the normalizer engine our data, so the engine gets a feel for the ranges and statistical summaries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;adapt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;normalizer&lt;/strong&gt; inputs a data set, and the &lt;strong&gt;numpy()&lt;/strong&gt; method returns a matrix of normalized numbers.  We pass this &lt;strong&gt;numpy&lt;/strong&gt; matrix to a &lt;strong&gt;Pandas&lt;/strong&gt; DataFrame and then plot the new, normalized histogram.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;numpy&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mcv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;alkphos&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;sgpt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;sgot&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;gammagt&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;
   &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Much better!  The &lt;strong&gt;normalized&lt;/strong&gt; data cluster around &lt;strong&gt;zero&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Histogram of Normalized Features" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/08_Features_Norm_Hist.png"&gt;&lt;/p&gt;
&lt;h2&gt;3. Create a Linear Model&lt;/h2&gt;
&lt;p&gt;Keras makes life easy.  The following line of code creates a linear regression model.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;linear_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;units&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Every &lt;strong&gt;Machine Learning&lt;/strong&gt; course in history seems to fixate on &lt;strong&gt;Gradient Descent&lt;/strong&gt; for the first few weeks.  In this case, we do not use &lt;strong&gt;Gradient Descent&lt;/strong&gt; to optimize our model, instead we use &lt;strong&gt;ADAM&lt;/strong&gt;.  In addition, I set the loss function to &lt;strong&gt;Mean Square Error (MSE)&lt;/strong&gt;.  In practice, you should use &lt;strong&gt;Mean Absolute Error (MAE)&lt;/strong&gt;, however, I use &lt;strong&gt;MSE&lt;/strong&gt; in order to drive some interesting thought experiments in the final &lt;strong&gt;interpretations&lt;/strong&gt; section of next month's blog post.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;linear_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                     &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean_squared_error&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Run through one hundred epochs to train the model.  We use 1/5 of the train data to validate the model.  I use an NVIDIA Tesla K80, which keeps the clock time to under 3 seconds.  A CPU will take about 30 seconds. &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linear_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;#turn off loggs&lt;/span&gt;
    &lt;span class="n"&gt;validation_split&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="c1"&gt;#validation on 20% of the training&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mf"&gt;3.85&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;312&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;4.16&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.83&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Keras plops the training information into a table.  The following function plots the table for us to look at.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;plt&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loss&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
           &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;loss&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;val_loss&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
           &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;val_loss&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Epoch&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Error [Drinks]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now plot the training history.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Our loss on the train data set (blue line) lands at around nine (MSE), or a root mean square error (RMSE) of 3.  This means that, for the average person, the model predicts either three too many or three too few drinks per day.  We discuss the impacts of this RMSE in the final &lt;strong&gt;interpretations&lt;/strong&gt; section of next month's blog post.&lt;/p&gt;
&lt;p&gt;The validation set, however, fares better, with an MSE of under eight, and an RMSE of ~2.8.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Linear Model Error" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/09_Linear_Model_Error.png"&gt;&lt;/p&gt;
&lt;p&gt;Good loss on train and validate sets do not mean much.  Data Scientists can overfit a model to their train data, which does not generalize well in &lt;strong&gt;the wild&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The proof of the pudding lies in the taste therein... only the error of the &lt;strong&gt;holdout (test)&lt;/strong&gt; set matters.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Linear Model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;linear_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;==============================&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.3520&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Linear Model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.217451704088136&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;On the &lt;strong&gt;holdout&lt;/strong&gt; set, the liner model produces an RMSE of ~3.2&lt;/p&gt;
&lt;h2&gt;4. Train a Dense Neural Network (DNN)&lt;/h2&gt;
&lt;p&gt;Keras lets us assemble a &lt;strong&gt;Dense Neural Network (DNN)&lt;/strong&gt; model layer by layer.  The following function will use Keras to build and compile our DNN model.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_and_compile_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sequential&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
      &lt;span class="n"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;relu&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;])&lt;/span&gt;

  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mean_squared_error&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keras&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimizers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Adam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.001&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We pass the model a &lt;strong&gt;normalizer&lt;/strong&gt; (created above) to normalize the data before it hits the DNN.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dnn_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;build_and_compile_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Keras prints the model summary to the screen.  The model includes four layers, a normalization layer that accepts a &lt;strong&gt;five feature&lt;/strong&gt; data set, two &lt;strong&gt;64 feature&lt;/strong&gt; dense layers, and then the single parameter output layer, which provides the prediction for &lt;strong&gt;number of drinks per day&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dnn_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;sequential_12&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;_________________________________________________________________&lt;/span&gt;
&lt;span class="n"&gt;Layer&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="n"&gt;Output&lt;/span&gt; &lt;span class="n"&gt;Shape&lt;/span&gt;              &lt;span class="n"&gt;Param&lt;/span&gt; &lt;span class="c1"&gt;#   &lt;/span&gt;
&lt;span class="o"&gt;=================================================================&lt;/span&gt;
&lt;span class="n"&gt;normalization_1&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Normalizati&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="mi"&gt;11&lt;/span&gt;        
&lt;span class="n"&gt;_________________________________________________________________&lt;/span&gt;
&lt;span class="n"&gt;dense_20&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="mi"&gt;384&lt;/span&gt;       
&lt;span class="n"&gt;_________________________________________________________________&lt;/span&gt;
&lt;span class="n"&gt;dense_21&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="mi"&gt;4160&lt;/span&gt;      
&lt;span class="n"&gt;_________________________________________________________________&lt;/span&gt;
&lt;span class="n"&gt;dense_22&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;             &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="mi"&gt;65&lt;/span&gt;        
&lt;span class="o"&gt;=================================================================&lt;/span&gt;
&lt;span class="n"&gt;Total&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;620&lt;/span&gt;
&lt;span class="n"&gt;Trainable&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;609&lt;/span&gt;
&lt;span class="n"&gt;Non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;trainable&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;
&lt;span class="n"&gt;_________________________________________________________________&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Train the &lt;strong&gt;DNN&lt;/strong&gt; and record the loss for the &lt;strong&gt;train&lt;/strong&gt; and &lt;strong&gt;validate&lt;/strong&gt; (set to 1/5) data sets.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;%%&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dnn_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;validation_split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;CPU&lt;/span&gt; &lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="mf"&gt;4.01&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;468&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;4.48&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="n"&gt;Wall&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.09&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Plot the loss.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;plot_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The train loss slopes down and the validation loss holds constant.&lt;/p&gt;
&lt;p&gt;&lt;img alt="DNN Error" src="https://john.soban.ski/images/Fast_And_Easy_Regression_With_Tensorflow/10_Dnn_Error.png"&gt;&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;evaluate&lt;/strong&gt; method checks the holdout (test) set.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;DNN&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dnn_model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;==============================&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;10.9154&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Linear Model&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.217451704088136&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;DNN&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.3038437219287813&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The DNN model shows RMSE of 3.3, worse than the Linear Model.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this demonstration we first used the &lt;strong&gt;requests&lt;/strong&gt; package to pull a dataset directly off the UC Irvine website and stuff the data into a &lt;strong&gt;Pandas&lt;/strong&gt; data frame.  We explored the data using a combination of traditional analytics, &lt;strong&gt;Seaborn&lt;/strong&gt;, &lt;strong&gt;Matplotlib&lt;/strong&gt; and fundamentals of statistics.  We then used the &lt;strong&gt;experimental&lt;/strong&gt; TensorFlow normalizer to normalize our data set.  We also used &lt;strong&gt;TensorFlow&lt;/strong&gt; to create our &lt;strong&gt;Train&lt;/strong&gt;, &lt;strong&gt;Validate&lt;/strong&gt; and &lt;strong&gt;Holdout&lt;/strong&gt; data sets.  &lt;strong&gt;Keras&lt;/strong&gt; provided a vehicle to create both a &lt;strong&gt;linear model&lt;/strong&gt; and a &lt;strong&gt;Dense Neural Network (DNN)&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The added complexity of the DNN produced a reduction in performance over the linear model.  Worse performance due to added complexity points to &lt;strong&gt;over-fitting&lt;/strong&gt;.  We will address the issue of DNN over-fitting &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;next month&lt;/a&gt; by using &lt;strong&gt;Principal Component Analysis (PCA)&lt;/strong&gt; to reduce the dimensionality of the data set.  We will &lt;a href="https://john.soban.ski/fast-and-easy-regression-with-tensorflow-part-2.html"&gt;use PCA to drop features&lt;/a&gt;.&lt;/p&gt;</content><category term="Data Science"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category><category term="TensorFlow"></category><category term="Keras"></category></entry><entry><title>Google AutoML Vision - Mods vs. Rockers Revisited!</title><link href="https://john.soban.ski/gcp-automl-vision.html" rel="alternate"></link><published>2020-04-25T12:12:00-04:00</published><updated>2020-04-25T12:12:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2020-04-25:/gcp-automl-vision.html</id><summary type="html">&lt;p&gt;FastAI provides Jupyter notebooks to wrangle data, train models, optimize models and then serve models.&lt;/p&gt;
&lt;p&gt;I recommended FastAI to my Data Scientist friends and they found the FastAI Jupyter layout and workflow both cumbersome and confusing.&lt;/p&gt;
&lt;p&gt;GCP provides the Google AutoML Vision service, an alternative to FastAI (and any roll-your-own …&lt;/p&gt;</summary><content type="html">&lt;p&gt;FastAI provides Jupyter notebooks to wrangle data, train models, optimize models and then serve models.&lt;/p&gt;
&lt;p&gt;I recommended FastAI to my Data Scientist friends and they found the FastAI Jupyter layout and workflow both cumbersome and confusing.&lt;/p&gt;
&lt;p&gt;GCP provides the Google AutoML Vision service, an alternative to FastAI (and any roll-your-own vision service, for that matter), which automates the tedious aspects of AI Vision efforts.&lt;/p&gt;
&lt;p&gt;AutoML Vision simplifies labeling and then automates training, optimization and serving of the model.&lt;/p&gt;
&lt;p&gt;GCP provides a &lt;strong&gt;no code&lt;/strong&gt; method to create, deploy and  serve AI Vision models at scale!&lt;/p&gt;
&lt;p&gt;In this HOWTO we will accomplish the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create a Google Cloud Storage Bucket&lt;/li&gt;
&lt;li&gt;Acquire and label data&lt;/li&gt;
&lt;li&gt;Train a Vision model that identifies &lt;strong&gt;Mods&lt;/strong&gt; vs. &lt;strong&gt;Rockers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Fix data labels via  GUI&lt;/li&gt;
&lt;li&gt;Re-train and tune a vision model&lt;/li&gt;
&lt;li&gt;Serve a Vision model&lt;/li&gt;
&lt;li&gt;Send our served model a test image&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Create a bucket.&lt;/h2&gt;
&lt;p&gt;We must get the &lt;a href="https://cloud.google.com/vision/automl/docs"&gt;AutoML vision&lt;/a&gt; service some labeled data.&lt;/p&gt;
&lt;p&gt;We will create a &lt;a href="https://cloud.google.com/storage"&gt;Google Cloud Storage&lt;/a&gt; bucket for this purpose.&lt;/p&gt;
&lt;p&gt;You will upload two folders to this bucket, one for each class, &lt;strong&gt;mods&lt;/strong&gt; and &lt;strong&gt;rockers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;You will also upload a line-delimited CSV file to this bucket that records the URI of each image in the bucket, followed by a label.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;gs://mods-rockers/mods/00000000.jpg,mods
gs://mods-rockers/mods/00000001.jpg,mods
gs://mods-rockers/mods/00000002.jpg,mods

&amp;lt;snip&amp;gt;

gs://mods-rockers/rockers/00000097.jpg,rockers
gs://mods-rockers/rockers/00000098.jpg,rockers
gs://mods-rockers/rockers/00000099.jpg,rockers
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;IMPORTANT:  Ensure that you use a regional bucket in us-central1, location type: Region and required storage class: Standard.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;If you do not use the proper bucket configuration, you will receive the following error when you attempt to import your dataset.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import Fail" src="https://john.soban.ski/images/Gcp_Automl_Vision/00_Import_Fail_Wrong_Region.png"&gt;&lt;/p&gt;
&lt;p&gt;The following commands create a bucket, apply permissions to AutoML and give AutoML permissions to access a bucket.&lt;/p&gt;
&lt;p&gt;From the cloudshell, export your &lt;strong&gt;USERNAME&lt;/strong&gt; as an environment variable.&lt;/p&gt;
&lt;p&gt;Be sure to enter your &lt;strong&gt;USERNAME&lt;/strong&gt; in the following command:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sobanski_htc@cloudshell:~&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;mods-rocker-project&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;USERNAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;your&lt;span class="w"&gt; &lt;/span&gt;email&lt;span class="w"&gt; &lt;/span&gt;address&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now enable AutoML to access a bucket.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sobanski_htc@cloudshell:~&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;mods-rocker-project&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$DEVSHELL_PROJECT_ID&lt;/span&gt;
sobanski_htc@cloudshell:~&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;mods-rocker-project&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;gcloud&lt;span class="w"&gt; &lt;/span&gt;projects&lt;span class="w"&gt; &lt;/span&gt;add-iam-policy-binding&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--member&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;user:&lt;/span&gt;&lt;span class="nv"&gt;$USERNAME&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--role&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;roles/automl.admin&amp;quot;&lt;/span&gt;
Updated&lt;span class="w"&gt; &lt;/span&gt;IAM&lt;span class="w"&gt; &lt;/span&gt;policy&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;project&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;mods-rocker-project&lt;span class="o"&gt;]&lt;/span&gt;.
bindings:
-&lt;span class="w"&gt; &lt;/span&gt;members:
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;user:my@email.com
&lt;span class="w"&gt;  &lt;/span&gt;role:&lt;span class="w"&gt; &lt;/span&gt;roles/automl.admin
-&lt;span class="w"&gt; &lt;/span&gt;members:
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;serviceAccount:service-4011961642212@gcp-sa-automl.iam.gserviceaccount.com
&lt;span class="w"&gt;  &lt;/span&gt;role:&lt;span class="w"&gt; &lt;/span&gt;roles/automl.serviceAgent
-&lt;span class="w"&gt; &lt;/span&gt;members:
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;serviceAccount:service-4011961642212@compute-system.iam.gserviceaccount.com
&lt;span class="w"&gt;  &lt;/span&gt;role:&lt;span class="w"&gt; &lt;/span&gt;roles/compute.serviceAgent
-&lt;span class="w"&gt; &lt;/span&gt;members:
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;serviceAccount:4011961642212-compute@developer.gserviceaccount.com
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;serviceAccount:4011961642212@cloudservices.gserviceaccount.com
&lt;span class="w"&gt;  &lt;/span&gt;role:&lt;span class="w"&gt; &lt;/span&gt;roles/editor
-&lt;span class="w"&gt; &lt;/span&gt;members:
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;user:smy@email.com
&lt;span class="w"&gt;  &lt;/span&gt;role:&lt;span class="w"&gt; &lt;/span&gt;roles/owner
etag:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;AxYhhFi&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;
version:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;

sobanski_htc@cloudshell:~&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;mods-rocker-project&lt;span class="o"&gt;)&lt;/span&gt;$
&lt;span class="w"&gt; &lt;/span&gt;gcloud&lt;span class="w"&gt; &lt;/span&gt;projects&lt;span class="w"&gt; &lt;/span&gt;add-iam-policy-binding&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--member&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;serviceAccount:custom-vision@appspot.gserviceaccount.com&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;--role&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;roles/ml.admin&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now create the bucket:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;sobanski_htc@cloudshell:~&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;mods-rocker-project&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;gsutil&lt;span class="w"&gt; &lt;/span&gt;mb&lt;span class="w"&gt; &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;standard&lt;span class="w"&gt; &lt;/span&gt;-l&lt;span class="w"&gt; &lt;/span&gt;us-central1&lt;span class="w"&gt; &lt;/span&gt;gs://&amp;lt;your-bucket-name&amp;gt;/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2&gt;Get a dataset.&lt;/h2&gt;
&lt;p&gt;If you do not have a labeled dataset, use the &lt;a href="https://john.soban.ski/fastai-flask.html"&gt;FastAI&lt;/a&gt; dataset notebook to quickly download a labeled dataset, separated by folder.&lt;/p&gt;
&lt;p&gt;If you do have a labeled dataset you can skip this section.&lt;/p&gt;
&lt;h3&gt;Launch AI Platform&lt;/h3&gt;
&lt;p&gt;Spin up an AI platform notebook for this task.&lt;/p&gt;
&lt;p&gt;Log into the Google Cloud Platform (GCP) console at &lt;a href="https://console.cloud.google.com"&gt;console.cloud.google.com&lt;/a&gt; [Non-referral link].&lt;/p&gt;
&lt;p&gt;Type &lt;strong&gt;notebooks&lt;/strong&gt; into the search bar, click &lt;strong&gt;Notebooks AI Platform&lt;/strong&gt; and then click &lt;strong&gt;Enable API&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enable_Notebooks" src="https://john.soban.ski/images/Gcp_Automl_Vision/01_Enable_Notebooks.png"&gt;&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;New Instance&lt;/strong&gt; and then select &lt;strong&gt;Python&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Create_Python_Notebook" src="https://john.soban.ski/images/Gcp_Automl_Vision/02_Create_Python_Notebook.png"&gt;&lt;/p&gt;
&lt;p&gt;Launch a terminal.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Launch_Terminal" src="https://john.soban.ski/images/Gcp_Automl_Vision/03_Launch_Terminal.png"&gt;&lt;/p&gt;
&lt;h3&gt;Install FastAI Course v3&lt;/h3&gt;
&lt;p&gt;From the terminal install the FastAI course v3.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;clone&lt;span class="w"&gt; &lt;/span&gt;https://github.com/fastai/course-v3.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Install the required &lt;a href="https://john.soban.ski/fastai-flask.html"&gt;FastaAI&lt;/a&gt; libs.&lt;/p&gt;
&lt;p&gt;Since this is an ephemeral notebook, you will not need to worry about virtual environments.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;fastai
Collecting&lt;span class="w"&gt; &lt;/span&gt;fastai
&lt;span class="w"&gt;  &lt;/span&gt;Downloading&lt;span class="w"&gt; &lt;/span&gt;fastai-1.0.60-py3-none-any.whl&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;237&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kB&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;████████████████████████████████&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;237&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kB&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;.6&lt;span class="w"&gt; &lt;/span&gt;MB/s&lt;span class="w"&gt; &lt;/span&gt;
Requirement&lt;span class="w"&gt; &lt;/span&gt;already&lt;span class="w"&gt; &lt;/span&gt;satisfied:&lt;span class="w"&gt; &lt;/span&gt;numpy&amp;gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.15&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/opt/conda/lib/python3.7/site-packages&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;from&lt;span class="w"&gt; &lt;/span&gt;fastai&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;.18.1&lt;span class="o"&gt;)&lt;/span&gt;

...

Building&lt;span class="w"&gt; &lt;/span&gt;wheels&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;collected&lt;span class="w"&gt; &lt;/span&gt;packages:&lt;span class="w"&gt; &lt;/span&gt;nvidia-ml-py3
&lt;span class="w"&gt;  &lt;/span&gt;Building&lt;span class="w"&gt; &lt;/span&gt;wheel&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nvidia-ml-py3&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;setup.py&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;...&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;Created&lt;span class="w"&gt; &lt;/span&gt;wheel&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;nvidia-ml-py3:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia_ml_py3-7.352.0-py3-none-any.whl&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;19189&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;sha256&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;42f79de382946ce4af88196dfdcf55cda496237f7db498bd2cc1cce3f788fba6
&lt;span class="w"&gt;  &lt;/span&gt;Stored&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;directory:&lt;span class="w"&gt; &lt;/span&gt;/home/jupyter/.cache/pip/wheels/df/99/da/c34f202dc8fd1dffd35e0ecf1a7d7f8374ca05fbcbaf974b83
Successfully&lt;span class="w"&gt; &lt;/span&gt;built&lt;span class="w"&gt; &lt;/span&gt;nvidia-ml-py3
Installing&lt;span class="w"&gt; &lt;/span&gt;collected&lt;span class="w"&gt; &lt;/span&gt;packages:&lt;span class="w"&gt; &lt;/span&gt;wasabi,&lt;span class="w"&gt; &lt;/span&gt;srsly,&lt;span class="w"&gt; &lt;/span&gt;murmurhash,&lt;span class="w"&gt; &lt;/span&gt;cymem,&lt;span class="w"&gt; &lt;/span&gt;plac,&lt;span class="w"&gt; &lt;/span&gt;preshed,&lt;span class="w"&gt; &lt;/span&gt;catalogue,&lt;span class="w"&gt; &lt;/span&gt;blis,&lt;span class="w"&gt; &lt;/span&gt;thinc,&lt;span class="w"&gt; &lt;/span&gt;spacy,&lt;span class="w"&gt; &lt;/span&gt;torch,&lt;span class="w"&gt; &lt;/span&gt;torchvision,&lt;span class="w"&gt; &lt;/span&gt;nvidia-ml-py3,&lt;span class="w"&gt; &lt;/span&gt;fastprogress,&lt;span class="w"&gt; &lt;/span&gt;fastai
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Download the images&lt;/h3&gt;
&lt;p&gt;From the GUI, Navigate to &lt;strong&gt;nbs --&amp;gt; dl1 --&amp;gt; lesson2-download.ipynb.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Select &lt;strong&gt;Edit --&amp;gt; Clear All Outputs&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Follow the instructions up to the &lt;strong&gt;Download images&lt;/strong&gt; section to create and upload your &lt;strong&gt;mods.csv&lt;/strong&gt; and &lt;strong&gt;rockers.csv&lt;/strong&gt; tables.&lt;/p&gt;
&lt;p&gt;These tables include URLs that point to images from each class.&lt;/p&gt;
&lt;p&gt;Once you are at the &lt;strong&gt;Download Images&lt;/strong&gt; section, replace the presented code with the code below.  &lt;/p&gt;
&lt;p&gt;This prevents you from needing to scroll up to the prior section.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;classes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;mods&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;rockers&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;path&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;data/brighton_seafront&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;folder&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;folder&lt;/span&gt;
    &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;.csv&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;dest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;
    &lt;span class="n"&gt;download_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_pics&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;classes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;verify_images&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delete&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I present the updated code in the following graphic.&lt;/p&gt;
&lt;p&gt;&lt;img alt="FastAI_Notebook" src="https://john.soban.ski/images/Gcp_Automl_Vision/04_FastAI_Notebook.png"&gt;&lt;/p&gt;
&lt;p&gt;Be sure to run the next &lt;strong&gt;verify_images&lt;/strong&gt; cell.&lt;/p&gt;
&lt;h2&gt;Label the data&lt;/h2&gt;
&lt;p&gt;At this point you should have two folders, one named &lt;strong&gt;mods&lt;/strong&gt; and one named &lt;strong&gt;rockers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If you used the AI platform to create your data set folders, then change directories to &lt;strong&gt;brigton_seafront&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;~$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;course-v3/nbs/dl1/data/brighton_seafront/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Export the name of your Google Cloud Storage (GCS) bucket.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: Do not use the bucket name &lt;strong&gt;mods-rockers&lt;/strong&gt; since I own that bucket.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;~/course-v3/nbs/dl1/data/brighton_seafront$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;BUCKET_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;mods-rockers
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Brighton seafront contains two sub-directories, &lt;strong&gt;mods&lt;/strong&gt; and &lt;strong&gt;rockers.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Create a spreadsheet that appends the URI for the image, followed by the label.&lt;/p&gt;
&lt;p&gt;All of the images in the &lt;strong&gt;mods&lt;/strong&gt; directory will get the label &lt;strong&gt;mods&lt;/strong&gt; and all of the images in the &lt;strong&gt;rockers&lt;/strong&gt; directory will get the label &lt;strong&gt;rockers&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;~/course-v3/nbs/dl1/data/brighton_seafront$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;name&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;ls&lt;span class="w"&gt; &lt;/span&gt;mods&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gs://&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt;/mods/&lt;span class="nv"&gt;$name&lt;/span&gt;,mods&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;labeled_data.csv&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
~/course-v3/nbs/dl1/data/brighton_seafront$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;name&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;ls&lt;span class="w"&gt; &lt;/span&gt;rockers&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;gs://&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt;/rockers/&lt;span class="nv"&gt;$name&lt;/span&gt;,rockers&lt;span class="w"&gt; &lt;/span&gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;labeled_data.csv&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The first few lines of the &lt;strong&gt;labeled_data.csv&lt;/strong&gt; file read:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;~/course-v3/nbs/dl1/data/brighton_seafront$&lt;span class="w"&gt; &lt;/span&gt;head&lt;span class="w"&gt; &lt;/span&gt;-n3&lt;span class="w"&gt; &lt;/span&gt;labeled_data.csv
gs://mods-rockers/mods/00000000.jpg,mods
gs://mods-rockers/mods/00000001.jpg,mods
gs://mods-rockers/mods/00000002.jpg,mods
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And the last few lines read:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;~/course-v3/nbs/dl1/data/brighton_seafront$&lt;span class="w"&gt; &lt;/span&gt;tail&lt;span class="w"&gt; &lt;/span&gt;-n3&lt;span class="w"&gt; &lt;/span&gt;labeled_data.csv
gs://mods-rockers/rockers/00000097.jpg,rockers
gs://mods-rockers/rockers/00000098.jpg,rockers
gs://mods-rockers/rockers/00000099.jpg,rockers
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Copy &lt;strong&gt;labeled_data.csv&lt;/strong&gt;, the &lt;strong&gt;mods&lt;/strong&gt; folder, the &lt;strong&gt;rockers&lt;/strong&gt; folder and all of their contents to your GCS bucket.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;~/course-v3/nbs/dl1/data/brighton_seafront$&lt;span class="w"&gt; &lt;/span&gt;gsutil&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;labeled_data.csv&lt;span class="w"&gt; &lt;/span&gt;mods/&lt;span class="w"&gt; &lt;/span&gt;rockers/&lt;span class="w"&gt; &lt;/span&gt;gs://&lt;span class="nv"&gt;$BUCKET_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;At this point be sure to destroy the AI Platform notebook so that you do not incur any charges.&lt;/p&gt;
&lt;h2&gt;Train the model&lt;/h2&gt;
&lt;h3&gt;Enable the API&lt;/h3&gt;
&lt;p&gt;Log into the Google Cloud Platform (GCP) console at &lt;a href="https://console.cloud.google.com"&gt;console.cloud.google.com&lt;/a&gt; [Once again, this is a non-affiliate link].  &lt;/p&gt;
&lt;p&gt;In the search bar, type &lt;strong&gt;Vision&lt;/strong&gt; and then click &lt;strong&gt;ENABLE AUTOML API&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enable_API" src="https://john.soban.ski/images/Gcp_Automl_Vision/05_Enable_API.png"&gt;&lt;/p&gt;
&lt;h3&gt;Upload your Dataset&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;Get Started --&amp;gt; New Dataset  --&amp;gt; Multi-label classification&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="AutoML Import Dataset" src="https://john.soban.ski/images/Gcp_Automl_Vision/06_AutoML_Import_Dataset.png"&gt;&lt;/p&gt;
&lt;p&gt;Under &lt;strong&gt;Select files to import&lt;/strong&gt;, select &lt;strong&gt;Select a CSV file on Cloud Storage&lt;/strong&gt; and then enter the URI for the &lt;strong&gt;labeled_data.csv&lt;/strong&gt; file on your &lt;strong&gt;bucket&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Select_Bucket_With_Labeled_CSV" src="https://john.soban.ski/images/Gcp_Automl_Vision/07_Select_Bucket_With_Labeled_CSV.png"&gt;&lt;/p&gt;
&lt;p&gt;The import will take several minutes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Import_Image" src="https://john.soban.ski/images/Gcp_Automl_Vision/08_Import_Image.png"&gt;&lt;/p&gt;
&lt;h3&gt;View Images&lt;/h3&gt;
&lt;p&gt;After the import completes, you will see your labeled images.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Successful_Import" src="https://john.soban.ski/images/Gcp_Automl_Vision/09_Successful_Import.png"&gt;&lt;/p&gt;
&lt;p&gt;A brief perusal of the images shows that some pictures (highlighted in red) include incorrect labels.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bad_Labels" src="https://john.soban.ski/images/Gcp_Automl_Vision/10_Bad_Labels.png"&gt;&lt;/p&gt;
&lt;p&gt;For now, let's ignore the bad labels and see what happens.  &lt;/p&gt;
&lt;h2&gt;Train your Model&lt;/h2&gt;
&lt;p&gt;Select &lt;strong&gt;Start Training&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Start training bad labels" src="https://john.soban.ski/images/Gcp_Automl_Vision/11_Start_Training_Bad_Labels.png"&gt;&lt;/p&gt;
&lt;p&gt;The training will use 16 GPU hours.&lt;/p&gt;
&lt;p&gt;Since GCP farms the training out in parallel, the 16 GPU hours take less than an hour.&lt;/p&gt;
&lt;h2&gt;Evaluate your Model&lt;/h2&gt;
&lt;p&gt;After the training completes, click &lt;strong&gt;Evaluate&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;You will see that the model provides sub 90% precision and recall, as noted by the confusion matrix (highlighted in green).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bad_Labels_Results" src="https://john.soban.ski/images/Gcp_Automl_Vision/12_Bad_Labels_Results.png"&gt;&lt;/p&gt;
&lt;p&gt;Drill down for more details and you will see that the false positives for &lt;strong&gt;mods&lt;/strong&gt; includes two pictures of &lt;strong&gt;mods&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This points to a labeling problem.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bad_Labels_Cause_Errors" src="https://john.soban.ski/images/Gcp_Automl_Vision/13_Bad_Labels_Cause_Errors.png"&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;NOTE: Upon second glance, the picture on the right depicts Teddy Boys.  Should I label Teddy Boys Mods, Rockers or delete the picture?  Answer in the comments below!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Fix Labels&lt;/h2&gt;
&lt;p&gt;Click &lt;strong&gt;images&lt;/strong&gt; and change the labels of the troublesome images (or just delete them if you're lazy right now).&lt;/p&gt;
&lt;p&gt;I have a &lt;strong&gt;rocker&lt;/strong&gt; motorcycle labeled &lt;strong&gt;mod&lt;/strong&gt; and a picture that includes both &lt;strong&gt;mods&lt;/strong&gt; and &lt;strong&gt;rockers&lt;/strong&gt; labeled as just &lt;strong&gt;mods&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Delete_Confusing_Ones" src="https://john.soban.ski/images/Gcp_Automl_Vision/14_Delete_Confusing_Ones.png"&gt;&lt;/p&gt;
&lt;p&gt;I like this picture, a bunch of rockers attempting to murder two helmet-less mods, who find it funny.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Both_Mod_and_Rocker" src="https://john.soban.ski/images/Gcp_Automl_Vision/15_Both_Mod_and_Rocker.png"&gt;&lt;/p&gt;
&lt;h2&gt;Re-train model&lt;/h2&gt;
&lt;p&gt;After we clean up the data and re-train, we see a perfect confusion matrix.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Much_Better" src="https://john.soban.ski/images/Gcp_Automl_Vision/16_Much_Better.png"&gt;&lt;/p&gt;
&lt;p&gt;Drilling down we see our model gave a &lt;strong&gt;mod&lt;/strong&gt; under arrest the &lt;strong&gt;rocker&lt;/strong&gt; label.&lt;/p&gt;
&lt;p&gt;&lt;img alt="One_Wrong" src="https://john.soban.ski/images/Gcp_Automl_Vision/17_One_Wrong.png"&gt;&lt;/p&gt;
&lt;h2&gt;Deploy the model&lt;/h2&gt;
&lt;p&gt;Unlike FastAI, the Google AI Platform provides one-click deployment of your model.&lt;/p&gt;
&lt;p&gt;Click &lt;strong&gt;Test &amp;amp; Use&lt;/strong&gt; and the &lt;strong&gt;Deploy Model&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;GCP takes several minutes to deploy the model.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Deploy_The_Model" src="https://john.soban.ski/images/Gcp_Automl_Vision/18_Deploy_The_Model.png"&gt;&lt;/p&gt;
&lt;p&gt;After you deploy the model, click the &lt;strong&gt;Upload Images&lt;/strong&gt; button and upload up to ten images.&lt;/p&gt;
&lt;p&gt;I upload a picture of myself at the park.&lt;/p&gt;
&lt;p&gt;The model reports, with 93% certainty that I fall under the &lt;strong&gt;Mod&lt;/strong&gt; classification, vs. &lt;strong&gt;Rocker&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Serve_Sobanski" src="https://john.soban.ski/images/Gcp_Automl_Vision/19_Serve_Sobanski.png"&gt;&lt;/p&gt;
&lt;p&gt;My paisley shirt and Italian sunglasses give credence to this, although I do have a &lt;strong&gt;Rocker&lt;/strong&gt; hair cut (styled with &lt;a href="https://en.wikipedia.org/wiki/Pomade"&gt;Royal Crown&lt;/a&gt;, no less).&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;GCP provides an AutoML vision service that automates the manual FastAI tasks of training, optimizing and serving a Vision model.&lt;/p&gt;
&lt;p&gt;AutoML vision also provides an easy to use and intuitive labeling service.&lt;/p&gt;
&lt;p&gt;If you can get a hold of labeled data, then I would recommend the AutoML vision service.&lt;/p&gt;
&lt;p&gt;Use the AutoML vision service for serious tasks.  Google throws the kitchen sink at training and tuning.  &lt;/p&gt;
&lt;p&gt;Each model consumed sixteen (16) hours of GPU time!  &lt;/p&gt;
&lt;p&gt;My FastAI model ran for two or three minutes, on one GPU.&lt;/p&gt;
&lt;p&gt;My two runs (32 hours total), cost about $100.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Pricey" src="https://john.soban.ski/images/Gcp_Automl_Vision/20_Pricey.png"&gt;&lt;/p&gt;
&lt;p&gt;Google, however, gave me $176.00 to experiment with the model training and serving.&lt;/p&gt;
&lt;p&gt;From Google:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Free Trial!
You can try AutoML Vision Object Detection for free by using 40 free node hours each for training and online prediction, and 1 free node hour for batch prediction, per billing account. Your free node hours are issued right before you create your first model. For batch prediction, the free node hour is issued at the time of the first batch prediction is initiated. You have up to one year to use them.
Prices are listed in US Dollars (USD). If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply."&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I did not need to eat into the $300 in free credits google provided when I signed up for GCP!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Free Goody" src="https://john.soban.ski/images/Gcp_Automl_Vision/21_Free_Goody.png"&gt;&lt;/p&gt;</content><category term="Data Science"></category><category term="GCP"></category><category term="Data Science"></category><category term="Machine Learning"></category></entry><entry><title>FastAI x Flask - Mods vs. Rockers!</title><link href="https://john.soban.ski/fastai-flask.html" rel="alternate"></link><published>2019-07-31T22:26:00-04:00</published><updated>2019-07-31T22:26:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2019-07-31:/fastai-flask.html</id><summary type="html">&lt;p&gt;Fastai provides helper functions on top of Pytorch to help us wrangle, clean, and process data.  In this HOWTO we will accomplish the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Deploy an AWS g3.8xlarge instance&lt;/li&gt;
&lt;li&gt;Compile and install NVIDIA drivers on our g3.8xlarge instance&lt;/li&gt;
&lt;li&gt;Use a Juypter notebook to clean and organize image data …&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;Fastai provides helper functions on top of Pytorch to help us wrangle, clean, and process data.  In this HOWTO we will accomplish the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Deploy an AWS g3.8xlarge instance&lt;/li&gt;
&lt;li&gt;Compile and install NVIDIA drivers on our g3.8xlarge instance&lt;/li&gt;
&lt;li&gt;Use a Juypter notebook to clean and organize image data&lt;/li&gt;
&lt;li&gt;Learn a model that classifies Mods vs. Rockers&lt;/li&gt;
&lt;li&gt;Deploy an image classification web app using Flask&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Fastai delivers a series of videos and Juypter notebooks that teach us how to quickly apply ML/AI techniques to real world problems.  The Jupyter notebooks require Python 3 libraries and a GPU.  Crestle.ai provides a one-click deployment of the required environment. &lt;/p&gt;
&lt;p&gt;&lt;img alt="Crestle Splash" src="https://john.soban.ski/images/Fastai_Flask/01_Crestle.png"&gt; &lt;/p&gt;
&lt;p&gt;You enter your credit card and then click "run."&lt;/p&gt;
&lt;p&gt;&lt;img alt="Enter your credit card" src="https://john.soban.ski/images/Fastai_Flask/02_Pay.png"&gt;&lt;/p&gt;
&lt;p&gt;Unfortunately, as of August 31st, 2019, Crestle no longer provides their service to the public.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;As the platform and its usage continues to evolve, we've decided to move Crestle.ai to an Enterprise model. This means that as of August 31, 2019, the service will no longer be available to individual participants.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;While other platforms still provide this service to individual users, I decided that it would be a good exercise to deploy our own environment.&lt;/p&gt;
&lt;h2&gt;Deploy the FASTAI environment to AWS&lt;/h2&gt;
&lt;p&gt;In order to crunch the ML/AI data at a reasonable pace, I recommend a g3.8xlarge instance.  Select this instance with the Amazon Linux Operating System.&lt;/p&gt;
&lt;h3&gt;Install the NVIDIA driver&lt;/h3&gt;
&lt;p&gt;The AWS GPU instances require a CUDA driver to enable Python to send jobs to the card.  &lt;/p&gt;
&lt;p&gt;Their &lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html"&gt;web site&lt;/a&gt; provides decent instructions on how to install this driver, and I summarize the steps below.&lt;/p&gt;
&lt;p&gt;The AWS Linux AMI provides the AWS Command Line Interface (CLI) and we use that CLI to fetch the latest NVIDIA driver.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;aws&lt;span class="w"&gt; &lt;/span&gt;s3&lt;span class="w"&gt; &lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;--recursive&lt;span class="w"&gt; &lt;/span&gt;s3://ec2-linux-nvidia-drivers/latest/&lt;span class="w"&gt; &lt;/span&gt;.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The driver requires certain development packages, so we will install them along with other useful server packages into our Operating System.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: Python3 requires installation of the AWS EPEL repository&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;amazon-linux-extras&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;epel
$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;yum&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;bzip2&lt;span class="w"&gt; &lt;/span&gt;gcc&lt;span class="w"&gt; &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;kernel-devel-*&lt;span class="w"&gt; &lt;/span&gt;python36&lt;span class="w"&gt; &lt;/span&gt;python36-dev*&lt;span class="w"&gt; &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Enter the following configurations into our &lt;strong&gt;modprobe&lt;/strong&gt; blacklist.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;vim&lt;span class="w"&gt; &lt;/span&gt;/etc/modprobe.d/blacklist.conf
blacklist&lt;span class="w"&gt; &lt;/span&gt;vga16fb
blacklist&lt;span class="w"&gt; &lt;/span&gt;noveau
blacklist&lt;span class="w"&gt; &lt;/span&gt;rivafb
blacklist&lt;span class="w"&gt; &lt;/span&gt;nvidiafb
blacklist&lt;span class="w"&gt; &lt;/span&gt;rivatv
EOF
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now edit our boot configuration.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;vim&lt;span class="w"&gt; &lt;/span&gt;/etc/default/grub
&lt;span class="nv"&gt;GRUB_CMDLINE_LINUX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;rblacklist=nouveau&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Re-compile and install the boot configuration.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;grub2-mkconfig&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;/boot/grub2/grub.cfg
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Edit the NVIDIA conf file to turn off GUI Features.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;vim&lt;span class="w"&gt; &lt;/span&gt;/etc/nvidia/gridd.conf
&lt;span class="nv"&gt;EnableUI&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;FALSE
&lt;span class="nv"&gt;FeatureType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="nv"&gt;IgnoreSP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;TRUE
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Run the NVIDIA compiler and reboot.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;/bin/sh&lt;span class="w"&gt; &lt;/span&gt;./NVIDIA-Linux-x86_64-430.30-grid.run&lt;span class="w"&gt; &lt;/span&gt;
$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;reboot
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You may see the following warning, just hit OK and proceed, since you will not need the GUI features of the driver.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;nvidia-installer&lt;span class="w"&gt; &lt;/span&gt;was&lt;span class="w"&gt; &lt;/span&gt;forced&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;guess&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;X&lt;span class="w"&gt; &lt;/span&gt;library&lt;span class="w"&gt; &lt;/span&gt;path&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/usr/lib64&amp;#39;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;X&lt;span class="w"&gt; &lt;/span&gt;module&lt;span class="w"&gt; &lt;/span&gt;path
&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/usr/lib64/xorg/modules&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;these&lt;span class="w"&gt; &lt;/span&gt;paths&lt;span class="w"&gt; &lt;/span&gt;were&lt;span class="w"&gt; &lt;/span&gt;not&lt;span class="w"&gt; &lt;/span&gt;queryable&lt;span class="w"&gt; &lt;/span&gt;from&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;system.&lt;span class="w"&gt;  &lt;/span&gt;If&lt;span class="w"&gt; &lt;/span&gt;X&lt;span class="w"&gt; &lt;/span&gt;fails
&lt;span class="w"&gt;           &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;find&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;NVIDIA&lt;span class="w"&gt; &lt;/span&gt;X&lt;span class="w"&gt; &lt;/span&gt;driver&lt;span class="w"&gt; &lt;/span&gt;module,&lt;span class="w"&gt; &lt;/span&gt;please&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;pkg-config&lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;utility&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;the
&lt;span class="w"&gt;           &lt;/span&gt;X.Org&lt;span class="w"&gt; &lt;/span&gt;SDK/development&lt;span class="w"&gt; &lt;/span&gt;package&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;your&lt;span class="w"&gt; &lt;/span&gt;distribution&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;reinstall&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;driver.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Once the server reboots, you can test to see if the driver installed.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;nvidia-smi&lt;span class="w"&gt; &lt;/span&gt;-q&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;head&lt;/span&gt;

&lt;span class="o"&gt;==============&lt;/span&gt;NVSMI&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;LOG&lt;/span&gt;&lt;span class="o"&gt;==============&lt;/span&gt;

Timestamp&lt;span class="w"&gt;                           &lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;Wed&lt;span class="w"&gt; &lt;/span&gt;Aug&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;21&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;00&lt;/span&gt;:07:40&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2019&lt;/span&gt;
Driver&lt;span class="w"&gt; &lt;/span&gt;Version&lt;span class="w"&gt;                      &lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;430&lt;/span&gt;.30
CUDA&lt;span class="w"&gt; &lt;/span&gt;Version&lt;span class="w"&gt;                        &lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;.2

Attached&lt;span class="w"&gt; &lt;/span&gt;GPUs&lt;span class="w"&gt;                       &lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
GPU&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;00000000&lt;/span&gt;:00:1D.0
&lt;span class="w"&gt;    &lt;/span&gt;Product&lt;span class="w"&gt; &lt;/span&gt;Name&lt;span class="w"&gt;                    &lt;/span&gt;:&lt;span class="w"&gt; &lt;/span&gt;Tesla&lt;span class="w"&gt; &lt;/span&gt;M60
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We will use a Python virtual environment to check to see if Python can use the GPU.&lt;/p&gt;
&lt;p&gt;FASTAI recommends we use Conda to install their required libraries and we will.&lt;/p&gt;
&lt;p&gt;I find Conda, however, to be &lt;strong&gt;&lt;em&gt;extremely&lt;/em&gt;&lt;/strong&gt; slow, so for this quick &lt;strong&gt;hello world&lt;/strong&gt; we will use PIP.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;python3&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;venv&lt;span class="w"&gt; &lt;/span&gt;aws
$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;aws/
$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./bin/activate
&lt;span class="o"&gt;(&lt;/span&gt;aws&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-U&lt;span class="w"&gt; &lt;/span&gt;pip
&lt;span class="o"&gt;(&lt;/span&gt;aws&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;torch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now run the Python interpreter and verify that Python can use the GPU.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;aws&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;python
Python&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;.6.8&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;default,&lt;span class="w"&gt; &lt;/span&gt;Apr&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2019&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;21&lt;/span&gt;:02:35&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;GCC&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;.8.5&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;20150623&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;Red&lt;span class="w"&gt; &lt;/span&gt;Hat&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;.8.5-36&lt;span class="o"&gt;)]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;on&lt;span class="w"&gt; &lt;/span&gt;linux
Type&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;help&amp;quot;&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;copyright&amp;quot;&lt;/span&gt;,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;credits&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;or&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;license&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;more&lt;span class="w"&gt; &lt;/span&gt;information.
&amp;gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;import&lt;span class="w"&gt; &lt;/span&gt;torch
&amp;gt;&amp;gt;&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;torch.cuda.is_available&lt;span class="o"&gt;()&lt;/span&gt;
True
&amp;gt;&amp;gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Install the Conda libraries&lt;/h3&gt;
&lt;p&gt;If Python can use the GPU, deactivate the virtual environment to proceed.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;aws&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;deactivate
$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now install Conda.  You may want to visit the Anaconda website to see if they provide an updated version.&lt;/p&gt;
&lt;p&gt;Once installed, log out of your session and then log back in to install the &lt;strong&gt;&lt;em&gt;conda&lt;/em&gt;&lt;/strong&gt; commands into your path.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;-XGET&lt;span class="w"&gt; &lt;/span&gt;https://repo.anaconda.com/archive/Anaconda2-2019.03-Linux-x86_64.sh&lt;span class="w"&gt; &lt;/span&gt;-O
$&lt;span class="w"&gt; &lt;/span&gt;bash&lt;span class="w"&gt; &lt;/span&gt;Anaconda2-2019.03-Linux-x86_64.sh
$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With Conda installed, create and activate a Conda environment.  I named mine &lt;strong&gt;&lt;em&gt;boot&lt;/em&gt;&lt;/strong&gt; (For Bootstrap).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;conda&lt;span class="w"&gt; &lt;/span&gt;create&lt;span class="w"&gt; &lt;/span&gt;-n&lt;span class="w"&gt; &lt;/span&gt;boot&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;.6.8&lt;span class="w"&gt; &lt;/span&gt;anaconda
$&lt;span class="w"&gt; &lt;/span&gt;conda&lt;span class="w"&gt; &lt;/span&gt;activate&lt;span class="w"&gt; &lt;/span&gt;boot
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;FASTAI requires Torch, and torch requires OS level development libraries.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;boot&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;yum&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;libxml2-devel&lt;span class="w"&gt; &lt;/span&gt;libxslt1-dev&lt;span class="w"&gt; &lt;/span&gt;zlib1g-dev&lt;span class="w"&gt; &lt;/span&gt;g++
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, since I find Conda to be &lt;strong&gt;&lt;em&gt;painfully&lt;/em&gt;&lt;/strong&gt; slow, I will just install the packages via PIP (while in my conda environment).&lt;/p&gt;
&lt;p&gt;Feel free to use the &lt;strong&gt;&lt;em&gt;install&lt;/em&gt;&lt;/strong&gt; command if you prefer.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note:  If you choose to use Conda channels to install packages, ensure that you have the correct package name.  I, for example, commanded Conda to install &lt;strong&gt;pytorch&lt;/strong&gt; when the package is in fact named &lt;strong&gt;torch&lt;/strong&gt;.  As a result, Conda searched for several hours for a package named &lt;strong&gt;pytorch&lt;/strong&gt;, with no errors or warnings.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;boot&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;fastai&lt;span class="w"&gt; &lt;/span&gt;torch&lt;span class="w"&gt; &lt;/span&gt;flask-bootstrap&lt;span class="w"&gt; &lt;/span&gt;flask-wtf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We now have everything needed to start the notebook.&lt;/p&gt;
&lt;h3&gt;Start the notebook&lt;/h3&gt;
&lt;p&gt;Use &lt;strong&gt;&lt;em&gt;git&lt;/em&gt;&lt;/strong&gt; to pull the FASTAI course to your server, navigate to the &lt;strong&gt;&lt;em&gt;dl1&lt;/em&gt;&lt;/strong&gt; directory and then start the server.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;clone&lt;span class="w"&gt; &lt;/span&gt;https://github.com/fastai/course-v3.git
$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;course-v3/nbs/dl1/
$&lt;span class="w"&gt; &lt;/span&gt;jupyter&lt;span class="w"&gt; &lt;/span&gt;notebook&lt;span class="w"&gt; &lt;/span&gt;--ip&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Ensure that your EC2 instance's security group provides access to port 8888, and then access the server on this port via a web browser.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Notebook" src="https://john.soban.ski/images/Fastai_Flask/03_Jupyter_Notebook.png"&gt;&lt;/p&gt;
&lt;h3&gt;Use the notebook to clean the data&lt;/h3&gt;
&lt;p&gt;Launch the &lt;strong&gt;&lt;em&gt;lesson2-download.ipynb&lt;/em&gt;&lt;/strong&gt;, which provides the data cleaning service.&lt;/p&gt;
&lt;p&gt;The instructions are quite clear.  You use the Chrome developer tools to gather the URLs of pictures from Google images that match your classes.&lt;/p&gt;
&lt;p&gt;I ran these instructions and produced &lt;strong&gt;&lt;em&gt;mods.csv&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;rockers.csv&lt;/em&gt;&lt;/strong&gt; files.&lt;/p&gt;
&lt;p&gt;Once you have these files on your server, the steps get confusing, since the notebook expects you to scroll back for each section.&lt;/p&gt;
&lt;p&gt;To ease this confusion, I created separate cells for the &lt;strong&gt;&lt;em&gt;mods&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;rockers&lt;/em&gt;&lt;/strong&gt; steps.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;&lt;em&gt;mods&lt;/em&gt;&lt;/strong&gt; cells set folder to &lt;strong&gt;&lt;em&gt;mods&lt;/em&gt;&lt;/strong&gt;, file to &lt;strong&gt;&lt;em&gt;mods.csv&lt;/em&gt;&lt;/strong&gt; and the parent folder to &lt;strong&gt;&lt;em&gt;data/brigthon_seafront&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;I also add a cell to download the files.&lt;/p&gt;
&lt;p&gt;The notebook directs us to upload the &lt;strong&gt;&lt;em&gt;mods.csv&lt;/em&gt;&lt;/strong&gt; file to the &lt;strong&gt;dl1&lt;/strong&gt; folder, which you do via the notebook UI.&lt;/p&gt;
&lt;p&gt;Be sure to upload this file before you execute the download command.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Mods cells" src="https://john.soban.ski/images/Fastai_Flask/04_Mods_DL.png"&gt;&lt;/p&gt;
&lt;p&gt;I then create a new set of cells to set folder to &lt;strong&gt;&lt;em&gt;rockers&lt;/em&gt;&lt;/strong&gt;, file to &lt;strong&gt;&lt;em&gt;rockers.csv&lt;/em&gt;&lt;/strong&gt; and the same parent folder.&lt;/p&gt;
&lt;p&gt;Once again, I add a cell to download the files and print the results.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Mods cells" src="https://john.soban.ski/images/Fastai_Flask/05_Rockers_DL.png"&gt;&lt;/p&gt;
&lt;p&gt;Run the cells until you get to the &lt;strong&gt;learn.recorder.plot()&lt;/strong&gt; step.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;learn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lr_find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;learn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;recorder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Plot the image.  This image helps us select the learning rate boundaries.&lt;/p&gt;
&lt;p&gt;&lt;img alt="LR" src="https://john.soban.ski/images/Fastai_Flask/06_LR.png"&gt;&lt;/p&gt;
&lt;p&gt;Based on this image, we want to set the learning rate between &lt;strong&gt;&lt;em&gt;1e-4&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;1e-3&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;learn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit_one_cycle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Keep running the model until you get to the image cleaner step.&lt;/p&gt;
&lt;p&gt;As you can see, our data set includes images with incorrect labels.&lt;/p&gt;
&lt;p&gt;The image on the right, for example, shows a &lt;strong&gt;&lt;em&gt;mod&lt;/em&gt;&lt;/strong&gt;, not a &lt;strong&gt;&lt;em&gt;rocker&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bad Data" src="https://john.soban.ski/images/Fastai_Flask/07_Bad_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;Once you used the widget to fix the data, go back to the following cell, uncomment it and run it.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ImageDataBunch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;folder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;.&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid_pct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;csv_labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cleaned.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;ds_tfms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;get_transforms&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;224&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;imagenet_stats&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice that this cell uses &lt;strong&gt;&lt;em&gt;cleaned.csv&lt;/em&gt;&lt;/strong&gt;, a file created by the image cleaning widget.&lt;/p&gt;
&lt;p&gt;Run the next few cells.  One cell will show correctly labeled data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Samples" src="https://john.soban.ski/images/Fastai_Flask/08_Samples.png"&gt;&lt;/p&gt;
&lt;h3&gt;Re-run the learn process&lt;/h3&gt;
&lt;p&gt;Re-learn the model based on the new cleaned data set.&lt;/p&gt;
&lt;p&gt;Run the cells right and be sure to stop at the &lt;strong&gt;&lt;em&gt;Cleaning Up&lt;/em&gt;&lt;/strong&gt; section.&lt;/p&gt;
&lt;p&gt;The new model has a nice, low error rate.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Good Error" src="https://john.soban.ski/images/Fastai_Flask/09_Good_Error.png"&gt;&lt;/p&gt;
&lt;p&gt;The Confusion matrix shows one miss.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Confuse" src="https://john.soban.ski/images/Fastai_Flask/10_Confuse.png"&gt;&lt;/p&gt;
&lt;h3&gt;Export the model&lt;/h3&gt;
&lt;p&gt;Skip the &lt;strong&gt;&lt;em&gt;Cleaning Up&lt;/em&gt;&lt;/strong&gt; section and proceed to the following cell.&lt;/p&gt;
&lt;p&gt;Once there, execute the cell.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="err"&gt;$&lt;/span&gt; &lt;span class="n"&gt;learn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;export&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You now have a file named &lt;strong&gt;&lt;em&gt;export.pkl&lt;/em&gt;&lt;/strong&gt; (pickle) that includes the model.&lt;/p&gt;
&lt;p&gt;You can save and kill the jupyter notebook.&lt;/p&gt;
&lt;h3&gt;Deploy the Flask App&lt;/h3&gt;
&lt;p&gt;Now that we have a model, we can create a Flask APP that allows users to upload a picture and have the model classify the picture as a &lt;strong&gt;&lt;em&gt;mod&lt;/em&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;em&gt;rocker&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In your home directory, create and enter a directory named &lt;strong&gt;&lt;em&gt;brighton_seafront&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;boot&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;~
&lt;span class="o"&gt;(&lt;/span&gt;boot&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;brighton_seafront
&lt;span class="o"&gt;(&lt;/span&gt;boot&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;brighton_seafront
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In this directory create the following file, named &lt;strong&gt;&lt;em&gt;application.py&lt;/em&gt;&lt;/strong&gt;.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="ch"&gt;#!/usr/bin/env python&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;fastai.vision&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;flask&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;render_template&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url_for&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;flask_bootstrap&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Bootstrap&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;flask_wtf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FlaskForm&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;flask_wtf.file&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FileField&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;wtforms&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SubmitField&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;werkzeug&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;secure_filename&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;SECRET_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;a-bunch-of-secret-stuff&amp;#39;&lt;/span&gt;
    &lt;span class="n"&gt;BOOTSTRAP_SERVE_LOCAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UploadForm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FlaskForm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;FileField&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;submit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SubmitField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;Submit&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;application&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Flask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vm"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;application&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_object&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Bootstrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;application&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;learn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_learner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@application&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;GET&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;POST&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;form&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;UploadForm&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;validate_on_submit&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secure_filename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;uploads/&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;open_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;uploads/&lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;pred_class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;pred_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;learn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;flash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;That picture belongs to &lt;/span&gt;&lt;span class="si"&gt;{}&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pred_class&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;success&amp;#39;&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;upload&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;render_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;upload.html&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;application&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;0.0.0.0&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Make two directories, one named &lt;strong&gt;&lt;em&gt;uploads&lt;/em&gt;&lt;/strong&gt; and one named &lt;strong&gt;&lt;em&gt;templates&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;uploads&lt;span class="w"&gt; &lt;/span&gt;templates
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Edit the following file, in the &lt;strong&gt;&lt;em&gt;templates&lt;/em&gt;&lt;/strong&gt; folder (e.g. &lt;strong&gt;&lt;em&gt;templates/upload.html&lt;/em&gt;&lt;/strong&gt;).&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;extends&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bootstrap/base.html&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bootstrap/wtf.html&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;wtf&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;div&lt;/span&gt; &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;container&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_flashed_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;with_categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="n"&gt;Categories&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;green&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;warning&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yellow&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;danger&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;red&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;--&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
          &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;div&lt;/span&gt; &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;alert alert-{{ category }} alert-dismissible&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;alert&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;button&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;close&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;dismiss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;alert&amp;quot;&lt;/span&gt; &lt;span class="n"&gt;aria&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;Close&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="n"&gt;aria&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;true&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;times&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;button&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;strong&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;strong&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
          &lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endfor&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endif&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endwith&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;div&lt;/span&gt; &lt;span class="n"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;container&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;h3&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Please&lt;/span&gt; &lt;span class="n"&gt;upload&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;h3&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;wtf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quick_form&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;form&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
  &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="o"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="n"&gt;div&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endblock&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Copy &lt;strong&gt;&lt;em&gt;export.pkl&lt;/em&gt;&lt;/strong&gt; from your &lt;strong&gt;&lt;em&gt;course-v3&lt;/em&gt;&lt;/strong&gt; folder to the current &lt;strong&gt;&lt;em&gt;brighton_seafront&lt;/em&gt;&lt;/strong&gt; folder.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;$&lt;span class="w"&gt; &lt;/span&gt;cp&lt;span class="w"&gt; &lt;/span&gt;/home/ec2-user/course-v3/nbs/dl1/data/brighton_seafront/export.pkl&lt;span class="w"&gt; &lt;/span&gt;/home/ec2-user/brighton_seafront
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, start your Flask app.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;boot&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;brighton_seafront&lt;span class="o"&gt;]&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;./application.py
&lt;span class="w"&gt; &lt;/span&gt;*&lt;span class="w"&gt; &lt;/span&gt;Serving&lt;span class="w"&gt; &lt;/span&gt;Flask&lt;span class="w"&gt; &lt;/span&gt;app&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;application&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;lazy&lt;span class="w"&gt; &lt;/span&gt;loading&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;*&lt;span class="w"&gt; &lt;/span&gt;Environment:&lt;span class="w"&gt; &lt;/span&gt;production
&lt;span class="w"&gt;   &lt;/span&gt;WARNING:&lt;span class="w"&gt; &lt;/span&gt;This&lt;span class="w"&gt; &lt;/span&gt;is&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;development&lt;span class="w"&gt; &lt;/span&gt;server.&lt;span class="w"&gt; &lt;/span&gt;Do&lt;span class="w"&gt; &lt;/span&gt;not&lt;span class="w"&gt; &lt;/span&gt;use&lt;span class="w"&gt; &lt;/span&gt;it&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;production&lt;span class="w"&gt; &lt;/span&gt;deployment.
&lt;span class="w"&gt;   &lt;/span&gt;Use&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;production&lt;span class="w"&gt; &lt;/span&gt;WSGI&lt;span class="w"&gt; &lt;/span&gt;server&lt;span class="w"&gt; &lt;/span&gt;instead.
&lt;span class="w"&gt; &lt;/span&gt;*&lt;span class="w"&gt; &lt;/span&gt;Debug&lt;span class="w"&gt; &lt;/span&gt;mode:&lt;span class="w"&gt; &lt;/span&gt;off
&lt;span class="w"&gt; &lt;/span&gt;*&lt;span class="w"&gt; &lt;/span&gt;Running&lt;span class="w"&gt; &lt;/span&gt;on&lt;span class="w"&gt; &lt;/span&gt;http://0.0.0.0:5000/&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;Press&lt;span class="w"&gt; &lt;/span&gt;CTRL+C&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;quit&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you navigate to port &lt;strong&gt;&lt;em&gt;5000/upload&lt;/em&gt;&lt;/strong&gt;, you will see a screen that directs you to upload a picture for classification.&lt;/p&gt;
&lt;p&gt;&lt;img alt="App Upload" src="https://john.soban.ski/images/Fastai_Flask/11_App_Upload.png"&gt;&lt;/p&gt;
&lt;p&gt;I will upload a picture of young Glam Rocker Marc Bolan, which the photographer snapped during his mod phase.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Marc Bolan" src="https://john.soban.ski/images/Fastai_Flask/12_Marc_Bolan.png"&gt;&lt;/p&gt;
&lt;p&gt;After upload, the model successfully classified Marc as a mod!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Result" src="https://john.soban.ski/images/Fastai_Flask/13_Result.png"&gt;&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;We created a fun APP that differentiates between &lt;strong&gt;&lt;em&gt;mods&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;rockers&lt;/em&gt;&lt;/strong&gt;.  We created a toy application.  Our application blocks, and can only accommodate one user at a time.  If asynchronous web applications interest you, take a look at Python &lt;a href="https://john.soban.ski/backup-aws-provided-elasticsearch-to-amazon-simple-storage-service.html"&gt;Celery&lt;/a&gt; or Amazon &lt;a href="https://john.soban.ski/deploy_an_advanced_elasticsearch_proxy_with_lambda.html"&gt;Lambda&lt;/a&gt;.&lt;/p&gt;</content><category term="Data Science"></category><category term="Flask"></category><category term="HOWTO"></category><category term="Data Science"></category></entry><entry><title>A Graphical Intro to Probabilistic Neural Networks (PNN)</title><link href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html" rel="alternate"></link><published>2018-10-30T23:47:00-04:00</published><updated>2018-10-30T23:47:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2018-10-30:/graphical_intro_to_probabilistic_neural_networks.html</id><summary type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Machine Learning engineers use Probabilistic Neural Networks (&lt;a href="https://en.wikipedia.org/wiki/Probabilistic_neural_network"&gt;PNN&lt;/a&gt;) for  classification and pattern recognition tasks.  PNN use a &lt;a href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;Parzen Window&lt;/a&gt; along with a non-negative kernel function to estimate the probability distribution function (&lt;a href="https://en.wikipedia.org/wiki/Probability_density_function"&gt;PDF&lt;/a&gt;) of each class.  The Parzen approach enables non-parametric estimation of the PDF.&lt;/p&gt;
&lt;p&gt;In this blog post I …&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Machine Learning engineers use Probabilistic Neural Networks (&lt;a href="https://en.wikipedia.org/wiki/Probabilistic_neural_network"&gt;PNN&lt;/a&gt;) for  classification and pattern recognition tasks.  PNN use a &lt;a href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;Parzen Window&lt;/a&gt; along with a non-negative kernel function to estimate the probability distribution function (&lt;a href="https://en.wikipedia.org/wiki/Probability_density_function"&gt;PDF&lt;/a&gt;) of each class.  The Parzen approach enables non-parametric estimation of the PDF.&lt;/p&gt;
&lt;p&gt;In this blog post I will discuss the following&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What is a Parzen PNN?&lt;ul&gt;
&lt;li&gt;Animated example of the Parzen algorithm&lt;/li&gt;
&lt;li&gt;Animated example of a Parzen Neural Network &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Normalization of Training Data&lt;ul&gt;
&lt;li&gt;Trade several approaches&lt;/li&gt;
&lt;li&gt;Effectiveness of approaches - Parzen vs. Nearest Neighbor&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Reduced Coulomb Energy Networks&lt;ul&gt;
&lt;li&gt;Descriptive Animation&lt;/li&gt;
&lt;li&gt;Visualization of RCE on the normalization approach&lt;/li&gt;
&lt;li&gt;Benefits of Ambiguous Regions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;RCE applied to the Bupa Liver disorders data set&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What is a Parzen PNN?&lt;/h2&gt;
&lt;p&gt;Mathworks provides a simple definition of a &lt;a href="https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/11880/versions/1/previews/ParzenPNN/html/demo.html"&gt;Parzen Probabilistic Neural Network&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The Parzen Probabilistic Neural Networks (PPNN) are a simple type of neural network used to classify data vectors. This [sic] classifiers are based on the Bayesian theory where the a posteriori probability density function (apo-pdf) is estimated from data using the Parzen window technique.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;PPNN allow a non-parametric approach to estimate the required Bayesian Classifier probabilities &lt;strong&gt;&lt;em&gt;P(x|w&lt;sub&gt;i&lt;/sub&gt;)&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;P(w&lt;sub&gt;i&lt;/sub&gt;)&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Bayes Classifier" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/00_Bayes_Classifier.png"&gt;&lt;/p&gt;
&lt;p&gt;In action, the PPNN mechanics are easy to follow.  The PPNN takes a training vector, dot products it with the weights of the hidden layer vector and then chooses the winning class based on the highest output value.  The  next section includes an Animated cartoon that shows the PPNN visually.&lt;/p&gt;
&lt;h3&gt;Animated example of the Parzen algorithm&lt;/h3&gt;
&lt;p&gt;Suppose you have three classes, and the following training data:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;th&gt;Var1&lt;/th&gt;
&lt;th&gt;Var2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Green&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Purple&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;Purple&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;Yellow&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E&lt;/td&gt;
&lt;td&gt;Green&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You now want to use a PPNN to classify the color of the observation &lt;strong&gt;&lt;em&gt;( Var1 = 0.75, Var2 = 0.25 )&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The Cartoon below shows the weights as filled colored boxes.  In Column A, for example, weight one (WA1) is half full (e.g. 0.5) and weight two (WA2) is three quarters full ( e.g. 0.75).  The animation shows the dot product of the test pattern &lt;strong&gt;&lt;em&gt;( X1 = 0.75, X2 = 0.25)&lt;/em&gt;&lt;/strong&gt; with the weight vectors, an activation function, and then the selection of the winner.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Parzen Cartoon" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/Parzen_Cartoon.gif"&gt;&lt;/p&gt;
&lt;h3&gt;Animated example of a Parzen Neural Network&lt;/h3&gt;
&lt;p&gt;Now let's take a look at the classification approach using the familiar neural network diagram.  The input layer (bottom) includes our test pattern &lt;strong&gt;&lt;em&gt;( X1 = 0.75, X2 = 0.25)&lt;/em&gt;&lt;/strong&gt;, the hidden layer includes weight vectors assigned to classes based on the train patterns.  The PPNN then connects the hidden layer to the appropriate class in the output layer.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Parzen Cartoon" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/Parzen_Neural_Net_Cartoon.gif"&gt;&lt;/p&gt;
&lt;h2&gt;Normalization of Training Data&lt;/h2&gt;
&lt;p&gt;The Mathworks &lt;a href="https://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/11880/versions/1/previews/ParzenPNN/html/demo.html"&gt;PPNN web page&lt;/a&gt; specifies that we must normalize both our weight vectors and training vectors.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The weights on the first [hidden] layer are trained as follows: each sample data is normalized so that its length becomes unitary, each sample data becomes a neuron with the normalized values as weights w.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This next section shows different approaches to normalize the training data.&lt;/p&gt;
&lt;h3&gt;Trade several approaches&lt;/h3&gt;
&lt;p&gt;I use the following data set for this trade.  &lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;x&lt;/th&gt;
&lt;th&gt;y&lt;/th&gt;
&lt;th&gt;class&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;X&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;X&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;td&gt;X&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here is a plot of the training data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Original Data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/01_Original.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note that in this toy example, we can set up a simple classifier via a vertical line at &lt;/em&gt;&lt;strong&gt;X = 2.25&lt;/strong&gt;&lt;em&gt; and just use the &lt;/em&gt;&lt;strong&gt;x&lt;/strong&gt;&lt;em&gt; values to decide.  Never mind that, though, since the point of this section is to illustrate different normalization techniques and then look at the effectiveness of different classification approaches.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;When we normalize over all the training data, you see that the &lt;strong&gt;&lt;em&gt;(x, y)&lt;/em&gt;&lt;/strong&gt; axis scale to &lt;strong&gt;&lt;em&gt;( 1, 1 )&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Normalized over all training data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/02_Normalized_Over_All_Training_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;If we center the data and normalize, the scale goes from &lt;strong&gt;&lt;em&gt;-1&lt;/em&gt;&lt;/strong&gt; to &lt;strong&gt;&lt;em&gt;1&lt;/em&gt;&lt;/strong&gt; on both axis.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Centered and normalized over all training data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/03_Centered_And_Normailzed_Over_All_Training.png"&gt;&lt;/p&gt;
&lt;p&gt;If we normalize to class specific magnitude, it makes matters worse.  We no longer have clean separation of the classes.&lt;/p&gt;
&lt;p&gt;&lt;img alt="04_Notmalized_To_Class_Specific_Magnitude" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/04_Notmalized_To_Class_Specific_Magnitude.png"&gt;&lt;/p&gt;
&lt;p&gt;If we normalize on a per-vector basis, we get build in error.  Pattern &lt;strong&gt;&lt;em&gt;(0.75, 0.75)&lt;/em&gt;&lt;/strong&gt; now belongs to both Class &lt;strong&gt;&lt;em&gt;X&lt;/em&gt;&lt;/strong&gt; and Class &lt;strong&gt;&lt;em&gt;+&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="05_Normalized_On_A_Per_Vector_Basis" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/05_Normalized_On_A_Per_Vector_Basis.png"&gt;&lt;/p&gt;
&lt;h3&gt;Effectiveness of approaches - Parzen vs. Nearest Neighbor&lt;/h3&gt;
&lt;p&gt;Now let's look at the effectiveness of PPNN vs. the &lt;a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm"&gt;&lt;em&gt;k&lt;/em&gt;-nearest neighbor&lt;/a&gt; algorithms.  KNN provides another non-parametric method of classification.  Instead of using a kernel to estimate the parent PDF, it looks at the &lt;strong&gt;&lt;em&gt;k&lt;/em&gt;&lt;/strong&gt; closest neighbors of the same class.  In the graphics below the gray regions depict Class One (&lt;strong&gt;&lt;em&gt;X&lt;/em&gt;&lt;/strong&gt;) and the white regions depict Class Two (&lt;strong&gt;&lt;em&gt;+&lt;/em&gt;&lt;/strong&gt;).&lt;/p&gt;
&lt;p&gt;First lets look at the case where we normalized each training pattern to class specific magnitude.  If you recall it appeared to look bad, scrunching the two classes close to each other.&lt;/p&gt;
&lt;p&gt;&lt;img alt="06_Case_1_Normalized_To_Class_Specific_Magnitude" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/06_Case_1_Normalized_To_Class_Specific_Magnitude.png"&gt;&lt;/p&gt;
&lt;p&gt;KNN, believe it or not, does a good job of classifying the data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="07_Nearest_Neighbor_Normalized_By_Class" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/07_Nearest_Neighbor_Normalized_By_Class.png"&gt;&lt;/p&gt;
&lt;p&gt;The PPNN, fails, classifying all of Class 2 as Class 1.&lt;/p&gt;
&lt;p&gt;&lt;img alt="08_Parzen_Neural_Net_Normalized_By_Class" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/08_Parzen_Neural_Net_Normalized_By_Class.png"&gt;&lt;/p&gt;
&lt;p&gt;The second case scales the training data to &lt;strong&gt;&lt;em&gt;(0,1)&lt;/em&gt;&lt;/strong&gt; on both axis.&lt;/p&gt;
&lt;p&gt;&lt;img alt="09_Case_2_Normalized_Over_All_Training_Data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/09_Case_2_Normalized_Over_All_Training_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;KNN handles the classification with ease.&lt;/p&gt;
&lt;p&gt;&lt;img alt="10_Nearest_Neighbor_Over_All_Samples_Norm" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/10_Nearest_Neighbor_Over_All_Samples_Norm.png"&gt;&lt;/p&gt;
&lt;p&gt;The PPNN (using &lt;strong&gt;&lt;em&gt;&amp;#963; = &lt;sup&gt;1&lt;/sup&gt;/&lt;sub&gt;4&lt;/sub&gt;&lt;/em&gt;&lt;/strong&gt; ) fails.  It allocates a tiny box region to Class 1, and classifies everything else to Class 2.&lt;/p&gt;
&lt;p&gt;&lt;img alt="11_Parzen_Neural_Net_Over_All_Samples_Norm" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/11_Parzen_Neural_Net_Over_All_Samples_Norm.png"&gt;&lt;/p&gt;
&lt;p&gt;Normalizing over a per-sample basis introduces built in error.  Note again the overlap of the &lt;strong&gt;&lt;em&gt;X&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;+&lt;/em&gt;&lt;/strong&gt; at &lt;strong&gt;&lt;em&gt;( 0.75, 0.75)&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="12_Case_3_Normalized_On_A_Per_Sample_Basis" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/12_Case_3_Normalized_On_A_Per_Sample_Basis.png"&gt;&lt;/p&gt;
&lt;p&gt;The KNN of course takes a hit due to the build in error.&lt;/p&gt;
&lt;p&gt;&lt;img alt="13_Nearest_Neighbor_Per_Sample_Norm" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/13_Nearest_Neighbor_Per_Sample_Norm.png"&gt;&lt;/p&gt;
&lt;p&gt;The PPNN (using &lt;strong&gt;&lt;em&gt;&amp;#963; = &lt;sup&gt;1&lt;/sup&gt;/&lt;sub&gt;4&lt;/sub&gt;&lt;/em&gt;&lt;/strong&gt; ) misses twice, once for the built in error and once for a Class 1 observation.&lt;/p&gt;
&lt;p&gt;&lt;img alt="14_Parzen_Over_Per_Sample_Norm" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/14_Parzen_Over_Per_Sample_Norm.png"&gt;&lt;/p&gt;
&lt;p&gt;The final normalization approaches centers and normalizes the data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="15_Case_4_Centered_And_Normalized_Over_All_Training_Data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/15_Case_4_Centered_And_Normalized_Over_All_Training_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;The KNN handles this with aplomb.&lt;/p&gt;
&lt;p&gt;&lt;img alt="16_Nearest_Neighbor_Centered_And_Normalized_Over_All_Training_Data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/16_Nearest_Neighbor_Centered_And_Normalized_Over_All_Training_Data.png"&gt;&lt;/p&gt;
&lt;p&gt;The PPNN also correctly classifies all observations.&lt;/p&gt;
&lt;p&gt;&lt;img alt="17_Parzen_Centered_And_Normalized_Over_All_Training_Data" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/17_Parzen_Centered_And_Normalized_Over_All_Training_Data.png"&gt;&lt;/p&gt;
&lt;h2&gt;Reduced Coulomb Energy Networks&lt;/h2&gt;
&lt;p&gt;So far I showed several normalization approaches and then the effectiveness of different non-parametric classification techniques on the normalized data.  I demonstrated, PPNN and KNN effectiveness.  Now I would like to describe a third non-parametric classification algorithm.  The &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Reduced Coulomb Energy&lt;/a&gt; (RCE) net.  &lt;/p&gt;
&lt;p&gt;In summary, RCE provide the following benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rapid learning of class regions that are&lt;ul&gt;
&lt;li&gt;Complex&lt;/li&gt;
&lt;li&gt;Non-linear&lt;/li&gt;
&lt;li&gt;Disjoint&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;No local minima issues&lt;/li&gt;
&lt;li&gt;Performance knobs&lt;ul&gt;
&lt;li&gt;Trade training time vs. memory requirements&lt;/li&gt;
&lt;li&gt;Trade classifier complexity to training data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you would like more details, I encourage you to read my &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;detailed investigation of RCE&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Descriptive Animation&lt;/h3&gt;
&lt;p&gt;This cartoon shows the simplicity of the RCE algorithm.  For each training point, the RCE algorithm creates a circular footprint with a radius equal to the distance of the nearest training point from the &lt;strong&gt;&lt;em&gt;other&lt;/em&gt;&lt;/strong&gt; class.  To prevent overlap, you can set a maximum radius for each training point.&lt;/p&gt;
&lt;p&gt;&lt;img alt="RCE Cartoon" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Cartoon.gif"&gt;&lt;/p&gt;
&lt;h3&gt;Visualization of RCE on the normalization approach&lt;/h3&gt;
&lt;p&gt;The following animation shows the classification footprints for the centered and normalized training data.  Note that dark gray represents class one, light gray represents class two and white indicates an "ambiguous region" (no class).&lt;/p&gt;
&lt;p&gt;&lt;img alt="Center Norm Lambda" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/Center_Norm_Lambda.gif"&gt;&lt;/p&gt;
&lt;p&gt;The next animation shows the RCE classification footprints on the non-centered all samples normalized training data.&lt;/p&gt;
&lt;p&gt;&lt;img alt="All Samples Norm" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/All_Samples_Norm.gif"&gt;&lt;/p&gt;
&lt;p&gt;Normalized by class increases the amount of ambiguous regions.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Norm by class" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Norm_Per_Class.png"&gt;&lt;/p&gt;
&lt;p&gt;Once more, the built in error of the normalize by per-sample magnitude approach results in a miss.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Norm by per sample" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/RCE_Norm_Per_Sample.png"&gt;&lt;/p&gt;
&lt;h3&gt;Benefits of Ambiguous Regions&lt;/h3&gt;
&lt;p&gt;RCE provides the benefit of ambiguous regions.  Ambiguous regions pinpoint areas that would provide useful training samples.  The data scientist can then execute observations in those regions to fill in the gaps.&lt;/p&gt;
&lt;p&gt;&lt;img alt="18_Good_Ambiguous_Regions" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/18_Good_Ambiguous_Regions.png"&gt;&lt;/p&gt;
&lt;p&gt;The following graphic shows how additional training observations filled in the ambiguity.&lt;/p&gt;
&lt;p&gt;&lt;img alt="19_Useful_Training_Samples" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/19_Useful_Training_Samples.png"&gt;&lt;/p&gt;
&lt;h2&gt;RCE applied to the Bupa Liver disorders data set&lt;/h2&gt;
&lt;p&gt;The final section summarizes my approaches to separating the training data I input into my &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;detailed investigation of RCE&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For my investigation, I looked at the &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data"&gt;BUPA Liver Disorders&lt;/a&gt; data set.&lt;/p&gt;
&lt;p&gt;The data includes six features and two classes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mcv: mean corpuscular volume&lt;/li&gt;
&lt;li&gt;Four Chemical Markers&lt;ul&gt;
&lt;li&gt;alkphos: alkaline phosphotase&lt;/li&gt;
&lt;li&gt;sgpt: alamine aminotransferase&lt;/li&gt;
&lt;li&gt;sgot: aspartate aminotransferase&lt;/li&gt;
&lt;li&gt;gammagt: gamma-glutamyl transpeptidase&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;drinks # of half-pint equivalents of alcohol per day&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I then wrangled the data set in order to increase the success rate of my classification.  I used the following method:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Normalize the data&lt;/li&gt;
&lt;li&gt;Quantify separability using&lt;ul&gt;
&lt;li&gt;Divergence&lt;/li&gt;
&lt;li&gt;Bhattacharyya distance&lt;/li&gt;
&lt;li&gt;Scatter Matricies&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the two feature case, separation analysis showed the best feature combination for class detection includes &lt;strong&gt;&lt;em&gt;gamma-glutamyl&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;number of drinks&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Out of the box, you can see these two are poorly separable.&lt;/p&gt;
&lt;p&gt;&lt;img alt="20_Poor_Two_Feature_Separability" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/20_Poor_Two_Feature_Separability.png"&gt;&lt;/p&gt;
&lt;p&gt;For the three feature case, the scatter method (left) added &lt;strong&gt;&lt;em&gt;alkphos&lt;/em&gt;&lt;/strong&gt; to the mix, whereas divergence and Bhattacharyya added &lt;strong&gt;&lt;em&gt;sgpt&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="21_Three_Feature_Separability" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/21_Three_Feature_Separability.png"&gt;&lt;/p&gt;
&lt;p&gt;The following diagrams show the three dimensional separation approaches based on a normalized test set.  I used the training &lt;strong&gt;&lt;em&gt;&amp;#956;&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;&amp;#963;&lt;/em&gt;&lt;/strong&gt; to normalize the test set.&lt;/p&gt;
&lt;p&gt;&lt;img alt="22_Further_Three_D_Separability_Approaches" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/22_Further_Three_D_Separability_Approaches.png"&gt;&lt;/p&gt;
&lt;p&gt;This graphic shows the same approach, only using the test set's &lt;strong&gt;&lt;em&gt;&amp;#956;&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;&amp;#963;&lt;/em&gt;&lt;/strong&gt; to normalize the test set.&lt;/p&gt;
&lt;p&gt;&lt;img alt="23_Three_D_Separability_On_Train_Set" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/23_Three_D_Separability_On_Train_Set.png"&gt;&lt;/p&gt;
&lt;p&gt;The following graphic shows the classification footprints using a normalized, two feature (&lt;strong&gt;&lt;em&gt;gamma-glutamyl&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;number of drinks&lt;/em&gt;&lt;/strong&gt;) train and test set.&lt;/p&gt;
&lt;p&gt;&lt;img alt="24_Results_Of_Two_Feature_Two_Five" src="https://john.soban.ski/images/Graphical_Intro_To_Probabilistic_Neural_Networks/24_Results_Of_Two_Feature_Two_Five.png"&gt;&lt;/p&gt;
&lt;p&gt;For detailed results of my investigation, I encourage you to read my &lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;detailed investigation of RCE&lt;/a&gt; applied to the BUPA liver disorders data set.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I leave you with convenient bullet points summarizing the work we accomplished today.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Frame PNN as a simple series of steps&lt;ul&gt;
&lt;li&gt;Dot product (or distance)&lt;/li&gt;
&lt;li&gt;Non-linear transform&lt;/li&gt;
&lt;li&gt;Summation and voting&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Be cognizant of normalization approach&lt;/li&gt;
&lt;li&gt;Sometimes feature reduction yields classes with common patterns&lt;/li&gt;
&lt;li&gt;RCE rapidly learns class regions&lt;ul&gt;
&lt;li&gt;Complex&lt;/li&gt;
&lt;li&gt;Non-linear&lt;/li&gt;
&lt;li&gt;Disjoint&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;RCE can ID ambiguous regions&lt;ul&gt;
&lt;li&gt;ID regions of useful training patterns&lt;/li&gt;
&lt;li&gt;Does not classify as a known class, in the case that there may be unknown classes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you enjoyed this blog post, please check out these related blog posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/big-data-idol-how-i-crunched-the-numbers.html"&gt;Exploratory Factor Analysis (EFA) Workflow and Interpretation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;EFA - The Math and Algorithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html"&gt;Reduced Columb Energy (RCE) - An alternative to KNN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fastai-flask.html"&gt;Vision model w/ FAST AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/gcp-automl-vision.html"&gt;Vision model w/ Google AutoML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;Google AutoML Tables Beta&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><category term="Data Science"></category><category term="Octave"></category><category term="RCE"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>A Reduced Coulomb Energy (RCE) Neural Network Classifier</title><link href="https://john.soban.ski/reduced_coulomb_energy_neural_network_bupa.html" rel="alternate"></link><published>2018-08-27T10:26:00-04:00</published><updated>2018-08-27T10:26:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2018-08-27:/reduced_coulomb_energy_neural_network_bupa.html</id><summary type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I investigate the effectiveness of a Reduced Coulomb Energy (RCE) Neural Network on the classification of the University of California, Irvine (UCI) Bupa liver disorder data set.  I investigate seven (7) different versions of the data set, four  (4) un-coded and three (3) binary coded (to a higher dimensional …&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;I investigate the effectiveness of a Reduced Coulomb Energy (RCE) Neural Network on the classification of the University of California, Irvine (UCI) Bupa liver disorder data set.  I investigate seven (7) different versions of the data set, four  (4) un-coded and three (3) binary coded (to a higher dimensional feature space) data sets applying various feature vector dimensionality reduction strategies.  Finally, for all seven (7) datasets I apply a feedback-tuning algorithm.  In summary, I receive a best case error rate of 20% and ambiguity of 31%, when I apply my feedback-tuning algorithm (using a learning rate of &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.25) to the un-coded data set that reduces the feature vector dimensionality to half the original size.&lt;/p&gt;
&lt;h2&gt;Outline&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Background&lt;ul&gt;
&lt;li&gt;Why is this an important problem?&lt;/li&gt;
&lt;li&gt;What work has been done before?&lt;/li&gt;
&lt;li&gt;Brief discussion of RCE&lt;/li&gt;
&lt;li&gt;Benefits of RCE&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Methods&lt;/li&gt;
&lt;li&gt;Data&lt;/li&gt;
&lt;li&gt;Results&lt;/li&gt;
&lt;li&gt;Conclusions&lt;/li&gt;
&lt;li&gt;Bibliography&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;h3&gt;Why is this an important problem?&lt;/h3&gt;
&lt;p&gt;A medical diagnosis contains a test pattern with features such as symptoms, patient history, and laboratory tests.  The doctor uses this test pattern to diagnose, or classify the patient.  Doctors and patients can benefit if the Doctor treats the diagnosis as a classification problem, and can arrive at a classification model with low error [&lt;a href="#Bojarczuk"&gt;Bojarczuk&lt;/a&gt; 27].  “Medical data often seem to contain a great number and uncertain or irrelevant features.  How to extract enough necessary and useful diagnostic rules used to be highly depended on the clinical experience  [&lt;a href="#Kahramanli"&gt;Kahramanli&lt;/a&gt; 9].”  I investigate if a RCE NN can extract enough necessary and useful diagnostic rules from the BUPA liver disorders data set, to reduce dependence on clinical experience, and instead put the intelligence in the pattern classification model. &lt;/p&gt;
&lt;h3&gt;What work has been done before?&lt;/h3&gt;
&lt;p&gt;Several mathematicians apply algorithms to the BUPA Liver Disorders dataset.  Goncalves minimizes error to 20.31% using the Inverted Hierarchical Neuro-Fuzzy BSP System (HNFB) [&lt;a href="#Goncalves"&gt;Goncalves&lt;/a&gt; 245].  Raicharoen and Lursinsap achieve an error rate of only 18.61% using critical support vectors (CSV) [&lt;a href="#Raicharoen"&gt;Raicharoen&lt;/a&gt; 2534].  Bagirov and Ugon achieve 10.14% error using their min-max separability algorithm [&lt;a href="#Bagirov"&gt;Bagirov&lt;/a&gt; 19].  Cordella classifies through genetic programming, where prototypes of the classes describe clusters of data samples and logical expressions established conditions on feature values.  Cordella hits an error rate of 26.2% [&lt;a href="#Cordella"&gt;Cordella&lt;/a&gt; 732].  Kahramanli’s Opt-aiNET algorithm lowers the error to 5.2% [&lt;a href="#Kahramanli"&gt;Kahramanli&lt;/a&gt; 12].  [&lt;a href="#Kahramanli"&gt;Kahramanli&lt;/a&gt; 9].  I investigate the utility of applying a RCE net method to the Bupa liver disorders dataset.&lt;/p&gt;
&lt;h3&gt;Brief discussion of RCE&lt;/h3&gt;
&lt;p&gt;Pattern classification represents distinct classes through disjoint regions formed by feature space partitioning.  Most classifiers partition non-overlapping regions and map each of these to a class.  In RCE networks, however, a class may have one or more regions, and regions can overlap.  A RCE net contains three layers, the input, output and hidden.  The input layer contains one node for each of the features, totaling the feature vector dimension.  The output layer has one node for each class.  In the hidden layer, each node represents a prototype.  Each class connects to either one or a cluster of prototypes.  A RCE net contains two modes, learning and classification.  The learning mode executes feature space partitioning, adjusts connection weights between input and hidden layer, and reduces thresholds in hidden nodes to eliminate wrong activations.  The classification mode makes class membership decisions based on the prototypes and their influence fields.  Some regions may have multiple class affiliations, and the RCE net labels these regions “ambiguous.”  [&lt;a href="#Li"&gt;Li&lt;/a&gt; 847]&lt;/p&gt;
&lt;p&gt;&lt;img alt="Learn Lambda" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/01_A_Training_Point_Learns_Lambda.png"&gt;&lt;/p&gt;
&lt;p&gt;Figure 1 above illustrates part of the learning stage of an RCE net.  Consider two classes of data, each with three training samples.  The RCE net grows a sphere around a training point until it hits a training point of a different class.  The RCE net stores this radius, &lt;strong&gt;&amp;#955;&lt;/strong&gt;, for that training point.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Lambda for three training points" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/02_Lambda_For_Three_Training_Points.png"&gt;&lt;/p&gt;
&lt;p&gt;Figure 2 above shows &lt;strong&gt;&amp;#955;&lt;/strong&gt; for the three training points of class two (marked by a “+”).&lt;/p&gt;
&lt;p&gt;Figure 3 below depicts &lt;strong&gt;&amp;#955;&lt;/strong&gt; for both classes, notice how they overlap.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Ambiguous Regions" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/03_Illustration_Of_Abiguous_Regions.png"&gt;&lt;/p&gt;
&lt;p&gt;Scofield defines ambiguous regions as "point sets in the state space of a system which are labeled with multiple class affiliations.  This can occur because the input space has not carried all of the features in the pattern environment, or because the pattern itself is not separable."  [&lt;a href="#Scofield"&gt;Scofield&lt;/a&gt; 5].  The RCE net reduces the overlapping region by setting a maximum &lt;strong&gt;&amp;#955;&lt;/strong&gt;, as shown in Figure 4 below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Reduce Max Lambda" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/04_Reduce_Max_Lambda_To_Reduce_Ambiguity.png"&gt;&lt;/p&gt;
&lt;p&gt;The overlapping, or ambiguous regions point to regions that provide useful training points.  In Figure 5 below, we show an RCE net with a large ambiguous region.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Areas of Interest" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/05_Ambiguous_Regions_Point_To_Areas_Of_Interest.png"&gt;&lt;/p&gt;
&lt;p&gt;Getting training samples from this ambiguous area helps to better define the nature of the feature space.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Key Training Points" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/06_Key_Training_Points_Reduce_Ambiguity.png"&gt;&lt;/p&gt;
&lt;p&gt;Once the training phase completes, the RCE net classifies the test points.  RCE nets belong to the family of exemplar neural net classifiers, which “perform classification based on the identity of the training examples, or exemplars, that are nearest to the input.  Exemplar nodes compute the weighted Euclidean distance between inputs and node centroids [&lt;a href="#Lippmann"&gt;Lippmann&lt;/a&gt; 49].”  RCE nets create hyper-spheres around training points.  The related hidden layer nodes have “high outputs only if the input is within a given radius of the node’s centroid.  Otherwise, node outputs are low.  The classification decision is the label attached to the nodes with high outputs [&lt;a href="#Lippmann"&gt;Lippmann&lt;/a&gt; 51].”  The RCE net classifies a region ambiguous in the case of no outputs or outputs from multiple classes.&lt;/p&gt;
&lt;h3&gt;Benefits of RCE&lt;/h3&gt;
&lt;p&gt;Lippmann summarizes the benefits of RCE:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This classifier is similar to a k-nearest neighbor classifier in that it adapts rapidly   over time, but it typically requires many fewer exemplar nodes than a nearest   neighbor classifier.  During adaptation, more nodes are recruited to generate more  complex decision regions, and the size of hyper-spheres formed by existing nodes    is modified.  Theoretical analyses and experiments with RCE classifiers     demonstrate that they can form complex decision regions rapidly.  Experiments   also demonstrated that they can be trained to solve Boolean mapping problems    such as the symmetry and multiplexer problems more than an order of magnitude   faster than back-propagation classifiers.  Finally, classifiers such as the RCE     classifier require less memory than k-nearest-neighbor classifiers but adapt    classifier structure over time using simple adaptation rules that recruit new nodes     to match the complexity of the classifier to that of the training data"  [&lt;a href="#Lippmann"&gt;Lippmann&lt;/a&gt;     51]&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Li writes that RCE nets perform rapid learning of class regions that are, complex, non-linear and disjoint.  Li also writes “RCE net has the advantage of fast learning, unlimited memory capacity, and no local minima problem” [&lt;a href="#Li"&gt;Li&lt;/a&gt; 846].       &lt;/p&gt;
&lt;h2&gt;Methods&lt;/h2&gt;
&lt;p&gt;I solve the problem by creating a family of MatLab/ Octave functions from the ground up, identifying the key features and then running the reduced data set through my algorithms (See Appendix).  I then create a feedback approach, and identify the ground rules that yield the lowest error.  If interested, you can download my Octave code from GitHub &lt;a href="https://github.com/hatdropper1977/bupa-rce-octave"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Inappropriate normalization presents the first roadblock in my investigation.  Normalizing between zero and one causes otherwise distinct training points from different classes to have the same magnitude.  For most classification algorithms, this creates “built-in error.”  For RCE nets, this results in “built-in ambiguity.”  I depict this in Figure 7 below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Normalized Zoom In" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/07_Normalized_Zoom_In.png"&gt;&lt;/p&gt;
&lt;p&gt;In addition, normalizing on a per pattern basis yields the greatest error and ambiguity.  Normalizing over a class yields the next greatest error/ambiguity.  Normalizing over all samples yields the lowest error, when normalizing between -1 and 1.  &lt;/p&gt;
&lt;p&gt;In terms of feature reduction, I use several methods, including Fisher’s discriminant ratio, divergence, Bhattacharyya distance and scatter matrices to select feature subsets. &lt;/p&gt;
&lt;p&gt;I also run a series of analysis on binary coded Bupa data, increasing the feature dimension.  For example, I take the “mcv” feature and map it into five dimensions.  For mcv, I create five categories, for the ranges [0,64),[65 85),[85,90),[90,95),[95,200].  The value mcv=77 for example, mapps to the five dimensional vector [0 1 0 0 0].  The value 92 becomes [0 0 0 1 0].  This creates four new input nodes for mcv, and four of the five are always zero for any given value of mcv.  I perform similar binary mapping for all the features in the BUPA data set, increasing the feature dimension from six to thirty-three (See appendix).  [&lt;a href="#Kahramanli"&gt;Kahramanli&lt;/a&gt; 9]     &lt;/p&gt;
&lt;p&gt;Li presents two training approaches.  The first approach reduces the threshold &lt;strong&gt;&amp;#955;&lt;/strong&gt; of a hidden node “such that the wrong activations of this node is eliminated.  This process occurs when an input pattern activates some hidden nodes which are not committed to the same class as the input pattern [&lt;a href="#Li"&gt;Li&lt;/a&gt; 848].”  My feedback approach uses his second approach that tunes the weights between the input and hidden nodes.  The RCE net commits each hidden node to an output node of one class.  If the net cannot correctly classify a known exemplar, “change weights between input nodes &amp;amp; hidden nodes until you activate this node [&lt;a href="#Li"&gt;Li&lt;/a&gt; 848].”  We must take care in changing the input to hidden weights of an exemplar classifier, since it “brings forth the potential of a training procedure whose error criterion is non-convergent [&lt;a href="#Hudak"&gt;Hudak&lt;/a&gt; 853].”  The nature of an exemplar classifier is such that changing the weights to one hidden node in order to activate it may throw off the balance of the system, and cause other patterns to become incorrectly classified.&lt;/p&gt;
&lt;h2&gt;Data&lt;/h2&gt;
&lt;p&gt;I download the BUPA liver disorders database from the University of California, Irvine (UCI) &lt;a href="http://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data"&gt;machine learning repository&lt;/a&gt;.  The BUPA data has six features and two classes (one for alcohol related liver disorder, and one for alcohol unrelated liver disorder).  BUPA features include “mean corpuscular volume (mcv),” half-pint equivalents of alcohol per day (drinks) and four chemical markers:  (1) alkaline phosphotase (alkphos) (2) alamine aminotransferase (sgpt) (3) aspartate aminotransferase (sgot) and (4) gamma-glutamyl transpeptidase (gammagt).  Using Fisher discriminant, Scatter Matrices, Divergence and Bhattacharyya distance methods, I pare down the feature space to candidate feature spaces.  Lippmann writes, "features should contain information required to distinguish between classes, be insensitive to irrelevant variability in the input, and also be limited in number to permit efficient computation of discriminant functions and to limit the amount of training data required" [&lt;a href="#Lippmann"&gt;Lippmann&lt;/a&gt; 47]&lt;/p&gt;
&lt;p&gt;In summary, all methods picked parameters [2 5] for the two feature case, [2 5 6] for the three feature case, [2 3 5 6] for four features and [2 3 4 5 6] for five features.  For the coded data, the methods selected [19 20 21 23 29] for five features, [4 9 10 14 16 18 25 28 29 30] for ten and [1 2 3 4 7 9 10 16 19 21 25 28 29 31].  This shows certain features correlate more with a class when they are coded to a certain range.  The following figures (Figure 8 &amp;amp; Figure 9) show two (2) and three (3) feature plots:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Poorly Separable" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/08_Poorly_Separable_Two_Dimensional_Feature_Vector.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Two Approaches" src="https://john.soban.ski/images/Reduced_Coulomb_Energy_Neural_Network_Bupa/09_3d_Plots_Of_Twp_Approaches.png"&gt;&lt;/p&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;In general, coding the data does not give us any gain, either in the ‘feedback-tuned” case or the “non-feedback-tuned” RCE net case.  All un-coded strategies yield lower error and ambiguity than the coded strategies, with one exception (when feedback tuned, the fifteen (15) feature coded data set yields lower error and ambiguity than the five (5) feature un-coded data set).  In all cases (coded and un-coded), however, paring the feature space down yields less error than using all features, with one exception.  For feedback tuning, the coded “all features” feature vector performs better than the coded ten (10) feature vector.&lt;/p&gt;
&lt;p&gt;Using no feedback, the results follow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4 Features, Un-coded:     22% error, 40% ambiguity&lt;/li&gt;
&lt;li&gt;5 Features, Un-coded:     18% error, 49% ambiguity&lt;/li&gt;
&lt;li&gt;3 Features Un-coded:      22% error, 45% ambiguity&lt;/li&gt;
&lt;li&gt;All Features Un-coded:    19% error, 49% ambiguity&lt;/li&gt;
&lt;li&gt;15 Features Coded:        38% error, 40% ambiguity&lt;/li&gt;
&lt;li&gt;10 Feats Coded:       31% error, 48% ambiguity&lt;/li&gt;
&lt;li&gt;All Features Coded:       44% error, 42% ambiguity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using “feedback-tuning,” the three (3) feature un-coded data set yields the lowest error and ambiguity.  For coded data, fifteen parameters yield the lowest error and ambiguity.  Due to the delicate nature of tuning in an exemplar neural net, feedback tuning efficiency relies heavily on the learning rate.  In my analysis I iterate different learning rate values (&lt;strong&gt;&amp;#951;&lt;/strong&gt;) ranging from 0.05 to 1.0, with a step of 0.05.  For each case (coded/un-coded and different feature vector dimensions), I iterate 200 times and select the lowest error/ ambiguity.  In the list below, I show &lt;strong&gt;&amp;#951;&lt;/strong&gt; that yields the lowest error/ ambiguity.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3 Feats Un-coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.25 20% error, 31% ambiguity&lt;/li&gt;
&lt;li&gt;All Feats Un-coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.10   18% error, 35% ambiguity&lt;/li&gt;
&lt;li&gt;4 Feats Un-coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.15 17% error, 40% ambiguity&lt;/li&gt;
&lt;li&gt;15 Feats Coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.25   33% error, 25% ambiguity&lt;/li&gt;
&lt;li&gt;5 Feats Un-coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.50     17% error, 42% ambiguity&lt;/li&gt;
&lt;li&gt;All Feats Coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.30  36% error, 24% ambiguity&lt;/li&gt;
&lt;li&gt;10 Feats Coded, &lt;strong&gt;&amp;#951;&lt;/strong&gt;=0.70   23% error, 39% ambiguity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Not surprisingly, feedback tuning yields lower error/ambiguity than the non-feedback-tuned case, with one exception.  The non-feedback-tuned four feature un-coded method actually yields lower error/ ambiguity than the feedback-tuned ten (10) feature coded method.  The coded ten (10) feature method produces the worst results, which indicates poor feature selection, since the coded five (5) feature method performs better in all cases.&lt;/p&gt;
&lt;p&gt;Now let’s look at the usefulness of my feedback strategy.  The table below shows the gain (or reduction in error/ambiguity).&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Reduction in Error&lt;/th&gt;
&lt;th&gt;Reduction in Ambiguity&lt;/th&gt;
&lt;th&gt;Error (Gain)&lt;/th&gt;
&lt;th&gt;Ambiguity (Gain)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coded All Feats&lt;/td&gt;
&lt;td&gt;18.10%&lt;/td&gt;
&lt;td&gt;42.90%&lt;/td&gt;
&lt;td&gt;-0.87 dB&lt;/td&gt;
&lt;td&gt;-2.43 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coded 15 Feats&lt;/td&gt;
&lt;td&gt;13.20%&lt;/td&gt;
&lt;td&gt;37.50%&lt;/td&gt;
&lt;td&gt;-0.61 dB&lt;/td&gt;
&lt;td&gt;-2.04 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coded 10 Feats&lt;/td&gt;
&lt;td&gt;25.80%&lt;/td&gt;
&lt;td&gt;18.80%&lt;/td&gt;
&lt;td&gt;-1.3 dB&lt;/td&gt;
&lt;td&gt;-0.90 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Un-coded 3 Feats&lt;/td&gt;
&lt;td&gt;9.10%&lt;/td&gt;
&lt;td&gt;31.10%&lt;/td&gt;
&lt;td&gt;-0.41 dB&lt;/td&gt;
&lt;td&gt;-1.62 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Un-coded All Feats&lt;/td&gt;
&lt;td&gt;5.30%&lt;/td&gt;
&lt;td&gt;28.60%&lt;/td&gt;
&lt;td&gt;-0.23 dB&lt;/td&gt;
&lt;td&gt;-1.46 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Un-coded 4 Feats&lt;/td&gt;
&lt;td&gt;22.80%&lt;/td&gt;
&lt;td&gt;0.00%&lt;/td&gt;
&lt;td&gt;-1.12 dB&lt;/td&gt;
&lt;td&gt;0.00 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Un-coded 5 Feats&lt;/td&gt;
&lt;td&gt;5.60%&lt;/td&gt;
&lt;td&gt;14.30%&lt;/td&gt;
&lt;td&gt;-0.25 dB&lt;/td&gt;
&lt;td&gt;-0.67 dB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In all cases, my tuning algorithm helps reduce the error and/or ambiguity.  My tuning algorithm produces the most gain for the inferior performing “coded” data set.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Lippmann and Hudak criticize RCE nets.  Lippmann writes, RCE nets “may require large amounts of memory and computation time [&lt;a href="#Lippmann"&gt;Lippmann&lt;/a&gt; 49].”  Hudak writes, “Viewing RCE as an incremental nearest-neighbor classifier with hyper-spheres lads to the conclusion that the hyper-spheres are not positively contributing to the performance of the classifier.  At best their presence is ineffectual, but their management during training entails a computational cost that is not justified by these findings  [&lt;a href="#Hudak"&gt;Hudak&lt;/a&gt; 852].” &lt;/p&gt;
&lt;p&gt;I experience a “computational cost” during management of the RCE net training.  Iterating through 20 candidate values of &lt;strong&gt;&amp;#951;&lt;/strong&gt;, and then iterating 200 learning steps for each takes tens of minutes.  Once I identify the proper weight tuning for the data, however, classification occurs in real time.  RCE does have benefits, due to the ambiguity.  Even Hudak writes, “hyper-spherical classifiers can recognize patterns from an unknown class as not belonging to any class known to the classifier. If true, this would be an advantage over the nearest-neighbor classifier [&lt;a href="#Hudak"&gt;Hudak&lt;/a&gt; 853].”  &lt;/p&gt;
&lt;p&gt;In conclusion, binary coding does not help reduce error/ ambiguity.  Reducing the feature set on the un-coded data reduces error/ ambiguity.  My feedback-tuning algorithm, while computationally expensive, reduces error/ ambiguity in all cases.  The feedback-tuning algorithm yields the greatest gain on the poorer-performing coded data set.  The best case scenario shows reducing the un-coded feature vector to half its dimension using the features “alkaline phosphotase, “gamma-glutamyl transpeptidase”  and “number of half-pint equivalents of alcoholic beverages drunk per day” and feedback tuning using a learning rate of &lt;strong&gt;&amp;#951;&lt;/strong&gt;=.25.  This yields an error of 20%, and an ambiguity of 31%.&lt;/p&gt;
&lt;p&gt;If you enjoyed this blog post, please check out these related blog posts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/big-data-idol-how-i-crunched-the-numbers.html"&gt;Exploratory Factor Analysis (EFA) Workflow and Interpretation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;EFA - The Math and Algorithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/graphical_intro_to_probabilistic_neural_networks.html"&gt;Probabalistic Parzen Neural Networks (PNN) with cartoons&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fastai-flask.html"&gt;Vision model w/ FAST AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/gcp-automl-vision.html"&gt;Vision model w/ Google AutoML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://john.soban.ski/fast-and-easy-automl-optimize.html"&gt;Google AutoML Tables Beta&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Bibliography&lt;/h2&gt;
&lt;p&gt;&lt;a name="Bagirov"&gt;&lt;/a&gt;Bagirov, A.M., Ugon, J. “Supervised Data Classification via Max-min Separability.”  Mathematical Modeling of Bio-systems Springer Berlin Heidelberg, 2008: 1-23&lt;/p&gt;
&lt;p&gt;&lt;a name="Bojarczuk"&gt;&lt;/a&gt;Bojarczuk, C.C., Lopes, H.S., Freitas, A.A. and Michalkiewicz, E. L.  “A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets.”  Artificial Intelligence in Medicine 2004:27-48&lt;/p&gt;
&lt;p&gt;&lt;a name="Cooper"&gt;&lt;/a&gt;Cooper, Leon N.  “A Neural Model for Category Learning.”  Center for Neural Science and Department of Physics, Brown University Providence, R.I. 1982&lt;/p&gt;
&lt;p&gt;&lt;a name="Cordella"&gt;&lt;/a&gt;Cordella, L.P., De Stefano, C., Fontanella, F.  “A Novel Genetic Programming Based Approach for Classification Problems.”  Lecture Notes in Computer Science, Image Analysis and Processing ICIAP 2005: 727-734&lt;/p&gt;
&lt;p&gt;&lt;a name="Goncalves"&gt;&lt;/a&gt;Goncalves, L.B., Vellasco, M.M., Cavalcanti, M.A., Pacheco M.A.  “Inverted Hierarchical Neuro-Fuzzy BSP System:A Novel Neuro-Fuzzy Model for Pattern Classification and Rule Extraction in Databases.”  IEEE Transactions On Systems, Man And Cybernetics Part C: Applications And Reviews 2006: 236-248&lt;/p&gt;
&lt;p&gt;&lt;a name="Hudak"&gt;&lt;/a&gt;Hudak M.J.  “RCE Networks:  An Experimental Investigation.”  Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on Jul. 1991: 849-854&lt;/p&gt;
&lt;p&gt;&lt;a name="Kahramanli"&gt;&lt;/a&gt;Kahramanli, Humar, Allahverdi, Novruz  “Mining Classification Rules for Liver Disorders.”  International Journal of Mathematics and computers in simulation Issue 1, Volume 3: 2009&lt;/p&gt;
&lt;p&gt;&lt;a name="Li"&gt;&lt;/a&gt;Li, Wei  “Invariant Object Recognition Based on a Neural Network of Cascaded RCE nets.”  Neural Networks, 1990., 1990 IJCNN International Joint Conference on Jun 1990:17-21&lt;/p&gt;
&lt;p&gt;&lt;a name="Lippmann"&gt;&lt;/a&gt;Lippmann, Richard P.  “Pattern Classification Using Neural Networks.”  IEEE Communications Magazine Nov. 1989&lt;/p&gt;
&lt;p&gt;&lt;a name="Raicharoen"&gt;&lt;/a&gt;Raicharoen, T., Lursinsap, C.  “Critical Support Vector Machine Without Kernel Function.”  Neural information Processing, 9th International Conference on  (ICONIPOZ) 2002: 2532-2536&lt;/p&gt;
&lt;p&gt;&lt;a name="Roan"&gt;&lt;/a&gt;Roan, Sing-Ming  “Fuzzy RCE Neural Network.”  Fuzzy Systems, 1993., Second IEEE International Conference on 1993:629-634&lt;/p&gt;
&lt;p&gt;&lt;a name="Scofield"&gt;&lt;/a&gt;Scofield, Christopher L.  “Pattern class degeneracy in an unrestricted storage density memory.”  Nestor, Inc.  Providence, RI.  1988&lt;/p&gt;</content><category term="Data Science"></category><category term="Octave"></category><category term="RCE"></category><category term="Neural Networks"></category><category term="Machine Learning"></category><category term="Data Science"></category></entry><entry><title>Big Data Idol: The Math</title><link href="https://john.soban.ski/big-data-idol-the-math.html" rel="alternate"></link><published>2017-11-18T01:17:00-05:00</published><updated>2017-11-18T01:17:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2017-11-18:/big-data-idol-the-math.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;Caution! Math Ahead&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;For the Math-phobic, I explain how I crunch the test results in a math-free, simple and focused blog post &lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I use math here, so this may be your last chance to escape! Still with me? Excellent!&lt;/p&gt;
&lt;p&gt;The bullets below outline the steps we take.  A flow …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Caution! Math Ahead&lt;/strong&gt;!&lt;/p&gt;
&lt;p&gt;For the Math-phobic, I explain how I crunch the test results in a math-free, simple and focused blog post &lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I use math here, so this may be your last chance to escape! Still with me? Excellent!&lt;/p&gt;
&lt;p&gt;The bullets below outline the steps we take.  A flow diagram follows.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transform Elasticsearch Database to Comma Separated Variables (CSV)&lt;/li&gt;
&lt;li&gt;Load into &lt;a href="https://www.r-project.org/about.html"&gt;R&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Normalize data&lt;/li&gt;
&lt;li&gt;Exploratory Factor Analysis (EFA)&lt;ul&gt;
&lt;li&gt;Dimensionality Reduction (Iterate on &lt;em&gt;n&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Plot Data on Graphs&lt;/li&gt;
&lt;li&gt;Identify Factor Names&lt;/li&gt;
&lt;li&gt;Isolate Factor Weight Matrix&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Dot Product Answer Matrix with Weight Matrix&lt;/li&gt;
&lt;li&gt;Fit Factor Sums matrix Theoretical Models&lt;/li&gt;
&lt;li&gt;Plot Data&lt;/li&gt;
&lt;li&gt;Guess Distribution&lt;/li&gt;
&lt;li&gt;Fit, Graph, QQ-Plot&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img alt="Workflow" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/workflow.jpg"&gt;
 
&lt;strong&gt;Extract and Transform&lt;/strong&gt;  &lt;/p&gt;
&lt;p&gt;I chose a NoSQL &lt;a href="https://www.elastic.co/"&gt;Elasticsearch&lt;/a&gt; document store (DataBase) to hold all of the test results, metadata and identity information. In addition to private search services (i.e. auto-completion, 'did you mean?', etc.) Elasticsearch provides (massive) scalability and integration with a robust web based GUI named Kibana. Kibana provides trend plots, pie graphs, keyword searches and a host of other features.&lt;/p&gt;
&lt;p&gt;I need to extract data from the NoSQL document store and translate it into a structured format for R. I use the excellent Elasticsearch &lt;a href="https://elasticsearch-dsl.readthedocs.org/en/latest/"&gt;Domain Specific Language&lt;/a&gt; (DSL) python library to do just that. &lt;/p&gt;
&lt;p&gt;As I mentioned in &lt;a href="https://john.soban.ski/part-1-connect-ec2-to-the-amazon-elasticsearch-service.html"&gt;HOWTO-1&lt;/a&gt;, I must first serialize the data to JSON, in order to use the Amazon IAM serivce with the Amazon Elasticsearch service. When I roll up my sleeves and dive in, I notice the (trivial) Elasticsearch DSL "&lt;a href="https://elasticsearch-dsl.readthedocs.org/en/latest/search_dsl.html#pagination"&gt;scan&lt;/a&gt;" method requires a &lt;a href="http://elasticsearch-py.readthedocs.org/en/master/"&gt;low-level Elasticsearch client&lt;/a&gt; connection object to perform. &lt;/p&gt;
&lt;p&gt;In order to use REST/JSON calls, therefore, I need to scan "by hand." The official Elasticsearch &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#scroll-search-results"&gt;documents&lt;/a&gt; point us to the low level &lt;a href="http://elasticsearch-py.readthedocs.org/en/master/helpers.html"&gt;elasticsearch-py libraries&lt;/a&gt; but since elasticsearch-dsl extends these, they do not help with our problem of needing to serialize to JSON and pass via an extended AWSAuthConnection object. &lt;/p&gt;
&lt;p&gt;To scroll by hand, I first request the Elasticsearch API to set the search type to "scan" with a scroll duration of ten minutes. Elasticsearch responds with a scroll ID. I use the scroll ID to request the first batch of documents, and Elasticsearch responds with the documents and the current scroll_id (it may update). I then iterate until the process finishes.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;make_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;GET&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;/_search/scroll&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_scroll_id&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;scroll&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;10m&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;hits&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="c1"&gt;#...process the documents&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;strong&gt;HOT TIP&lt;/strong&gt;&lt;/strong&gt;: Dump an Entire Document Store&lt;/p&gt;
&lt;p&gt;If you connect to your ES AWS service via an IAM role
(AWSAuthConnection) then&lt;br&gt;
(1) Make a scan request to turn off sorting&lt;br&gt;
(2) Set an appropriate scroll duration (10 minutes)&lt;br&gt;
(3) Iterate through all of the documents with a scroll request&lt;br&gt;
  (a) On each iteration, pass the current scroll_id&lt;/p&gt;
&lt;p&gt;(If you connect to your AWS service via IP whitelisting then use your search object's scan method, e.g. s.scan())&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Elasticsearch returns all of the documents in a schemaless JSON format.&lt;/p&gt;
&lt;p&gt;For example:  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_index&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;pilot&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_type&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_id&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;AVHfVLNTHootPMn5yhhf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_version&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;found&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_source&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q47&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q2&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q6&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q33&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q25&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We need to transform the schema-less responses into something structured. With Python, we can translate the dictionary &lt;strong&gt;{"q47":"y","q2":"y","q6":"y","q33":"y","q25":"y"}&lt;/strong&gt; into a table format for R to read.&lt;/p&gt;
&lt;p&gt;Note the arbitrary placement of the questions in the JSON response. We need simple logic to cycle through all fifty questions (q0, q1 ... g49) and test if they reside in the response. We could use a case statement with fifty individual tests but instead I decided to use arrays. I create a simple list, "q" with string names for each question. I then create an array, "scorecard" with fifty zeros. If the while loop discovers a match for a question, "q25" for example, it places a one (1) in that list position of scorecard. At the end I receive a table of results, one row per document, and a positional on/off for each "yes" in that document.&lt;/p&gt;
&lt;p&gt;The dump script follows:  &lt;/p&gt;
&lt;p&gt;
&lt;script src="https://gist.github.com/hatdropper1977/5e646a58d74f34fb24bb.js"&gt;&lt;/script&gt;
&lt;/p&gt;

&lt;p&gt;The script transforms...  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_index&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;pilot&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_type&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;test&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_id&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;AVHfVLNTHootPMn5yhhf&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_version&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;found&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="s2"&gt;&amp;quot;_source&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q47&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q2&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q6&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q33&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;q25&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;y&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;...into one row.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;AVHfVLNTHootPMn5yhhf,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Exploratory Factor Analysis&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.statpower.net/"&gt;James H. Steiger&lt;/a&gt; writes an excellent white paper on using &lt;a href="https://www.google.com/search?q=Exploratory+Factor+Analysis+with+R+James+H.+Steiger"&gt;R for EFA&lt;/a&gt; titled "Exploratory Factor Analysis With R." I tried out several of the R libriaries to include &lt;strong&gt;&lt;em&gt;principal&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;factor.pa&lt;/em&gt;&lt;/strong&gt; in the &lt;a href="https://cran.r-project.org/web/packages/psych/index.html"&gt;psych&lt;/a&gt; and &lt;strong&gt;&lt;em&gt;princomp&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;factanal&lt;/em&gt;&lt;/strong&gt; in &lt;a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/00Index.html"&gt;stats&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The princomp library got me really excited at first because all of the vectors made perfect sense (in terms of groupings the questions by "yes's" ). When I applied the weight matrix to the initial answer matrix I saw that 80% of the test takers fell into the first component. This concerned me at first until I realized that the first component held most of the varience and that raw PCA would not be the appropriate tool to separate test takers into roughly equal groupings. &lt;/p&gt;
&lt;p&gt;The hilariously named factanal (huh huh Bevis)library accomodated my use case. I used the common factor analysis approach (vs. component analysis) which fits via Maximum Likelyhood and rotated with varimax.&lt;/p&gt;
&lt;p&gt;I first loaded the data and normalized my data set to the range of [-1,1].  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;bdpt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;read.csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;data.csv&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;stringsAsFactors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Get rid of the ID&amp;#39;s&lt;/span&gt;
&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bdpt&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;51&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Uncomment to change zeros to negative ones&lt;/span&gt;
&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;-1&lt;/span&gt;

&lt;span class="c1"&gt;# Uncomment to sample&lt;/span&gt;
&lt;span class="n"&gt;ind&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ind&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I fit with Maximum Likelyhood, select three (3) dimensions and rotate with varimax.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;factanal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;varimax&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can look at the loadings for each factor.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;Loadings&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;Factor1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Factor2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Factor3&lt;/span&gt;
&lt;span class="n"&gt;tire&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.114&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.214&lt;/span&gt;
&lt;span class="n"&gt;rich&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.231&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.222&lt;/span&gt;
&lt;span class="n"&gt;dangerous&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.257&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.165&lt;/span&gt;
&lt;span class="n"&gt;fame&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.172&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.169&lt;/span&gt;
&lt;span class="n"&gt;hair&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.200&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.168&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.199&lt;/span&gt;
&lt;span class="n"&gt;bet&lt;/span&gt;
&lt;span class="n"&gt;taxes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.143&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.286&lt;/span&gt;
&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.141&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.303&lt;/span&gt;
&lt;span class="n"&gt;castle&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.167&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.105&lt;/span&gt;
&lt;span class="n"&gt;losers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.139&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.290&lt;/span&gt;
&lt;span class="n"&gt;pretty_women&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.796&lt;/span&gt;
&lt;span class="n"&gt;pretty_men&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.676&lt;/span&gt;
&lt;span class="n"&gt;vhs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.184&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.186&lt;/span&gt;
&lt;span class="n"&gt;punch&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.362&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.164&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.109&lt;/span&gt;
&lt;span class="n"&gt;jerkface&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.380&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.151&lt;/span&gt;
&lt;span class="n"&gt;facebook&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.128&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.113&lt;/span&gt;
&lt;span class="n"&gt;champion&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.439&lt;/span&gt;
&lt;span class="n"&gt;God&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;-0.300&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.237&lt;/span&gt;
&lt;span class="n"&gt;speed&lt;/span&gt;
&lt;span class="n"&gt;advice&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.291&lt;/span&gt;
&lt;span class="n"&gt;mouse&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.229&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.169&lt;/span&gt;
&lt;span class="n"&gt;drunk&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.406&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.165&lt;/span&gt;
&lt;span class="n"&gt;kleeb&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.267&lt;/span&gt;
&lt;span class="n"&gt;gravitate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.224&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.259&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.118&lt;/span&gt;
&lt;span class="n"&gt;no_plan&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.309&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.222&lt;/span&gt;
&lt;span class="n"&gt;drugs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.445&lt;/span&gt;
&lt;span class="n"&gt;fan&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.136&lt;/span&gt;
&lt;span class="n"&gt;work_hard&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.304&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.124&lt;/span&gt;
&lt;span class="n"&gt;potential&lt;/span&gt;
&lt;span class="n"&gt;intelligence&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.210&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.218&lt;/span&gt;
&lt;span class="n"&gt;wait_food&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.298&lt;/span&gt;
&lt;span class="n"&gt;bossy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.127&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.105&lt;/span&gt;
&lt;span class="n"&gt;world_against&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.264&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.156&lt;/span&gt;
&lt;span class="n"&gt;suffer_evil&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.409&lt;/span&gt;
&lt;span class="n"&gt;trust_cops&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;-0.245&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.276&lt;/span&gt;
&lt;span class="n"&gt;learning&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.408&lt;/span&gt;
&lt;span class="n"&gt;cult&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.383&lt;/span&gt;
&lt;span class="n"&gt;naked&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.328&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.135&lt;/span&gt;
&lt;span class="n"&gt;door&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.159&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.167&lt;/span&gt;
&lt;span class="n"&gt;grass&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.158&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.184&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.131&lt;/span&gt;
&lt;span class="n"&gt;paycheck&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.242&lt;/span&gt;
&lt;span class="n"&gt;fashion&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.155&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.266&lt;/span&gt;
&lt;span class="n"&gt;locks&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.204&lt;/span&gt;
&lt;span class="n"&gt;love&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.229&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.110&lt;/span&gt;
&lt;span class="n"&gt;dogs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.362&lt;/span&gt;
&lt;span class="n"&gt;baby_corner&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.178&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.253&lt;/span&gt;
&lt;span class="n"&gt;listen&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.157&lt;/span&gt;
&lt;span class="n"&gt;transit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.124&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.333&lt;/span&gt;
&lt;span class="n"&gt;motorcycle&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.154&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.103&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.134&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The chart shows that "being drunk in public" and "not believing in God" correlates strongly with factor 1, "championing others" and "suffering evil vs. being evil" correlates with factor 2, and "men/women should be pretty" highly correlates with factor 3. I named these factors "hellraiser," "boy scout" and "celebrity."&lt;/p&gt;
&lt;p&gt;The code that follows shows the relative weights of each "question" for each factor. I provide 2d and 3d graphs. For more detail, click &lt;a href="https://john.soban.ski/big-data-idol-how-i-crunched-the-numbers.html"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# #################&lt;/span&gt;
&lt;span class="c1"&gt;# # Graph Factors #&lt;/span&gt;
&lt;span class="c1"&gt;# # in 2 and 3d #&lt;/span&gt;
&lt;span class="c1"&gt;# #################&lt;/span&gt;
&lt;span class="nf"&gt;layout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;byrow&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;which.max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;darkgreen&amp;quot;&lt;/span&gt;
&lt;span class="n"&gt;s3d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;scatterplot3d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pcolor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;pch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;h&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lty.hplot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;angle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;xlab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Hellraiser&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ylab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Boy Scout&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;zlab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Celebrity&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;s3d.xyz&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s3d&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="nf"&gt;xyz.convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s3d.xyz&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;s3d.xyz&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;row.names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;topleft&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;n&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Factor Assignment&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Hellraiser&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Boy Scout&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Celebrity&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fill&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;red&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;blue&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;darkgreen&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;n&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;xlab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Hellraiser&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ylab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Boy Scout&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Hellraiser vs. Boy Scout&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;n&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;xlab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Hellraiser&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ylab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Celebrity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Hellraiser vs. Celebrity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;n&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;xlab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Boy Scout&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ylab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Celebrity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Boy Scout vs. Celebrity&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;lty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Crunch the Numbers&lt;/strong&gt;  &lt;/p&gt;
&lt;p&gt;For each test taker, I tally up their total factor weights based on they answer each question. To process the data, I perform a simple dot product of the "User Answer Matrix" and the "User Factor Weight" matrix, which yields a "User Factor Sums Matrix." I then normalize the "User Factor Sums Matrix" and pull out zero values in order to try certain theoretical fits (such as Gamma).  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;########################&lt;/span&gt;
&lt;span class="c1"&gt;# The number crunching #&lt;/span&gt;
&lt;span class="c1"&gt;########################&lt;/span&gt;

&lt;span class="c1"&gt;# Convert loadings to a weight matrix&lt;/span&gt;
&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;loadings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Dot product of answers with the weight matrix&lt;/span&gt;
&lt;span class="c1"&gt;# to get factor sums for each test taker&lt;/span&gt;
&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%*%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pca&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Normalize the test taker&amp;#39;s factor sums&lt;/span&gt;
&lt;span class="c1"&gt;# between 0 and 100&lt;/span&gt;
&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sapply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;seq_len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Get rid of the Zero values&lt;/span&gt;
&lt;span class="c1"&gt;# so we can fit to a Gamma&lt;/span&gt;
&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.01&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Each user has a weight sum for each factor. I take these data points and try to fit them to a theoretical model. I use the Gamma function to begin. The following lines fit the data and then pull out just the shape and rate parameters for each of the three fits.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Figure out rate and scale for fit for each factor&lt;/span&gt;
&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sapply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;seq_len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fitdistr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;gamma&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# Pull just shape and rate&lt;/span&gt;
&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sapply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;seq_len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The next example shows the code for the "Hellraiser" factor. I plot a &lt;a href="https://en.wikipedia.org/wiki/Kernel_density_estimation"&gt;Kernel Density Estimation (KDE)&lt;/a&gt; of the empirical data. I then overlay points pulled from a Gamma density function with the "shape" and "rate" parameters that we found above. I show the code for "Hellraiser," and you will find the complete code at the end of this blog post.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# &amp;quot;Hellraiser&amp;quot; pdf&lt;/span&gt;
&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;density&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;&lt;span class="n"&gt;xlab&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Hellraiser Weight&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Kernel Density Estimation (Solid) vs. Gamma Fit (Dots)&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;&lt;span class="n"&gt;lty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;topright&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;inset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;bty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;n&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;.&lt;/span&gt;&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;shape = &amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))),&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;rate = &amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;par&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;T&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;points&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;dgamma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;seq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can see a good fit for "Hellraiser" and "Celebrity," but a poor fit for "Boy Scout."&lt;/p&gt;
&lt;p&gt;&lt;img alt="Hellraiser Gamma" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/hellraiser_gamma-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Celebrity Gamma" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/celebrity_gamma-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Boy Scout Gamma" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/boy_scout_gamma-300x242.png"&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Hellraiser&lt;/span&gt;
&lt;span class="n"&gt;x.fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rgamma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;x.empirical&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nf"&gt;qqplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x.empirical&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;x.fit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Sample vs. Theoretical for Hellraiser Fit&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;abline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;img alt="QQ-Plot Hellraiser" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/qq_hellraiser-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="QQ-Plot Celebrity" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/qq_celebrity-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="QQ-Plot Boyscout" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/qq_boyscout_gamma-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;Again, good fits for all the factors except for "boy scout." With this guidance, I re-fit the "boy scout" data to a "normal" theoretical model.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Find mean and sd&lt;/span&gt;
&lt;span class="n"&gt;std_boyscout_fit&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;fitdistr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;normal&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The calculation produces a much better fit, as witnessed by the new "Boy Scout" overlay plot and QQ-Plot.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Normal Boyscout" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/normal_boyscout-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="QQ-Plot Boyscout Normal" src="https://john.soban.ski/images/Big_Data_Idol_The_Math/qq_boyscout_normal-300x242.png"&gt;&lt;/p&gt;
&lt;p&gt;We know the factor weight sums for each user. We then use our new density functions to find out where they stand compared to the other users. We give them a percentile and send each user sum weight vector to the appropriate theoretical model, with the appropriate density function parameters.  &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# Convert the normalized weights to percentile&lt;/span&gt;
&lt;span class="c1"&gt;# based on our fit model distributions&lt;/span&gt;
&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nf"&gt;sapply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;seq_len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;pgamma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gamma_fit&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Replace the gamma fit for Boy Scout&lt;/span&gt;
&lt;span class="c1"&gt;# with the Standard&lt;/span&gt;
&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;pnorm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;norm_answers&lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std_boyscout_fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;sd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;std_boyscout_fit&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;``&lt;/span&gt;

&lt;span class="n"&gt;We&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;simple&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;on/off&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;logic.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;If&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lies&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;greater&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;than&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fiftieth&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;we&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;factor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;on.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;The&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;simple&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;binary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;logic&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;then&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gives&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;us&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;eight&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;types.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;We&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;assign&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;each&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;performing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;conversion&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;binary&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;decimal.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dot&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;between&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;three&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dimensional&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;factor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;performs&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;conversion.&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;


&lt;span class="n"&gt;``&lt;/span&gt;`&lt;span class="n"&gt;R&lt;/span&gt;
&lt;span class="c1"&gt;# For all test takers, set all values below mean to zero (per factor)&lt;/span&gt;
&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;

&lt;span class="c1"&gt;# Set all values above mean to one (per factor)&lt;/span&gt;
&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;

&lt;span class="c1"&gt;# Map each test taker to one of seven classes based on their on/off values for each factor&lt;/span&gt;
&lt;span class="n"&gt;classifications&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percentiles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nf"&gt;function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%*%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When you take the test (after I batch process), you will receive you classification and percentiles for each of the three factors.&lt;/p&gt;</content><category term="Data Science"></category><category term="Big Data Personality Test"></category><category term="HOWTO"></category><category term="Data Science"></category></entry><entry><title>Big Data Idol: How I Crunched the Numbers</title><link href="https://john.soban.ski/big-data-idol-how-i-crunched-the-numbers.html" rel="alternate"></link><published>2017-10-14T12:20:00-04:00</published><updated>2017-10-14T12:20:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2017-10-14:/big-data-idol-how-i-crunched-the-numbers.html</id><summary type="html">&lt;p&gt;&lt;strong&gt;Do you have big data chops?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Quick, what do these three things have in common?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Yankees, Giants, Rangers, Knicks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What about these?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Beatles, Monkees, Beach Boys&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do you have an answer for each? "New York," for example, for the first list and "Rock and Roll" for the second? If so …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Do you have big data chops?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Quick, what do these three things have in common?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Yankees, Giants, Rangers, Knicks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What about these?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Beatles, Monkees, Beach Boys&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do you have an answer for each? "New York," for example, for the first list and "Rock and Roll" for the second? If so, call yourself a "data scientist!"&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Computer Groups Users&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Big Data Personality Test follows a similar process. I feed 1 billion test results into a computer program that "learns" the data. The program looks for patterns and finds that, for exmple, people who answer "yes" to "question 1" also answer yes to questions 8, 9 and 31 and "no" to question 15. The program then separates test takers based on their answer patterns.&lt;/p&gt;
&lt;p&gt;Now answer the following question. You may find the question a little tricky so I drew a picture to help you "cheat." Think of a row on a "&lt;em&gt;Tic Tac Toe&lt;/em&gt;" board. You have three spaces and can place either an "X" or "O" on each space. How many unique rows (patterns) can you make from this?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Tic Tac Toe" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/tictac.png"&gt;&lt;/p&gt;
&lt;p&gt;As you can see from the picture, each row in a game of &lt;em&gt;Tic Tac Toe&lt;/em&gt; can only have one of eight possible patterns. Three spaces in a row, with an "X" or "O" on each space gives us &lt;em&gt;eight unique patterns&lt;/em&gt;. You will see this phenomenon in the next section, where I reduce each test taker to a "row" with three "factors" (i.e.&lt;em&gt;Tic Tac Toe&lt;/em&gt; spaces) that can be "on" or "off" (i.e. an "X" or "O").&lt;/p&gt;
&lt;p&gt;I feed the 1 billion test responses into another computer program that reduces the "answer patterns" into three types (or "factors"). From this program I reduced the possible number of unique test "answer paterns" from 1,130,000,000,000,000 to eight.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A Fun Game&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Are you still with me or are you bored? Here is a fun game. The "big data" program only identifies the patterns, I must name them.&lt;/p&gt;
&lt;p&gt;What would you name this spectrum?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Veg" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/veg.png"&gt;&lt;/p&gt;
&lt;p&gt;How about "The Vegetarian Spectrum"&lt;/p&gt;
&lt;p&gt;What about this one?&lt;/p&gt;
&lt;p&gt;&lt;img alt="Lib" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/lib.png"&gt;&lt;/p&gt;
&lt;p&gt;I'd name it "The Libertarian Spectrum"&lt;/p&gt;
&lt;p&gt;And finally, we have this one:&lt;/p&gt;
&lt;p&gt;&lt;img alt="NY" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/ny.png"&gt;&lt;/p&gt;
&lt;p&gt;I'd name it "New Yorker Spectrum"&lt;/p&gt;
&lt;p&gt;We call these three spectrum &lt;strong&gt;independent&lt;/strong&gt;. If you're from New York, you might be a libertarian or a vegetarian, but one does not cause the other. (In reality, you might find correlation but bear with me here. If you want specifics, then read my blog post about &lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;the Math&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;We can combine two of these spectrum.  Let's combine, for example, the "Vegetarian" and "Libertarian" spectrum. Then we have four types of people: "Libertarian Vegetarians," "Libertarian Meat Eaters," "Big Government Vegetarians" and "Big Government Meat Eaters."  Take a look at the chart below.&lt;/p&gt;
&lt;p&gt;&lt;img alt="2d" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/2d.png"&gt;&lt;/p&gt;
&lt;p&gt;We can still have a pretty picture in three dimensions if we overlay the third spectrum on the graph:&lt;/p&gt;
&lt;p&gt;&lt;img alt="3d" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/3d.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Name the Groups&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Enough of the example spectrum, let's look at the real data.  I obtained the data from actual test takers on Facebook.  When I crunch the numbers on the test takers, machine learning outputs three groups, or &lt;strong&gt;factors&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Factor #1:&lt;/strong&gt; &lt;strong&gt;Drug&lt;/strong&gt; and &lt;strong&gt;alcohol&lt;/strong&gt; users that have been &lt;strong&gt;naked in public&lt;/strong&gt;, &lt;strong&gt;distrust cops&lt;/strong&gt; and &lt;strong&gt;do not believe&lt;/strong&gt; strongly in &lt;strong&gt;God&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Factor #2:&lt;/strong&gt; Leaders that &lt;strong&gt;champion others&lt;/strong&gt;, would &lt;strong&gt;martyr&lt;/strong&gt; themselves for an idea and believe they can &lt;strong&gt;learn from all setbacks&lt;/strong&gt;.  &lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Factor #3:&lt;/strong&gt; Strongly believe &lt;strong&gt;men and women should be attractive&lt;/strong&gt;, judge people on their &lt;strong&gt;fashion&lt;/strong&gt; and think in &lt;strong&gt;concrete, all or nothing terms&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;What would you name these factors?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Plot the Data&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I named #1 "Hellraisers", #2 "Boy Scouts" and #3 "Celebrities."&lt;/p&gt;
&lt;p&gt;I graphed the factors against each other.&lt;/p&gt;
&lt;p&gt;Take a look at Hellraiser vs. Boy Scout. If this graph confuses you, take a look at some of New York magazines' &lt;a href="https://nymag.com/nymag/culture/approvalmatrix/archive/"&gt;Approval Matricies&lt;/a&gt;. You will quickly get an idea of how the matricies work.&lt;/p&gt;
&lt;p&gt;&lt;img alt="HR vs BS" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/hr_v_bs-1024x794.png"&gt;&lt;/p&gt;
&lt;p&gt;Look at the "left/ right" axis above. You can see, all the way to the right, that positive attitudes towards &lt;strong&gt;drinking&lt;/strong&gt; and &lt;strong&gt;drugs&lt;/strong&gt; defines a test taker as a "hellraiser" and negative attitudes (all the way to the left) about &lt;strong&gt;trusting cops&lt;/strong&gt; and &lt;strong&gt;belief in God&lt;/strong&gt; also define them as a "hellraiser." On the "up/ down" axis (Boy Scout), &lt;strong&gt;championing others&lt;/strong&gt; and &lt;strong&gt;learning from failures&lt;/strong&gt; define the Boy Scout. Notice how "&lt;strong&gt;Champion&lt;/strong&gt;," high up on the "Up/ Down" axis, falls right near the middle of the "Left/ Right" axis. So attitudes about "&lt;strong&gt;Championing others&lt;/strong&gt;" weights the "Boy Scout" judgement but doesn't do much for the "Hellraiser" judgement. Likewise, "&lt;strong&gt;drugs&lt;/strong&gt;" greatly defines the "Hellraiser," but doesn't weigh either way for the "Boy Scout" factor (look at how close it lies to the horizontal dotted line.)&lt;/p&gt;
&lt;p&gt;On the "Hellraiser" vs. "Celebrity" scale, most of the "Hellraiser" qualities do not provide much weight to the "Celebrity" judgement. You can see that most of the Hellraiser factors lie close to the zero point (i.e. the dotted horizontal line) for the "Celebrity" spectrum. &lt;strong&gt;Pretty Men&lt;/strong&gt;, and &lt;strong&gt;Pretty Women&lt;/strong&gt;, however, hugely weigh in on the celebrity scale.&lt;/p&gt;
&lt;p&gt;&lt;img alt="HR vs C" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/hr_v_c-1024x794.png"&gt;&lt;/p&gt;
&lt;p&gt;We witness the same distribution on "Boy Scout" vs. "Celebrity." Most "Boy Scout" traits do not weigh heavily on the "Celebrity" decision and vice versa. You can see this by simply noticing the words closely align with the x and y access.&lt;/p&gt;
&lt;p&gt;&lt;img alt="BS vs C" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/bs_v_c-1024x794.png"&gt;&lt;/p&gt;
&lt;p&gt;If you're interested, you can see the three factors plotted on a three dimensional graph. Again, the graph below shows the actual placement on the graph, based on the real-life test takers.&lt;/p&gt;
&lt;p&gt;&lt;img alt="3d-1" src="https://john.soban.ski/images/Big_Data_Idol_How_I_Crunched_the_Numbers/3d-1-1024x794.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Get the Results&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When you take the test, the machine learning looks at your answers and matches them to the &lt;strong&gt;patterns&lt;/strong&gt; for each of the &lt;strong&gt;factors&lt;/strong&gt;. It places you on the &lt;strong&gt;spectrum&lt;/strong&gt; for each factor and based on the placement it classifies you as one of the eight categories. For a given user, the factor &lt;strong&gt;switches on&lt;/strong&gt; depending on how you compare to the rest of the test-takers. So, as more people take the test, the factors, weights, placement and classifications change.&lt;/p&gt;
&lt;p&gt;If you made it this far, you now have a solid understanding of &lt;a href="https://en.wikipedia.org/wiki/Exploratory_factor_analysis"&gt;Exploratory Factor Analysis (EFA)&lt;/a&gt;.  If you want more detail, then check out the &lt;a href="https://john.soban.ski/big-data-idol-the-math.html"&gt;math&lt;/a&gt;.&lt;/p&gt;</content><category term="Data Science"></category><category term="Big Data Personality Test"></category><category term="Data Science"></category></entry><entry><title>Let us now praise ugly code!</title><link href="https://john.soban.ski/let-us-now-praise-ugly-code.html" rel="alternate"></link><published>2017-07-15T01:18:00-04:00</published><updated>2017-07-15T01:18:00-04:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2017-07-15:/let-us-now-praise-ugly-code.html</id><summary type="html">&lt;p&gt;In this blog post I will revisit the first piece of code I wrote with the &lt;a href="https://www.r-project.org/about.html"&gt;R Programming language&lt;/a&gt;, back in the early part of this decade.&lt;/p&gt;
&lt;p&gt;Coming from an &lt;a href="https://octave.org/"&gt;Octave&lt;/a&gt;/&lt;a href="https://www.mathworks.com/products/matlab.html"&gt;MATLAB&lt;/a&gt; background, I really enjoyed the &lt;a href="https://en.wikipedia.org/wiki/Functional_programming"&gt;functional&lt;/a&gt; nature of R. I imagined flinging vectors into Matrices, collapsing them with …&lt;/p&gt;</summary><content type="html">&lt;p&gt;In this blog post I will revisit the first piece of code I wrote with the &lt;a href="https://www.r-project.org/about.html"&gt;R Programming language&lt;/a&gt;, back in the early part of this decade.&lt;/p&gt;
&lt;p&gt;Coming from an &lt;a href="https://octave.org/"&gt;Octave&lt;/a&gt;/&lt;a href="https://www.mathworks.com/products/matlab.html"&gt;MATLAB&lt;/a&gt; background, I really enjoyed the &lt;a href="https://en.wikipedia.org/wiki/Functional_programming"&gt;functional&lt;/a&gt; nature of R. I imagined flinging vectors into Matrices, collapsing them with dot-products, Tetris like. I refused to write a single for loop... I framed everything as functions and maps. As I gained experience with R, I found &lt;a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"&gt;pipes&lt;/a&gt; and &lt;a href="https://cran.r-project.org/web/packages/tidyr/index.html"&gt;data wrangling&lt;/a&gt; libraries, but early on, my code was pretty ugly, as you will see shortly.&lt;/p&gt;
&lt;p&gt;I have a project that keeps track of comic books, their publishers, their prices and their customers. The model stores data in excel and to make things readable, I use a columnar store. In this way, I can quickly add new entries to the table by adding columns.  Each column has an arbitrary number of rows. I know this might not be the best way to store data, but bear with me here. This blog looks at the processing of that data, not the storing of the data. Besides, in the real world, you sometimes have no choice but to start with ugly data.&lt;/p&gt;
&lt;h3&gt;The Ugly Way...&lt;/h3&gt;
&lt;p&gt;Let us proceed. First, take a look at Titles:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles.orig&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Batman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                 &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Superman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                 &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Captain_Marvel&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                 &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Youngblood&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                  &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Spawn&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                  &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                  &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Spiderman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Iron_Man&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Cable&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                   &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Doctor_Strange&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                          &lt;/span&gt;&lt;span class="n"&gt;stringsAsFactors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles.orig&lt;/span&gt;

&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;                     &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;                           &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Notice that a rotation doesn't really buy us anything. Instead of an arbitrary number of rows for each entry, a rotation gets us an arbitrary number of columns.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;t&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles.orig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;
&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Batman&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Superman&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Captain_Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;
&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Youngblood&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Spawn&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;
&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Spiderman&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Iron_Man&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Cable&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Doctor_Strange&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When I process &lt;strong&gt;&lt;em&gt;Titles.orig&lt;/em&gt;&lt;/strong&gt; R, I first transform it to a key-value store. My approach relies on data frame index logic (commands inside the [] brackets).&lt;/p&gt;
&lt;p&gt;In my original approach, I create two vectors, one that repeats the column several times, and another that un-packs (unlists) the data. When I put them together, I get key-value pairs (with some empties).&lt;/p&gt;
&lt;p&gt;My first vector repeats each column name &lt;em&gt;n&lt;/em&gt; times, with &lt;em&gt;n&lt;/em&gt; being the number of rows. Since the data frame has four rows, I repeat each column name four times. I first try the &lt;strong&gt;&lt;em&gt;rep()&lt;/em&gt;&lt;/strong&gt; function.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles.orig&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This attempt fails. I want it in the form: 'DC, DC, DC, DC, Image, Image etc.'&lt;/p&gt;
&lt;p&gt;After a few Google searches, I find that &lt;strong&gt;&lt;em&gt;matrix()&lt;/em&gt;&lt;/strong&gt;allows us to stack rows, so I stuff the repeat statement into &lt;strong&gt;&lt;em&gt;matrix()&lt;/em&gt;&lt;/strong&gt;:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Close, but not quite what I need. I then add the &lt;em&gt;byrow&lt;/em&gt; flag:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;byrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;T&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;[,&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;From here, we convert to a vector:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;byrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;T&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;DC&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Image&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Marvel&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, vector works "down the column" by default (which makes sense, since columns are vectors).&lt;/p&gt;
&lt;p&gt;Let's move past the titles. To create a vector from our data, we need to &lt;strong&gt;&lt;em&gt;unlist()&lt;/em&gt;&lt;/strong&gt; the data first and then vectorize it:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;as.vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;unlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Batman&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Superman&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Captain_Marvel&amp;quot;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Youngblood&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;
&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Spawn&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Spiderman&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Iron_Man&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Cable&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Doctor_Strange&amp;quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I bind these two vectors together as columns and then create a data frame.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nf"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;cbind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;as.vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;rep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                                             &lt;/span&gt;&lt;span class="n"&gt;nrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;nrow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="n"&gt;byrow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;T&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="w"&gt;                            &lt;/span&gt;&lt;span class="nf"&gt;as.vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;unlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;

&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;X1&lt;/span&gt;&lt;span class="w"&gt;             &lt;/span&gt;&lt;span class="n"&gt;X2&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;               &lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;
&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;
&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I give names to the data:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;publisher&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;title&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;And then remove the empty rows. A lot of my early code follows this convention. I scan a data frame with index logic, using a comma to separate row and column logic. In the line below, I scan the index to return only rows that have a non-empty title, and return all columns. Such syntax appears a little confusing, as I reference the data frame &lt;strong&gt;&lt;em&gt;Titles&lt;/em&gt;&lt;/strong&gt; in three separate parts.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;which&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;!=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),]&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;
&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;publisher&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;
&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;
&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;The Pretty Way...&lt;/h3&gt;
&lt;p&gt;Let's recap. We had nested hell to transform the columnar table to a key-value table, and then we needed two more commands to name the data frame columns and remove the empties.&lt;/p&gt;
&lt;p&gt;With pipes (&lt;strong&gt;&lt;em&gt;dplyr&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;magrittr&lt;/em&gt;&lt;/strong&gt;) and &lt;strong&gt;&lt;em&gt;tidyr&lt;/em&gt;&lt;/strong&gt;, we can produce the &lt;strong&gt;&lt;em&gt;same result&lt;/em&gt;&lt;/strong&gt; with &lt;strong&gt;&lt;em&gt;one line of code&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;dplyr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;magrittr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;library&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;tidyr&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles.orig&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;publisher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;nzchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;publisher&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;DC&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Marvel&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To dump and then set the variable, we use the &lt;strong&gt;&lt;em&gt;%&amp;lt;&gt;%&lt;/em&gt;&lt;/strong&gt; pipe.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Titles&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;lt;&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;publisher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;nzchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;More Pretty Code&lt;/h3&gt;
&lt;p&gt;Now we have a separate table of customers. This is a more traditional table, and we can arbitrarily add columns and rows as we see fit.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Batman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Superman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Captain_Marvel&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Youngblood&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Spawn&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Spiderman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Iron_Man&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Cable&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                    &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Doctor_Strange&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="w"&gt;                        &lt;/span&gt;&lt;span class="n"&gt;stringsAsFactors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;

&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's try the gather function on this table to see what we get. We want each row to contain the comic title, the customer name, and the quantity they want to purchase.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;suppressWarnings&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;              &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;As you can see, this is not what we want. For correct syntax, we need to specify a start and end column.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;
&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I have an issue with this code in that I need to refactor it each time I add a new customer.&lt;/p&gt;
&lt;p&gt;To future proof, we modify the code as follows:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="nf"&gt;ncol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;In a separate table I have prices for each title.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;data.frame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Batman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Superman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Captain_Marvel&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Youngblood&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Spawn&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Spiderman&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Iron_Man&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Cable&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                                &lt;/span&gt;&lt;span class="s"&gt;&amp;#39;Doctor_Strange&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;1.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;1.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;                              &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;                    &lt;/span&gt;&lt;span class="n"&gt;stringsAsFactors&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;FALSE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;

&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.75&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.75&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We can easily add a price column to Customers with the &lt;strong&gt;&lt;em&gt;merge()&lt;/em&gt;&lt;/strong&gt; function:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Doctor_Strange&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Iron_Man&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Spawn&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.75&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="n"&gt;Spiderman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.75&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;Superman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Youngblood&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;&lt;strong&gt;Pretty Showdown:  Hard vs. Easy&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;How do we find per customer totals? I'll show a hard way and an easy way. Let's look at the &lt;strong&gt;pipe/ dplyr/ tydr&lt;/strong&gt; method first.&lt;/p&gt;
&lt;p&gt;First, we narrow the table and merge with price:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="nf"&gt;ncol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, we add a fifth column that calculates the subtotal:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="nf"&gt;ncol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtotal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;subtotal&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;3.90&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;3.90&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;Batman&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;1.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;9.75&lt;/span&gt;
&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0.00&lt;/span&gt;
&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;7.90&lt;/span&gt;
&lt;span class="m"&gt;7&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;
&lt;span class="m"&gt;8&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="n"&gt;Cable&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;3.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0.00&lt;/span&gt;
&lt;span class="m"&gt;9&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0.00&lt;/span&gt;
&lt;span class="m"&gt;11&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;
&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Captain_Marvel&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;2.95&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="m"&gt;0.00&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then, we sum the subtotal for each customer. We can achieve this with ease using the &lt;strong&gt;&lt;em&gt;group_by()&lt;/em&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;em&gt;summarize()&lt;/em&gt;&lt;/strong&gt; functions:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="nf"&gt;ncol&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;mutate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtotal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;qty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtotal&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# A tibble: 4 x 2&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;customer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subtotal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;chr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;dbl&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;42.45&lt;/span&gt;
&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;13.75&lt;/span&gt;
&lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;30.95&lt;/span&gt;
&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="m"&gt;17.50&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;POP quiz... did we just execute the hard or easy method to find the totals? I will show you the easy way next and you can decide for yourself. In short, we can solve this problem with simple linear algebra.&lt;/p&gt;
&lt;p&gt;We first create our vector&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then our matrix&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="nf"&gt;as.matrix&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We do a simple dot product and we're done:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%*%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;

&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;13.75&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;30.95&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;17.5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;42.45&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We could also do it in one line:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="o"&gt;$&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%*%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Customers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;%&amp;gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nf"&gt;as.matrix&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="n"&gt;Micky&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Mike&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Peter&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Davy&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;13.75&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;30.95&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;17.5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;42.45&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;My Octave/ MATLAB experience led me to use linear algebra right out of the gate. Sometimes, even in the face of fancy new functions, it turns out I produce beautiful code on the first try.&lt;/p&gt;</content><category term="Data Science"></category><category term="R Programming"></category><category term="Data Science"></category></entry><entry><title>Why A "Big Data" Personality Test?</title><link href="https://john.soban.ski/why-a-big-data-personality-test.html" rel="alternate"></link><published>2017-02-18T02:10:00-05:00</published><updated>2017-02-18T02:10:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2017-02-18:/why-a-big-data-personality-test.html</id><summary type="html">&lt;p&gt;Why do we need yet another personality test?  Because, without "big data" technologies online "personality tests" suffer these problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With most tests, we quickly see a pattern to the answers, and can easily steer the test to the outcome we want&lt;ul&gt;
&lt;li&gt;Do we really learn anything when we "cheat" this …&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;</summary><content type="html">&lt;p&gt;Why do we need yet another personality test?  Because, without "big data" technologies online "personality tests" suffer these problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;With most tests, we quickly see a pattern to the answers, and can easily steer the test to the outcome we want&lt;ul&gt;
&lt;li&gt;Do we really learn anything when we "cheat" this way?  (No)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Test writers like to invent their own cryptic definitions for commonly understood words&lt;ul&gt;
&lt;li&gt;See various definitions for Extrovert, Intuition, Conscientious,etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The math to calculate the results must be kept simple (weighted sums), to reduce server overhead&lt;/li&gt;
&lt;li&gt;To keep in line with the simple math, a lot of these tests force an un-natural, robotic answer format&lt;ul&gt;
&lt;li&gt;Do I strongly or highly agree with the statement they just presented to me?&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Freshlex, LLC presents a "big data" personality test, for charity.  Our test differs from traditional tests:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ours is fun!  A combination of serious, thought provoking, and silly questions&lt;/li&gt;
&lt;li&gt;Random, arbitrary questions&lt;ul&gt;
&lt;li&gt;We won't see a pattern until enough people take the test and we crunch the numbers&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Simple "yes" or "no" answers... no "somewhat/ slightly/ strongly agree" garbage&lt;/li&gt;
&lt;li&gt;Since we do the "heavy lifting" on the back end, we can keep it fun and simple for the user&lt;/li&gt;
&lt;li&gt;Ours goes to charity&lt;ul&gt;
&lt;li&gt;We will be open about the expenses involved in this voyage.  If by some miracle we figure out how to make money off this, we'll &lt;a href="http://rmhcbaltimore.org/"&gt;donate it to charity&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thank you for reading this!&lt;/p&gt;</content><category term="Data Science"></category><category term="Big Data Personality Test"></category><category term="Data Science"></category></entry><entry><title>Facebook Ads Nets Me 7,000+ Clicks in 48 Hours</title><link href="https://john.soban.ski/facebook-ads-nets-me-7000-clicks-in-48-hours.html" rel="alternate"></link><published>2016-01-28T02:04:00-05:00</published><updated>2016-01-28T02:04:00-05:00</updated><author><name>john-sobanski</name></author><id>tag:john.soban.ski,2016-01-28:/facebook-ads-nets-me-7000-clicks-in-48-hours.html</id><summary type="html">&lt;p&gt;&lt;img alt="Kibana" src="https://john.soban.ski/images/Facebook_Ads_Nets_Me_7000_Clicks_in_48_Hours/Visualize-Kibana-4-1024x517.png"&gt;&lt;/p&gt;
&lt;p&gt;For my (post) masters project on machine learning and &lt;a href="https://john.soban.ski/part-1-connect-ec2-to-the-amazon-elasticsearch-service.html"&gt;big data infrastructure&lt;/a&gt; I thought it would be fun to acquire my own data set.  Last semester I traded available services and architected a scalable (big data) Internet facing survey infrastructure using a combination of &lt;a href="https://aws.amazon.com/"&gt;Amazon Web Services&lt;/a&gt; and &lt;a href="https://www.python.org/"&gt;Python&lt;/a&gt;.  I …&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;img alt="Kibana" src="https://john.soban.ski/images/Facebook_Ads_Nets_Me_7000_Clicks_in_48_Hours/Visualize-Kibana-4-1024x517.png"&gt;&lt;/p&gt;
&lt;p&gt;For my (post) masters project on machine learning and &lt;a href="https://john.soban.ski/part-1-connect-ec2-to-the-amazon-elasticsearch-service.html"&gt;big data infrastructure&lt;/a&gt; I thought it would be fun to acquire my own data set.  Last semester I traded available services and architected a scalable (big data) Internet facing survey infrastructure using a combination of &lt;a href="https://aws.amazon.com/"&gt;Amazon Web Services&lt;/a&gt; and &lt;a href="https://www.python.org/"&gt;Python&lt;/a&gt;.  I had my friends send me (sometimes silly) survey questions and put them on my site.  After an initial flury of "friends of friends" taking the survey, my traffic died down to zero.&lt;/p&gt;
&lt;p&gt;I then decided to pull the trigger on web advertising and started by Google searching "how do I advertise on the Internet."&lt;/p&gt;
&lt;p&gt;I signed up for four of the most attractive options.  I won't bore you with the also-rans (although I should mention one Advertising service turned down my cold hard cash because my site "does not meet [their] partner ad networks' strict policies regarding site content, structure and navigability") since, as far as I can tell, Facebook's advertising service delivers the platonic ideal of perfection.  I will try really, really hard not to use cliches such as "knocks my socks off," "blows me away" or "combines to form Predaking and destroy the Autobots" when I communicate how well this service works.&lt;/p&gt;
&lt;p&gt;I'm sure if I (devote my nights and weekends and) master the art Search Engine Optimization, A/B testing and AdWords I would find fault with Facebook's service, but that is my point.  Facebook gave me, someone who has never advertised on the Web before, a Fischer-Price interface to quickly deploy a killer campaign with no thought or skill.  One of the other services I tried, in contrast, required me to install a "pixel" to my site, which is simple enough for me (once I figured out how to make it play nice with my &lt;a href="https://getbootstrap.com/"&gt;Bootstrapped&lt;/a&gt; &lt;a href="https://palletsprojects.com/p/jinja/"&gt;Jinja2&lt;/a&gt; template), but I could never see Al (from Al's auto-parts) pull that one off.  With Facebook's service, I could see a Caveman mashing his keyboard with a club deploy an effective campaign.  I think that is why I like Facebook's service so much.  Any "Mom and Pop" could figure it out, and Facebook is smart enough to keep things moving to their sale.  Click, Click, Click and close.  Genius.  The competitors put up huge roadblocks that required me to Stack-Exchange some questions wheras Facebook takes the attitude of "Don't know what this means?  Don't worry about it, you can tune that knob later, let's move on..."&lt;/p&gt;
&lt;p&gt;To summarize the experience, you simply click "Create Ad,"  pick some demographics (they had one for "people who take personality tests"), create a "Call to Action" (A slogan to get people to click) and then upload a JPEG.  I didn't have any graphic art so I Googled "Public Domain Cartoon Brain," uploaded that, and I was done.  The results?  86+k Imprints, 7k clicks to my site and 4.75k of those taking the survey!  All for just \$427.  I will let this run for a few more days, just to see what I can get.&lt;/p&gt;
&lt;p&gt;A quick note.  Of the 4.75k test takers I had two skeptics that think I'm a front for some political entity or some other nefarious institution.  I assure you that I am putting together this data set for my (post) masters college project.  Also, &lt;a href="https://john.soban.ski/why-a-big-data-personality-test.html"&gt;as I said before&lt;/a&gt; if by some miracle I make any money off this, I will donate it to charity!  For the time being, with the infrastructure and advertising I'm a couple grand in the hole and forsee a couple grand more.  I am paying for this out of my own pocket.  Where do I get the money, you ask?  Easy... by selling my childhood on ebay!&lt;/p&gt;
&lt;p&gt;&lt;img alt="Selling Childhood Transformers" src="https://john.soban.ski/images/Facebook_Ads_Nets_Me_7000_Clicks_in_48_Hours/Screenshot_2016-01-25-15-30-50-640x1024.jpg"&gt;
 &lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;UPDATE:  I posted this blog entry at 2:04 AM yesterday.  Since I posted this blog, the market opened and Facebook stock went up 15.5%.  I take full credit for this jump:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Stock Bump" src="https://john.soban.ski/images/Facebook_Ads_Nets_Me_7000_Clicks_in_48_Hours/fb-1024x956.png"&gt;&lt;/p&gt;</content><category term="Data Science"></category><category term="Big Data Personality Test"></category><category term="Data Science"></category></entry></feed>