John Sobanski - Data Science

Engagement Farming with AI Bots

2026-04-25T01:23:00-04:00

I hear a lot of developers on social media claim that they created fully automated, end-to-end AI bot pipelines that create, build, test, and deploy apps from scratch. This idea appeals to me, so I gave it a shot.

I first tried and failed to create a working app pipeline (see future blog posts), so I decided to create a Minimum Viable Product (MVP) in the form of a bot that writes and publishes one blog post, once a day, every day.

Goals include:

Writing a blog that people will want to read
Publishing the blog to a production-grade website
Following security best practices
Full automation (no manual intervention)

The Approach

I use Gemini CLI on my Windows laptop for GenAI. The LLM write the blog in markdown and pushes it to my public repo on GitHub. GitHub Actions (GA) then use a static site generator to build the site from the markdown (adding the home page hyperlinks, menus, and style) and then pushes the HTML and Javascript to an Amazon S3 Hosted website with content caching, HTTPS and naked domain redirects. Gemini CLI uses a mounted SSH private key to push to GitHub, and GA uses an AWS Identity and Access Management (IAM) Role to push to AWS.

I chose Gemini CLI for GenAI. You can easily replace this service with ChatGPT, Claude, or Cursor. On the hosting end, you can replace S3 with Google Cloud, Azure, Oracle Cloud or any web hosting platform.

Create an AWS Role

GA uses OpenID Connect (OIDC) to authenticate to AWS without a password. OIDC requires an IAM Role and Policy.

I name the IAM Role GitHubActionsHoursLoveDeploy, since it will deploy to the S3 bucket that hosts https://hours.love.

The Role includes a trust policy that allows the hours.love repo to access resources in my AWS account. I limit the scope to the main branch.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
                },
                "StringLike": {
                    "token.actions.githubusercontent.com:sub": [
                        "repo:hatdropper1977/hours.love:ref:refs/heads/main",
                        "repo:hatdropper1977/hours.love:ref:refs/heads/main"
                    ]
                }
            }
        }
    ]
}

The Role includes a policy that allows GA to both push web content to the S3 bucket and revoke the CloudFront cache.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3Deploy",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::hours.love",
                "arn:aws:s3:::hours.love/*"
            ]
        },
        {
            "Sid": "CloudFrontInvalidate",
            "Effect": "Allow",
            "Action": [
                "cloudfront:CreateInvalidation"
            ],
            "Resource": "*"
        }
    ]
}

Configure GitHub

We need to configure GitHub to both accept markdown from Gemini CLI and send content to AWS.

For the Gemini CLI, we paste our public key in Settings --> Deploy Keys --> Add Deploy Key.

For AWS, we create a YAML file that includes the AWS Resource Number (ARN) for the OIDC and push it to the repo in .github/workflows/deploy.yml.

name: Deploy

on:
  push:
    branches:
      - main
  workflow_dispatch:

permissions:
  contents: read
  id-token: write

concurrency:
  group: deploy-hours-love
  cancel-in-progress: true

env:
  AWS_REGION: us-east-1
  S3_BUCKET: hours.love
  CLOUDFRONT_DISTRIBUTION_ID: A12B3CDEFGHIJK

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v5

      - name: Setup Node
        uses: actions/setup-node@v6
        with:
          node-version: 24
          cache: npm
          cache-dependency-path: package-lock.json

      - name: Install dependencies
        run: npm install

      - name: Build site
        run: npm run build

      - name: Verify build output
        run: |
          test -d _site || (echo "Expected build output directory '_site' not found" && exit 1)

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v6
        with:
          aws-region: ${{ env.AWS_REGION }}
          role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsHoursLoveDeploy

      - name: Sync site to S3
        run: aws s3 sync _site/ s3://${S3_BUCKET}/ --delete

      - name: Invalidate CloudFront
        run: |
          aws cloudfront create-invalidation \
            --distribution-id "${CLOUDFRONT_DISTRIBUTION_ID}" \
            --paths "/*"

Note that this YAML also includes stanzas to build the site via npm.

We create another YAML file, in .github/workflows/ci.yml to create the Continuous Integration/ Continuous Deployment (CI/CD) pipeline.

name: CI

on:
  push:
    branches:
      - "gemini/**"
      - "setup/**"
  pull_request:
    branches:
      - main

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Detect project
        id: detect
        run: |
          if [ -f package.json ]; then
            echo "type=node" >> $GITHUB_OUTPUT
          elif [ -f requirements.txt ]; then
            echo "type=python" >> $GITHUB_OUTPUT
          else
            echo "type=unknown" >> $GITHUB_OUTPUT
          fi

      - name: Setup Node
        if: steps.detect.outputs.type == 'node'
        uses: actions/setup-node@v4
        with:
          node-version: 20

      - name: Install deps
        if: steps.detect.outputs.type == 'node'
        run: npm ci

      - name: Lint
        if: steps.detect.outputs.type == 'node'
        run: npm run lint --if-present

      - name: Test
        if: steps.detect.outputs.type == 'node'
        run: npm test --if-present

      - name: Build
        if: steps.detect.outputs.type == 'node'
        run: npm run build --if-present

Create a Ruleset in Settings --> Rules --> Rulesets to prevent deletions and force pushes via the GA User Interface (UI).

Configure Gemini CLI

The Gemini CLI contains the blog writing intelligence. It writes a blog in Markdown and pushes the Markdown to GitHub. When GitHub receives the file, it kicks off the GA workflows captured in ci.yml and deploy.yml.

For security reasons, we deploy Gemini CLI in a docker container. This prevents Gemini from accessing any files on our laptop/ server that live outside of the deployment folder.

We use a file entrypoint.sh to configure and command the container. This includes the following actions:

Mount SSH key (for GitHub)
Clone/pull repo
Generate prompt based on rules
Call Gemini API (with search)
Write markdown post
Validate output
Commit + push

The full file reads:

#!/usr/bin/env bash
set -euo pipefail

: "${GEMINI_API_KEY:?GEMINI_API_KEY is required}"
: "${GIT_REPO_SSH:?GIT_REPO_SSH is required}"
: "${GIT_USER_NAME:=Gemini CLI}"
: "${GIT_USER_EMAIL:=gemini-hours-love@users.noreply.github.com}"
: "${SSH_KEY_SRC:=/run/secrets/gemini_hours_love}"
: "${TZ:=America/New_York}"

export TZ
export HOME=/root
export REPO_DIR="${REPO_DIR:-/work/hours.love}"
export POSTS_DIR="${POSTS_DIR:-posts}"
export DATE_LOCAL="$(date +%F)"

# Context mode:
#   good   = titles only
#   better = titles + short snippets
export RECENT_POSTS_MODE="${RECENT_POSTS_MODE:-better}"

mkdir -p /root/.ssh /work
chmod 700 /root/.ssh

if [[ ! -f "$SSH_KEY_SRC" ]]; then
  echo "Missing SSH key at $SSH_KEY_SRC"
  exit 1
fi

cp "$SSH_KEY_SRC" /root/.ssh/gemini_hours_love
chmod 600 /root/.ssh/gemini_hours_love

cat >/root/.ssh/config <<'EOF'
Host github-hours-love
  HostName github.com
  User git
  IdentityFile /root/.ssh/gemini_hours_love
  IdentitiesOnly yes
EOF

chmod 600 /root/.ssh/config
ssh-keyscan github.com >> /root/.ssh/known_hosts 2>/dev/null
chmod 644 /root/.ssh/known_hosts

# --- clone or update repo ---
if [[ ! -d "$REPO_DIR/.git" ]]; then
  git clone "$GIT_REPO_SSH" "$REPO_DIR"
fi

cd "$REPO_DIR"

git config user.name "$GIT_USER_NAME"
git config user.email "$GIT_USER_EMAIL"

git fetch origin
git checkout main
git reset --hard origin/main

mkdir -p "$POSTS_DIR"

POST_FILE="$POSTS_DIR/${DATE_LOCAL}.md"

if [[ -f "$POST_FILE" ]]; then
  echo "Post already exists for ${DATE_LOCAL}; exiting."
  exit 0
fi

# --- recent post context ---
build_recent_titles_context() {
  local recent_files=()
  mapfile -t recent_files < <(find "$POSTS_DIR" -maxdepth 1 -type f -name "*.md" ! -name "${DATE_LOCAL}.md" | sort -r | head -n 5)

  if [[ ${#recent_files[@]} -eq 0 ]]; then
    echo "No recent posts yet."
    return
  fi

  for f in "${recent_files[@]}"; do
    local title
    title="$(grep -m1 '^title:' "$f" | sed 's/^title:[[:space:]]*//')"
    if [[ -z "$title" ]]; then
      title="$(basename "$f" .md)"
    fi
    echo "- ${title}"
  done
}

build_recent_snippets_context() {
  local recent_files=()
  mapfile -t recent_files < <(find "$POSTS_DIR" -maxdepth 1 -type f -name "*.md" ! -name "${DATE_LOCAL}.md" | sort -r | head -n 3)

  if [[ ${#recent_files[@]} -eq 0 ]]; then
    echo "No recent posts yet."
    return
  fi

  for f in "${recent_files[@]}"; do
    local title
    local snippet
    title="$(grep -m1 '^title:' "$f" | sed 's/^title:[[:space:]]*//')"
    if [[ -z "$title" ]]; then
      title="$(basename "$f" .md)"
    fi

    snippet="$(
      awk '
        BEGIN { in_frontmatter=0; started=0; lines=0 }
        /^---$/ {
          if (started == 0) { in_frontmatter=1; started=1; next }
          else if (in_frontmatter == 1) { in_frontmatter=0; next }
        }
        in_frontmatter == 0 && NF {
          print
          lines++
          if (lines >= 6) exit
        }
      ' "$f" | tr '\n' ' ' | sed 's/[[:space:]]\+/ /g'
    )"

    echo "- Title: ${title}"
    if [[ -n "$snippet" ]]; then
      echo "  Snippet: ${snippet}"
    fi
  done
}

if [[ "$RECENT_POSTS_MODE" == "good" ]]; then
  RECENT_POSTS_CONTEXT="$(build_recent_titles_context)"
else
  RECENT_POSTS_CONTEXT="$(build_recent_snippets_context)"
fi

# --- prompt ---
PROMPT=$(cat <<EOF
Write exactly one Eleventy post as valid markdown.

Output rules:
- Output ONLY the post file contents
- No explanations, no commentary, no meta text
- No code fences

Required format:
---
title: <specific title>
date: ${DATE_LOCAL}
tags:
  - posts
layout: post.liquid
---

Then the article body.

Narrative requirements:
- The first paragraph must introduce the story and preview what follows
- It must clearly state what happened and why it matters
- If a person is central, identify them immediately with name, role, and location
- Do not open with a scene or anecdote
- Do not open with generic background

Structure:
1. Opening: what happened (fact-based lead)
2. Details: names, locations, specifics
3. Context: why it matters locally
4. Optional: related developments
5. No formal conclusion

Writing rules:
- 500 to 800 words
- Write like a local industry publication, not a personal blog
- No first-person narration (no "I", "we")
- No fictional scenes or invented experiences
- No sensory storytelling unless tied to a real reported fact
- Short paragraphs, factual tone
- No pontificating
- No generalizations
- No moralizing
- No reflective commentary
- No filler phrases
- No inspirational tone
- No em dashes
- No intensifiers like "very", "really", "deeply", "truly", "far more"
- Do not explain what you're about to say
- If a sentence sounds like a diary, remove it
- If a sentence could appear in a newspaper, keep it
- Keep sentences under 20 words when possible

Content rules:
- Use recent, real information when available via search grounding
- Summarize clearly and directly
- Combine multiple sources when relevant
- Do not fabricate facts
- Do not fabricate firsthand experience
- Focus on:
  - what happened
  - who is involved
  - where it happened
  - why it matters locally
  - practical implications

People coverage rules:
- When a person is central, make them the anchor of the piece
- Focus on what they did, changed, opened, produced, or influenced
- Include:
  - full name
  - role
  - associated business or winery
  - location
- Do not write generic biographies
- Do not invent people
- Only include people present in grounding sources
- Prefer coverage where a person is tied to a real development

Sourcing rules:
- Prefer using 2 to 4 distinct sources when reporting news
- Prefer sources that mention specific people, not just organizations
- Do not rely on a single source if multiple relevant sources exist
- Synthesize information across sources into a single narrative
- Do not summarize sources one-by-one
- Do not write "Article A says, Article B says"
- Combine facts into one coherent account

Citation rules:
- Every key factual claim must come from a grounding source
- Do not invent citations
- Do not fabricate details
- Use light attribution when necessary:
  - Marin Independent Journal reported that ...
  - The San Francisco Chronicle reported ...
  - WineBusiness noted ...
- Do not attribute every sentence
- Use markdown links inline when appropriate:
  - [publication name](url)
- Links must correspond to real grounding sources
- If multiple sources confirm a fact, present it once
- If sources differ, reflect that briefly without speculation

Footnote rules:
- Footnotes are allowed but optional
- If used, they must reference real sources
- Use markdown footnote syntax:
  - reference like [^1]
  - define at bottom:
    [^1]: Source Name - URL
- Do not invent footnotes

Failure rules:
- If fewer than 2 relevant sources are available, write a focused piece using one source plus context
- If no relevant sources are available, fall back to a non-news industry post
- Do NOT invent news or citations

Topic priorities (in order):
1. Marin / Novato / Bay Area restaurant or hospitality news
2. Northern California winery or vineyard developments
3. Key people in the Northern California wine scene:
   - winemakers
   - vineyard managers
   - owners
   - sommeliers
   - importers / distributors
   - hospitality operators
   - chefs with strong wine programs
4. Local wine business, production, or distribution changes
5. Regional updates tied to specific producers or places

Recent posts to avoid repeating:
${RECENT_POSTS_CONTEXT}

If a topic overlaps:
- choose a different angle instead of repeating

Title rules:
- Specific and concrete
- Not generic
- Should read like a headline
- Include a real place, business, or person when possible
EOF
)

# --- build request ---
jq -n --arg prompt "$PROMPT" '{
  contents: [
    {
      parts: [
        { text: $prompt }
      ]
    }
  ],
  tools: [
    {
      google_search: {}
    }
  ]
}' > /tmp/gemini_request.json

# --- call Gemini API ---
curl -sS \
  -H "Content-Type: application/json" \
  -H "x-goog-api-key: ${GEMINI_API_KEY}" \
  -X POST \
  "https://generativelanguage.googleapis.com/v1beta/models/gemini-3-flash-preview:generateContent" \
  -d @/tmp/gemini_request.json \
  > /tmp/gemini_response.json

# --- fail fast on API errors ---
if jq -e '.error' /tmp/gemini_response.json >/dev/null 2>&1; then
  echo "Gemini API returned an error:"
  jq '.error' /tmp/gemini_response.json
  exit 1
fi

# --- extract text safely ---
POST_TEXT="$(jq -r '.candidates[0].content.parts[0].text // empty' /tmp/gemini_response.json)"

if [[ -z "$POST_TEXT" ]]; then
  echo "Empty post content returned from Gemini."
  jq '.' /tmp/gemini_response.json
  exit 1
fi

printf '%s\n' "$POST_TEXT" > "$POST_FILE"

# --- contamination guard ---
if grep -qE 'I have written the blog post|/work/|^Here is|^Sure|^```' "$POST_FILE"; then
  echo "Contaminated output. Aborting."
  cat "$POST_FILE"
  exit 1
fi

# --- basic format checks ---
if ! grep -q '^---$' "$POST_FILE"; then
  echo "Missing front matter."
  cat "$POST_FILE"
  exit 1
fi

if ! grep -q '^title:' "$POST_FILE"; then
  echo "Missing title in front matter."
  cat "$POST_FILE"
  exit 1
fi

# --- extract real sources, deduped ---
jq -r '
  [
    .candidates[0].groundingMetadata.groundingChunks[]?.web
    | select(.title and .uri)
    | "- [" + .title + "](" + .uri + ")"
  ] | unique | .[]
' /tmp/gemini_response.json > /tmp/sources.md || true

if [[ -s /tmp/sources.md ]]; then
  {
    printf '\n\n## Sources\n\n'
    cat /tmp/sources.md
    printf '\n'
  } >> "$POST_FILE"
fi

# --- style guard (lightweight) ---
if grep -qE '—| very | really | deeply | truly | far more | that is just ' "$POST_FILE"; then
  echo "Style warning: banned phrasing detected."
fi

# --- build validation ---
npm ci
npm run build

# --- commit ---
git add "$POST_FILE"

if git diff --cached --quiet; then
  echo "No changes to commit."
  exit 0
fi

git commit -m "Auto post ${DATE_LOCAL}"
git push origin main

echo "Done."

The Docker container uses this file to execute the commands required for creating and pushing a blog post.

One Time Tasks

One time tasks include setting up the repo, initializing a node project, configuring the static site generator, creating liquid templates, building the container, and deploying the task scheduler.

Set Up the Repo

The writing bot launches a container on my laptop, and pulls the current GitHub repo. The repo contains both the necessary data (1) to build the project on my laptop (a one time manual action), and (2) the current state of the website markdown.

When the container runs, it uses the following commands to pull the source of truth from the repo.

git clone git@github-hours-love:hatdropper1977/hours.love.git
cd hours.love

Initialize Node Project

The static site builder uses Node.js. GitHub actions run the commands to build the site. The following shell command initializes the node project, and provides configuration information for GitHub Actions.

npm init -y

# Created:
# package.json

This file tells all systems (your laptop, GitHub Actions) how to install dependencies and how to build the site.

Configure the Static Site Generator

I chose Eleventy (11ty) for my static site generator. You can choose any static site generator you wish. On this website, for example, I use Pelican.

The static site generator converts the raw markdown into a live site, with navigation, client side scripting, and style.

We install install 11ty with:

npm install --save-dev @11ty/eleventy

# Added 
# "devDependencies": {
#  "@11ty/eleventy": "^3.x"
# }
#
# Created:
# package-lock.json
# node_modules/

This commands our project to uUse Eleventy to build the site.

Next, we edit package.json (created above) to define build command.

"scripts": {
  "build": "npx @11ty/eleventy"
}

Now, when GA runs...

npm run build

...it executes...

npx @11ty/eleventy

...which generates:

_site/

Create Liquid Templates

11ty uses the Liquid template language. The template language configures the look and feel of the website.

I created layout files for the home page and each unique post.

index.liquid includes

---
title: hours.love
---

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>{{ title }}</title>
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <style>
    :root {
      --bg: #f7f3ec;
      --paper: #fffdf9;
      --ink: #1f1b16;
      --muted: #6b6258;
      --line: #ddd3c7;
      --link: #6f2c1f;
      --link-hover: #8a3828;
    }

    * { box-sizing: border-box; }

    body {
      margin: 0;
      background: var(--bg);
      color: var(--ink);
      font-family: Georgia, "Times New Roman", serif;
      line-height: 1.65;
    }

    .wrap {
      max-width: 760px;
      margin: 0 auto;
      padding: 40px 20px 64px;
    }

    .hero {
      margin-bottom: 32px;
    }

    h1 {
      margin: 0 0 8px;
      font-size: 2.4rem;
      line-height: 1.1;
      font-weight: 700;
    }

    .sub {
      margin: 0;
      color: var(--muted);
      font-size: 1.05rem;
      max-width: 42rem;
    }

    .intro {
      margin: 28px 0 36px;
      padding: 18px 20px;
      background: var(--paper);
      border: 1px solid var(--line);
      border-radius: 8px;
    }

    h2 {
      margin: 0 0 18px;
      font-size: 1.15rem;
      letter-spacing: 0.02em;
    }

    .posts {
      list-style: none;
      padding: 0;
      margin: 0;
    }

    .posts li {
      padding: 18px 0 22px;
      border-top: 1px solid var(--line);
    }

    .posts li:last-child {
      border-bottom: 1px solid var(--line);
    }

    .posts a {
      color: var(--link);
      text-decoration: none;
      font-size: 1.2rem;
      font-weight: 700;
    }

    .posts a:hover {
      color: var(--link-hover);
      text-decoration: underline;
    }

    .meta {
      display: block;
      margin-top: 4px;
      color: var(--muted);
      font-size: 0.92rem;
    }

    .empty {
      color: var(--muted);
      font-style: italic;
    }
  </style>
</head>
<body>
  <main class="wrap">
    <header class="hero">
      <h1>hours.love</h1>
      <p class="sub">Notes on Northern California wine, food, places, and the small details that make them worth remembering.</p>
    </header>

    <section class="intro">
      This is a running notebook, not a magazine. Short pieces. Specific places. Good bottles. Restaurants worth knowing. Things noticed and written down before they blur.
    </section>

    <section>
      <h2>Recent posts</h2>

      {% if collections.posts and collections.posts.length > 0 %}
        <ul class="posts">
          {% for post in collections.posts | reverse %}
            <li>
              <a href="{{ post.url }}">{{ post.data.title }}</a>
              <span class="meta">{{ post.date | date: "%B %d, %Y" }}</span>
            </li>
          {% endfor %}
        </ul>
      {% else %}
        <p class="empty">No posts yet.</p>
      {% endif %}
    </section>
  </main>
</body>
</html>

This template, similar to PHP, dynamically applies style to an arbitrarty number of files.

Build & Test the Container

I wrote the following Dockerfile, which builds a container that includes Linux, Node, git, curl, jq, and my security key.

FROM node:24-bookworm-slim

ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=America/New_York
ENV REPO_DIR=/work/hours.love

RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    openssh-client \
    ca-certificates \
    bash \
    tzdata \
    curl \
    jq \
  && ln -fs /usr/share/zoneinfo/$TZ /etc/localtime \
  && dpkg-reconfigure -f noninteractive tzdata \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /work

COPY entrypoint.sh /work/entrypoint.sh
RUN chmod +x /work/entrypoint.sh

ENTRYPOINT ["/work/entrypoint.sh"]

I build it with:

docker build -t hours-love-gemini-runner .

Run the Container

Now we run the container with the following command. I stored my $GEMINI_API_KEY in an environment variable, so that it will not appear in my history.

MSYS_NO_PATHCONV=1 docker run --rm   -e GEMINI_API_KEY   -e GIT_REPO_SSH=git@github-hours-love:hatdropper1977/hours.love.git   -e GIT_USER_NAME="hatdropper1977"   -e GIT_USER_EMAIL="sobanski.htc@gmail.com"   --mount type=bind,src="$(cd ~/.ssh && pwd)/gemini_hours_love",dst=/run/secrets/gemini_hours_love,readonly   hours-love-gemini-runner

When the container runs, it executes the commands in entrypoint.sh, exits the container, and then deletes the container.

Schedule the Task

I use a windows laptop to run the job once a day. You can modify it to use Linux, via cron.

The batch file contains:

@echo off

REM ---- config ----
set IMAGE=hours-love-gemini-runner
set REPO=git@github-hours-love:hatdropper1977/hours.love.git
set SSH_KEY=C:\Users\Freshlex\.ssh\gemini_hours_love
set ENV_FILE=C:\Users\Freshlex\.gemini.env

REM ---- run container ----
docker run --rm ^
  --env-file %ENV_FILE% ^
  -e GIT_REPO_SSH=%REPO% ^
  -e GIT_USER_NAME=hatdropper1977 ^
  -e GIT_USER_EMAIL=my@emsail.com ^
  --mount type=bind,src=%SSH_KEY%,dst=/run/secrets/gemini_hours_love,readonly ^
  %IMAGE%

REM ---- optional logging ----
REM >> C:\Users\Freshlex\gemini.log 2>&1

I schedule the command via CMD:

$ schtasks /create /tn "hours-love-daily" /tr "cmd.exe /c C:\Users\JohnSobanski\gemini-workspace\run-hours-love.bat" /sc daily /st 10:30 /rl highest /f

Conclusion

I created a engagement farming bot that produces a clean, readable, and interesting blog post once per day. You can take my approach to create a similar bot, on the topic of your choosing. You can deploy a fleet of bots, to produce blog posts to dozens of different websites. In fact, you could potentially create a meta-bot pipeline that scours the web for trending topics, and then creates a bot to write on that topic.

I just began my GenAI automation pipeline journey, and look forward to exploring future use cases.

Crunch Big Data on Your Laptop With Polars Streaming

2025-09-27T06:44:00-04:00

Last month, I refactored a custom Artificial Intelligence (AI) algorithm from Pandas to Polars. This switch drove a 25x increase in performance.

I needed to change the logic from a row-based apply approach to a holistic, matrix-level join/ GROUP BY approach.

My algorithm, however, experienced an Out Of Memory (OOM) error when I attempted to train a corpus of twenty-seven million (27M) observations. I then discovered the Streaming feature of Polars, which solves this issue.

Polars Streaming parallel processes in the time domain. It chunks data into memory, spreads the computation across all cores, saves the result, and loads the next chunk. This way, you can perform Big Data operations on a meager Collab notebook.

Recap: The Algorithm in Polars vs. Python

The exemplar, Reduced Columb Energy (RCE) algorithm works like the familiar k-nearest algorithm with a subtle twist. RCE calculates a hit footprint. The distance to the closest observation of a different class defines the hit footprint radius. The math then calculates the distance to every other observation of a different class. The class of each observation with a distance less than the hit footprint radius yields a hit for that class.

The original Pandas approach uses the following Lambda function to implement the logic.

def find_hits(df, v, outcome ):
  return (( np
            .linalg
            .norm(df
                    .loc[df['outcome'] == outcome]
                    .iloc[:,:-2]
                    .sub(np
                           .array(v)),
                  axis = 1)
             < df.loc[df['outcome'] == outcome]['lambda'] ).sum() )

I then apply the lambda function to each row.

The Polars approach uses columnar/matrix-based operations.

squared_distance_expr = sum(
    (pl.col(col) - pl.col(f'{col}_other'))**2 for col in princomp_cols
)

df_with_other_outcome = df_with_other_outcome.with_columns(
    squared_distance = squared_distance_expr
)

group_cols = [
    col for col in df_with_other_outcome.columns
    if col.startswith('princomp') and not col.endswith('_other')
] + ['outcome']

min_squared_dist_polars = (df_with_other_outcome
    .group_by(
        group_cols 
    )
    .agg(
        pl.col("squared_distance")
        .min()
        .alias("min_squared_distance")
    )
)

lambda_df_polars = (min_squared_dist_polars
    .with_columns(
        pl.col("min_squared_distance")
        .sqrt()
        .alias("lambda")
    )
)

The wall clock time for the Pandas approach reads six (6) minutes.

The wall clock time for the Polars approach reads fourteen (14) seconds, a ~25x improvement.

Data Viz

Last month, I created a graphic of the Pima Data set, which depicts the observations classified by Outcome 1 vs. Outcome 2.

To graph it, I trained the Pima Dataset via RCE to generate a model. I then created a two-dimensional 300x300 grid DataFrame, and used the Pima model to classify all 9k points.

I will extend this to three dimensions. To do so, I must (1) reduce the Pima Dataset to 3D, (2) calculate the lambdas (radii) for all hit footprints, (3) create a 300x300x300 grid Data Viz (27,000,000 points) DataFrame, and (4) classify all 27M Data Viz points with the 3d model.

1. Reduce the Pima Data Set to Three Dimensions

The Pima Diabetes Dataset includes eight (8) features and one (1) target. For a 3D plot, I need to reduce the eight (8) features down to three (3). Principal Component Analysis (PCA) reduces dimensionality while retaining information. See my blog post New Exemplar Machine Learning Algorithm for a discussion on PCA. To laymen, PCA crams the observations from higher to lower-dimensional space. Imagine a bunch of coins sprinkled on a (2D) piece of paper. If you arrange them in a line, you just reduced the dimensionality. PCA executes the same process, but accounts for variance and density of the observations in the higher space.

First, I load the Pima Diabetes Dataset.

import numpy as np
import pandas as pd
import tensorflow as tf
import polars as pl

url = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df_pima = pd.read_csv(url)
display(df_pima.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1

I create a Normalizer.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Normalization

pima_df = df_pima.copy()
labels = pima_df.pop('Outcome')

normalizer = Normalization()
normalizer.adapt(np
                 .array(pima_df))

I then run PCA on the normalized Pima Dataset. This collapses the eight (8) dimensions to three (3).

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit(normalizer(pima_df))
pca_features_df = pd.DataFrame(pca.transform(normalizer(pima_df)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3'],
                                     index=pima_df.index)

pima_df = pca_features_df.assign(outcome = labels)

This gives us the 3D Pima training dataset for RCE modeling.

print(pima_df.head())

   princomp1  princomp2  princomp3  outcome
0   1.068502   1.234895  -0.095930        1
1  -1.121683  -0.733852   0.712938        0
2  -0.396477   1.595876  -1.760679        1
3  -1.115781  -1.271241   0.663729        0
4   2.359335  -2.184819  -2.963107        1

2. Calculate the radii for all hit footprints

In RCE, lambdas (overloaded term) record the distance from an observation to the closest observation in a different class. RCE uses the lambdas to tally hits. We must calculate lambda for every observation in the training Dataset. I go into the code in my prior blog post. I also share the code at the bottom of this blog post. I wrote the code to accommodate an arbitrary number of dimensions.

Overload Warning: RCE uses lambda to indicate hit footprint radii. Pandas uses lambda to indicate inline/ anonymous functions. Do not confuse the two.

The following head() shows the observations with their target and lambda.

shape: (5, 5)
┌───────────┬───────────┬───────────┬─────────┬──────────┐
│ princomp1 ┆ princomp2 ┆ princomp3 ┆ outcome ┆ lambda   │
│ ---       ┆ ---       ┆ ---       ┆ ---     ┆ ---      │
│ f64       ┆ f64       ┆ f64       ┆ i64     ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═════════╪══════════╡
│ -3.183769 ┆ -1.553631 ┆ 1.571653  ┆ 0       ┆ 2.399091 │
│ -1.038747 ┆ 1.269651  ┆ 0.878596  ┆ 1       ┆ 0.180943 │
│ -0.847973 ┆ 1.281973  ┆ -0.178098 ┆ 0       ┆ 0.209617 │
│ -0.546354 ┆ 1.100635  ┆ -0.542195 ┆ 1       ┆ 0.371084 │
│ -0.991178 ┆ -1.153642 ┆ -0.04556  ┆ 0       ┆ 0.587108 │
└───────────┴───────────┴───────────┴─────────┴──────────┘

3. Generate a 300x300x300 Data Viz Grid

The following code creates a Polars DataFrame with 27M Data Viz points. We will use this DataFrame to draw a 3D map of the Pima Dataset, classified by Outcome.

grid_size = 300
grid_range = 2
grid_start = -1

class_df = pl.DataFrame(
    {
        "princomp1": [
            grid_start + x * (grid_range / (grid_size - 1))
            for x in range(grid_size)
        ] * (grid_size * grid_size),
        "princomp2": [
            grid_start + y * (grid_range / (grid_size - 1))
            for y in range(grid_size) for _ in range(grid_size)
        ] * grid_size,
        "princomp3": [
            grid_start + z * (grid_range / (grid_size - 1))
            for z in range(grid_size) for _ in range(grid_size * grid_size)
        ]
    }
)
print(class_df)

shape: (27_000_000, 3)
┌───────────┬───────────┬───────────┐
│ princomp1 ┆ princomp2 ┆ princomp3 │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ -1.0      ┆ -1.0      ┆ -1.0      │
│ -0.993311 ┆ -1.0      ┆ -1.0      │
│ -0.986622 ┆ -1.0      ┆ -1.0      │
│ -0.979933 ┆ -1.0      ┆ -1.0      │
│ -0.973244 ┆ -1.0      ┆ -1.0      │
│ …         ┆ …         ┆ …         │
│ 0.973244  ┆ 1.0       ┆ 1.0       │
│ 0.979933  ┆ 1.0       ┆ 1.0       │
│ 0.986622  ┆ 1.0       ┆ 1.0       │
│ 0.993311  ┆ 1.0       ┆ 1.0       │
│ 1.0       ┆ 1.0       ┆ 1.0       │
└───────────┴───────────┴───────────┘

4. Classify all 27M Data Points

I already discussed the algorithm and implementation of RCE classification in my prior blog post. Please click through to read. The algo uses lambda to calculate hits in a given hit footprint and, based on the hits assigns a class.

I wrote the code to accommodate an arbitrary number of dimensions, so I run it now without edits.

When I attempt to run it in my Collab notebook, however, I get the following error:

ComputeError: cross joins would produce more rows than fits into 2^32; consider compiling with polars-big-idx feature, or set to 'streaming.'

I want to use a cross join to leverage Polars/ Arrow distributed features. I could try to cobble together a different approach (manual chunking), but I don't have confidence in my ability to improve on the work done by the Polars developers. My hacked code would only bastardize the logic and execution of a cross join.

I instead use the Polars Streaming feature.

Quick aside. If I use the RCE epsilon hyperparameter and set an upper limit for the hit footprint size, this will greatly reduce the size of the model. I will show that in a future blog post. For now, I want to focus on illustrating the streaming feature of Polars.

Classification

I first set the Polars DataFrames to lazy.

lf_class  = class_df.lazy()
lf_lambda = lambda_df_polars.lazy()

I then load the RCE classification logic.

# Rename lambda-side cols
lf_lambda_renamed = lf_lambda.rename(
    {col: f"l{col}" for col in princomp_cols} | {"outcome": "loutcome", "lambda": "llambda"}
)

# Cross join (cartesian product)
lf_cross = lf_class.join(lf_lambda_renamed, how="cross")

# Distance calculation
distance_expr = (
    sum((pl.col(col) - pl.col(f"l{col}")) ** 2 for col in princomp_cols).sqrt()
)

lf_cross = lf_cross.with_columns(
    distance = distance_expr
)

# Flag hit
lf_cross = lf_cross.with_columns(
    is_hit = pl.col("distance") < pl.col("llambda")
)

# Group by the grid point
grid_group_cols = [col for col in class_df.columns if col.startswith("princomp")]

lf_hits = (
    lf_cross
    .group_by(grid_group_cols)
    .agg(
        ((pl.col("is_hit")) & (pl.col("loutcome") == 0)).sum().alias("hits_outcome_0"),
        ((pl.col("is_hit")) & (pl.col("loutcome") == 1)).sum().alias("hits_outcome_1"),
    )
    .with_columns(
        predicted_outcome = (
            pl.when( (pl.col("hits_outcome_0") > 0) & (pl.col("hits_outcome_1") == 0) )
              .then(0)
              .when( (pl.col("hits_outcome_1") > 0) & (pl.col("hits_outcome_0") == 0) )
              .then(1)
              .otherwise(None)     # unclassified
        )
    )
)

The pre-loaded logic processes the lazy input DataFrames into the lazy output DataFrame, which I name lf_hits (Lazy Fram Hits). I then execute the logic on the lf_hits, and set the Streaming flag.

hits_by_grid_point = lf_hits.collect(streaming=True)

After execution, we can take a peek at the DataFrame.

print(hits_by_grid_point.head())

┌───────────┬───────────┬───────────┬────────────────┬────────────────┬───────────────────┐
│ princomp1 ┆ princomp2 ┆ princomp3 ┆ hits_outcome_0 ┆ hits_outcome_1 ┆ predicted_outcome │
│ ---       ┆ ---       ┆ ---       ┆ ---            ┆ ---            ┆ ---               │
│ f64       ┆ f64       ┆ f64       ┆ u32            ┆ u32            ┆ i32               │
╞═══════════╪═══════════╪═══════════╪════════════════╪════════════════╪═══════════════════╡
│ 0.55102   ┆ -0.102041 ┆ -0.265306 ┆ 1              ┆ 2              ┆ null              │
│ -0.591837 ┆ -0.142857 ┆ 0.714286  ┆ 2              ┆ 0              ┆ 0                 │
│ 0.795918  ┆ 0.020408  ┆ -0.469388 ┆ 0              ┆ 0              ┆ null              │
│ 0.755102  ┆ -1.0      ┆ -0.020408 ┆ 1              ┆ 1              ┆ null              │
│ -0.632653 ┆ 0.510204  ┆ -1.0      ┆ 1              ┆ 1              ┆ null              │
└───────────┴───────────┴───────────┴────────────────┴────────────────┴───────────────────┘

We can now plot the Classified Pima DataSet in 3D.

import plotly.express as px

filtered_hits_by_grid_point = hits_by_grid_point.filter(
    pl.col("predicted_outcome").is_not_null()
)

fig = px.scatter_3d(filtered_hits_by_grid_point.to_pandas(),
                    x='princomp1',
                    y='princomp2',
                    z='princomp3',
                    color='predicted_outcome',
                    title='Predicted Outcome based on Principal Components (Excluding Nulls)')
fig.show()

The plot shows Outcome 0 in yellow, and Outcome 1 in blue.

I tune the epsilon hyperparameter to put an upper limit on the hit footprints. This cleans up the plot.

Conclusion

Polars uses Apache Arrow to drive optimal utilization across every available core. For operations that clog local memory, like a cross join that yields a DataFrame larger than 32GB, Polars provides the Streaming construct. This sequentially loads chunks of data into memory, which drives parallel execution across time. Streaming allows Big Data computations on Collab notebooks and laptops. The Data Engineer does not need to concern herself with the housekeeping associated with most Big Data infrastructures. Remember the headaches that Hadoop and Spark caused? You do not need to deal with them; you just work with DataFrames and Polars figures the rest out.

Lambda Calculation Code

train_df_polars = pl.from_pandas(pima_df)

princomp_cols = [
    col for col in train_df_polars.columns if col.startswith('princomp')
]

join_selection = [
    pl.col(col).alias(f'{col}_other') for col in princomp_cols
] + [
    pl.col('outcome').alias('outcome_other')
]

df_with_other_outcome = train_df_polars.join(
    train_df_polars.select(
        princomp_cols + ['outcome']
    ).rename(
        {col: f'{col}_other' for col in princomp_cols} | {'outcome': 'outcome_other'}
    ),
    how="cross"
).filter(
    pl.col('outcome') != pl.col('outcome_other')
)

squared_distance_expr = sum(
    (pl.col(col) - pl.col(f'{col}_other'))**2 for col in princomp_cols
)

df_with_other_outcome = df_with_other_outcome.with_columns(
    squared_distance = squared_distance_expr
)

group_cols = [
    col for col in df_with_other_outcome.columns
    if col.startswith('princomp') and not col.endswith('_other')
] + ['outcome']

min_squared_dist_polars = (df_with_other_outcome
    .group_by(
        group_cols
    )
    .agg(
        pl.col("squared_distance")
        .min()
        .alias("min_squared_distance")
    )
)

lambda_df_polars = (min_squared_dist_polars
    .with_columns(
        pl.col("min_squared_distance")
        .sqrt()
        .alias("lambda")
    )
)

print(
    lambda_df_polars.select(
        group_cols + ['lambda']
    ).head()
)

┌───────────┬───────────┬───────────┬─────────┬──────────┐
│ princomp1 ┆ princomp2 ┆ princomp3 ┆ outcome ┆ lambda   │
│ ---       ┆ ---       ┆ ---       ┆ ---     ┆ ---      │
│ f64       ┆ f64       ┆ f64       ┆ i64     ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═════════╪══════════╡
│ 0.466587  ┆ 0.42411   ┆ -0.133452 ┆ 0       ┆ 0.114694 │
│ -0.158603 ┆ 0.811361  ┆ 0.563424  ┆ 0       ┆ 0.430963 │
│ 0.395739  ┆ 1.375969  ┆ 0.234109  ┆ 0       ┆ 0.086674 │
│ -1.276773 ┆ -0.741043 ┆ -0.127401 ┆ 0       ┆ 0.546374 │
│ 0.91052   ┆ 1.058389  ┆ 0.605156  ┆ 1       ┆ 0.262343 │
└───────────┴───────────┴───────────┴─────────┴──────────┘

Polars Boosted My Algorithm's Speed by 25x

2025-06-28T16:47:00-04:00

The open-source Polars library touts huge performance gains over Pandas. A combination of parallel processing, Apache Arrow, and a "Close to the Metal" Architecture drives Polars' speed. I refactored my Pandas-based Reduced Columb Energy (RCE) algorithm to Polars and will share my journey and observations with you.

The Reduced Columb Energy (RCE) Algorithm

The obscure RCE exemplar classifier offers a niche alternative to the well-known K-Nearest approach. I wrote about the algorithm in-depth in my New Exemplar Machine Learning Algorithm blog post.

The RCE algorithm labels inference data via hit footprints learned from training data.

RCE creates spheres around each labeled training observation, with radii equal to the distance of the closest labeled observation in the opposite class. The collection of all spheres for a given class represents the hit footprint for that class.

RCE uses the term lambda instead of the term radius. Lambda represents the radii of the spheres that comprise the hit footprints.

Look at the following diagram to understand the difference between RCE and K-Nearest.

In the diagram, we have two classes X and O. The green ? represents an observation in our inference data set. A training observation in class X sits closest to the green question mark.

K-Nearest would classify the green ? into class X since it sits closest to an observation in class X. RCE, however, classifies the Green ? into class O, because the unknown observation sits in the hit footprint of class O. The RCE hit footprint approach allows RCE to handle limited data sets.

Polars vs. Pandas

I need to overload terms in this discussion. The RCE algorithm uses the term lambda to represent radii. Pandas also uses the term lambda to represent inline/ anonymous functions. For clarity, I will use the term lambda when discussing the hit footprint radii, and lambda functions when discussing the Pandas anonymous functions. I hope that context will also drive clarity.

I take a functional approach to my Pandas development. I avoid iterative loops (for, while, if/then) and instead use the apply construct. I stuff data into a data frame, create a lambda function, and use the Pandas Data Frame apply method to process the data in a selection of columns via the lambda function. For you Map/ Reduce fans, the apply method covers the map side of the equation.

The Polars documentation, however, recommends that you avoid lambda functions:

Polars will try to parallelize the computation of the aggregating functions over the groups, so it is recommended that you avoid using lambdas and custom Python functions as much as possible. Instead, try to stay within the realm of the Polars expression API

Given this, I will use the native expression API vs. lambda functions when I refactor the code.

The Algorithm

I use the following approach to execute RCE in Pandas.

Acquire Data and store in a Data Frame
Split Data into Training and Inference Data Frames
Apply a lambda function to each row (Observation) of the Training Set DataFrame and record the hit footprint for each Observation
Apply a lambda function to each row in the Inference Set Dataframe and classify each Observation

Acquire Data and store in a Data Frame

The following code acquires our data and stores it in a data frame:

%pip install pandas_datareader

import numpy as np
import pandas as pd 
import tensorflow as tf
import polars as pl

url = "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df_pima = pd.read_csv(url)
display(df_pima.head())

Split Data into Training and Inference Data Frames

We copy the data frame and execute a split.

pima_df = (df_pima
            .copy())

train_dataset = (pima_df
                  .sample(frac=0.8,
                               random_state = 0))

test_dataset = (pima_df
                .drop(train_dataset
                       .index))

train_features = (train_dataset
                  .copy())
test_features = (test_dataset
                  .copy())

train_labels = (train_features
                 .pop('Outcome'))
test_labels = (test_features
               .pop('Outcome'))

To help with plots, I use Principal Component Analysis (PCA) to reduce the dimensions of the training data.

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Normalization

normalizer = Normalization()

normalizer.adapt(np
                 .array(train_features))


from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(normalizer(train_features))

pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1'],
                                     index=train_features.index)

pca = PCA(n_components=2)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2'],
                                     index=train_features.index)

train_df = pca_train_features_df.assign(outcome = train_labels)

If you would like to learn more about my justification and approach to dimensionality reduction, read my Pandas for RCE blog post.

Calculate Hit Footprints (Pandas)

In Pandas, I apply a lambda function to each row (Observation) of the Training Set Data Frame and record the hit footprint for each Observation.

RCE draws a circle around each labeled training observation, with a radius (lambda) that stops at the closest labeled training point in the opposite class. Each circle indicates the hit footprint for that class.

I use this code for the lambda function. Note that the function inputs the entire training data set. For each observation, I need to calculate the distance to every observation in the Train Dataframe (of a different class), and then select the closest.

def find_lambda(df, v):
  return ( np
            .linalg
            .norm(df
                    .loc[df['outcome'] != v[-1]]
                    .iloc[:,:-1]
                    .sub(np
                           .array(v[:-1])),
                  axis = 1)
            .min())

The lambda function executes the following for each observation:

Remove all observations of the same class from the dataset
Calculate the distance to every other observation in the filtered data set
Choose the closest observation, and use the distance for lambda

TIP: Paste the above code into Collab and press the Explain button. Gemini will explain the code to you!

Classify Inference Data Frame (Pandas)

I create another lambda function to find the hits.

def find_hits(df, v, outcome ):
  return (( np
            .linalg
            .norm(df
                    .loc[df['outcome'] == outcome]
                    .iloc[:,:-2]
                    .sub(np
                           .array(v)),
                  axis = 1)
             < df.loc[df['outcome'] == outcome]['lambda'] ).sum() )

find_hits uses find_lambda, and I apply find_hits to the inference data frame to classify data.

def classify_data(training_df, class_df):
    # find the hits
    class0_hits = (class_df
                   .apply(lambda X: find_hits(training_df, X, 0),
                          axis = 1))
    class1_hits = (class_df
                   .apply(lambda X: find_hits(training_df, X, 1),
                          axis = 1))
    # add the columns
    class_df = (class_df
                .assign( class0_hits = class0_hits))
    class_df = (class_df
                .assign( class1_hits = class1_hits))
    # ID ambiguous, class 0 and class 1 data
    class_df['classification'] = np.nan
    class_df['classification'] = (class_df
                                  .apply(lambda X: 0 if X.class0_hits > 0 and X.class1_hits == 0 else X.classification,
                                         axis = 1))
    class_df['classification'] = (class_df
                                  .apply(lambda X: 1 if X.class1_hits > 0 and X.class0_hits == 0 else X.classification,
                                         axis = 1))
    return class_df

The Classification applies a lambda function to each row.

The algorithm only labels a class if one class registers at least one hit and the other classes register no hits. We can tune the algorithm to look at different weight options if desired.

Lambda vs. Expression API

The following diagram shows the current Pandas approach. I apply a lambda function to each row (Observation) in the Training Data Frame. Each call of the lambda function ingests a subset of the Training Data Frame that includes every observation of the opposite class. The original lambda function includes a nested lambda function that executes find_lambda, which uses the ingested Training Data Frame for each call.

Every row of the Training Data Frame must calculate the distance to every other row in the Training Data Frame (of opposite class). We can either execute this logic via the application of lambda functions or via a cross join.

A cross join creates a Data Frame that includes every observation combined with every other observation (of opposite class). In this form, we can use Data Frame level operations to calculate distances. This allows Column Based processing vs. Row Based processing.

Once we have the distances in the cross join Data Frame, we can use a GROUP BY to find the min and therefore lambda (hit footprint radius) for each observation in the Training Dataframe.

Calculate Hit Footprints (Polars)

For Polars, we use the Polars expression API instead of lambda functions.

First, we:

Get the list of principal component columns
Create a selection for the cross join, and rename the Principal Component (princomp) columns in the right side Data Frame
Cross-join the Data Frame with a subset of itself that includes opposite outcomes

NOTE: We include code (starts with) to accommodate an arbitrary number of principal components.

train_df_polars = pl.from_pandas(train_df)

princomp_cols = [
    col for col in train_df_polars.columns if col.startswith('princomp')
]

join_selection = [
    pl.col(col).alias(f'{col}_other') for col in princomp_cols
] + [
    pl.col('outcome').alias('outcome_other')
]

df_with_other_outcome = train_df_polars.join(
    train_df_polars.select(
        princomp_cols + ['outcome']
    ).rename(
        {col: f'{col}_other' for col in princomp_cols} | {'outcome': 'outcome_other'}
    ),
    how="cross"
).filter(
    pl.col('outcome') != pl.col('outcome_other')
)

This yields our cross join Data Frame.

print(df_with_other_outcome.head())

┌───────────┬───────────┬─────────┬─────────────────┬─────────────────┬───────────────┐
│ princomp1 ┆ princomp2 ┆ outcome ┆ princomp1_other ┆ princomp2_other ┆ outcome_other │
│ ---       ┆ ---       ┆ ---     ┆ ---             ┆ ---             ┆ ---           │
│ f64       ┆ f64       ┆ i64     ┆ f64             ┆ f64             ┆ i64           │
╞═══════════╪═══════════╪═════════╪═════════════════╪═════════════════╪═══════════════╡
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -0.046671       ┆ -1.161939       ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.631433       ┆ 0.001913        ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.535048       ┆ -0.184561       ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.474017       ┆ -0.842351       ┆ 0             │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ 1.105778        ┆ -1.756428       ┆ 0             │
└───────────┴───────────┴─────────┴─────────────────┴─────────────────┴───────────────┘

To find the distances to every other point of the opposite class, we execute:

squared_distance_expr = sum(
    (pl.col(col) - pl.col(f'{col}_other'))**2 for col in princomp_cols
)

df_with_other_outcome = df_with_other_outcome.with_columns(
    squared_distance = squared_distance_expr
)

Notice the column based operations, which use -, + and ** (square) on the columns princomp1 and princomp2. The operations yield the squared distance for the cross-joined Data Frame.

print(df_with_other_outcome.head())

┌───────────┬───────────┬─────────┬────────────────┬───────────────┬───────────────┬───────────────┐
│ princomp1 ┆ princomp2 ┆ outcome ┆ princomp1_othe ┆ princomp2_oth ┆ outcome_other ┆ squared_dista │
│ ---       ┆ ---       ┆ ---     ┆ r              ┆ er            ┆ ---           ┆ nce           │
│ f64       ┆ f64       ┆ i64     ┆ ---            ┆ ---           ┆ i64           ┆ ---           │
│           ┆           ┆         ┆ f64            ┆ f64           ┆               ┆ f64           │
╞═══════════╪═══════════╪═════════╪════════════════╪═══════════════╪═══════════════╪═══════════════╡
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -0.046671      ┆ -1.161939     ┆ 0             ┆ 6.272523      │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.631433      ┆ 0.001913      ┆ 0             ┆ 18.911708     │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.535048      ┆ -0.184561     ┆ 0             ┆ 17.589336     │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ -1.474017      ┆ -0.842351     ┆ 0             ┆ 15.726017     │
│ 2.425151  ┆ -1.565197 ┆ 1       ┆ 1.105778       ┆ -1.756428     ┆ 0             ┆ 1.777314      │
└───────────┴───────────┴─────────┴────────────────┴───────────────┴───────────────┴───────────────┘

A GROUP BY followed by the min operation returns the minimum squared distance for each operation.

group_cols = [
    col for col in df_with_other_outcome.columns
    if col.startswith('princomp') and not col.endswith('_other')
] + ['outcome']

min_squared_dist_polars = (df_with_other_outcome
    .group_by(
        group_cols 
    )
    .agg(
        pl.col("squared_distance")
        .min()
        .alias("min_squared_distance")
    )
)

The square root operation gives us the minimum Euclidean distance, or lambda.

lambda_df_polars = (min_squared_dist_polars
    .with_columns(
        pl.col("min_squared_distance")
        .sqrt()
        .alias("lambda")
    )
)

We now have a Training Data Frame that records every lambda (hit footprint radius) for every Observation.

print(
    lambda_df_polars.select(
        group_cols + ['lambda']
    ).head()
)

┌───────────┬───────────┬─────────┬──────────┐
│ princomp1 ┆ princomp2 ┆ outcome ┆ lambda   │
│ ---       ┆ ---       ┆ ---     ┆ ---      │
│ f64       ┆ f64       ┆ i64     ┆ f64      │
╞═══════════╪═══════════╪═════════╪══════════╡
│ -0.242788 ┆ -1.433054 ┆ 1       ┆ 0.030056 │
│ -1.035114 ┆ -0.063441 ┆ 0       ┆ 0.087633 │
│ 0.237591  ┆ 1.848636  ┆ 1       ┆ 0.130412 │
│ 0.477868  ┆ 1.650595  ┆ 0       ┆ 0.123065 │
│ 1.136271  ┆ -1.039659 ┆ 0       ┆ 0.140668 │
└───────────┴───────────┴─────────┴──────────┘

Classify Inference Data Frame (Polars)

We refactor our algorithm to use the Polars expression API instead of lambda functions.

First, we create a 2D grid of data. The grid provides the Inference Dataframe. This code assumes we have only two Principal Components.

grid_size = 300
grid_range = 2 
grid_start = -1 

class_df \
    = pl.DataFrame(
        {
            "princomp1": [
                grid_start
                + x
                * (
                    grid_range
                    / (
                        grid_size
                        - 1
                    )
                ) for x in range(grid_size)
            ]
            * grid_size,
            "princomp2": [
                grid_start
                + y
                * (
                    grid_range
                    / (
                        grid_size
                        - 1
                    )
                ) for y in range(grid_size) for _ in range(grid_size)
            ]
        }
    )

The Polars classification approach

Cross joins the Inference Data Frame and the Training Data Frame to drive Column Based, Group By and Summary Statistic (Sum) operations
Calculates the Euclidean distance between each Observation in the Inference Data Frame and each point in the Training Data Frame
Identifies Hits, when the Euclidean distance clocks in less than the lambda (footprint radius) distance
Aggregates the hits. For now, we only classify a hit if an observation lies in only one class (we can tune this)

# Cross join 
lambda_df_renamed = lambda_df_polars.rename(
    {
        col: f'l{col}' for col in princomp_cols
    } | {'outcome': 'loutcome', 'lambda': 'llambda'}
)

grid_lambda_cross_join = class_df.join(
    lambda_df_renamed,
    how="cross"
)

# Find Euclidean Distances
distance_expr = sum(
    (pl.col(col) - pl.col(f'l{col}'))**2 for col in princomp_cols
).sqrt()

grid_lambda_cross_join = (grid_lambda_cross_join
    .with_columns(
        distance = distance_expr
    )
)

# ID Hits
grid_lambda_cross_join = (grid_lambda_cross_join
    .with_columns(
        is_hit = pl.col('distance') < pl.col('llambda')
    )
)

# GROUP BY and SUM hits
grid_group_cols = [col for col in class_df.columns if col.startswith('princomp')]
hits_by_grid_point = (grid_lambda_cross_join
    .group_by(grid_group_cols)
    .agg(
        (pl.col('is_hit') & (pl.col('loutcome') == 0))
        .sum()
        .alias('hits_outcome_0'),
        (pl.col('is_hit') & (pl.col('loutcome') == 1))
        .sum()
        .alias('hits_outcome_1')
    )
)

# Decide on class
hits_by_grid_point = (hits_by_grid_point
    .with_columns(
        predicted_outcome = pl.when(
                                (pl.col('hits_outcome_0') > 0) &
                                (pl.col('hits_outcome_1') == 0)
                            )
                            .then(0)
                            .when(
                                (pl.col('hits_outcome_1') > 0) &
                                (pl.col('hits_outcome_0') == 0)
                            )
                            .then(1)
                            .otherwise(None)
    )
)

This yields a classified Inference Data Frame:

print(hits_by_grid_point.head())

┌───────────┬───────────┬────────────────┬────────────────┬───────────────────┐
│ princomp1 ┆ princomp2 ┆ hits_outcome_0 ┆ hits_outcome_1 ┆ predicted_outcome │
│ ---       ┆ ---       ┆ ---            ┆ ---            ┆ ---               │
│ f64       ┆ f64       ┆ u32            ┆ u32            ┆ i32               │
╞═══════════╪═══════════╪════════════════╪════════════════╪═══════════════════╡
│ 0.913043  ┆ -0.498328 ┆ 0              ┆ 1              ┆ 1                 │
│ 0.591973  ┆ 0.498328  ┆ 0              ┆ 1              ┆ 1                 │
│ -0.277592 ┆ -0.110368 ┆ 0              ┆ 2              ┆ 1                 │
│ -0.531773 ┆ 0.571906  ┆ 4              ┆ 0              ┆ 0                 │
│ -0.478261 ┆ -0.973244 ┆ 11             ┆ 0              ┆ 0                 │
└───────────┴───────────┴────────────────┴────────────────┴───────────────────┘

We plot a heat map of the Classified Data:

import matplotlib.pyplot as plt
import numpy as np

# Polars to Pandas for Matplotlib
hits_by_grid_point_pd = hits_by_grid_point.to_pandas()

# Pivot and sort data for Heatmap
hits_by_grid_point_pd = hits_by_grid_point_pd.sort_values(
    by=['princomp2', 'princomp1']
)

# Reshape to 300x300 grid
grid_size = 300 # Update grid_size to match the actual grid size
predicted_outcome_grid = hits_by_grid_point_pd[
    'predicted_outcome'
].values.reshape(
    (grid_size, grid_size)
)

# Set Axis
princomp1_values = hits_by_grid_point_pd['princomp1'].unique()
princomp2_values = hits_by_grid_point_pd['princomp2'].unique()

# Heatmap
plt.figure(figsize=(8, 6))
plt.imshow(
    predicted_outcome_grid,
    origin='lower',
    extent=[
        princomp1_values.min(),
        princomp1_values.max(),
        princomp2_values.min(),
        princomp2_values.max()
    ],
    aspect='auto',
    cmap='viridis'
)
plt.colorbar(label='Predicted Outcome')
plt.title('Classification Boundary')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Execution Time

I used the same Python environment to run my algorithm on the same data set in both Pandas and Polars. I saw a significant reduction in wall clock time to complete the processing.

The Pandas (lambda) approach took roughly six (6) minutes to complete.

The Polars approach took only fourteen (14)seconds, a reduction of 96%.

This reduction of 96% equates to a performance gain of 25x.

Conclusion

The cross-join approach, in addition to the parallel architecture of Polars, yielded a 25x boost over Pandas. The crossr-join, however, requires the compute to hold n² rows and 2m columns in memory, given n rows in the training set with m feature columns. Next month I will look at a way to mitigate the scenario where the cross-join Data Frame exceeds available memory.

Quick and Easy Capacity Planning with Pandas

2024-09-28T09:41:00-04:00

The discipline of Operations Research (OR) applies analytical methods from math, statistics, economics, and computer science to help leaders make good decisions.

Enterprise software consumes physical resources (CPU, Memory, Disk, and Bandwidth) to provide mission-essential services. Software, Cloud, and Data Center Architects must identify the expected resource consumption to optimize resource spend. Operations Research Engineers develop Capacity Planning models to drive decisions around CAPEX and OPEX purchases.

Today you will learn how to develop a Python Pandas Capacity Planning model that right sizes the resources needed for a simple Web Application.

Approach

We use concepts from Fermi Estimation and the Jackson Network Theorem (Product-Form Solution) to drive our model. Both allow us to remove unnecessary details (and rabbit holes) from our model. Our model will nonetheless yield a reasonable estimate.

BONUS: Our approach yields artifacts (flow diagrams) that demonstrate rigorous, considerate thought and discipline. You can expect them to satisfy (most) audits or engineering reviews.

The stages of our model development workflow include:

Record the Nominal Architecture (Node Diagram)
ID the Data Flows through the Architecture (Use Cases)
Estimate the Maximum Throughput per flow (Gb/s)
Sum the Max Throughput for Each Node
Map Max Gb/s to the required number of CPUs per Node
Price the Sum Total for the required CPUs ($)

1. Record the Nominal Architecture (Node Diagram)

For illustration, we will size a simple web application. The set of Application Nodes includes an App Server, File Store, Firewall, and a Core Switch that connects the nodes. Users and Developers both use the system.

NOTE: Once you understand our approach, feel free to tailor the names and roles of the Nodes for your models.

2. ID the Data Flows through the Architecture (Use Cases)

Our Application hosts a Photo Album service. Users upload and retrieve photos via the Web Application. Developers look at logs from the server to optimize the user experience.

The following graphic captures the three main flows.

The Three Flows include:

Photo upload
Photo Retrieval
Logs

3. Estimate the Maximum Throughput per flow (Gb/s)

How do we estimate the expected throughput of our system? File Size, Upload Time, and (Max) Number of Concurrent Users drive the throughput.

The following model produces the desired metric:

For example, assume that each photo averages four (4) MegaBytes (MB) in size. My Samsung Galaxy phone produces photos of this size by default. Then, assume we have (at max) thirty-two (32) Concurrent Users and each photo takes 1/4 a second to upload.

This formula estimates a max throughput of four (4) Gigabits per second (Gb/s) for our file upload use case.

Another formula calculates the logs' maximum network capacity consumption.

An average Syslog (for Apache2) yields about one and a half (1.5) KiloBytes (KB).

This formula yields a max of 0.307 Gb/s for log traffic.

Feel free to either benchmark your data or use rough order of magnitude (ROM) numbers in your calculation. When you run the model, you will learn that ROM numbers provide enough data for acceptible results.

4. Sum the Max Throughput for Each Node

This stage of our pipeline uses code from the Python Pandas and Numpy libraries.

import pandas as pd
import numpy as np

Map Flows to Tables

We need to map our data flows from our flow diagram to a Pandas DataFrame.

In the flow above, we see that the Photo Upload Flow originates at the User, hits the Firewall, traverses the Core Switch to the App Server, and then terminates at the File Store via another trip through the Core Switch.

We map this description to the following table:

Flow	Data	SourceNode	DestinationNode
1	photo_upload	user	firewall
1	photo_upload	firewall	core_switch
1	photo_upload	core_switch	app_server
1	photo_upload	app_server	core_switch
1	photo_upload	core_switch	file_store

We can ignore the User to Firewall flow since we do not want to size the User's computer.

To further minimize rote work, we will delete the Core Switch entries. Since every flow navigates through the core switch, we will add those in one batch job at a later point.

We reduce the above table to:

Flow	Data	SourceNode	DestinationNode
1	photo_upload	firewall	app_server
1	photo_upload	app_server	file_store

We append the photo_retrieval and logs flows to the table.

Flow	Data	SourceNode	DestinationNode
1	photo_upload	firewall	app_server
1	photo_upload	app_server	file_store
2	photo_retrieval	file_store	app_server
2	photo_retrieval	app_server	firewall
3	logs	app_server	developers

We use a Python Dict to import the data. You can also use a Comma Separated Variable (CSV) file, a Structured Query Language (SQL) table, or JavaScript Object Notation (JSON) to encode your flows.

flow = {'Data': ['photo_upload',
  'photo_upload',
  'photo_retrieval',
  'photo_retrieval',
  'logs'],
 'SourceNode': ['firewall',
  'app_server',
  'file_store',
  'app_server',
  'app_server'],
 'DestinationNode': ['app_server',
  'file_store',
  'app_server',
  'firewall',
  'developers']}

We then import the Dict into a Pandas DataFrame.

Flow = pd.DataFrame(data=flow)

Print displays the Flow DataFrame.

print(Flow)

    ...

              Data  SourceNode DestinationNode
0     photo_upload    firewall      app_server
1     photo_upload  app_server      file_store
2  photo_retrieval  file_store      app_server
3  photo_retrieval  app_server        firewall
4             logs  app_server      developers

NOTE: You can add more rows and flows to suit your needs.

We now add the Core Switch nodes back into the table.

Core_Flows = (pd
              .concat([Flow
                       .copy()
                       .assign(DestinationNode='core_switch'),
                       Flow
                       .copy()
                       .assign(SourceNode='core_switch')],
                      ignore_index=True))

This batch job introduces the Core Switch back into the flows.

print(Core_Flows)

              Data   SourceNode DestinationNode
0     photo_upload     firewall     core_switch
1     photo_upload   app_server     core_switch
2  photo_retrieval   file_store     core_switch
3  photo_retrieval   app_server     core_switch
4             logs   app_server     core_switch
5     photo_upload  core_switch      app_server
6     photo_upload  core_switch      file_store
7  photo_retrieval  core_switch      app_server
8  photo_retrieval  core_switch        firewall
9             logs  core_switch      developers

NOTE: The above code outputs an arbitrary ordering of the flows. We can ignore the order since we aim to execute a per-node GROUP BY operation at the end of our pipeline.

Our model requires a Node view of the flows, independent of Source or Destination.

We achieve this through the following Pandas operations:

Node_Flows = (pd
 .concat([Core_Flows
          .drop('DestinationNode',
                axis=1)
          .rename(columns={'SourceNode': 'Node'}),
          Core_Flows.drop('SourceNode', axis=1)
          .rename(columns={'DestinationNode': 'Node'})]
         ,ignore_index=True))

Our concat operation outputs:

print(Node_Flows.sort_values('Node', ignore_index= True))

   ...

]
0s
print(Node_Flows.sort_values('Node', ignore_index= True))

               Data         Node
0      photo_upload   app_server
1   photo_retrieval   app_server
2   photo_retrieval   app_server
3              logs   app_server
4      photo_upload   app_server
5              logs  core_switch
6      photo_upload  core_switch
7      photo_upload  core_switch
8   photo_retrieval  core_switch
9   photo_retrieval  core_switch
10     photo_upload  core_switch
11     photo_upload  core_switch
12  photo_retrieval  core_switch
13  photo_retrieval  core_switch
14             logs  core_switch
15             logs   developers
16     photo_upload   file_store
17  photo_retrieval   file_store
18  photo_retrieval     firewall
19     photo_upload     firewall

A simple GROUP BY operation verifies the quantity of Flows per Node. The Output of the operation matches the number of Flows per Node in our Flow Diagram.

print(Node_Flows
      .groupby('Node')['Data'].count())

    ...

Node
app_server      8
core_switch    16
developers      1
file_store      4
firewall        3
Name: Data, dtype: int64

JOIN Data Rates into Node Flow Table

I use another Dict to import the (Estimated) Max Throughput for the Upload, Retrieval, and Logs data flows into a DataFrame. Once more, you can use the encoding format of your choosing (CSV, SQL, JSON).

data = {'Data': ['photo_upload',
                 'photo_retrieval',
                 'logs'],
        'RateGbps': [4.096,
                     4.096,
                     0.307,]}

Data = pd.DataFrame(data=data)

This yields:

print(Data)

    ...

              Data  RateGbps
0     photo_upload     4.096
1  photo_retrieval     4.096
2             logs     0.307

We JOIN this Data DataFrame into our Flow DataFrame via a merge operation.

We also apply a SUM operation via the GROUP BY method.

# prompt: join Node_Flows and Data on Data

Node_Flows_Data = (pd
                   .merge(Node_Flows,
                          Data,
                          on='Data',
                          how='left')
                   .groupby('Node')['RateGbps']
                   .sum()
                   .reset_index())

This then outputs the MAX throughput (sum of ALL flows) on a per-node basis.

print(Node_Flows_Data)

    ...

          Node  RateGbps
0   app_server    18.341
1  core_switch    36.682
2   developers     0.307
3   file_store     9.292
4     firewall     8.742

5. Map Max Gb/s to the required number of CPU per Node

Assume that each CPU's cycle per second can process one bit of throughput, or one bit per Hertz (Hz).

We identify the number of cores needed via this calculation.

For purposes of this model, I use the Intel Xeon Silver 4214 Processor which retails at approximately $185.00 (USD) in 2024.

We feed our model with the Silver's specs (2.2GHz, 12C, 16.5MB Cache):

CPU_NAME = 'Intel Xeon Silver 4214'
CPU_CLOCK_SPEED = 2.2
CPU_CORES = 12
CPU_PRICE = 185.00

In addition, we need to account for various processing overhead, or Taxes. Feel free to benchmark your metrics. I use the following percentages:

OS_TAX = 0.05
VIRTUALIZATION_TAX = 0.15
TLS_TAX = 0.05

From here, we convert the Max Gb/s per node to Ghz to Cores. Note the Ceiling operation, since we can't buy a fractional core:

Node_Flows_Data['NumCores'] = (np
                               .ceil(Node_Flows_Data['RateGbps']
                                     *(1 
                                       + OS_TAX 
                                       + VIRTUALIZATION_TAX 
                                       + TLS_TAX)
                                     /(CPU_CLOCK_SPEED)))

Based on the above math, we see that the core_switch requires the most cores:

          Node  RateGbps  NumCores
0   app_server    18.341      11.0
1  core_switch    36.682      21.0
2   developers     0.307       1.0
3   file_store     9.292       6.0
4     firewall     8.742       5.0

6. Price the Sum Total for the required CPU ($)

We know that each CPU includes twelve (12) 2.66 GHz cores and costs $185 per CPU. We use the following Pandas statements to calculate the cost.

Node_Flows_Data['NumCpu'] = (np
                             .ceil(Node_Flows_Data['NumCores']
                                   /CPU_CORES))

Node_Flows_Data['TotalCpuCost'] = (Node_Flows_Data['NumCpu']
                                *CPU_PRICE)

The following lines of code add a Totals row:

Total_Row = (Node_Flows_Data
             .sum(numeric_only=True))

Total_Row['Node'] = 'Total'

Node_Flows_Data = (pd
                   .concat([Node_Flows_Data,
                            pd
                              .DataFrame([Total_Row])],
                            ignore_index=True))

The final output Reads:

print(Node_Flows_Data)

    ...

          Node  RateGbps  NumCores  NumCpu  TotalCpuCost
0   app_server    16.691      10.0     1.0         185.0
1  core_switch    33.382      19.0     2.0         370.0
2   developers     0.307       1.0     1.0         185.0
3   file_store     8.192       5.0     1.0         185.0
4     firewall     8.192       5.0     1.0         185.0
5        Total    66.764      40.0     6.0        1110.0

We can expect to pay $1,100 to purchase the required CPU for our Photo Album Web Application.

Bonus: Extend the model

We can easily add new flows to our model.

For example, let's add a Data Analyst use case to our system. The Data Analysts look at new Metadata Flows (Flows four and five in the diagram below):

We add these two new flows (From the App Server to the File Store, and from the File Store to the Analysts) to our Flow and Data DataFrames:

flow = {'Data': ['photo_upload',
  'photo_upload',
  'photo_retrieval',
  'photo_retrieval',
  'logs',
  'metadata_archival',
  'metadata_retrieval',
  'metadata_retrieval',],
 'SourceNode': ['firewall',
  'app_server',
  'file_store',
  'app_server',
  'app_server',
  'app_server',              
  'file_store',
  'app_server',],
 'DestinationNode': ['app_server',
  'file_store',
  'app_server',
  'firewall',
  'developers',
  'file_store',
  'app_server',
  'data_analysts',]}

Flow = pd.DataFrame(data=flow)

The Flow DataFrame now includes Metadata flows:

print(Flow)

    ...

                 Data  SourceNode DestinationNode
0        photo_upload    firewall      app_server
1        photo_upload  app_server      file_store
2     photo_retrieval  file_store      app_server
3     photo_retrieval  app_server        firewall
4                logs  app_server      developers
5   metadata_archival  app_server      file_store
6  metadata_retrieval  file_store      app_server
7  metadata_retrieval  app_server   data_analysts

We also update the Data DataFrame to include Metadata flows:

data = {'Data': ['photo_upload',
                 'photo_retrieval',
                 'logs',
                 'metadata_archival',
                 'metadata_retrieval'],
        'RateGbps': [4.096,
                     4.096,
                     0.307,
                     0.550,
                     0.550]}

Data = pd.DataFrame(data=data)

We then run the rest of the commands above, without edit, and the model learns about the new Node (Analysts) and Flows (Metadata Archival & Retrieval).

We see that we only need to buy one (1) new CPU for the Data Analyst's workstation:

            Node  RateGbps  NumCores  NumCpu  TotalCpuCost
0     app_server    18.341      11.0     1.0         185.0
1    core_switch    36.682      21.0     2.0         370.0
2  data_analysts     0.550       1.0     1.0         185.0
3     developers     0.307       1.0     1.0         185.0
4     file_store     9.292       6.0     1.0         185.0
5       firewall     8.192       5.0     1.0         185.0
6          Total    73.364      45.0     7.0        1295.0

Combat ChatGPT Use in Academia and Government Acquisition

2024-07-27T07:27:00-04:00

Introduction

Government Acquisition Teams (and Academics) request written responses to gauge writers' capacity for difficult work. Good writing requires effort. Writers must provide clear, focused, and valuable prose to Readers.

Writers must do the work for the reader, and remove ambiguity, falsehoods, filler content, and irrelevant bloviating.

Generative AI provides a cheat code for writing. Putative writers enter text into a prompt, and the AI produces mediocre, passable text in seconds. GenAI undermines the goals of writing assignments. In the face of instant prose development, Acquisition Teams can no longer infer Offerers' work ethic and ability through writing assignments.

Today I propose a simple rubric to detect the use of GenAI in writing. Graders and reviewers can then decide how to penalize Offerers who use GenAI. Where traditional writing demonstrated an ability to work hard, GenAI use reflects a willingness to take shortcuts.

Prior Art

Liang et al developed a method to identify the likelihood of Large Language model use in Academic white papers. Liang:

present[s] an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM) [Liang 1]

Early attempts at LLM detection looked at individual articles. Liang's novel approach attempts detection based on the analysis of a large corpus.

We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. [Liang 1]

Their results

suggest that between 6.5% and 16.9% of text submitted as peer reviews to [Academic] conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates [Liang 1]

Consider that for a moment. Academic journals and conferences represent the pinnacle of creative and scientific thought. Submission requires a PhD or PhD track. Liang, however, discovered that nearly one in five submissions substantially used LLM to generate prose.

Liang correlates LLM use with low quality, procrastination, and apathy:

the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. [Liang 1]

The Rubric

Find my rubric below. Subtract one point (from 100) for each transgression. I explain my rationale in detail in the following sections.

Category	Example
Meme Adjectives	meticulous, intricate, transformative
Empty Phrases	In today's digital landscape
Unsubstantiated Grandiosity	is crucial, is critical, is of paramount importance
Cliches	Acme Corp. is pleased to present
Adverbs	to share and decipher data seamlessly is paramount
Passive Voice/ Ambiguous Subject	OpenAI was founded in 2015
The Verb To Be	OpenDaylight is a Software Defined Network (SDN) Controller.

Take, for example, the following prose:

Acme Corp is pleased to present our new packet tech. In the intricate world of digital communication, the ability to label and prioritize data is paramount, especially when it comes to real-time data.

The above prose loses five points in two sentences. It includes a cliche, an empty phrase, two instances of passive voice, and an adverb.

Meme Adjectives

ChatGPT loves certain words. After the release of the public LLM, Liang discovered a 10x surge of certain words.

We find a significant shift in the frequency of certain tokens in ICLR 2024, with adjectives such as “commendable”, “meticulous”, and “intricate” showing 9.8, 34.7, and 11.2-fold increases in probability of occurring in a sentence. Liang 1

I took the following chart from Liang's paper, and give all credit to his team.

I consider the Meme adjectives meticulous, intricate and commendable unnecessary (attempts at) flourishes. They do not contribute to the information density of prose.

Page 15 of Monitoring AI-Modified Content at Scale provides a word cloud of Meme Adjectives. Again, credit to Liang and his team.

I propose we penalize authors that use these words.

Passive Voice/ Ambiguous Subject

In just a few months since I published my original thoughts on this subject, LLM improved their ability to remove passive voice. Simple prompt commands remove 80% of passive voice.

For example, consider the first reply from ChatGPT:

The OpenDaylight Project was founded in April 2013 and is hosted by the Linux Foundation. OpenDaylight is structured as a modular, pluggable, and flexible controller. It can be deployed on any hardware and operating system platform that supports Java. The platform is designed to separate...

The response includes myriad instances of passive voice. I ask ChatGPT to remove passive voice and it returns:

Founded in April 2013 and hosted by the Linux Foundation, OpenDaylight features a modular, pluggable, and flexible controller. This platform is deployable on any hardware and operating system platform supporting Java. OpenDaylight's design separates the control plane...

The response still includes passive voice. It identifies the founder but does not identify the designer. I ask it to

Rewrite the above to Subject-verb object form, removing all forms of to be and all passive voice.

It outputs:

The OpenDaylight Project, an open-source platform for Software-Defined Networking (SDN), accelerates SDN adoption and promotes... Founded in April 2013 and hosted by the Linux Foundation, OpenDaylight features a modular, pluggable, and flexible controller. This platform, deployable on any hardware and operating system platform supporting Java, separates...

I find this rule the most difficult to address. ChatGPT removes all forms of the verb to be from its response and removes passive voice. It still includes ambiguous subjects, for example, the reader will not know who would deploy the hardware.

This rule will only catch the most egregious of LLM use, and prompt engineers too lazy to ask the bot to refine the output.

Empty Phrases

ChatGPT peppers output with phrases that sound impressive at first read, but prove hollow and meaningless upon further inspection.

Take, for example, the phrase in today's digital landscape.

A Kagi Search (Non-affiliate link) of Digital Landscape returns myriad definitions. The domains of advertising, online games, and communications infrastructure policy all provide separate definitions.

ChatGPT should define digital landscape to remove ambiguity but I have not seen it do that prior its to use.

I collect several of these empty phrases in my Beat AI blog post. They include:

In today's digital landscape
In the intricate world of
In this era of
Unlocking new horizons

I recommend compiling a corpus of these phrases and then penalizing authors one point for each instance.

Cliches

Cliches add no information, just noise to text. Cliches offend the Readers' intelligence and demonstrate a Writer's lack of respect for their Reader's time.

Tackle tough problems
Best of breed
Hit the ground running
Low-hanging fruit
Long pole in the tent

Unsubstantiated Grandiosity

Unsubstantiated Grandiosity prioritizes showy prose over the conveyance of useful information. Writers must convey useful information in tight, considerate prose. I've collected examples of LLM-produced Unsubstantiated Grandiosity in my prior blog post. Some include:

Infrastructure as Code (IaC) is a paradigm-shifting approach
Businesses can achieve unprecedented agility
...making them indispensable for decision-making
Graphics Processing Units (GPUs) have transcended their original role...
Digital advertising holds immense significance

An author must provide objective, well-cited, and appropriate evidence to build a case for a particular argument. They must do the work for their reader. Only after presenting substantial evidence can a writer tout a technology's importance.

Unsubstantiated Grandiosity provides a shortcut for lazy writers and attempts (and fails) to replace diligent research and evidence. It puts a cognitive load on the reader.

Adverbs

I put Adverbs in the category of lazy shortcuts. Thorough research and clear communication of relevant facts obviates the need for adverbs. For example:

an increasingly important part of our global economy
Embracing and diligently implementing security frameworks
We seamlessly integrate different data sources

Require, for example, your authors to describe the processes and experience that would indicate seamless integration. Without proof, the adverb seamlessly rings hollow and subjective.

Conclusion

My Rubric provides a first-pass filter to penalize LLM use in Government Acquisition and Academic writing. It will only catch the laziest of writers. If AI companies can continue to follow observed trendlines, they will improve the proportion of relevant, objective, and fact-based content in LLM output.

In my last blog post, I recommend that readers demand prose in the form of subject verb object. I also recommend the removal of all forms of the verb to be. See that post for justification and examples. Since then, LLM proved adept at accommodating this request.

I considered recommending that authors cite journals and Academic papers to foil LLM usage since earlier LLM failed at this task. In July 2024, however, Grok AI shows promise on this front.

I use the following prompt:

Write two sentences about how we can detect LLM-generated code. Cite specific authors and quotes from current Academic journals.

Grok replies:

According to the research by Wu et al. in their paper titled "LLMDet: A Third Party Large Language Models Generated Text Detection Tool," published in the 2023 Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), a text detection tool called LLMDet can distinguish between different large language models and human-generated text, providing a specific probability for each source [1]. This tool represents a significant advancement in the field of detecting AI-generated content.

Additionally, a paper titled "Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector" by Wu et al. published in 2024 on arXiv highlights the challenge of detecting synthetic code generated by Large Language Models (LLMs) and proposes a sampling-based watermarking method for identifying machine-generated text, addressing the growing concern over the misuse of LLM-generated code in both educational and industrial domains [3].

Grok did not Cite any direct quotes, so the prompt engineer must do some manual work to find a direct quote.

Grok did not provide the details of the source. A follow-on prompt, however, will acquire it.

For example:

Cite the second paper's source in MLS format for a bibliography

Outputs:

Wu, L., Wu, X., & Wu, Y. (2024). Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector. arXiv preprint arXiv:2402.01234.

I believe LLM will evolve past detection.

Bibliography

Liang, P., Wu, L., Wu, X., & Wu, Y. (2024). Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024). Association for Computational Linguistics.

Use Python Pandas to Invest in Sealed Super Mario Bros. 3

2023-12-30T00:26:00-05:00

Sealed and Graded Video Game Collecting skyrocketed in popularity over the past decade. I joined the hobby in 2020 and stuck through the boom times of 2021 and the recent crash of 2023.

Background

Despite the recent crash, sealed video games do provide organic collectability. In contrast to Image Comics (Gold Editions! Holo-Foil Covers!) and Star Wars Power of the Force (Green Cardboard! Brown Vest Luke!) action figures, no one (that I knew of) thought to preserve outdated, legacy video games in the 1990s.

In 1993, for example, I traded about $10 worth of (completely legal) fireworks for the 8-Bit Nintendo Entertainment System (NES) releases of Wrestlemania, Mega Man, and (IIRC) Jackal.

Nobody (except a handful of weirdos) kept sealed copies of NES, Genesis, or SNES games. If I got a game, I opened it and played it, end of story. Today, the supply of conventional collectibles (comics, sports cards, action figures) dwarfs the supply of sealed video games.

On the demand side, sealed Video games, like the NFT market, appear to follow a winner take all approach. The popular, or brand name games sell at multiples of less popular games, with no regard for supply. You can, for example, buy certain pop one (only one known sealed game in the market) games on Heritage for a little under $200 at auction.

DISCLAIMER: I base the information on this blog on my personal opinion and experience and you MUST not consider this professional financial investment advice. Do not ever use my opinions without first assessing your own personal and financial and situation and you MUST consult a financial professional before making any investment. Keep in mind I will change my thoughts and opinions over time as I learn and accumulate more knowledge. I am NOT a financial professional! This blog is not a place for the giving or receiving financial advice, advice concerning investment decisions or tax or legal advice.

Investment Grade

Today we analyze the collectability of Super Mario Bros. 3 for the NES.

Shawn Surmick from Reserved Investments taught me the idea of an Investment Grade collectible, a collectible in the 85th percentile of the population.

Investment-grade collectibles sit at the top 15% of the pack.

The CGC census, for example, records 623 graded copies of the 1962 issue of Green Lantern #16 (First Star Sapphire). 623 times 15% yields the quantity 93.45. If you add the quantities for each universal grade, you will find that less than 94 copies of this comic have a grade of greater than 8.5. For that reason, an investor can consider any copy of Green Lantern #16 (1962) with a grade equal to or greater than 8.5 Investment Grade.

A glance at the census for Super Mario 3, however, indicates a need for a more complicated analysis.

Import and Clean the Data

We use the Python Pandas package for our analysis, and Python Seaborn fuels the graphics.

We use data from Larry's GamerStonks (non-affiliate link) database.

If you want to collect sealed video games, then I recommend you pay for a subscription to Gamerstonks.

We load the libraries and import the data from a Comma Separated Value (CSV) spreadsheet into a Pandas DataFrame.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize':(11.7,8.27)})

df = pd.read_csv('smb3.csv')

The DataFrame includes the Grader (WATA, CGC), Box Grade, Seal Grade, Variant, Purchase Price, Auction House (Goldin, Heritage, Certified Link), and the Date Sold.

Grader Box Seal Variant                                            Price  Auction           Date
WATA   9.6 A    *Made in Japan, Oval SOQ R - "USA, Canada and ...  $2,880 Heritage Auctions 11/30/23
WATA   9.4 A    *Made in Japan, Oval SOQ R - "USA, Canada and ...  $2,160 Heritage Auctions 11/30/23
WATA   8.0 A    *Made in Japan, Oval SOQ R - "USA, Canada and ...  $1,159 Goldin            11/18/23
WATA   9.6 A+   *Made in Japan, NFR (Challenge Set)                $2,160 Heritage Auctions 11/04/23
...

We need to encode the data into the correct type, for example, convert Price from String to Numeric, and Date from String to DateTime.

df['Date'] = pd.to_datetime(df['Date'])
df['Price'] = df['Price'].str.replace('$', '').str.replace(',', '').astype(float)

The Seal ratings, while of type String, represent a scale of increasing preference. We use Pandas to encode Seal into an Ordered Categorical type.

df['Seal'] = pd.Categorical(df['Seal'], categories=['NS','C+','B+','A','A+','A++'], ordered=True)

Python indicates the hierarchy:

Categories (6, object):
['NS' < 'C+' < 'B+' < 'A' < 'A+' < 'A++']

Can we Use Box Grade for our Analysis?

CGC Comics provides a numeric score for quality. CGC and WATA video games both provide a numeric score for Box quality and also provide a Seal grade.

In the Green Lantern example above, we use the CGC Universal Grade to stack rank the comics and identify the investment grade. Can we use the WATA (or CGC) Sealed Video Game Box Grade to identify the investment grade for Super Mario 3 (SMB3)?

Consider the summary statistics for the Price of the 9.6 Box grade:

df[df['Box'] == 9.6]['Price'].describe()

Python dumps a ton of data.

count       28.000000
mean     11526.428571
std       8659.504825
min       2160.000000
25%       5220.000000
50%       8400.000000
75%      16350.000000
max      33600.000000
Name: Price, dtype: float64

For the twenty-eight (28) copies of SMB3 graded 9.6, we see a high sale price of $33.60k, a low of $2.16k, a median of $8.4k, and so on.

I now present to you a Box Plot.

sns.boxplot(x='Box', y='Price', data = df[df['Box'] == 9.6])

The Box Plot captures the information from the table in graphical form.

The box shows the First ($5.2k) and Third ($16.35k) quartiles and the whiskers show data points that lie 1.5 times the Interquartile range (IQR) (for both top and bottom). The little diamond shows the outliers, in this case, the max of $33.6k.

We plot the distribution for each of the recorded Box Grades:

sns.boxplot(x='Box', y='Price', data=df)
plt.axhline(df[df['Box'] == 9.8]['Price'].min(), color='red')

This yields:

The red horizontal line captures the lowest price paid for a 9.8 Box grade.

The graph illustrates that certain instances of Box grades with scores of 9.6, 9.4, 9.2, 9.0, and even 5.0 (!) sold for more than the minimum 9.8 price.

A quick calculation illustrates that 23% of all copies graded less than 9.8 sold for more than the 9.8 minimum.

sum(df[df['Box'] < 9.8]['Price'] > df[df['Box'] == 9.8]['Price'].min())/df[df['Box'] < 9.8].shape[0]

0.23076923076923078

Based on this discovery, we can not use Box grade alone to identify the top 15% of SMB3.

Data Enrichment

InfluxDB uses the terms Tags and Measurements for Categorical and Continuous variables. Tags allow extra dimensions in Data Visualization.

We first create Tags for Year and Quarter. These provide buckets for aggregations.

df['Year'] = df['Date'].dt.year
df['Quarter'] = pd.PeriodIndex(df['Date'], freq='Q').strftime('%Y-0%q')

We also want to improve the readability of the Variant feature.

The Original Data Set uses WATA and CGC notes for the Variant column:

*Made in Japan, NFR (Challenge Set)
*Made in Japan, Oval SOQ R - "USA and Canada" Text
*Made in Japan, Oval SOQ R - "USA, Canada and Mexico" Text
*Made in Japan, Oval SOQ R - ""USA and Canada"" Text
*Made in Japan, Oval SOQ TM - Left Bros.
*Made in Japan, Oval SOQ TM - Right Bros.
*Made in Japan, Oval SOQ R - ""USA, Canada and Mexico"" Text

To simplify the analysis, we shorten these variants to:

NFR
CAN
MEX
LEFT
RIGHT

We create a quick lookup DataFrame and merge it into my working DataFrame.

tag_df = pd.DataFrame.from_dict({'Variant': ['*Made in Japan, Oval SOQ R - "USA, Canada and Mexico" Text',
  '*Made in Japan, NFR (Challenge Set)',
  '*Made in Japan, Oval SOQ R - "USA and Canada" Text',
  '*Made in Japan, Oval SOQ TM - Left Bros.',
  '*Made in Japan, Oval SOQ TM - Right Bros.',
  '*Made in Japan, Oval SOQ R - ""USA, Canada and Mexico"" Text',
  '*Made in Japan, Oval SOQ R - ""USA and Canada"" Text'],
  'Var_Tag': ['MEX','NFR','CAN','LEFT','RIGHT','MEX','CAN']
  })

 df = df.merge(tag_df, on='Variant', how='left')

This results in:

Grader Box Seal Variant                                           Price  Auction           Date       Year Quarter Var_Tag
WATA   9.6 A    *Made in Japan, Oval SOQ R - "USA, Canada and ... 2880.0 Heritage Auctions 2023-11-30 2023 2023-04 MEX
WATA   9.4 A    *Made in Japan, Oval SOQ R - "USA, Canada and ... 2160.0 Heritage Auctions 2023-11-30 2023 2023-04 MEX
WATA   8.0 A    *Made in Japan, Oval SOQ R - "USA, Canada and ... 1159.0 Goldin            2023-11-18 2023 2023-04 MEX
WATA   9.6 A+   *Made in Japan, NFR (Challenge Set)               2160.0 Heritage Auctions 2023-11-04 2023 2023-04 NFR

I dump the noisy and unused Columns:

df = df.dropna().drop(['Grader','Variant','Auction'], axis=1)

I synthesize a Categorical Price_Tag from the Numerical Price column. This allows us to visualize prices in four buckets: Lowest, Low, High and Highest.

df['Price_Tag'] = pd.qcut(df['Price'],q=4,labels=['Lowest','Low','High','Highest'])

The new Price_Tag feature allows quick and easy 3d plots.

sns.displot(df, x='Box',y='Seal',hue='Price_Tag')

Displot shows a heat map between the Box grade, the Seal grade and the Price.

High Box grades cluster around A+ Seal grades, and low Box grades cluster around the B+ Seal grade. Price (Highest = Red) correlates with high Box and Seal Grades.

If you notice, the price for an 8.5 Box with an A Seal registers higher than the price for an 8.5 Box with an A+ Seal (Green vs. Orange).

Date and Variation's Effect on Price

The sealed, graded Video Game collectible market spiked and crashed since 2021.

sns.lineplot(x='Year', y='Price', data=df)

The next chart shows the mean price over the years, with error bands that depict variation.

We see a peak in 2021.

Look at the sales data in terms of Box Grade and Date Sold:

sns.kdeplot(df,x="Date",y='Box', hue='Price_Tag')

The highest sales cluster around the high Box Grades and 2021.

Variant drives Price along with the purchase Date.

sns.lineplot(x='Year', y='Price',hue='Var_Tag', data=df)
plt.ylim([0,df['Price'].quantile(q=0.99)])

If we eyeball the chart, we see that the Left Bros. Variant trumps the Canada (No Mexico) Variant, which trumps Right Bros. and then Mexico variants. NFR sits at the bottom (which makes sense, because the seal contains the text Not for Resale which obfuscates the box art).

We will rank the Variant feature, therefore, in this order: NFR < MEX < RIGHT < CAN < LEFT

df['Var_Tag'] = pd.Categorical(df['Var_Tag'], categories=['NFR','MEX','RIGHT','CAN','LEFT'], ordered=True)

We use this new Categorical ranking to plot Box Grade vs. Sale Price vs. Var_Tag.

sns.boxplot(x='Box', y='Price',hue='Var_Tag', data=df[df['Box'] > 9.0])

This yields:

Note that the (Purple) LEFT Bros. variant trounces all the higher-graded variants.

We need to pay attention to Date and Variant.

Quick Aside: Numerical analysis

Pandas provides tools to convert Tags to Measurements ( Categorical to Numeric). Machine Learning, for example, requires normalized numeric data.

We create a numeric version of our DataFrame:

df_num = df.copy()
df_num['Seal'] = pd.factorize(df['Seal'],sort=True)[0]
df_num['Var_Tag'] = pd.factorize(df['Var_Tag'],sort=True)[0]

I input this numeric DataFrame into a Correlation Graphing Function to produce a Heat Map:

A Numeric encoding of Seal (NS, C+, B+, A, A++) allows us to use a Kernel Density Estimation plot for Box vs. Seal vs. Price.

sns.kdeplot(df_num,x="Box",y='Seal', hue='Price_Tag')

Normalize Price Over Time

Price lets us stack rank the different Variants and Box Grades of Super Mario Bros. 3.

The Sale Date variable also impacts the Sale Price.

We will remove (or at least mitigate) the effect of Sale Date on our price data.

We can choose from dozens of approaches. I choose the following approach to remove the impact of Date on the Price data:

Calculate the Mean_Price per Quarter
Normalize each Sale Price by its Quarter's Mean_Price

I first calculate the Mean_Price per Quarter:

af = df.set_index('Date').groupby('Quarter')['Price'].resample('A').mean().reset_index()
af.rename(columns={'Price': 'Mean_Price'}, inplace=True)
af.dropna(inplace=True)
af.drop(['Date'], axis=1, inplace=True)


    Quarter Mean_Price
0   2019-02 1176.000000
1   2019-03 552.000000
2   2019-04 1298.571429
3   2020-01 2534.400000
4   2020-02 1346.250000
5   2020-03 11397.333333
6   2020-04 62800.000000
7   2021-01 13878.750000
8   2021-02 20340.000000
9   2021-03 14948.571429
10  2021-04 11228.571429
11  2022-01 6408.333333
12  2022-02 6932.000000
13  2022-03 11226.800000
14  2022-04 5524.090909
15  2023-01 9927.428571
16  2023-02 3001.333333
17  2023-03 8980.000000
18  2023-04 16562.375000

We merge this lookup table with the working DataFrame.

df = df.merge(af, on=['Quarter'], how='left')

Create a feature Norm_Price which records the sale Price in units of Mean_Price.

df['Norm_Price'] = df['Price']/df['Mean_Price']

Calculate the normalized (against time) 85th percentile sale prices. This gives us the Investment Grade copies of Super Mario Bros. 3.

investment_grade = df.query('Norm_Price>{}'.format(df['Norm_Price'].quantile([0.85])[0.85]))

Box Seal   Price    Date        Year    Quarter Var_Tag Price_Tag   Mean_Price  Norm_Price
9.4 A+     60000    2023-07-27  2023    2023-03 LEFT    Highest     8980.000000 6.681514
9.4 A+     108000   2023-11-03  2023    2023-04 LEFT    Highest     16562.37500 6.520804
9.8 A++    60000    2023-01-20  2023    2023-01 MEX     Highest     9927.428571 6.043861
9.8 A++    60000    2022-08-05  2022    2022-03 MEX     Highest     11226.80000 5.344355

A GROUP BY operation summarizes the Investment Grade copies of SMB3, by Variant, Box Grade and Seal Grade.

investment_grade[['Var_Tag','Box','Seal','Norm_Price']].groupby(['Var_Tag', 'Box', 'Seal']).mean(['Norm_Price']).dropna()


Var_Tag Box Seal
CAN     9.8 A+
            A++
LEFT    9   A
        9.2 A+
        9.4 A+
MEX     9.4 A
            A+
        9.6 A
            A+
            A++
        9.8 A
            A+
            A++
RIGHT   9.6 A+
        9.8 A+

Conclusion

Video Game Collectors drive high demand for sealed copies Super Mario Brothers 3. Nintendo released at least five different Variants of the game.

Our analysis recommends the following Investment Grade copies:

DISCLAIMER: I base the information on this blog on my personal opinion and experience and you MUST not consider this professional financial investment advice. Do not ever use my opinions without first assessing your own personal and financial and situation and you MUST consult a financial professional before making any investment. Keep in mind I will change my thoughts and opinions over time as I learn and accumulate more knowledge. I am NOT a financial professional! This blog is not a place for the giving or receiving financial advice, advice concerning investment decisions or tax or legal advice.

*Made in Japan, Oval SOQ TM - Left Bros. = 9.0 A or Better
*Made in Japan, Oval SOQ TM - Right Bros. = 9.6 A+ or Better
*Made in Japan, Oval SOQ R - "USA and Canada" Text = 9.8 A+ or Better
*Made in Japan, Oval SOQ R - "USA, Canada and Mexico" Text = 9.4 A or Better
*Made in Japan, NFR (Challenge Set) = Avoid

Coda

The minimum recommendation for the Canada version seemed high to me. I suspected this resulted from a high Quarterly Mean for that time, so I executed the model with a broader bucket. I used Yearly Mean instead of Quarterly Mean via:

afy = df.set_index('Date').groupby('Year')['Price'].resample('A').mean().reset_index()
afy.rename(columns={'Price': 'Mean_Price'}, inplace=True)
afy.dropna(inplace=True)
afy.drop(['Date'], axis=1, inplace=True)
df = df.merge(afy, on=['Year'], how='left')
df['Norm_Price'] = df['Price']/df['Mean_Price']
investment_grade = df.query('Norm_Price>{}'.format(df['Norm_Price'].quantile([0.85])[0.85]))
investment_grade[['Var_Tag','Box','Seal','Norm_Price']].groupby(['Var_Tag', 'Box', 'Seal']).mean(['Norm_Price']).dropna()

This outputs:

MEX   9.2 A+
      9.4 A
      9.6 A
          A+
          A++
      9.8 A++
RIGHT 9.6 A
          A+
      9.8 A+
CAN   9.6 A++
      9.8 A+
          A++
LEFT  9.0 A
      9.2 A+
      9.4 A+

The updated analysis recommends:

Our analysis recommends the following Investment Grade copies:

*Made in Japan, Oval SOQ TM - Left Bros. = 9.0 A or Better
*Made in Japan, Oval SOQ TM - Right Bros. = 9.6 A or Better
*Made in Japan, Oval SOQ R - "USA and Canada" Text = 9.6 A++ or Better
*Made in Japan, Oval SOQ R - "USA, Canada and Mexico" Text = 9.2 A+ or Better
*Made in Japan, NFR (Challenge Set) = Avoid

I dove into the data and it appears that the Mexico variant sells for multiples of the yearly average.

In CRAZY 2021, for example a humble 9.2 A+ Mexico variant sold for over 1.5x the yearly average of $15k.

Var_Tag  Box Seal Price   Year Mean_Price   Norm_Price
MEX      9.8 A++  60000.0 2022 8010.585366  7.490089
MEX      9.8 A++  60000.0 2023 9437.159091  6.357846
MEX      9.4 A    3600.0  2019 1194.000000  3.015075
MEX      9.8 A++  23400.0 2023 9437.159091  2.479560
MEX      9.6 A    2880.0  2019 1194.000000  2.412060
MEX      9.6 A++  19200.0 2022 8010.585366  2.396829
MEX      9.6 A+   33600.0 2021 15033.125000 2.235064
MEX      9.6 A    16800.0 2022 8010.585366  2.097225
MEX      9.6 A++  28800.0 2021 15033.125000 1.915769
MEX      9.6 A+   14400.0 2022 8010.585366  1.797621
MEX      9.2 A+   22800.0 2021 15033.125000 1.516651

Yes, in 2021 someone paid $22,800 for the Mexico variant in 9.2 A+ grade. Compare that to a 9.8 A++ Mexico variant sold in 2023 for just $1k more.

Beat Artificial Intelligence (AI) with Active, Present Voice

2023-10-28T01:23:00-04:00

When I ask ChatGPT to write detailed technical content, I find that it outputs hollow, bombastic, and meandering prose.

Such output places a high cognitive load on my mind: My subconscious must fill in the holes of the passive, cliche-ridden, and obtuse narratives. I long for clear communication.

In this blog post, I collect some ChatGPT tells, discuss the issues with these tells and then recommend how to fight the AI leviathan.

ChatGPT

In December 2015 Elon Musk, Greg Brockman, Ilya Sutskever, and Sam Altman invested over $1B (USD) to found OpenAI. Last year, the OpenAI foundation launched ChatGPT, a user-friendly Large Language Model (LLM) built upon the Generative Pre-trained Transformer (GPT) Four (GPT-4). Traditional data services retrieve content, and ChatGPT generates content (notice the subtle difference).

LLM allows users to create non-existent data. Generative AI, for example, can create an answer to the question "Name Socrates Favorite 90s Sitcom."

ChatGPT Output

ChatGPT output appears impressive at first read. The initial glow, however, fades in seconds. The prose then injects the reader with a mild sense of discomfort.

Consider the following output from ChatGPT

In the intricate world of digital communication the ability to label and prioritize data is paramount, especially when it comes to real-time data.

At first, I read this and thought:

Wow! Tell me more.

Then, I felt unease and thought:

Wait, that statement says nothing.

The ChatGPT-constructed sentence values showy cadence over communication.

Take the phrase:

the intricate world of digital communication

It sounds impressive but conveys nothing. Intricate lacks precision, intricate world lacks authority.

I believe ChatGPT wants to convey the importance of labeled/ prioritized data, and I fail to see how the adjective intricate strengthens that argument.

The sentence continues with:

the ability to label and prioritize data is paramount

I ask, Paramount to Who?. I also need to take a step back and remember the definition of Paramount:

Paramount: Chief in importance or impact

ChatGPT hallucinates here. I consider the ability to label and prioritize real-time data a useful feature of digital communications, but I do not consider it chief in importance or impact. I consider the ability to transmit packets from source to sink chief in importance for digital communications.

Consider another ChatGPT-authored example:

Phishing detection is of paramount importance in today's digital landscape as cybercriminals continuously refine their deceptive tactics to trick individuals and organizations into divulging sensitive information.

I have experience with computer security and know that most security breaches result from Phishing attacks. In that light, I would consider Phishing Detection chief in importance or impact to computer security. While correct, ChatGPT makes this claim without any evidence.

ChatGPT then uses the phrase today's digital landscape. That poetic phrase sounds impressive but lacks meaning or at the very least lacks any weight in the context of this sentence.

I use my Kagi Search Engine (Non-affiliate link) to retrieve a definition of Digital Landscape. It returns dozens of different definitions, including one for advertising, one for online games, and one for communications infrastructure policy.

I consider Digital Landscape a marketing buzzword, an empty vessel for content creators to fill in a way that serves their narrative.

ChatGPT should define digital landscape to remove ambiguity.

ChatGPT Tells

I ask ChatGPT to create two-sentence summaries for a variety of topics. I then highlight any turns of phrase that lead to sensations of discomfort.

I organize them here in the hopes that a pattern will emerge.

I name the first Category Empty Phrases That Almost Sound Smart. ChatGPT peppers these phrases in introductory paragraphs, without justification or evidence.

For example:

In today's digital landscape
In the intricate world of
In this era of
Unlocking new horizons
A strategic imperative
Recognizing the inherent challenges in this domain
and the unique challenges they face
an increasingly important part of our global economy

I call the next category Unsubstantiated Grandiosity:

Infrastructure as Code (IaC) is a paradigm-shifting approach
Businesses can achieve unprecedented agility
These models provide valuable insights...making them indispensable for decision-making
Graphics Processing Units (GPUs) have transcended their original role in rendering graphics to become indispensable accelerators
Digital advertising holds immense significance
Blockchain... brings unprecedented efficiency

Next up, Empty Cliches in Passive Voice

Phishing detection is of paramount importance
Regression models are powerful tools
...making robust security frameworks a paramount concern
Embracing and diligently implementing security frameworks is not just a regulatory necessity but a strategic imperative
Cleaning data is the essential first step

Adverbs speak for themselves:

to share and decipher data seamlessly is paramount
specifically designed for Banks and the unique challenges they face
an increasingly important part of our global economy
Embracing and diligently implementing security frameworks

KEVIN SPACEY (as Casey Schuler in Outbreak): It’s an adverb, Sam. It’s a lazy tool of a weak mind.

I call the next group Cringe Words, words that actual humans never use in normal conversation unless they want to appear smart while high on Meth.

Indeed
Alas
Especially

I maintain that the word Indeed signals ChatGPT text more than any other tell, so I wrote a bash script to detect ChatGPT prose:

str=`grep -i indeed prose.txt`
if [ "$str" ]
   then echo 'ChatGPT wrote this'
fi

I call the next group Midwit Words, words that people use to appear smart.

Leverage (Instead of use)
Ensure (Instead of an appropriate verb)
Utilize (Instead of use)
Penultimate (Instead of Ultimate)

My Recommendation

ChatGPT produces wordy, loose text, light on meaning or evidence, and peppered with bombastic flourishes and empty cliches. Authors must tighten their prose to separate their work from ChatGPT output.

I recommend an active, present voice that follows the Subject-Verb-Object (SVO) template. I advise against adverbs, you should instead choose a precise verb. Do not use adjectives without proper context. Never write a cliche.

Take the original ChatGPT sentence:

In the intricate world of digital communication the ability to label and prioritize data is paramount, especially when it comes to real-time data.

Rewrite it to:

Digital Communications systems label and prioritize Real-Time Data to prevent jitter and buffering

I encourage you to investigate E-Prime, an upgrade to the English language that prioritizes clarity, precision and respect for the listener.

D. David Bourland Jr studied under General Semantics founder Alfred Korzybski and developed E-Prime.

E-Prime removes all forms of the verb to be. This includes (along with negative contractions and contractions):

am
is
are
was
were

The verb to be short circuits the Subject-Verb-Object (SVO) word order. It allows lazy, imprecise writing. To be verbs drive passive voice:

Consider

OpenAI was founded in 2015

Versus

Elon Musk, Greg Brockman, Ilya Sutskever and Sam Altman founded OpenAI in 2015

Consider this lazy, loose sentence:

OpenDaylight is a Software Defined Network (SDN) Controller.

I must work harder to construct a sentence without the verb to be. I need to select subjects, and verbs and provide evidence. The additional information benefits the reader:

Internet Service Providers (ISP), Cloud Service Providers (CSP), Data Center Engineers, and Academics use the OpenDaylight (ODL) platform to tailor and automate computer networks.

Kellog and Bouland write:

[misuse of the verb to be creates] a deity mode of speech [and allows] even the most ignorant to transform their opinions magically into god-like pronouncements on the nature of things - Kellogg, E. W.; Bourland Jr., D. David (1990). "Working with E-Prime: Some Practical Notes" (PDF). Et Cetera. 47 (4): 376–392.

I use E-Prime exclusively on my blog. I encourage you to read my other posts and notice how I avoid the verb to be and write with a clear, active, SVO voice.

Conclusion

ChatGPT provides a useful service. It generates prose in seconds, based on little user input.

ChatGPT excels in some use cases:

You need bullets for advertising copy
You need an invite for an event or meetup
You need a quick outline
You have writer's block and need a nudge
You need to produce rote/ box-checking paperwork
You need copious "good enough" content to increase your web presence
Your company values looking busy over productivity

If you need to communicate a clear message to your user, I recommend you craft your narrative by hand and use SVO, deliberate adjectives, and E-Prime.

Data Exploration with Data Viz Cheat Sheet

2023-09-30T10:26:00-04:00

Today I collect and organize useful data visualization (Data Viz) tools that aid data exploration.

I illustrate the use of the tools via the classic Abalone database, hosted on the University of California, Irvine (UCI) Machine Learning repository website.

I recommend you bookmark this and return to it when you need to find the syntax and semantics of popular data viz constructs.

Get the Data

PhD student David Aha created the University of California, Irvine (UCI) Machine Learning repository in 1987 in the form of a File Transfer Protocol (FTP) site. The Repo collects databases, domain theories, and data generators. Today I use the Abalone database.

The Abalone database provides a table of four thousand observations, which each contain one categorical feature, seven continuous features, and one target:

Features, Categorical
- Sex: Male, Female, and Infant
Features, Continuous
- Length: Longest shell measurement (mm)
- Diameter: Perpendicular to length (mm)
- Height: With meat in the shell (mm)
- Whole_weight: Whole abalone (grams)
- Shucked_weight: Weight of meat (grams)
- Viscera_weight: Gut weight after bleeding (grams)
- Shell_weight: After being dried (grams)
Target, Integer
- Rings: +1.5 gives the age in years

I use the Python requests library to pull the data straight from the UCI repo and stuff it into a Pandas DataFrame.

I import the required libraries.

import pandas as pd
import numpy as np
import io
import requests
import seaborn as sns

I set the url (String) and column_name (List) variables to match the Abalone database schema.

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
column_names = ['Sex',
                'Length',
                'Diameter',
                'Height',
                'Whole_weight',
                'Shucked_weight',
                'Viscera_weight',
                'Shell_weight',
                'Rings']

Requests downloads the HTTP object, StringIO decodes it and Pandas loads the decoded data into a DataFrame.

r = requests.get(url).content
abalone_df = pd.read_csv(io.StringIO(r.decode('utf-8')),
                      names = column_names)

One-Dimensional Statistical Summaries

We first explore the data in one dimension.

Histograms

Histograms provide a visual shorthand for the distribution of numerical data. Think of a connect four board, where you stack chips in different columns (or buckets). Each chip represents a number in that bucket.

Pandas provides a built-in hist() method.

abalone_df['Rings'].hist()

We use Pandas to draw a Histogram of our target variable, Rings.

Most Abalone include between 7.5 and 12.5 Rings.

Pandas also accommodates our Categorical feature.

abalone_df['Sex'].hist()

The corpus of data includes roughly equal observations for Male, Female and Infant.

Pandas allows us to run histograms on all features. The method ignores the Categorical feature.

abalone_df.plot.hist(subplots=True,layout=(4,2))

The results illustrate the need to Normalize the data, since all the Categorical features clock in under a value of one (1), and the target feature includes ranges up to thirty (30).

Hist with tags

InfluxDB uses the nomenclature Tags and Measurements to describe Categorical and Continuous variables.

Tags provide a new dimension of visual data, slicing and dicing the data into different categories.

Seaborn provides the option to color by Tag with their hue parameter.

sns.histplot(data=abalone_df, x='Rings',hue='Sex')

Hue does not make sense with Measurements:

sns.histplot(data=abalone_df, x='Rings',hue='Rings')

Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE) smooths the Histograms. Instead of discrete buckets, we see continuous lines that represent the distribution.

I used the analogy above of a Histogram stacking chips on a connect four board. KDE pours sand at each point, enough to fill a Standard Normal Distribution. KDE in a sense stacks Standard Normal Distributions at each point, which leads to the smoothness of the plot.

If you reduce the bucket size to a very small number, you can see the idea in action.

abalone_df['Whole_weight'].hist()

abalone_df['Whole_weight'].hist(bins=25)

abalone_df['Whole_weight'].hist(bins=50)

abalone_df['Whole_weight'].plot.kde()

SNS will plot the KDE over the histogram if you instruct it to do so:

sns.histplot(data=abalone_df, x="Whole_weight", kde=True)

Pandas plots all features' distribution with KDE.

abalone_df.plot.kde(subplots=True,layout=(4,2))

Boxplots

A glance at a Boxplot tells you the median, 25th percentile, 75th percentile, and outliers.

The box shows the First and Third quartiles and the whiskers show data points that lie 1.5 times the Interquartile range (IQR) (for both top and bottom).

sns.boxplot(data=abalone_df, x='Whole_weight')

SNS allows you to separate the chart by Tag. If you set y equal to Sex, for example, you see the distributions split by Male, Female, and Infant.

sns.boxplot(data=abalone_df, x='Whole_weight',y='Sex')

In the Boxplot above, we see that Female Abalone weigh slightly more than Male Abalone.

Special Note: Enrich Data.

Remember that we have a target variable named Rings, which encompasses a range of numbers between one (1) and thirty (30). I recommend you enrich the Rings data with a new Tag.

The following code uses the Rings value to set a new Tag, which I named Age. The code splits the data into three ranges and applies to a given observation the tag Young, Middle_Age or Old based on the value of Rings.

abalone_df['Age'] = pd.qcut(abalone_df['Rings'],q=3,labels=['Young','Middle_Age','Old'])
abalone_df.head()

This new tag provides a new dimension to slice and dice our Boxplot.

sns.boxplot(data=abalone_df, x='Whole_weight',y='Sex',hue='Age')

We now see the relationship between Whole_weight, Sex and Age at a glance.

Violinplots

A Violinplot mirrors the Distribution, which gives the plot a Violin-like shape.

sns.violinplot(x=abalone_df['Rings'])

Violinplots also accommodate Tags.

sns.violinplot(data=abalone_df,x='Sex',y='Whole_weight',hue='Age')

Two-dimensional Plots

Python provides tools to explore Bivariate data sets.

Seaborn (SNS) provides two-dimensional Histograms and two-dimensional KDE tools.

Two-dimensional Histogram

Note that SNS only shows the top-down view for histograms.

sns.displot(abalone_df, x="Length", y="Height")

The SNS Bivariate Histograms accommodate tags.

sns.displot(abalone_df, x="Length", y="Height", hue="Age")

Two-dimensional KDE

SNS also provides two-dimensional KDE plots, with Tags.

sns.displot(abalone_df, x="Length", y="Height", hue="Age", kind="kde")

Look for Correlation

The Data Scientist looks for correlation between features and the target during the Data Exploration phase of the Machine Learning Pipeline

Data prep

In the Data Prep stage, we encode the Tags (String) into numeric values (float32).

The Pandas method get_dummies one-hot-encodes the Sex variable into Orthogonal dimensions. This increases the dimensionality of our data set.

We also use the factorize method to convert Young, Middle_Aged and Old into the integers 0,1 and 2.

abalone_reg_df = abalone_df.join(pd.get_dummies(abalone_df['Sex']))
abalone_reg_df['Age_Bucket'] = pd.factorize(abalone_df['Age'],sort=True)[0]
abalone_reg_df = abalone_reg_df.drop(['Sex','Age'],axis=1).astype(np.float32)

We pop off the labels for later use.

class_labels stores the target vector for Classification models, and reg_labels stores the target vector for Regression models.

class_labels = abalone_reg_df.pop('Age_Bucket')
reg_labels = abalone_reg_df.pop('Rings')

I also create vectors to pull like Features from the DataFrame (Measurements, Tags, Target).

metric_vars = ['Length',
               'Diameter',
               'Height',
               'Whole_weight',
               'Shucked_weight',
               'Viscera_weight',
               'Shell_weight']

encoded_vars = ['F',
                'I',
                'M']

y_vars = ['Rings']

Heatmap correlation

SNS provides a Heatmap matrix for correlation.

import matplotlib as plt

corr = abalone_reg_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr,
                            dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 
                             20,
                             as_cmap=True)

# Draw the heatmap with the mask and 
# correct aspect ratio
sns.heatmap(corr, 
            mask=mask,
            cmap=cmap,
            vmax=1,
            center=0,
            square=True,
            linewidths=.5,
            cbar_kws={"shrink": .5})

We see that Diameter and Length have significant correlation and so do all of the weight features.

Pairgrid Correlation

This SNS Pairgrid plot shows the correlation between the features and the target, Rings.

g = sns.PairGrid(abalone_df,
                 x_vars = metric_vars,
                 y_vars = y_vars)
g.map_offdiag(sns.kdeplot)
g.add_legend()

All features depict a correlation slope close to around 25 degrees or so, which indicates Correlation.

Scatterplot with Regression

SNS plots the ML 101 favorite, Linear Regression right on the screen with the regplot action.

sns.regplot(x = abalone_df['Viscera_weight'],
                y = abalone_df['Rings'])

We see positive slope with pretty tight error bands, which indicates Viscera_weight predicts Rings.

Fancy Tilted 3d Plots

Remember that SNS only graphs top-down views. I wrote the following matplotlib function to show an isometric view of the data.

def plot_3d(df, target, feature1, 
            feature2, feature3):
    target_list = list(set(df[target]))
    fig = plt.figure(figsize = (12, 12))
    ax1 = fig.add_subplot(111, 
                          projection='3d')

    x3 = df.loc[df[target] == target_list[0]][feature1]
    y3 = df.loc[df[target] == target_list[0]][feature2]
    z3 = df.loc[df[target] == target_list[0]][feature3]
    ax1.scatter(x3, 
                y3,
                z3,
                label = target_list[0],
                color = "red")

    x3 = df.loc[df[target] == target_list[1]][feature1]
    y3 = df.loc[df[target] == target_list[1]][feature2]
    z3 = df.loc[df[target] == target_list[1]][feature3]
    ax1.scatter(x3,
                y3,
                z3,
                label = target_list[1],
                color = "green")

    x3 = df.loc[df[target] == target_list[2]][feature1]
    y3 = df.loc[df[target] == target_list[2]][feature2]
    z3 = df.loc[df[target] == target_list[2]][feature3]
    ax1.scatter(x3,
                y3,
                z3,
                label = target_list[2],
                color = "blue")

    ax1.legend()

I call the function with the Abalone data.

plot_3d(abalone_df,
        'Age',
        'Height',
        'Viscera_weight',
        'Length')

Dimensionality Reduction

Note my Graph above requires me to choose three (out of the possible eight) features at a time. This fact drives two questions:

Which features do I use?
How can I plot all the features at once?

Principal Component Analysis (PCA) collapses the information held in eight features into three, two or even one feature.

I write about PCA in my blog post on Regression with Keras and TensorFlow

If you stick a magnet at each point in the data space, and then stick a telescoping iron bar at the origin, the magnets will pull the bar into position and stretch the bar. The bar will wiggle a bit at first and then eventually settle into a static position. The final direction and length of the bar represent a principal component. We can map the higher dimensionality space to the principal component by connecting a string directly from each magnet to the bar. Where the string hits (taut) we make a mark. The marks represent the mapped vector space.

George Dallas also writes an excellent blog post that explains PCA.

Normalize

First Normalize the Data. TensorFlow provides a normalizer.

from tensorflow.keras.layers.experimental import preprocessing

normalizer = preprocessing.Normalization()

Fit the normalizer to our measurements (exclude the encoded tags).

normalizer.adapt(np.array(abalone_reg_df[metric_vars]))

One Principal Component

SciKitLearn provides PCA.

from sklearn.decomposition import PCA

The following code collapses all seven features into one Principal Component.

pca = PCA(n_components=1)
pca.fit(normalizer(abalone_reg_df[metric_vars]))
pca_abalone_df = pd.DataFrame(pca.transform(normalizer(abalone_reg_df[metric_vars])),
                                     columns = ['princomp1'],
                                     index=abalone_reg_df.index)

SNS shows the utility of this Principal Component on the separability of the Classes.

sns.histplot( x = pca_abalone_df['princomp1'],
              hue = class_labels)

Two Principal Components

Now derive two principal components.

pca = PCA(n_components=2)
pca.fit(normalizer(abalone_reg_df[metric_vars]))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(abalone_reg_df[metric_vars])),
                                     columns = ['princomp1',
                                                'princomp2'],
                                     index=abalone_reg_df.index)

A KDE plot shows the three classes in relation to the two Principal Components.

sns.kdeplot( data = pca_train_features_df,
             x = pca_train_features_df['princomp1'],
             y = pca_train_features_df['princomp2'],
             hue = class_labels,
             fill = False)

3 Principal Components

Astute readers anticipate the slight code modifications required to derive three Principal Components.

pca = PCA(n_components=3)
pca.fit(normalizer(abalone_reg_df[metric_vars]))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(abalone_reg_df[metric_vars])),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3'],
                                     index=abalone_reg_df.index)

We use the 3d plot to see the separation of classes in relation to three Principal Components.

data_df = pca_train_features_df.assign(outcome=class_labels)
plot_3d(data_df,
        'outcome',
        'princomp1',
        'princomp2',
        'princomp3')

If you include one-hot encoded variables in your PCA, you may see weird results.

For example, we encoded the Categorical Sex feature into three Orthogonal numeric vectors, one for M, F and I. If you keep these vectors in the PCA you will see the following:

Conclusion

Bookmark this page for future reference. It provides a handy Cheat Sheet for useful Python Data Exploration and Data Viz tools.

Refactor Matlab to Tidyverse

2023-08-26T10:26:00-04:00

I've previously discussed the Reduced Coloumb Energy Neural Net algorithm on this site. I wrote the algorithm in Matlab, which uses index based logic to select, filter, wrangle and process data. Today I will refactor the Matlab code to Tidyverse. Tidyverse uses forward pipe operators to flow data through the data processing steps.

The example RCE algorithm assigns a class to data based on whether or not the data points live inside "footprints" of training data. If a visual walk through the RCE algorithm interests you, take a minute to read my post titled A Graphical Introduction to Probabalistic Neural Networks.

The following graphic captures an animation of the RCE NN Algorithm in action.

You can find the original Matlab script and the new R script on GitHub.

Loading Data

The algorithm loads the BUPA liver disorders database from the University of California, Irvine (UCI) machine learning repository.

Matlab

In Matlab, we encode the CSV into a matrix with brackets and assignment.

data = [85,92,45,27,31,0.0,1
85,64,59,32,23,0.0,2
86,54,33,16,54,0.0,2
91,78,34,24,36,0.0,2
87,70,12,28,10,0.0,2

...

98,77,55,35,89,15.0,1
91,68,27,26,14,16.0,1
98,99,57,45,65,20.0,1];

Tidyverse

Tidyverse allows us to read the raw CSV and store it in a Tibble.

In addition to storing the CSV data in a Tibble, we use the readr library to add column names (col_names) and an ID column (rowid_to_column).

I discuss the definitions of the column names in the next section

library("readr")
library("dplyr")
library("magrittr")
library("purrr")
library("tidyr")

columns <- c( "mcv", "alkphos", "sgpt", "sgot", "gammagt", "drinks_num", "select")

Bupa.Tib <- read_csv( "bupa.data", col_names = columns ) %>%
  tibble::rowid_to_column("id")

Selecting Features

The BUPA data includes six features and two classes (one for alcohol related liver disorder, and one for alcohol unrelated liver disorder).

The six (6) BUPA features include:

mean corpuscular volume (mcv)
Four chemical markers
- alkaline phosphotase (alkphos)
- alamine aminotransferase (sgpt)
- aspartate aminotransferase (sgot)
- gamma-glutamyl transpeptidase (gammagt)
half-pint equivalents of alcohol per day (drinks)

I discussed the salient features in my RCE writeup. Three features, "alkphos", "sgpt" and "gammagt" stand out in terms of the algorithm's classification performance. We still would like to provide the Data Scientist with flexibility in selecting the features, for "what if" scenarios, so we write our code to accommodate an arbitrary number of features.

Matlab

In Matlab, we use the column index to select the features. In this case, we use alkphos == 2, sgpt == 5, and gammagt == 6.

feats = [2 5 6];

Tidyverse

Tidyverse allows us to name the columns and then select by name.

When we created Bupa.Tib we named the columns, so now we can select columns by name.

We create a list that records the column names that we intend to keep.

features <- c( "alkphos", "sgpt", "gammagt")

Creating the training set

Matlab

In Matlab, we sort the data by the class, which the matrix stores in column seven (7).

We then use index operations to select all features, excluding the class.

Then we select the desired features using the feats array. A function, named prepare_uncoded wraps this operation.

function [data] = prepare_uncoded(data,feats)
    data = sortrows(data,7);
    data = data(:,1:6);
    data = data(:,feats);
end

We create two separate matrices, one that includes the first seventy-two (72) rows, and one that includes the bottom seventy-two (72) rows. Since we sorted by class in the function above, we produce one matrix of train patterns that contains class one, and one that contains train patterns of class two.

data = prepare_uncoded(data,feats);

class1 = data(73:144,:)';
class2 = data(145:216,:)';

Tidyverse

The MagrittR package of Tidyverse enables a pipe forward operator. The pipe forward operation provides a more readable feature selection operation.

We use filter to filter points of each class, select to select the features and slice to pull specific rows.

NOTE: Just to disambiguate, Irvine named the class column select, so we filter based on the value of the select column.

class_1_training_patterns <- c( 73:144 )
class_2_training_patterns <- c( 1:72 )

Class.1.Train.Tib <- Bupa.Tib %>% 
  filter( select == 1 ) %>%
  select( id, features) %>%
  slice( class_1_training_patterns )

Class.2.Train.Tib <- Bupa.Tib %>% 
  filter( select == 2 ) %>%
  select( id, features) %>%
  slice( class_2_training_patterns )

Find Radii

The RCE NN algorithm requires us to find the radii between a train point and the nearest train point of the opposite class.

We compute the euclidean distance to all other training points of the other class, and store the distance (named lambda) of the closest one.

Matlab

In Matlab, we create a function that ingests both the Class 1 and Class 2 training matrices, along with epsilon and lambda max. Lambda max provides an upper bound in terms of the maximum radius the algorithm will consider. Epsilon provides a very small value that we subtract from the calculated lambda. For more details, see my writeup of the RCE NN algorithm.

The Matlab code performs Matrix operations via nested functions to calculate the euclidean distance to all other points and then record the minimum.

In addition, the code uses a for loop to iterate through every training pattern.

The function iterates through each training point, calculates the distance to every other training point (stored in the two Class matrices) and keeps the minimum.

It then returns two arrays that contain lambda, one array per class.

function [lambda_1, lambda_2] = rce_train(class1,class2,eps,lambda_max)
 %Find number of train patterns (colums)
    n_c1p = size(class1,2);
    n_c2p = size(class2,2);

    for i=1:n_c1p
            x_hat = min(sqrt(sum((class2-class1(:,i)*ones(1,n_c1p)).^2)));
            lambda_1(i) = min(x_hat - eps, lambda_max);
    end
    for i=1:n_c2p
            x_hat = min(sqrt(sum((class1-class2(:,i)*ones(1,n_c2p)).^2)));
            lambda_2(i) = min(x_hat - eps, lambda_max);
    end
end

We apply the function to the training matrices:

[lambda_1 lambda_2] = rce_train(class1,class2,eps,lambda_max);

Tidyverse

R best practices do not encourage for loops, since R follows a functional programming convention.

In addition, the MagrittR pipes allow us to avoid nested functions.

We first create a function find_lambda. I decided to process the data one class at a time, so this function only calculates the distance to training points of the other class, and not all data points as in the Matlab function above.

The find_lambda function takes a single observation (row of data) for a particular class, along with the entire Tibble that contains all data points of the other class. The function also ingests epsilon, lambda max and the features vector.

Not to overload terms too much, but the function includes a lambda function that calculates the Euclidean distance between two vectors.

The lambda function takes two vectors, the observation vector and a row from the other class Tibble, which I call x.

function(x) sqrt( sum( ( x - observation )^2 )

The Lambda function can perform calculations on vectors of any length, which provides Data Scientists flexiblity in choosing which features to include.

The find_lambda function follows, and I will explain it quickly line by line.

find_lambda <- function( observation, Other.Class.Tib, lambda_max, epsilon, features ) {
  Other.Class.Tib %>%
    select( features ) %>%
    mutate( euclid_dist = apply( . , 1, function(x) sqrt( sum( ( x - observation )^2 ) ) ) ) %>%
    select( euclid_dist ) %>%
    min() %>%
    min( . - epsilon, lambda_max ) }

We start with the Tibble that contains all observations of the other class, stored in Other.Class.Tib.

The function pipes the Tibble in its entirety to a select statement that selects all of the desired features.

We then use the mutate operator to create a new column named euclid_dist. This column stores the euclid_dist from the current observation (single vector) to every data point (row) in the Other.Class.Tib.

The apply operator tells Tidyverse to apply the Euclidean distance lambda function to every row in Other.Class.Tib and store the result for each row in the euclid_dist column.

Since we must accommodate vectors of arbitrary length we tell apply to input row wise data via the 1 in the second parameter in the function signature.

Once the apply operation completes, we have a column that records the distance to each data point in the Other.Class.Tib. We are only interested in the nearest data point of the other class so we select the euclid_dist column and find the min(). We then ensure that the minimum distance has length less than lambda max.

In summary, we supply the function with a single observation for a class, along with a Tibble that includes all observations for the other class. The function then returns a single value, the distance between the current observation and the nearest data point of the other class.

We are not done yet. We must apply this function to every training point in the Class under observation.

# Find Lambda for Class 1 Training patterns
Class.1.Train.Tib %<>% 
  select( features ) %>%
  mutate( lambda = apply(. , 1, function(x) find_lambda(x, 
                                                        Class.2.Train.Tib, 
                                                        lambda_max, 
                                                        epsilon,
                                                        features ) ) ) %>%
  mutate( id = Class.1.Train.Tib$id )

# Find Lambda for Class 2 Training patterns
Class.2.Train.Tib %<>% 
  select( features ) %>%
  mutate( lambda = apply(. , 1, function(x) find_lambda(x, 
                                                        Class.1.Train.Tib, 
                                                        lambda_max, 
                                                        epsilon,
                                                        features ) ) ) %>%
  mutate( id = Class.2.Train.Tib$id )

We pipe the entire Class.1.Train.Tib to a select function and then use the apply operation to execute find_lambda on every row of Class.1.Train tib. Although each iteration (application) of find_lambda inputs the entire Tibble of Class.2.Train.Tin, it returns a single value for lambda.

NOTE: The MagrittR %<>% operation pipes data forward and stores the final result of all chained operations back into initial variable

The following output tibble depicts what Class.1.Train.Tib looks like after application of find_lambda.

> Class.1.Train.Tib
# A tibble: 72 x 5
   alkphos  sgpt gammagt lambda    id
     <dbl> <dbl>   <dbl>  <dbl> <int>
 1      67    77     114  29.1    175
 2      71    29      52  10.5    176
 3      93    22     123  19.4    182
 4      77    86      31  26.8    183
 5      77    39     108  20.4    189
 6      83    81     201  58.3    190
 7      75    25      14   3.16   191
 8      56    23      12   6.48   192
 9      91    27      15   7.87   194
10      62    17       5   5.00   195
# ... with 62 more rows

The closest data point in Class 2, for example, to the first Class 1 observation exists 29.1 units away.

Classify the Data

We first take the remaining BUPA data to create test patterns for each class.

Matlab

In Matlab:

test_class1 = data(1:72,:)';
test_class2 = data(217:288,:)';

Tidyverse

In Tidyverse I decided to create one Tibble for all Test Patterns, via the bind_rows operation.

Test.Patterns <- Bupa.Tib %>%
  filter( select == 1 ) %>%
  slice( class_1_test_patterns ) %>%
  bind_rows( Bupa.Tib %>%
                filter( select == 2 ) %>%
                slice( class_2_test_patterns ) )

Once we have test data, we need to classify it.

Matlab

In Matlab, I wrote a function named rce_clasify. The function contains a ton of nested functions and a for loop.

Each training pattern includes a circular "footprint" around it that extends to the nearest point of the other class, with radius equal to the lambda we calculated above.

The rce_clasify function finds which footprint each test observation lies in.

function [cl] = rce_classify(class1,lambda_1,class2,lambda_2,test_patterns)
        %Test Patterns in form: num_features x num_patterns
        ind1 = []; ind2 = [];
        %Find number of train patterns (colums)
        n_c1p = size(class1,2);
        n_c2p = size(class2,2);
        num_test_patterns = size(test_patterns,2);
        for i = 1:num_test_patterns
            test_x = test_patterns(:,i);
            dist1 = test_x*ones(1,n_c1p)-class1;
            dist1 = sqrt(diag(dist1'*dist1))';

            dist2 = test_x*ones(1,n_c2p)-class2;
            dist2 = sqrt(diag(dist2'*dist2))';

            ind1 = find(dist1 < lambda_1);
            ind2 = find(dist2 < lambda_2);
            p = 3;
            if ~isempty(ind1)
                p = 1;
            end
            if ~isempty(ind2)
                p = 2;
            end
            if (~isempty(ind1) && ~isempty(ind2))
                p = 3;
            end
            cl(i) = p;
        end
end

Tidyverse

In the Tidyverse classification approach, we use nested functions in the logical sense, since our code exclusively uses pipes.

We create a generic function to discover how many "footprints" the given observation lives in.

Similar to the Matlab code above, we calculate the distance between an observation of the Test data and all of the training samples of a given class.

We then use the lambda values of the training samples to identify the count (nrow) of footprints the test data lives in.

rce_classify <- function( observation, Data.Tib, features ) {
  Data.Tib %>% 
    select( features ) %>%
    mutate( euclid_dist = apply( . , 1, function(x) sqrt( sum( ( x - observation )^2 ) ) ) ) %>%
    filter( euclid_dist < Data.Tib$lambda ) %>%
    nrow
}

Without getting too complicated, we pass the Test data to a function that uses rce_classify to detect the number of hits against each class of Training data. First it finds the hits against Class.2.Training.Tib, and then it finds the hits against Class.1.Training.Tib.

The new function rce_classify_tib then uses the number of hits for each class to classify the data. In this example, we use a voting approach, although you can tailor the algorithm to classify a test point as ambiguous if it hits either zero or more than one class.

rce_classify_tib <- function(Test.Data.Tib, Class.One.Train.Tib, Class.Two.Train.Tib, features) { Test.Data.Tib %<>%
  select( features ) %>%
  mutate( class.2.hits = apply( . , 1, function(x) rce_classify( x, Class.Two.Train.Tib , features ) ) ) %>%
    mutate( id = Test.Data.Tib$id )

Test.Data.Tib %<>%
  select(features) %>%
  mutate( class.1.hits = apply( . , 1, function(x) rce_classify( x, Class.One.Train.Tib, features ) ) ) %>%
  mutate( class.2.hits = Test.Data.Tib$class.2.hits,
          id = Test.Data.Tib$id ) %>%
  mutate( rce_class = ifelse( test = class.1.hits > class.2.hits,
                              yes = 1,
                              no = ifelse( test = class.2.hits > class.1.hits,
                                           yes = 2,
                                           no = 3))) 
  return(Test.Data.Tib)
}

We then apply these functions to our data.

Matlab

In Matlab:

    cl1 = rce_classify(class1,lambda_1,class2,lambda_2,test_class1);
    cl2 = rce_classify(class1,lambda_1,class2,lambda_2,test_class2);

Tidyverse

In Tidyverse:

Test.Patterns %<>% 
  rce_classify_tib(Class.1.Train.Tib, Class.2.Train.Tib, features)

Graphing RCE NN

We can graph the RCE NN in action by creating a uniform data grid and running rce_classify against every point.

First, create the data grid. We find the highest valued observation in the data set, in order to ensure that our graph includes this data point.

max_obs <- Class.1.Train.Tib %>% 
  bind_rows(Class.2.Train.Tib) %>% 
  select(features) %>%
  max

test_grid <- expand.grid( seq( 0, max_obs * 1.1, length.out = 50 ),
                          seq( 0, max_obs * 1.1, length.out = 50 ),
                          seq( 0, max_obs * 1.1, length.out = 50 ) ) 
names( test_grid ) <- features

Now we apply the classification to every test data point, in order to blanket the entire canvas.

Note: This will take a long time. If you don't want to wait, you can execute: test_grid = readxl::read_xlsx("NinetyK.xlsx")

You can then use Plotly to generate a Three Dimensional image that you can rotate.

library("plotly")
test_grid %>% filter( rce_class != 3 ) %>% 
  mutate( rce_class = ifelse( test = rce_class == 1,
                              yes  = "one",
                              no  = "two") ) %>%
  plot_ly( x = ~alkphos, y = ~sgpt, z = ~gammagt , color = ~rce_class )

If you would like to see a 2D graph, then re-run the script using two features.

features <- c( "alkphos", "sgpt")

Create a test grid using two dimensions and classify.

test_grid <- expand.grid( seq( 0, max_obs * 1.1, length.out = 300 ), seq( 0, max_obs * 1.1, length.out = 300 ) ) 
names( test_grid ) <- features

test_grid %<>% 
  as_tibble %>% 
  tibble::rowid_to_column("id") %>% 
  rce_classify_tib( Class.1.Train.Tib, Class.2.Train.Tib, features)

You can just load the pre-processed data instead of waiting.

test_grid = readxl::read_xlsx("NinetyK.xlsx")

Here I plot using Grammer of Graphics.

ggplot( ) + 
  geom_point( data = test_grid %>% filter( rce_class == 1 ), aes(x = alkphos, y = sgpt, color = class.1.hits )) + 
  geom_point( data = test_grid %>% filter( rce_class == 2 ), aes(x = alkphos, y = sgpt, color = class.2.hits ))

Conclusion

This blog post described how to convert a Matlab script that uses for loops and nested function into a functional, pipe based Tidyverse script.

If you enjoyed this, you may enjoy these other Machine Learning posts.

Roark vs. Raskolnikov: Natural Language Processing Analysis

2023-07-29T04:54:00-04:00

I augment traditional literary analysis with Natural Language Processing (NLP) tools to compare Fyodor Dostoevsky's Rodion Raskolnikov (Crime and Punishment) with Ayn Rand's Howard Roark (The Fountainhead). Tools include both the Google Cloud Platform (GCP) Natural Language Application Programming Interface (API) and Tensorflow Transfer Learning.

I use the same approach I followed in my Thoreau vs. Unabomber blog post. The GCP NLP API measures character sentiment (positive or negative) and emotional intensity while the literary analysis frames the quantified personality metrics with relevant quotes.

My earlier post analyzed Thoreau and the Unabomber's manifestos. These texts provide single-voice narration which yielded simple data preparation. Unlike The Unabomber Manifesto and Walden, however, Crime and Punishment and The Fountainhead include multiple speakers and a narrator.

Valid analysis requires me to extract the speaking lines for Raskolnikov and Roark from their respective works. I used Tensorflow and Keras NLP to accomplish this task.

Quantify Sentiment

I take the extracted dialog and internal monologues from Raskolnikov and Roark and feed them into the Google API. The API infers sentiment (score) and intensity (magnitude).

The GCP NLP API docs define score and magnitude:

Score
- Indicates the overall emotion of a document
Magnitude
- Indicates how much emotional content is present within the document, and this value is often proportional to the length of the document

Open my Thoreau vs. Unabomber post in a new tab to find my script that processes the texts, emits them to the API, and records the results.

NOTE: I uploaded the Sentiment analysis data for both Raskolnikov and Roark to Github.

Import Data

Requests imports the data straight from GitHub.

import pandas as pd
import io
import requests

roark_url = 'https://github.com/hatdropper1977/Raskolnikov/raw/main/Roark/roark_sentiment.csv'
rask_url  = 'https://github.com/hatdropper1977/Raskolnikov/raw/main/rask_sentiment.csv'

roark_r = requests.get(roark_url).content
rask_r  = requests.get(rask_url).content

roark_df = pd.read_csv(io.StringIO(roark_r.decode('utf-8')))
rask_df = pd.read_csv(io.StringIO(rask_r.decode('utf-8')))

Numeric Analysis

Pandas extracts Roark's most negative dialog.

roark_df[ roark_df.score == roark_df.score.min()]

score   magnitude   text
52  -0.8    0.8 """You're wasting your time,"" said Roark."

Roark's most negative dialog: "You're wasting your time"

A similar command extracts Raskolnikov's most negative dialog. Note that Twenty Four (24) lines of dialog share the max sentiment of Negative Zero Point Eight (-0.8).

rask_df[ rask_df.score == rask_df.score.min()].size

24

For example, Raskolnikov's most negative dialog includes:

All this is very naive . . . excuse me, I should have said impudent on your part

One Dimensional Graphical Analysis

Pandas provides an easy method to generate Histograms.

rask_df['score'].hist()

Raskolnikov's sentiment histogram leans negative.

Roark's sentiment histogram spikes at neutral. His negative lines of dialog taper off, with few beyond Negative Zero Point Four (-0.4)

I use Seaborn to overlay the two Histograms.

import seaborn as sns

I concatenate the two Data Frames into one Data Frame. I add a Label column, named Class. This label allows Seaborn to color the data by Speaker.

roark_df['Class'] = 'Roark'
rask_df['Class'] = 'Rask'

sentiment_df = pd.concat([rask_df, roark_df])

I run histplot on the new Data Frame.

sns.histplot( x = sentiment_df['score'],
              hue = sentiment_df['Class'])

The overlaid Histograms illustrate that Raskolnikov (Blue) leans more negative than Roark (Orange).

Two Dimensional Graphical Analysis

The Google API returns two dimensions of data: score and magnitude. The magnitude data captures the intensity of emotion.

SciPy Kernel Density Estimation (KDE) generates a bivariate density plot for each speaker.

The colors represent density. Darker colors indicate more instances of a particular score/magnitude pair. The black dots represent the actual data points.

NOTE: I scale the Score by ten to improve chart readability

We know from the histogram that Raskolnikov leans negative. He reigns in the emotional intensity, however, with most of his dialog clocking in at an intensity of one (1).

Compare Raskolnikov's nearly horizontal chart to Roark's chart. Roark's chart angles up a bit in the negative zone. This upward angle indicates that Roark increases emotional intensity in lockstep with negativity. The more negative the dialog, the more intense the magnitude.

SNS once more allows us to overlay the two density plots.

sns.displot( x = sentiment_df['score']*10,
             y = sentiment_df['magnitude'],
             hue =  sentiment_df['Class'],
             kind = 'kde')

This chart captures Raskolnikov's negative sentiment with neutral intensity and Roark's slight intensity upticks.

Literary Analysis

An astute reader finds similarities between Rodion Raskolnikov, the protagonist of Fyodor Dostoyevsky's Crime and Punishment, and Howard Roark, from Ayn Rand's The Fountainhead.

Rodion and Roark share misanthropic traits. Roark's creator vs. second hander hypothesis echoes Rodion's extraordinary man vs. raw materials hypothesis. Their ethics drive each to perform criminal acts of destruction.

The two characters have separate reactions to their crimes. The differences in their reactions set them apart.

Social Misanthropes

In college, Roark refuses to join a fraternity or engage in graduation festivities. Rand writes he:

never [has] any friend anywhere (Rand 253)

Peter Keating (an acquaintance of his) states:

nobody can like him (Rand 253).

Dostoyevsky writes that Raskolnikov, like Roark, enjoys:

practically no friends [and] somehow fail[s] to take any part in [other student's] communal gatherings, their discussions and their amusements, and [holds] no share in any other aspects of their lives (Dostoyevsky 85).

Raskolnikov spends his time in college:

study[ing] intensely, not sparing himself (Dostoyevsky 86)

Post-expulsion he takes a six-month hiatus from society to focus on personal philosophies. Raskolnikov's peers label him a:

haughtily arrogant [egoist] (Dostoyevsky 86).

Razumikhin, Raskolnikov's best friend, says:

He doesn't listen to what people say to him. He's never interested in what everyone else is interested in at any given moment (Dostoyevsky 265)

Pulcheria Alexandra, Rodion's mother, asks Dunya:

I mean, It couldn't be that he's an egotist, Dunechka? Eh? (Dostoyevsky 291)

Disgust Towards Parasites

Howard Roark despises collective thought. He says:

The mind is an attribute of an individual. There is no such thing as a collective brain... No man can use his brain to think for another (Rand 737)

He further states:

[Man] can survive in only one of two ways- by the independent work of his own mind or as a parasite fed by the minds of others (Rand 738)

He considers these parasites:

second handers [and] savages (Rand 742).

Raskolnikov increases the vitriol against second hand parasites. After Luzhin tries to impress Raskolnikov and Razumikhin with his ideas on progress, Raskolnikov cuts him down and says:

He learned that all by rote! He's Showing off! (Dostoyevsky 193)

Extrodinary Man vs. Egoist

Rodion subscribes to the theory of the extraordinary man and Roark to that of the egoist.

Both theories separate human society into two classes. The two protagonists champion the improvement of society through great Men who stand apart from the masses.

Rodion labels mediocre and uninspired members of the populace raw materials. His raw materials stand in for Roark's second handers.

Neither the raw materials nor second handers think for themselves and recycle old ideas. They keep the world in an evolutionary stasis. Raskolnikov states that the Raw Materials live only to:

bring into being more like itself, and another group of people who possess a gift or talent for saying something new, in their own milieu (Dostoyevsky 313)

Raskolnikov's Raw Materials live to procreate, to increase the chance of spawning Raskolnikov's extraordinary man, or Roark's egoist/ creator.

Both agree that the lesser men ostracize (or kill) the great men. Rand writes that Second Handers consider egoists:

transgressors that venture into forbidden territory (Rand 736)

Raskolnikov says that raw materials see extraordinary men:

as being persons of backward and degrading views (Dostoyevsky 315)

Difference in Execution

Roark's philosophy stresses the importance of the individual. A creator will not rely on others to survive. Society benefits when a collection of individuals focus on their own needs and align their actions with those needs:

No creator [should be] prompted by a desire to serve his brothers, for his brothers [will] reject the gift offered and that gift [will] destroy the slothful routine of their lives. His truth [should be] his only motive. His own truth, and his only motive to [should be to] achieve it in his own way (Rand 737)

Raskolnikov, however, suggests that extrodinary men must use the raw materials to their own ends:

If an extraordinary man "finds it necessary, for the sake of his idea, to step over a dead body, over a pool of blood, then he is able within his own conscience to do so. It's in this sense alone that [they have a] right to crime" (Dostoyevsky 313).

Roark stresses individual focus. Raskolnikov condones the (criminal) use of groups to reach a goal.

Roark criticizes Raskolnikov's theory. He considers Raskolnikov's conquerors second handers.

The most dreadful butchers [are] the most sincere. They believe in a perfect society through the guillotine and the firing squad. Nobody questions] their right to murder since they [are] murdering for an 'altruistic' purpose. It [is] accepted that man must be sacrificed for other men... It goes on and will go on so long as men believe that an action is good if it is unselfish. That permits the 'altruist' to act and his victims to bear it. Now observe the results of a society built on individualism. This country was not based on selfless service, sacrifice, renunciation or any precept of altruism. It was based on a man's pursuit of happiness. Not anyone else's, a private personal, selfish motive (Rand 741-42)

Both Raskolnikov and Roark act according to their beliefs. Raskolnikov's extraordinary man belief compels him to kill a pawnbroker and her sister. Roark's creator belief leads him to destroy his desecrated Cortlandt building.

Follow Through

Raskolnikov and Roark differ in their commitment to their criminal actions. Roark held to his convictions and did not experience guilt or compromise after completing a crime that upheld his ethics. He lived up to his creator principles.

Raskolnikov, however, could not justify the crime to himself. He admitted failure on several occasions before he turned himself in. He did not live up to his extraordinary man principles.

Howard Roark refuses to accept nor consider any wrongdoing. He insists that America must uphold its first principles and recognize the necessity and urgency of his (Roark's) actions. A country that compromises equals a slave society (Rand 743)

Roark says he will serve his time:

in memory and in gratitude for what my country has been. It will be my act of loyalty, my refusal to live and work in what has taken place (Rand 743).

Contrast Rodion to the stalwart Roark. Raskolnikov experiences guilt, admits failure, and contemplates suicide. He says:

I don't want to go on like this (Dostoyevsky 200)

Rodion can not muster the conviction to self-annihilate:

I wanted to end it all there, but... I couldn't bring myself to do it (Dostoyevsky 593)

Rodion admits defeat and turns himself in. He does not recognize any extraordinary man qualities in his character. He labels himself a failure:

It's because of my own baseness and mediocrity that I'm taking this step (Dostoyevsky 595).

He questions why he, a raw material, felt that he was qualified to act in the manner of an extraordinary man. He speculates:

the strength of his own desires that made him believe he was a person to whom more was allowed than others (Dostoyevsky 623)

Conclusion

Rodion and Roark see banality and mediocrity in the Common Man. They share apathy towards base men and disgust towards phony men. Their disgust incites one to become a hermit for several months and causes the other to become an object of hatred and jealousy among his peers. They both commit crimes by their beliefs. In the aftermath, Rodion folds and Roark stands strong.

Did Rand use the character Rodion to inspire Roark? If so, Rand aligns with Roark's creator principles.

Roark says:

We inherit the products of the thought of other men. We inherit the wheel. We make a cart. The cart becomes an automobile... But all through the process what we receive from others is the end product of their thinking (Rand 738)

Bibliography

Dostoevsky, Fyodor. Crime and Punishment. Bantam Books, 1996.
Rand, Ayn. The Fountainhead. Plume, 1994.

Label Speakers in Text with Natural Language Processing

2023-06-24T05:56:00-04:00

The domain of Natural Language Processing (NLP) provides novel tools for Literary analysis. Analysts use NLP tools to record sentiment, emotional intensity, and word frequencies. Today I use NLP techniques to extract Raskolnikov's speaking (and thinking) quotes from Dostyevsky's Crime and Punishment.

Background

In a prior blog post, I compared the speaker sentiment (positive or negative) and intensity of emotions between the Unabomber and Thoreau. I used their respective manifestos. They wrote their manifestos in the first person and spoke (or thought) every word of text in the document. The fact that neither manifesto includes other speakers drove simple data preparation. I fed each entire document to my NLP models.

Now consider Dostyevsky's Crime and Punishment.

The text includes dozens of characters, each with their own speaking and thinking lines. Since I only want to analyze Raskolnikov, I must extract his text from the book. I filter other characters, narration, and page numbers from the analysis.

I see three ways to extract Raskolnikov's thinking and speaking parts.

Dive into the book and cut and paste his lines by hand
Write a series of if/ then heuristics with rules and logic to extract his lines (e.g. if you see the phrase said Raskolnikov, pull the line)
Train a Machine Learning (ML) model to extract the lines for me

I decide to go with #3 and train a model to do the work for me.

I use Keras and TensorFlow to train my model.

Label the Training Data

I seed the model with training data. I pull representative lines of text and label them Raskolnikov and Not Raskolnikov with Microsoft EXCEL.

I use my judgment and experience and decide that roughly two hundred and fifty (250) data points should suffice for training.

My EXCEL efforts yield two files: Raskolnikov.txt and Other.txt.

I then use a simple BASH script to put each line into a separate file. The script then moves these files to their appropriate directory. Keras imports the data and labels them based on the name of their parent directory.

i=1
cat Raskolnikov.txt | 
while read line;
do echo $line > ./Raskolnikov/R_$i.txt;
((i++));
done

i=1
cat Other.txt | 
while read line;
do echo $line > ./Other/O_$i.txt;
((i++));
done

This yields two directories. I then use the Linux mv command to move some of the Raskolnikov and Other labeled files to a testing directory.

I create a parent directory named crime_and_punish, with two sub-directories training and testing each of which contain files for Raskolnikov and Other.

$ tree crime_and_punish
crime_and_punish
├── testing
│         ├── Other (19 Samples)
│         │         ├── O_120.txt
│         │         ├── O_137.txt
│         │         ├── ...
│         │         └── O_138.txt
│         └── Raskolnikov (38 Samples)
│             ├── R_120.txt
│             ├── R_121.txt
│             ├── ...
│             └── R_132.txt
└── training
    ├── Other (119 Samples)
    │         ├── O_100.txt
    │         ├── O_99.txt
    │         ├── ...
    │         └── O_9.txt
    └── Raskolnikov (144 Samples)
        ├── R_100.txt
        ├── ...
        └── R_9.txt

I zip the topmost directory into a zip file for easy portability.

zip -r crime_and_punish.zip crime_and_punish

Train the model

I upload my ZIP file into my Amazon Web Services Sagemaker Notebook through the graphical user interface and then run a code block to extract the labeled dataset.

!unzip crime_and_punish.zip

I import the required Machine Learning libraries.

!pip install keras-nlp
import keras_nlp
import pandas as pd
import tensorflow as tf
from tensorflow import keras

# Use mixed precision for optimal performance
keras.mixed_precision.set_global_policy("mixed_float16")

Keras provides helper functions to import training data into TensorFlow.

BATCH_SIZE = 16
cp_train = tf.keras.utils.text_dataset_from_directory(
    "crime_and_punish/training",
    batch_size=BATCH_SIZE,
)
cp_test = tf.keras.utils.text_dataset_from_directory(
    "crime_and_punish/testing",
    batch_size=BATCH_SIZE,
)

The train Dataset includes samples for model training and validation. The test Dataset includes holdout data to surprise our model and simulate real-world interaction.

We inspect the structure of the new tensor object, which wraps each line of text in the tensor encoding.

print(cp_train.unbatch().take(1).get_single_element())

Found 239 files belonging to 2 classes.
Found 32 files belonging to 2 classes.
(<tf.Tensor: shape=(), dtype=string, numpy=b'"Get up, why are you asleep!" she called to him: "It\'s past nine, I have brought you some tea; will you have a cup? I should think you\'re fairly starving?"\r\n'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)

The following commands instruct Keras to train a custom model with a BERT backbone. This Transfer Learning uses the power of a pre-existing NLP model to increase the performance of our custom model.

We first load bert_medium_en_uncased model into our workspace.

classifier = keras_nlp.models.BertClassifier.from_preset(
    "bert_medium_en_uncased",
    num_classes=2,
)

We then instruct Keras to fine-tune the model based on the training data, cp_train and cp_test:

classifier.fit(
    cp_train,
    validation_data=cp_test,
    epochs=1,
)

Sagemaker outputs the training results:

 1/15 [=>............................] - ETA: 30:57 - loss: 0.6997 - sparse_categorical_accuracy: 0.5000
 2/15 [===>..........................] - ETA: 10:03 - loss: 0.7119 - sparse_categorical_accuracy: 0.4375
 3/15 [=====>........................] - ETA: 9:17 - loss: 0.7036 - sparse_categorical_accuracy: 0.5208 
 4/15 [=======>......................] - ETA: 8:30 - loss: 0.7020 - sparse_categorical_accuracy: 0.5469
15/15 [==============================] - 914s 56s/step - loss: 0.6995 - sparse_categorical_accuracy: 0.5397 - val_loss: 0.6157 - val_sparse_categorical_accuracy: 0.7188

Our model provides an impressive accuracy of 71.88% on our holdout data.

I test-drive our model with the following quote:

"It's in the houses of spiteful old widows that one finds such cleanliness," Raskolnikov thought again, and he stole a curious glance at the cotton curtain over the door leading into another tiny room, in which stood the old woman's bed and chest of drawers and into which he had never looked before. These two rooms made up the whole flat.

I need to escape the quotes when I call the Model:

classifier.predict(["\"It's in the houses of spiteful old widows that one finds such cleanliness,\" Raskolnikov thought again, and he stole a curious glance at the cotton curtain over the door leading into another tiny room, in which stood the old woman's bed and chest of drawers and into which he had never looked before. These two rooms made up the whole flat."])

The model outputs disappointing results, with no clear prediction of class Raskolnikov.

1/1 [==============================] - 7s 7s/step
array([[-0.01614, -0.0249 ]], dtype=float16)

Label Crime and Punishment

I feed the entire text of Crime and Punishment into my model and have the model label each line.

First, I load the text into a Pandas Dataframe.

df = pd.read_csv('cp.csv', header=None, names=['Crime'])
df.head()

    Crime
0    man came out of the garret in which he lodged ...
1    - "Good God!" he cried, "can it be, can it be,...
2    - "N-no," answered Dounia, with more animation...
3    - "What a pig you are!'
4    - nero - uf dis atari - 3 sae ; afeeyeeinae as...

The Dataframe includes 4,425 rows.

df.shape

(4425, 1)

I use a Lambda function to send each row of the text to my model. The model returns a prediction in the form of [[Likelihood of Other, Likelihood of Raskolnikov]].

lb = df.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)

The prediction consumes three hours of clock time to complete.

In the end, I get a Series with predictions for each line of text.

lb.head()
0      [[-0.00489, -0.01569]]
1          [[0.2327, 0.1193]]
2           [[0.11, 0.04248]]
3        [[-0.06537, 0.1735]]
4          [[-0.1049, 0.359]]

I join the predictions Series with the Crime and Punishment Dataframe.

cp = df.merge(lb.to_frame(name='Result'), left_index=True, right_index=True)

This produces a Dataframe with the following structure.

cp.head()
                                                Crime                   Result
0   man came out of the garret in which he lodged ...   [[-0.00489, -0.01569]]
1   - "Good God!" he cried, "can it be, can it be,...   [[0.2327, 0.1193]]
2   - "N-no," answered Dounia, with more animation...   [[0.11, 0.04248]]
3   - "What a pig you are!'                             [[-0.06537, 0.1735]]
4   - nero - uf dis atari - 3 sae ; afeeyeeinae as...   [[-0.1049, 0.359]]

Two Lambda functions extract the likelihoods from the Result array, and create two new Pandas columns.

cp['Other'] = cp.apply(lambda X: X['Result'][0][0], axis = 1 )
cp['Rask'] = cp.apply(lambda X: X['Result'][0][1], axis = 1 )

I create a Diff column to record the difference in likelihood between the two columns. Large differences indicate greater certainty.

cp['Diff'] = cp['Rask'] - cp['Other']
cp.head()

                                                Crime                   Result        Other      Rask      Diff
0   man came out of the garret in which he lodged ...   [[-0.00489, -0.01569]]    -0.004890 -0.015686 -0.010796
1   - "Good God!" he cried, "can it be, can it be,...   [[0.2327, 0.1193]]         0.232666  0.119324 -0.113342
2   - "N-no," answered Dounia, with more animation...   [[0.11, 0.04248]]          0.109985  0.042480 -0.067505
3   - "What a pig you are!'                             [[-0.06537, 0.1735]]      -0.065369  0.173462  0.238770
4   - nero - uf dis atari - 3 sae ; afeeyeeinae as...   [[-0.1049, 0.359]]        -0.104919  0.358887  0.463867

You can find the full labeled text of Crime and Punishment here

A histogram illustrates the distribution of the likelihood differences:

cp['Diff'].hist(bins=20)

We need to set a threshold of certainty.

A threshold of 0.1 yields 1,088, a threshold of 0.2 yields 670 rows, and for both I see a few incorrect labels.

cp[cp['Diff'] >  0.1].apply(lambda X: print(X.Crime), axis=1)

""Ah, he's eating, then he's not ill," said Razumihin. He took a chair and sat down at the table opposite Raskolnikov.
"..Â» But I should like to know why mother has written to me about â€˜our most rising generation'? Simply as a descriptive touch, or with the idea of prepossessing me in favour of Mr. Luzhin? Oh, the cunning of them! I should like to know one thing more: how far they were open with one another that day and night and all this time since? Was
"['ve only just waked up, and wanted to go to you, but was delayed owing tomy clothes; I forgot yesterday to ask
"A cup of tea, maybe."
"A notice from the office," he announced, as he gave him the paper.
"A painter?"
"A pickpocket I dare say."
"A priest," he articulated huskily.
"A strange scene passed between us last time we met, Rodion Romanovitch. Our first interview, too, was a
...

I set the threshold to 0.3 and save the Data Frame into a text file. This threshold still yields incorrect labels.

cp[cp['Diff'] >  0.3].apply(lambda X: print(X.Crime), axis=1)

"I am the murderer. . . . I want to give evidence," Nikolay pronounced,
"I am thinking," he answered seriously after a pause.
I beg you to say no more," said Raskolnikov. "In any case this is unpardonable impertinence."
"I can't help it... . 1 will come in half an hour. Tell them."
"I do,' repeated Raskolnikov, raising his eyes to Porfiry.
"I don't want it," said Raskolnikov, pushing away the pen.
"I fainted then because it was so close and the smell of paint," said Raskolnikov.
"I know."
"I remember now," said Raskolnikov after a long, suller silence. Razumihin looked at him, frowning and uneasy.
"I say nothing about him," added Raskolnikov, pointing to Razumihin, "though he has had nothing from me either but insult and trouble."
"I suppose you didn't expect it?" said Raskolnikov who, though he had not fully grasped the situation, had regained his courage.
"I was summoned... by a notice..." Raskolnikov faltered.
"I won't drink anything," said Raskolnikov.

rask_df = cp[cp['Diff'] >  0.3]['Crime']
rask_df.to_csv('rask.txt', index=False)

Sentiment and Emotional analysis

I take the file of Raskolnikov quotes and feed the quotes into the Google Cloud Natural Language API.

I import the required libraries and set FILENAME to rask.txt.

import pandas as pd
from google.cloud import language_v1

FILENAME = 'rask.txt'

The following code then records the sentiment and magnitude of each Raskolnikov quote into a Pandas Dataframe.

See my post on Unabomber vs. Thoreau for details.

sentiment_dict = []

# Instantiates a client
client = language_v1.LanguageServiceClient()

with open(FILENAME, encoding='utf-8') as f:
    for line in f:
        if line.strip():
            try:
                document = language_v1.Document(content=line.strip(), type_=language_v1.Document.Type.PLAIN_TEXT)
                sentiment = client.analyze_sentiment(request={'document': document}).document_sentiment
                sentiment_dict.append( { \
                    'score' : sentiment.score, \
                    'magnitude' : sentiment.magnitude, \
                    'text' : line.strip() } )
            except:
                sentiment_dict.append( { \
                    'score' : 0.0, \
                    'magnitude' : 0.0, \
                    'text' : 'ERROR: {}'.format(line.strip()) } )

sentiment_df = pd.DataFrame(sentiment_dict)

sentiment_df.to_csv('{}_sentiment.csv'.format(FILENAME.split('.')[0]),
                    index= False)

This produces the following Dataframe:

    score    magnitude    text
0    -0.6    0.6          Crime
1     0.0    0.0          - nero - uf dis atari - 3 sae ; afeeyeeinae as...
2    -0.3    0.3          "!""? he bent over her once"
3    -0.1    0.3          """""Ah, he's eating, then he's not ill,"" sai...
4     0.0    1.3          """..Â» But I should like to know why mother h...
 ...  ...    ...                             ...
367  -0.6    1.2          two sharp and suspicious eyes stared at him ou...
368  -0.2    0.2          "very much struck by your face this moraine. 4...
369   0.1    0.2          "Ves. g4V come,"""
370  -0.1    0.3          Well and what then? What shall I do with the f...
371   0.0    0.0          wiee:?? 7

I use matplotlib to graph a Kernel Density Estimation Plot.

import numpy as np
from scipy import stats

m1 = sentiment_df['score']*10 #scaled to improve Data Viz
m2 = sentiment_df['magnitude']

xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()

X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([X.ravel(), Y.ravel()])
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
Z = np.reshape(kernel(positions).T, X.shape)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
          extent=[xmin, xmax, ymin, ymax])
ax.plot(m1, m2, 'k.', markersize=2)
ax.set_xlim([xmin, xmax])
ax.set_ylim([ymin, ymax])

plt.xlabel("Score")
plt.ylabel("Magnitude")

plt.show()

matplotlib produces the KDE plot.

Conclusion

I demonstrated how to use NLP to label speakers in a text. This allows Literary Analysts to apply NLP tools on a per-character vs. per-novel basis.

I first created a corpus of labeled training data. I then used transfer learning to customize a medium-scale BERT model to fit the training data. This produced a model that isolated Raskolnikov's quotes from Crime and Punishment

I use this approach in next month's post, where I compare Fyodor Dostoevsky's Rodion Raskolnikov (Crime and Punishment) with Ayn Rand's Howard Roark (The Fountainhead).

Coda

I ran into some issues with Keras and TensorFlow on Amazon Sagemaker. I record the issues and solutions here.

I received the ValueError: Unable to import backend : mxnet failure when I attempt to import keras-nlp.

Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import keras_nlp
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/__init__.py", line 8, in <module>
    from keras_nlp import layers
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/layers/__init__.py", line 8, in <module>
    from keras_nlp.src.layers.modeling.cached_multi_head_attention import CachedMultiHeadAttention
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/__init__.py", line 23, in <module>
    from keras_nlp.src import layers
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/layers/__init__.py", line 15, in <module>
    from keras_nlp.src.layers.modeling.cached_multi_head_attention import (
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/layers/modeling/cached_multi_head_attention.py", line 16, in <module>
    from keras_nlp.src.api_export import keras_nlp_export
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/api_export.py", line 17, in <module>
    from keras_nlp.src.backend import keras
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/backend/__init__.py", line 27, in <module>
    from keras_nlp.src.backend import config
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_nlp/src/backend/config.py", line 17, in <module>
    import keras_core
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/__init__.py", line 8, in <module>
    from keras_core import activations
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/activations/__init__.py", line 8, in <module>
    from keras_core.src.activations import deserialize
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/__init__.py", line 1, in <module>
    from keras_core.src import activations
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/activations/__init__.py", line 3, in <module>
    from keras_core.src.activations.activations import elu
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/activations/activations.py", line 1, in <module>
    from keras_core.src import backend
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/keras_core/src/backend/__init__.py", line 48, in <module>
    raise ValueError(f"Unable to import backend : {backend()}")
ValueError: Unable to import backend : mxnet
>>>

The Keras-NLP team discovered that Amazon includes a hard-coded variable that calls MXNET upon launch of a new Sagemaker notebook.

!cat ~/.keras/keras.json

{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "mxnet",
    "image_data_format": "channels_first"
}

These commands fix the issue:

! echo { > out.txt
! echo '    "floatx": "float32",' >> out.txt
! echo '    "epsilon": 1e-07,' >> out.txt
! echo '    "backend": "tensorflow",' >> out.txt
! echo '    "image_data_format": "channels_first"' >> out.txt
! echo '}' >> out.txt
! cp out.txt ~/.keras/keras.json

I used the following library versions for this excursion.

keras==2.13.1
keras-core==0.1.2
keras-nlp==0.6.0

tensorflow==2.13.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.14.0
tensorflow-io-gcs-filesystem==0.32.0
tensorflow-text==2.13.0

I also faced an issue where Sagemaker crashed when I labeled the entire book of Crime and Punishment. To solve the problem, I first split the text and then reassembled it after training.

df = pd.read_csv('cp.csv', header=None, names=['Crime'])

df1 = df[0:750]
df2 = df[751:1500]
df3 = df[1501:2250]
df4 = df[2251:3000]
df5 = df[3001:3750]
df6 = df[3751:]

lb1 = df1.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)
lb1.to_csv('lb1.csv')
lb2 = df2.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)
lb2.to_csv('lb2.csv')
lb3 = df3.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)
lb3.to_csv('lb3.csv')
lb4 = df4.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)
lb4.to_csv('lb4.csv')
lb5 = df5.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)
lb5.to_csv('lb5.csv')
lb6 = df6.apply(lambda X: classifier.predict( [X['Crime']]), axis=1)
lb6.to_csv('lb6.csv')

df1z = df1.merge(lb1.to_frame(name='Result'), left_index=True, right_index=True)
df2z = df2.merge(lb2.to_frame(name='Result'), left_index=True, right_index=True)
df3z = df3.merge(lb3.to_frame(name='Result'), left_index=True, right_index=True)
df4z = df4.merge(lb4.to_frame(name='Result'), left_index=True, right_index=True)
df5z = df5.merge(lb5.to_frame(name='Result'), left_index=True, right_index=True)
df6z = df6.merge(lb6.to_frame(name='Result'), left_index=True, right_index=True)

cp = pd.concat([df1z, df2z, df3z, df4z, df5z, df6z])

Bibliography

Dostoevsky, Fyodor. Crime and Punishment. Bantam Books, 1996.

Upgrade SageMaker JupyterLab Notebooks to the Latest Pandas

2022-10-29T10:29:00-04:00

Amazon Web Services (AWS) SageMaker Notebook Instances provide fully managed Jupyter Notebooks, tailored for Data Science and Machine Learning (ML) use cases.

These notebooks allow Data Scientists and ML Engineers to explore, operationalize and share data, algorithms and pipelines.

Pandas contributes a critical piece to the Data Scientists' toolbox, via the Data Frame construct. Each new version of Pandas provides improvements, upgrades and new conveniences.

I run into an issue with my AWS SageMaker Notebook, however, when I try to upgrade Pandas.

If I attempt to Upgrade Pandas above version 1.1.5 on my AWS Sagemaker provided JupyterLab notebook I receive the error No Matching Distribution Found.

import sys
!{sys.executable} -m pip install --pre --upgrade pandas==1.3.5

ERROR: Could not find a version that satisfies the requirement pandas==1.3.5 (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5)
ERROR: No matching distribution found for pandas==1.3.5

I receive the following error:

ERROR: No matching distribution found for pandas==1.3.5

Background

I created a Notebook instance from the AWS Console via AWS Sagemaker -> Notebook instances -> Create Notebook instance.

I then selected the Kernel conda_Python3.

I use sys.executable to show the Kernel's Python, Pip and Pandas version.

!{sys.executable} -version
Python 3.6.13

!{sys.executable} -m pip show pip
Name: pip
Version: 21.3.1
Summary: The PyPA recommended tool for installing Python packages.
Home-page: https://pip.pypa.io/
Author: The pip developers
Author-email: distutils-sig@python.org
License: MIT
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: 
Required-by: 

!{sys.executable} -m pip show pandas
Name: pandas
Version: 1.1.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: 
Author-email: 
License: BSD
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: autovizwidget, awswrangler, hdijupyterutils, odo, sagemaker, seaborn, shap, smclarify, sparkmagic, statsmodels

I cannot upgrade Pandas.

!{sys.executable} -m pip install --pre --upgrade pandas
Requirement already satisfied: pandas in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.1.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pandas) (2021.1)
Requirement already satisfied: numpy>=1.15.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pandas) (1.18.5)
Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

Root Cause Analysis

Pandas does not provide support for Python 3.6 beyond Pandas version 1.1.5.

Earlier versions of AWS SageMaker JupyterLab Notebooks delivered Python 3.6 Kernels by default, via their Platform identifier configuration item.

Upon launch of my notebook, I selected Amazon Linux 1 for my Platform identifier, which limited my Notebook to the Python 3.6 Kernel which in turn caps the version of Pandas to 1.1.5

I verified that my Notebook runs the Amazon Linux 1 Operating System via the Amazon SageMaker --> Notebook Instances --> Notebook instance settings Console page.

Solution

Through trial and error, I identified the solution to my problem.

To install the most recent version of Pandas into a SageMaker JupyterLab Notebook, I must install the most recent version of Python.

To install the most recent version of Python to my JupyterLab environment, I must do the following upon launch:

Select the Amazon Linux 2 Operating System
Select JupyterLab Version 3.0+

Select the Amazon Linux 2 Operating System

AWS released Amazon Linux in 2010 and then an improved Amazon Linux 2 in 2017.

AWS End of Life'ed (EOL) their standard support for the original Amazon Linux in late 2020.

SageMaker notebooks, however, ran on the 2010 version of Amazon Linux until August 2021, when AWS provided the option to run Sagemaker JupyterLab Notebooks on Amazon Linux 2.

I recommend that you create SageMaker JupyterLab Notebooks with Amazon Linux 2 based notebook instances.

These Amazon Linux 2 based notebook instances support the Python 3.8 kernel, unlike the older versions of Amazon Linux (2010) based notebook instances, which cap at Python 3.6.

The AWS developer guides catalog all the differences between Amazon Linux 2 and Amazon Linux (2010) notebook instances.

Execution

Upon Launch of your SageMaker JupyterLab Notebook Instance, navigate to the Platform identifier option.

The dropdown box provides three choices for Platform identifier.

If you select an Amazon Linux 1 based notebook instance, the Console alerts you to the End of Life (EOL) support.

Select Amazon Linux 2, JupyterLab 3.

Select JupyterLab Version 3.0+

Amazon SageMaker notebooks provide the JupyterLab service. JupyterLab features a web-based Integrated Development Environment (IDE) for Python code, data and models.

Upon launch of your Notebook, AWS allows you to choose either JupyterLab Version 1 or JupyterLab Version 3

I recommend Jupyter Lab 3, which delivers a half dozen new features, which include:

Graphical Debug
- Desktop Integrated Development Environments (IDE) ship graphical debuggers. JupyterLab 3.0 provides a visual debugger for your Notebook.
Outline View/ Table of Contents (TOC)
- Provides an Outline view for your Notebook. Jump to different sections with the click of a mouse.
Wide Selection of Display Languages.
- Install the language pack of your choice.
Improved Single-Document Mode
- Remove the clutter of all the extraneous tabs and widgets.
Easy Extension Install
- Install extensions without JupyterLab recompilation via Pip or Conda.

To enjoy the above features, select Amazon Linux 2, JupyterLab 3

Success

After I launch my new AWS SageMaker JupyterLab Notebook I select the conda_Python3 environment from the launcher.

In my notebook I check for the Python version and the output reads 3.8.

Good Sign!

!{sys.executable} --version
Python 3.8.12

Pip and Pandas read version 22.0.4 and 1.3.4 respectively.

!{sys.executable} -m pip show pip
Name: pip
Version: 22.0.4
Summary: The PyPA recommended tool for installing Python packages.
Home-page: https://pip.pypa.io/
Author: The pip developers
Author-email: distutils-sig@python.org
License: MIT
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages
Requires: 
Required-by: 

!{sys.executable} -m pip show pandas
Name: pandas
Version: 1.3.4
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: autovizwidget, hdijupyterutils, sagemaker, seaborn, shap, smclarify, sparkmagic, statsmodels

I use the notebook to upgrade Pandas.

!{sys.executable} -m pip install --pre --upgrade pandas
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Requirement already satisfied: pandas in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (1.3.4)
Collecting pandas
  Downloading pandas-1.5.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.2/12.2 MB 48.4 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: numpy>=1.20.3 in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (from pandas) (1.20.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (from pandas) (2021.3)
Requirement already satisfied: six>=1.5 in /home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.4
    Uninstalling pandas-1.3.4:
      Successfully uninstalled pandas-1.3.4
Successfully installed pandas-1.5.1

The output reads:

Successfully installed pandas-1.5.1

Success!!!

Create a Lifecycle Config

A Sagemaker Lifecycle Configuration allows you to upgrade Pandas at launch.

When you log into your Notebook for the first time, the Notebook will present to you the most recent version of Pandas.

In Amazon Sagemaker, click Lifecycle configurations --> Notebook Instance --> Create Configuration.

I name my lifecycle config sobanski-update-pandas.

Paste the following script under Start notebook.

#!/bin/bash

set -e

# OVERVIEW
# This script installs a single pip package in a single SageMaker conda environments.

sudo -u ec2-user -i <<'EOF'
# PARAMETERS
PACKAGE=pandas
ENVIRONMENT=python3
source /home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"
pip install --upgrade "$PACKAGE"
source /home/ec2-user/anaconda3/bin/deactivate
EOF

The script upgrades Pandas in the conda_Python3 environment.

Under Amazon SageMaker --> Notebook instances --> Notebook instance settings select Edit and set Lifecycle configuration to the name of your file.

When you launch the notebook, AWS will run the upgrade script.

Conclusion

AWS SageMaker Notebook Instances host and manage JupyterLab Notebooks. In this blog post we discussed how to configure your Notebook Instance to maximize the available features in Pandas and JupyterLab.

Note: I created the Panda/ Python artwork with Jasper AI Art, see workflow here

Jasper Artificial Intelligence (AI) for Marketing Pictures

2022-09-24T03:21:00-04:00

I use the new Jasper Artificial Intelligence (AI) Art service to create the pictures in this blog post.

Jasper AI Art (non-affiliate link), for example, creates the following picture of the World's Largest Turnip.

Jasper generates a before and after picture of a fat cat.

The AI service also produces a picture of ET in a phone booth.

This month, I demonstrate the service and discuss the history of the service, which begins with the OpenAI foundation.

OpenAI

Elon Musk (Tesla), Sam Altman (Y Combinator), Greg Brockman (Stripe) and Ilya Sutskever (Google) pledged over $1B to found OpenAI in December, 2015.

OpenAI funds the development of friendly Artificial General Intelligence (AGI) that drives the health, happiness and improvement of our human race.

OpenAI develops interesting Generative models, including one for text (GPT-3) and one for images (DALLE-2).

The DALLE-2 model inputs descriptive natural language text, renders an image of the description and then outputs that image to the screen.

OpenAI invites the public to use DALLE-2, which yields sometimes hilarious results.

You can try the API for free via Huggingface, or Crayon, but (in July 2022) these services take minutes to produce images. Jasper AI Art reduces the wall clock time down to seconds.

Jasper AI Art

Dave Rogenmoser, J.P. Morgan and Chris Hull formed Jasper AI in early 2021. Since then, Jasper AI provides the GPT-3 model for copywriting and Search Engine Optimization (SEO) use cases.

This month, Jasper opens their Jasper Art service to their customers.

Jasper Art provides an easy-to-use interface to DALLE-2, and accelerates the image generation process to seconds.

I use the Jasper Art service to create the following pictures.

Fight Club Simpsons.

A delicious grilled spider.

A hand with fifteen fingers.

An Isometric Pixel Art rendition of the Democratic People's Republic of Korea (DPRK).

Too much baby powder!!!

Sign Up

Jasper sent me an invitation to try their new Art service. They boast that artificial intelligence now can create any image or art you can imagine in seconds.

Clicking through the links takes me to a payment page. Unlimited access to the Art service costs $20/month.

The User Interface

Jasper provides a User Interface (UI) to generate art. You enter a text description and then use the drop down selector boxes to set the Style, Medium, Artist and Mood.

I decide to enter the following input:

Text Description
- John Travolta from Saturday Night Fever in a Magnavox Odyssey 2 game
Style
- Pixel art
Medium
- None Specified
Artist
- Andy Warhol
Mood
- Frightening

Within seconds, Jasper Renders the following four images.

I like this one the best, although it looks more like a Commodore 64 game.

I change the Style to 3d Render and Mood to Aggressive. I also add The Situation from Jersey Shore.

Text Description
- John Travolta from Saturday Night Fever hanging out with the Situation from Jersey Shore
Style
- 3d Render
Medium
- None Specified
Artist
- None Specified
Mood
- Aggressive

The result looks straight from a PlayStation Two game.

Changing the style back to Pixel Art produces new art.

Jasper nailed the Low Cut Vee shirt!

I remember a joke about a Vulture with Carry On (Carrion) and input it into Jasper.

Text Description
- Vulture with Carrion in Airplane cabin
Style
- Cartoon
Medium
- Spray Paint
Artist
- Dr. Suess
Mood
- Happy

I love the trippy Dr. Suess result, although I don't see a Vulture nor Carrion.

I re-run with the same inputs.

While Jasper missed the boat, I do like the look of this picture. It reminds me of a Sci Fi pulp magazine from the early 1930's.

I run the input once more and add Art Deco to the description.

We get a vulture in the airplane cabin, but no Carrion.

A Toddler gets Silly

At this point my toddler walks over and asks me to make planet earth made out of poop emoji.

He laughed at Poop World.

Next up - Poopy pants (I did mention I have a toddler).

This image looks like a Fark Photoshop Friday from 2003.

He then requests more silly pictures.

First up, Boxing a Cow.

Skateboarding cow.

Dog rides a Pony.

Fart Propelled Bunny.

Pumpkin Skeleton.

I get Silly

The silliness inspires me. I think of some more ideas.

Propaganda Wario.

A Comics Guarantee Corp (CGC) graded copy of a Johnny Cage comic book, where he wears Hammer pants.

Konami Nintendo Entertainment System (NES) box art for Atlas Shrugged.

Nu-Metal President's Presidential Rally.

Mad Magazine presents Emo.

Nysnc + Nirvana. Nirvanasync perhaps?

Selfie at the Smelter factory.

Finally, I just type Ai Yiii Yiii Yiii into the box.

Conclusion

I will use the Jasper Art Service to illustrate future blog posts. I have writeups on Gerard Manley Hopkins and Yukio Mishima in the works that need interesting pictures.

I recommend the service for any small (or big) business that needs to create compelling pictures beyond stock art. The services provide an inexpensive and timely alternative to hiring graphic artists.

Paw Patrol

Before you go, my Toddler re-appears and demands the following Paw Patrol related pieces. I leave them here for prosperity. He will get a kick out of them twenty years from now or so.

Cat Paw Patrol.

Super Mario Paw Patrol.

Sonic The Hedgehog Paw Patrol.

The Incredible Hulk Paw Patrol.

Titanic Paw Patrol.

Pikachu Paw Patrol.

Power Rangers Paw Patrol.

More Power Rangers Paw Patrol.

Transformers Paw Patrol.

Nick Jonas Paw Patrol.

Jasper Artificial Intelligence (AI) Writes Your Blog Posts

2022-03-26T01:23:00-04:00

Machine Learning (ML) Engineers at Jasper Artificial Intelligence (AI) (non-affiliate link) trained a Natural Language Processing (NLP) model on 10% of the written content on the Internet to create a service that writes text on command. A user feeds the model a sentence or two requesting a topic and the service spits out paragraphs. The end user tunes the service, approving or disapproving each new paragraph and guides the writing through the addition of keywords.

I had Jasper write the introduction paragraph to this blog post. I fed it only the title (above) of this blog post and the service produced three candidate introduction paragraphs. The best one follows:

Would you like to have a computer write your blog posts for you? Jasper AI can do that! Just input your topic and Jasper will take care of the writing for you. No more worrying about whether you can write an interesting or engaging post - let Jasper take care of that for you. So why not give it a try? You may be surprised at just how well Jasper can write!

Jasper also integrates with the machine learning driven Surfer AI (non-affiliate link) service, which provides guidance to the Jasper user on how to maximize Search Engine performance. In this blog post, I will demonstrate how to quickly use Jasper and Surfer together to write an effective, Search Engine Optimized (SEO) blog post.

Get Started

Jasper uses a Software as a Service (SaaS) approach to get its AI into the hands of end users. You just need to sign into the WebSite and put in your credit card information.

Similar to most Web 2.0 applications, you need to register a username and password with their proprietary database, or connect to the centralized, authoritarian Google identity service.

Once you log in, you enter a Business Name and domain.

I mis-understood the domain field. The Wizard expects a web address, so I enter my address in the text field.

Select the use case for Jasper. Since I plan to write a blog, I click Blog posts.

Jasper then expects payment. They offer starter and basic packages, for $29/month and $59/month respectively.

I decide to try the cheaper option for my Proof of Concept.

Use a Template

The starter plan allows use of the Jasper templates.

I select the Blog Post Intro Paragraph template. I want Jasper to write a blog post on Decentralized Finance, so I feed it the title Decentralized Finance Drives Freedom.

I click Generate and Jasper produces the following text:

It's no secret that the world is becoming increasingly digitized. In fact, DeFi may be one of the most important aspects of this digital revolution. DeFi allows for a new level of freedom and autonomy that has never been seen before. With DeFi, users are in control of their own finances, and can transact without the need for permission from centralized authorities. This is a huge step forward in terms of financial freedom and censorship resistance. As DeFi continues to grow in popularity, it will become an increasingly important part of our global economy.

Jasper produces a light, readable intro paragraph.

The E-Prime zealot inside me gripes at the overuse of IS, an indicator (to me at least) of lazy, amateur writing.

It’s no secret
may be
is becoming
are in control
is a huge step

I also barf at the passive voice:

freedom and autonomy that has never been seen before

I don’t blame Jasper too much, since I assume most of the Internet includes sloppy text, and Jasper trained the NLP model on 10% of the Internet.

Recipes

Jasper provides recipes to help content creators write more than an intro (or concluding) paragraph.

When I click the Recipe button, Jasper sends an alert that I must upgrade to the $59 Basic plan.

I know that the Documents service will help me write a blog post.

The Jasper website reads

Our 50+ templates are the building blocks of Jasper. They are trained to perform very specific use cases and are very simple to understand and use.

Where templates are more constrained, documents allow open-ended content creation in a familiar Google Doc-like experience. Documents are available on the Boss Mode Plan.

Documents should be used for writing longer content like blog posts, emails, social posts or books.

I click Documents and once more read a Call To Action (CTA) that I must Upgrade to the $59 plan to write a document.

Jasper provides me with the Boss Mode plan for an additional $28.09, since I already paid $29 for the Starter plan. I take one for our team and pay the money to upgrade.

Since I already have my Credit Card out, I also pay $59 for Surfer SEO, a tool I will discuss later in this blog post.

Create a Document

Now that I paid the big bucks, Jasper rolls out the red carpet for me.

It provides a wizard to hold my hand through the blog writing process.

I tell Jasper to write a thought leadership piece that discusses how decentralized finance (DeFi) increases freedom. I then enter the keywords DeFi, Blockchain and Censorship

Jasper provides a half dozen or so candidate titles.

How Decentralized Finance (DeFi) Increases Freedom
Decentralized Finance: How DeFi Increases Freedom
Decentralized Finance: Increasing Freedom for All
Decentralized Finance (DeFi) Increases Freedom: Why Blockchain-Based Finance Will Save the World
Decentralized Finance: The Future of Financial Freedom
Decentralized Finance: The Future of Freedom

Once I select a title, Jasper spits out the intro paragraph.

Integrate Search Engine Optimization

The target audience for Jasper writes blog posts for traffic. Search Engine Optimization (SEO) tools maximize the return on (writing) investment in terms of increased traffic. Surfer SEO provides a SaaS and integrates with Jasper AI.

I purchase the $59/ month Jasper subscription and then refresh my Jasper page. This connects the Surfer SEO service to Jasper.

After I enter “DeFi” into the target search term box, Surfer lists a handful of related, popular search questions.

What exactly is DeFi?
What is a DeFi in crypto?
Is DeFi a good investment?
What is the difference between DeFi and crypto?

I paste these questions directly into the Jasper document box. I precede them with double Pound signs (#) which tells Jasper to start a new paragraph.

From there, I click Compose, which generates four sentences per click.

For example, when I place the cursor under What exactly is DeFi?, type the seed words Decentralized Finance (DeFi) and hit Compose, Jasper produces the following text:

Decentralized Finance (DeFi) is a financial system that runs on decentralized infrastructure, such as the Ethereum blockchain. Peer to Peer DeFi protocols are built on top of Ethereum and allow users to interact with each other and create financial products without the need for a central authority.

I then have Jasper compose paragraphs for the remaining paragraph headers.

Tune SEO Effectiveness

Surfer provides a word quota, with recommended words and phrases that feed the hungry search engine algorithms. The service colors each Phrase: Green indicates acceptable, Yellow indicates needs work and Red indicates absent.

I begin to grind through the word quota suggestions, and pepper the phrases into my Jasper document. I replace an instance of DeFi, for example, with DeFi Applications and that phrase turns green.

Jasper provides a gauge. In addition to the word quota, I also must increase the word count to 1.7k, the paragraphs to 16, the headings to at least 18 and Images to at least three.

Checking off the suggested phrases in the word quota screen gets my document to a score of 80/100.

You can see that I peppered in a lot of the phrases, turning them green

Write Conclusion

Jasper encourages you to update the “Content” box on the left side of the pane, to nudge the service a bit. I type in Write a concluding paragraph into the content box, enter seed words into the document and then hit compile.

Jasper then writes a conclusion for me.

The Document

I now present the AI written document for you:

Decentralized Finance: The Future of Freedom

It's no secret that the world is becoming increasingly digitized. In fact, Decentralized Finance (DeFi) may be one of the most important aspects of this digital revolution. DeFi allows for a new level of freedom, liberty and autonomy that has never been seen before. With Decentralized Finance, users are in control of their own finances, and can execute financial transactions without the need for permission from centralized authorities or a centralized exchange. This is a huge step forward in terms of financial freedom and censorship resistance. As Decentralized Finance continues to grow in popularity, it will become an increasingly important part of our global economy.

What exactly is DeFi?

Decentralized Finance (DeFi) is a financial system that runs on decentralized infrastructure, such as the Ethereum blockchain. Peer to Peer DeFi protocols are built on top of Ethereum and allow users to interact with each other and create financial products without the need for a central authority. This means that DeFi is censorship-resistant and cannot be shut down by any single entity. DeFi work includes the ability to use open source software to create a loan, set interest rates, earn interest or manage digital assets.

Defi applications can be used for a variety of purposes, such as lending, borrowing, trading, investing and flash loans. By using DeFi applications, users can have full control over the digital money in their bank account. DeFi also has the potential to provide a more level playing field for financial services, as it is accessible to anyone with an Internet connection.

The DeFi ecosystem is still in its early stages, but it is growing rapidly. In 2019, the total value locked in DeFi applications was about $675 million. As of September 2020, that number has grown to over $13 billion. This rapid growth is indicative of the strong demand for decentralized applications and financial products.

What Contributed to the growth of Decentralized Finance?

There are a few factors that drive the growth of Decentralized Finance applications.

First, the DeFi ecosystem is built on Ethereum, which is the most popular smart contract platform. This gives Decentralized Finance applications a large potential user base, as there are over 50 million Ethereum users.

Second, DeFi applications offer a variety of features and benefits that appeal to users. For example, Decentralized Finance applications are often faster and more secure than traditional financial institutions. DeFi also offers users more control over their finances, as they are not reliant on centralized authorities.

Third, DeFi is becoming increasingly popular due to its censorship-resistant nature. In a traditional financial system, central authorities can censor or shut down applications at will. However, DeFi applications are built on decentralized infrastructure, which makes them much more resistant to censorship.

What is a DeFi in crypto?

DeFi in crypto is decentralized and open source, which allows for more freedom and transparency. DeFi in crypto is built on trustless smart contracts, which eliminates the need for central authorities. This makes DeFi in crypto censorship-resistant and immune to shutdowns.

DeFi in crypto has the potential to provide a more level playing field for financial services, as it is accessible to anyone with an Internet connection. DeFi in crypto also has the potential to reduce costs and increase efficiency by eliminating the need for intermediaries.

Is DeFi a good investment?

In terms of investing, DeFi is still in its early stages and there are no guarantees. However, DeFi does offer a number of advantages over traditional financial systems. For example, DeFi applications often have lower fees and faster transaction speeds. DeFi also offers users more control over their finances, which can be beneficial in times of market volatility.

Overall, DeFi is still a relatively new phenomenon and there are no guarantees as to its future success. However, DeFi does offer a number of advantages over traditional financial systems and has the potential to revolutionize the way we interact with money.

What is the difference between DeFi and crypto?

Crypto is a digital or virtual asset that uses cryptography to secure its transactions. DeFi is a financial system that runs on decentralized infrastructure, such as the Ethereum blockchain. DeFi applications are built on top of Ethereum and allow users to interact with each other without the need for a central authority.

Crypto assets can be used for a variety of purposes, such as investing, trading, and paying for goods and services. DeFi applications can be used for a variety of purposes, such as lending, borrowing, trading, and investing.

Can financial institutions stop DeFi?

Central financial institutions can stop DeFi applications by shutting down the servers or blocking the IP addresses. However, DeFi is built on decentralized infrastructure, which makes it much more resistant to censorship. DeFi applications are also open source, so they can be copied and distributed freely. This makes it difficult for central authorities to shut them down without causing a significant amount of backlash.

What are liquidity pools?

Liquidity pools are a type of DeFi application that allow users to pool their funds together and make loans. This can be beneficial for both lenders and borrowers, as it allows them to get better rates and longer terms. Liquidity pools also offer increased liquidity, as they can provide access to a larger pool of funds.

What is word freedom?

Word freedom is the ability to use DeFi applications without the need for a central authority. DeFi applications are built on decentralized infrastructure, which makes them much more resistant to censorship. DeFi also offers users more control over their finances, as they are not reliant on centralized authorities.

What kind of Financial Transactions does DeFi enable?

DeFi enables a number of financial transactions, such as lending, borrowing, trading, and investing.

Conclusion

In conclusion , DeFi is a financial system that runs on decentralized infrastructure and offers users more freedom and control over their finances. DeFi has the potential to revolutionize the way we interact with money and could eventually replace traditional financial institutions.

Final Thoughts

Jasper provides an Artificial Intelligence service that produces contents for massive search engine algorithms. With Jasper, you have a robot writing for robots.

While (at present time) I do not recommend Jasper for polished technical content nor scientific papers writing I do recommend it to anyone who works in an industry that needs traffic. The Jasper + Surfer combination produces easy to read blog posts in a casual voice with minimal effort and minimal time.

For fun, I may start an ancillary hobby blog on a Pop Culture topic, and then jot down titles and bullet points on that topic and then feed those notes into Jasper to get an insta-website.

A New Exemplar Machine Learning Algorithm (Part 2: Optimize)

2021-12-23T12:23:00-05:00

In part one of this two-part series, I developed a Reduced Columb Energy (RCE) classifier in Python. RCE calculates hit footprints around training data and uses the footprints to classify test data.

RCE draws a circle around each labeled training observation, with a radius (lambda) that stops at the closest labeled training point in the opposite class. Each circle indicates the hit footprint for that class.

In part one I ran RCE for one epoch on a two-feature training set to achieve an F1 Score of 0.42 and ambiguity of 26.6%.

In this blog post, I will introduce and tune hyperparameters to improve model success and reduce ambiguity. I will investigate the number of principal components and tune r.

r indicates the maximum value for Lambda and puts an upper limit on the maximum size of each circle that represents a given hit footprint.

I will also see how RCE performs with a reduced training set. In Pattern Classification Using Neural Networks (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman writes:

This classifier is similar to a k-nearest neighbor classifier in that it adapts rapidly over time, but it typically requires many fewer exemplar nodes than a nearest neighbor classifier.

Tune Number of Features

Last time, I left off with the following confusion matrix for the two principal component scenario. In that scenario, I applied RCE to the diabetes dataset after I used Principal Component Analysis (PCA) to reduce the data set down to two features.

Our RCE algorithm trained a model with an F1 Score of 0.42 and ambiguity of 26.6%.

calc_success(test_df)
{'f1_score': 0.42424242424242425,
 'ambiguity': 0.2662337662337662}

Three Principal Components

I use the following code to reduce the diabetes training dataset down to three components and yield a Pandas dataframe named test_df.

pca_train = PCA(n_components=3)
pca_train.fit(normalizer(train_features))

train_df = pd.DataFrame(pca_train.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3'],
                                     index=train_features.index).assign(outcome = train_labels)
train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df, X),axis = 1)

pca = PCA(n_components=3)
pca.fit(normalizer(test_features))
test_df = pd.DataFrame(pca.transform(normalizer(test_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3'],
                                     index=test_features.index)

I then call my classify_data() function to classify the data.

test_df = classify_data(train_df, test_df)

I attach the labels to the classified data frame for the confusion matrix.

test_df = test_df.assign(actual=test_labels)
confusion_matrix = pd.crosstab(test_df['actual'], test_df['classification'], rownames=['Actual'], colnames=['Predicted'])

sns.heatmap(confusion_matrix, annot=True)
plt.show()

Three features yield the following confusion_matrix:

My calc_success() function returns f1_score and ambiguity.

from sklearn.metrics import f1_score

def calc_success(test_df):
    unambiguous_df = test_df.dropna()
    ambiguity = (test_df.shape[0] - unambiguous_df.shape[0])/test_df.shape[0]
    f1 = f1_score(unambiguous_df.actual, unambiguous_df.classification)
    return { "f1_score" : f1, "ambiguity" : ambiguity}

calc_success(test_df)

Both F1 (bad) and ambiguity (good) decrease with an extra principal component.

{'f1_score': 0.41666666666666663,
'ambiguity': 0.2532467532467532}

Four Principal Components

The following code fits the train dataset to four principal components, classifies the resulting data frame and then plots the confusion matrix.

pca = PCA(n_components=4)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3',
                                                'princomp4'],
                                     index=train_features.index)
train_df = pca_train_features_df.assign(outcome = train_labels)
train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df, X),axis = 1)


pca = PCA(n_components=4)
pca.fit(normalizer(test_features))
test_df = pd.DataFrame(pca.transform(normalizer(test_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3',
                                                'princomp4'],
                                     index=test_features.index)

test_df = classify_data(train_df, test_df)
test_df = test_df.assign(actual=test_labels)
confusion_matrix = pd.crosstab(test_df['actual'],test_df['classification'], rownames=['Actual'], colnames=['Predicted'])

sns.heatmap(confusion_matrix, annot=True)
plt.show()

The F1 score increases slightly and the ambiguity shoots up.

{'f1_score': 0.41935483870967744,
 'ambiguity': 0.34415584415584416}

Five Principal Components

I use the following code to look at the five Principal Component scenario.

pca = PCA(n_components=5)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3',
                                                'princomp4',
                                                'princomp5'],
                                     index=train_features.index)
train_df = pca_train_features_df.assign(outcome = train_labels)
train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df, X),axis = 1)


pca = PCA(n_components=5)
pca.fit(normalizer(test_features))
test_df = pd.DataFrame(pca.transform(normalizer(test_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3',
                                                'princomp4',
                                                'princomp5'],
                                     index=test_features.index)

test_df = classify_data(train_df, test_df)
test_df = test_df.assign(actual=test_labels)
confusion_matrix = pd.crosstab(test_df['actual'], test_df['classification'], rownames=['Actual'], colnames=['Predicted'])

sns.heatmap(confusion_matrix, annot=True)
plt.show()

Five principal components decrease the F1 score and increase the ambiguity.

{'f1_score': 0.3928571428571428,
'ambiguity': 0.36363636363636365}

Principal Component Results

The following table captures the results of the investigation.

P	f1	Ambig.
2	.424	.266
3	.417	.253
4	.419	.344
5	.393	.363

Tune the Radius

The original find_lambda formula increases the radius of the hit footprint until the footprint collides with a point of the opposite class.

def find_lambda(df, v):
  return ( np
            .linalg
            .norm(df
                    .loc[df['outcome'] != v[-1]]
                    .iloc[:,:-1]
                    .sub(np
                           .array(v[:-1])),
                  axis = 1)
            .min())

In part one, we see the footprints that result from unbounded radii.

I can add the following conditional to scope the footprint to a set maximum radius, r.

def find_lambda(df, v, r):
    lambda_var = ( np
                     .linalg
                     .norm(df
                             .loc[df['outcome'] != v[-1]]
                             .iloc[:,:-1]
                             .sub(np
                                    .array(v[:-1])),
                           axis = 1)
                     .min())
    return r if lambda_var > r else lambda_var

I add r to the find_lambda function. (Note the vocabulary overload, the following code uses a lambda function named find_lambda).

train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df,
                                                          X,
                                                          0.1),
                                                          axis = 1)

A scoped radius of maximum 0.1 creates the following footprints.

The following code creates, labels and plots a three dimensional dataset, with r set to 3.

# Generate 3 Principal Components for training 
pca = PCA(n_components=3)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3'],
                                     index=train_features.index)
# Re-attach the labels for training
train_df = pca_train_features_df.assign(outcome = train_labels)

# ID Lambda for each datum
train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df, X, 3),axis = 1)

# Generate a 3D grid for data viz
class_3d_df = pd.DataFrame([(x,y,z) for x in range(-25,25) for y in range(-25,25) for z in range(-25,25)], 
                        columns = ['princomp1',
                                   'princomp2',
                                   'princomp3'])/25

# Classify each point of the grid for data viz
class_3d_df = classify_data(train_df, class_3d_df)

plot_3d(class_3d_df,
        'classification',
        'princomp1',
        'princomp2',
        'princomp3')

This plot captures the hit footprints in 3d, with each footprint a sphere versus a circle (2d case).

Re-run the code above with the following edit to set r to 0.15:

train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df,
                                                          X,
                                                          0.15),
                                                          axis = 1)

With a smaller r we get a better view of the spheres that show the hit footprints.

r provides a hyperparameter to the RCE algorithm. Different values of r will produce different results in terms of model effectiveness.

I create a function named hyperparameter_tune that applies RCE to a fresh train dataset, constrained by a given value for r and returns the f1 score and ambiguity.

def hyperparameter_tune(radius):
    train_df = raw_train_df.copy()
    train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df, X, radius),axis = 1)

    test_df = raw_test_df.copy()
    test_df = classify_data(train_df,test_df)
    test_df = test_df.assign(actual=test_labels)
    return calc_success(test_df)

I then iterate through one-hundred epochs, changing the values for r, spread between zero and one.

loss = []

for x in range(0,100):
    if x == 0:
        pass
    else:
        score = hyperparameter_tune(x/100)
        score['r'] = x/100
        loss.append(score)

I then plot the results, to identify the optimal r value for the given train dataset.

pd.DataFrame(loss).set_index('r').plot()

r = 0.58 yields the ideal results, with an f1_score of 0.43 and ambiguity of 0.27.

The following confusion matrix captures the results for r=0.58.

Tune the Algorithm

Our Algorithm declares regions with either (1) no footprint, or (2) overlapping footprints ambiguous. The Python code follows:

# Original
def classify_data(training_df, class_df):
    # find the hits
    class0_hits = class_df.apply(lambda X: find_hits(training_df, X, 0),axis = 1)
    class1_hits = class_df.apply(lambda X: find_hits(training_df, X, 1),axis = 1)
    # add the columns
    class_df = class_df.assign( class0_hits = class0_hits)
    class_df = class_df.assign( class1_hits = class1_hits)
    # ID ambiguous, class 0 and class 1 data
    class_df['classification'] = np.nan
    class_df['classification'] = class_df.apply(lambda X: 0 if X.class0_hits > 0 and X.class1_hits == 0 else X.classification, axis = 1)
    class_df['classification'] = class_df.apply(lambda X: 1 if X.class1_hits > 0 and X.class0_hits == 0 else X.classification, axis = 1)
    return class_df

To decrease ambiguity, I add vote logic to the code. In this case, overlapping regions will have a winner class in the case where one class includes more exemplars than the other.

# Reduce Ambiguity
def classify_data(training_df, class_df):
    # find the hits
    class0_hits = class_df.apply(lambda X: find_hits(training_df, X, 0),axis = 1)
    class1_hits = class_df.apply(lambda X: find_hits(training_df, X, 1),axis = 1)
    # add the columns
    class_df = class_df.assign( class0_hits = class0_hits)
    class_df = class_df.assign( class1_hits = class1_hits)
    # ID ambiguous, class 0 and class 1 data
    class_df['classification'] = np.nan
    class_df['classification'] = class_df.apply(lambda X: 0 if X.class0_hits > X.class1_hits else X.classification, axis = 1)
    class_df['classification'] = class_df.apply(lambda X: 1 if X.class1_hits > X.class0_hits else X.classification, axis = 1)
    return class_df

The Voting RCE algorithm produces a 2d footprint map with a high concentration of Class Zero regions.

I tune r for the new algorithm and plot the results using the same code above.

loss = []

for x in range(0,100):
    if x == 0:
        pass
    else:
        score = hyperparameter_tune(x/100)
        score['r'] = x/100
        loss.append(score)

pd.DataFrame(loss).set_index('r').plot()

The tuning identifies an ideal r of 0.40, which yields an f1_score of 0.4 and ambiguity of 0.2. The ambiguity drops from the non-voting algorithm, which yielded .27.

Small Training Sets

In Pattern Classification Using Neural Networks (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman writes that RCE handles small training sets with aplomb:

This classifier is similar to a k-nearest neighbor classifier in that it adapts rapidly over time, but it typically requires many fewer exemplar nodes than a nearest neighbor classifier.

I decided to try the algorithm (keeping the ideal r) on half the training data, which results in the following loss graph:

Contrast this to the loss plot for the full training data set (from above):

Compared to the full dataset, the half dataset drives higher ambiguity, but produces a decent F1 score.

If we halve the dataset once more, (one quarter the data) we get the following loss plot.

Since we have a dearth of data, we need an r of at least 0.4 to get any traction. At that point, the algorithm produces decent ambiguity and F1 score, considering the lack of training data.

The following plot shows the RCE hit footprints given one-quarter of the training data:

Conclusion

RCE provides an interesting alternative to the more popular K-Nearest exemplar classifier. The RCE classifier learns quickly with limited training data.

Comment below if you think Tensorflow or MXNet should include this classifier in their ML libraries!

A New Exemplar Machine Learning Algorithm (Part 1: Develop)

2021-08-21T02:34:00-04:00

In Pattern Classification Using Neural Networks (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman provides the following definition of Exemplar neural net classifiers:

[Exemplar classifiers] perform classification based on the identity of the training examples, or exemplars, that are nearest to the input. Exemplar nodes compute the weighted Euclidean distance between inputs and node centroids

The nearest neighbor classifier represents the most popular and widely used exemplar neural net classifier in the domain of Machine Learning (ML). Every ML framework and platform provides a library to execute nearest neighbor classification.

In this blog post, I will develop Python code to implement a lesser known exemplar classifier, Reduced Columb Energy (RCE).

The RCE algorithm assigns a class to test data based on whether or not the data points live inside hit footprints of training data.

Open my post A Graphical Introduction to Probabalistic Neural Networks in a new tab for a deep dive into the math behind RCE.

At a high level, RCE draws a circle around each labeled training observation, with a radius (lambda) equal to the distance of the closest labeled training point in the opposite class. Each circle indicates the hit footprint for that class.

RCE vs. Nearest Neighbor (NN)

The following two-dimensional (2d) plot shows five data points, two of class X, two of class O and one unknown observation, ?, we wish to classify.

The NN algorithm uses the classes of the nearest data points to classify an unknown observation. Based on the plot above, NN identifies that the green question mark belongs to class X. The two X's clearly lie closer to the green question mark than the two red O's.

RCE, however uses a hit radius approach to classify datum. The algorithm calculates a footprint for each of the known data, with radii lengths determined by the vicinity of data from the opposite class. The RCE footprint for the four data points follows:

Based on this model, the green question mark lands in the footprint of the red class, and RCE indicates that the unknown observation belongs to class O.

Explore the Data

I use the Pima Indians Diabetes dataset to craft my model. The dataset includes observations of eight features and a two-class label. The labels indicate the presence or absence of diabetes.

First, import the data into a Pandas DataFrame.

import pandas as pd
import numpy as np

pima_df = pd.read_csv('diabetes.csv')
pima_df.head()

The head() method gives a quick peek at the features and observations.

A quick Seaborn histogram depicts the frequency of Outcome Zero (No Diabetes) vs. Outcome One (Diabetes).

import seaborn as sns
sns.histplot( pima_df['Outcome'],
              bins=2)

A quick glance shows that 2/3 of the observations indicate no diabetes.

Explore One Feature

Of all the given features, I assume that Glucose will impact Outcome the most, so I update the histogram to depict the relationship between the two.

sns.set( rc = {'figure.figsize' : (11.7, 8.27)})
sns.histplot( x = pima_df['Glucose'],
              hue = pima_df['Outcome'])

Blood sugar over 150 appears to indicate diabetes. Lower than 150 we see a lot of overlap.

Kernel Density Estimation (KDE) provides a smoothed "overhead view" of the histogram.

sns.displot( x = pima_df['Glucose'],
             y = pima_df['Outcome'],
             kind = 'kde')

This view also shows the lack of clear separation between the two Outcomes based on Glucose.

Explore Two Features

Use PairGrid to cycle through all features in order to depict their relationships to Outcome.

x_vars = ['Pregnancies',
          'Glucose',
          'BloodPressure',
          'SkinThickness',
          'Insulin',
          'BMI',
          'DiabetesPedigreeFunction',
          'Age']
y_vars = ['Outcome']
g = sns.PairGrid( pima_df,
                  x_vars = x_vars,
                  y_vars = y_vars)
g.map_offdiag(sns.kdeplot)
g.map_diag(sns.histplot)
g.add_legend

Glucose and BMI appear to have a tiny bit of correlation with Outcome, based on the left/ right orentation of the density blobs.

A KDE plot provides an overhead, three-dimensional view of the relationships between Glucose, BMI and Outcome.

sns.displot( x = pima_df['Glucose'],
             y = pima_df['BMI'],
             hue =  pima_df['Outcome'],
             kind = 'kde')

Based on the near-total overlap, the two features do not provide enough data to predict Outcome.

A Seaborn heat map visualizes correlations across features.

import matplotlib.pyplot as plt

sns.set_theme(style="white")
corr = pima_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr,
                            dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 
                             20,
                             as_cmap=True)

# Draw the heatmap with the mask and 
# correct aspect ratio
sns.heatmap(corr, 
            mask=mask,
            cmap=cmap,
            vmax=.3,
            center=0,
            square=True,
            linewidths=.5,
            cbar_kws={"shrink": .5})

Look for dark red tiles in the Outcome row. The dark red tiles of Glucose and BMI indicate stronger correlation with Outcome vs. other features.

Explore Three Features

Create a function to plot three features against Outcome.

import matplotlib.pyplot as plt

def plot_3d(df, target, feature1, 
            feature2, feature3):
    fig = plt.figure(figsize = (12, 12))
    ax1 = fig.add_subplot(111, 
                          projection='3d')

    x3 = df.loc[df[target] == 0][feature1]
    y3 = df.loc[df[target] == 0][feature2]
    z3 = df.loc[df[target] == 0][feature3]
    ax1.scatter(x3, 
                y3,
                z3,
                label = 0,
                color = "red")

    x3 = df.loc[df[target] == 1][feature1]
    y3 = df.loc[df[target] == 1][feature2]
    z3 = df.loc[df[target] == 1][feature3]
    ax1.scatter(x3,
                y3,
                z3,
                label = 1,
                color = "green")

    ax1.legend()

The following function call, for example, draws a 3d plot that visualizes Glucose, BMI and SkinThickness against Outcome.

plot_3d(pima_df,
        'Outcome',
        'Glucose',
        'BMI',
        'SkinThickness')

This plot depicts slight separation between the two classes.

Pick two new features, Insulin and DiabetesPedigreeFunction for another view.

plot_3d(pima_df,
        'Outcome',
        'Insulin',
        'DiabetesPedigreeFunction',
        'SkinThickness')

This combination yields significantly less separability of the classes than the combination of Glucose, BMI and SkinThickness above.

Normalize

First, split the pima_df DataFrame into train and test.

Train - Data to build our exemplar model
Test (AKA Hodout) - Unseen data to help predict real-world performance

train_dataset = pima_df.sample(frac=0.8,
                               random_state = 0)

# Remove the rows that correspond to the train DF
test_dataset = pima_df.drop(train_dataset.index)

train_features = train_dataset.copy()
test_features = test_dataset.copy()

# The pop removes Outcome from the features DF
train_labels = train_features.pop('Outcome')
test_labels = test_features.pop('Outcome')

Observe the summary statistics of the features.

train_dataset.describe().transpose()[['mean','std']]

We see big differences in the range of values for each feature, so we must normalize the data to comply with Machine Learning (ML) best practices.

Keras provides tools for Normalization.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

normalizer = preprocessing.Normalization()

normalizer.adapt(np.array(train_features))

(pd
   .DataFrame(normalizer(train_features)
   .numpy(), columns = ['Pregnancies',
                        'Glucose',
                        'BloodPressure',
                        'SkinThickness',
                        'Insulin',
                        'BMI',
                        'DiabetesPedigreeFunction',
                        'Age'],
              index = train_features.index)
   .plot
   .hist(subplots=True,
                  layout = (4,2),
                  figsize = (15,15)))

The features cluster around zero post-normalization.

Reduce Dimensionality

The correlation heatmap above indicates strong correlation between some features. Highly correlated features input redundancy (noise) into our model. Principal Component Analysis (PCA) maps the features onto orthogonal planes and also provides a means to reduce dimensions. Too many dimensions (features) leads to over-fitting which reduces the predictive effectiveness of ML models.

Open George Dallas' blog post in a new tab for an excellent explanation of PCA

Apply PCA to the Pima DataFrame in order to reduce noise and reduce the number of dimensions.

Create a PCA transform engine, set the number of principal components via n_components and then have the engine fit to the normalized train_features DataFrame.

from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(normalizer(train_features))

Store the normalized, dimensionality reduced matrix in a data frame and set the column name.

pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1'],
                                     index=train_features.index)

See if the single Principal Component provides better separability for each class over our original Glucose histogram.

import seaborn as sns
sns.set( rc = { 'figure.figsize' : (11.7, 8.27)})
sns.histplot( x = pca_train_features_df['princomp1'],
              hue = train_labels)

The histogram captures near-total overlap, which indicates we will need more than one Principal Component

Create a new data frame that includes two Principal Components.

pca = PCA(n_components=2)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2'],
                                     index=train_features.index)

Observe a two dimensional scatterplot, colored by Outcome.

sns.scatterplot(x = pca_train_features_df['princomp1'],
                y = pca_train_features_df['princomp2'],
                hue = train_labels,
                size = train_labels)

Two Principal Components reduce the overlap of the two classes slightly.

A density plot provides another view of the Outcomes.

sns.kdeplot( data = pca_train_features_df,
             x = pca_train_features_df['princomp1'],
             y = pca_train_features_df['princomp2'],
             hue = train_labels,
             fill = False)

The most dense regions of the two outcomes overlap.

How many components do we need? The following code records the variance for each component. Higher variance means more information.

pca = PCA(n_components=8)
pca.fit(normalizer(train_features))
print(pca.explained_variance_)
# 2.09525231 1.67097928 1.04292129 0.88878235 0.76897059 0.69332725
# 0.4365278  0.41629126

The first three to five components include most of the useful information.

The following code produces and stores three components.

pca = PCA(n_components=3)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3'],
                                     index=train_features.index)

Attach the labels back to the train DataFrame for purposes of a 3d plot.

data_df = pca_train_features_df.assign(outcome=train_labels)
plot_3d(data_df,
        'outcome',
        'princomp1',
        'princomp2',
        'princomp3')

The result shows slight separability of the two classes if you imagine sliding a sheet of paper between the clouds of green and red dots.

Develop Model

We will use a 2d train data set to walk through model development.

pca = PCA(n_components=2)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1',
                                                'princomp2'],
                                     index=train_features.index)

We re-attach the train labels to our DataFrame. Our exemplar algorithm requires knowledge of the labels for supervised learning.

train_df = pca_train_features_df.assign(outcome = train_labels)

Calculate Lambda

The following function finds the radii (lambda) for each row. For a given observation, it calculates the euclidean distance to every observation of the opposite class, and then returns the closest point.

(Note the complete absence of any for statements in the code below.)

def find_lambda(df, v):
  return ( np
            .linalg
            .norm(df
                    .loc[df['outcome'] != v[-1]]
                    .iloc[:,:-1]
                    .sub(np
                           .array(v[:-1])),
                  axis = 1)
            .min())

For an example, look at row one of our training DataFrame.

print(train_df.iloc[1,:])

That observation belongs to Outcome 0 (no diabetes), and lies at the point (-0.05, -1.16).

princomp1   -0.046671
princomp2   -1.161939
outcome      0.000000
Name: 122, dtype: float64

Pass this observation to our find_lambda function, which returns the distance to the nearest observation in Outcome 1.

find_lambda(train_df,train_df.iloc[1,:])

Our function indicates that the closest observation in Outcome 1 lies 0.09 units away.

0.0893789327564675

The Pandas apply method allows us to follow a Functional Programming approach and process the entire DataFrame at once.

train_df['lambda'] = train_df.apply(lambda X: find_lambda(train_df, X),
                                    axis = 1)

The following table captures the resulting lambda for a handful of example training observations.

Classify Test Data

Test data does not include a label. The ML Engineer feeds test data into the trained model, and the model predicts a label.

We will now develop the logic to predict a label.

In order to demonstrate the logic, we produce a grid of test points. The grid will also feed a visualization of the RCE decision boundaries.

class_df = pd.DataFrame([(x,y) for x in range(-100,100) for y in range(-100,100)], 
                        columns = ['princomp1',
                                   'princomp2'])/100

Our grid includes the following test data.

Our RCE algorithm uses the find_lambda function (above) to calculate lambda for each observation in the train DataFrame and stores the results in the train_df DataFrame. Recall that Lambda represents the radius of a circle that captures the hit footprint for a given observation.

Our find_hits function (below) takes a given test observation and then calculates the euclidean distance to every point in train_df. A test point to train point distance less than the train point's lambda indicates that the test point lies in the train point's hit footprint.

For a given test observation, our find_hits function discovers and tallies the hits for each class.

def find_hits(df, v, outcome ):
  return (( np
            .linalg
            .norm(df
                    .loc[df['outcome'] == outcome]
                    .iloc[:,:-2]
                    .sub(np
                           .array(v)),
                  axis = 1)
             < df.loc[df['outcome'] == outcome]['lambda'] ).sum() )

Row one of our test DataFrame, for example, includes the unlabeled point (-1,-0.99)

class_df.iloc[1,:]
princomp1   -1.00
princomp2   -0.99

For this point, find_hits tallies 9 hits for Outcome 0.

find_hits(train_df,class_df.iloc[1,:],0)
9

find_hits drives the classify_data function, which labels the class based on hits for each class.

classify_data returns Ambiguous or NaN for any test_data that lies in an Ambiguous region (region with overlapping classes or region with no class).

def classify_data(training_df, class_df):
    # find the hits
    class0_hits = class_df.apply(lambda X: find_hits(training_df, X, 0),axis = 1)
    class1_hits = class_df.apply(lambda X: find_hits(training_df, X, 1),axis = 1)
    # add the columns
    class_df = class_df.assign( class0_hits = class0_hits)
    class_df = class_df.assign( class1_hits = class1_hits)
    # ID ambiguous, class 0 and class 1 data
    class_df['classification'] = np.nan
    class_df['classification'] = class_df.apply(lambda X: 0 if X.class0_hits > 0 and X.class1_hits == 0 else X.classification, axis = 1)
    class_df['classification'] = class_df.apply(lambda X: 1 if X.class1_hits > 0 and X.class0_hits == 0 else X.classification, axis = 1)
    return class_df

Pass our test DataFrame to classify_data and store the results.

class_df = classify_data(train_df, class_df)

A quick peek shows mostly Ambiguous classification for the first and last five observations in our test DataFrame.

A Seaborn scatterplot maps our entire grid.

sns.scatterplot(x = class_df['princomp1'],
                y = class_df['princomp2'],
                hue = class_df['classification'])

The following graphic captures the footprint of each class. Purple for Outcome 1 (Diabetes), Pink for Outcome 0 (No Diabetes) and Gray for Ambiguous.

Evaluate RCE

Our Pima test DataFrame includes labels, which we use to evaluate the model.

To prepare the test DataFrame for classification, we normalize and PCA transform the DataFrame.

pca = PCA(n_components=2)
pca.fit(normalizer(test_features))
test_df = pd.DataFrame(pca.transform(normalizer(test_features)),
                                     columns = ['princomp1',
                                                'princomp2'],
                                     index=test_features.index)

We pass this test_df to classify_data.

test_df = classify_data(train_df, test_df)

Seaborn provides a method to depict a confusion matrix. We attach the known test labels to the test DataFrame for scoring.

test_df = test_df.assign(actual=test_labels)
confusion_matrix = pd.crosstab(test_df['actual'],
                               test_df['classification'],
                               rownames=['Actual'],
                               colnames=['Predicted'])

sns.heatmap(confusion_matrix,
            annot=True)
plt.show()

The following graphic captures the confusion matrix for our two Principal Component test DataFrame.

An F1 Score provides a usesful success metric.

from sklearn.metrics import f1_score

def calc_success(test_df):
    unambiguous_df = test_df.dropna()
    ambiguity = (test_df.shape[0] - unambiguous_df.shape[0])/test_df.shape[0]
    f1 = f1_score(unambiguous_df.actual,
                  unambiguous_df.classification)
    return { "f1_score" : f1,
             "ambiguity" : ambiguity}

Our RCE algorithm trained a model with an F1 Score of 0.42 and ambiguity of 26.6%.

calc_success(test_df)
{'f1_score': 0.42424242424242425,
 'ambiguity': 0.2662337662337662}

Conclusion

In this blog post we developed an exemplar RCE neural net classifier from scratch. Our initial attempt yielded a model with an F1 Score of 0.42 and ambiguity of 26.6%. Next month, we will tune hyperparameters in order to improve model success and reduce ambiguity. We will investigate the number of principal components and tune r. r indicates the maximum value for Lambda and puts an upper limit on the maximum size of each circle that represents a given hit footprint.

Juice Your In-Situ Machine Learning with BigQuery AutoML

2021-06-26T01:23:00-04:00

Data Scientists need skill and experience to create useful Machine Learning (ML) models. ML activities include tool selection, training logistic decisions (move data to training vs. train in-situ), data acquisition, data cleaning, data quality checks, feature engineering, algorithm selection and hyperparameter tuning.

Algorithm selection and hyperparameter tuning drive tedious manual processes where the Data Scientist must flip a switch, turn a knob, train the model and then cross her fingers. The Google Cloud Platform (GCP) Big Query Machine Learning (BQML) service provides two wins for Data Scientists:

The ability to train data in place (a must for PetaByte scale Big Data sets)
The ability to execute AutoML in place

This blog post demonstrates how to execute AutoML in-situ via the GCP BQML service.

Background

In previous blog posts I used several platforms to train models on tabular data: TensorFlow, Google Cloud Platform (GCP) AutoML Tables, and BigQuery BQML.

The blog posts above capture a variety model training approaches:

Hand crafted models using Python Objects (Tensorflow)
AutoML using a GCP Application Programming Interface (API) (AutoML tables)
Hand crafted models using SQL commands (BigQuery)

This blog post demonstrates how to use the BigQuery BQML service to trigger AutoML workflows in-situ. The AutoML service runs through a variety of ML Algorithms and iterates through a range of hyperparameter settings for each algorithm. The service then keeps and serves the winning approach.

AutoML Regressor

Last month we used SQL syntax to command BigQuery to train a linear regression model in-situ. Open that blog post in a new tab to review the steps required to train models in BigQuery.

We used the following SQL statement to train a linear regression model, with the model_type set to LINEAR_REG in the SQL OPTIONS :

CREATE MODEL `shining-chain.wine_dataset.model`
OPTIONS(model_type='LINEAR_REG') AS 
SELECT 
  alcohol,
  chlorides,
  citric_acid,
  density,
  fixed_acidity,
  free_sulfur_dioxide,
  ph,
  quality AS label,
  residual_sugar,
  sulphates,
  total_sulfur_dioxide,
  volatile_acidity
FROM
  `shining-chain.wine_dataset.model.wine_red`

Our model kept the default parameters for the training algorithm.

We can improve model performance through hyper parameter tuning. In the old days, we needed to tune these parameters by hand. GCP, however, provides AutoML services (e.g. AutoML tables) to automatically tune these parameters.

BigQuery now provides a Beta service to execute in-situ AutoML.

To use BigQuery AutoML, simply set your SQL OPTIONS to AUTOML_REGRESSOR.

Note: We direct BigQuery to save the new model under the name automl_model.

CREATE MODEL `shining-chain.wine_dataset.automl_model`
OPTIONS(model_type='AUTOML_REGRESSOR') AS 
SELECT 
  alcohol,
  chlorides,
  citric_acid,
  density,
  fixed_acidity,
  free_sulfur_dioxide,
  ph,
  quality AS label,
  residual_sugar,
  sulphates,
  total_sulfur_dioxide,
  volatile_acidity
FROM
  `shining-chain.wine_dataset.wine_red`

BigQuery AutoML iterates through many hyperparameter scenarios, each which investigate the effects of choices related to learning rate, regularization and optimizers. You will notice that AutoML consumes a much larger portion of wall clock time in comparison to our single Regression model above.

Click Execution Details to get more status information.

Upon completion, BigQuery stores our new automl_model in the wine_dataset Dataset, which lives in the shining_chain project.

The AutoML process completes in about fifty (50) or so minutes.

AutoML Regressor Results

The results tab reports a reduction in Mean Square Error (MSE), compared to the prior Linear Regression model that used default parameters.

The MSE maps to a Root Mean Square Error (RMSE) of about 0.6393.

Several months ago we used TensorFlow and Google Cloud Platform AutoML to train several models on the Wine Quality Dataset and compare the results. In January, we used BigQuery Linear Regression with default Hyperparameters to train the Wine Quality Dataset.

Let's compare the RMSE of BQML'S AUTOML_REGRESSOR (0.6393) against these prior experiments.

The following table captures the results:

Rank	Platform	Approach	Dims	RMSE
1	GCP	AutoML Tables	11	0.598
2	TensorFlow	Linear Model	7	0.633
3	BigQuery	AutoML	11	0.639
4	TensorFlow	DNN	7	0.645
5	TensorFlow	DNN	11	0.648
6	BigQuery	Linear	11	0.661
7	TensorFlow	Linear Model	11	0.706
8	TensorFlow	Linear Model	2	0.735
9	Pandas	Guess Mean	N/A	0.801

BigQuery AutoML under-performs compared to GCP AutoML Tables and a dimensionality reduced TensorFlow model.

Serve Model

After training, BigQuery saves and serves the new model in place.

We use SQL to use the served model. In the BigQuery console, click QUERY MODEL.

The following SQL command pulls the first record out of the Wine Quality data set and then sets the alcohol parameter to 80%.

SELECT
  80 AS alcohol,
  chlorides,
  citric_acid,
  density,
  fixed_acidity,
  free_sulfur_dioxide,
  ph,
  residual_sugar,
  sulphates,
  total_sulfur_dioxide,
  volatile_acidity
FROM 
    `shining-chain.wine_dataset.wine_red`
LIMIT 1

The above QUERY returns the following JSON.

[
  {
    "alcohol": "80",
    "chlorides": "0.074",
    "citric_acid": "0.66",
    "density": "1.0008",
    "fixed_acidity": "11.6",
    "free_sulfur_dioxide": "10.0",
    "ph": "3.25",
    "residual_sugar": "2.2",
    "sulphates": "0.57",
    "total_sulfur_dioxide": "47.0",
    "volatile_acidity": "0.58"
  }
]

The following screengrab captures the console view of this QUERY:

The SQL QUERY below pulls and modifies the first record from the Wine Quality data set and then pipes it to the automl_model we trained via AUTOML_REGRESSOR.

At a high level, we SELECT the predicted score of a wine with 80% alcohol FROM our model:

SELECT
  predicted_label
FROM
  ML.PREDICT(MODEL `shining-chain.wine_dataset.automl_model`, (
SELECT
  80 AS alcohol,
  chlorides,
  citric_acid,
  density,
  fixed_acidity,
  free_sulfur_dioxide,
  ph,
  residual_sugar,
  sulphates,
  total_sulfur_dioxide,
  volatile_acidity
FROM 
    `shining-chain.wine_dataset.wine_red`
LIMIT 1
  ) )

The console returns the predicted quality.

The model predicts a quality score (taste) of 1.7 out of 10 for a wine with 80% alcohol, which I consider reasonable.

[
  {
    "predicted_label": "1.7691493034362793"
  }
]

Boosted Tree

For fun, let's look at the success of an ensemble method.

BQML provides a BOOSTED_TREE_REGRESSOR, which we select via SQL OPTIONS.

CREATE MODEL `shining-chain.wine_dataset.boost_model`
OPTIONS(model_type='BOOSTED_TREE_REGRESSOR') AS 
SELECT 
  alcohol,
  chlorides,
  citric_acid,
  density,
  fixed_acidity,
  free_sulfur_dioxide,
  ph,
  quality AS label,
  residual_sugar,
  sulphates,
  total_sulfur_dioxide,
  volatile_acidity
FROM `shining-chain.wine_dataset.wine_red`

The model takes six minutes to train.

The model results in an MSE of 0.3419, with an RMSE of 0.5847.

The BQML BOOSTED_TREE_REGRESSOR bests GCP AutoML Tables and lands in first place!

Rank	Platform	Approach	Dims	RMSE
1	BigQuery	Boosted Tree	11	0.585
2	GCP	AutoML Tables	11	0.598
3	TensorFlow	Linear Model	7	0.633
4	BigQuery	AutoML	11	0.639
5	TensorFlow	DNN	7	0.645
6	TensorFlow	DNN	11	0.648
7	BigQuery	Linear	11	0.661
8	TensorFlow	Linear Model	11	0.706
9	TensorFlow	Linear Model	2	0.735
10	Pandas	Guess Mean	N/A	0.801

Dimensionality Reduced BQML

Too many features drive over-fitting which increases RMSE.

In a past blog post, we demonstrated that dimensionality reduction through Principal Component Analysis (PCA) reduces over-fitting and reduces RMSE

The last part of this blog post feeds a dimensionality reduced Wine Quality Dataset to the BQML BOOSTED_TREE_REGRESSOR algorithm.

We will briefly run through the steps to apply PCA to the Wine Quality Dataset.

First, import the necessary Python libraries and then pull the data off the University of Irvine's website and stuff it into a Pandas Data Frame.

import pandas as pd
import numpy as np
import io
import requests
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
r = requests.get(url).content
column_names = ['fixed_acidity',
                'volatile_acidity',
                'citric_acid',
                'residual_sugar',
                'chlorides',
                'free_sulfur_dioxide',
                'total_sulfur_dioxide',
                'density',
                'ph',
                'sulphates',
                'alcohol',
                'quality']
wine_df = pd.read_csv(io.StringIO(r.decode('utf-8')), 
                      sep =";",
                      header = 0,
                      names= column_names).astype(np.float32)

Next, separate the Dataframe into a features Dataframe and label series.

wine_features_df = wine_df.copy()
wine_labels_series = wine_features_df.pop('quality')

TensorFlow allows us to create a normalization engine for our features Dataframe.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(wine_features_df))

Sklearn provides a PCA engine. We pipe the features Dataframe to the normalization engine and then to the PCA engine, and request the first seven principal components.

from sklearn.decomposition import PCA
pca = PCA(n_components=7)
pca.fit(normalizer(wine_features_df))
pca_features_df = pd.DataFrame(pca.transform(normalizer(wine_features_df)),
                                     columns = ['princomp1',
                                                'princomp2',
                                                'princomp3',
                                                'princomp4',
                                                'princomp5',
                                                'princomp6',
                                                'princomp7',
                                               ],
                                     index=wine_features_df.index)

We then pop the labels back onto the normalized, dimensionality reduced Dataframe and save it in a Comma Separated Value (CSV) encoded file.

pca_wine_df = pca_features_df.assign(quality=wine_labels_series)
pca_wine_df.to_csv('pca_wine_df.csv',index=False)

The following output records the first ten lines of the CSV:

princomp1,princomp2,princomp3,princomp4,princomp5,princomp6,princomp7,quality
-1.6195179763728917,0.4509726853737244,-1.7744612972264329,0.04374371633307103,0.0670156612925275,-0.9139218906561226,-0.1610332757006941,5.0
-0.7991612763626295,1.856561351520203,-0.9116754264684358,0.5480739949151788,-0.01839571398165714,0.929709236000912,-1.0098350218068104,5.0
-0.7484768531031628,0.8820469715571214,-1.1713842697588999,0.41102911926788793,-0.043535655196972736,0.40147666614026,-0.539553150939102,5.0
2.357677805002114,-0.269982797056245,0.24348912259870834,-0.9284469679531109,-1.4991502738904028,-0.13102232409979334,0.34428774245741034,6.0
-1.6195179763728917,0.4509726853737244,-1.7744612972264329,0.04374371633307103,0.0670156612925275,-0.9139218906561226,-0.1610332757006941,5.0
-1.583695657944522,0.5692157167619253,-1.5382922454632044,0.02375291374041369,-0.11007403103710495,-0.993628380100469,-0.10964916626917803,5.0
-1.1014601399705353,0.6080257636816654,-1.0759111926105813,-0.343950360931988,-1.1333873126112808,0.1750035123630711,0.26101076781023663,5.0
-2.2487136084125905,-0.4168236213171013,-0.9868407617813321,-0.0011977615992119378,-0.7804374122971008,0.2860584721236257,0.1314469559051279,7.0
-1.0868804709342004,-0.3085531414570113,-1.5181578596509828,0.003318878620501723,-0.22672738691574854,-0.5126291605993216,0.2496169604878968,7.0

We upload the CSV into BigQuery using the console and execute the following SQL to train a BOOSTED_TREE_REGRESSOR model on the dimensionality reduced dataset.

CREATE MODEL `shining-chain.pca_wine.pca_boost_model`
OPTIONS(model_type='BOOSTED_TREE_REGRESSOR') AS 
SELECT 
  princomp1,
  princomp2,
  princomp3,
  princomp4,
  princomp5,
  princomp6,
  princomp7,
  quality AS label
FROM `shining-chain.pca_wine.pca_wine`

The model takes six minutes to train.

Click the Evaluation tab to find a MSE of 0.3771, which maps to an RMSE of 0.6140.

The dimensionality reduced data set proves less successful than the full featured data set, and lands in third place.

Rank	Platform	Approach	Dims	RMSE
1	BigQuery	Boosted Tree	11	0.585
2	GCP	AutoML Tables	11	0.598
3	BigQuery	Boosted Tree	7	0.614
4	TensorFlow	Linear Model	7	0.633
5	BigQuery	AutoML	11	0.639
6	TensorFlow	DNN	7	0.645
7	TensorFlow	DNN	11	0.648
8	BigQuery	Linear	11	0.661
9	TensorFlow	Linear Model	11	0.706
10	TensorFlow	Linear Model	2	0.735
11	Pandas	Guess Mean	N/A	0.801

Conclusion

Data Scientists have a plethora of tools and approaches to train models. BigQuery provides in-situ Machine Learning and in-situ AutoML. This blog post compared the BQML AUTOML_REGRESSOR algorithm against the BOOSTED_TREE_REGRESSOR, for both a complete and dimensionality reduced data set.

Did Thoreau inspire the Unabomber? We use AI to Find Out!

2021-05-30T03:19:00-04:00

Good Vs. Evil - Two Opposing paths Taken by a Similar Genius

This blog post provides a comparison between Henry David Thoreau's Walden and Ted Kaczynski's Unabomber Manifesto.

To compare these two works, I use both a modern Natural Language Processing (NLP) Artificial Intelligence (AI) tool and traditional literary analysis.

The Google Cloud Natural Language Application Programming Interface (API) quantifies the authors' sentiment (positive or negative) and intensity of emotion while traditional Literary Analysis compares and contrasts themes.

Google Cloud Natural Language Analysis

The Google Cloud Natural API allows developers to use Google's advanced, massive and validated language model to infer sentiment, extract entities and classify documents. I will use the API to infer sentiment from the two texts, and then compare the results. Sentiment analysis provides quantifiable metrics (hard numbers) which drive mathematical comparisons.

Process Text

I use the Google API to infer sentiment (score) and intensity (magnitude).

The Google Cloud Natural Language API documentation defines score and magnitude:

Score
- Indicates the overall emotion of a document
Magnitude
- Indicates how much emotional content is present within the document, and this value is often proportional to the length of the document

I use the following script to split each text into individual paragraphs, send each paragraph to the API, and then record the results.

#!/usr/bin/env python
import pandas as pd
from google.cloud import language_v1

# Configure book name here 
#FILENAME = 'walden.txt'
FILENAME = 'unabomber.txt'

sentiment_dict = []

# Instantiates a client
client = language_v1.LanguageServiceClient()

with open(FILENAME, encoding='utf-8') as f:
    for line in f:
        if line.strip():
            try:
                document = language_v1.Document(content=line.strip(), type_=language_v1.Document.Type.PLAIN_TEXT)
                sentiment = client.analyze_sentiment(request={'document': document}).document_sentiment
                sentiment_dict.append( { \
                    'score' : sentiment.score, \
                    'magnitude' : sentiment.magnitude, \
                    'text' : line.strip() } )
            except:
                sentiment_dict.append( { \
                    'score' : 0.0, \
                    'magnitude' : 0.0, \
                    'text' : 'ERROR: {}'.format(line.strip()) } )

sentiment_df = pd.DataFrame(sentiment_dict)

sentiment_df.to_csv('{}_sentiment.csv'.format(FILENAME.split('.')[0]),
                    index= False)

I use the Google Cloud Natural API Domain Specific Language (DSL), although requests also works. Walden includes some latin, which breaks the service. To mitigate against the erorr I use try/ except logic. The strip() methods remove blank lines from the analysis.

Since both texts include a wide variety of characters, I store the results in a Pandas dataframe. The to_csv() method will escape all of the characters that disturb the Comma Separated Values (CSV) encoded output.

The script outputs files named unabomber_sentiment.csv and walden_sentiment.csv and each row includes a score, magnitude and the appropriate paragraph text.

The following snippit records several lines of the Walden output:

score,magnitude,text
0.20000000298023224,0.699999988079071,"When I wrote the following pages, or rather the bulk of them, I lived alone, in the woods, a mile from any neighbor, in a house which I had built myself, on the shore of Walden Pond, in Concord, Massachusetts, and earned my living by the labor of my hands only. I lived there two years and two months. At present I am a sojourner in civilized life again."
-0.10000000149011612,4.900000095367432,"I should not obtrude my affairs so much on the notice of my readers if very particular inquiries had not been made by my townsmen concerning my mode of life, which some would call impertinent, though they do not appear to me at all impertinent, but, considering the circumstances, very natural and pertinent. Some have asked what I got to eat; if I did not feel lonesome; if I was not afraid; and the like. Others have been curious to learn what portion of my income I devoted to charitable purposes; and some, who have large families, how many poor children I maintained. I will therefore ask those of my readers who feel no particular interest in me to pardon me if I undertake to answer some of these questions in this book. In most books, the I, or first person, is omitted; in this it will be retained; that, in respect to egotism, is the main difference. We commonly do not remember that it is, after all, always the first person that is speaking. I should not talk so much about myself if there were anybody else whom I knew as well. Unfortunately, I am confined to this theme by the narrowness of my experience. Moreover, I, on my side, require of every writer, first or last, a simple and sincere account of his own life, and not merely what he has heard of other men’s lives; some such account as he would send to his kindred from a distant land; for if he has lived sincerely, it must have been in a distant land to me. Perhaps these pages are more particularly addressed to poor students. As for the rest of my readers, they will accept such portions as apply to them. I trust that none will stretch the seams in putting on the coat, for it may do good service to him whom it fits."
0.0,2.0,"I would fain say something, not so much concerning the Chinese and Sandwich Islanders as you who read these pages, who are said to live in New England; something about your condition, especially your outward condition or circumstances in this world, in this town, what it is, whether it is necessary that it be as bad as it is, whether it cannot be improved as well as not. I have travelled a good deal in Concord; and everywhere, in shops, and offices, and fields, the inhabitants have appeared to me to be doing penance in a thousand remarkable ways. What I have heard of Brahmins sitting exposed to four fires and looking in the face of the sun; or hanging suspended, with their heads downward, over flames; or looking at the heavens over their shoulders “until it becomes impossible for them to resume their natural position, while from the twist of the neck nothing but liquids can pass into the stomach;” or dwelling, chained for life, at the foot of a tree; or measuring with their bodies, like caterpillars, the breadth of vast empires; or standing on one leg on the tops of pillars,—even these forms of conscious penance are hardly more incredible and astonishing than the scenes which I daily witness. The twelve labors of Hercules were trifling in comparison with those which my neighbors have undertaken; for they were only twelve, and had an end; but I could never see that these men slew or captured any monster or finished any labor. They have no friend Iolas to burn with a hot iron the root of the hydra’s head, but as soon as one head is crushed, two spring up."
-0.30000001192092896,3.200000047683716,"I see young men, my townsmen, whose misfortune it is to have inherited farms, houses, barns, cattle, and farming tools; for these are more easily acquired than got rid of. Better if they had been born in the open pasture and suckled by a wolf, that they might have seen with clearer eyes what field they were called to labor in. Who made them serfs of the soil? Why should they eat their sixty acres, when man is condemned to eat only his peck of dirt? Why should they begin digging their graves as soon as they are born? They have got to live a man’s life, pushing all these things before them, and get on as well as they can. How many a poor immortal soul have I met well nigh crushed and smothered under its load, creeping down the road of life, pushing before it a barn seventy-five feet by forty, its Augean stables never cleansed, and one hundred acres of land, tillage, mowing, pasture, and wood-lot! The portionless, who struggle with no such unnecessary inherited encumbrances, find it labor enough to subdue and cultivate a few cubic feet of flesh."
-0.5,2.9000000953674316,"But men labor under a mistake. The better part of the man is soon plowed into the soil for compost. By a seeming fate, commonly called necessity, they are employed, as it says in an old book, laying up treasures which moth and rust will corrupt and thieves break through and steal. It is a fool’s life, as they will find when they get to the end of it, if not before. It is said that Deucalion and Pyrrha created men by throwing stones over their heads behind them:—"
0.0,0.0,"Inde genus durum sumus, experiensque laborum,"
0.10000000149011612,0.10000000149011612,Et documenta damus quâ simus origine nati.
-0.10000000149011612,0.10000000149011612,"Or, as Raleigh rhymes it in his sonorous way,—"
0.6000000238418579,0.6000000238418579,"“From thence our kind hard-hearted is, enduring pain and care,"
0.10000000149011612,0.10000000149011612,Approving that our bodies of a stony nature are.”
-0.699999988079071,0.699999988079071,"So much for a blind obedience to a blundering oracle, throwing the stones over their heads behind them, and not seeing where they fell."
-0.4000000059604645,5.400000095367432,"Most men, even in this comparatively free country, through mere ignorance and mistake, are so occupied with the factitious cares and superfluously coarse labors of life that its finer fruits cannot be plucked by them. Their fingers, from excessive toil, are too clumsy and tremble too much for that. Actually, the laboring man has not leisure for a true integrity day by day; he cannot afford to sustain the manliest relations to men; his labor would be depreciated in the market. He has no time to be anything but a machine. How can he remember well his ignorance—which his growth requires—who has so often to use his knowledge? We should feed and clothe him gratuitously sometimes, and recruit him with our cordials, before we judge of him. The finest qualities of our nature, like the bloom on fruits, can be preserved only by the most delicate handling. Yet we do not treat ourselves nor one another thus tenderly."
-0.6000000238418579,2.0,"Some of you, we all know, are poor, find it hard to live, are sometimes, as it were, gasping for breath. I have no doubt that some of you who read this book are unable to pay for all the dinners which you have actually eaten, or for the coats and shoes which are fast wearing or are already worn out, and have come to this page to spend borrowed or stolen time, robbing your creditors of an hour. It is very evident what mean and sneaking lives many of you live, for my sight has been whetted by experience; always on the limits, trying to get into business and trying to get out of debt, a very ancient slough, called by the Latins æs alienum, another’s brass, for some of their coins were made of brass; still living, and dying, and buried by this other’s brass; always promising to pay, promising to pay, tomorrow, and dying today, insolvent; seeking to curry favor, to get custom, by how many modes, only not state-prison offences; lying, flattering, voting, contracting yourselves into a nutshell of civility or dilating into an atmosphere of thin and vaporous generosity, that you may persuade your neighbor to let you make his shoes, or his hat, or his coat, or his carriage, or import his groceries for him; making yourselves sick, that you may lay up something against a sick day, something to be tucked away in an old chest, or in a stocking behind the plastering, or, more safely, in the brick bank; no matter where, no matter how much or how little."

Numerical Analysis

The output includes score, magnitude and text.

I input each CSV into Pandas for analysis, for example:

una_df = pd.read_csv('unabomber_sentiment.csv')

I did a quick sanity check, and used numpy to identify the most negative text from each author.

una_df[ una_df.score == una_df.score.min()]

The first line of the Chapman quote clocks in at the most negative (-0.9) Thoreau sentiment:

The false society of men—

for earthly greatness

All heavenly comforts rarefies to air. [Thoreau 141]

Kaczynski includes three paragraphs tied for most negative (-0.6), so I selected the paragraph with the highest magnitude (8.7):

Oversocialization can lead to low self-esteem, a sense of powerlessness, defeatism, guilt, etc. One of the most important means by which our society socializes children is by making them feel ashamed of behavior or speech that is contrary to society's expectations. If this is overdone, or if a particular child is especially susceptible to such feelings, he ends by feeling ashamed of HIMSELF... In many oversocialized people this results in a sense of constraint and powerlessness that can be a severe hardship. We suggest that oversocialization is among the more serious cruelties that human being inflict on one another. [Kaczynski 26]

This quick analysis demonstrates that the Google NLP Model appears to correctly identify sentiment.

Graphical Analysis

Since each work includes hundreds of paragraphs, I use Data Visualization (Data Viz) in the form of a Histogram to summarize the output data.

The following Histogram records the sentiment of "The Unabomber Manifesto." Note that the paragraphs skew negative.

Zero indicates a neutral sentiment.

Compare Kaczynski’s Data Viz to Thoreau’s. Thoreau's paragraphs provide a symmetrical Histogram, and most of the paragraphs land in the neutral zone.

Remember that the Google API returns both score and magnitude. We need to include the magnitude data into the analysis, to get a feel for the overall intensity of emotion.

I use a bivarate density plot, which looks like a smooth sheet placed on top of a blocky, two dimensional histogram. I use Kernel Density Estimation (KDE) to represent the frequency of the datum in each bucket in a continuous way (vs. the discrete count found in histograms). I follow the SciPy docs to make the chart.

The colors represent the density. The darker the color, the more instances of a particular score/magnitude pair. The black dots represent the actual data points.

Note that I multiply the Score by ten in order to make the Data Viz more readable.

Again, we see that Thoreau's text concentrates around Neutral tone, with the Magnitude higher around scores of zero.

Contrast Thoreau’s Data Viz to Kaczynski.

The near forty-five degree angle of the contour map (the blue, green and yellow oval) indicates strong correlation between sentiment and magnitude. The more negative Kaczynski writes, the stronger his emotions.

Overall, however, most of his text lands in the medium-negative sentiment range.

Literary Analysis

Ted Kaczynski and Henry David Thoreau (despite the separation of a century) live lives of uncanny similarity. Both tackle their field of expertise in unorthodox, pioneering and peerless manners, and both graduated from Harvard. These individuals tried their hand at teaching, but eventually withdrew from the profession and instead became hermits living in the wilderness in modest shanties. During their seclusion from society they produced their most influential works: Walden and The Manifesto.

Common View of Technology and Over-Socialization

The effect of technology and over-socialization negatively reducing the human experience represents the first (and most prevalent) issue found in both works. The Unabomber opens his tirade with:

The Industrial Revolution and its consequences have been disastrous for the human race... [it has] reduce(d) human beings and many other living organisms to engineered products and mere cogs in the social machine [Kaczynski 1]

Thoreau did not live long enough to experience the far reaching effects of our industrial infancy, but he still provides insights into finding the pitfalls of devoting one’s life to unnecessary industrial labor. He writes:

The laboring man has no leisure for a true integrity day by day; he cannot afford to sustain the manliest relations to men; his labor would be depreciated in the market. He has no time to be anything but a machine [Thoreau1 491]

Thoreau’s lambaste against model farms in Chapter IX: The Ponds criticizes technology’s tendency to reduce men and animals to cogs in a machine. He writes:

A model farm! where the house stands like a fungus in a muck-heap, chambers for men, horses, oxen and swine, cleansed and uncleansed, all contiguous to one another! Stocked with men! A great grease spot, redolent of manures and buttermilk! under a high state of cultivation, being manured with the hearts and brains of men! As if you were to raise your potatoes in the churchyard! Such is a model farm! [Thoreau1 593]

Kaczynski expresses the opinion that technology’s minimizing effect on an individual's importance in society imparts a sense of worthlessness over the general populace by taking away their autonomy and power (which he groups together as the definition for The Power Process, his own term). He writes:

When an individual does not have opportunity to go throughout the power-process the consequences are boredom, demoralization, low self-esteem, inferiority feelings [sic], defeatism, depression, anxiety, guilt, frustration ...[etc.] [Kaczynski 44]

Thoreau gets the point of human demoralization across in his work more elegantly and without the use of invented psychological jargon when he writes:

The mass of men lead lives of quiet desperation... A stereotyped but unconscious despair is concealed even under what are called games and amusements of mankind. There is no play in them, for this comes after work [Thoreau 1492].

Thoreau later reveals that an individual should remove herself from the influence of technology and becoming one with nature in order to escape depression. In Chapter V: Solitude, he writes:

There can be no very black melancholy to him who lives in the midst of nature and has his senses still... Nothing can rightly compel a simple and brave man to a vulgar sadness. While I enjoy the friendship of the seasons I trust that nothing can make a burden to me [Thoreau 1559]

The Unabomber shares the same sentiment that modern man should go back to nature to escape ennui. Unlike Thoreau, however, he gives a scientific reason to explain our difficulty in achieving contenment when removed from nature. He writes:

We attribute the social and psychological problems of modern society to the fact that society requires people to live under conditions radically different from those under which the human race has evolved and to behave in ways that conflict with the patterns of behavior that the human race developed while living under the earlier conditions [Kaczynski 46]

Common View of Workaholic/ Novelty Culture

Thoreau and Kaczynski both focus on the detrimental effects of our workaholic/ consumer culture, where workers work long hours to pay for disposable, novelty items and engage in meaningless activities for distraction. Kaczynski writes:

...even if they have a great deal of money, [modern workers] cannot satisfy their constant craving for the shiny new toys that the marketing industry dangles before their eyes. So, they always feel hard pressed financially, even if their income is large, and their cravings are frustrated [Kaczynski 80]

Thoreau also attacks the consumer mentality on numerous occasions, stressing that a simple life trumps a life spent accumulating junk and requires much less effort. For example:

...if one would live simply and eat only the crop he raised , and raise no more than he ate, and not exchange it for an unsufficient quantity of more luxurious and expensive things ... he could do all his farm work with his left hand at odd hours of the summer [Thoreau1 518]

...if working were not my trade, I could get all the meat I should want by hunting... I could get all I should want for one week in one day [Thoreau 1566]

He concludes on this theme with a powerful and succinct statement:

Superfluous wealth can buy superfluities only. Money is not required to buy one necessary of the soul [Thoreau 1660].

The Unabomber manifesto brings up an interesting aspect of human nature, which he labels the theory of surrogate activities. He writes that because technological society takes care of our basic needs, we humans have to invent artificial needs in order to feel satisfied with our lives:

A surrogate activity is an activity that is directed toward an artificial goal that the individual pursues for the sake of the "fulfillment" that he gets from pursuing the goal, not because he needs to attain the goal itself. For instance there is no practical motive for building enormous muscles, hitting a little ball in a hole or acquiring a complete series of postage stamps. Yet many people in our society devote themselves with passion to bodybuilding, golf, or stamp collecting [Kaczynski 84]

Thoreau criticizes this same notion of devoting one’s life to the pursuit of nonsense, and not concentrating on what’s really important in life:

...It is easier to sail many thousand miles through cold and storm and cannibals ... than it is to explore the private sea, the Atlantic and the Pacific ocean of one’s being alone... It is not worth the while to go round the world to count the cats in Zanzibar. Yet do this even till you can do better, and you may perhaps find some “Symmes’ Hole” by which to get at the inside at last... if you would learn to speak all tongue and conform to the customs of all nations, if you would travel farther than all travelers...and cause the Sphinx to dash her head against the stone [Thoreau 1657]

I predict that the Unabomber would label the surrogate activities that Thoreau just mentioned travel-oriented surrogate activities. In addition to these travel-oriented activities, Thoreau criticizes other surrogate activities in Walden. For example, he considers the hobby of reading junk novels a useless pasttime:

Most men are satisfied if they read or hear read, and perchance have been [convinced] by the wisdom of one good book, the Bible, and for the rest of their lives vegetate and dissipate their faculties in what is called easy reading... the result is dullness of sight, a stagnation of the vital circulations, and a general [sinking] and sloughing off of all the intellectual faculties [Thoreau 1545]

The Unabomber lists the acquisition of useless junk a surrogate activity, writing:

...many people put into their work far more effort than is necessary to earn whatever... they desire and this extra effort constitutes a surrogate activity [Kaczynski 84]

Thoreau also lambastes the collecting surrogate activity. He writes:

...as I preferred some things to others I especially valued my freedom ... I did not wish to spend my time in earning rich carpets... delicate cookery, or a house in the Grecian or the Gothic style just yet [Thoreau 1526]

Thoreau suggests that nonspiritual, superficial pastimes do not substitute for the higher principles of Self, God nor freedom. The Unabomber wraps a similar sentiment in psychological jargon.

Common View of Rage Against the Machine

Kaczynski and Thoreau both desire to eliminate the Machine or what they label the ugly and evil influence of technology from the world.

Kaczynski writes:

it is necessary to develop and propagate an ideology that opposes technology and the industrial society... the factories should be destroyed, technical books burned, etc. [Kaczynski 165]

Thoreau considers the railroad (arguably the springboard for America’s industrial revolution) the machine. Thoreau spares no feelings of mercy or clemency for the railroad. He perceives the railroad a threat to nature, the same way the Unabomber perceives 20th century industrial society a threat to nature:

That devilish Iron Horse, whose ear-rending neigh is heard throughout town, has muddied the Boiling Spring with his foot, and he it is that has browsed off all the woods on Walden shore, that Trojan horse, with a thousand men in his belly, introduced by mercenary Greeks! Where is the country s champion, the [dragon slayer] to meet him at the Deep Cut and thrust an avenging lance between the ribs of the bloated pest? [Thoreau 1591]

Thoreau wants mankind to preserve, if not so much return to the Natural habitat that he shares the Earth with. The Unabomber shares this sentiment. He writes:

An ideology, in order to gain enthusiastic support... must be FOR something, as well as AGAINST something. The positive ideal [I] propose [in my manifesto] is Nature. That is, WILD nature: those aspects of the earth functioning of the Earth and its living things that are independent of human management and free of human interference and control [Kaczynski 183]

The two works provide more parallels that pertain to important social issues. The Unabomber, for example, writes:

Instead of removing the conditions that make people depressed, modern society gives them antidepressant drugs. [Kaczynski 145]

His statment provides an uncanny response to Thoreau’s rhetorical question:

What is the pill which will keep us well, serene, contented? [Thoreau 1563]

Differences

The two works have their differences in addition to their similarities. First of all, the writing styles of the two authors clash quite severely. Thoreau uses a flowery, poetic style and injects a sense of humor into the text (see the “Cenobites” pun). Kaczynski’s uses a staccato, scientific and analytic voice and his work contains not a single iota of brevity.

In addition, Walden includes an overall tone of optimism, whereas Kandinsky focuses on dire pessimism. Finally, Thoreau’s “tangents” mainly describe nature, his possessions, how he built his house, etc., which all deal with his environment. Kaczynski's tangents on the other focus on political themes.

Thoreau's work speaks for itself, and society recognizes the value of Walden solely by his writing talents. Thoreau did not need to partake in any extraneous activities to bring attention to it. Thoreau did not need to use violence to set himself apart from his contemporaries.

Contrast this with Kaczynski, who clearly states :

In order to get our message before the public with some chance of making a lasting impression, we've had to kill people [Kaczynski 96]

Kaczynski did in fact resort to violence, and killed people. We must not forget his cowardly actions.

Conclusion

Based on literary analysis, Thoreau and Kaczynski see eye to eye in relation to their works’ major themes. Both authors appear to be steadfast in their pro-nature (and all the good that comes from it) / anti-technology (and all the ills that are a result of it) convictions.

AI inference provides hard numbers that indicate Kaczynski communicates in a strong, negative tone and Thoreau uses a neutral tone in terms of both emotion and intensity.

Don't Move Your Data! In-Situ Machine Learning via BigQuery

2021-01-30T08:07:00-05:00

I started my AI/ML journey in 2011 with a laptop model, a term which indicates a measure of size. Laptop models, by definition, do not exceed the compute, memory and storage resources of a single piece of hardware. The laptop model approach works well for small data sets, and modern hardware accommodates a few dozen GigaBytes (GB) of data with no issues.

The following cartoon demonstrates the laptop approach to model training and serving. (I use a brain icon to represent the ML model.)

I still build, on occasion, laptop models when I have a small data set and I do not have access to the cloud. In general, however, I train models on Big Data, or data at PetaByte (PB) and higher scale. The laptop model approach, unfortunately, breaks down in the face of Big Data. Consider, for example, an attempt to train a ten (10) PB Dataset on a laptop.

I do not know of a laptop that can accommodate 10 PB, and even if one existed, the compute and memory resources would not be able to train and serve at that scale. The cost and latency, furthermore, involved in transferring that amount of data across the Network also prohibits this approach.

Data Scientists, therefore, sample Big Datasets in order to work around the network, cost and resource constraints associated with Big Data driven laptop models. A sampled Dataset enables the Data Scientist to train and serve models on their laptops.

The following cartoon, for example, illustrates a Data Scientist who downloads one out of every one million (1M) rows at random. This reduces the data set from a cumbersome 10 PB to a manageable 10 GB.

The sampling approach works from a technical standpoint. In order to accommodate resource constraints, however, the Data Scientist must throw away 99.9999% of the data. The ignored data may contain interesting outliers that could, for example, predict black swan events. So, from an information standpoint, the sampling approach lacks utility.

How can Data Scientists train and serve models on Big Data?

The issues with laptop models and sampling approaches result from the attempts of the Data Scientist to Bring the Data to the processing. A better approach, therefore, would be to bring the processing to the Data.

The Google Compute Platform (GCP) BigQuery service now enables Data Scientists to train models in place (or in-situ). They can train and serve models on their BigQuery Datasets without the need to move the data outside of Big Query.

The following Cartoon captures this phenomenon:

Test Drive BigQuery ML

We now will test drive the in-situ BigQuery ML and AutoML services, which allow us to train and serve data without the need to transfer the data out of BigQuery.

Add Data to BigQuery

If you have data in BigQuery, you can test drive BigQuery ML immediately. I will import the UC Davis Wine Quality Dataset into BigQuery.

I discussed the UC Davis Wine Quality Dataset in last month's discussion of GCP AutoML Tables. Please open that link in a new tab to read a description of the data, along with a discussion of the quality of the GCP AutoML generated models.

Last month, I uploaded the UC Davis Wine Quality Dataset to a Google Cloud Storage bucket. I will now import data from that bucket into BigQuery. If you have issues with importing the Wine Quality Dataset into BigQuery via a GCS Bucket, please see that blog post for reference.

The BigQuery console provides a list of pinned projects. Select your project from the list. Google named my project shining chain. Google will provide you with a different, randomly generated name.

Select Create Database.

Name the dataset wine_dataset.

The BigQuery console now lists wine_dataset under your project name.

Click wine_dataset and then the PLUS (+) sign on the upper right in order to add a table.

Select Create Table From Cloud Storage and then browse for your bucket.

If you do not have a bucket with the wine quality dataset loaded, you can click the swiss lunch pail to create a new bucket now.

Add a table name (I named it wine_red), select auto-detect schema and save to close the wizard.

Auto-schema inferred FLOAT for all of our features, and INTEGER for our label.

Select the wine_red table and then select preview to get a look at the data.

Linear Regression Model

We will now create our first model. To train a model in BigQuery we simply point BigQuery to a table, select the desired features and then indicate a LABEL feature, or target.

The wine quality dataset includes chemical markers and a rating from zero (0) to ten (10). Our wine model looks at the wine quality data to learn the relationship between these chemical markers and the quality. In other words, given a set of chemical markers, our model will predict the rating. For this reason, command BigQuery to use the quality feature for the target label.

Use SELECT AS syntax to indicate the target feature

BigQuery allows us to train a model via the CREATE MODEL SQL command. In the following SQL statement, I tell BigQuery to use the wine_red table, located in my wine_dataset Dataset, found in my shining_chain project. I also command SQL to create a model named model and place it in my wine_dataset Dataset, which lives in the shining_chain project. I use an SQL OPTIONS stanza to set the model_type to LINEAR_REG. Note that I include syntax that reads SELECT quality AS label FROM wine_red (I summarize it here). This instructs BigQuery to set the target feature to quality.

Read the SQL statement below to see the logic in action.

CREATE MODEL `shining-chain.wine_dataset.model`
OPTIONS(model_type='LINEAR_REG') AS 
SELECT 
  alcohol,
  chlorides,
  citric_acid,
  density,
  fixed_acidity,
  free_sulfur_dioxide,
  ph,
  quality AS label,
  residual_sugar,
  sulphates,
  total_sulfur_dioxide,
  volatile_acidity
FROM
  `shining-chain.wine_dataset.model.wine_red`

Once the model completes the train stage, click on execution details. You will see that BigQuery used parallel processing to execute two (2) minutes worth of processing in fourteen (14) seconds.

BigQuery also reports the model success metrics.

Click the Results tab and click Go to Model

In accordance with (IAW) our SQL statement, BigQuery named our model model and stored it in the wine_dataset Dataset, which lives in the shining-chain project. Click model, click Evaluation and BigQuery will print the metrics.

Our first model produces a MSE of 0.4374.

Metric	Value
Mean absolute error	0.5107
Mean squared error	0.4374
Mean squared log error	0.0105
Median absolute error	0.4022
R squared	0.3039

The MSE maps to a Root Mean Square Error (RMSE) of about 0.6614.

Last month, we tackled the Wine Quality Dataset with a variety of models and compared the results.

I capture the results table below:

Rank	Approach	Dims	RMSE
1	AutoML Tables	11	0.598
2	Linear Model	7	0.6327900
3	DNN	7	0.6449177
4	DNN	11	0.6483683
5	BigQuery Linear	11	0.6613622
6	Linear Model	11	0.7061977
7	Linear Model	2	0.7350416
8	Guess Mean	N/A	0.8012159

BigQuery's LINEAR_REG model out-performed the two Tensorflow models executed with default parameters. BiqQuery also beat the Guess Mean approach, which provides a good pace car for all of our investigations.

Note that a Linear Model applied to our feature reduced dataset landed in second place last month, which supports the claim that too many features leads to over-fitting and therefore lower performance.

Conclusion

Click the training tab and BQ provides training statistics.

If you click details, you will see that BQ only created one model and stopped.

We commanded BigQuery to use the default configurations for learning rate, regularization and optimizer, create a model and stop there. We can increase the performance of our model, therefore, by tuning these Hyperparameters. Adam, for example, may perform better than Stochastic Gradient Descent. In general, Data Scientists will run through a variety of hyperparameter settings, iterate and then keep the best performing set of configuration options.

In the past, Data Scientists needed to tune these parameters by hand. The AI/ML industry, however, now provides a host of AutoML solutions, that execute model tuning without the need for operator involvement.

BigQuery, in fact, just unlocked a Beta service that allows us to execute AutoML in-situ.

We discuss In-Situ AutoML via BigQuery next month. See you then!

Fast and Easy ML Optimization with GCP AutoML Tables (Beta)

2020-12-26T04:19:00-05:00

Model optimization on traditional Artificial Intelligence and Machine Learning (AI/ML) platforms requires considerable Data Architect expertise and judgement. These ML platforms require the Architect to choose from dozens of available training algorithms. The platforms also provide a host of hyper-parameter knobs and switches for the Architect to tune. The deluge of choice requires the Architect to iterate on both algorithm selection and hyper-parameter values, a time consuming proposition.

AutoML services democratize model development via no-code, graphical user interface (GUI) based optimization services. We discuss the Google Compute Platform's (GCP) AutoML Vision service in an earlier blog post. In this blog post we discuss the GCP AutoML Tables Beta service.

The Data Set

The GCP AutoML Tables Beta service requires structured, Data Frame encoded data. To test drive the service, we use the BUPA Liver Disorders data set. For a refresher on the BUPA Liver Disorders data set, please right click and open one or more of the following blog posts in a new tab (or set of tabs):

In the latter two blog posts, we crunch the BUPA Liver Disorders data set in TensorFlow via Neural Net and Linear Regression Models and reduce model over-fit via dimensionality reduction.

The following table captures the results of our model iteration:

Approach	Dims	RMSE
Guess Mean	N/A	3.03
Linear Model	2	3.07
Linear Model	5	3.22
DNN	5	3.30
DNN	2	3.17

NOTE: Our TensorFlow 2.3 and Keras 2.3 investigations use the Root Mean Square Error (RMSE) success metric for model tuning

We iterate over the different model scenarios and draw the interesting conclusion that simply guessing the mean of the training data set yields the best results.

Google AutoML Beta executes a battle of the bands and iterates through dozens of algorithm choices. For each algorithm, the service tunes Hyper Parameters, to include number of layers, learning rate and number of features.

Let's see if the Google AutoML service can beat our idiotic, yet successful choose mean algorithm.

Enable GCP AutoML tables

Engineers at Google call the menu selection icon the hamburger, a bit of slang that simultaneously makes me laugh and makes me hungry. Click the hamburger icon in the upper left corner and then scroll down to Artificial Intelligence and select Tables --> Datasets.

Click enable the API.

Once we enable the API, click get started.

Name our dataset on the create dataset screen.

The AutoML Tables Beta service provides three vehicles for dataset import:

Import data from BigQuery
Select a CSV file from Cloud Storage
Upload files from your computer

We will upload the BUPA dataset from our computer.

Click here to download the CSV file from USC.

USC names the file bupa.data. If we attempt to upload the file bupa.data, Google will bark.

Rename our file from bupa.data to bupa.csv in order to upload the data. Click select files and then click the bupa.csv file.

If needed, create a bucket to store the model and metadata. Click browse and then select the Swiss lunch pail (my terminology, not Google's).

Cycle through the bucket wizard and click create.

If needed, create a folder via the Swiss lunch pail.

If done properly, we will see all green, and will be able to click import.

Click Import

Google will import the data, a process which can take hours. We can close the window and Google will email us once they (it?) completes the import process.

After a few minutes, the import fails!

The Beta service requires a header. We hope the Alpha version will provide a friendlier UI, and provide a wizard to create a header row. Until then, we will need to add the header row by hand:

mcv,alkphos,sgpt,sgot,gammagt,drinks,selector
85,92,45,27,31,0.0,1
85,64,59,32,23,0.0,2
86,54,33,16,54,0.0,2

     <snip>

91,68,27,26,14,16.0,1
98,99,57,45,65,20.0,1

We properly wrangled the data into a form that the Google service accepts. Upload the modified bupa.csv into the import wizard and select import once more.

GCP imports the data...

...and fails once more!

The service does not accept data sets that contain less than 1,000 rows. For this reason, we can't optimize the BUPA Liver Disorders model with Google AutoML Tables Beta, a reality that disappoints me greatly.

Imagine that you drive trucks cross-country for a living. Now imagine every morning a magical elf appears and hides your keys, steals your engine or slashes your tires. This analogy captures the daily frustration faced by professional Data Engineers!

The Beta version of the Google AutoML Tables service dashed my plans of providing a consistent machine learning narrative centered on the BUPA Liver Disorders data set. Un-cooperative software slayed my plans (and dreams) enough times in my professional career to vaccinate me against such roadblocks. With a heavy heart, I will re-group and select a new data set for us to test drive the with the AutoML Tables Beta service.

The UCI Wine Data Set

Our BUPA liver disorders TensorFlow model predicts the number of drinks that a boozer drinks each day based on biological markers. We stick with the wino theme and use the University of California Irvine (UCI) wine quality data set. The Wine Quality data set uses biological (and chemical) markers to predict the quality of wine, which the sommeliers give a score from one to ten. I would assume that Thunderbird would score low on such a scale.

Download the wine dataset from the same UCI website that hosts the BUPA data set.

We follow the method described in the BUPA TensorFlow blog post to process the data, replacing the BUPA Data Frame with the new Wine Data Frame where appropriate.

The following Python code, for example, uses the requests library to download the Wine Quality data set from the UCI website, and stuffs the data into a Pandas Data Frame.

import pandas as pd
import numpy as np
import io
import requests
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
r = requests.get(url).content
column_names = ['fixed_acidity',
'volatile_acidity',
'citric_acid',
'residual_sugar',
'chlorides',
'free_sulfur_dioxide',
'total_sulfur_dioxide',
'density',
'ph',
'sulphates',
'alcohol',
'quality']
wine_df = pd.read_csv(io.StringIO(r.decode('utf-8')), 
                      sep =";",
                      header = 0,
                      names= column_names).astype(np.float32)

The table below captures the summary statistics for the wine dataset:

feature	mean	std
fixed_acidity	8.31	1.74
volatile_acidity	0.53	0.18
citric_acid	0.27	0.20
residual_sugar	2.52	1.30
chlorides	0.09	0.05
free_sulfur_dioxide	15.8	10.4
total_sulfur_dioxide	46.5	33.0
density	1.00	0.00
ph	3.31	0.15
sulphates	0.66	0.16
alcohol	10.4	1.08
quality	5.63	0.81

NOTE: We observe a standard deviation of 0.809201 for the target variable quality

Note the wild range swings amongst the features. We follow the same process from our BUPA model to normalize the data via TensorFlow. The following histogram records the normalized data histograms. Note that we do not normalize the target, quality.

Recall that the idiotic guess mean algorithm yielded the best results for the BUPA data set. That algorithm guesses the mean of the training Data Frame for each row in the holdout (or test) Data Frame. If we apply that algorithm to the Wine Data Frame, we observe a RMSE of 0.8012159, with a RMSE greater than the standard deviation of the entire population. This result compels us to apply more advanced algorithms.

Once more, Keras provides the tools to create a linear regression model and a Dense Neural Network (DNN) model, both of which predict the quality of the wine based on the given features.

NOTE: Keras detects that we now have eleven input features, versus the five for BUPA.

dnn_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
normalization_5 (Normalizati (None, 11)                23        
_________________________________________________________________
dense_1 (Dense)              (None, 64)                768       
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
=================================================================
Total params: 5,016
Trainable params: 4,993
Non-trainable params: 23

The normalized training set yields the following results:

Approach	Dims	RMSE
DNN	11	0.648
Linear Model	11	0.706
Guess Mean	N/A	0.801

The DNN blows the other two approaches out of the water.

In the spirit of the prior blog post, we reduce the eleven features to two, via PCA. Keras reports that the dimensionality reduction increases the RMSE for the linear model.

Approach	Dims	RMSE
DNN	11	0.648
Linear Model	11	0.706
Linear Model	2	0.735
Guess Mean	N/A	0.801

One principle component (dimension) depicts a poor fit for the regression line.

NOTE: The Wine data frame uses integers for quality. For this reason we could also apply a classification algorithm to predict wine quality.

The graph of two principal components indicates poor predictive performance. We cannot draw a clean line that will predict the correct wine quality (depicted by the color and radius of the circles below).

The 3D bar chart looks flat, which also indicates that we need more than two Principal Components.

How many Principal Components should we use? The eigenvalues record the variance for each eigenvector:

print(pca.explained_variance_)
[3.0807826  1.84947941 1.62211745 1.23466434 0.96610121 0.68122053
 0.58218232 0.40963393 0.3487236  0.17406732]

If we eyeball the vector of eigenvalues, we see that the first seven (7) or so principal components contain most of the variance.

When we reduce the dimensionality of the data set to seven, and apply the two algorithms, we witness the following results:

Approach	Dims	RMSE
Linear Model	7	0.633
DNN	7	0.645
DNN	11	0.648
Linear Model	11	0.706
Linear Model	2	0.735
Guess Mean	N/A	0.801

The dimensionality reduced Linear Model wins.

Can Google AutoML tables beat the dimensionality reduced Linear Model? Let's find out!

Import the UCI Wine Data Set

Download the Wine Data Set from UCI to your workstation and execute the following two actions.

Replace all semi-colons (;) with commas (,)
Replace all spaces with underscores (_)

See a snippet of wine.csv below:

fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
7.3,0.65,0,1.2,0.065,15,21,0.9946,3.39,0.47,10,7

               <snip>

6.3,0.51,0.13,2.3,0.076,29,40,0.99574,3.42,0.75,11,6
5.9,0.645,0.12,2,0.075,32,44,0.99547,3.57,0.71,10.2,5
6,0.31,0.47,3.6,0.067,18,42,0.99549,3.39,0.66,11,6

Follow the process that we used to (attempt to) import the BUPA data set above. Create a new bucket and folder if desired.

I created a bucket named wine-quality-data and a folder named red.

After we click import Google will suggest that we close the window.

After about forty-five minutes, Google sends an email that reports a successful import.

With our imported data set, we can now train the model.

Train the Model

Click the Train tab in the console.

Scroll down to our target variable quality and change the data type from categorical to numeric.

Note: In the spirit of our earlier efforts, we select numeric to continue with the regression theme. If we want a classification model, then we can set data type to categorical

Scroll to the top and set the target variable to quality.

Select train model. We can limit the number of CPU hours (e.g. cost) if desired. I just set the value to the maximum. Our simple model will not consume these resources. Click Train Model.

The Google singularity then gets to work and creates the infrastructure needed to train our model. We can close the browser. Google will email us a notification once they finish developing the model.

View Results

After a few hours, Google sends an email that notifies us of model completion.

Navigate back to the Tables service and click the Models tab. The GCP console presents the results.

In summary, the Google AutoML Tables Beta service yields the best results:

Approach	Dims	RMSE
AutoML Tables	11	0.598
Linear Model	7	0.633
DNN	7	0.645
DNN	11	0.648
Linear Model	11	0.706
Linear Model	2	0.735
Guess Mean	N/A	0.801

NOTE: We achieved the best results with the least amount of work: Upload a CSV and click train!

Model Meta Data

The service provides feature importance. Google reports that alcohol drives quality more than any other feature.

Model Hyperparameters under the Model tab re-directs us to the GCP Operations Logging console. These logs include the different scenarios for each iteration. Trial zero, for example, uses a Neural Network with sixteen (16) layers.

Trial fifteen uses a Gradient Boosted Decision Tree (GBDT).

The logs provide a cumbersome UI to investigate the trials. Perhaps the Alpha service will clean up the UI and present a friendlier dashboard.

Deploy the Model

Now that we developed the model, we will deploy the model for use. The AutoML service provides one-click, no-code model deployment.

Click Test And Use --> Online Prediction --> Deploy Model.

Google once more deploys the model, and perhaps more importantly, the required infrastructure to enable model serving.

Google emails an alert once the model deployment completes.

Test the Model

The online prediction tab provides a web form to test the model.

Scroll down to the alcohol field. What score can we expect for a 160 proof bottle of wine? Simply enter the number eighty into the alcohol field and then click test.

The model predicts our strong wine deserves a score of 5.194

The AutoML Tables Beta also service provides a REST API for machines to submit predictions to the model.

Billing

The AutoML Tables Beta service costs significantly less than our Vision model. We paid $28 for one and a half hour of compute time.

Google gave us a credit for the training, so we did not need to pay any out-of-pocket fee.

Conclusion

In this blog post we test-drove the Google AutoML Tables Beta service. The service did not accommodate our BUPA data, so we needed to pivot and try another Data Set, the UCI Wine Quality data set.

We used Pandas, SciKit Learn and TensorFlow 2.3 to wrangle, explore, normalize, visualize and split the Wine Quality data set. We used Keras 2.3 to train a linear model and DNN model and compared the results. We then iterated on dimensionality reduction approaches, converging on a good-enough number of features. PCA provided the vehicle to reduce dimensionality. The TensorFlow/ Keras/ Pandas approach required domain knowledge of AI/ML concepts and also required familiarity with various Python libraries and methods. In other words, the Python approach required considerable Math, Data Science and Software Development skills.

The Google AutoML Tables Beta service obviated the need for subject matter expertise. We simply uploaded a CSV and clicked run. Google took care of the rest. The AutoML Tables Beta service, therefore, democratizes the power of AI/ML and puts the technology in the hands of non-technical business users. I look forward to the Alpha release of this service.

Fast & Easy Regression w/ Keras and TensorFlow 2.3 (Part 2)

2020-11-28T10:26:00-05:00

In this demonstration we continue to use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem. Today we use Principal Component Analysis (PCA) to address over-fitting via dimensionality reduction

NOTE: TensorFlow Core 2.3 includes tf.keras, which provides the high level (high abstraction) Keras Application Programming Interface (API) that simplifies the command and control of TensorFlow.

Last Month we executed the following activities:

Explore the data set
Normalize the training data
Build, Compile, Train and Evaluate a Linear Model
Build, Compile, Train and Evaluate a DNN

This month, we address the issue of over-fitting by using Principal Component Analysis (PCA) to reduce the dimensionality of the data set. We will:

Drop features (via PCA) to address over-fitting
Revisit the Linear Model
Revisit the DNN
Compare, discuss and contextualize the results

1. Dimensionality Reduction

Model over-fitting leads to loss. Dimensionality reduction, or feature removal, mitigates and reduces model over-fitting. We use Principal Component Analysis (PCA) to reduce the dimensionality.

If you stick a magnet at each point in the data space, and then stick an telescoping iron bar at the origin, the magnets will pull the bar into position and stretch the bar. The bar will wiggle a bit at first and then eventually settle into a static position. The final direction and length of the bar represents a principal component. We can map the higher dimensionality space to the principal component by connecting a string directly from each magnet to the bar. Where the string hits (taut) we make a mark. The marks represent the mapped vector space.

If you want more information, George Dallas writes an excellent blog post that contains cartoons explaining PCA and I suggest you open the link in a new tab.

You can either construct PCA from your Linear Algebra notes (I have mine from 1996 in a Marble Composition book) or just use a pre-built engine. I use the package from Scikit Learn.

Reduce Five Dimensions to One Dimension

The PCA workflow mirrors that of the ML models above. Just set the number of desired components (dimensions) and pass the engine a data set. We also pass a name for the Principal Component.

from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca_train_features_df = pd.DataFrame(pca.fit_transform(train_features),
                                     columns = ['princomp1'],
                                     index=train_features.index)

The fit_transform method both extracts the Principal Components from the data set and then maps the data set to the lower dimensionality space.

Want to see all five dimensions mapped to a single one-dimensional vector?

print(pca_train_features_df.head())
     princomp1
142 -23.421539
6   -32.402962
60  -10.089154
339  24.724613
54  -13.494720

NOTE: The fit and map example above preserves the index of the initial train data set. We need to ensure that we maintain the index so that the label vectors properly align. The index=train_features.index argument preserves the original index during the PCA transform.

Take a look at the scale of the Principal Component vector above. The head alone ranges from ten to thirty. That indicates that we forgot to normalize the data before we extracted the Principal Components.

The following code configures one Principal Component (reduces five features to one), extracts the Component of the normalized data set, and then saves the PCA fit in a mapping vector. We need to use this mapping vector to transform the test (holdout) data set.

# Normalize before PCA, also save fit for test data  
pca = PCA(n_components=1)
pca.fit(normalizer(train_features))

We use the mapping vector to transform the normalized train features and save the results in a Pandas Data Frame. Once more we preserve the index.

pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1'],
                                     index=train_features.index)

You now see the normalized features mapped to the one dimensional Principal Component space.

print(pca_train_features_df.head())
     princomp1
142  -0.416407
6    -1.311242
60    0.209480
339   1.577983
54   -1.013619

A trendline over a scatter plot indicates if we have correlation.

sns.regplot(x = pca_train_features_df['princomp1'],
            y =train_labels )

The trendline does not indicate strong correlation.

Reduce Five Dimensions to Two Dimensions

A two dimension feature set allows us to graph the two Principal Components against our label (target) vector, drinks.

pca = PCA(n_components=2)
pca.fit(normalizer(train_features))
pca_train_features_df = pd.DataFrame(pca.transform(normalizer(train_features)),
                                     columns = ['princomp1','princomp2'],
                                     index=train_features.index)

Seaborn only provides limited three dimensional plots. The following plot captures the relationship between drinks and the two Principal Components. The diameter of each circle indicates the number of drinks.

sns.scatterplot(x = pca_train_features_df['princomp1'],
                y = pca_train_features_df['princomp2'],
                hue = train_labels,
                size = train_labels)

Classic MATPLOTLIB allows us to plot on three axes.

fig = plt.figure(figsize = (15, 10))
ax1 = fig.add_subplot(111, projection='3d')

x3 = pca_train_features_df['princomp1']
y3 = pca_train_features_df['princomp2']
z3 = np.zeros(len(train_labels))

dx = np.ones(len(train_labels))
dy = np.ones(len(train_labels))
dz = train_labels

ax1.bar3d(x3, y3, z3, dx, dy, dz)

ax1.set_xlabel('Principal Component 1')
ax1.set_ylabel('Principal Component 2')
ax1.set_zlabel('Drinks')

plt.show()

The height of the bars depict the number of drinks. The sloping of the bar charts indicates we may have found some slight correlation.

2. Linear Model w/ PCA

We already normalized our train dataset before we applied PCA, so we do not include the TensorFlow normalizer. We use Keras to construct and compile our new linear model.

# no need for normalizer
linear_model_pca = keras.Sequential([layers.Dense(units=1)])
linear_model_pca.compile(optimizer=tf.optimizers.Adam(learning_rate=0.1),
                         loss='mean_squared_error')

We pass the PCA-transformed, two feature data set to the model, along with the original train labels vector, that includes the number of drinks.

%%time
history = linear_model_pca.fit(
    pca_train_features_df, train_labels,
    epochs=100,
    verbose=0, #turn off loggs
    validation_split = 0.2 #validation on 20% of the training
)

CPU times: user 3.76 s, sys: 384 ms, total: 4.15 s
Wall time: 2.8 s

Plot the loss across each epoch for the train and validate sets.

plot_loss(history)

The train set MSE clocks in over 10, with the validate set under 8.

In order to evaluate the model with the holdout set, we first must project the five feature holdout set to two dimensional space via the PCA map matrix.

# Project test features to Principal Components
pca_test_features_df = pd.DataFrame(pca.transform(normalizer(test_features)),
                                    columns = ['princomp1','princomp2'],
                                    index=test_features.index)

The resulting holdout set now spans two (vs. five) dimensions.

print(pca_test_features_df.head())
    princomp1  princomp2
9   -1.031826  -0.390413
25  -0.900648   0.331132
28  -0.957798   1.973741
31  -0.087801  -0.594343
32  -0.242004   0.572321

How did we do?

test_results['Linear Model w/ PCA'] = (linear_model_pca.evaluate(pca_test_features_df, test_labels))**0.5
print(test_results)
3/3 [==============================] - 0s 1ms/step - loss: 9.4360
{'Linear Model': 3.217451704088136, 'DNN': 3.3038437219287813, 'Linear Model w/ PCA': 3.0718091272720853}

PCA reduces the RMSE of the Linear model from 3.2 to 3.0, pretty darn good!

3. DNN with PCA-transformed Data

We use Keras to compile a DNN and once more we do not pass a normalizer.

dnn_model_pca = keras.Sequential([
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
  ])
dnn_model_pca.compile(loss='mean_squared_error', optimizer=tf.keras.optimizers.Adam(0.001))

We pass the PCA mapped train features to the model and set validation proportion to 20%.

%%time
history = dnn_model_pca.fit(
    pca_train_features_df, train_labels,
    epochs=100,
    verbose=0, #turn off loggs
    validation_split = 0.2 #validation on 20% of the training
)

CPU times: user 4 s, sys: 420 ms, total: 4.42 s
Wall time: 3.03 s

How do the results look?

plot_loss(history)

The MSE for the validation set crosses above 8 after the 70th epoch or so.

We evaluate the DNN model with the transformed, two dimensional holdout set.

test_results['DNN w/ PCA'] = (dnn_model_pca.evaluate(pca_test_features_df, test_labels))**0.5
print(test_results)
3/3 [==============================] - 0s 1ms/step - loss: 10.0268
{'Linear Model': 3.217451704088136, 'DNN': 3.3038437219287813, 'Linear Model w/ PCA': 3.0718091272720853, 'DNN w/ PCA': 3.166514259150867}

The DNN w/ PCA reduces the RMSE from 3.3 to 3.16 vs. the original DNN.

4. Interpretation

The RMSE for the four models range from 3.07 (lowest) to 3.30 (highest). Does our model do a good job in predicting how many drinks a person consumes in a day?

To answer that, consider the formula for Root Mean Squared Error (RMSE):

We subtract the actual value from the estimated value for each observation, square the result to remove the negative sign, sum everything up and then take the square root.

Now, assume we just guess the mean for every observation.

If we substitute this guess vector into our RMSE formula, we get the formula for Standard Deviation.

We consider, therefore, any RMSE that comes in under Standard Deviation a victory.

Take a look at the Standard Deviation of the train data set:

train_labels.std()
Out[185]:
3.4108545780181885

By this account all four models win. Keep in mind, however, in the wild most test sets will include such a high volume of data that the STD will tighten to zero.

One last thing. Assume a simple model where we just guess the mean of the train data when predicting on the holdout data. How does this simple model perform?

sq_er = (test_labels - train_labels.mean())**2
test_results['Guess Mean'] = sq_er.mean()**0.5
test_results
Out[215]:
{''Guess Mean': 3.029730661841211}

The "Guess Mean" approach out-performs all of the other models!

Approach	Dims	RMSE
Guess Mean	N/A	3.03
Linear Model	2	3.07
Linear Model	5	3.22
DNN	5	3.30
DNN	2	3.17

In the next blog post we will investigate ways to tune the model, from a construction and hyper-parameter tuning standpoint.

If you enjoyed this blog post, please check out these related blog posts:

Fast & Easy Regression w/ Keras and TensorFlow 2.3 (Part 1)

2020-10-31T10:26:00-04:00

In this demonstration we will use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem. TensorFlow Core 2.3 includes tf.keras, which provides the high level (high abstraction) Keras Application Programming Interface (API) for TensorFlow. Keras simplifies the command and control of TensorFlow. The TensorFlow ecosystem also contains straightforward and simple vehicles for normalization and other common Machine Learning data preparation constructs.

The following bulleted list captures the steps we will execute in this demonstration:

Explore the data set
Normalize the training data
Build, Compile, Train and Evaluate a Linear Model
Build, Compile, Train and Evaluate a DNN

Next month, we will address the issue of over-fitting by using Principal Component Analysis (PCA) to reduce the dimensionality of the data set. In that blog post we will:

Drop features (via PCA) to address over-fitting
Revisit the Linear Model
Revisit the DNN
Compare, discuss and contextualize the results

1. Explore the data set

This demo revisits the BUPA Liver Disorders data set, a classic, tough data set that I have explored in three prior blog posts:

The dataset includes five biological features, a record of drinks per day and an arbitrary selector variable that the original data compilers used for their initial models.

mcv: mean corpuscular volume
Four Chemical Markers
- alkphos: alkaline phosphotase
- sgpt: alamine aminotransferase
- sgot: aspartate aminotransferase
- gammagt: gamma-glutamyl transpeptidase
drinks: # of half-pint equivalents of alcohol per day
selector: field used to split data into two sets

MCV and the four chemical markers provide the features for the model. The model's label vector records drinks per day. We throw out the obsolete selector feature.

Our regression problem seeks to predict the number of alcohol servings a person drinks per day (label) based on the recorded biological stats (features).

Import the Data

I prefer to use requests over the low level urllib3 to pull the data from Irvine. Once I retrieve the content I stuff the data into a Pandas DataFrame and immediately drop the selector column into the bitbucket.

# Import the data
import pandas as pd
import numpy as np
import io
import requests
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/liver-disorders/bupa.data'
r = requests.get(url).content
column_names = ['mcv', 
                'alkphos',
                'sgpt',
                'sgot',
                'gammagt',
                'drinks',
                'selector']
bupa_df = pd.read_csv(io.StringIO(r.decode('utf-8')),
                      names = column_names).astype(np.float32)
bupa_df.drop('selector',
             axis=1, 
             inplace= True)
bupa_df.head()

The DataFrame's head method outputs the first few lines of the frame.

id	mcv	alkphos	sgpt	sgot	gammagt
0	85	92	45	27	31
1	85	64	59	32	23
2	86	54	33	16	54
3	91	78	34	24	36
4	87	70	12	28	10

Check for Correlation

We pick a feature at random, mcv and use a simple scatter plot to check for any obvious correlation between this feature and our target variable, drinks.

import seaborn as sns
sns.set( rc = { 'figure.figsize' : (11.7, 8.27)})
sns.scatterplot(x = bupa_df['mcv'],
                y = bupa_df['drinks'])

No obvious correlation jumps out in the scatter plot below.

To be sure, we will use Seaborn to plot the best fit trend line and error bands.

sns.regplot(x = bupa_df['mcv'],
            y = bupa_df['drinks'])

The graph depicts fat error bands and a near-horizontal trend line, which reflects little to no correlation.

We plot a Kernel Density Estimation (KDE) of the drinks variable. KDE plots estimate the density of a continuous random variable, in this case, drinks. Imagine a smooth histogram, or a histogram with really skinny bars.

sns.displot( x = bupa_df['drinks'],
             kind = 'kde' )

From the density plot we see that most people drink less than a couple of drinks per day.

A continuous bivariate joint density function captures the probability distribution of two random variables. Imagine a top down view of the density plot above, with the density plot for MCV mixed in.

sns.displot(x = bupa_df['mcv'],
            y = bupa_df['drinks'],
            kind = "kde")

The near-circular shape shows the dearth of correlation between MCV and Drinks.

To contrast, observe two features with excellent correlation, SGPT and SGOT. We will leverage this correlation when we apply dimensionality reduction to our data set.

sns.displot(x = bupa_df['sgpt'],
            y = bupa_df['sgot'],
            kind = "kde")

Notice the sharp, nearly 45 degree angle of the bi-variate density plot, which indicates strong correlation.

We plot the correlation between drinks and all features. If one feature presents strong correlation then we can simply use that feature, throw out the rest and then take a nap.

x_vars = ['drinks',
          'mcv',
          'alkphos',
          'sgpt',
          'sgot',
          'gammagt']
y_vars = ["drinks"]
g = sns.PairGrid(bupa_df,
                 x_vars = x_vars,
                 y_vars = y_vars)
g.map_offdiag(sns.kdeplot)
g.map_diag(sns.histplot)
g.add_legend()

None of the features show strong (or any) correlation with drinks, so we will need to proceed with Machine Learning approaches for our prediction model.

Split the Data

We split the data into three buckets:

Train - To train a model
Validate - To tune the model
Holdout (aka Test) - To test the model

The holdout data set surprises the model with completely unknown data, which helps predict expected real-world performance. I use the term test in the code below. The train/ test split partitions rows into different buckets. The features/ label split pops off the label column into a separate vector. TensorFlow uses a DataFrame for the features matrix and a series for the label vector.

NOTE: We will further split the train dataset into train and validate sets when we train the model.

train_dataset = bupa_df.sample(frac=0.8,
                               random_state = 0)
test_dataset = bupa_df.drop(train_dataset.index) # Remove the rows that correspond to the train DF

train_features = train_dataset.copy()
test_features = test_dataset.copy()

train_labels = train_features.pop('drinks') #The pop removes drinks from the fetures DF
test_labels = test_features.pop('drinks')

Take a quick look at the summary statistics for the train dataset.

train_dataset.describe().transpose()[['mean','std']]

feature	mean	std
mcv	90.2	4.4
alkphos	70.0	18.3
sgpt	30.6	20.1
sgot	24.4	10.2
gammagt	38.0	37.5
drinks	3.4	3.4

Notice that for drinks, our target (label), μ = σ = 3.4.

We already looked at the density plot for drinks. We now plot the histograms of the features.

train_features.plot.hist(subplots=True,
                         layout = (3,2))

Notice that each feature encompasses a different range of values. To comply with Machine Learning best practices, we will normalize the data.

2. Normalize the data

We normalize the data between -1 and 1. Most blogs describe the manual normalization process. TensorFlow 2.X, however, provides an experimental normalization engine.

Import the required packages.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

Create a normalizer object.

normalizer = preprocessing.Normalization()

Feed the normalizer engine our data, so the engine gets a feel for the ranges and statistical summaries.

normalizer.adapt(np.array(train_features))

The normalizer inputs a data set, and the numpy() method returns a matrix of normalized numbers. We pass this numpy matrix to a Pandas DataFrame and then plot the new, normalized histogram.

(pd
   .DataFrame(normalizer(train_features)
   .numpy(), columns = ['mcv', 'alkphos', 'sgpt', 'sgot', 'gammagt'], index = train_features.index)
   .plot
   .hist(subplots=True, layout = (3,2)))

Much better! The normalized data cluster around zero.

3. Create a Linear Model

Keras makes life easy. The following line of code creates a linear regression model.

linear_model = keras.Sequential([normalizer, layers.Dense(units=1)])

Every Machine Learning course in history seems to fixate on Gradient Descent for the first few weeks. In this case, we do not use Gradient Descent to optimize our model, instead we use ADAM. In addition, I set the loss function to Mean Square Error (MSE). In practice, you should use Mean Absolute Error (MAE), however, I use MSE in order to drive some interesting thought experiments in the final interpretations section of next month's blog post.

linear_model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.1),
                     loss='mean_squared_error')

Run through one hundred epochs to train the model. We use 1/5 of the train data to validate the model. I use an NVIDIA Tesla K80, which keeps the clock time to under 3 seconds. A CPU will take about 30 seconds.

%%time
history = linear_model.fit(
    train_features, train_labels,
    epochs=100,
    verbose=0, #turn off loggs
    validation_split = 0.2 #validation on 20% of the training
)
CPU times: user 3.85 s, sys: 312 ms, total: 4.16 s
Wall time: 2.83 s

Keras plops the training information into a table. The following function plots the table for us to look at.

import matplotlib.pyplot as plt
def plot_loss(history):
  plt.plot(history.history['loss'],
           label='loss')
  plt.plot(history.history['val_loss'],
           label='val_loss')
  plt.ylim([0, 12])
  plt.xlabel('Epoch')
  plt.ylabel('Error [Drinks]')
  plt.legend()
  plt.grid(True)

Now plot the training history.

plot_loss(history)

Our loss on the train data set (blue line) lands at around nine (MSE), or a root mean square error (RMSE) of 3. This means that, for the average person, the model predicts either three too many or three too few drinks per day. We discuss the impacts of this RMSE in the final interpretations section of next month's blog post.

The validation set, however, fares better, with an MSE of under eight, and an RMSE of ~2.8.

Good loss on train and validate sets do not mean much. Data Scientists can overfit a model to their train data, which does not generalize well in the wild.

The proof of the pudding lies in the taste therein... only the error of the holdout (test) set matters.

test_results = {}
test_results['Linear Model'] = (linear_model.evaluate(test_features, test_labels))**0.5
print(test_results)
3/3 [==============================] - 0s 1ms/step - loss: 10.3520
{'Linear Model': 3.217451704088136}

On the holdout set, the liner model produces an RMSE of ~3.2

4. Train a Dense Neural Network (DNN)

Keras lets us assemble a Dense Neural Network (DNN) model layer by layer. The following function will use Keras to build and compile our DNN model.

def build_and_compile_model(norm):
  model = keras.Sequential([
      norm,
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

  model.compile(loss='mean_squared_error',
                optimizer=tf.keras.optimizers.Adam(0.001))

  return model

We pass the model a normalizer (created above) to normalize the data before it hits the DNN.

dnn_model = build_and_compile_model(normalizer)

Keras prints the model summary to the screen. The model includes four layers, a normalization layer that accepts a five feature data set, two 64 feature dense layers, and then the single parameter output layer, which provides the prediction for number of drinks per day.

dnn_model.summary()
Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
normalization_1 (Normalizati (None, 5)                 11        
_________________________________________________________________
dense_20 (Dense)             (None, 64)                384       
_________________________________________________________________
dense_21 (Dense)             (None, 64)                4160      
_________________________________________________________________
dense_22 (Dense)             (None, 1)                 65        
=================================================================
Total params: 4,620
Trainable params: 4,609
Non-trainable params: 11
_________________________________________________________________

Train the DNN and record the loss for the train and validate (set to 1/5) data sets.

%%time
history = dnn_model.fit(
    train_features, train_labels,
    validation_split=0.2,
    verbose=0,
    epochs = 100)
CPU times: user 4.01 s, sys: 468 ms, total: 4.48 s
Wall time: 3.09 s

Plot the loss.

plot_loss(history)

The train loss slopes down and the validation loss holds constant.

The evaluate method checks the holdout (test) set.

test_results['DNN'] = (dnn_model.evaluate(test_features, test_labels))**0.5
print(test_results)
3/3 [==============================] - 0s 1ms/step - loss: 10.9154
{'Linear Model': 3.217451704088136, 'DNN': 3.3038437219287813}

The DNN model shows RMSE of 3.3, worse than the Linear Model.

Conclusion

In this demonstration we first used the requests package to pull a dataset directly off the UC Irvine website and stuff the data into a Pandas data frame. We explored the data using a combination of traditional analytics, Seaborn, Matplotlib and fundamentals of statistics. We then used the experimental TensorFlow normalizer to normalize our data set. We also used TensorFlow to create our Train, Validate and Holdout data sets. Keras provided a vehicle to create both a linear model and a Dense Neural Network (DNN).

The added complexity of the DNN produced a reduction in performance over the linear model. Worse performance due to added complexity points to over-fitting. We will address the issue of DNN over-fitting next month by using Principal Component Analysis (PCA) to reduce the dimensionality of the data set. We will use PCA to drop features.

Google AutoML Vision - Mods vs. Rockers Revisited!

2020-04-25T12:12:00-04:00

FastAI provides Jupyter notebooks to wrangle data, train models, optimize models and then serve models.

I recommended FastAI to my Data Scientist friends and they found the FastAI Jupyter layout and workflow both cumbersome and confusing.

GCP provides the Google AutoML Vision service, an alternative to FastAI (and any roll-your-own vision service, for that matter), which automates the tedious aspects of AI Vision efforts.

AutoML Vision simplifies labeling and then automates training, optimization and serving of the model.

GCP provides a no code method to create, deploy and serve AI Vision models at scale!

In this HOWTO we will accomplish the following:

Create a Google Cloud Storage Bucket
Acquire and label data
Train a Vision model that identifies Mods vs. Rockers
Fix data labels via GUI
Re-train and tune a vision model
Serve a Vision model
Send our served model a test image

Create a bucket.

We must get the AutoML vision service some labeled data.

We will create a Google Cloud Storage bucket for this purpose.

You will upload two folders to this bucket, one for each class, mods and rockers.

You will also upload a line-delimited CSV file to this bucket that records the URI of each image in the bucket, followed by a label.

For example:

gs://mods-rockers/mods/00000000.jpg,mods
gs://mods-rockers/mods/00000001.jpg,mods
gs://mods-rockers/mods/00000002.jpg,mods

<snip>

gs://mods-rockers/rockers/00000097.jpg,rockers
gs://mods-rockers/rockers/00000098.jpg,rockers
gs://mods-rockers/rockers/00000099.jpg,rockers

IMPORTANT: Ensure that you use a regional bucket in us-central1, location type: Region and required storage class: Standard.

If you do not use the proper bucket configuration, you will receive the following error when you attempt to import your dataset.

The following commands create a bucket, apply permissions to AutoML and give AutoML permissions to access a bucket.

From the cloudshell, export your USERNAME as an environment variable.

Be sure to enter your USERNAME in the following command:

sobanski_htc@cloudshell:~ (mods-rocker-project)$ export USERNAME=<your email address>

Now enable AutoML to access a bucket.

sobanski_htc@cloudshell:~ (mods-rocker-project)$ export PROJECT_ID=$DEVSHELL_PROJECT_ID
sobanski_htc@cloudshell:~ (mods-rocker-project)$ gcloud projects add-iam-policy-binding $PROJECT_ID --member="user:$USERNAME" --role="roles/automl.admin"
Updated IAM policy for project [mods-rocker-project].
bindings:
- members:
  - user:my@email.com
  role: roles/automl.admin
- members:
  - serviceAccount:service-4011961642212@gcp-sa-automl.iam.gserviceaccount.com
  role: roles/automl.serviceAgent
- members:
  - serviceAccount:service-4011961642212@compute-system.iam.gserviceaccount.com
  role: roles/compute.serviceAgent
- members:
  - serviceAccount:4011961642212-compute@developer.gserviceaccount.com
  - serviceAccount:4011961642212@cloudservices.gserviceaccount.com
  role: roles/editor
- members:
  - user:smy@email.com
  role: roles/owner
etag: AxYhhFi=
version: 1

sobanski_htc@cloudshell:~ (mods-rocker-project)$
 gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:custom-vision@appspot.gserviceaccount.com" --role="roles/ml.admin"

Now create the bucket:

sobanski_htc@cloudshell:~ (mods-rocker-project)$ gsutil mb -p $PROJECT_ID -c standard -l us-central1 gs://<your-bucket-name>/

Get a dataset.

If you do not have a labeled dataset, use the FastAI dataset notebook to quickly download a labeled dataset, separated by folder.

If you do have a labeled dataset you can skip this section.

Launch AI Platform

Spin up an AI platform notebook for this task.

Log into the Google Cloud Platform (GCP) console at console.cloud.google.com [Non-referral link].

Type notebooks into the search bar, click Notebooks AI Platform and then click Enable API.

Click New Instance and then select Python.

Launch a terminal.

Install FastAI Course v3

From the terminal install the FastAI course v3.

$ clone https://github.com/fastai/course-v3.git

Install the required FastaAI libs.

Since this is an ephemeral notebook, you will not need to worry about virtual environments.

$ pip install fastai
Collecting fastai
  Downloading fastai-1.0.60-py3-none-any.whl (237 kB)
     |████████████████████████████████| 237 kB 4.6 MB/s 
Requirement already satisfied: numpy>=1.15 in /opt/conda/lib/python3.7/site-packages (from fastai) (1.18.1)

...

Building wheels for collected packages: nvidia-ml-py3
  Building wheel for nvidia-ml-py3 (setup.py) ... done
  Created wheel for nvidia-ml-py3: filename=nvidia_ml_py3-7.352.0-py3-none-any.whl size=19189 sha256=42f79de382946ce4af88196dfdcf55cda496237f7db498bd2cc1cce3f788fba6
  Stored in directory: /home/jupyter/.cache/pip/wheels/df/99/da/c34f202dc8fd1dffd35e0ecf1a7d7f8374ca05fbcbaf974b83
Successfully built nvidia-ml-py3
Installing collected packages: wasabi, srsly, murmurhash, cymem, plac, preshed, catalogue, blis, thinc, spacy, torch, torchvision, nvidia-ml-py3, fastprogress, fastai

Download the images

From the GUI, Navigate to nbs --> dl1 --> lesson2-download.ipynb.

Select Edit --> Clear All Outputs.

Follow the instructions up to the Download images section to create and upload your mods.csv and rockers.csv tables.

These tables include URLs that point to images from each class.

Once you are at the Download Images section, replace the presented code with the code below.

This prevents you from needing to scroll up to the prior section.

classes = ['mods','rockers']
path    = Path('data/brighton_seafront')
for folder in classes:
    dest = path/folder
    dest.mkdir(parents=True, exist_ok=True)
path.ls()
for c in classes:
    file = '{}.csv'.format(c)
    dest = path/c
    download_images(file, dest, max_pics=200)

for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

I present the updated code in the following graphic.

Be sure to run the next verify_images cell.

Label the data

At this point you should have two folders, one named mods and one named rockers.

If you used the AI platform to create your data set folders, then change directories to brigton_seafront.

~$ cd course-v3/nbs/dl1/data/brighton_seafront/

Export the name of your Google Cloud Storage (GCS) bucket.

NOTE: Do not use the bucket name mods-rockers since I own that bucket.

~/course-v3/nbs/dl1/data/brighton_seafront$ export BUCKET_NAME=mods-rockers

Brighton seafront contains two sub-directories, mods and rockers.

Create a spreadsheet that appends the URI for the image, followed by the label.

All of the images in the mods directory will get the label mods and all of the images in the rockers directory will get the label rockers.

~/course-v3/nbs/dl1/data/brighton_seafront$ for name in `ls mods`; do echo gs://$BUCKET_NAME/mods/$name,mods >> labeled_data.csv; done
~/course-v3/nbs/dl1/data/brighton_seafront$ for name in `ls rockers`; do echo gs://$BUCKET_NAME/rockers/$name,rockers >> labeled_data.csv; done

The first few lines of the labeled_data.csv file read:

~/course-v3/nbs/dl1/data/brighton_seafront$ head -n3 labeled_data.csv
gs://mods-rockers/mods/00000000.jpg,mods
gs://mods-rockers/mods/00000001.jpg,mods
gs://mods-rockers/mods/00000002.jpg,mods

And the last few lines read:

~/course-v3/nbs/dl1/data/brighton_seafront$ tail -n3 labeled_data.csv
gs://mods-rockers/rockers/00000097.jpg,rockers
gs://mods-rockers/rockers/00000098.jpg,rockers
gs://mods-rockers/rockers/00000099.jpg,rockers

Copy labeled_data.csv, the mods folder, the rockers folder and all of their contents to your GCS bucket.

~/course-v3/nbs/dl1/data/brighton_seafront$ gsutil -m cp -r labeled_data.csv mods/ rockers/ gs://$BUCKET_NAME

At this point be sure to destroy the AI Platform notebook so that you do not incur any charges.

Train the model

Enable the API

Log into the Google Cloud Platform (GCP) console at console.cloud.google.com [Once again, this is a non-affiliate link].

In the search bar, type Vision and then click ENABLE AUTOML API.

Upload your Dataset

Click Get Started --> New Dataset --> Multi-label classification

Under Select files to import, select Select a CSV file on Cloud Storage and then enter the URI for the labeled_data.csv file on your bucket.

The import will take several minutes.

View Images

After the import completes, you will see your labeled images.

A brief perusal of the images shows that some pictures (highlighted in red) include incorrect labels.

For now, let's ignore the bad labels and see what happens.

Train your Model

Select Start Training

The training will use 16 GPU hours.

Since GCP farms the training out in parallel, the 16 GPU hours take less than an hour.

Evaluate your Model

After the training completes, click Evaluate.

You will see that the model provides sub 90% precision and recall, as noted by the confusion matrix (highlighted in green).

Drill down for more details and you will see that the false positives for mods includes two pictures of mods.

This points to a labeling problem.

NOTE: Upon second glance, the picture on the right depicts Teddy Boys. Should I label Teddy Boys Mods, Rockers or delete the picture? Answer in the comments below!

Fix Labels

Click images and change the labels of the troublesome images (or just delete them if you're lazy right now).

I have a rocker motorcycle labeled mod and a picture that includes both mods and rockers labeled as just mods.

I like this picture, a bunch of rockers attempting to murder two helmet-less mods, who find it funny.

Re-train model

After we clean up the data and re-train, we see a perfect confusion matrix.

Drilling down we see our model gave a mod under arrest the rocker label.

Deploy the model

Unlike FastAI, the Google AI Platform provides one-click deployment of your model.

Click Test & Use and the Deploy Model.

GCP takes several minutes to deploy the model.

After you deploy the model, click the Upload Images button and upload up to ten images.

I upload a picture of myself at the park.

The model reports, with 93% certainty that I fall under the Mod classification, vs. Rocker.

My paisley shirt and Italian sunglasses give credence to this, although I do have a Rocker hair cut (styled with Royal Crown, no less).

Conclusion

GCP provides an AutoML vision service that automates the manual FastAI tasks of training, optimizing and serving a Vision model.

AutoML vision also provides an easy to use and intuitive labeling service.

If you can get a hold of labeled data, then I would recommend the AutoML vision service.

Use the AutoML vision service for serious tasks. Google throws the kitchen sink at training and tuning.

Each model consumed sixteen (16) hours of GPU time!

My FastAI model ran for two or three minutes, on one GPU.

My two runs (32 hours total), cost about $100.

Google, however, gave me $176.00 to experiment with the model training and serving.

From Google:

Free Trial! You can try AutoML Vision Object Detection for free by using 40 free node hours each for training and online prediction, and 1 free node hour for batch prediction, per billing account. Your free node hours are issued right before you create your first model. For batch prediction, the free node hour is issued at the time of the first batch prediction is initiated. You have up to one year to use them. Prices are listed in US Dollars (USD). If you pay in a currency other than USD, the prices listed in your currency on Cloud Platform SKUs apply."

I did not need to eat into the $300 in free credits google provided when I signed up for GCP!

FastAI x Flask - Mods vs. Rockers!

2019-07-31T22:26:00-04:00

Fastai provides helper functions on top of Pytorch to help us wrangle, clean, and process data. In this HOWTO we will accomplish the following:

Deploy an AWS g3.8xlarge instance
Compile and install NVIDIA drivers on our g3.8xlarge instance
Use a Juypter notebook to clean and organize image data
Learn a model that classifies Mods vs. Rockers
Deploy an image classification web app using Flask

Introduction

Fastai delivers a series of videos and Juypter notebooks that teach us how to quickly apply ML/AI techniques to real world problems. The Jupyter notebooks require Python 3 libraries and a GPU. Crestle.ai provides a one-click deployment of the required environment.

You enter your credit card and then click "run."

Unfortunately, as of August 31st, 2019, Crestle no longer provides their service to the public.

As the platform and its usage continues to evolve, we've decided to move Crestle.ai to an Enterprise model. This means that as of August 31, 2019, the service will no longer be available to individual participants.

While other platforms still provide this service to individual users, I decided that it would be a good exercise to deploy our own environment.

Deploy the FASTAI environment to AWS

In order to crunch the ML/AI data at a reasonable pace, I recommend a g3.8xlarge instance. Select this instance with the Amazon Linux Operating System.

Install the NVIDIA driver

The AWS GPU instances require a CUDA driver to enable Python to send jobs to the card.

Their web site provides decent instructions on how to install this driver, and I summarize the steps below.

The AWS Linux AMI provides the AWS Command Line Interface (CLI) and we use that CLI to fetch the latest NVIDIA driver.

$ aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .

The driver requires certain development packages, so we will install them along with other useful server packages into our Operating System.

Note: Python3 requires installation of the AWS EPEL repository

$ sudo amazon-linux-extras install epel
$ sudo yum -y install bzip2 gcc git kernel-devel-* python36 python36-dev*

Enter the following configurations into our modprobe blacklist.

$ sudo vim /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist noveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF

Now edit our boot configuration.

$ sudo vim /etc/default/grub
GRUB_CMDLINE_LINUX='rblacklist=nouveau'

Re-compile and install the boot configuration.

$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Edit the NVIDIA conf file to turn off GUI Features.

$ sudo vim /etc/nvidia/gridd.conf
EnableUI=FALSE
FeatureType=0
IgnoreSP=TRUE

Run the NVIDIA compiler and reboot.

$ sudo /bin/sh ./NVIDIA-Linux-x86_64-430.30-grid.run 
$ sudo reboot

You may see the following warning, just hit OK and proceed, since you will not need the GUI features of the driver.

nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path
           '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails
           to find the NVIDIA X driver module, please install the `pkg-config` utility and the
           X.Org SDK/development package for your distribution and reinstall the driver.

Once the server reboots, you can test to see if the driver installed.

$ nvidia-smi -q | head

==============NVSMI LOG==============

Timestamp                           : Wed Aug 21 00:07:40 2019
Driver Version                      : 430.30
CUDA Version                        : 10.2

Attached GPUs                       : 2
GPU 00000000:00:1D.0
    Product Name                    : Tesla M60

We will use a Python virtual environment to check to see if Python can use the GPU.

FASTAI recommends we use Conda to install their required libraries and we will.

I find Conda, however, to be extremely slow, so for this quick hello world we will use PIP.

$ python3 -m venv aws
$ cd aws/
$ source ./bin/activate
(aws) $ pip install -U pip
(aws) $ pip install torch

Now run the Python interpreter and verify that Python can use the GPU.

(aws) $ python
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

Install the Conda libraries

If Python can use the GPU, deactivate the virtual environment to proceed.

(aws) $ deactivate
$

Now install Conda. You may want to visit the Anaconda website to see if they provide an updated version.

Once installed, log out of your session and then log back in to install the conda commands into your path.

$ curl -XGET https://repo.anaconda.com/archive/Anaconda2-2019.03-Linux-x86_64.sh -O
$ bash Anaconda2-2019.03-Linux-x86_64.sh
$ exit

With Conda installed, create and activate a Conda environment. I named mine boot (For Bootstrap).

$ conda create -n boot python=3.6.8 anaconda
$ conda activate boot

FASTAI requires Torch, and torch requires OS level development libraries.

(boot)$ sudo yum -y install libxml2-devel libxslt1-dev zlib1g-dev g++

Now, since I find Conda to be painfully slow, I will just install the packages via PIP (while in my conda environment).

Feel free to use the install command if you prefer.

Note: If you choose to use Conda channels to install packages, ensure that you have the correct package name. I, for example, commanded Conda to install pytorch when the package is in fact named torch. As a result, Conda searched for several hours for a package named pytorch, with no errors or warnings.

(boot)$ pip install fastai torch flask-bootstrap flask-wtf

We now have everything needed to start the notebook.

Start the notebook

Use git to pull the FASTAI course to your server, navigate to the dl1 directory and then start the server.

$ git clone https://github.com/fastai/course-v3.git
$ cd course-v3/nbs/dl1/
$ jupyter notebook --ip=0.0.0.0

Ensure that your EC2 instance's security group provides access to port 8888, and then access the server on this port via a web browser.

Use the notebook to clean the data

Launch the lesson2-download.ipynb, which provides the data cleaning service.

The instructions are quite clear. You use the Chrome developer tools to gather the URLs of pictures from Google images that match your classes.

I ran these instructions and produced mods.csv and rockers.csv files.

Once you have these files on your server, the steps get confusing, since the notebook expects you to scroll back for each section.

To ease this confusion, I created separate cells for the mods and rockers steps.

The mods cells set folder to mods, file to mods.csv and the parent folder to data/brigthon_seafront.

I also add a cell to download the files.

The notebook directs us to upload the mods.csv file to the dl1 folder, which you do via the notebook UI.

Be sure to upload this file before you execute the download command.

I then create a new set of cells to set folder to rockers, file to rockers.csv and the same parent folder.

Once again, I add a cell to download the files and print the results.

Run the cells until you get to the learn.recorder.plot() step.

learn.lr_find(start_lr=1e-5, end_lr=1e-1)
learn.recorder.plot()

Plot the image. This image helps us select the learning rate boundaries.

Based on this image, we want to set the learning rate between 1e-4 and 1e-3.

learn.fit_one_cycle(8, max_lr=slice(1e-4,1e-3))

Keep running the model until you get to the image cleaner step.

As you can see, our data set includes images with incorrect labels.

The image on the right, for example, shows a mod, not a rocker.

Once you used the widget to fix the data, go back to the following cell, uncomment it and run it.

np.random.seed(42)
data = ImageDataBunch.from_csv(path, folder=".", valid_pct=0.2, csv_labels='cleaned.csv',
         ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)

Notice that this cell uses cleaned.csv, a file created by the image cleaning widget.

Run the next few cells. One cell will show correctly labeled data.

Re-run the learn process

Re-learn the model based on the new cleaned data set.

Run the cells right and be sure to stop at the Cleaning Up section.

The new model has a nice, low error rate.

The Confusion matrix shows one miss.

Export the model

Skip the Cleaning Up section and proceed to the following cell.

Once there, execute the cell.

$ learn.export()

You now have a file named export.pkl (pickle) that includes the model.

You can save and kill the jupyter notebook.

Deploy the Flask App

Now that we have a model, we can create a Flask APP that allows users to upload a picture and have the model classify the picture as a mod or rocker.

In your home directory, create and enter a directory named brighton_seafront.

(boot) $ cd ~
(boot) $ mkdir brighton_seafront
(boot) $ cd brighton_seafront

In this directory create the following file, named application.py.

#!/usr/bin/env python
from fastai.vision import *
from flask import Flask, flash, redirect, render_template, url_for
from flask_bootstrap import Bootstrap
from flask_wtf import FlaskForm
from flask_wtf.file import FileField
from wtforms import SubmitField
from werkzeug import secure_filename

class Config(object):
    SECRET_KEY = 'a-bunch-of-secret-stuff'
    BOOTSTRAP_SERVE_LOCAL = True

class UploadForm(FlaskForm):
    file = FileField()
    submit = SubmitField('Submit')

application = Flask(__name__)
application.config.from_object(Config)
Bootstrap(application)

learn = load_learner('.')

@application.route('/upload', methods=['GET', 'POST'])
def upload():
    form = UploadForm()

    if form.validate_on_submit():
        filename = secure_filename(form.file.data.filename)
        form.file.data.save('uploads/{}'.format(filename))
        img = open_image('uploads/{}'.format(filename))
        pred_class,pred_idx,outputs = learn.predict(img)
        flash( 'That picture belongs to {}'.format(pred_class), 'success' )
        return redirect(url_for('upload'))

    return render_template('upload.html', form=form)

if __name__ == '__main__':
    application.run(host='0.0.0.0')

Make two directories, one named uploads and one named templates

$ mkdir uploads templates

Edit the following file, in the templates folder (e.g. templates/upload.html).

{% extends "bootstrap/base.html" %}
{% import "bootstrap/wtf.html" as wtf %}
{% block content %}
  <div class="container">
    {% with messages = get_flashed_messages(with_categories=true) %}
    <!-- Categories: success (green), info (blue), warning (yellow), danger (red) -->
      {% if messages %}
        {% for category, message in messages %}
          <div class="alert alert-{{ category }} alert-dismissible" role="alert">
          <button type="button" class="close" data-dismiss="alert" aria-label="Close"><span aria-hidden="true">&times;</span></button>
          <!-- <strong>Title</strong> --> {{ message }}
          </div>
        {% endfor %}
    {% endif %}
    {% endwith %}
  </div>
  <div class="container">
    <h3>Please upload your file</h3>
    {{ wtf.quick_form(form) }}
  <hr>
  </div>
{% endblock %}

Copy export.pkl from your course-v3 folder to the current brighton_seafront folder.

$ cp /home/ec2-user/course-v3/nbs/dl1/data/brighton_seafront/export.pkl /home/ec2-user/brighton_seafront

Now, start your Flask app.

(boot) [brighton_seafront]$ ./application.py
 * Serving Flask app "application" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)

If you navigate to port 5000/upload, you will see a screen that directs you to upload a picture for classification.

I will upload a picture of young Glam Rocker Marc Bolan, which the photographer snapped during his mod phase.

After upload, the model successfully classified Marc as a mod!

Conclusion

We created a fun APP that differentiates between mods and rockers. We created a toy application. Our application blocks, and can only accommodate one user at a time. If asynchronous web applications interest you, take a look at Python Celery or Amazon Lambda.

A Graphical Intro to Probabilistic Neural Networks (PNN)

2018-10-30T23:47:00-04:00

Introduction

Machine Learning engineers use Probabilistic Neural Networks (PNN) for classification and pattern recognition tasks. PNN use a Parzen Window along with a non-negative kernel function to estimate the probability distribution function (PDF) of each class. The Parzen approach enables non-parametric estimation of the PDF.

In this blog post I will discuss the following

What is a Parzen PNN?
- Animated example of the Parzen algorithm
- Animated example of a Parzen Neural Network
Normalization of Training Data
- Trade several approaches
- Effectiveness of approaches - Parzen vs. Nearest Neighbor
Reduced Coulomb Energy Networks
- Descriptive Animation
- Visualization of RCE on the normalization approach
- Benefits of Ambiguous Regions
RCE applied to the Bupa Liver disorders data set
Conclusion

What is a Parzen PNN?

Mathworks provides a simple definition of a Parzen Probabilistic Neural Network:

The Parzen Probabilistic Neural Networks (PPNN) are a simple type of neural network used to classify data vectors. This [sic] classifiers are based on the Bayesian theory where the a posteriori probability density function (apo-pdf) is estimated from data using the Parzen window technique.

PPNN allow a non-parametric approach to estimate the required Bayesian Classifier probabilities P(x|w_i) and P(w_i).

In action, the PPNN mechanics are easy to follow. The PPNN takes a training vector, dot products it with the weights of the hidden layer vector and then chooses the winning class based on the highest output value. The next section includes an Animated cartoon that shows the PPNN visually.

Animated example of the Parzen algorithm

Suppose you have three classes, and the following training data:

ID	Class	Var1	Var2
A	Green	0.5	0.75
B	Purple	0.5	0.25
C	Purple	0.25	0.75
D	Yellow	0.75	0.5
E	Green	0.75	0.75

You now want to use a PPNN to classify the color of the observation ( Var1 = 0.75, Var2 = 0.25 ).

The Cartoon below shows the weights as filled colored boxes. In Column A, for example, weight one (WA1) is half full (e.g. 0.5) and weight two (WA2) is three quarters full ( e.g. 0.75). The animation shows the dot product of the test pattern ( X1 = 0.75, X2 = 0.25) with the weight vectors, an activation function, and then the selection of the winner.

Animated example of a Parzen Neural Network

Now let's take a look at the classification approach using the familiar neural network diagram. The input layer (bottom) includes our test pattern ( X1 = 0.75, X2 = 0.25), the hidden layer includes weight vectors assigned to classes based on the train patterns. The PPNN then connects the hidden layer to the appropriate class in the output layer.

Normalization of Training Data

The Mathworks PPNN web page specifies that we must normalize both our weight vectors and training vectors.

The weights on the first [hidden] layer are trained as follows: each sample data is normalized so that its length becomes unitary, each sample data becomes a neuron with the normalized values as weights w.

This next section shows different approaches to normalize the training data.

Trade several approaches

I use the following data set for this trade.

x	y	class
2.5	2.5	+
3	1	+
4	2	+
1	1	X
1	2	X
2	2.5	X

Here is a plot of the training data.

Note that in this toy example, we can set up a simple classifier via a vertical line at X = 2.25 and just use the x values to decide. Never mind that, though, since the point of this section is to illustrate different normalization techniques and then look at the effectiveness of different classification approaches.

When we normalize over all the training data, you see that the (x, y) axis scale to ( 1, 1 ).

If we center the data and normalize, the scale goes from -1 to 1 on both axis.

If we normalize to class specific magnitude, it makes matters worse. We no longer have clean separation of the classes.

If we normalize on a per-vector basis, we get build in error. Pattern (0.75, 0.75) now belongs to both Class X and Class +.

Effectiveness of approaches - Parzen vs. Nearest Neighbor

Now let's look at the effectiveness of PPNN vs. the k-nearest neighbor algorithms. KNN provides another non-parametric method of classification. Instead of using a kernel to estimate the parent PDF, it looks at the k closest neighbors of the same class. In the graphics below the gray regions depict Class One (X) and the white regions depict Class Two (+).

First lets look at the case where we normalized each training pattern to class specific magnitude. If you recall it appeared to look bad, scrunching the two classes close to each other.

KNN, believe it or not, does a good job of classifying the data.

The PPNN, fails, classifying all of Class 2 as Class 1.

The second case scales the training data to (0,1) on both axis.

KNN handles the classification with ease.

The PPNN (using σ = ¹/₄ ) fails. It allocates a tiny box region to Class 1, and classifies everything else to Class 2.

Normalizing over a per-sample basis introduces built in error. Note again the overlap of the X and + at ( 0.75, 0.75).

The KNN of course takes a hit due to the build in error.

The PPNN (using σ = ¹/₄ ) misses twice, once for the built in error and once for a Class 1 observation.

The final normalization approaches centers and normalizes the data.

The KNN handles this with aplomb.

The PPNN also correctly classifies all observations.

Reduced Coulomb Energy Networks

So far I showed several normalization approaches and then the effectiveness of different non-parametric classification techniques on the normalized data. I demonstrated, PPNN and KNN effectiveness. Now I would like to describe a third non-parametric classification algorithm. The Reduced Coulomb Energy (RCE) net.

In summary, RCE provide the following benefits:

Rapid learning of class regions that are
- Complex
- Non-linear
- Disjoint
No local minima issues
Performance knobs
- Trade training time vs. memory requirements
- Trade classifier complexity to training data

If you would like more details, I encourage you to read my detailed investigation of RCE.

Descriptive Animation

This cartoon shows the simplicity of the RCE algorithm. For each training point, the RCE algorithm creates a circular footprint with a radius equal to the distance of the nearest training point from the other class. To prevent overlap, you can set a maximum radius for each training point.

Visualization of RCE on the normalization approach

The following animation shows the classification footprints for the centered and normalized training data. Note that dark gray represents class one, light gray represents class two and white indicates an "ambiguous region" (no class).

The next animation shows the RCE classification footprints on the non-centered all samples normalized training data.

Normalized by class increases the amount of ambiguous regions.

Once more, the built in error of the normalize by per-sample magnitude approach results in a miss.

Benefits of Ambiguous Regions

RCE provides the benefit of ambiguous regions. Ambiguous regions pinpoint areas that would provide useful training samples. The data scientist can then execute observations in those regions to fill in the gaps.

The following graphic shows how additional training observations filled in the ambiguity.

RCE applied to the Bupa Liver disorders data set

The final section summarizes my approaches to separating the training data I input into my detailed investigation of RCE.

For my investigation, I looked at the BUPA Liver Disorders data set.

The data includes six features and two classes.

mcv: mean corpuscular volume
Four Chemical Markers
- alkphos: alkaline phosphotase
- sgpt: alamine aminotransferase
- sgot: aspartate aminotransferase
- gammagt: gamma-glutamyl transpeptidase
drinks # of half-pint equivalents of alcohol per day

I then wrangled the data set in order to increase the success rate of my classification. I used the following method:

Normalize the data
Quantify separability using
- Divergence
- Bhattacharyya distance
- Scatter Matricies

For the two feature case, separation analysis showed the best feature combination for class detection includes gamma-glutamyl and number of drinks.

Out of the box, you can see these two are poorly separable.

For the three feature case, the scatter method (left) added alkphos to the mix, whereas divergence and Bhattacharyya added sgpt.

The following diagrams show the three dimensional separation approaches based on a normalized test set. I used the training μ and σ to normalize the test set.

This graphic shows the same approach, only using the test set's μ and σ to normalize the test set.

The following graphic shows the classification footprints using a normalized, two feature (gamma-glutamyl and number of drinks) train and test set.

For detailed results of my investigation, I encourage you to read my detailed investigation of RCE applied to the BUPA liver disorders data set.

Conclusion

I leave you with convenient bullet points summarizing the work we accomplished today.

Frame PNN as a simple series of steps
- Dot product (or distance)
- Non-linear transform
- Summation and voting
Be cognizant of normalization approach
Sometimes feature reduction yields classes with common patterns
RCE rapidly learns class regions
- Complex
- Non-linear
- Disjoint
RCE can ID ambiguous regions
- ID regions of useful training patterns
- Does not classify as a known class, in the case that there may be unknown classes

If you enjoyed this blog post, please check out these related blog posts:

A Reduced Coulomb Energy (RCE) Neural Network Classifier

2018-08-27T10:26:00-04:00

Introduction

I investigate the effectiveness of a Reduced Coulomb Energy (RCE) Neural Network on the classification of the University of California, Irvine (UCI) Bupa liver disorder data set. I investigate seven (7) different versions of the data set, four (4) un-coded and three (3) binary coded (to a higher dimensional feature space) data sets applying various feature vector dimensionality reduction strategies. Finally, for all seven (7) datasets I apply a feedback-tuning algorithm. In summary, I receive a best case error rate of 20% and ambiguity of 31%, when I apply my feedback-tuning algorithm (using a learning rate of η=0.25) to the un-coded data set that reduces the feature vector dimensionality to half the original size.

Outline

Introduction
Background
- Why is this an important problem?
- What work has been done before?
- Brief discussion of RCE
- Benefits of RCE
Methods
Data
Results
Conclusions
Bibliography

Background

Why is this an important problem?

A medical diagnosis contains a test pattern with features such as symptoms, patient history, and laboratory tests. The doctor uses this test pattern to diagnose, or classify the patient. Doctors and patients can benefit if the Doctor treats the diagnosis as a classification problem, and can arrive at a classification model with low error [Bojarczuk 27]. “Medical data often seem to contain a great number and uncertain or irrelevant features. How to extract enough necessary and useful diagnostic rules used to be highly depended on the clinical experience [Kahramanli 9].” I investigate if a RCE NN can extract enough necessary and useful diagnostic rules from the BUPA liver disorders data set, to reduce dependence on clinical experience, and instead put the intelligence in the pattern classification model.

What work has been done before?

Several mathematicians apply algorithms to the BUPA Liver Disorders dataset. Goncalves minimizes error to 20.31% using the Inverted Hierarchical Neuro-Fuzzy BSP System (HNFB) [Goncalves 245]. Raicharoen and Lursinsap achieve an error rate of only 18.61% using critical support vectors (CSV) [Raicharoen 2534]. Bagirov and Ugon achieve 10.14% error using their min-max separability algorithm [Bagirov 19]. Cordella classifies through genetic programming, where prototypes of the classes describe clusters of data samples and logical expressions established conditions on feature values. Cordella hits an error rate of 26.2% [Cordella 732]. Kahramanli’s Opt-aiNET algorithm lowers the error to 5.2% [Kahramanli 12]. [Kahramanli 9]. I investigate the utility of applying a RCE net method to the Bupa liver disorders dataset.

Brief discussion of RCE

Pattern classification represents distinct classes through disjoint regions formed by feature space partitioning. Most classifiers partition non-overlapping regions and map each of these to a class. In RCE networks, however, a class may have one or more regions, and regions can overlap. A RCE net contains three layers, the input, output and hidden. The input layer contains one node for each of the features, totaling the feature vector dimension. The output layer has one node for each class. In the hidden layer, each node represents a prototype. Each class connects to either one or a cluster of prototypes. A RCE net contains two modes, learning and classification. The learning mode executes feature space partitioning, adjusts connection weights between input and hidden layer, and reduces thresholds in hidden nodes to eliminate wrong activations. The classification mode makes class membership decisions based on the prototypes and their influence fields. Some regions may have multiple class affiliations, and the RCE net labels these regions “ambiguous.” [Li 847]

Figure 1 above illustrates part of the learning stage of an RCE net. Consider two classes of data, each with three training samples. The RCE net grows a sphere around a training point until it hits a training point of a different class. The RCE net stores this radius, λ, for that training point.

Figure 2 above shows λ for the three training points of class two (marked by a “+”).

Figure 3 below depicts λ for both classes, notice how they overlap.

Scofield defines ambiguous regions as "point sets in the state space of a system which are labeled with multiple class affiliations. This can occur because the input space has not carried all of the features in the pattern environment, or because the pattern itself is not separable." [Scofield 5]. The RCE net reduces the overlapping region by setting a maximum λ, as shown in Figure 4 below.

The overlapping, or ambiguous regions point to regions that provide useful training points. In Figure 5 below, we show an RCE net with a large ambiguous region.

Getting training samples from this ambiguous area helps to better define the nature of the feature space.

Once the training phase completes, the RCE net classifies the test points. RCE nets belong to the family of exemplar neural net classifiers, which “perform classification based on the identity of the training examples, or exemplars, that are nearest to the input. Exemplar nodes compute the weighted Euclidean distance between inputs and node centroids [Lippmann 49].” RCE nets create hyper-spheres around training points. The related hidden layer nodes have “high outputs only if the input is within a given radius of the node’s centroid. Otherwise, node outputs are low. The classification decision is the label attached to the nodes with high outputs [Lippmann 51].” The RCE net classifies a region ambiguous in the case of no outputs or outputs from multiple classes.

Benefits of RCE

Lippmann summarizes the benefits of RCE:

This classifier is similar to a k-nearest neighbor classifier in that it adapts rapidly over time, but it typically requires many fewer exemplar nodes than a nearest neighbor classifier. During adaptation, more nodes are recruited to generate more complex decision regions, and the size of hyper-spheres formed by existing nodes is modified. Theoretical analyses and experiments with RCE classifiers demonstrate that they can form complex decision regions rapidly. Experiments also demonstrated that they can be trained to solve Boolean mapping problems such as the symmetry and multiplexer problems more than an order of magnitude faster than back-propagation classifiers. Finally, classifiers such as the RCE classifier require less memory than k-nearest-neighbor classifiers but adapt classifier structure over time using simple adaptation rules that recruit new nodes to match the complexity of the classifier to that of the training data" [Lippmann 51]

Li writes that RCE nets perform rapid learning of class regions that are, complex, non-linear and disjoint. Li also writes “RCE net has the advantage of fast learning, unlimited memory capacity, and no local minima problem” [Li 846].

Methods

I solve the problem by creating a family of MatLab/ Octave functions from the ground up, identifying the key features and then running the reduced data set through my algorithms (See Appendix). I then create a feedback approach, and identify the ground rules that yield the lowest error. If interested, you can download my Octave code from GitHub here.

Inappropriate normalization presents the first roadblock in my investigation. Normalizing between zero and one causes otherwise distinct training points from different classes to have the same magnitude. For most classification algorithms, this creates “built-in error.” For RCE nets, this results in “built-in ambiguity.” I depict this in Figure 7 below.

In addition, normalizing on a per pattern basis yields the greatest error and ambiguity. Normalizing over a class yields the next greatest error/ambiguity. Normalizing over all samples yields the lowest error, when normalizing between -1 and 1.

In terms of feature reduction, I use several methods, including Fisher’s discriminant ratio, divergence, Bhattacharyya distance and scatter matrices to select feature subsets.

I also run a series of analysis on binary coded Bupa data, increasing the feature dimension. For example, I take the “mcv” feature and map it into five dimensions. For mcv, I create five categories, for the ranges [0,64),[65 85),[85,90),[90,95),[95,200]. The value mcv=77 for example, mapps to the five dimensional vector [0 1 0 0 0]. The value 92 becomes [0 0 0 1 0]. This creates four new input nodes for mcv, and four of the five are always zero for any given value of mcv. I perform similar binary mapping for all the features in the BUPA data set, increasing the feature dimension from six to thirty-three (See appendix). [Kahramanli 9]

Li presents two training approaches. The first approach reduces the threshold λ of a hidden node “such that the wrong activations of this node is eliminated. This process occurs when an input pattern activates some hidden nodes which are not committed to the same class as the input pattern [Li 848].” My feedback approach uses his second approach that tunes the weights between the input and hidden nodes. The RCE net commits each hidden node to an output node of one class. If the net cannot correctly classify a known exemplar, “change weights between input nodes & hidden nodes until you activate this node [Li 848].” We must take care in changing the input to hidden weights of an exemplar classifier, since it “brings forth the potential of a training procedure whose error criterion is non-convergent [Hudak 853].” The nature of an exemplar classifier is such that changing the weights to one hidden node in order to activate it may throw off the balance of the system, and cause other patterns to become incorrectly classified.

Data

I download the BUPA liver disorders database from the University of California, Irvine (UCI) machine learning repository. The BUPA data has six features and two classes (one for alcohol related liver disorder, and one for alcohol unrelated liver disorder). BUPA features include “mean corpuscular volume (mcv),” half-pint equivalents of alcohol per day (drinks) and four chemical markers: (1) alkaline phosphotase (alkphos) (2) alamine aminotransferase (sgpt) (3) aspartate aminotransferase (sgot) and (4) gamma-glutamyl transpeptidase (gammagt). Using Fisher discriminant, Scatter Matrices, Divergence and Bhattacharyya distance methods, I pare down the feature space to candidate feature spaces. Lippmann writes, "features should contain information required to distinguish between classes, be insensitive to irrelevant variability in the input, and also be limited in number to permit efficient computation of discriminant functions and to limit the amount of training data required" [Lippmann 47]

In summary, all methods picked parameters [2 5] for the two feature case, [2 5 6] for the three feature case, [2 3 5 6] for four features and [2 3 4 5 6] for five features. For the coded data, the methods selected [19 20 21 23 29] for five features, [4 9 10 14 16 18 25 28 29 30] for ten and [1 2 3 4 7 9 10 16 19 21 25 28 29 31]. This shows certain features correlate more with a class when they are coded to a certain range. The following figures (Figure 8 & Figure 9) show two (2) and three (3) feature plots:

Results

In general, coding the data does not give us any gain, either in the ‘feedback-tuned” case or the “non-feedback-tuned” RCE net case. All un-coded strategies yield lower error and ambiguity than the coded strategies, with one exception (when feedback tuned, the fifteen (15) feature coded data set yields lower error and ambiguity than the five (5) feature un-coded data set). In all cases (coded and un-coded), however, paring the feature space down yields less error than using all features, with one exception. For feedback tuning, the coded “all features” feature vector performs better than the coded ten (10) feature vector.

Using no feedback, the results follow:

4 Features, Un-coded: 22% error, 40% ambiguity
5 Features, Un-coded: 18% error, 49% ambiguity
3 Features Un-coded: 22% error, 45% ambiguity
All Features Un-coded: 19% error, 49% ambiguity
15 Features Coded: 38% error, 40% ambiguity
10 Feats Coded: 31% error, 48% ambiguity
All Features Coded: 44% error, 42% ambiguity

Using “feedback-tuning,” the three (3) feature un-coded data set yields the lowest error and ambiguity. For coded data, fifteen parameters yield the lowest error and ambiguity. Due to the delicate nature of tuning in an exemplar neural net, feedback tuning efficiency relies heavily on the learning rate. In my analysis I iterate different learning rate values (η) ranging from 0.05 to 1.0, with a step of 0.05. For each case (coded/un-coded and different feature vector dimensions), I iterate 200 times and select the lowest error/ ambiguity. In the list below, I show η that yields the lowest error/ ambiguity.

3 Feats Un-coded, η=0.25 20% error, 31% ambiguity
All Feats Un-coded, η=0.10 18% error, 35% ambiguity
4 Feats Un-coded, η=0.15 17% error, 40% ambiguity
15 Feats Coded, η=0.25 33% error, 25% ambiguity
5 Feats Un-coded, η=0.50 17% error, 42% ambiguity
All Feats Coded, η=0.30 36% error, 24% ambiguity
10 Feats Coded, η=0.70 23% error, 39% ambiguity

Not surprisingly, feedback tuning yields lower error/ambiguity than the non-feedback-tuned case, with one exception. The non-feedback-tuned four feature un-coded method actually yields lower error/ ambiguity than the feedback-tuned ten (10) feature coded method. The coded ten (10) feature method produces the worst results, which indicates poor feature selection, since the coded five (5) feature method performs better in all cases.

Now let’s look at the usefulness of my feedback strategy. The table below shows the gain (or reduction in error/ambiguity).

Approach	Reduction in Error	Reduction in Ambiguity	Error (Gain)	Ambiguity (Gain)
Coded All Feats	18.10%	42.90%	-0.87 dB	-2.43 dB
Coded 15 Feats	13.20%	37.50%	-0.61 dB	-2.04 dB
Coded 10 Feats	25.80%	18.80%	-1.3 dB	-0.90 dB
Un-coded 3 Feats	9.10%	31.10%	-0.41 dB	-1.62 dB
Un-coded All Feats	5.30%	28.60%	-0.23 dB	-1.46 dB
Un-coded 4 Feats	22.80%	0.00%	-1.12 dB	0.00 dB
Un-coded 5 Feats	5.60%	14.30%	-0.25 dB	-0.67 dB

In all cases, my tuning algorithm helps reduce the error and/or ambiguity. My tuning algorithm produces the most gain for the inferior performing “coded” data set.

Conclusion

Lippmann and Hudak criticize RCE nets. Lippmann writes, RCE nets “may require large amounts of memory and computation time [Lippmann 49].” Hudak writes, “Viewing RCE as an incremental nearest-neighbor classifier with hyper-spheres lads to the conclusion that the hyper-spheres are not positively contributing to the performance of the classifier. At best their presence is ineffectual, but their management during training entails a computational cost that is not justified by these findings [Hudak 852].”

I experience a “computational cost” during management of the RCE net training. Iterating through 20 candidate values of η, and then iterating 200 learning steps for each takes tens of minutes. Once I identify the proper weight tuning for the data, however, classification occurs in real time. RCE does have benefits, due to the ambiguity. Even Hudak writes, “hyper-spherical classifiers can recognize patterns from an unknown class as not belonging to any class known to the classifier. If true, this would be an advantage over the nearest-neighbor classifier [Hudak 853].”

In conclusion, binary coding does not help reduce error/ ambiguity. Reducing the feature set on the un-coded data reduces error/ ambiguity. My feedback-tuning algorithm, while computationally expensive, reduces error/ ambiguity in all cases. The feedback-tuning algorithm yields the greatest gain on the poorer-performing coded data set. The best case scenario shows reducing the un-coded feature vector to half its dimension using the features “alkaline phosphotase, “gamma-glutamyl transpeptidase” and “number of half-pint equivalents of alcoholic beverages drunk per day” and feedback tuning using a learning rate of η=.25. This yields an error of 20%, and an ambiguity of 31%.

If you enjoyed this blog post, please check out these related blog posts:

Bibliography

Bagirov, A.M., Ugon, J. “Supervised Data Classification via Max-min Separability.” Mathematical Modeling of Bio-systems Springer Berlin Heidelberg, 2008: 1-23

Bojarczuk, C.C., Lopes, H.S., Freitas, A.A. and Michalkiewicz, E. L. “A constrained-syntax genetic programming system for discovering classification rules: application to medical data sets.” Artificial Intelligence in Medicine 2004:27-48

Cooper, Leon N. “A Neural Model for Category Learning.” Center for Neural Science and Department of Physics, Brown University Providence, R.I. 1982

Cordella, L.P., De Stefano, C., Fontanella, F. “A Novel Genetic Programming Based Approach for Classification Problems.” Lecture Notes in Computer Science, Image Analysis and Processing ICIAP 2005: 727-734

Goncalves, L.B., Vellasco, M.M., Cavalcanti, M.A., Pacheco M.A. “Inverted Hierarchical Neuro-Fuzzy BSP System:A Novel Neuro-Fuzzy Model for Pattern Classification and Rule Extraction in Databases.” IEEE Transactions On Systems, Man And Cybernetics Part C: Applications And Reviews 2006: 236-248

Hudak M.J. “RCE Networks: An Experimental Investigation.” Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on Jul. 1991: 849-854

Kahramanli, Humar, Allahverdi, Novruz “Mining Classification Rules for Liver Disorders.” International Journal of Mathematics and computers in simulation Issue 1, Volume 3: 2009

Li, Wei “Invariant Object Recognition Based on a Neural Network of Cascaded RCE nets.” Neural Networks, 1990., 1990 IJCNN International Joint Conference on Jun 1990:17-21

Lippmann, Richard P. “Pattern Classification Using Neural Networks.” IEEE Communications Magazine Nov. 1989

Raicharoen, T., Lursinsap, C. “Critical Support Vector Machine Without Kernel Function.” Neural information Processing, 9th International Conference on (ICONIPOZ) 2002: 2532-2536

Roan, Sing-Ming “Fuzzy RCE Neural Network.” Fuzzy Systems, 1993., Second IEEE International Conference on 1993:629-634

Scofield, Christopher L. “Pattern class degeneracy in an unrestricted storage density memory.” Nestor, Inc. Providence, RI. 1988

Big Data Idol: The Math

2017-11-18T01:17:00-05:00

Caution! Math Ahead!

For the Math-phobic, I explain how I crunch the test results in a math-free, simple and focused blog post here.

I use math here, so this may be your last chance to escape! Still with me? Excellent!

The bullets below outline the steps we take. A flow diagram follows.

Transform Elasticsearch Database to Comma Separated Variables (CSV)
Load into R
Normalize data
Exploratory Factor Analysis (EFA)
- Dimensionality Reduction (Iterate on n)
- Plot Data on Graphs
- Identify Factor Names
- Isolate Factor Weight Matrix
Dot Product Answer Matrix with Weight Matrix
Fit Factor Sums matrix Theoretical Models
Plot Data
Guess Distribution
Fit, Graph, QQ-Plot

Extract and Transform

I chose a NoSQL Elasticsearch document store (DataBase) to hold all of the test results, metadata and identity information. In addition to private search services (i.e. auto-completion, 'did you mean?', etc.) Elasticsearch provides (massive) scalability and integration with a robust web based GUI named Kibana. Kibana provides trend plots, pie graphs, keyword searches and a host of other features.

I need to extract data from the NoSQL document store and translate it into a structured format for R. I use the excellent Elasticsearch Domain Specific Language (DSL) python library to do just that.

As I mentioned in HOWTO-1, I must first serialize the data to JSON, in order to use the Amazon IAM serivce with the Amazon Elasticsearch service. When I roll up my sleeves and dive in, I notice the (trivial) Elasticsearch DSL "scan" method requires a low-level Elasticsearch client connection object to perform.

In order to use REST/JSON calls, therefore, I need to scan "by hand." The official Elasticsearch documents point us to the low level elasticsearch-py libraries but since elasticsearch-dsl extends these, they do not help with our problem of needing to serialize to JSON and pass via an extended AWSAuthConnection object.

To scroll by hand, I first request the Elasticsearch API to set the search type to "scan" with a scroll duration of ten minutes. Elasticsearch responds with a scroll ID. I use the scroll ID to request the first batch of documents, and Elasticsearch responds with the documents and the current scroll_id (it may update). I then iterate until the process finishes.

while True:
r = client.make_request(method='GET',path='/_search/scroll',data=json.loads(r.read())['_scroll_id'],params={"scroll":"10m"})
if len(json.loads(r.read())['hits']['hits']) > 0:
#...process the documents
else:
break

HOT TIP: Dump an Entire Document Store

If you connect to your ES AWS service via an IAM role (AWSAuthConnection) then
(1) Make a scan request to turn off sorting
(2) Set an appropriate scroll duration (10 minutes)
(3) Iterate through all of the documents with a scroll request
(a) On each iteration, pass the current scroll_id

(If you connect to your AWS service via IP whitelisting then use your search object's scan method, e.g. s.scan())

Elasticsearch returns all of the documents in a schemaless JSON format.

For example:

{
"_index" : "pilot",
"_type" : "test",
"_id" : "AVHfVLNTHootPMn5yhhf",
"_version" : 5,
"found" : true,
"_source":{"q47":"y","q2":"y","q6":"y","q33":"y","q25":"y"}
}

We need to transform the schema-less responses into something structured. With Python, we can translate the dictionary {"q47":"y","q2":"y","q6":"y","q33":"y","q25":"y"} into a table format for R to read.

Note the arbitrary placement of the questions in the JSON response. We need simple logic to cycle through all fifty questions (q0, q1 ... g49) and test if they reside in the response. We could use a case statement with fifty individual tests but instead I decided to use arrays. I create a simple list, "q" with string names for each question. I then create an array, "scorecard" with fifty zeros. If the while loop discovers a match for a question, "q25" for example, it places a one (1) in that list position of scorecard. At the end I receive a table of results, one row per document, and a positional on/off for each "yes" in that document.

The dump script follows:

The script transforms...

{
"_index" : "pilot",
"_type" : "test",
"_id" : "AVHfVLNTHootPMn5yhhf",
"_version" : 5,
"found" : true,
"_source":{"q47":"y","q2":"y","q6":"y","q33":"y","q25":"y"}
}

...into one row.

AVHfVLNTHootPMn5yhhf,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0

Exploratory Factor Analysis

James H. Steiger writes an excellent white paper on using R for EFA titled "Exploratory Factor Analysis With R." I tried out several of the R libriaries to include principal and factor.pa in the psych and princomp and factanal in stats.

The princomp library got me really excited at first because all of the vectors made perfect sense (in terms of groupings the questions by "yes's" ). When I applied the weight matrix to the initial answer matrix I saw that 80% of the test takers fell into the first component. This concerned me at first until I realized that the first component held most of the varience and that raw PCA would not be the appropriate tool to separate test takers into roughly equal groupings.

The hilariously named factanal (huh huh Bevis)library accomodated my use case. I used the common factor analysis approach (vs. component analysis) which fits via Maximum Likelyhood and rotated with varimax.

I first loaded the data and normalized my data set to the range of [-1,1].

bdpt = read.csv('data.csv',stringsAsFactors = FALSE)

# Get rid of the ID's
bd = bdpt[2:51]

# Uncomment to change zeros to negative ones
bd[ bd == 0] <- -1

# Uncomment to sample
ind <- sample(10000,1000)
bd <- bd[ind,]

I fit with Maximum Likelyhood, select three (3) dimensions and rotate with varimax.

fit <- factanal(bd, 3, rotation="varimax")

We can look at the loadings for each factor.

Loadings:
Factor1 Factor2 Factor3
tire 0.114 0.214
rich 0.231 0.222
dangerous 0.257 0.165
fame 0.172 0.169
hair 0.200 0.168 0.199
bet
taxes 0.143 0.286
execute 0.141 0.303
castle 0.167 0.105
losers 0.139 0.290
pretty_women 0.796
pretty_men 0.676
vhs 0.184 0.186
punch 0.362 0.164 0.109
jerkface 0.380 0.151
facebook 0.128 0.113
champion 0.439
God -0.300 0.237
speed
advice 0.291
mouse 0.229 0.169
drunk 0.406 0.165
kleeb 0.267
gravitate 0.224 0.259 0.118
no_plan 0.309 0.222
drugs 0.445
fan 0.136
work_hard 0.304 0.124
potential
intelligence 0.210 0.218
wait_food 0.298
bossy 0.127 0.105
world_against 0.264 0.156
suffer_evil 0.409
trust_cops -0.245 0.276
learning 0.408
cult 0.383
naked 0.328 0.135
door 0.159 0.167
grass 0.158 0.184 0.131
paycheck 0.242
fashion 0.155 0.266
locks 0.204
love 0.229 0.110
dogs 0.362
baby_corner 0.178 0.253
listen 0.157
transit 0.124 0.333
motorcycle
driver 0.154 0.103 0.134

The chart shows that "being drunk in public" and "not believing in God" correlates strongly with factor 1, "championing others" and "suffering evil vs. being evil" correlates with factor 2, and "men/women should be pretty" highly correlates with factor 3. I named these factors "hellraiser," "boy scout" and "celebrity."

The code that follows shows the relative weights of each "question" for each factor. I provide 2d and 3d graphs. For more detail, click here.

# #################
# # Graph Factors #
# # in 2 and 3d #
# #################
layout(matrix(c(1,1,1,2,3,4), 2, 3, byrow = TRUE))

pcolor <- apply(fit$loadings,1,which.max)
pcolor[pcolor==1] <- "red"
pcolor[pcolor==2] <- "blue"
pcolor[pcolor==3] <- "darkgreen"
s3d <- scatterplot3d(fit$loadings,color=pcolor,pch=19,type="h", lty.hplot=2,scale=3,angle=55,xlab="Hellraiser",ylab="Boy Scout",zlab="Celebrity")
s3d.xyz <- s3d$xyz.convert(fit$loadings)
text(s3d.xyz$x, s3d.xyz$y, labels=row.names(fit$loadings), cex=.7, pos=4)
legend("topleft", inset=.05, bty="n", cex=.5, title="Factor Assignment", c("Hellraiser", "Boy Scout", "Celebrity"), fill=c("red", "blue", "darkgreen"))

load <- fit$loadings[,1:2]
plot(load,type="n",xlab='Hellraiser',ylab='Boy Scout')
title('Hellraiser vs. Boy Scout')
abline(v=0,h=0,lty=2)
text(load,labels=names(bd),cex=.7)

load <- fit$loadings[,c(1,3)]
plot(load,type="n",xlab='Hellraiser',ylab='Celebrity')
title('Hellraiser vs. Celebrity')
abline(v=0,h=0,lty=2)
text(load,labels=names(bd),cex=.7)

load <- fit$loadings[,2:3]
plot(load,type="n",xlab='Boy Scout',ylab='Celebrity')
title('Boy Scout vs. Celebrity')
abline(v=0,h=0,lty=2)
text(load,labels=names(bd),cex=.7)
#

Crunch the Numbers

For each test taker, I tally up their total factor weights based on they answer each question. To process the data, I perform a simple dot product of the "User Answer Matrix" and the "User Factor Weight" matrix, which yields a "User Factor Sums Matrix." I then normalize the "User Factor Sums Matrix" and pull out zero values in order to try certain theoretical fits (such as Gamma).

########################
# The number crunching #
########################

# Convert loadings to a weight matrix
pca <- apply(fit$loadings,2,function(x) x)

# Dot product of answers with the weight matrix
# to get factor sums for each test taker
answer <- as.matrix(bd) %*% as.matrix(pca)

# Normalize the test taker's factor sums
# between 0 and 100
norm_answers <- sapply(seq_len(3),
function(i) round((answer[,i] - min(answer[,i]))*100/(max(answer[,i]) - min(answer[,i])),2))

# Get rid of the Zero values
# so we can fit to a Gamma
norm_answers[norm_answers == 0] <- 0.01

Each user has a weight sum for each factor. I take these data points and try to fit them to a theoretical model. I use the Gamma function to begin. The following lines fit the data and then pull out just the shape and rate parameters for each of the three fits.

# Figure out rate and scale for fit for each factor
gamma_fit <- sapply(seq_len(3), function(i) fitdistr(norm_answers[,i],"gamma"))
# Pull just shape and rate
gamma_fit <- sapply(seq_len(3), function(i) gamma_fit[,i]$estimate)

The next example shows the code for the "Hellraiser" factor. I plot a Kernel Density Estimation (KDE) of the empirical data. I then overlay points pulled from a Gamma density function with the "shape" and "rate" parameters that we found above. I show the code for "Hellraiser," and you will find the complete code at the end of this blog post.

# "Hellraiser" pdf
plot(density(norm_answers[,1]),xlab='Hellraiser Weight',main = 'Kernel Density Estimation (Solid) vs. Gamma Fit (Dots)')
abline(v=mean(norm_answers[,1]),lty=2)
legend("topright", inset=.05, bty="n", cex=.9, title=NULL, c(append("shape = ",toString(round(gamma_fit[,1][1],3))),append("rate = ",toString(round(gamma_fit[,1][2],3)))) )
par(new=T); points(dgamma(seq(0,100),shape=gamma_fit[,1][1],rate=gamma_fit[,1][2]))

You can see a good fit for "Hellraiser" and "Celebrity," but a poor fit for "Boy Scout."

# Hellraiser
x.fit <- rgamma(n=length(norm_answers[,1]),shape=gamma_fit[,1][1],rate=gamma_fit[,1][2])
x.empirical <- norm_answers[,1]
qqplot(x.empirical,x.fit, main="Sample vs. Theoretical for Hellraiser Fit"); abline(0,1)

Again, good fits for all the factors except for "boy scout." With this guidance, I re-fit the "boy scout" data to a "normal" theoretical model.

# Find mean and sd
std_boyscout_fit <- fitdistr(norm_answers[,2],"normal")

The calculation produces a much better fit, as witnessed by the new "Boy Scout" overlay plot and QQ-Plot.

We know the factor weight sums for each user. We then use our new density functions to find out where they stand compared to the other users. We give them a percentile and send each user sum weight vector to the appropriate theoretical model, with the appropriate density function parameters.

# Convert the normalized weights to percentile
# based on our fit model distributions
percentiles <- round(
sapply(seq_len(3),function(i) pgamma(norm_answers[,i],shape=gamma_fit[,i][1],rate=gamma_fit[,i][2])),2)

# Replace the gamma fit for Boy Scout
# with the Standard
percentiles[,2] <- round(pnorm(norm_answers[,2],mean = std_boyscout_fit$estimate[1], sd = std_boyscout_fit$estimate[2]),2)
``

We append the percentiles to the results matrix and execute simple "on/off" logic. If the user lies in the greater than fiftieth percentile, we turn that factor on. The simple binary logic then gives us eight types. We assign a type to each user by performing conversion from binary to decimal. A dot product between the three dimensional factor vector and the vector [1,2,4] performs the conversion.  


```R
# For all test takers, set all values below mean to zero (per factor)
percentiles[percentiles < 0.5] <- 0

# Set all values above mean to one (per factor)
percentiles[percentiles != 0 ] <- 1

# Map each test taker to one of seven classes based on their on/off values for each factor
classifications <- apply(percentiles,1,function(x) as.vector(x) %*% as.vector(c(1,2,4)))

When you take the test (after I batch process), you will receive you classification and percentiles for each of the three factors.

Big Data Idol: How I Crunched the Numbers

2017-10-14T12:20:00-04:00

Do you have big data chops?

Quick, what do these three things have in common?

Yankees, Giants, Rangers, Knicks

What about these?

Beatles, Monkees, Beach Boys

Do you have an answer for each? "New York," for example, for the first list and "Rock and Roll" for the second? If so, call yourself a "data scientist!"

The Computer Groups Users

The Big Data Personality Test follows a similar process. I feed 1 billion test results into a computer program that "learns" the data. The program looks for patterns and finds that, for exmple, people who answer "yes" to "question 1" also answer yes to questions 8, 9 and 31 and "no" to question 15. The program then separates test takers based on their answer patterns.

Now answer the following question. You may find the question a little tricky so I drew a picture to help you "cheat." Think of a row on a "Tic Tac Toe" board. You have three spaces and can place either an "X" or "O" on each space. How many unique rows (patterns) can you make from this?

As you can see from the picture, each row in a game of Tic Tac Toe can only have one of eight possible patterns. Three spaces in a row, with an "X" or "O" on each space gives us eight unique patterns. You will see this phenomenon in the next section, where I reduce each test taker to a "row" with three "factors" (i.e.Tic Tac Toe spaces) that can be "on" or "off" (i.e. an "X" or "O").

I feed the 1 billion test responses into another computer program that reduces the "answer patterns" into three types (or "factors"). From this program I reduced the possible number of unique test "answer paterns" from 1,130,000,000,000,000 to eight.

A Fun Game

Are you still with me or are you bored? Here is a fun game. The "big data" program only identifies the patterns, I must name them.

What would you name this spectrum?

How about "The Vegetarian Spectrum"

What about this one?

I'd name it "The Libertarian Spectrum"

And finally, we have this one:

I'd name it "New Yorker Spectrum"

We call these three spectrum independent. If you're from New York, you might be a libertarian or a vegetarian, but one does not cause the other. (In reality, you might find correlation but bear with me here. If you want specifics, then read my blog post about the Math).

We can combine two of these spectrum. Let's combine, for example, the "Vegetarian" and "Libertarian" spectrum. Then we have four types of people: "Libertarian Vegetarians," "Libertarian Meat Eaters," "Big Government Vegetarians" and "Big Government Meat Eaters." Take a look at the chart below.

We can still have a pretty picture in three dimensions if we overlay the third spectrum on the graph:

Name the Groups

Enough of the example spectrum, let's look at the real data. I obtained the data from actual test takers on Facebook. When I crunch the numbers on the test takers, machine learning outputs three groups, or factors:

Factor #1: Drug and alcohol users that have been naked in public, distrust cops and do not believe strongly in God.

Factor #2: Leaders that champion others, would martyr themselves for an idea and believe they can learn from all setbacks.

Factor #3: Strongly believe men and women should be attractive, judge people on their fashion and think in concrete, all or nothing terms.

What would you name these factors?

Plot the Data

I named #1 "Hellraisers", #2 "Boy Scouts" and #3 "Celebrities."

I graphed the factors against each other.

Take a look at Hellraiser vs. Boy Scout. If this graph confuses you, take a look at some of New York magazines' Approval Matricies. You will quickly get an idea of how the matricies work.

Look at the "left/ right" axis above. You can see, all the way to the right, that positive attitudes towards drinking and drugs defines a test taker as a "hellraiser" and negative attitudes (all the way to the left) about trusting cops and belief in God also define them as a "hellraiser." On the "up/ down" axis (Boy Scout), championing others and learning from failures define the Boy Scout. Notice how "Champion," high up on the "Up/ Down" axis, falls right near the middle of the "Left/ Right" axis. So attitudes about "Championing others" weights the "Boy Scout" judgement but doesn't do much for the "Hellraiser" judgement. Likewise, "drugs" greatly defines the "Hellraiser," but doesn't weigh either way for the "Boy Scout" factor (look at how close it lies to the horizontal dotted line.)

On the "Hellraiser" vs. "Celebrity" scale, most of the "Hellraiser" qualities do not provide much weight to the "Celebrity" judgement. You can see that most of the Hellraiser factors lie close to the zero point (i.e. the dotted horizontal line) for the "Celebrity" spectrum. Pretty Men, and Pretty Women, however, hugely weigh in on the celebrity scale.

We witness the same distribution on "Boy Scout" vs. "Celebrity." Most "Boy Scout" traits do not weigh heavily on the "Celebrity" decision and vice versa. You can see this by simply noticing the words closely align with the x and y access.

If you're interested, you can see the three factors plotted on a three dimensional graph. Again, the graph below shows the actual placement on the graph, based on the real-life test takers.

Get the Results

When you take the test, the machine learning looks at your answers and matches them to the patterns for each of the factors. It places you on the spectrum for each factor and based on the placement it classifies you as one of the eight categories. For a given user, the factor switches on depending on how you compare to the rest of the test-takers. So, as more people take the test, the factors, weights, placement and classifications change.

If you made it this far, you now have a solid understanding of Exploratory Factor Analysis (EFA). If you want more detail, then check out the math.

Let us now praise ugly code!

2017-07-15T01:18:00-04:00

In this blog post I will revisit the first piece of code I wrote with the R Programming language, back in the early part of this decade.

Coming from an Octave/MATLAB background, I really enjoyed the functional nature of R. I imagined flinging vectors into Matrices, collapsing them with dot-products, Tetris like. I refused to write a single for loop... I framed everything as functions and maps. As I gained experience with R, I found pipes and data wrangling libraries, but early on, my code was pretty ugly, as you will see shortly.

I have a project that keeps track of comic books, their publishers, their prices and their customers. The model stores data in excel and to make things readable, I use a columnar store. In this way, I can quickly add new entries to the table by adding columns. Each column has an arbitrary number of rows. I know this might not be the best way to store data, but bear with me here. This blog looks at the processing of that data, not the storing of the data. Besides, in the real world, you sometimes have no choice but to start with ugly data.

The Ugly Way...

Let us proceed. First, take a look at Titles:

> Titles.orig <- data.frame(DC=c('Batman',
                                 'Superman',
                                 'Captain_Marvel',
                                 ''),
                          Image=c('Youngblood',
                                  'Spawn',
                                  '',
                                  ''),
                          Marvel=c('Spiderman',
                                   'Iron_Man',
                                   'Cable',
                                   'Doctor_Strange'),
                          stringsAsFactors = FALSE)

> Titles.orig

              DC      Image         Marvel
1         Batman Youngblood      Spiderman
2       Superman      Spawn       Iron_Man
3 Captain_Marvel                     Cable
4                           Doctor_Strange

Notice that a rotation doesn't really buy us anything. Instead of an arbitrary number of rows for each entry, a rotation gets us an arbitrary number of columns.

> t(Titles.orig)
       [,1]         [,2]       [,3]             [,4]            
DC     "Batman"     "Superman" "Captain_Marvel" ""              
Image  "Youngblood" "Spawn"    ""               ""              
Marvel "Spiderman"  "Iron_Man" "Cable"          "Doctor_Strange"

When I process Titles.orig R, I first transform it to a key-value store. My approach relies on data frame index logic (commands inside the [] brackets).

In my original approach, I create two vectors, one that repeats the column several times, and another that un-packs (unlists) the data. When I put them together, I get key-value pairs (with some empties).

My first vector repeats each column name n times, with n being the number of rows. Since the data frame has four rows, I repeat each column name four times. I first try the rep() function.

> Titles <- Titles.orig
> rep(names(Titles),nrow(Titles))
 [1] "DC"     "Image"  "Marvel" "DC"     "Image"  "Marvel" "DC"     "Image"  "Marvel" "DC"    
[11] "Image"  "Marvel"

This attempt fails. I want it in the form: 'DC, DC, DC, DC, Image, Image etc.'

After a few Google searches, I find that matrix()allows us to stack rows, so I stuff the repeat statement into matrix():

> matrix(rep(names(Titles),nrow(Titles)),nrow=nrow(Titles))

     [,1]     [,2]     [,3]    
[1,] "DC"     "Image"  "Marvel"
[2,] "Image"  "Marvel" "DC"    
[3,] "Marvel" "DC"     "Image" 
[4,] "DC"     "Image"  "Marvel"

Close, but not quite what I need. I then add the byrow flag:

> matrix(rep(names(Titles),nrow(Titles)),nrow=nrow(Titles),byrow='T')

     [,1] [,2]    [,3]    
[1,] "DC" "Image" "Marvel"
[2,] "DC" "Image" "Marvel"
[3,] "DC" "Image" "Marvel"
[4,] "DC" "Image" "Marvel"

From here, we convert to a vector:

> as.vector(matrix(rep(names(Titles),nrow(Titles)),nrow=nrow(Titles),byrow='T'))

 [1] "DC"     "DC"     "DC"     "DC"     "Image"  "Image"  "Image"  "Image"  "Marvel" "Marvel"
[11] "Marvel" "Marvel"

As you can see, vector works "down the column" by default (which makes sense, since columns are vectors).

Let's move past the titles. To create a vector from our data, we need to unlist() the data first and then vectorize it:

> as.vector(unlist(Titles))

 [1] "Batman"         "Superman"       "Captain_Marvel" ""               "Youngblood"    
 [6] "Spawn"          ""               ""               "Spiderman"      "Iron_Man"      
[11] "Cable"          "Doctor_Strange"

I bind these two vectors together as columns and then create a data frame.

> Titles <-  data.frame(cbind(as.vector(matrix(rep(names(Titles),nrow(Titles)),
                                             nrow=nrow(Titles),byrow='T')),
                            as.vector(unlist(Titles))))
> Titles

       X1             X2
1      DC         Batman
2      DC       Superman
3      DC Captain_Marvel
4      DC               
5   Image     Youngblood
6   Image          Spawn
7   Image               
8   Image               
9  Marvel      Spiderman
10 Marvel       Iron_Man
11 Marvel          Cable
12 Marvel Doctor_Strange

I give names to the data:

> names(Titles) <- c('publisher','title')

And then remove the empty rows. A lot of my early code follows this convention. I scan a data frame with index logic, using a comma to separate row and column logic. In the line below, I scan the index to return only rows that have a non-empty title, and return all columns. Such syntax appears a little confusing, as I reference the data frame Titles in three separate parts.

> Titles <- Titles[which(Titles$title != ""),]

> Titles
   publisher          title
1         DC         Batman
2         DC       Superman
3         DC Captain_Marvel
5      Image     Youngblood
6      Image          Spawn
9     Marvel      Spiderman
10    Marvel       Iron_Man
11    Marvel          Cable
12    Marvel Doctor_Strange

The Pretty Way...

Let's recap. We had nested hell to transform the columnar table to a key-value table, and then we needed two more commands to name the data frame columns and remove the empties.

With pipes (dplyr and magrittr) and tidyr, we can produce the same result with one line of code.

> library("dplyr")
> library("magrittr")
> library("tidyr")
> Titles <- Titles.orig
> Titles %>% 
    gather(publisher,title) %>% 
    filter(nzchar(title))

  publisher          title
1        DC         Batman
2        DC       Superman
3        DC Captain_Marvel
4     Image     Youngblood
5     Image          Spawn
6    Marvel      Spiderman
7    Marvel       Iron_Man
8    Marvel          Cable
9    Marvel Doctor_Strange

To dump and then set the variable, we use the %<>% pipe.

> Titles %<>% 
    gather(publisher,title) %>% 
    filter(nzchar(title))

More Pretty Code

Now we have a separate table of customers. This is a more traditional table, and we can arbitrarily add columns and rows as we see fit.

> Customers <- data.frame(title = c('Batman',
                                    'Superman',
                                    'Captain_Marvel',
                                    'Youngblood',
                                    'Spawn',
                                    'Spiderman',
                                    'Iron_Man',
                                    'Cable',
                                    'Doctor_Strange'),
                        Micky = c(2,0,0,0,0,0,2,0,1),
                        Mike = c(5,1,1,1,1,1,1,1,1),
                        Peter = c(1,1,0,0,0,1,1,2,0),
                        Davy = c(2,7,1,5,1,2,0,0,1),
                        stringsAsFactors=FALSE)
> Customers

           title Micky Mike Peter Davy
1         Batman     2    5     1    2
2       Superman     0    1     1    7
3 Captain_Marvel     0    1     0    1
4     Youngblood     0    1     0    5
5          Spawn     0    1     0    1
6      Spiderman     0    1     1    2
7       Iron_Man     2    1     1    0
8          Cable     0    1     2    0
9 Doctor_Strange     1    1     0    1

Let's try the gather function on this table to see what we get. We want each row to contain the comic title, the customer name, and the quantity they want to purchase.

> Customers %>% 
    gather(customer,qty) %>% 
    suppressWarnings %>% 
    head(12)

   customer            qty
1     title         Batman
2     title       Superman
3     title Captain_Marvel
4     title     Youngblood
5     title          Spawn
6     title      Spiderman
7     title       Iron_Man
8     title          Cable
9     title Doctor_Strange
10    Micky              2
11    Micky              0
12    Micky              0

As you can see, this is not what we want. For correct syntax, we need to specify a start and end column.

> Customers %>% 
    gather(customer,qty,Micky:Davy) %>% 
    head(12)

            title customer qty
1          Batman    Micky   2
2        Superman    Micky   0
3  Captain_Marvel    Micky   0
4      Youngblood    Micky   0
5           Spawn    Micky   0
6       Spiderman    Micky   0
7        Iron_Man    Micky   2
8           Cable    Micky   0
9  Doctor_Strange    Micky   1
10         Batman     Mike   5
11       Superman     Mike   1
12 Captain_Marvel     Mike   1

I have an issue with this code in that I need to refactor it each time I add a new customer.

To future proof, we modify the code as follows:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>% 
    head(12)

In a separate table I have prices for each title.

> Price <- data.frame(title = c('Batman',
                                'Superman',
                                'Captain_Marvel',
                                'Youngblood',
                                'Spawn',
                                'Spiderman',
                                'Iron_Man',
                                'Cable',
                                'Doctor_Strange'), 
                    price = c(1.95,
                              1.95,
                              2.95,
                              2.95,
                              1.75,
                              1.75,
                              3.95,
                              3.95,
                              1.95), 
                    stringsAsFactors = FALSE )
> Price

           title price
1         Batman  1.95
2       Superman  1.95
3 Captain_Marvel  2.95
4     Youngblood  2.95
5          Spawn  1.75
6      Spiderman  1.75
7       Iron_Man  3.95
8          Cable  3.95
9 Doctor_Strange  1.95

We can easily add a price column to Customers with the merge() function:

> Customers %>% 
    merge(Price)

           title Micky Mike Peter Davy price
1         Batman     2    5     1    2  1.95
2          Cable     0    1     2    0  3.95
3 Captain_Marvel     0    1     0    1  2.95
4 Doctor_Strange     1    1     0    1  1.95
5       Iron_Man     2    1     1    0  3.95
6          Spawn     0    1     0    1  1.75
7      Spiderman     0    1     1    2  1.75
8       Superman     0    1     1    7  1.95
9     Youngblood     0    1     0    5  2.95

Pretty Showdown: Hard vs. Easy

How do we find per customer totals? I'll show a hard way and an easy way. Let's look at the pipe/ dplyr/ tydr method first.

First, we narrow the table and merge with price:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>%
    merge(Price) %>% 
    head(12)

            title customer qty price
1          Batman    Micky   2  1.95
2          Batman     Davy   2  1.95
3          Batman    Peter   1  1.95
4          Batman     Mike   5  1.95
5           Cable     Davy   0  3.95
6           Cable    Peter   2  3.95
7           Cable     Mike   1  3.95
8           Cable    Micky   0  3.95
9  Captain_Marvel     Davy   1  2.95
10 Captain_Marvel    Peter   0  2.95
11 Captain_Marvel     Mike   1  2.95
12 Captain_Marvel    Micky   0  2.95

Then, we add a fifth column that calculates the subtotal:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>% 
    merge(Price) %>% 
    mutate(subtotal= qty * price) %>% 
    head(12)

            title customer qty price subtotal
1          Batman    Micky   2  1.95     3.90
2          Batman     Davy   2  1.95     3.90
3          Batman    Peter   1  1.95     1.95
4          Batman     Mike   5  1.95     9.75
5           Cable     Davy   0  3.95     0.00
6           Cable    Peter   2  3.95     7.90
7           Cable     Mike   1  3.95     3.95
8           Cable    Micky   0  3.95     0.00
9  Captain_Marvel     Davy   1  2.95     2.95
10 Captain_Marvel    Peter   0  2.95     0.00
11 Captain_Marvel     Mike   1  2.95     2.95
12 Captain_Marvel    Micky   0  2.95     0.00

Then, we sum the subtotal for each customer. We can achieve this with ease using the group_by() and summarize() functions:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>% 
    merge(Price) %>% 
    mutate(subtotal= qty * price) %>% 
    group_by(customer) %>% 
    summarize(sum(subtotal))

# A tibble: 4 x 2
  customer sum(subtotal)
     <chr>         <dbl>
1     Davy         42.45
2    Micky         13.75
3     Mike         30.95
4    Peter         17.50

POP quiz... did we just execute the hard or easy method to find the totals? I will show you the easy way next and you can decide for yourself. In short, we can solve this problem with simple linear algebra.

We first create our vector

> x <- Price$price

Then our matrix

> A <- Customers %>% 
         select(Micky:Davy) %>% 
         as.matrix()

We do a simple dot product and we're done:

> x %*% A

     Micky  Mike Peter  Davy
[1,] 13.75 30.95  17.5 42.45

We could also do it in one line:

> Price$price %*% 
    (Customers %>% 
       select(Micky:Davy) %>% 
    as.matrix())

     Micky  Mike Peter  Davy
[1,] 13.75 30.95  17.5 42.45

My Octave/ MATLAB experience led me to use linear algebra right out of the gate. Sometimes, even in the face of fancy new functions, it turns out I produce beautiful code on the first try.

Why A "Big Data" Personality Test?

2017-02-18T02:10:00-05:00

Why do we need yet another personality test? Because, without "big data" technologies online "personality tests" suffer these problems:

With most tests, we quickly see a pattern to the answers, and can easily steer the test to the outcome we want
- Do we really learn anything when we "cheat" this way? (No)
Test writers like to invent their own cryptic definitions for commonly understood words
- See various definitions for Extrovert, Intuition, Conscientious,etc.
The math to calculate the results must be kept simple (weighted sums), to reduce server overhead
To keep in line with the simple math, a lot of these tests force an un-natural, robotic answer format
- Do I strongly or highly agree with the statement they just presented to me?

Freshlex, LLC presents a "big data" personality test, for charity. Our test differs from traditional tests:

Ours is fun! A combination of serious, thought provoking, and silly questions
Random, arbitrary questions
- We won't see a pattern until enough people take the test and we crunch the numbers
Simple "yes" or "no" answers... no "somewhat/ slightly/ strongly agree" garbage
Since we do the "heavy lifting" on the back end, we can keep it fun and simple for the user
Ours goes to charity
- We will be open about the expenses involved in this voyage. If by some miracle we figure out how to make money off this, we'll donate it to charity.

Thank you for reading this!

Facebook Ads Nets Me 7,000+ Clicks in 48 Hours

2016-01-28T02:04:00-05:00

For my (post) masters project on machine learning and big data infrastructure I thought it would be fun to acquire my own data set. Last semester I traded available services and architected a scalable (big data) Internet facing survey infrastructure using a combination of Amazon Web Services and Python. I had my friends send me (sometimes silly) survey questions and put them on my site. After an initial flury of "friends of friends" taking the survey, my traffic died down to zero.

I then decided to pull the trigger on web advertising and started by Google searching "how do I advertise on the Internet."

I signed up for four of the most attractive options. I won't bore you with the also-rans (although I should mention one Advertising service turned down my cold hard cash because my site "does not meet [their] partner ad networks' strict policies regarding site content, structure and navigability") since, as far as I can tell, Facebook's advertising service delivers the platonic ideal of perfection. I will try really, really hard not to use cliches such as "knocks my socks off," "blows me away" or "combines to form Predaking and destroy the Autobots" when I communicate how well this service works.

I'm sure if I (devote my nights and weekends and) master the art Search Engine Optimization, A/B testing and AdWords I would find fault with Facebook's service, but that is my point. Facebook gave me, someone who has never advertised on the Web before, a Fischer-Price interface to quickly deploy a killer campaign with no thought or skill. One of the other services I tried, in contrast, required me to install a "pixel" to my site, which is simple enough for me (once I figured out how to make it play nice with my Bootstrapped Jinja2 template), but I could never see Al (from Al's auto-parts) pull that one off. With Facebook's service, I could see a Caveman mashing his keyboard with a club deploy an effective campaign. I think that is why I like Facebook's service so much. Any "Mom and Pop" could figure it out, and Facebook is smart enough to keep things moving to their sale. Click, Click, Click and close. Genius. The competitors put up huge roadblocks that required me to Stack-Exchange some questions wheras Facebook takes the attitude of "Don't know what this means? Don't worry about it, you can tune that knob later, let's move on..."

To summarize the experience, you simply click "Create Ad," pick some demographics (they had one for "people who take personality tests"), create a "Call to Action" (A slogan to get people to click) and then upload a JPEG. I didn't have any graphic art so I Googled "Public Domain Cartoon Brain," uploaded that, and I was done. The results? 86+k Imprints, 7k clicks to my site and 4.75k of those taking the survey! All for just \$427. I will let this run for a few more days, just to see what I can get.

A quick note. Of the 4.75k test takers I had two skeptics that think I'm a front for some political entity or some other nefarious institution. I assure you that I am putting together this data set for my (post) masters college project. Also, as I said before if by some miracle I make any money off this, I will donate it to charity! For the time being, with the infrastructure and advertising I'm a couple grand in the hole and forsee a couple grand more. I am paying for this out of my own pocket. Where do I get the money, you ask? Easy... by selling my childhood on ebay!

UPDATE: I posted this blog entry at 2:04 AM yesterday. Since I posted this blog, the market opened and Facebook stock went up 15.5%. I take full credit for this jump:

id	mcv	alkphos	sgpt	sgot	gammagt
0	85	92	45	27	31
1	85	64	59	32	23
2	86	54	33	16	54
3	91	78	34	24	36
4	87	70	12	28	10

id	mcv	alkphos	sgpt	sgot	gammagt
0	85	92	45	27	31
1	85	64	59	32	23
2	86	54	33	16	54
3	91	78	34	24	36
4	87	70	12	28	10

id	mcv	alkphos	sgpt	sgot	gammagt
0	85	92	45	27	31
1	85	64	59	32	23
2	86	54	33	16	54
3	91	78	34	24	36
4	87	70	12	28	10