Inspiration

I run a CRM consulting agency for home improvement businesses. Every day I watch outbound sales agents dial through lists of leads with no context, reading the same opening script over and over, hoping for a connection. The good agents build context themselves through tone, intuition, and follow-up notes. The average agent reads the script.

That gap between top-performing closers and everyone else isn't about effort. It's about information at the moment of the call. When an agent picks up the phone, they need three things: who this person is, how likely they are to convert, and what to lean into during the conversation. Most CRMs surface the first one. Almost none surface the other two.

Caller Companion is the public-data version of a briefing tool I've been building for my own agency's team. Forking the pattern onto a public banking dataset was a way to prove the product shape generalizes outside my own vertical.

What it does

Before every dial, the agent sees a one-screen brief:

  • A calibrated subscription probability for that customer
  • A confidence band (high, medium, low) so they know how much to trust the number
  • Two to three talking points translated from the model's reasoning into agent-readable language
  • The customer's profile including age, occupation, prior campaign history, and contact preferences The agent reads the brief, makes the call, logs the outcome. The app shows them whether the model was right. Over a shift, agents calibrate their own instincts against the model's predictions and learn where to trust it and where to override it.

Each rep also has their own session history. Every completed shift logs date, calls made, model accuracy, and outcome breakdown. The pattern scales to teams. Same UI, different rep, separate history.

How we built it

The model. A LightGBM gradient-boosted classifier trained on the UCI Bank Marketing dataset. 41,188 customer records from a Portuguese bank's term deposit campaigns. Stratified 80/20 train/test split, 5-fold stratified cross-validation, class weighting for the 11% positive class. The most consequential decision: dropping the duration feature for post-call leakage. It correlates 0.405 with the target, but it's only known after the call ends. Including it would inflate every metric and produce a model that can't help an agent picking up the phone.

The backend. FastAPI on Render. Loads the trained model on startup and pre-computes per-row SHAP-like feature contributions for the entire test set, so queue requests are constant-time. Three endpoints: customer queue, customer detail, outcome submission.

The frontend. Next.js 14 with TypeScript and Tailwind, deployed to Vercel. Dark theme with a coral accent. State machine: identity picker, queue, active customer, outcome reveal, next. Identity persistence and session history in localStorage.

The talking points translator. A Python module that ranks each customer's feature contributions by magnitude, then translates the top two or three into agent-readable sentences. A hard override surfaces poutcome (previous campaign outcome) whenever it's success or failure, regardless of contribution rank. The agent always knows when they're calling a returning customer.

Challenges we ran into

Honest model performance. Without duration, ROC-AUC capped at 0.80. Four hyperparameter configurations during tuning all underperformed the conservative defaults. The dataset has a real ceiling and no amount of tuning was going to break it. I chose to report the honest number and explain why, rather than overfit my way to a flashier metric.

LightGBM ties the baseline on ROC-AUC. Logistic regression and LightGBM landed at 0.8008 each. The deciding factors were PR-AUC (0.4840 vs 0.4601) and Brier score (0.1312 vs 0.1616). The product needs trustworthy probabilities, not just rankings, and LightGBM produces per-row feature contributions that translate cleanly into customer-specific talking points. The harder challenge was articulating why the simpler model would have been defensible too.

The rich-get-richer dynamic. Customers with poutcome == 'success' convert at 65%, but represent only 3.3% of the population. A naive model would point agents at the same returning customers over and over. The fix was a product-level intervention: surface poutcome explicitly in the talking points whenever it's set, so the agent knows when they're being pointed at a returning customer and can deprioritize.

Accomplishments that we're proud of

  • A working, deployed product. Live URL, real model, real predictions, end-to-end flow tested on phone and desktop.
  • A modeling story that's defensible top to bottom. Every feature decision has a reason. Every metric is honest. Every limitation is documented.
  • The talking points translator. Per-row feature contributions converted into sentences an actual sales rep could read and act on, with a hard override for prior campaign history.
  • The Responsible AI framing isn't an afterthought. The product is built around the principle that the agent stays in control. No auto-dial, no black box, no false certainty. Low-confidence predictions explicitly tell the agent to trust their own read.
  • Session history per rep, built to scale to teams without changing the UI. ## What we learned

That the right model for a product isn't always the best model on paper. LightGBM and logistic regression tied on ROC-AUC. The choice came down to which one produced outputs the product layer could actually use. PR-AUC and Brier matter more than ROC-AUC when the agent is going to read a probability number off the screen and decide what to say next.

That the most important feature engineering decision is sometimes a feature deletion. Dropping duration made the model less accurate on paper and infinitely more useful in deployment.

That building the model is half the work. Translating its outputs into language a human can act on is the other half. The talking points translator was where the project went from "ML demo" to "actual product."


Built With

Share this project:

Updates