AI-powered small business survival intelligence for Chicago's immigrant commercial corridors.
Built at WildHacks 2025.
Chicago has lost tens of thousands of small businesses over the past decade. The ones that close are rarely the ones that deserved to. They are family-owned restaurants, corner grocers, tailor shops and bakeries that hold neighborhoods together, and they close not because their owners gave up, but because nobody told them the warning signs were there until it was too late. The city has the data. Business license renewals, health inspection records, neighborhood income shifts, Google review velocity, rent pressure by ZIP code. None of it had ever been assembled into something a small business owner could actually use. We built Oasis because a shop owner on Devon Avenue or in Pilsen deserves the same quality of risk intelligence that a commercial real estate firm uses to evaluate a property portfolio, and they deserve it in their own language, right now, for free.
Oasis is a small business survival platform for Chicago. It ingests data from six public and commercial sources, trains a machine learning classifier to predict which businesses will close within 12 months, and surfaces that intelligence on an interactive 3D map where every business appears as a color-coded pulsing marker: red for high closure risk, amber for watch, green for stable.
Clicking any business opens a sliding consultant panel that shows the business's 12-month survival score, the three specific factors most responsible for that score, and an AI-generated diagnosis with a concrete action plan tailored to that business's situation. The consultation is fully multilingual: English, Hindi, Tamil, Mandarin, and Spanish. Business owners can type their question or speak it directly using voice input, and the AI response is read back to them in their language through text-to-speech. At the end of a session they can download a formatted PDF report with their score, risk factors, and action items, with all section headings and content rendered in their selected language.
Corridor-level summaries give city planners and community organizations a neighborhood-by-neighborhood view: total businesses at risk, the dominant risk factor driving closures in that area, and a narrative headline for corridors like Devon Avenue, Little Village, Argyle Street, Pilsen, and Chinatown.
The pipeline starts with six ingestion scripts. Business license data comes from the Chicago Data Portal Socrata API (dataset r5kz-chrr), filtered to seven corridor ZIP codes covering Devon Avenue, Chinatown, Little Village, Uptown, Pilsen, Albany Park, and West Town. We pull both active licenses (status AAC) and inactive ones (INQ, INAI, CANC, REV) and sample 500 of each to create a balanced training set. The Socrata API does not return coordinates in license records, so we built a separate geocoding script that batches business IDs in groups of 50, queries the same Socrata endpoint for latitude, longitude, ZIP, and neighborhood, and writes the results back into the feature CSV. Of the roughly 1,000 license records we start with, around 570 end up with clean mappable coordinates after deduplication.
Google Places data comes from the FindPlaceFromText endpoint using business name plus address as the query. We built a checkpointed parallel fetcher using Python's ThreadPoolExecutor with five workers to stay under the 10 QPS quota, with progress logged every 10 records and a CSV checkpoint written every 50 so a failed run can resume without re-paying for already-fetched records.
Census data comes from the ACS 5-Year Estimates API for 2018 and 2022. We pull four variables per ZIP: median household income (B19013), median gross rent (B25064), median home value (B25077), and a linguistic isolation count (B16002). The 2022 and 2018 values are then differenced as a percentage change to create delta features that capture whether a neighborhood is getting richer or poorer, and whether rent pressure is accelerating.
The feature table joins all six sources into 15 model columns per business. License-derived features include days since the original license start date, a bucketed age category (under 1 year, 1 to 3, 3 to 5, over 5) to capture the nonlinear startup risk curve, a renewal count proxy, and a binary is_independent flag computed by counting how many businesses share the same DBA name. Google-derived features include the raw rating, total review count, price level, and a review velocity ratio computed as annualized reviews per day of operation. ZIP-level features include business density, median income, median rent, median home value, and the 2018-to-2022 percentage change for all three, plus the linguistic isolation score. Health inspection failures are aggregated at the ZIP level from Chicago CDPH records.
The target variable is binary: 1 if the license status is anything other than AAC, 0 if it is active. We use an XGBoost binary classifier and set scale_pos_weight to the ratio of negative to positive examples in the training set to handle class imbalance. Hyperparameters are tuned with RandomizedSearchCV over 20 iterations using 5-fold StratifiedKFold cross-validation, optimizing for ROC-AUC. The best model is evaluated on a held-out 20 percent test set. After training, a SHAP TreeExplainer is fit on the same model to generate per-prediction SHAP values, which identify the three features with the largest absolute impact on each individual business's score.
Two features were deliberately excluded from training even though they were present in the raw data: days_overdue_renewal and recent_closures_nearby. Both are computed from or correlated with the target label itself, which would have created data leakage and inflated apparent model performance.
Versioned artifacts (model.pkl, explainer.pkl, columns.pkl) are written to a timestamped directory and also copied to a latest artifacts folder so the inference path always loads the most recent trained version.
The FastAPI server loads the full feature CSV at startup, runs all model predictions in a single vectorized batch call to XGBoost, and stores the results in a Python dict keyed by business ID. Every API response is served from that in-memory dict with O(1) lookup, so there is no per-request model inference cost.
The AI consultant endpoint builds a structured prompt with the business name, corridor, ZIP, survival score, risk level, and a semicolon-joined summary of the top SHAP driver labels, then sends it to Gemini 2.5 Flash with a strict JSON-only instruction. If Gemini fails or is unavailable, the request falls through to Groq Llama-3.1-8b-instant as a fallback. Both paths return the same response shape: a diagnosis paragraph, an array of 2 to 3 action objects (each with title, what, deadline, and contact), and a one-sentence summary. Responses for generic questions are cached per business and language so each business only costs tokens once per server session.
Voice transcription uses Groq Whisper (whisper-large-v3) via multipart form upload. TTS uses ElevenLabs with a different voice ID per language. The PDF report is built with ReportLab using NotoSans for Unicode coverage across Hindi, Tamil, Mandarin, Spanish, and English, with all section headings translated through a hardcoded lookup table for six languages.
The frontend is built in React with Vite and Tailwind. The landing page uses a Three.js starfield scene (7,000 instanced star particles) with a radial speed lines animation on entry. The main dashboard runs a persistent Three.js ambient layer behind the map: 600 instanced mesh particles that drift on sinusoidal paths, three orbital torus rings rotating at different speeds and tilts, a distorted sphere using MeshDistortMaterial, and a Sparkles component from drei, all lit by three colored point lights (emerald, cyan, violet). The map uses react-map-gl with Mapbox dark-v11 and adds a 3D fill-extrusion building layer with height-interpolated colors and atmospheric fog on load. The map viewport is masked with a radial gradient and four directional gradient fades so the edges dissolve into the dark background rather than having a hard border. The consultant panel slides in from the right with a spring animation from Framer Motion and shows a circular SVG progress ring for the survival score that animates from zero using a cubic ease curve.
The Socrata business license dataset does not include coordinates for most records. Chicago's open data portal stores latitude and longitude in a separate geographic layer tied to the same account numbers, but the schema is not well documented and the join logic is not obvious. We figured out that querying the same r5kz-chrr endpoint with an account_number IN clause and selecting latitude, longitude, zip_code, and neighborhood returns the most recent known location for each license. Batching 50 IDs per request kept us inside URL length limits while keeping the geocoding run under a few minutes.
Getting the survival score direction consistent across the entire stack was the most persistent bug. The model outputs a probability of closure, which we convert to a survival score where a low number means high risk. Keeping that inversion consistent between the Python computation, the FastAPI response, the React risk score ring, the map dot color thresholds, and the language in the AI prompt broke twice during the build when teammates were working on the same layer simultaneously and operating on different assumptions about which direction was which.
The Google Places fetcher started failing silently at scale because individual requests would time out without raising an exception, leaving NaN rows in the middle of the output. We fixed this by wrapping each thread's request in a try-except that returns a base dict with NaN values on any failure, which keeps the output length exactly 1-to-1 with the input length and makes downstream merges safe.
Prompt engineering for the multilingual consultant took more iteration than expected. Early versions occasionally returned markdown-fenced JSON, included extra commentary outside the JSON object, or produced inconsistent field names across languages. We added a regex stripping step for code fences and tightened the prompt to a one-line JSON schema with explicit field names and a hard instruction to return only valid JSON with no markdown.
ReportLab does not handle Unicode well with the default Helvetica font. Hindi, Tamil, and Mandarin section headings were rendering as blank or corrupt characters until we registered NotoSans-Regular and NotoSans-Bold as custom fonts and fell back to Helvetica only when the font files were not present on the system.
The full pipeline actually works end to end. Real Chicago business data flows through a trained XGBoost classifier, gets explained by SHAP values at the individual prediction level, drives a 3D map with hundreds of live markers, feeds a multilingual AI consultant backed by Gemini with a Groq fallback, and produces downloadable PDF reports rendered in six languages, all built within a hackathon timeline.
We are proud of the SHAP explainability layer specifically. The model does not just output a number. It tells the owner that their neighborhood's linguistic isolation score is the biggest factor, or that their annualized review velocity has dropped relative to their business age. That kind of specific, named signal is what makes the score actionable rather than just alarming.
The visual experience came together better than we expected. The 3D city with atmospheric fog, the dissolving map edges, the pulsing risk markers, and the glass consultant panel feel like something built by people who actually care about the communities being shown on that map.
The hardest part of a data product is not the model. It is the pipeline. Getting clean, joined, geocoded, deduplicated data that XGBoost can actually consume took longer than everything else combined. The model training itself, including hyperparameter search across 20 iterations with 5-fold CV, runs in a few minutes once the feature table is clean. The work to get there is everything.
SHAP values are more useful as a product feature than as a model debugging tool. Surfacing the top three SHAP contributors per business, labeled in plain language and with a direction, is what turns a single risk number into something a business owner can actually act on.
Keeping a multilingual AI prompt consistent across two different model providers requires a structured JSON instruction with an explicit schema, combined with regex cleanup on the response. That combination made Gemini and Groq interchangeable at the application layer without any response parsing changes.
We want to expand coverage to all active Chicago businesses, not just the 570 with clean coordinates in the current dataset. The geocoding pipeline is built and just needs to run at full scale. We want to add Google Calendar and lease expiration date inputs so the model can weight renewal window proximity more heavily when it is imminent. A neighborhood early warning feed for city aldermen and CDFI loan officers is on the roadmap, surfacing clusters of businesses crossing the high-risk threshold before an entire corridor tips. The longer-term vision is a white-label version of the platform that any city can deploy by pointing it at their own business license registry, census data, and inspection records and getting a live survival intelligence layer over their entire small business ecosystem, available to every owner for free.
| Layer | Stack |
|---|---|
| Backend | Python · FastAPI · Uvicorn |
| ML | XGBoost · SHAP · scikit-learn · pandas |
| AI / Voice | Google Gemini 2.5 Flash · Groq (Whisper + Llama) · ElevenLabs |
| ReportLab · fpdf2 · NotoSans | |
| Frontend | React 18 · Vite · Tailwind CSS · Framer Motion · Three.js |
| Map | Mapbox GL JS · react-map-gl |
| Deployment | DigitalOcean (Docker Compose or App Platform) |
Create backend/.env by copying .env.example:
cp .env.example backend/.env| Key | Where to get it |
|---|---|
GEMINI_API_KEY |
aistudio.google.com |
OPENAI_API_KEY |
platform.openai.com/api-keys |
ELEVENLABS_API_KEY |
elevenlabs.io → Profile → API Key |
GROK_KEY |
console.groq.com |
CENSUS_API_KEY |
api.census.gov/data/key_signup.html — free |
MAPBOX_TOKEN |
account.mapbox.com — free tier |
For Mapbox, create frontend/.env:
cp frontend/.env.example frontend/.env
# then add: VITE_MAPBOX_TOKEN=pk.eyJ1...cd backend
pip install -r requirements.txt
uvicorn main:app --reload --port 8001API docs: http://localhost:8001/docs
cd frontend
npm install
npm run devApp: http://localhost:5173
cp backend/.env.example backend/.env # fill in API keys
cp .env.example .env # fill in VITE_MAPBOX_TOKEN
docker compose up -d --buildConnect this repo to App Platform — it will detect .do/app.yaml automatically. Add secrets in the App Platform UI and it auto-deploys on every push to master.