Use this file as persistent context for LLMs, teammates, or future you. It explains the business domain, the hackathon goal, how datasets relate, and every column in the CSVs (verified against data_science/data/ headers and data_science/data/data_dictionary.csv).
Where to start
- Notebooks and CSV analysis (no Docker required): data_science/README.md
- Postgres + API + UI (Docker): web/README.md
Smadex is a mobile advertising platform, specifically a DSP (Demand-Side Platform). A DSP buys ad inventory on behalf of advertisers (brands that want installs, purchases, signups, etc.). When you see an ad inside a mobile app, a chain of systems decided in near real time whether to show it and which creative to show.
Public messaging often cites extreme scale (for example, millions of bid opportunities per second and sub-100ms decision windows). You do not need live auction logs for this hackathon; the provided CSVs are a fully synthetic teaching dataset.
| Role | Meaning |
|---|---|
| Advertiser | Pays to run ads to hit a business goal (install app, purchase, etc.). |
| Publisher | Owns the app or site where the ad appears; sells impressions. |
| DSP | Advertiser-side buying automation (bid, budget, creatives, targeting). Smadex is modeled as a DSP here. |
| SSP | Publisher-side selling automation. |
| Ad exchange | Liquidity layer where buy and sell meet (not represented as a table in this dataset). |
When an impression becomes available, eligible buyers evaluate a bid request (often includes coarse geo, device/OS, app category, etc.). Buyers respond with bids; a winner is chosen; the winning creative is served. This repository does not contain bid-level auction logs. It contains delivery and outcome statistics after the fact (spend, impressions, clicks, conversions, revenue), plus creative and campaign metadata.
| Object | Meaning |
|---|---|
| Creative | The actual ad unit (image/video/playable layout). Identified by creative_id. |
| Campaign | A structured flight with budget, dates, targeting, and an objective. Identified by campaign_id. |
| Advertiser | The company running campaigns. Identified by advertiser_id. |
| KPI | Definition | Notes |
|---|---|---|
| Impression | Ad was served (row count in daily fact table). | Use impressions column. |
| Click | User tapped the ad. | clicks. |
| Conversion | User completed the campaign objective (install, purchase, etc.). | conversions; meaning depends on campaigns.objective. |
| CTR | clicks / impressions |
Guard divide-by-zero. |
| CVR | conversions / clicks |
Guard divide-by-zero. |
| IPM | 1000 * conversions / impressions |
Installs (or conversions) per mille impressions. |
| ROAS | revenue_usd / spend_usd |
Above 1 means revenue exceeds spend on that slice. |
| CPA | spend_usd / conversions |
This dataset’s stated KPI focus (kpi_goal is CPA in campaigns). |
Fatigue means performance degrades over time for a creative that used to work: CTR/CVR/efficiency slip even if targeting is unchanged. Marketers want early warnings and actions (refresh creative, rotate variants, reallocate budget).
Build a product-style demo that helps a marketer answer:
- Which ads are working right now? — Performance explorer (slice metrics by time, geo, OS, format, etc.).
- Which ads are dying (fatigue)? — Fatigue detection (trends vs a baseline, not only a static leaderboard).
- What should I do next? — Recommendations / explainability (actions on creatives, not “what to bid on an auction slot” unless you bring external auction data).
Bonus differentiator: combine asset_file PNGs with performance to explain why patterns exist (vision model, human review UI, clustering, etc.).
Framing: this is not a perfect Kaggle-style leaderboard problem. Strong projects blend analysis + simple models + explainability + UX.
From data_science/data/README.md (canonical counts):
| Entity | Count |
|---|---|
| Advertisers | 36 |
| Campaigns | 180 |
| Creatives | 1,080 |
| Daily fact rows | 192,315 |
Uniformity quirk: every advertiser has 5 campaigns and every campaign has 6 creatives by design. Portfolio-size comparisons are meaningless; compare performance and creative traits instead.
advertisers.csv
└── advertiser_id ──< campaigns.csv
└── campaign_id ──< creatives.csv
└── creative_id ──< creative_daily_country_os_stats.csv
(also join daily rows by campaign_id for sanity)
Practical joins
campaigns.advertiser_id->advertisers.advertiser_idcreatives.campaign_id->campaigns.campaign_idcreative_daily_country_os_stats.creative_id->creatives.creative_idcreative_daily_country_os_stats.campaign_id->campaigns.campaign_id(redundant but useful)
Pre-aggregated convenience tables
creative_summary.csv= one row per creative: lifetime totals, decay features, synthetic labels (creative_status,fatigue_day,perf_score), plus creative metadata columns mirrored fromcreatives.csv.campaign_summary.csv= one row per campaign: lifetime totals + campaign metadata.
Authoritative column gloss: data_science/data/data_dictionary.csv (machine-readable). The sections below match actual CSV headers in this repo.
| Column | Type | Meaning |
|---|---|---|
advertiser_id |
int | Primary key. |
advertiser_name |
string | Synthetic company name. |
vertical |
string | Industry vertical (gaming, fintech, etc.). |
hq_region |
string | Synthetic HQ region (e.g. LATAM, North America). |
| Column | Type | Meaning |
|---|---|---|
campaign_id |
int | Primary key. |
advertiser_id |
int | FK to advertisers.advertiser_id. |
advertiser_name |
string | Denormalized name (matches advertiser). |
app_name |
string | Promoted app/product name. |
vertical |
string | Vertical. |
objective |
string | Conversion goal: install, purchase, signup, booking, order, etc. |
primary_theme |
string | Campaign-level creative angle. |
target_age_segment |
string | Age band targeted. |
target_os |
string | Android, iOS, or Both. |
countries |
string | Pipe-separated list, e.g. `CA |
start_date |
date | Campaign start. |
end_date |
date | Campaign end. |
daily_budget_usd |
float | Approximate daily budget. |
kpi_goal |
string | In this dataset: CPA focus. |
Rich metadata + engineered visual features (synthetic) for each creative. Includes path to a synthetic PNG.
Identity and delivery
| Column | Type | Meaning |
|---|---|---|
creative_id |
int | Primary key. |
campaign_id |
int | FK to campaigns.campaign_id. |
advertiser_name |
string | Denormalized. |
app_name |
string | Denormalized promoted app. |
vertical |
string | Denormalized vertical. |
format |
string | e.g. interstitial, rewarded_video, banner, native, playable. |
width, height |
int | Asset dimensions (pixels). |
language |
string | Creative copy language code. |
creative_launch_date |
string/date | First active date for the creative. |
asset_file |
string | Relative path to a synthetic PNG (bundle may or may not ship assets/ next to the CSVs; paths are still valid when images are present). |
Concept / copy
| Column | Type | Meaning |
|---|---|---|
theme |
string | Story/concept bucket. |
hook_type |
string | Attention hook category. |
cta_text |
string | Call to action text. |
headline |
string | Primary headline text in asset. |
subhead |
string | Secondary line. |
dominant_color |
string | Color family. |
emotional_tone |
string | Tone bucket. |
Visual scores (roughly 0-1 floats unless noted)
| Column | Type | Meaning |
|---|---|---|
duration_sec |
int | Video/interactive length; 0 for static. |
text_density |
float | Share of layout covered by text (0-1). |
copy_length_chars |
int | Approximate character count of copy. |
readability_score |
float | Estimated readability (0-1). |
brand_visibility_score |
float | Logo/brand prominence (0-1). |
clutter_score |
float | Busyness (0-1). |
novelty_score |
float | Originality (0-1). |
motion_score |
float | Motion intensity (0-1). |
faces_count |
int | Number of faces visible. |
product_count |
int | Product/food elements shown. |
Binary flags (0/1)
| Column | Meaning |
|---|---|
has_price |
Price or monetary offer visible. |
has_discount_badge |
Sale/discount badge visible. |
has_gameplay |
Gameplay footage visible. |
has_ugc_style |
UGC / creator-style layout. |
Grain: one row per (date, creative_id, country, os) (also carries campaign_id).
| Column | Type | Meaning |
|---|---|---|
date |
date | Observation date. |
campaign_id |
int | FK to campaigns. |
creative_id |
int | FK to creatives. |
country |
string | Delivery country code. |
os |
string | Android or iOS. |
days_since_launch |
int | Days since creative launch on that date (fatigue axis). |
impressions_last_7d |
int | Precomputed rolling impressions (see caveats below). |
spend_usd |
float | Spend attributed to this slice. |
impressions |
int | Served impressions for the slice. |
viewable_impressions |
int | Estimated viewable impressions. |
clicks |
int | Clicks. |
conversions |
int | Conversions (meaning per campaign objective). |
revenue_usd |
float | Attributed revenue. |
video_completions |
int | Completed video views (0 for static). |
Important analytical caveat: impressions_last_7d is a rolling field defined by the dataset authors. Do not blindly sum it across rows the same way you sum impressions; for rollups you usually aggregate impressions, spend_usd, etc., and compute your own rolling windows if needed.
Grain: one row per creative_id. Contains:
- Lifetime totals and overall rates (
overall_*). - First-week vs last-week decay features (
first_7d_*,last_7d_*,*_decay_pct). - Synthetic supervision-ish fields:
creative_status,fatigue_day,perf_score. - Creative metadata duplicated for modeling convenience (same themes as
creatives.csv).
IDs and labels
| Column | Meaning |
|---|---|
creative_id, campaign_id |
Keys. |
advertiser_name, app_name, vertical, format |
Denormalized context. |
creative_status |
top_performer, stable, fatigued, underperformer (synthetic). |
fatigue_day |
Populated for fatigued rows; blank otherwise (see quirks). |
perf_score |
Synthetic scalar score (0-1). |
Lifetime totals
| Column | Meaning |
|---|---|
total_days_active |
Active lifetime length (days). |
total_spend_usd, total_impressions, total_clicks, total_conversions, total_revenue_usd |
Rolled-up sums. |
overall_ctr, overall_cvr, overall_ipm, overall_roas |
Derived rates. |
Fatigue / momentum features
| Column | Meaning |
|---|---|
first_7d_ctr, last_7d_ctr, ctr_decay_pct |
CTR early vs late and relative change. |
first_7d_cvr, last_7d_cvr, cvr_decay_pct |
Same for CVR. |
peak_rolling_ctr_5 |
Peak short-window CTR early in life. |
Creative metadata columns (same families as creatives.csv)
Includes: width, height, language, creative_launch_date, theme, hook_type, cta_text, headline, subhead, dominant_color, emotional_tone, duration_sec, text_density, copy_length_chars, readability/brand/clutter/novelty/motion scores, counts, has_* flags, asset_file.
Volume helpers
| Column | Meaning |
|---|---|
peak_day_impressions |
Max impressions on a single day. |
first_7d_impressions, first_7d_clicks, first_7d_conversions |
First week raw counts. |
last_7d_impressions, last_7d_clicks, last_7d_conversions |
Last week raw counts. |
Grain: one row per campaign_id. Campaign metadata plus rolled-up performance.
| Column | Meaning |
|---|---|
campaign_id |
Key. |
total_spend_usd, total_impressions, total_clicks, total_conversions, total_revenue_usd |
Totals. |
advertiser_id, advertiser_name |
Advertiser linkage. |
app_name, vertical, objective, primary_theme, target_age_segment, target_os, countries |
Targeting/setup. |
start_date, end_date, daily_budget_usd, kpi_goal |
Flight + KPI. |
overall_ctr, overall_cvr, overall_roas |
Campaign-level rates. |
Row-wise mapping of (file_name, column_name, data_type, description). Treat it as the schema glossary for tools and LLMs.
From data_science/data/README.md and dataset design:
fatigue_dayonly exists forcreative_status == fatigued. Other statuses leave it blank by construction.- Uniform portfolio sizes make “who is biggest” questions meaningless; compare KPIs and creative features.
- Synthetic data: patterns are realistic for learning, but this is not a production benchmark.
- Hidden generator variables are not shipped; do not expect to invert the full generative process from public files alone.
Shorter guides with the same domain story in context:
- data_science/README.md — data science tree, each CSV explained, joins.
- web/README.md — Docker, Postgres tables vs CSVs, empty vs non-empty DB seeding behavior.
Layout:
data_science/— CSVs underdata_science/data/(andassets/), plus notebooks underdata_science/notebooks/. Use this tree for analysis and experiments without running the web stack.web/— Production-style stack:web/backend/(FastAPI:/api/performance/*, …),web/frontend/(React + Vite UI),web/docker-compose.yml, andweb/db/schema.sql(PostgreSQL DDL: one public table per seeded CSV indata_science/data/, excludingdata_dictionary.csv).
Web API data source
The API always uses PostgreSQL via DATABASE_URL. It does not read CSV files at request time. CSVs in data_science/data/ are the bulk seed source when the fact table is empty (see below).
Run locally (API + UI against Postgres)
# Terminal A — Postgres must already exist and be seeded (see bootstrap below)
cd web/backend && pip install -r requirements.txt && uvicorn main:app --reload --port 8000
# Terminal B
cd web/frontend && npm install && npm run devSeed Postgres from CSV (local, without full Docker stack)
From the repo root, with a venv that has web/backend dependencies installed:
export DATABASE_URL=postgresql+psycopg2://USER:PASS@localhost:5432/smadex
export IMPORT_DATA_DIR=/absolute/path/to/HACK_UPC_SMADEX/data_science/data
python web/backend/scripts/bootstrap_pg_from_csv.pyIf IMPORT_DATA_DIR is unset, the script defaults to data_science/data/ relative to the repository root. It applies web/db/schema.sql, then loads all CSV-backed tables when the fact table is empty, or backfills empty summary/dictionary tables when the fact table already has rows.
Run with Docker (Postgres + API + UI)
cd web && docker compose up --build
# UI: http://localhost:8080 (nginx proxies /api to backend)
# API: http://localhost:8000
# Postgres: localhost:5432 user/password/db: smadex / smadex / smadexCompose mounts ../data_science/data read-only at /import on the backend. On each backend start, scripts.ensure_db_seeded runs: if any seed table is missing it applies web/db/schema.sql; if creative_daily_country_os_stats has at least one row, the full truncate+reload is skipped and empty tables (e.g. new campaign_summary) are backfilled from CSV; if the fact table is empty, all seven seeded tables are truncated and the corresponding CSVs are loaded from /import (data_dictionary.csv is not loaded into Postgres).
Reset the database (force a full re-seed on next start): cd web && docker compose down -v then up again (removes the named volume smadex_pgdata).
When your team changes the product, append short dated notes here:
- Which KPI definitions you standardized (CTR denominators, CPA smoothing rules).
- Any new derived tables you materialize (daily creative totals, per-country baselines).
- Model cards (inputs, labels, leakage checks) if you add ML.
Keeping this README current is cheaper than re-explaining the domain in every chat.