preflightml is a lightweight machine learning dataset linter that runs sanity checks before training to catch silent data issues that commonly invalidate models.
It is designed to answer one question:
"Is this dataset safe to train a model on?"
Rather than producing large exploratory reports, preflightml surfaces only actionable problems, ranked by severity, so you can fix issues early and avoid misleading results.
Most ML failures are not caused by model choice. They are caused by data issues that don't crash code but quietly ruin results, such as:
- severe class imbalance
- duplicate or leaked rows
- invalid target columns
- constant or identifier features
- accidental target leakage
These problems often go unnoticed until after training, when metrics look suspicious or models fail in production.
preflightml is a pre-flight checklist for datasets, similar to how linters work for code.
- a Python library with a clean public API
- a CLI tool for quick checks and CI usage
- deterministic and opinionated
- focused on ML-specific failure modes
- an EDA or visualization tool
- a report generator with charts
- a model training framework
- a data auto-cleaner
preflightml does not modify data. It only reports risks.
If a check passes, nothing is printed. Only failures are surfaced.
All validations are rule-based and reproducible. No heuristics are hidden.
- Python checks detect issues
- structured output represents findings
- an optional small LLM converts findings into human explanations
No raw data is ever sent to an LLM. Only metadata, counts, and issue summaries are used.
preflightml implements ~30 focused checks, grouped into categories. Most datasets will trigger only a few.
- unreadable or unsupported files
- empty datasets
- missing or duplicate column names
- fully empty columns
- row count sanity checks
- missing values per column
- high missing-rate warnings
- exact duplicate rows
- near-duplicate row warnings
- target column existence
- missing values in target
- constant or invalid target
- incompatible target types
- class distribution analysis
- severe imbalance detection
- minority class sparsity warnings
- constant or near-constant features
- identifier-like columns
- high-cardinality categorical features
- mixed or invalid data types
- suspicious feature–target correlations
- potential target leakage
- timestamp-based leakage risks
- severity levels (info / warning / critical)
- prioritized issue ordering
- concise dataset health summary
Basic scan:
preflightml scan data.csv --target label
Verbose output:
preflightml scan data.csv --target label --verbose
Machine-readable output (for CI):
preflightml scan data.csv --target label --json
LLM-generated explanation:
preflightml scan data.csv --target label --explain
preflightml can be used directly inside Python code or notebooks.
Typical use cases:
- data validation before training
- automated checks in pipelines
- unit testing datasets
- teaching ML data hygiene
The library exposes a small, stable API rather than requiring imports from internal modules.
preflightml can optionally pass its structured findings to a small language model to generate:
- a concise dataset health summary
- prioritized explanations of the most serious issues
- suggested next steps
Important guarantees:
- the LLM does not see raw data
- the LLM cannot invent issues
- all numbers and facts come from deterministic checks
- the LLM is used only as a narrator
If the LLM is disabled, preflightml still functions fully.
Dataset Health Check
[CRITICAL] Target leakage suspected
- Feature "total_spend_last_30d" is highly correlated with target
[WARNING] Severe class imbalance
- Class 0 represents 93% of samples
[WARNING] Missing values detected
- Column "income": 12% missing
3 issues found (1 critical, 2 warnings)
preflightml is structured as a reusable Python library:
- checks are grouped by category
- the CLI is a thin wrapper around the core API
- output rendering is separated from validation logic
- all checks are unit-testable
This makes it suitable for both learning and real-world use.
- students learning ML properly
- engineers validating datasets in pipelines
- researchers reproducing experiments
- anyone who wants to avoid silent data bugs
preflightml is intentionally narrow.
It does not try to be everything. It tries to be correct, quiet, and useful.
If your dataset is healthy, preflightml stays silent. If it is risky, preflightml speaks clearly.
MIT License.