preflightml

preflightml is a lightweight machine learning dataset linter that runs sanity checks before training to catch silent data issues that commonly invalidate models.

It is designed to answer one question:

"Is this dataset safe to train a model on?"

Rather than producing large exploratory reports, preflightml surfaces only actionable problems, ranked by severity, so you can fix issues early and avoid misleading results.

Why preflightml exists

Most ML failures are not caused by model choice. They are caused by data issues that don't crash code but quietly ruin results, such as:

severe class imbalance
duplicate or leaked rows
invalid target columns
constant or identifier features
accidental target leakage

These problems often go unnoticed until after training, when metrics look suspicious or models fail in production.

preflightml is a pre-flight checklist for datasets, similar to how linters work for code.

What preflightml is (and is not)

It is:

a Python library with a clean public API
a CLI tool for quick checks and CI usage
deterministic and opinionated
focused on ML-specific failure modes

It is not:

an EDA or visualization tool
a report generator with charts
a model training framework
a data auto-cleaner

preflightml does not modify data. It only reports risks.

Core design principles

1. Linter-style output

If a check passes, nothing is printed. Only failures are surfaced.

2. Deterministic checks

All validations are rule-based and reproducible. No heuristics are hidden.

3. Separation of concerns

Python checks detect issues
structured output represents findings
an optional small LLM converts findings into human explanations

4. Safe by default

No raw data is ever sent to an LLM. Only metadata, counts, and issue summaries are used.

What preflightml checks

preflightml implements ~30 focused checks, grouped into categories. Most datasets will trigger only a few.

Dataset integrity

unreadable or unsupported files
empty datasets
missing or duplicate column names
fully empty columns
row count sanity checks

Missing & duplicate data

missing values per column
high missing-rate warnings
exact duplicate rows
near-duplicate row warnings

Target validation

target column existence
missing values in target
constant or invalid target
incompatible target types

Class balance

class distribution analysis
severe imbalance detection
minority class sparsity warnings

Feature quality

constant or near-constant features
identifier-like columns
high-cardinality categorical features
mixed or invalid data types

Leakage & risk signals

suspicious feature–target correlations
potential target leakage
timestamp-based leakage risks

Reporting

severity levels (info / warning / critical)
prioritized issue ordering
concise dataset health summary

CLI usage

Basic scan:

preflightml scan data.csv --target label

Verbose output:

preflightml scan data.csv --target label --verbose

Machine-readable output (for CI):

preflightml scan data.csv --target label --json

LLM-generated explanation:

preflightml scan data.csv --target label --explain

Python library usage

preflightml can be used directly inside Python code or notebooks.

Typical use cases:

data validation before training
automated checks in pipelines
unit testing datasets
teaching ML data hygiene

The library exposes a small, stable API rather than requiring imports from internal modules.

LLM-generated explanations (optional)

preflightml can optionally pass its structured findings to a small language model to generate:

a concise dataset health summary
prioritized explanations of the most serious issues
suggested next steps

Important guarantees:

the LLM does not see raw data
the LLM cannot invent issues
all numbers and facts come from deterministic checks
the LLM is used only as a narrator

If the LLM is disabled, preflightml still functions fully.

Example output (conceptual)

Dataset Health Check

[CRITICAL] Target leakage suspected
- Feature "total_spend_last_30d" is highly correlated with target

[WARNING] Severe class imbalance
- Class 0 represents 93% of samples

[WARNING] Missing values detected
- Column "income": 12% missing

3 issues found (1 critical, 2 warnings)

Project structure

preflightml is structured as a reusable Python library:

checks are grouped by category
the CLI is a thin wrapper around the core API
output rendering is separated from validation logic
all checks are unit-testable

This makes it suitable for both learning and real-world use.

Who this is for

students learning ML properly
engineers validating datasets in pipelines
researchers reproducing experiments
anyone who wants to avoid silent data bugs

Philosophy

preflightml is intentionally narrow.

It does not try to be everything. It tries to be correct, quiet, and useful.

If your dataset is healthy, preflightml stays silent. If it is risky, preflightml speaks clearly.

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

preflightml

Why preflightml exists

What preflightml is (and is not)

It is:

It is not:

Core design principles

1. Linter-style output

2. Deterministic checks

3. Separation of concerns

4. Safe by default

What preflightml checks

Dataset integrity

Missing & duplicate data

Target validation

Class balance

Feature quality

Leakage & risk signals

Reporting

CLI usage

Python library usage

LLM-generated explanations (optional)

Example output (conceptual)

Project structure

Who this is for

Philosophy

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

muhammadanas0716/preflightml

Folders and files

Latest commit

History

Repository files navigation

preflightml

Why preflightml exists

What preflightml is (and is not)

It is:

It is not:

Core design principles

1. Linter-style output

2. Deterministic checks

3. Separation of concerns

4. Safe by default

What preflightml checks

Dataset integrity

Missing & duplicate data

Target validation

Class balance

Feature quality

Leakage & risk signals

Reporting

CLI usage

Python library usage

LLM-generated explanations (optional)

Example output (conceptual)

Project structure

Who this is for

Philosophy

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages