Skip to content

muhammadanas0716/preflightml

Repository files navigation

preflightml

preflightml is a lightweight machine learning dataset linter that runs sanity checks before training to catch silent data issues that commonly invalidate models.

It is designed to answer one question:

"Is this dataset safe to train a model on?"

Rather than producing large exploratory reports, preflightml surfaces only actionable problems, ranked by severity, so you can fix issues early and avoid misleading results.


Why preflightml exists

Most ML failures are not caused by model choice. They are caused by data issues that don't crash code but quietly ruin results, such as:

  • severe class imbalance
  • duplicate or leaked rows
  • invalid target columns
  • constant or identifier features
  • accidental target leakage

These problems often go unnoticed until after training, when metrics look suspicious or models fail in production.

preflightml is a pre-flight checklist for datasets, similar to how linters work for code.


What preflightml is (and is not)

It is:

  • a Python library with a clean public API
  • a CLI tool for quick checks and CI usage
  • deterministic and opinionated
  • focused on ML-specific failure modes

It is not:

  • an EDA or visualization tool
  • a report generator with charts
  • a model training framework
  • a data auto-cleaner

preflightml does not modify data. It only reports risks.


Core design principles

1. Linter-style output

If a check passes, nothing is printed. Only failures are surfaced.

2. Deterministic checks

All validations are rule-based and reproducible. No heuristics are hidden.

3. Separation of concerns

  • Python checks detect issues
  • structured output represents findings
  • an optional small LLM converts findings into human explanations

4. Safe by default

No raw data is ever sent to an LLM. Only metadata, counts, and issue summaries are used.


What preflightml checks

preflightml implements ~30 focused checks, grouped into categories. Most datasets will trigger only a few.

Dataset integrity

  • unreadable or unsupported files
  • empty datasets
  • missing or duplicate column names
  • fully empty columns
  • row count sanity checks

Missing & duplicate data

  • missing values per column
  • high missing-rate warnings
  • exact duplicate rows
  • near-duplicate row warnings

Target validation

  • target column existence
  • missing values in target
  • constant or invalid target
  • incompatible target types

Class balance

  • class distribution analysis
  • severe imbalance detection
  • minority class sparsity warnings

Feature quality

  • constant or near-constant features
  • identifier-like columns
  • high-cardinality categorical features
  • mixed or invalid data types

Leakage & risk signals

  • suspicious feature–target correlations
  • potential target leakage
  • timestamp-based leakage risks

Reporting

  • severity levels (info / warning / critical)
  • prioritized issue ordering
  • concise dataset health summary

CLI usage

Basic scan:

preflightml scan data.csv --target label

Verbose output:

preflightml scan data.csv --target label --verbose

Machine-readable output (for CI):

preflightml scan data.csv --target label --json

LLM-generated explanation:

preflightml scan data.csv --target label --explain

Python library usage

preflightml can be used directly inside Python code or notebooks.

Typical use cases:

  • data validation before training
  • automated checks in pipelines
  • unit testing datasets
  • teaching ML data hygiene

The library exposes a small, stable API rather than requiring imports from internal modules.


LLM-generated explanations (optional)

preflightml can optionally pass its structured findings to a small language model to generate:

  • a concise dataset health summary
  • prioritized explanations of the most serious issues
  • suggested next steps

Important guarantees:

  • the LLM does not see raw data
  • the LLM cannot invent issues
  • all numbers and facts come from deterministic checks
  • the LLM is used only as a narrator

If the LLM is disabled, preflightml still functions fully.


Example output (conceptual)

Dataset Health Check

[CRITICAL] Target leakage suspected
- Feature "total_spend_last_30d" is highly correlated with target

[WARNING] Severe class imbalance
- Class 0 represents 93% of samples

[WARNING] Missing values detected
- Column "income": 12% missing

3 issues found (1 critical, 2 warnings)

Project structure

preflightml is structured as a reusable Python library:

  • checks are grouped by category
  • the CLI is a thin wrapper around the core API
  • output rendering is separated from validation logic
  • all checks are unit-testable

This makes it suitable for both learning and real-world use.


Who this is for

  • students learning ML properly
  • engineers validating datasets in pipelines
  • researchers reproducing experiments
  • anyone who wants to avoid silent data bugs

Philosophy

preflightml is intentionally narrow.

It does not try to be everything. It tries to be correct, quiet, and useful.

If your dataset is healthy, preflightml stays silent. If it is risky, preflightml speaks clearly.


License

MIT License.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages