Open Source Python MIT
DataCheck

DataCheck

Data Check

★ 0 ⑂ 0 Updated 2026-03-15
Multi-dimensional data quality verification framework — covering completeness, uniqueness, validity, and anomaly detection across four quality dimensions. Built-in IQR/Z-score anomaly detection and n-gram Jaccard approximate deduplication, gatekeeping quality before data enters training.
Four-Dimensional Quality Model Anomaly Detection Approximate Deduplication

Quick Start

Install
pip install knowlyr-datacheck
Usage
from datacheck import DataChecker

checker = DataChecker()
report = checker.check_file("training_data.json")
check_data_quality Check data file quality (supports JSON/JSONL/CSV)
validate_from_datarecipe Validate data using DataRecipe analysis results
compare_distributions Compare distributions across multiple data files (supports JSON/JSONL/CSV)
list_quality_rules List all available quality check rules
infer_schema Infer Schema from data file (field types, constraints, required fields)
fix_data Fix common data quality issues (dedup, trim whitespace, PII masking)
batch_check_directory Batch check quality of all data files in a directory (recursive scan JSON/JSONL/CSV)
check_drift Detect distribution drift between two data files (numerical statistics, category distribution, text features)
check_leakage Detect data leakage between train and test sets (exact duplicates + token Jaccard near-duplicates)
check_bias Detect dataset bias (class imbalance, text length distribution, language distribution)
check_coverage Detect dataset coverage — field completeness, missing value ratios, unique value distribution

Documentation

DataCheck

Multi-Dimensional Data Quality Validation
with Statistical Anomaly Detection

Automated quality validation for LLM training data — composable rules, IQR/Z-score anomaly detection, and auto-fix pipeline

Why DataCheck?

Training data quality is the hidden bottleneck of model performance. Overlooked format errors, hidden PII leaks, undetected duplicate samples — any single issue can amplify into systematic bias downstream.

Existing quality solutions are either one-off scripts (not reusable) or heavyweight platforms (expensive to deploy), and generally lack statistical anomaly detection and auto-fix capabilities.

DataCheck solves this with a composable rule engine that provides end-to-end data quality validation:

  • 9 Built-in Rules covering completeness, validity, privacy, and consistency
  • IQR / Z-score Dual-Method anomaly detection for numeric and text length outliers
  • LLM-Assisted Evaluation for instruction clarity and response relevance
  • Auto-Fix Pipeline — dedup, strip whitespace, PII redaction
  • Report Diff — quantify quality improvements before vs. after fixes

Get Started in 30 Seconds

pip install knowlyr-datacheck

# Check your data
knowlyr-datacheck check data.json

# Auto-fix issues
knowlyr-datacheck fix data.jsonl -o fixed.jsonl --strip-pii

# Compare before/after
knowlyr-datacheck diff report_v1.json report_v2.json

Quality Pipeline

graph LR
    D["Data Files<br/>JSON / JSONL / CSV"] --> R["Rule Engine<br/>9 Rules + YAML Custom"]
    R --> A["Anomaly Detector<br/>IQR / Z-score"]
    A --> Rep["Quality Report<br/>MD / JSON / HTML"]
    Rep --> Fix["Auto Fix<br/>Dedup · PII · Trim"]
    Fix --> Diff["Report Diff<br/>Before vs After"]

    style R fill:#0969da,color:#fff,stroke:#0969da
    style A fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style Rep fill:#2da44e,color:#fff,stroke:#2da44e
    style Fix fill:#e5534b,color:#fff,stroke:#e5534b
    style D fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Diff fill:#1a1a2e,color:#e0e0e0,stroke:#444

Core Features

Composable Rule Engine

9 built-in rules with 4 preset rulesets (default, sft, preference, llm). Extend with YAML — no Python code needed:

rules:
  - field: instruction
    check: min_length
    value: 10
    severity: error

Statistical Anomaly Detection

Pure Python, zero external dependencies. Automatically enabled when sample size $\geq 10$:

  • IQR Method: $\text{outlier}(x) \iff x < Q_1 - 1.5 \cdot \text{IQR} ;\lor; x > Q_3 + 1.5 \cdot \text{IQR}$
  • Z-score Method: $\text{outlier}(x) \iff |z(x)| > 3$

LLM-Assisted Quality Evaluation

Semantic-level quality checks beyond rule-based validation:

knowlyr-datacheck check data.json --ruleset llm

MCP Integration

11 MCP tools for seamless AI IDE integration — check, fix, diff, infer schema, and more, all from your editor.

Python SDK

from datacheck import DataChecker, QualityReport

checker = DataChecker()
result = checker.check_file("data.json")
report = QualityReport(result)
report.print_summary()

Ecosystem

DataCheck is part of the knowlyr data infrastructure:

Layer Project Role
Discovery AI Dataset Radar Dataset intelligence & trend analysis
Analysis DataRecipe Reverse analysis, schema extraction, cost estimation
Production DataSynth / DataLabel LLM batch synthesis / lightweight annotation
Quality DataCheck Rule validation, anomaly detection, auto-fix
Audit ModelAudit Distillation detection, model fingerprinting

GitHub · PyPI

knowlyr — multi-dimensional data quality validation with statistical anomaly detection

Want to discuss this project? Reach out to

Kai
Kai Founder & CEO
林晓桐
林晓桐 AI Data Quality Specialist