Quick Start
pip install knowlyr-datacheck
from datacheck import DataChecker
checker = DataChecker()
report = checker.check_file("training_data.json")
check_data_quality
Check data file quality (supports JSON/JSONL/CSV)
validate_from_datarecipe
Validate data using DataRecipe analysis results
compare_distributions
Compare distributions across multiple data files (supports JSON/JSONL/CSV)
list_quality_rules
List all available quality check rules
infer_schema
Infer Schema from data file (field types, constraints, required fields)
fix_data
Fix common data quality issues (dedup, trim whitespace, PII masking)
batch_check_directory
Batch check quality of all data files in a directory (recursive scan JSON/JSONL/CSV)
check_drift
Detect distribution drift between two data files (numerical statistics, category distribution, text features)
check_leakage
Detect data leakage between train and test sets (exact duplicates + token Jaccard near-duplicates)
check_bias
Detect dataset bias (class imbalance, text length distribution, language distribution)
check_coverage
Detect dataset coverage — field completeness, missing value ratios, unique value distribution
Documentation
DataCheck
Multi-Dimensional Data Quality Validation
with Statistical Anomaly Detection
Automated quality validation for LLM training data — composable rules, IQR/Z-score anomaly detection, and auto-fix pipeline
Why DataCheck?
Training data quality is the hidden bottleneck of model performance. Overlooked format errors, hidden PII leaks, undetected duplicate samples — any single issue can amplify into systematic bias downstream.
Existing quality solutions are either one-off scripts (not reusable) or heavyweight platforms (expensive to deploy), and generally lack statistical anomaly detection and auto-fix capabilities.
DataCheck solves this with a composable rule engine that provides end-to-end data quality validation:
- 9 Built-in Rules covering completeness, validity, privacy, and consistency
- IQR / Z-score Dual-Method anomaly detection for numeric and text length outliers
- LLM-Assisted Evaluation for instruction clarity and response relevance
- Auto-Fix Pipeline — dedup, strip whitespace, PII redaction
- Report Diff — quantify quality improvements before vs. after fixes
Get Started in 30 Seconds
pip install knowlyr-datacheck
# Check your data
knowlyr-datacheck check data.json
# Auto-fix issues
knowlyr-datacheck fix data.jsonl -o fixed.jsonl --strip-pii
# Compare before/after
knowlyr-datacheck diff report_v1.json report_v2.json
Quality Pipeline
graph LR
D["Data Files<br/>JSON / JSONL / CSV"] --> R["Rule Engine<br/>9 Rules + YAML Custom"]
R --> A["Anomaly Detector<br/>IQR / Z-score"]
A --> Rep["Quality Report<br/>MD / JSON / HTML"]
Rep --> Fix["Auto Fix<br/>Dedup · PII · Trim"]
Fix --> Diff["Report Diff<br/>Before vs After"]
style R fill:#0969da,color:#fff,stroke:#0969da
style A fill:#8b5cf6,color:#fff,stroke:#8b5cf6
style Rep fill:#2da44e,color:#fff,stroke:#2da44e
style Fix fill:#e5534b,color:#fff,stroke:#e5534b
style D fill:#1a1a2e,color:#e0e0e0,stroke:#444
style Diff fill:#1a1a2e,color:#e0e0e0,stroke:#444
Core Features
Composable Rule Engine
9 built-in rules with 4 preset rulesets (default, sft, preference, llm). Extend with YAML — no Python code needed:
rules:
- field: instruction
check: min_length
value: 10
severity: error
Statistical Anomaly Detection
Pure Python, zero external dependencies. Automatically enabled when sample size $\geq 10$:
- IQR Method: $\text{outlier}(x) \iff x < Q_1 - 1.5 \cdot \text{IQR} ;\lor; x > Q_3 + 1.5 \cdot \text{IQR}$
- Z-score Method: $\text{outlier}(x) \iff |z(x)| > 3$
LLM-Assisted Quality Evaluation
Semantic-level quality checks beyond rule-based validation:
knowlyr-datacheck check data.json --ruleset llm
MCP Integration
11 MCP tools for seamless AI IDE integration — check, fix, diff, infer schema, and more, all from your editor.
Python SDK
from datacheck import DataChecker, QualityReport
checker = DataChecker()
result = checker.check_file("data.json")
report = QualityReport(result)
report.print_summary()
Ecosystem
DataCheck is part of the knowlyr data infrastructure:
| Layer | Project | Role |
|---|---|---|
| Discovery | AI Dataset Radar | Dataset intelligence & trend analysis |
| Analysis | DataRecipe | Reverse analysis, schema extraction, cost estimation |
| Production | DataSynth / DataLabel | LLM batch synthesis / lightweight annotation |
| Quality | DataCheck | Rule validation, anomaly detection, auto-fix |
| Audit | ModelAudit | Distillation detection, model fingerprinting |
Want to discuss this project? Reach out to