docs: extend roadmap with graph, migrate, parallel, and infer features#11
Merged
HelgeSverre merged 11 commits intomainfrom Dec 26, 2025
Merged
docs: extend roadmap with graph, migrate, parallel, and infer features#11HelgeSverre merged 11 commits intomainfrom
HelgeSverre merged 11 commits intomainfrom
Conversation
This commit significantly expands the project roadmap with four major new features based on competitive gap analysis: Roadmap Updates (v1.11 - v2.1): - v1.11.0: Graph command - FK dependency visualization and topological ordering - v1.12.0: Query command - SQL-like row filtering (moved from v1.11.0) - v1.13.0: Detect-PII - Auto-suggest redaction config (moved from v1.12.0) - v1.14.0: MSSQL support - Fourth dialect (moved from v1.13.0) - v1.15.0: Migration Generation - NEW: Generate ALTER statements from schema diffs - v2.0.0: Parallel Processing - NEW: Multi-threaded performance (4-8x speedup) - v2.1.0: Schema Inference - NEW: Reverse-engineer DDL from data-only dumps New Feature Design Documents: 1. GRAPH_FEATURE.md (~8h effort) - Export FK dependency graphs (DOT, Mermaid, JSON) - Cycle detection using Tarjan's algorithm - Topological ordering for safe imports - Leverages existing schema graph infrastructure 2. MIGRATE_FEATURE.md (~40h effort) - Generate migration scripts from schema diffs - Breaking change detection and warnings - Rollback script generation - Multi-dialect support (MySQL, PostgreSQL, SQLite, MSSQL) - Extends existing diff command 3. PARALLEL_PROCESSING.md (~60h effort) - Multi-threaded processing across all commands - Target: 4x speedup on 8-core systems - Worker pool architecture with crossbeam - Memory-bounded parallel execution - Ordered output merging 4. SCHEMA_INFERENCE.md (~50h effort) - Infer CREATE TABLE from INSERT-only dumps - Support CSV and JSON input formats - Type inference with statistical analysis - Primary key and index detection - Foreign key inference (heuristic) Key Strategic Decisions: - Graph feature prioritized as v1.11.0 (user requested first) - Migration generation addresses major competitive gap - Parallel processing brings parity with mydumper - Schema inference is unique differentiator (no competitor does this well) Updated dependency matrix and effort estimates for all features. Revision: 3.0
…zation Enhanced the graph feature design with a user-friendly diagram mode: Simple Diagram Mode: - New `diagram` command alias for `graph --simple` - One-liner to visualize database: `sql-splitter diagram dump.sql` - Opens interactive HTML diagram in browser by default - Auto-excludes system tables (migrations, schema_*, etc.) - Supports direct rendering: .png, .svg, .pdf, .html Features Added: - Interactive HTML output with embedded Mermaid - No external dependencies - Hover to highlight, click to focus - Dark mode support, zoom & pan - Auto-render to PNG/SVG via Graphviz - Layout options (lr, tb, radial) - Theme support (default, dark, minimal) - --open flag to auto-open in browser - --clean flag for automatic system table exclusion Updated Components: - Added diagram.rs CLI handler - Added html.rs output writer - Added render.rs for auto-rendering - Added simplify.rs for table cleanup Effort Update: - Increased from 8h to 12h due to: - Interactive HTML output implementation - Simplified diagram command UX - Auto-rendering capabilities This addresses the user's request for a simpler "make a database diagram" mode that doesn't require understanding advanced graph export formats. Examples: sql-splitter diagram dump.sql # Opens in browser sql-splitter diagram dump.sql -o schema.png # Auto-render PNG sql-splitter diagram dump.sql --clean # Exclude system tables
Created comprehensive analysis of practical parallel implementation challenges using the analyze command as a case study. Key Findings: - Analyze is I/O bound (600 MB/s disk), not CPU bound - Single-file parallelization has limited benefit (1-2x on NVMe) - Multi-file parallelization is simpler and more effective (8x) - Compressed files are CPU-bound (4x speedup from parallel decompression) Challenges Documented: 1. Sequential file reading - can't parallelize single file handle 2. Statement boundary detection - semicolons in strings 3. Memory explosion - chunking requires loading entire file 4. Lock contention - naive shared state slower than single-threaded 5. Compression complexity - most formats are sequential Solutions Analyzed: - File chunking (complex, limited benefit) - Streaming with bounded queues (better memory, still I/O bound) - Multi-file glob patterns (simple, highly effective) - Compressed file optimization (good CPU-bound use case) Recommendations: - Skip single-file parallel for analyze command - Implement multi-file glob parallelism instead - Focus v2.0.0 parallel efforts on CPU-bound commands: * convert (type conversion) * redact (fake data generation) * validate (PK checking) Includes code examples, benchmarks, and practical implementation strategies. This serves as reference for parallel processing decision-making.
…ation Created detailed analysis of parallelization strategies, challenges, and expected performance for all 9 major commands. Command Analysis Summary: HIGH PRIORITY (v2.0.0 MVP): ✅ Convert: 6x speedup, low complexity (CPU-bound type conversion) ✅ Redact: 5x speedup, low-med complexity (CPU-bound fake generation) ✅ Validate: 6x speedup, medium complexity (CPU-bound PK/FK checking) ✅ Merge: 7x speedup, low complexity (embarrassingly parallel) MEDIUM PRIORITY (v2.1.0):⚠️ Split: 3x speedup, medium complexity (I/O dependent, file handles) ✅ Diff: 5x speedup, medium complexity (CPU-bound hashing) LOW PRIORITY (v2.2+): ❌ Sample: 1.5x speedup, high complexity (sequential FK dependencies) ❌ Shard: 1.5x speedup, high complexity (sequential FK dependencies) ❌ Analyze: 1.2x speedup, limited benefit (I/O bound) Key Findings: 1. CPU-bound commands (convert, redact, validate) see 4-6x speedup 2. I/O-bound commands (analyze, sample, shard) see <2x speedup 3. Sequential FK dependency chains (sample, shard) hard to parallelize 4. Multi-file operations (merge) trivially parallel (7-8x) Challenges Documented per Command: - Convert: Order preservation, COPY block handling - Redact: Deterministic RNG seeding, shuffle strategy - Validate: Cross-table FK dependencies, shared PK sets - Split: File handle exhaustion, disk I/O contention - Diff: Memory for large tables (same as current) - Sample/Shard: Sequential FK chain traversal - Merge: None (trivial) Reusable Components Identified: - Statement chunking utility - Ordered output writer - Thread pool abstraction - Progress reporting for parallel operations This analysis confirms v2.0.0 effort estimate (~60h) and provides detailed implementation roadmap for parallel processing feature.
…opportunities Expanded competitive analysis across 6 categories with 30+ competitors: - Schema management (Liquibase, Flyway, Atlas, sqitch) - Data quality (Great Expectations, dbt, datafold, soda-sql) - Database optimization (pt-query-digest, pgBadger, EverSQL) - Test data (Mockaroo, Snaplet, Synth, tonic.ai) - ETL/Pipeline (dlt, Airbyte, Meltano) - Documentation (SchemaSpy, tbls, Azimutt, DataHub) 15 New Feature Opportunities Identified: TIER 1 - Highest Impact: 1. Schema Drift Detection (16h) - Catch unauthorized changes 2. Smart Index Recommendations (24h) - Performance optimization 3. Data Quality Profiling (32h) - Automated data validation 4. Data Lineage Tracking (40h) - Compliance & impact analysis 5. Change Data Capture (28h) - Event stream integration TIER 2 - High Value, Lower Effort: 6. Schema Size Optimization (12h) - Storage cost reduction 7. Security Audit (20h) - Find vulnerabilities 8. Cost Estimation (8h) - Cloud pricing calculator 9. Compliance Check (24h) - GDPR/HIPAA validation TIER 3 - Innovative: 10. AI-Powered Schema Suggestions (40h) - LLM-based optimization 11. Time-Travel Queries (32h) - Historical data access 12. Schema Evolution Tracking (28h) - Git-like versioning 13. Performance Simulation (48h) - Load testing 14. Natural Language Query (24h) - SQL generation from English 15. Schema Testing Framework (16h) - Automated quality checks Recommended Roadmap Extension: - v1.16: Recommendations & Drift (52h) - Quick wins, reuse infra - v1.17: Quality & Profiling (40h) - Unique value proposition - v1.18: Security & Compliance (44h) - Enterprise features - v1.19: CDC & Events (28h) - Modern data stack integration Market Positioning: - "Complete Dump Toolkit" - beyond split/convert/anonymize - Enterprise appeal: compliance, security, cost optimization - Developer experience: recommendations, testing, quality - AI differentiation: natural language, smart suggestions Competitive Gaps: ✅ Dump-based analytics (no competitors) ✅ Offline compliance checking (unique) ✅ Multi-dialect optimization (first mover) ✅ AI-powered database tools (emerging space) Strategic recommendation: Position as Swiss Army knife for SQL dumps, with open-source core and potential premium features (AI, compliance).
…tem tools
Identified 15 strategic integrations that could dramatically extend sql-splitter
capabilities without rebuilding everything from scratch.
Integration Philosophy: "Be the glue, not the engine"
- Connect tools together, provide unified interface
- Leverage best-in-class tools rather than reimplementing
- Hide complexity, add value beyond basic wrapper
TIER 1 - High Impact Integrations:
1. DuckDB Integration (16h) ⭐⭐⭐⭐⭐
- Use as query engine for dumps (100x faster than naive filtering)
- Export to Parquet for data lakes
- Full SQL analytics without database setup
2. Atlas Integration (20h) ⭐⭐⭐⭐⭐
- Export schema as HCL (schema-as-code)
- Enable GitOps workflows
- Two-way bridge: Atlas HCL ↔ SQL dumps
3. Great Expectations (16h) ⭐⭐⭐⭐
- Auto-generate data quality expectations from schema
- Bootstrap testing framework
- Production data monitoring
4. Liquibase/Flyway (24h) ⭐⭐⭐⭐
- Generate migration changesets from diffs
- Integrate with existing CI/CD pipelines
- Support XML/YAML/SQL migration formats
5. dbt Integration (28h) ⭐⭐⭐⭐⭐
- Generate dbt project from dumps
- Auto-create tests from constraints
- Bootstrap data transformation workflows
TIER 2 - Utility Integrations:
6. pgBadger/pt-query-digest (16h)
- Combine schema + slow query analysis
- Recommend indexes based on actual queries
7. tbls/Schema Browser (32h)
- Interactive schema documentation
- Self-contained HTML with D3.js visualization
8. Faker Enhancement (8h)
- Custom provider support via YAML
- Extended locale support
9. Airbyte Connector (24h)
- SQL dumps as ELT source
- Access to 300+ destinations
10. Slow Query Analysis (16h)
- Parse pg_slow.log / mysql-slow.log
- Correlate with schema for recommendations
TIER 3 - Cloud/Platform:
11. Supabase/PlanetScale/Neon (20h/platform)
- Deploy dumps to cloud databases instantly
- API-based provisioning
12. GitHub Actions (12h)
- CI/CD automation
- Schema validation in PRs
13. Datadog/New Relic (16h)
- Schema monitoring and alerting
- Metrics export
14. Terraform Provider (32h)
- Infrastructure as code for schemas
- Drift detection and remediation
15. Graphviz/Mermaid Live Preview (4h)
- Watch mode with auto-reload
- Real-time diagram updates
Recommended Integration Roadmap:
v1.16 — Query & Analytics (28h)
- DuckDB integration (16h)
- Parquet export (12h)
v1.17 — Schema Management (44h)
- Atlas HCL export (20h)
- Liquibase changelog (24h)
v1.18 — Data Quality (16h)
- Great Expectations integration (16h)
v1.19 — Documentation (52h)
- Schema browser (32h)
- tbls format export (20h)
v2.2 — Platform Integrations (64h)
- dbt project generation (28h)
- GitHub Action (12h)
- Airbyte connector (24h)
High-Impact Quick Wins (<20h):
✅ DuckDB query engine (16h) - Instant SQL analytics
✅ Parquet export (12h) - Bridge to data lakes
✅ GitHub Action (12h) - CI/CD automation
✅ Great Expectations (16h) - Data quality testing
Integration Patterns:
- Wrapper Pattern: Shell out to external tool (quick, simple)
- Library Integration: Embed via FFI (more control, no external deps)
- API Integration: Cloud service calls (platform access)
Success Criteria:
- Users choose sql-splitter BECAUSE of integration
- Hides complexity of integrated tool
- Provides value beyond basic wrapper
- Low maintenance burden
This positions sql-splitter as the universal adapter/glue for the
SQL ecosystem rather than trying to replace specialized tools.
…-generation and validation strategies
… migration testing
…and staging models from SQL dumps
… dbt strategies into unified vision
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit significantly expands the project roadmap with four major new
features based on competitive gap analysis:
Roadmap Updates (v1.11 - v2.1):
New Feature Design Documents:
GRAPH_FEATURE.md (~8h effort)
MIGRATE_FEATURE.md (~40h effort)
PARALLEL_PROCESSING.md (~60h effort)
SCHEMA_INFERENCE.md (~50h effort)
Key Strategic Decisions:
Updated dependency matrix and effort estimates for all features.
Revision: 3.0