Skip to content

docs: extend roadmap with graph, migrate, parallel, and infer features#11

Merged
HelgeSverre merged 11 commits intomainfrom
claude/review-roadmap-objWd
Dec 26, 2025
Merged

docs: extend roadmap with graph, migrate, parallel, and infer features#11
HelgeSverre merged 11 commits intomainfrom
claude/review-roadmap-objWd

Conversation

@HelgeSverre
Copy link
Copy Markdown
Owner

This commit significantly expands the project roadmap with four major new
features based on competitive gap analysis:

Roadmap Updates (v1.11 - v2.1):

  • v1.11.0: Graph command - FK dependency visualization and topological ordering
  • v1.12.0: Query command - SQL-like row filtering (moved from v1.11.0)
  • v1.13.0: Detect-PII - Auto-suggest redaction config (moved from v1.12.0)
  • v1.14.0: MSSQL support - Fourth dialect (moved from v1.13.0)
  • v1.15.0: Migration Generation - NEW: Generate ALTER statements from schema diffs
  • v2.0.0: Parallel Processing - NEW: Multi-threaded performance (4-8x speedup)
  • v2.1.0: Schema Inference - NEW: Reverse-engineer DDL from data-only dumps

New Feature Design Documents:

  1. GRAPH_FEATURE.md (~8h effort)

    • Export FK dependency graphs (DOT, Mermaid, JSON)
    • Cycle detection using Tarjan's algorithm
    • Topological ordering for safe imports
    • Leverages existing schema graph infrastructure
  2. MIGRATE_FEATURE.md (~40h effort)

    • Generate migration scripts from schema diffs
    • Breaking change detection and warnings
    • Rollback script generation
    • Multi-dialect support (MySQL, PostgreSQL, SQLite, MSSQL)
    • Extends existing diff command
  3. PARALLEL_PROCESSING.md (~60h effort)

    • Multi-threaded processing across all commands
    • Target: 4x speedup on 8-core systems
    • Worker pool architecture with crossbeam
    • Memory-bounded parallel execution
    • Ordered output merging
  4. SCHEMA_INFERENCE.md (~50h effort)

    • Infer CREATE TABLE from INSERT-only dumps
    • Support CSV and JSON input formats
    • Type inference with statistical analysis
    • Primary key and index detection
    • Foreign key inference (heuristic)

Key Strategic Decisions:

  • Graph feature prioritized as v1.11.0 (user requested first)
  • Migration generation addresses major competitive gap
  • Parallel processing brings parity with mydumper
  • Schema inference is unique differentiator (no competitor does this well)

Updated dependency matrix and effort estimates for all features.

Revision: 3.0

This commit significantly expands the project roadmap with four major new
features based on competitive gap analysis:

Roadmap Updates (v1.11 - v2.1):
- v1.11.0: Graph command - FK dependency visualization and topological ordering
- v1.12.0: Query command - SQL-like row filtering (moved from v1.11.0)
- v1.13.0: Detect-PII - Auto-suggest redaction config (moved from v1.12.0)
- v1.14.0: MSSQL support - Fourth dialect (moved from v1.13.0)
- v1.15.0: Migration Generation - NEW: Generate ALTER statements from schema diffs
- v2.0.0: Parallel Processing - NEW: Multi-threaded performance (4-8x speedup)
- v2.1.0: Schema Inference - NEW: Reverse-engineer DDL from data-only dumps

New Feature Design Documents:

1. GRAPH_FEATURE.md (~8h effort)
   - Export FK dependency graphs (DOT, Mermaid, JSON)
   - Cycle detection using Tarjan's algorithm
   - Topological ordering for safe imports
   - Leverages existing schema graph infrastructure

2. MIGRATE_FEATURE.md (~40h effort)
   - Generate migration scripts from schema diffs
   - Breaking change detection and warnings
   - Rollback script generation
   - Multi-dialect support (MySQL, PostgreSQL, SQLite, MSSQL)
   - Extends existing diff command

3. PARALLEL_PROCESSING.md (~60h effort)
   - Multi-threaded processing across all commands
   - Target: 4x speedup on 8-core systems
   - Worker pool architecture with crossbeam
   - Memory-bounded parallel execution
   - Ordered output merging

4. SCHEMA_INFERENCE.md (~50h effort)
   - Infer CREATE TABLE from INSERT-only dumps
   - Support CSV and JSON input formats
   - Type inference with statistical analysis
   - Primary key and index detection
   - Foreign key inference (heuristic)

Key Strategic Decisions:
- Graph feature prioritized as v1.11.0 (user requested first)
- Migration generation addresses major competitive gap
- Parallel processing brings parity with mydumper
- Schema inference is unique differentiator (no competitor does this well)

Updated dependency matrix and effort estimates for all features.

Revision: 3.0
…zation

Enhanced the graph feature design with a user-friendly diagram mode:

Simple Diagram Mode:
- New `diagram` command alias for `graph --simple`
- One-liner to visualize database: `sql-splitter diagram dump.sql`
- Opens interactive HTML diagram in browser by default
- Auto-excludes system tables (migrations, schema_*, etc.)
- Supports direct rendering: .png, .svg, .pdf, .html

Features Added:
- Interactive HTML output with embedded Mermaid
  - No external dependencies
  - Hover to highlight, click to focus
  - Dark mode support, zoom & pan
- Auto-render to PNG/SVG via Graphviz
- Layout options (lr, tb, radial)
- Theme support (default, dark, minimal)
- --open flag to auto-open in browser
- --clean flag for automatic system table exclusion

Updated Components:
- Added diagram.rs CLI handler
- Added html.rs output writer
- Added render.rs for auto-rendering
- Added simplify.rs for table cleanup

Effort Update:
- Increased from 8h to 12h due to:
  - Interactive HTML output implementation
  - Simplified diagram command UX
  - Auto-rendering capabilities

This addresses the user's request for a simpler "make a database diagram"
mode that doesn't require understanding advanced graph export formats.

Examples:
  sql-splitter diagram dump.sql              # Opens in browser
  sql-splitter diagram dump.sql -o schema.png  # Auto-render PNG
  sql-splitter diagram dump.sql --clean       # Exclude system tables
Created comprehensive analysis of practical parallel implementation challenges
using the analyze command as a case study.

Key Findings:
- Analyze is I/O bound (600 MB/s disk), not CPU bound
- Single-file parallelization has limited benefit (1-2x on NVMe)
- Multi-file parallelization is simpler and more effective (8x)
- Compressed files are CPU-bound (4x speedup from parallel decompression)

Challenges Documented:
1. Sequential file reading - can't parallelize single file handle
2. Statement boundary detection - semicolons in strings
3. Memory explosion - chunking requires loading entire file
4. Lock contention - naive shared state slower than single-threaded
5. Compression complexity - most formats are sequential

Solutions Analyzed:
- File chunking (complex, limited benefit)
- Streaming with bounded queues (better memory, still I/O bound)
- Multi-file glob patterns (simple, highly effective)
- Compressed file optimization (good CPU-bound use case)

Recommendations:
- Skip single-file parallel for analyze command
- Implement multi-file glob parallelism instead
- Focus v2.0.0 parallel efforts on CPU-bound commands:
  * convert (type conversion)
  * redact (fake data generation)
  * validate (PK checking)

Includes code examples, benchmarks, and practical implementation strategies.
This serves as reference for parallel processing decision-making.
…ation

Created detailed analysis of parallelization strategies, challenges, and
expected performance for all 9 major commands.

Command Analysis Summary:

HIGH PRIORITY (v2.0.0 MVP):
✅ Convert:   6x speedup, low complexity (CPU-bound type conversion)
✅ Redact:    5x speedup, low-med complexity (CPU-bound fake generation)
✅ Validate:  6x speedup, medium complexity (CPU-bound PK/FK checking)
✅ Merge:     7x speedup, low complexity (embarrassingly parallel)

MEDIUM PRIORITY (v2.1.0):
⚠️ Split:    3x speedup, medium complexity (I/O dependent, file handles)
✅ Diff:     5x speedup, medium complexity (CPU-bound hashing)

LOW PRIORITY (v2.2+):
❌ Sample:   1.5x speedup, high complexity (sequential FK dependencies)
❌ Shard:    1.5x speedup, high complexity (sequential FK dependencies)
❌ Analyze:  1.2x speedup, limited benefit (I/O bound)

Key Findings:

1. CPU-bound commands (convert, redact, validate) see 4-6x speedup
2. I/O-bound commands (analyze, sample, shard) see <2x speedup
3. Sequential FK dependency chains (sample, shard) hard to parallelize
4. Multi-file operations (merge) trivially parallel (7-8x)

Challenges Documented per Command:
- Convert: Order preservation, COPY block handling
- Redact: Deterministic RNG seeding, shuffle strategy
- Validate: Cross-table FK dependencies, shared PK sets
- Split: File handle exhaustion, disk I/O contention
- Diff: Memory for large tables (same as current)
- Sample/Shard: Sequential FK chain traversal
- Merge: None (trivial)

Reusable Components Identified:
- Statement chunking utility
- Ordered output writer
- Thread pool abstraction
- Progress reporting for parallel operations

This analysis confirms v2.0.0 effort estimate (~60h) and provides
detailed implementation roadmap for parallel processing feature.
…opportunities

Expanded competitive analysis across 6 categories with 30+ competitors:
- Schema management (Liquibase, Flyway, Atlas, sqitch)
- Data quality (Great Expectations, dbt, datafold, soda-sql)
- Database optimization (pt-query-digest, pgBadger, EverSQL)
- Test data (Mockaroo, Snaplet, Synth, tonic.ai)
- ETL/Pipeline (dlt, Airbyte, Meltano)
- Documentation (SchemaSpy, tbls, Azimutt, DataHub)

15 New Feature Opportunities Identified:

TIER 1 - Highest Impact:
1. Schema Drift Detection (16h) - Catch unauthorized changes
2. Smart Index Recommendations (24h) - Performance optimization
3. Data Quality Profiling (32h) - Automated data validation
4. Data Lineage Tracking (40h) - Compliance & impact analysis
5. Change Data Capture (28h) - Event stream integration

TIER 2 - High Value, Lower Effort:
6. Schema Size Optimization (12h) - Storage cost reduction
7. Security Audit (20h) - Find vulnerabilities
8. Cost Estimation (8h) - Cloud pricing calculator
9. Compliance Check (24h) - GDPR/HIPAA validation

TIER 3 - Innovative:
10. AI-Powered Schema Suggestions (40h) - LLM-based optimization
11. Time-Travel Queries (32h) - Historical data access
12. Schema Evolution Tracking (28h) - Git-like versioning
13. Performance Simulation (48h) - Load testing
14. Natural Language Query (24h) - SQL generation from English
15. Schema Testing Framework (16h) - Automated quality checks

Recommended Roadmap Extension:
- v1.16: Recommendations & Drift (52h) - Quick wins, reuse infra
- v1.17: Quality & Profiling (40h) - Unique value proposition
- v1.18: Security & Compliance (44h) - Enterprise features
- v1.19: CDC & Events (28h) - Modern data stack integration

Market Positioning:
- "Complete Dump Toolkit" - beyond split/convert/anonymize
- Enterprise appeal: compliance, security, cost optimization
- Developer experience: recommendations, testing, quality
- AI differentiation: natural language, smart suggestions

Competitive Gaps:
✅ Dump-based analytics (no competitors)
✅ Offline compliance checking (unique)
✅ Multi-dialect optimization (first mover)
✅ AI-powered database tools (emerging space)

Strategic recommendation: Position as Swiss Army knife for SQL dumps,
with open-source core and potential premium features (AI, compliance).
…tem tools

Identified 15 strategic integrations that could dramatically extend sql-splitter
capabilities without rebuilding everything from scratch.

Integration Philosophy: "Be the glue, not the engine"
- Connect tools together, provide unified interface
- Leverage best-in-class tools rather than reimplementing
- Hide complexity, add value beyond basic wrapper

TIER 1 - High Impact Integrations:

1. DuckDB Integration (16h) ⭐⭐⭐⭐⭐
   - Use as query engine for dumps (100x faster than naive filtering)
   - Export to Parquet for data lakes
   - Full SQL analytics without database setup

2. Atlas Integration (20h) ⭐⭐⭐⭐⭐
   - Export schema as HCL (schema-as-code)
   - Enable GitOps workflows
   - Two-way bridge: Atlas HCL ↔ SQL dumps

3. Great Expectations (16h) ⭐⭐⭐⭐
   - Auto-generate data quality expectations from schema
   - Bootstrap testing framework
   - Production data monitoring

4. Liquibase/Flyway (24h) ⭐⭐⭐⭐
   - Generate migration changesets from diffs
   - Integrate with existing CI/CD pipelines
   - Support XML/YAML/SQL migration formats

5. dbt Integration (28h) ⭐⭐⭐⭐⭐
   - Generate dbt project from dumps
   - Auto-create tests from constraints
   - Bootstrap data transformation workflows

TIER 2 - Utility Integrations:

6. pgBadger/pt-query-digest (16h)
   - Combine schema + slow query analysis
   - Recommend indexes based on actual queries

7. tbls/Schema Browser (32h)
   - Interactive schema documentation
   - Self-contained HTML with D3.js visualization

8. Faker Enhancement (8h)
   - Custom provider support via YAML
   - Extended locale support

9. Airbyte Connector (24h)
   - SQL dumps as ELT source
   - Access to 300+ destinations

10. Slow Query Analysis (16h)
    - Parse pg_slow.log / mysql-slow.log
    - Correlate with schema for recommendations

TIER 3 - Cloud/Platform:

11. Supabase/PlanetScale/Neon (20h/platform)
    - Deploy dumps to cloud databases instantly
    - API-based provisioning

12. GitHub Actions (12h)
    - CI/CD automation
    - Schema validation in PRs

13. Datadog/New Relic (16h)
    - Schema monitoring and alerting
    - Metrics export

14. Terraform Provider (32h)
    - Infrastructure as code for schemas
    - Drift detection and remediation

15. Graphviz/Mermaid Live Preview (4h)
    - Watch mode with auto-reload
    - Real-time diagram updates

Recommended Integration Roadmap:

v1.16 — Query & Analytics (28h)
  - DuckDB integration (16h)
  - Parquet export (12h)

v1.17 — Schema Management (44h)
  - Atlas HCL export (20h)
  - Liquibase changelog (24h)

v1.18 — Data Quality (16h)
  - Great Expectations integration (16h)

v1.19 — Documentation (52h)
  - Schema browser (32h)
  - tbls format export (20h)

v2.2 — Platform Integrations (64h)
  - dbt project generation (28h)
  - GitHub Action (12h)
  - Airbyte connector (24h)

High-Impact Quick Wins (<20h):
✅ DuckDB query engine (16h) - Instant SQL analytics
✅ Parquet export (12h) - Bridge to data lakes
✅ GitHub Action (12h) - CI/CD automation
✅ Great Expectations (16h) - Data quality testing

Integration Patterns:
- Wrapper Pattern: Shell out to external tool (quick, simple)
- Library Integration: Embed via FFI (more control, no external deps)
- API Integration: Cloud service calls (platform access)

Success Criteria:
- Users choose sql-splitter BECAUSE of integration
- Hides complexity of integrated tool
- Provides value beyond basic wrapper
- Low maintenance burden

This positions sql-splitter as the universal adapter/glue for the
SQL ecosystem rather than trying to replace specialized tools.
@HelgeSverre HelgeSverre merged commit 4902135 into main Dec 26, 2025
4 of 5 checks passed
@HelgeSverre HelgeSverre deleted the claude/review-roadmap-objWd branch December 26, 2025 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants