Skip to content

Conversation

@jqnatividad
Copy link
Collaborator

resolves #3101

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for inferring and enforcing JSON Schema format constraints for email, hostname, IPv4, and IPv6 address columns in the schema command. When the new --strict-formats flag is enabled, the schema command automatically detects these common data formats and adds appropriate format constraints to the generated JSON Schema.

Key changes:

  • Added format detection for email, hostname, IPv4, and IPv6 addresses with proper precedence (most specific first)
  • Format constraints are only applied when ALL unique values in a column match the detected format
  • Comprehensive test coverage with 6 new tests covering happy paths, mixed values, and precedence rules

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/cmd/schema.rs Core implementation: adds --strict-formats flag, implements format detection functions (is_email, is_hostname, is_ipv4, is_ipv6), and infer_format_from_values logic with correct precedence ordering
src/util.rs Adds flag_strict_formats field to SchemaArgs struct and initializes it in infer_polars_schema
src/cmd/tojsonl.rs Initializes flag_strict_formats to false in SchemaArgs construction
src/cmd/sample.rs Initializes flag_strict_formats to false in SchemaArgs construction
src/cmd/pivotp.rs Initializes flag_strict_formats to false in SchemaArgs construction (two occurrences)
src/cmd/joinp.rs Initializes flag_strict_formats to false in SchemaArgs construction
src/cmd/frequency.rs Initializes flag_strict_formats to false in SchemaArgs construction
src/cmd/diff.rs Initializes flag_strict_formats to false in SchemaArgs construction
tests/test_schema.rs Adds 6 comprehensive tests covering email, hostname, IPv4, IPv6 formats, mixed values behavior, and IPv4/hostname precedence
Cargo.toml Adds hostname-validator 1.1 dependency for hostname validation
Cargo.lock Updates lock file with hostname-validator 1.1.1

@jqnatividad jqnatividad merged commit d280fe0 into master Nov 26, 2025
23 checks passed
@jqnatividad jqnatividad deleted the 3101-schema-infer-addl-jsonschema-predefined-formats branch November 26, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

schema: auto-infer additional predefined JSON Schema formats

2 participants