-
Notifications
You must be signed in to change notification settings - Fork 99
feat: moarstats add "xsd_subtype" Gregorian date data types inferencing with --xsd-gdate-scan having fast (default) and comprehensive modes
#3259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- fast mode - using percentile samples to infer gregorian date type - comprehensive - scanning all records for a column to infer gregorian date type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds Gregorian date type inference capability to the moarstats command with two scanning modes: fast (uses percentile values) and comprehensive (uses min/max values). The feature adds an xsd_subtype field to identify XSD Gregorian date types (gYear, gYearMonth, gMonthDay, gDay, gMonth) with confidence indicators ("??" for fast mode, "?" for comprehensive mode).
Key Changes
- Introduces
--xsd-gdate-scanflag with "fast" (default) and "comprehensive" modes - Implements pattern-based detection for five Gregorian date types
- Adds automatic fallback from fast to comprehensive mode when percentiles are unavailable
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| src/cmd/moarstats.rs | Implements Gregorian date type detection with detect_gregorian_date_type() function, adds parse_all_percentile_string_values() helper, integrates detection into infer_xsd_type(), and adds --xsd-gdate-scan command-line option |
| tests/test_moarstats.rs | Adds comprehensive test suite covering fast mode, comprehensive mode, Integer gYear detection, invalid mode handling, default behavior, and fallback scenarios; updates existing test expectations to reflect gYear detection for precinct field |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…H Copilot review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
but not doing leap year, which is overkill for this func Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
per GH Copilot review
implements #2858
as the gdate data types inferencing are heuristics-based, in fast mode, the inferred type has a double question mark suffix ("??") as the inference is just based on the available percentile values.
In comprehensive mode, it has a single question mark suffix ("?") to indicate the inference is of a higher confidence as it is based on a comprehensive scan of all values for the column.
@kulnor , reading the spec (https://www.w3.org/TR/xmlschema-2/#date), I went with my interpretation of what the prescribed "lexical representation" of gMonthDay, gDay and gMonth with the prescribed number of hyphen prefixes - but it seems odd...
Can you check?