-
Notifications
You must be signed in to change notification settings - Fork 99
feat: describegpt - major refactor
#3143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: describegpt - major refactor
#3143
Conversation
jqnatividad
commented
Dec 1, 2025
- made data dictionary generation "neuro-symbolic"
- more robust Polars SQL generation
- more configurable parameters for dictionary and tags generation
- more verbose JSON output
as LLMs sometime start comments in the middle of the line (which is actually OK for readability)
- also add SQL comment prefix to Attribution when generating SQL
…dictionary and tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the describegpt command to use a "neuro-symbolic" approach for data dictionary generation. The core improvement separates code-based dictionary generation (statistics, enumerations, examples) from LLM-generated content (labels and descriptions), making the system more robust and configurable.
Key Changes:
- Introduced code-based dictionary generation that parses stats/frequency CSVs and generates structured entries
- Added configurable parameters:
--num-examplesand--truncate-strfor controlling dictionary output - Implemented more verbose JSON output with metadata fields (enum_threshold, num_examples, truncate_str)
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/cmd/sqlp.rs | Enhanced SQL comment regex to handle whitespace-prefixed comments |
| src/cmd/describegpt.rs | Major refactor implementing neuro-symbolic dictionary generation with new parsing functions and data structures |
| resources/describegpt_defaults.toml | Simplified dictionary prompt to only request labels/descriptions, updated Polars SQL guidance |
| docs/nyc311-describegpt.md | Updated example output showing new dictionary format with additional metadata columns |
| docs/nyc311-describegpt.json | Updated JSON output structure with new field format and attribution metadata |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>