Skip to content

Commit 4db0d18

Browse files
committed
feat: describegpt default-prompt-file v4.0: expanded Data Dictionary
- added more columns to data dictionary: * Cardinality * Enumeration * Null Count * Examples - added Generated by signature
1 parent 1433bf1 commit 4db0d18

File tree

1 file changed

+44
-35
lines changed

1 file changed

+44
-35
lines changed

resources/describegpt_defaults.toml

Lines changed: 44 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
name = "QSV Default Prompt File"
22
description = "Default prompt file for qsv's describegpt command."
33
author = "QSV Team"
4-
version = "3.3.0"
5-
tokens = 5000
4+
version = "4.0.0"
5+
tokens = 10000
66
base_url = "https://api.openai.com/v1"
77
model = "openai/gpt-oss-20b"
88
timeout = 300
@@ -24,44 +24,55 @@ You are an expert library scientist with extensive expertise in Statistics, Data
2424
You are also an expert on the DCAT-US 3 metadata specification (https://doi-do.github.io/dcat-us/).
2525
2626
When you are asked to generate a Data Dictionary, Description or Tags, use the provided Summary Statistics and
27-
Frequency Distribution to guide your response. They both describe the same Dataset.
27+
Frequency Distribution to guide your response. They both describe the same Dataset and are joined on the `field` column.
2828
2929
The provided Summary Statistics is a CSV file. Each record contains statistics for each Dataset field.
30-
For a detailed explanation of the Summary Statistics columns,
31-
see https://github.com/dathere/qsv/wiki/Supplemental#stats-command-output-explanation
3230
33-
The provided Frequency Distribution is a CSV file with the following columns - field, value, count, percentage, rank.
34-
For each Dataset field, it lists the top {TOP_N} (or less if there are less than {TOP_N} unique values) most frequent unique values
35-
sorted in descending order, with the special value "Other (N)" indicating "other" unique values beyond the top {TOP_N}.
36-
The "(N)" in "Other (N)" indicates the count of "other" unique values.
31+
The provided Frequency Distribution is a CSV file with these columns - `field`, `value`, `count`, `percentage`, `rank`.
32+
For each Dataset field, it lists the top {TOP_N} most frequent unique values sorted in descending order,
33+
with the special value "Other (N)" indicating "Other" unique values beyond the top {TOP_N}.
34+
The "N" in "Other (N)" indicates the count of "Other" unique values. The "Other" category has a special rank of 0.
3735
38-
The rank column is 1-based and is calculated based on the count of the values, with the most frequent having a rank of 1.
39-
In case of ties, the rank is calculated based on the "dense" rank-strategy (AKA "1223" ranking).
40-
The "Other" category has a special rank of 0.
36+
The Frequency Distribution's `rank` column is 1-based and is calculated based on the count of the values, with the
37+
most frequent having a rank of 1. In case of ties, `rank` is calculated based on the "dense" rank-strategy (AKA "1223" ranking).
4138
42-
For Dataset fields with all unique values (i.e. cardinality is equal to the number of records), the value column is the
43-
special value "<ALL_UNIQUE>", the count column is the number of records, the percentage column is 100, and the rank column is 1.
39+
For Dataset fields with all unique values (i.e. cardinality is equal to the number of records), the Frequency Distribution's
40+
`value` column is the special value "<ALL_UNIQUE>"; `count` - the number of records; `percentage` - 100; and `rank` - 0.
4441
"""
4542

4643
dictionary_prompt = """
4744
Here are the columns for each field in a Data Dictionary:
4845
49-
- Type: the data type of this column as indicated in the Summary Statistics below.
50-
- Label: a human-friendly label for this column
51-
- Description: a full description for this column (can be multiple sentences)
52-
53-
Generate a Data Dictionary as aforementioned {JSON_ADD} where each field has Name, Type, Label, and Description
54-
(so four columns in total) based on the following Summary Statistics and Frequency Distribution data of the Dataset.
55-
56-
Let's think step by step.
46+
- Name: `field` from Summary Statistics.
47+
- Type: `type` from Summary Statistics.
48+
- Label: a human-friendly label for this field
49+
- Description: a full description for this field (can be multiple sentences).
50+
- Cardinality: `cardinality` from Summary Statistics.
51+
- Enumeration: If `cardinality` > {TOP_N}, leave empty. Otherwise, if none of the corresponding unique values in the Frequency Distribution
52+
have `rank` = 0, enumerate unique values for this field from the Frequency Distribution.
53+
- Null Count: `nullcount` from Summary Statistics.
54+
- Examples: At least 5 top values for this field based on the Frequency Distribution, in `count` descending order.
55+
Include the Frequency Distribution `count` in parentheses after each value.
56+
Set to "<ALL_UNIQUE>" if the field has Frequency Distribution `percentage` = 100.
57+
58+
Generate a Data Dictionary as aforementioned {JSON_ADD} for ALL fields in the Dataset, where each field has
59+
Name,Type,Label,Description,Cardinality,Enumeration,Null Count,Examples (so eight columns in total)
60+
based on the Summary Statistics and Frequency Distribution.
61+
Always use the exact values from the Summary Statistics and Frequency Distribution data, never approximate them.
62+
63+
Add a Footnote with the placeholder "{GENERATED_BY_SIGNATURE}". If generating JSON format,
64+
add the footnote as a separate key at the top level of the JSON object, otherwise add it
65+
at the bottom of the Data Dictionary.
66+
67+
Let's think step by step, correcting yourself as needed.
5768
5869
---
5970
60-
Summary Statistics:
71+
Summary Statistics (CSV):
6172
6273
{STATS}
6374
64-
Frequency Distribution:
75+
Frequency Distribution (CSV):
6576
6677
{FREQUENCY}
6778
"""
@@ -73,11 +84,11 @@ Let's think step by step.
7384
7485
---
7586
76-
Summary Statistics:
87+
Summary Statistics (CSV):
7788
7889
{STATS}
7990
80-
Frequency Distribution:
91+
Frequency Distribution (CSV):
8192
8293
{FREQUENCY}
8394
@@ -109,11 +120,11 @@ Let's think step by step.
109120
110121
---
111122
112-
Summary Statistics:
123+
Summary Statistics (CSV):
113124
114125
{STATS}
115126
116-
Frequency Distribution:
127+
Frequency Distribution (CSV):
117128
118129
{FREQUENCY}"""
119130

@@ -141,11 +152,11 @@ Return the SQL query as a SQL code block preceded by a newline.
141152
142153
---
143154
144-
Summary Statistics:
155+
Summary Statistics (CSV):
145156
146157
{STATS}
147158
148-
Frequency Distribution:
159+
Frequency Distribution (CSV):
149160
150161
{FREQUENCY}
151162
@@ -172,11 +183,9 @@ polars_sql_guidance = """
172183
- Use the Dataset's Summary Statistics, Frequency Distribution and Data Dictionary data to generate the SQL query
173184
- Use {INPUT_TABLE_NAME} as the placeholder for the table name to query
174185
- Column names with embedded spaces and special characters are case-sensitive and should be enclosed in double quotes
175-
- Do not use window expressions in aggregations
176-
- Do not use the following SQL functions which are not supported by Polars SQL: `age`, `current_date`, `current_timestamp`,
177-
`date_bin`, `date_trunc`, `isfinite`, justify_days`, `justify_hours`, `justify_minutes`, `localtime`, `localtimestamp`,
178-
`make_interval`, `make_time`, `make_timestamp`, `make_timestamptz`, `now`, `timeofday`, `to_timestamp`,
179-
`regexp_match`, `regexp_replace`, `regexp_substr`, `repeat`, `substring`, `format`, `datediff`
186+
- Only use SQL functions that are supported by Polars SQL.
187+
Refer to https://github.com/pola-rs/polars/blob/e2818b3db9be5ec6b9abcc873bc4d2ab92861861/crates/polars-sql/src/functions.rs#L37-L755
188+
Note that we have the "rank" and "list_eval" polars features enabled.
180189
- `datepart`'s syntax is `date_part('part', date_column)` where part is one of: "year", "month", "week", "day", "hour", "minute", "second",
181190
"millisecond", "microsecond", "nanosecond", "epoch", "doy", "dow", "week", "timezone", "time"
182191
- Always cast columns to date/datetime type before doing date operations

0 commit comments

Comments
 (0)