11name = " QSV Default Prompt File"
22description = " Default prompt file for qsv's describegpt command."
33author = " QSV Team"
4- version = " 3.3 .0"
5- tokens = 5000
4+ version = " 4.0 .0"
5+ tokens = 10000
66base_url = " https://api.openai.com/v1"
77model = " openai/gpt-oss-20b"
88timeout = 300
@@ -24,44 +24,55 @@ You are an expert library scientist with extensive expertise in Statistics, Data
2424You are also an expert on the DCAT-US 3 metadata specification (https://doi-do.github.io/dcat-us/).
2525
2626When you are asked to generate a Data Dictionary, Description or Tags, use the provided Summary Statistics and
27- Frequency Distribution to guide your response. They both describe the same Dataset.
27+ Frequency Distribution to guide your response. They both describe the same Dataset and are joined on the `field` column .
2828
2929The provided Summary Statistics is a CSV file. Each record contains statistics for each Dataset field.
30- For a detailed explanation of the Summary Statistics columns,
31- see https://github.com/dathere/qsv/wiki/Supplemental#stats-command-output-explanation
3230
33- The provided Frequency Distribution is a CSV file with the following columns - field, value, count, percentage, rank.
34- For each Dataset field, it lists the top {TOP_N} (or less if there are less than {TOP_N} unique values) most frequent unique values
35- sorted in descending order, with the special value "Other (N)" indicating "other " unique values beyond the top {TOP_N}.
36- The "(N) " in "Other (N)" indicates the count of "other " unique values.
31+ The provided Frequency Distribution is a CSV file with these columns - ` field`, ` value`, ` count`, ` percentage`, ` rank` .
32+ For each Dataset field, it lists the top {TOP_N} most frequent unique values sorted in descending order,
33+ with the special value "Other (N)" indicating "Other " unique values beyond the top {TOP_N}.
34+ The "N " in "Other (N)" indicates the count of "Other " unique values. The "Other" category has a special rank of 0 .
3735
38- The rank column is 1-based and is calculated based on the count of the values, with the most frequent having a rank of 1.
39- In case of ties, the rank is calculated based on the "dense" rank-strategy (AKA "1223" ranking).
40- The "Other" category has a special rank of 0.
36+ The Frequency Distribution's `rank` column is 1-based and is calculated based on the count of the values, with the
37+ most frequent having a rank of 1. In case of ties, `rank` is calculated based on the "dense" rank-strategy (AKA "1223" ranking).
4138
42- For Dataset fields with all unique values (i.e. cardinality is equal to the number of records), the value column is the
43- special value "<ALL_UNIQUE>", the count column is the number of records, the percentage column is 100, and the rank column is 1 .
39+ For Dataset fields with all unique values (i.e. cardinality is equal to the number of records), the Frequency Distribution's
40+ `value` column is the special value "<ALL_UNIQUE>"; ` count` - the number of records; ` percentage` - 100; and ` rank` - 0 .
4441"""
4542
4643dictionary_prompt = """
4744Here are the columns for each field in a Data Dictionary:
4845
49- - Type: the data type of this column as indicated in the Summary Statistics below.
50- - Label: a human-friendly label for this column
51- - Description: a full description for this column (can be multiple sentences)
52-
53- Generate a Data Dictionary as aforementioned {JSON_ADD} where each field has Name, Type, Label, and Description
54- (so four columns in total) based on the following Summary Statistics and Frequency Distribution data of the Dataset.
55-
56- Let's think step by step.
46+ - Name: `field` from Summary Statistics.
47+ - Type: `type` from Summary Statistics.
48+ - Label: a human-friendly label for this field
49+ - Description: a full description for this field (can be multiple sentences).
50+ - Cardinality: `cardinality` from Summary Statistics.
51+ - Enumeration: If `cardinality` > {TOP_N}, leave empty. Otherwise, if none of the corresponding unique values in the Frequency Distribution
52+ have `rank` = 0, enumerate unique values for this field from the Frequency Distribution.
53+ - Null Count: `nullcount` from Summary Statistics.
54+ - Examples: At least 5 top values for this field based on the Frequency Distribution, in `count` descending order.
55+ Include the Frequency Distribution `count` in parentheses after each value.
56+ Set to "<ALL_UNIQUE>" if the field has Frequency Distribution `percentage` = 100.
57+
58+ Generate a Data Dictionary as aforementioned {JSON_ADD} for ALL fields in the Dataset, where each field has
59+ Name,Type,Label,Description,Cardinality,Enumeration,Null Count,Examples (so eight columns in total)
60+ based on the Summary Statistics and Frequency Distribution.
61+ Always use the exact values from the Summary Statistics and Frequency Distribution data, never approximate them.
62+
63+ Add a Footnote with the placeholder "{GENERATED_BY_SIGNATURE}". If generating JSON format,
64+ add the footnote as a separate key at the top level of the JSON object, otherwise add it
65+ at the bottom of the Data Dictionary.
66+
67+ Let's think step by step, correcting yourself as needed.
5768
5869---
5970
60- Summary Statistics:
71+ Summary Statistics (CSV) :
6172
6273{STATS}
6374
64- Frequency Distribution:
75+ Frequency Distribution (CSV) :
6576
6677{FREQUENCY}
6778"""
@@ -73,11 +84,11 @@ Let's think step by step.
7384
7485---
7586
76- Summary Statistics:
87+ Summary Statistics (CSV) :
7788
7889{STATS}
7990
80- Frequency Distribution:
91+ Frequency Distribution (CSV) :
8192
8293{FREQUENCY}
8394
@@ -109,11 +120,11 @@ Let's think step by step.
109120
110121---
111122
112- Summary Statistics:
123+ Summary Statistics (CSV) :
113124
114125{STATS}
115126
116- Frequency Distribution:
127+ Frequency Distribution (CSV) :
117128
118129{FREQUENCY}"""
119130
@@ -141,11 +152,11 @@ Return the SQL query as a SQL code block preceded by a newline.
141152
142153---
143154
144- Summary Statistics:
155+ Summary Statistics (CSV) :
145156
146157{STATS}
147158
148- Frequency Distribution:
159+ Frequency Distribution (CSV) :
149160
150161{FREQUENCY}
151162
@@ -172,11 +183,9 @@ polars_sql_guidance = """
172183- Use the Dataset's Summary Statistics, Frequency Distribution and Data Dictionary data to generate the SQL query
173184- Use {INPUT_TABLE_NAME} as the placeholder for the table name to query
174185- Column names with embedded spaces and special characters are case-sensitive and should be enclosed in double quotes
175- - Do not use window expressions in aggregations
176- - Do not use the following SQL functions which are not supported by Polars SQL: `age`, `current_date`, `current_timestamp`,
177- `date_bin`, `date_trunc`, `isfinite`, justify_days`, `justify_hours`, `justify_minutes`, `localtime`, `localtimestamp`,
178- `make_interval`, `make_time`, `make_timestamp`, `make_timestamptz`, `now`, `timeofday`, `to_timestamp`,
179- `regexp_match`, `regexp_replace`, `regexp_substr`, `repeat`, `substring`, `format`, `datediff`
186+ - Only use SQL functions that are supported by Polars SQL.
187+ Refer to https://github.com/pola-rs/polars/blob/e2818b3db9be5ec6b9abcc873bc4d2ab92861861/crates/polars-sql/src/functions.rs#L37-L755
188+ Note that we have the "rank" and "list_eval" polars features enabled.
180189- `datepart`'s syntax is `date_part('part', date_column)` where part is one of: "year", "month", "week", "day", "hour", "minute", "second",
181190 "millisecond", "microsecond", "nanosecond", "epoch", "doy", "dow", "week", "timezone", "time"
182191- Always cast columns to date/datetime type before doing date operations
0 commit comments