feat: support schema file as command line arg#74
Conversation
… as a command line arg, rather than relying on schema inference from the CSV file.
| output: PathBuf, | ||
|
|
||
| /// Arrow schema to be applied to data in CSV (same format as written out by -p / -n). {n} | ||
| /// {n} |
There was a problem hiding this comment.
Not sure what you mean by "this", if its the {n} it was to try and get clap to generate newlines, I've removed from the text anyway
| /// {n} | ||
| /// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs {n} | ||
| /// {n} | ||
| /// Make sure to have the same number of columns in your schema file as are in your CSV! |
There was a problem hiding this comment.
This is a very long comment. Did you write it or copy it from arrow? If the former, please revise heavily and shorten.
There was a problem hiding this comment.
I wrote it. I've had another go, chopped out a lot of the text.
| /// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs {n} | ||
| /// {n} | ||
| /// Make sure to have the same number of columns in your schema file as are in your CSV! | ||
| #[clap(short = 's', long, parse(from_os_str), value_hint = ValueHint::AnyPath)] |
There was a problem hiding this comment.
Instead of a path, I wonder whether it's better to expect a string instead of a file. What do you think?
There was a problem hiding this comment.
Providing the schema as a string would be feasible, the down side is it could be unwieldy.
The process I employed to get to the defined schema I've been playing with was:
- Do a dry run with schema infer, cat output to local file
- Manually edit local schema file
- Run with schema read from local file
To receive the defined schema as a string that flow becomes:
- Do a dry run with schema infer, cat output to local file
- Manually edit local schema file
- Read file into env var
- Run with schema read from env var
I can't see anyone being keen to type the whole schema by hand at the command line... !
The defined schema for the file I've been playing with (only 5 columns) is:
{
"fields": [
{
"name": "plant_id",
"nullable": false,
"type": {
"name": "int",
"bitWidth": 32,
"isSigned": false
},
"children": []
},
{
"name": "analysis_year",
"nullable": false,
"type": {
"name": "int",
"bitWidth": 32,
"isSigned": false
},
"children": []
},
{
"name": "risk_attribute_value",
"nullable": true,
"type": {
"name": "floatingpoint",
"precision": "SINGLE"
},
"children": []
},
{
"name": "physical_risk_scenario_id",
"nullable": false,
"type": {
"name": "int",
"bitWidth": 8,
"isSigned": false
},
"children": []
},
{
"name": "physical_risk_type_id",
"nullable": false,
"type": {
"name": "int",
"bitWidth": 8,
"isSigned": false
},
"children": []
}
],
"metadata": {}
}
| _ => { | ||
| // infer schema from file contents | ||
| // NOTE: if max_read_records is zero then all cols are assumed to be "string" | ||
| match arrow::csv::reader::infer_file_schema( |
There was a problem hiding this comment.
I don't see any changes to the code where we used to infer the schema. Instead of copying the logic, please refactor the code so we reuse the logic.
There was a problem hiding this comment.
The logic has changed.
In the first match statement of fn main gets a schema by either:
- reading it in from the defined schema file, which is the path taken if a schema file is provided on the command line
- inferring it from the source csv file, using "arrow::csv::reader::infer_file_schema"
NOTE: the original code uses "builder.infer_schema(opts.max_read_records)", whereas the new code is using an associated function on the reader struct.
At this point we don't yet have a configured reader (or even a reader builder).
We do have a schema which we can give to the reader builder in a few lines time, which leads into your comment further down about re-using the existing logic...
| let schema_ref = Arc::new(schema); | ||
| let builder = ReaderBuilder::new() | ||
| .has_header(opts.header.unwrap_or(true)) | ||
| .with_delimiter(opts.delimiter as u8) | ||
| .with_schema(schema_ref); | ||
|
|
||
| let reader = builder.build(input)?; |
There was a problem hiding this comment.
can we reuse the existing logic?
There was a problem hiding this comment.
Given that the logic around getting a schema has changed such that we've already got a schema without needing a Reader or ReaderBuilder, this seems to be the cleanest (and closest to the original logic) approach.
If we've got a defined schema file, we've read it and built the schema without touching the source csv, the input File hasn't been touched and is ready to read.
If we don't have a defined schema file and the max records is a value that lets us infer a schema, we've inferred a schema using the infer method on the CsvReader struct (rather than using the ReaderBuilder), the input File has been read and wound back to start (exactly as in the original code).
If we don't have a defined schema file and the max records is zero, we have a schema that's all strings, the input File is ready to read (exactly as in the original code).
If I've missed something, please try and point it out to me again :)
…go defined dependencies by taking newest version of each dependency from across both change sets
Address conflicts of cargo defined dependencies between pull request update of parquet and arrow to v12 and dependabot update to lates 9.x
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
…hema support. Merge branch 'main' of https://github.com/andyredhead/csv2parquet into main
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
| output: PathBuf, | ||
|
|
||
| /// Arrow schema to be applied to data in CSV (same format as written out by -p / -n), prevents schema inference from running. | ||
| /// The structure of schema json is shown in the source of: DataType fn from(json: &Value -> Result<DataType> in: |
There was a problem hiding this comment.
Please fix grammar and make it easier to understand.
There was a problem hiding this comment.
Revised, grammar checked in MS Word and a couple of online services, no issues reported.
| /// Arrow schema to be applied to data in CSV (same format as written out by -p / -n), prevents schema inference from running. | ||
| /// The structure of schema json is shown in the source of: DataType fn from(json: &Value -> Result<DataType> in: | ||
| /// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs | ||
| /// Make sure your schema has the same number of columns as your CSV! |
There was a problem hiding this comment.
I don't think this is good here. It's an obvious statement and not usually what I expect to read in cli docs.
…omoritz-main2 Dependabot updates in primary repo
Dependabot updates in primary dependencies.
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
|
Thank you for the implementation and the patience with my reviews. Could you send PRs for the related repos now? Once they are in, I will make a release. |
|
No worries, I appreciate you tolerating my bumbling newbie efforts. |
As discussed in issue #73, here is a first cut at enabling a schema file to be supplied as a command line arg.
It relies entirely upon the json schema handling provided by arrow, which means there can be unexpected outcomes if mistakes are made while defining the schema...
The main problem will be that it seems the columns in the json schema are applied in numeric order to the columns in the CSV. If the json schema omits a column, the column definitions will get misaligned, meaning that columns may end up with the wrong heading or there are (poorly reported) incompatibilities between the data in the CSV file and the type expected for the column (i.e. trying to push a float into an int).
These problems can largely be avoided by using the schema inference process to generate an initial schema file, which can be edited prior to running a full convert.
As there is a pretty straight forward way of getting most cases to work with the functionality as implemented here, this seems like a good place to start. (it's good enough for my needs at the moment anyway :) )
Note: I wrote this against Arrow/Parquet 9.1.0 and got it working, then raised the version of those dependencies to 12.0.0 (latest release) and everything continued to just compile and work.