feat: support schema file as command line arg by andyredhead · Pull Request #74 · domoritz/csv2parquet

andyredhead · 2022-05-01T13:42:42Z

As discussed in issue #73, here is a first cut at enabling a schema file to be supplied as a command line arg.

It relies entirely upon the json schema handling provided by arrow, which means there can be unexpected outcomes if mistakes are made while defining the schema...

The main problem will be that it seems the columns in the json schema are applied in numeric order to the columns in the CSV. If the json schema omits a column, the column definitions will get misaligned, meaning that columns may end up with the wrong heading or there are (poorly reported) incompatibilities between the data in the CSV file and the type expected for the column (i.e. trying to push a float into an int).

These problems can largely be avoided by using the schema inference process to generate an initial schema file, which can be edited prior to running a full convert.

As there is a pretty straight forward way of getting most cases to work with the functionality as implemented here, this seems like a good place to start. (it's good enough for my needs at the moment anyway :) )

Note: I wrote this against Arrow/Parquet 9.1.0 and got it working, then raised the version of those dependencies to 12.0.0 (latest release) and everything continued to just compile and work.

… as a command line arg, rather than relying on schema inference from the CSV file.

domoritz · 2022-05-03T22:20:10Z

    output: PathBuf,

+    /// Arrow schema to be applied to data in CSV (same format as written out by -p / -n). {n}
+    /// {n}


why do we need this?

Not sure what you mean by "this", if its the {n} it was to try and get clap to generate newlines, I've removed from the text anyway

Yeah, I meant the {n}

domoritz · 2022-05-03T22:21:00Z

+    /// {n}
+    /// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs {n}
+    /// {n}
+    /// Make sure to have the same number of columns in your schema file as are in your CSV!


This is a very long comment. Did you write it or copy it from arrow? If the former, please revise heavily and shorten.

I wrote it. I've had another go, chopped out a lot of the text.

domoritz · 2022-05-03T22:21:47Z

+    /// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs {n}
+    /// {n}
+    /// Make sure to have the same number of columns in your schema file as are in your CSV!
+    #[clap(short = 's', long, parse(from_os_str), value_hint = ValueHint::AnyPath)]


Instead of a path, I wonder whether it's better to expect a string instead of a file. What do you think?

Providing the schema as a string would be feasible, the down side is it could be unwieldy.

The process I employed to get to the defined schema I've been playing with was:

Do a dry run with schema infer, cat output to local file

Manually edit local schema file

Run with schema read from local file

To receive the defined schema as a string that flow becomes:

Do a dry run with schema infer, cat output to local file

Manually edit local schema file

Read file into env var

Run with schema read from env var

I can't see anyone being keen to type the whole schema by hand at the command line... !
The defined schema for the file I've been playing with (only 5 columns) is:

{ "fields": [ { "name": "plant_id", "nullable": false, "type": { "name": "int", "bitWidth": 32, "isSigned": false }, "children": [] }, { "name": "analysis_year", "nullable": false, "type": { "name": "int", "bitWidth": 32, "isSigned": false }, "children": [] }, { "name": "risk_attribute_value", "nullable": true, "type": { "name": "floatingpoint", "precision": "SINGLE" }, "children": [] }, { "name": "physical_risk_scenario_id", "nullable": false, "type": { "name": "int", "bitWidth": 8, "isSigned": false }, "children": [] }, { "name": "physical_risk_type_id", "nullable": false, "type": { "name": "int", "bitWidth": 8, "isSigned": false }, "children": [] } ], "metadata": {} }

domoritz · 2022-05-03T22:31:41Z

+        _ => {
+            // infer schema from file contents
+            // NOTE: if max_read_records is zero then all cols are assumed to be "string"
+            match arrow::csv::reader::infer_file_schema(


I don't see any changes to the code where we used to infer the schema. Instead of copying the logic, please refactor the code so we reuse the logic.

The logic has changed.

In the first match statement of fn main gets a schema by either:

reading it in from the defined schema file, which is the path taken if a schema file is provided on the command line

inferring it from the source csv file, using "arrow::csv::reader::infer_file_schema"

NOTE: the original code uses "builder.infer_schema(opts.max_read_records)", whereas the new code is using an associated function on the reader struct.

At this point we don't yet have a configured reader (or even a reader builder).

We do have a schema which we can give to the reader builder in a few lines time, which leads into your comment further down about re-using the existing logic...

domoritz · 2022-05-03T22:32:14Z

+    let schema_ref = Arc::new(schema);
+    let builder = ReaderBuilder::new()
+        .has_header(opts.header.unwrap_or(true))
+        .with_delimiter(opts.delimiter as u8)
+        .with_schema(schema_ref);
+
+    let reader = builder.build(input)?;


can we reuse the existing logic?

Given that the logic around getting a schema has changed such that we've already got a schema without needing a Reader or ReaderBuilder, this seems to be the cleanest (and closest to the original logic) approach.

If we've got a defined schema file, we've read it and built the schema without touching the source csv, the input File hasn't been touched and is ready to read.

If we don't have a defined schema file and the max records is a value that lets us infer a schema, we've inferred a schema using the infer method on the CsvReader struct (rather than using the ReaderBuilder), the input File has been read and wound back to start (exactly as in the original code).

If we don't have a defined schema file and the max records is zero, we have a schema that's all strings, the input File is ready to read (exactly as in the original code).

If I've missed something, please try and point it out to me again :)

…go defined dependencies by taking newest version of each dependency from across both change sets

Address conflicts of cargo defined dependencies between pull request update of parquet and arrow to v12 and dependabot update to lates 9.x

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

…hema support. Merge branch 'main' of https://github.com/andyredhead/csv2parquet into main

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

domoritz · 2022-05-08T13:24:23Z

    output: PathBuf,

+    /// Arrow schema to be applied to data in CSV (same format as written out by -p / -n), prevents schema inference from running.
+    /// The structure of schema json is shown in the source of: DataType fn from(json: &Value -> Result<DataType> in:


Please fix grammar and make it easier to understand.

Revised, grammar checked in MS Word and a couple of online services, no issues reported.

domoritz · 2022-05-08T13:25:31Z

+    /// Arrow schema to be applied to data in CSV (same format as written out by -p / -n), prevents schema inference from running.
+    /// The structure of schema json is shown in the source of: DataType fn from(json: &Value -> Result<DataType> in:
+    /// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs
+    /// Make sure your schema has the same number of columns as your CSV!


I don't think this is good here. It's an obvious statement and not usually what I expect to read in cli docs.

…omoritz-main2 Dependabot updates in primary repo

Dependabot updates in primary dependencies.

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

domoritz · 2022-05-13T04:07:40Z

Thank you for the implementation and the patience with my reviews. Could you send PRs for the related repos now? Once they are in, I will make a release.

andyredhead · 2022-05-13T20:24:29Z

No worries, I appreciate you tolerating my bumbling newbie efforts.
Yes, I'll take a look at the related repos over the next few days.

AndyRedhead-EC added 2 commits April 30, 2022 19:20

ignoring clion .idea folder

97ea196

Allow a schema file defined in the Arrow Schema format to be provided…

9340623

… as a command line arg, rather than relying on schema inference from the CSV file.

domoritz suggested changes May 3, 2022

View reviewed changes

AndyRedhead-EC and others added 7 commits May 4, 2022 18:23

Resolved merge conflicts with primary after dependabot updates to car…

dffbc5a

…go defined dependencies by taking newest version of each dependency from across both change sets

Merge branch 'domoritz-main' into main

4b8ecad

Address conflicts of cargo defined dependencies between pull request update of parquet and arrow to v12 and dependabot update to lates 9.x

Update src/main.rs

f896677

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

4cc7318

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

5003b59

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

updates following comments on pull request for defined schema

87b6265

Addressing first round comments on pull request for adding defined sc…

7721f5a

…hema support. Merge branch 'main' of https://github.com/andyredhead/csv2parquet into main

domoritz approved these changes May 6, 2022

View reviewed changes

Comment thread src/main.rs Outdated

Comment thread src/main.rs Outdated

Comment thread src/main.rs Outdated

Comment thread src/main.rs Outdated

andyredhead and others added 4 commits May 7, 2022 19:26

Update src/main.rs

9346f40

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

86c2266

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

5e521db

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

3b6c6bb

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

domoritz approved these changes May 8, 2022

View reviewed changes

AndyRedhead-EC added 3 commits May 11, 2022 18:47

Merge branch 'main' of https://github.com/domoritz/csv2parquet into d…

1ecdc0f

…omoritz-main2 Dependabot updates in primary repo

Merge branch 'domoritz-main2' into main

7c6ebc1

Dependabot updates in primary dependencies.

Simplify clap doco for schema file flag

3c3bc35

domoritz approved these changes May 11, 2022

View reviewed changes

Comment thread src/main.rs Outdated

Comment thread src/main.rs Outdated

Comment thread src/main.rs

Comment thread src/main.rs Outdated

andyredhead and others added 4 commits May 12, 2022 21:09

Update src/main.rs

d06fba7

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

117dbc3

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

Update src/main.rs

4738965

Co-authored-by: Dominik Moritz <domoritz@gmail.com>

minor refactoring to field names

1e28a45

domoritz changed the title ~~Provide schema file as command line arg~~ feat: support schema file as command line arg May 13, 2022

domoritz merged commit 20ccc90 into domoritz:main May 13, 2022

This was referenced May 16, 2022

feat: support schema file as command line arg domoritz/json2parquet#71

Merged

feat: support schema file as command line arg domoritz/json2arrow#60

Merged

domoritz mentioned this pull request May 16, 2022

feat: support schema file as command line arg domoritz/csv2arrow#61

Merged

Conversation

andyredhead commented May 1, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

domoritz commented May 13, 2022

Uh oh!

andyredhead commented May 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants