Skip to content
This repository was archived by the owner on Feb 2, 2023. It is now read-only.

feat: support schema file as command line arg#74

Merged
domoritz merged 20 commits into
domoritz:mainfrom
andyredhead:main
May 13, 2022
Merged

feat: support schema file as command line arg#74
domoritz merged 20 commits into
domoritz:mainfrom
andyredhead:main

Conversation

@andyredhead

Copy link
Copy Markdown
Contributor

As discussed in issue #73, here is a first cut at enabling a schema file to be supplied as a command line arg.

It relies entirely upon the json schema handling provided by arrow, which means there can be unexpected outcomes if mistakes are made while defining the schema...

The main problem will be that it seems the columns in the json schema are applied in numeric order to the columns in the CSV. If the json schema omits a column, the column definitions will get misaligned, meaning that columns may end up with the wrong heading or there are (poorly reported) incompatibilities between the data in the CSV file and the type expected for the column (i.e. trying to push a float into an int).

These problems can largely be avoided by using the schema inference process to generate an initial schema file, which can be edited prior to running a full convert.

As there is a pretty straight forward way of getting most cases to work with the functionality as implemented here, this seems like a good place to start. (it's good enough for my needs at the moment anyway :) )

Note: I wrote this against Arrow/Parquet 9.1.0 and got it working, then raised the version of those dependencies to 12.0.0 (latest release) and everything continued to just compile and work.

… as a command line arg, rather than relying on schema inference from the CSV file.
Comment thread Cargo.toml Outdated
Comment thread src/main.rs Outdated
output: PathBuf,

/// Arrow schema to be applied to data in CSV (same format as written out by -p / -n). {n}
/// {n}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by "this", if its the {n} it was to try and get clap to generate newlines, I've removed from the text anyway

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant the {n}

Comment thread src/main.rs Outdated
/// {n}
/// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs {n}
/// {n}
/// Make sure to have the same number of columns in your schema file as are in your CSV!

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very long comment. Did you write it or copy it from arrow? If the former, please revise heavily and shorten.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it. I've had another go, chopped out a lot of the text.

Comment thread src/main.rs
/// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs {n}
/// {n}
/// Make sure to have the same number of columns in your schema file as are in your CSV!
#[clap(short = 's', long, parse(from_os_str), value_hint = ValueHint::AnyPath)]

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a path, I wonder whether it's better to expect a string instead of a file. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing the schema as a string would be feasible, the down side is it could be unwieldy.

The process I employed to get to the defined schema I've been playing with was:

  • Do a dry run with schema infer, cat output to local file
  • Manually edit local schema file
  • Run with schema read from local file

To receive the defined schema as a string that flow becomes:

  • Do a dry run with schema infer, cat output to local file
  • Manually edit local schema file
  • Read file into env var
  • Run with schema read from env var

I can't see anyone being keen to type the whole schema by hand at the command line... !
The defined schema for the file I've been playing with (only 5 columns) is:

{
  "fields": [
    {
      "name": "plant_id",
      "nullable": false,
      "type": {
        "name": "int",
        "bitWidth": 32,
        "isSigned": false
      },
      "children": []
    },
    {
      "name": "analysis_year",
      "nullable": false,
      "type": {
        "name": "int",
        "bitWidth": 32,
        "isSigned": false
      },
      "children": []
    },
    {
      "name": "risk_attribute_value",
      "nullable": true,
      "type": {
        "name": "floatingpoint",
        "precision": "SINGLE"
      },
      "children": []
    },
    {
      "name": "physical_risk_scenario_id",
      "nullable": false,
      "type": {
        "name": "int",
        "bitWidth": 8,
        "isSigned": false
      },
      "children": []
    },
    {
      "name": "physical_risk_type_id",
      "nullable": false,
      "type": {
        "name": "int",
        "bitWidth": 8,
        "isSigned": false
      },
      "children": []
    }
  ],
  "metadata": {}
}

Comment thread src/main.rs Outdated
Comment thread src/main.rs Outdated
Comment thread src/main.rs Outdated
Comment thread src/main.rs
_ => {
// infer schema from file contents
// NOTE: if max_read_records is zero then all cols are assumed to be "string"
match arrow::csv::reader::infer_file_schema(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes to the code where we used to infer the schema. Instead of copying the logic, please refactor the code so we reuse the logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic has changed.

In the first match statement of fn main gets a schema by either:

  • reading it in from the defined schema file, which is the path taken if a schema file is provided on the command line
  • inferring it from the source csv file, using "arrow::csv::reader::infer_file_schema"

NOTE: the original code uses "builder.infer_schema(opts.max_read_records)", whereas the new code is using an associated function on the reader struct.

At this point we don't yet have a configured reader (or even a reader builder).

We do have a schema which we can give to the reader builder in a few lines time, which leads into your comment further down about re-using the existing logic...

Comment thread src/main.rs Outdated
Comment thread src/main.rs
Comment on lines +182 to +188
let schema_ref = Arc::new(schema);
let builder = ReaderBuilder::new()
.has_header(opts.header.unwrap_or(true))
.with_delimiter(opts.delimiter as u8)
.with_schema(schema_ref);

let reader = builder.build(input)?;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the existing logic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the logic around getting a schema has changed such that we've already got a schema without needing a Reader or ReaderBuilder, this seems to be the cleanest (and closest to the original logic) approach.

If we've got a defined schema file, we've read it and built the schema without touching the source csv, the input File hasn't been touched and is ready to read.

If we don't have a defined schema file and the max records is a value that lets us infer a schema, we've inferred a schema using the infer method on the CsvReader struct (rather than using the ReaderBuilder), the input File has been read and wound back to start (exactly as in the original code).

If we don't have a defined schema file and the max records is zero, we have a schema that's all strings, the input File is ready to read (exactly as in the original code).

If I've missed something, please try and point it out to me again :)

AndyRedhead-EC and others added 7 commits May 4, 2022 18:23
…go defined dependencies by taking newest version of each dependency from across both change sets
Address conflicts of cargo defined dependencies between pull request update of parquet and arrow to v12 and dependabot update to lates 9.x
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Comment thread src/main.rs Outdated
Comment thread src/main.rs Outdated
Comment thread src/main.rs Outdated
Comment thread src/main.rs Outdated
andyredhead and others added 4 commits May 7, 2022 19:26
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Comment thread src/main.rs Outdated
output: PathBuf,

/// Arrow schema to be applied to data in CSV (same format as written out by -p / -n), prevents schema inference from running.
/// The structure of schema json is shown in the source of: DataType fn from(json: &Value -> Result<DataType> in:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix grammar and make it easier to understand.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised, grammar checked in MS Word and a couple of online services, no issues reported.

Comment thread src/main.rs Outdated
/// Arrow schema to be applied to data in CSV (same format as written out by -p / -n), prevents schema inference from running.
/// The structure of schema json is shown in the source of: DataType fn from(json: &Value -> Result<DataType> in:
/// https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs
/// Make sure your schema has the same number of columns as your CSV!

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is good here. It's an obvious statement and not usually what I expect to read in cli docs.

Comment thread src/main.rs Outdated
Comment thread src/main.rs Outdated
Comment thread src/main.rs
Comment thread src/main.rs Outdated
andyredhead and others added 4 commits May 12, 2022 21:09
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
Co-authored-by: Dominik Moritz <domoritz@gmail.com>
@domoritz domoritz changed the title Provide schema file as command line arg feat: support schema file as command line arg May 13, 2022
@domoritz domoritz merged commit 20ccc90 into domoritz:main May 13, 2022
@domoritz

Copy link
Copy Markdown
Owner

Thank you for the implementation and the patience with my reviews. Could you send PRs for the related repos now? Once they are in, I will make a release.

@andyredhead

Copy link
Copy Markdown
Contributor Author

No worries, I appreciate you tolerating my bumbling newbie efforts.
Yes, I'll take a look at the related repos over the next few days.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants