Inconsistent result for DROPMALFORMED with pruned scan and unnecessarily casting all values when no fields are required #220

HyukjinKwon · 2015-12-28T02:36:47Z

https://github.com/databricks/spark-csv/issues/218 https://github.com/databricks/spark-csv/issues/219

In this PR, I made the pruned scan try to parse all the values in columns when DROPMALFORMED is enabled and return only required fields.

In addition, I changed the condition for table scan. If required columns are empty, then it just produces empty rows.

codecov-io · 2015-12-28T02:44:58Z

Current coverage is `85.82%`

Merging #220 into master will increase coverage by +0.11% as of 5b06cea

@@            master    #220   diff @@
======================================
  Files           11      11       
  Stmts          497     501     +4
  Branches       146     148     +2
  Methods          0       0       
======================================
+ Hit            426     430     +4
  Partial          0       0       
  Missed          71      71

Review entire Coverage Diff as of 5b06cea

Powered by Codecov. Updated on successful CI builds.

falaki · 2015-12-28T06:20:24Z

Thanks @HyukjinKwon for investigating this. Do you think we should also include FAILFAST mode?

HyukjinKwon · 2015-12-28T06:23:28Z

@falaki Ah..if you meant handling some logics to work FAILFAST mode properly, actually it looks fine with FAILFAST mode as it just compares schema to tokens (which are always read all regardless of pruned scan or table scan).

falaki · 2015-12-28T06:30:26Z

src/main/scala/com/databricks/spark/csv/CsvRelation.scala

Why not simply using schemaFields?

Because of the order of columns. The required columns can be in a different order with the original schema.
If you look through the code below, you will notice I slice the row by take().

HyukjinKwon · 2015-12-28T06:32:03Z

And.. would you mind if you look through the codes, in particular, the variable names (although it is trivial)?
I am not too sure if the naming looks appropriate.

falaki · 2015-12-28T23:01:46Z

Thanks this looks good to me.

HyukjinKwon added 2 commits December 28, 2015 10:43

Execute table scan only if needed.

d090664

Perform parsing all data when DROPMALFORMED parsing mode is enabled.

17f2ca9

falaki reviewed Dec 28, 2015
View reviewed changes

falaki closed this in 4a5ee81 Dec 28, 2015

HyukjinKwon mentioned this pull request Feb 12, 2016

[SPARK-13260][SQL] count(*) does not work with CSV data source apache/spark#11169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent result for DROPMALFORMED with pruned scan and unnecessarily casting all values when no fields are required #220

Inconsistent result for DROPMALFORMED with pruned scan and unnecessarily casting all values when no fields are required #220

Uh oh!

HyukjinKwon commented Dec 28, 2015

Uh oh!

codecov-io commented Dec 28, 2015

Uh oh!

falaki commented Dec 28, 2015

Uh oh!

HyukjinKwon commented Dec 28, 2015

Uh oh!

falaki Dec 28, 2015

Uh oh!

HyukjinKwon Dec 28, 2015

Uh oh!

HyukjinKwon commented Dec 28, 2015

Uh oh!

falaki commented Dec 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Inconsistent result for DROPMALFORMED with pruned scan and unnecessarily casting all values when no fields are required #220

Inconsistent result for DROPMALFORMED with pruned scan and unnecessarily casting all values when no fields are required #220

Uh oh!

Conversation

HyukjinKwon commented Dec 28, 2015

Uh oh!

codecov-io commented Dec 28, 2015

Current coverage is 85.82%

Uh oh!

falaki commented Dec 28, 2015

Uh oh!

HyukjinKwon commented Dec 28, 2015

Uh oh!

falaki Dec 28, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 28, 2015

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 28, 2015

Uh oh!

falaki commented Dec 28, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Current coverage is `85.82%`