Better handle rows that break across splits, and other small related fixes by srowen · Pull Request #400 · databricks/spark-xml

srowen · 2019-08-04T01:35:18Z

This attempt to address #398
See also #399

The change is I believe explained in comments below.

…fixes

srowen

I'm not super happy about the hack here, but I could not find any other way around it. This is a potential correctness issue, so we need to do something. At least, I added more tests to exercise handling of rows that split across splits, and they still pass.

HyukjinKwon · 2019-08-05T08:16:28Z

src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala

    reader = new InputStreamReader(in, charset)
+
+    if (codec == null) {
+      // Hack: in the uncompressed case (see more below), we must know how much the


Yea ... I don't like this hack too ... but seems no better way.

HyukjinKwon

Looks good if the styles and tests pass

codecov-io · 2019-08-05T13:43:19Z

Codecov Report

Merging #400 into master will decrease coverage by 0.05%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master     #400      +/-   ##
==========================================
- Coverage   87.78%   87.73%   -0.06%     
==========================================
  Files          14       14              
  Lines         745      758      +13     
  Branches       64       65       +1     
==========================================
+ Hits          654      665      +11     
- Misses         91       93       +2

Impacted Files	Coverage Δ
...cala/com/databricks/spark/xml/XmlInputFormat.scala	`92.53% <83.33%> (-0.86%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41d0d17...9584634. Read the comment docs.

srowen · 2019-08-05T14:23:42Z

Thanks @HyukjinKwon - we may want to cut an 0.6.0 release for this, plus the inferSchema change. I may need to learn from you how to do it.

Better handle rows that break across splits, and other small related …

32b6d3e

…fixes

srowen added the bug label Aug 4, 2019

srowen requested a review from HyukjinKwon August 4, 2019 01:35

srowen self-assigned this Aug 4, 2019

This was referenced Aug 4, 2019

Fix for data loss when input file partitioned through rowTag element #399

Closed

Records dropped on large files when partition breaks rowTag #398

Closed

srowen added this to the 0.5.1 milestone Aug 4, 2019

srowen commented Aug 4, 2019

View reviewed changes

Remove whitespace at end of line

438f71a

HyukjinKwon reviewed Aug 5, 2019

View reviewed changes

HyukjinKwon approved these changes Aug 5, 2019

View reviewed changes

More style fixes

9584634

HyukjinKwon merged commit 8bc9621 into databricks:master Aug 5, 2019

srowen modified the milestones: 0.5.1, 0.6.0 Aug 5, 2019

srowen deleted the Issue398 branch October 28, 2019 00:38

slim-naifar mentioned this pull request Jan 4, 2020

Controlling partitions while reading xml files #184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handle rows that break across splits, and other small related fixes#400

Better handle rows that break across splits, and other small related fixes#400
HyukjinKwon merged 3 commits intodatabricks:masterfrom
srowen:Issue398

srowen commented Aug 4, 2019

Uh oh!

srowen left a comment

Uh oh!

HyukjinKwon Aug 5, 2019

Uh oh!

HyukjinKwon left a comment

Uh oh!

codecov-io commented Aug 5, 2019 •

edited

Loading

Uh oh!

srowen commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

srowen commented Aug 4, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 5, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

srowen commented Aug 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-io commented Aug 5, 2019 •

edited

Loading