Fix for data loss when input file partitioned through rowTag element by jimenefe · Pull Request #399 · databricks/spark-xml

jimenefe · 2019-08-02T16:33:46Z

Fixes issue reported in 390

…uld lead to data loss

srowen · 2019-08-02T19:20:34Z

src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala

 package com.databricks.spark.xml

-import java.io.{IOException, InputStream, InputStreamReader, Reader}
+import java.io.{ IOException, InputStream, InputStreamReader, Reader }


(Could we revert the spacing changes and reordering?)

srowen · 2019-08-02T19:21:44Z

src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala

+
+    def getPosition: Long = initialPosition + offset
+
+    override def read(): Int = {


My concern here is that this isn't intercepting calls to read(byte[]) etc. How about adding CountingInputStream from Common IO? it's a wrapper that's exactly meant to tell you how many bytes have been read through the input stream. I think that's what we need, added to the start position, to really know where in the file it has read.

srowen · 2019-08-03T20:49:27Z

OK so I'm looking into this more today, and it's more complex than I thought.

First, there's a problem with tracking the number of chars read, which is what this does. The end and start offsets are in bytes, and so 1 char is not always 1 byte. That's easy enough to address another way, like CountingInputStream I think (or wrapping the InputStream the Reader reads; we do in fact only rely on .read() now, but that could change).

I agree that using FilePosition is problematic because all paths here would end up with a buffered stream from Hadoop and so the read position may not match exactly how much has been consumed. However I returned to the source of LineRecordReader which this is based on, and it does use the same mechanism for checking how much has been read. Sometimes.

It really uses this only for compressed input. For uncompressed input this is easy, as end and start make it entirely possible to know when one has finished the split by counting bytes read. For compressed input, it's not clear how many uncompressed bytes come out of the source bytes between start and end so this won't work.

For unsplittable compression, this won't matter. One file is one split always. Reading until the stream gives EOF is sufficient.

For splittable compression, this again won't work, and I'm still trying to work out how it handles this case. You can see the source checks the 'adjusted' start and end values but a) I don't think this turns them into somehow 'uncompressed' offsets, and b) it doesn't look like they are set at all anyway by the one implementation, for bzip2.

I think we'll need more tests in the end to check what happens on .gz and .bzip2 input, but that's TBD.

srowen · 2019-08-03T21:46:27Z

Bad news: InputStreamReader itself buffers from the underlying bytes. So it throws off any attempt to know exactly how many bytes have been processed.

It appears possible to hack this and get some internal ByteBuffer from it, and we can subtract off the number of bytes buffered but not processed yet. I have a version of this change that works, but it's hack. Yet I have no better ideas.

I'm pretty sure there are problems with compressed input too, and will work on some tests to check their behavior here.

srowen · 2019-08-04T01:35:28Z

Check out #400

…fixes (#400) This attempt to address #398 See also #399 The change is I believe explained in comments below.

srowen · 2019-08-05T14:24:12Z

We merged the other change, but credit to you for finding the issue and test case, and proposing a fix that was directionally correct.

jimenefe · 2019-08-09T15:05:51Z

Thanks Sean, as long as the issue is fixed, I'm happy :-)

jimenefe added 3 commits January 3, 2019 13:36

Fix for nested tag names with attributes bug

6b9ab13

Merge remote-tracking branch 'upstream/master'

5b88669

Fix for bug where an XML file being split through a rowTag element wo…

3f0be7b

…uld lead to data loss

srowen reviewed Aug 2, 2019

View reviewed changes

srowen mentioned this pull request Aug 4, 2019

Better handle rows that break across splits, and other small related fixes #400

Merged

HyukjinKwon pushed a commit that referenced this pull request Aug 5, 2019

Better handle rows that break across splits, and other small related …

8bc9621

…fixes (#400) This attempt to address #398 See also #399 The change is I believe explained in comments below.

srowen closed this Aug 5, 2019

PeterNmp mentioned this pull request Jun 2, 2020

Data loss when input file partitioned through rowTag element #450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for data loss when input file partitioned through rowTag element#399

Fix for data loss when input file partitioned through rowTag element#399
jimenefe wants to merge 3 commits intodatabricks:masterfrom
onedot-data:master

jimenefe commented Aug 2, 2019

Uh oh!

srowen Aug 2, 2019

Uh oh!

srowen Aug 2, 2019

Uh oh!

srowen commented Aug 3, 2019

Uh oh!

srowen commented Aug 3, 2019

Uh oh!

srowen commented Aug 4, 2019

Uh oh!

srowen commented Aug 5, 2019

Uh oh!

jimenefe commented Aug 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		def getPosition: Long = initialPosition + offset

		override def read(): Int = {

Conversation

jimenefe commented Aug 2, 2019

Uh oh!

srowen Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Aug 2, 2019

Choose a reason for hiding this comment

Uh oh!

srowen commented Aug 3, 2019

Uh oh!

srowen commented Aug 3, 2019

Uh oh!

srowen commented Aug 4, 2019

Uh oh!

srowen commented Aug 5, 2019

Uh oh!

jimenefe commented Aug 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants