Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Fix for data loss when input file partitioned through rowTag element#399

Closed
jimenefe wants to merge 3 commits intodatabricks:masterfrom
onedot-data:master
Closed

Fix for data loss when input file partitioned through rowTag element#399
jimenefe wants to merge 3 commits intodatabricks:masterfrom
onedot-data:master

Conversation

@jimenefe
Copy link
Copy Markdown
Contributor

@jimenefe jimenefe commented Aug 2, 2019

Fixes issue reported in 390

package com.databricks.spark.xml

import java.io.{IOException, InputStream, InputStreamReader, Reader}
import java.io.{ IOException, InputStream, InputStreamReader, Reader }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Could we revert the spacing changes and reordering?)


def getPosition: Long = initialPosition + offset

override def read(): Int = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that this isn't intercepting calls to read(byte[]) etc. How about adding CountingInputStream from Common IO? it's a wrapper that's exactly meant to tell you how many bytes have been read through the input stream. I think that's what we need, added to the start position, to really know where in the file it has read.

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Aug 3, 2019

OK so I'm looking into this more today, and it's more complex than I thought.

First, there's a problem with tracking the number of chars read, which is what this does. The end and start offsets are in bytes, and so 1 char is not always 1 byte. That's easy enough to address another way, like CountingInputStream I think (or wrapping the InputStream the Reader reads; we do in fact only rely on .read() now, but that could change).

I agree that using FilePosition is problematic because all paths here would end up with a buffered stream from Hadoop and so the read position may not match exactly how much has been consumed. However I returned to the source of LineRecordReader which this is based on, and it does use the same mechanism for checking how much has been read. Sometimes.

It really uses this only for compressed input. For uncompressed input this is easy, as end and start make it entirely possible to know when one has finished the split by counting bytes read. For compressed input, it's not clear how many uncompressed bytes come out of the source bytes between start and end so this won't work.

For unsplittable compression, this won't matter. One file is one split always. Reading until the stream gives EOF is sufficient.

For splittable compression, this again won't work, and I'm still trying to work out how it handles this case. You can see the source checks the 'adjusted' start and end values but a) I don't think this turns them into somehow 'uncompressed' offsets, and b) it doesn't look like they are set at all anyway by the one implementation, for bzip2.

I think we'll need more tests in the end to check what happens on .gz and .bzip2 input, but that's TBD.

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Aug 3, 2019

Bad news: InputStreamReader itself buffers from the underlying bytes. So it throws off any attempt to know exactly how many bytes have been processed.

It appears possible to hack this and get some internal ByteBuffer from it, and we can subtract off the number of bytes buffered but not processed yet. I have a version of this change that works, but it's hack. Yet I have no better ideas.

I'm pretty sure there are problems with compressed input too, and will work on some tests to check their behavior here.

@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Aug 4, 2019

Check out #400

HyukjinKwon pushed a commit that referenced this pull request Aug 5, 2019
…fixes (#400)

This attempt to address #398
See also #399

The change is I believe explained in comments below.
@srowen
Copy link
Copy Markdown
Collaborator

srowen commented Aug 5, 2019

We merged the other change, but credit to you for finding the issue and test case, and proposing a fix that was directionally correct.

@srowen srowen closed this Aug 5, 2019
@jimenefe
Copy link
Copy Markdown
Contributor Author

jimenefe commented Aug 9, 2019

Thanks Sean, as long as the issue is fixed, I'm happy :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants