Fix for data loss when input file partitioned through rowTag element#399
Fix for data loss when input file partitioned through rowTag element#399jimenefe wants to merge 3 commits intodatabricks:masterfrom
Conversation
| package com.databricks.spark.xml | ||
|
|
||
| import java.io.{IOException, InputStream, InputStreamReader, Reader} | ||
| import java.io.{ IOException, InputStream, InputStreamReader, Reader } |
There was a problem hiding this comment.
(Could we revert the spacing changes and reordering?)
|
|
||
| def getPosition: Long = initialPosition + offset | ||
|
|
||
| override def read(): Int = { |
There was a problem hiding this comment.
My concern here is that this isn't intercepting calls to read(byte[]) etc. How about adding CountingInputStream from Common IO? it's a wrapper that's exactly meant to tell you how many bytes have been read through the input stream. I think that's what we need, added to the start position, to really know where in the file it has read.
|
OK so I'm looking into this more today, and it's more complex than I thought. First, there's a problem with tracking the number of chars read, which is what this does. The end and start offsets are in bytes, and so 1 char is not always 1 byte. That's easy enough to address another way, like I agree that using It really uses this only for compressed input. For uncompressed input this is easy, as For unsplittable compression, this won't matter. One file is one split always. Reading until the stream gives EOF is sufficient. For splittable compression, this again won't work, and I'm still trying to work out how it handles this case. You can see the source checks the 'adjusted' start and end values but a) I don't think this turns them into somehow 'uncompressed' offsets, and b) it doesn't look like they are set at all anyway by the one implementation, for bzip2. I think we'll need more tests in the end to check what happens on .gz and .bzip2 input, but that's TBD. |
|
Bad news: It appears possible to hack this and get some internal I'm pretty sure there are problems with compressed input too, and will work on some tests to check their behavior here. |
|
Check out #400 |
|
We merged the other change, but credit to you for finding the issue and test case, and proposing a fix that was directionally correct. |
|
Thanks Sean, as long as the issue is fixed, I'm happy :-) |
Fixes issue reported in 390