Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Data loss when input file partitioned through rowTag element #450

@PeterNmp

Description

@PeterNmp

Hi,

Thanks for all the effort put into this library!
We still seem to be having this issue related to #399 with 0.9.0 :(
We have large xmlfiles - 10+ GB with format like this:

...
<SoundRecording>
...
</SoundRecording>
...
<Release>
...
</Release>
...
<ReleaseTransactions>
...
</ReleaseTransactions>

When I count the number of SoundRecording/Release/ReleaseTransactions in the files it is the same (and should be), but processing the files like this:
spark.read.format("com.databricks.spark.xml").....option("rowTag","SoundRecording")
Gives me different counts of SoundRecording/Release/ReleaseTransactions for some files processed.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions