Skip to content

XML as input format #29822

@alexey-milovidov

Description

@alexey-milovidov

Use case

  1. Stack Exchange archives: https://archive.org/details/stackexchange
  2. Wikipedia dumps.
  3. OpenStreetMap dumps.

Describe the solution you'd like

Add support for XML as input format.

The user should specify one of the three flavours with a setting:

  1. elements:
<table>
    <row>
        <column1_name>value</column1_name>
        <column2_name>value</column2_name>
        ...
    </row>
    ...
</table>
  1. attributes
<table>
    <row column1_name="value" column2_name="value" ... />
    ...
</table>
  1. cells
<table>
    <row>
        <cell name="column1_name">value</cell>
        <cell name="column2_name">value</cell>
        ...
    </row>
    ...
</table>

and the settings with the path to the table (like /table),
the name of row element,
the name of the cell element and the name of the attribute with column name (for cells variant).

The format should not use full-featured XML parser, but should support:

  • decoding of XML entities;
  • optional BOM at the beginning of the file, including UTF-8 BOM;
  • UTF16 and UTF-32 BE/LE encodings;
  • CDATA;
  • attributes in single and double quotes;
  • self-closing tags or separate closing tags;
  • skipping XML header and DOCTYPE;
  • processing of invalid (unescaped) charaters;

Additional context

Let these settings also control XML output format (that we already have).

Metadata

Metadata

Assignees

No one assigned

    Labels

    comp-formatsInput/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.).feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions