{"id":27580,"date":"2025-05-22T09:53:03","date_gmt":"2025-05-22T09:53:03","guid":{"rendered":"https:\/\/sonra.io\/?p=27580"},"modified":"2026-02-24T12:17:16","modified_gmt":"2026-02-24T12:17:16","slug":"parse-xml-spark-databricks","status":"publish","type":"post","link":"https:\/\/sonra.io\/parse-xml-spark-databricks\/","title":{"rendered":"How to Parse XML in Spark and Databricks [2026 Guide]"},"content":{"rendered":"\n<p>You\u2019d think working with XML in Spark 4.0 or Databricks 14.3+ in 2026 would be easy. <strong>But it often isn\u2019t.<\/strong><\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<div class=\"note-block-content\"><p><strong>What works:<\/strong> Spark and Databricks can absolutely ingest XML and write it out to Delta tables, especially now with native XML support in newer runtimes.<\/p>\n<p>&nbsp;<\/p>\n<p>For simple files and flat structures, you can get pretty far with built-in options and the <a href=\"#post-27580-_hne1m2w3fz9t\">spark-xml lineage of features<\/a>.<\/p>\n<\/div>\n<\/div>\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<div class=\"note-block-content\"><p><strong>The problems show up when you do this at scale:<\/strong> deeply nested structures, inconsistent files, and schema changes turn \u201ca working demo\u201d into <a href=\"#post-27580-_5zmw8j2uy9oh\">a fragile pipeline that needs constant rewrites and babysitting<\/a>.<\/p>\n<p>If you\u2019re dealing with production ingestion across <a href=\"#post-27580-_raf0a7rdtakp\">evolving XML and XSD-heavy standards<\/a>, <a href=\"#post-27580-_iz5zldj0jdwk\">Flexter is a no-code way to convert XML into Delta<\/a>, without hand-maintained parsing logic.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>TL;DR: <\/strong>This blog post breaks down how to convert XML into Delta Tables using Spark and Databricks, covering both spark-xml-based workflows and native support in Spark 4.0 and Databricks Runtime 14.3 (or higher).<\/p>\n\n\n\n<p><strong>Keep reading, and you\u2019ll see:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <a href=\"#post-27580-_lbz6lyhuf99r\">fundamentals of converting XML to Spark and Databricks<\/a>.<\/li>\n\n\n\n<li>The <a href=\"#post-27580-_d2n99ff55r53\">constraints of flattening XML<\/a> with manual code approaches using spark-xml.<\/li>\n\n\n\n<li><a href=\"#post-27580-_labm59hczlfe\">Hands-on walkthroughs<\/a> using both Spark and Databricks Notebooks.<\/li>\n\n\n\n<li>A summary of <a href=\"#post-27580-_eq6dw6ojafiw\">the limitations I encountered in my testing<\/a>, which include errors in schema inference, XSD handling, and XML validation.<\/li>\n\n\n\n<li>Why Databricks\u2019 <a href=\"#post-27580-_5zmw8j2uy9oh\">Auto Loader<\/a> feature isn\u2019t the magic bullet it promises to be.<\/li>\n\n\n\n<li><a href=\"#post-27580-_iz5zldj0jdwk\">How Flexter can free up your team by automating:<\/a> <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#product-overview\" target=\"_blank\" rel=\"noreferrer noopener\">schema discovery<\/a>, <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#key-features-of-flexter-for-automated-xml-json-conversion\">documentation<\/a>, outputs <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#key-features-of-flexter-for-automated-xml-json-conversion\" target=\"_blank\" rel=\"noreferrer noopener\">normalised tables<\/a> (not <a href=\"#post-27580-_d2n99ff55r53\">One Big Table<\/a>), and handling <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#key-features-of-flexter-for-automated-xml-json-conversion\" target=\"_blank\" rel=\"noreferrer noopener\">schema change without the constant rewrites<\/a>.<\/li>\n<\/ul>\n\n\n\n<p>Whether you&#8217;re exploring Spark and spark-xml features or setting up pipelines in Databricks, this guide will help you avoid common traps and build something <a href=\"#post-27580-_iz5zldj0jdwk\">that works beyond just one test case<\/a>.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Quick decision guide<\/h4>\n    \t<div class=\"note-block-content\"><p>I know you\u2019re busy; you know you\u2019re busy; we both know you don\u2019t have time to read my four workflows for converting your XML into Databricks.<\/p>\n<p><strong>Here\u2019s a quick decision guide to get the most out of this blog post in no time:<\/strong><\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li><strong>Use Spark and spark-xml when\u2026<\/strong> you\u2019re doing a small, controlled XML job, a one-off analysis, or a demo. Expect manual flattening, handheld schema work, and a growing pile of \u201cquick fixes\u201d once the XML gets nested, inconsistent, or evolves.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li><strong>Use native XML in Databricks when\u2026<\/strong> you want the fastest \u201cit runs\u201d path inside a notebook for simple XML, and you\u2019re okay with the fact that the hard part still lands on you: modelling, flattening, validating, and dealing with real-world variability.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li><strong>Use Auto Loader when\u2026<\/strong> your main problem is incremental file ingestion (lots of files landing in cloud storage), and you want the plumbing handled. It\u2019s still not a magic \u201cXML to clean tables\u201d button: you\u2019ll be doing schema wrangling + flattening + evolution babysitting yourself.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li><a href=\"https:\/\/sonra.io\/flexter-product-page\/\"><strong>Use Flexter when\u2026<\/strong><\/a> this is a production XML conversion project, not a science project. You want automation end-to-end, <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#metadata-management\">normalised relational tables (not OBT)<\/a>, schema evolution without constant rewrites, and <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#metadata-management\">metadata management<\/a> so changes don\u2019t silently wreck downstream consumers.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><a id=\"post-27580-_oof66cf3785a\"><\/a>Converting XML to Spark and Databricks<\/h2>\n\n\n\n<p>This section is your quick jumpstart. I\u2019ll walk you through the essentials and link you to other deep-dive posts where I get into the details of each specific sub-topic.<\/p>\n\n\n\n<p>If you\u2019re already comfortable with the basics, feel free to jump ahead to the section where I dig into <a href=\"#post-27580-_hne1m2w3fz9t\">XML handling features in Spark and Databricks:<\/a> what\u2019s new, what\u2019s improved, and what still needs work.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a id=\"post-27580-_lbz6lyhuf99r\"><\/a>Fundamentals of converting XML to Spark and Databricks<\/h3>\n\n\n\n<p><strong>So, what is XML? And what\u2019s the deal with XSD?<\/strong><\/p>\n\n\n\n<p>XML is one of those file formats that\u2019s everywhere, especially when it comes to <a href=\"https:\/\/sonra.io\/xml-converters-by-use-case-bidirectional\/#why-does-xml-matter\" target=\"_blank\" rel=\"noreferrer noopener\">exchanging data between systems<\/a>.<\/p>\n\n\n\n<p>If you\u2019re pulling data from government agencies, financial services, or healthcare systems, you\u2019ve probably seen more XML than you\u2019d like.<\/p>\n\n\n\n<p>But here\u2019s the thing: XML wasn\u2019t built to store files efficiently, especially considering large data volumes. <a href=\"https:\/\/sonra.io\/xml-converters-by-use-case-bidirectional\/#why-would-you-bother-converting-to-and-from-xml\" target=\"_blank\" rel=\"noreferrer noopener\">That\u2019s why we usually convert it to something more query-friendly<\/a>.<\/p>\n\n\n\n<p>Now, when dealing with XML, you\u2019ll often <a href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/#what-is-a-schema\" target=\"_blank\" rel=\"noreferrer noopener\">see XSD files alongside it<\/a>. Think of an XSD as the instruction manual. It tells you what\u2019s allowed in the XML file, what elements should be there, what types they are, how they relate to each other, and so on.<\/p>\n\n\n\n<p>Many industry standards are built using XSDs, like <a href=\"https:\/\/sonra.io\/library-xml-data-standards\/#healthcare\" target=\"_blank\" rel=\"noreferrer noopener\">HL7 in healthcare<\/a> or <a href=\"https:\/\/sonra.io\/library-xml-data-standards\/#insurance\" target=\"_blank\" rel=\"noreferrer noopener\">ACORD in insurance<\/a>. If you\u2019re curious, I\u2019ve put together <a href=\"https:\/\/sonra.io\/library-xml-data-standards\" target=\"_blank\" rel=\"noreferrer noopener\">a library of XML Data Standards<\/a> elsewhere in this blog.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>XSDs are supposed to be the easy button for XML. In \u201ctheory\u201d, you hand over the schema, validate records, and let the platform do the rest.<\/p>\n<p><strong>In practice, this is where Spark and Databricks pipelines get painful:<\/strong> you still end up translating XSD rules into parsing + flattening logic by hand, and advanced XSD patterns (like polymorphism via xsi:type) can turn into edge-case roulette.<\/p>\n<p>If you need this to work reliably in production, <a href=\"#post-27580-_iz5zldj0jdwk\">Flexter<\/a> supports <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#key-features-of-flexter-for-automated-xml-json-conversion\">the full XSD spec (including polymorphism \/ xsi:type)<\/a>, auto-creates the relational target model, and removes the need to manually translate XSD constraints into your pipeline code.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>What is Apache Spark?<\/strong><\/p>\n\n\n\n<p>If you\u2019re not already using Apache Spark, you\u2019ve probably heard <a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Spark\" target=\"_blank\" rel=\"noreferrer noopener\">the name thrown around in big data discussions.<\/a><\/p>\n\n\n\n<p>It\u2019s an open-source engine that can crunch massive datasets by distributing the work across a cluster of machines.<\/p>\n\n\n\n<p>Whether you\u2019re running batch jobs, real-time streaming, or even machine learning, <a href=\"https:\/\/medium.com\/@drishigupta\/an-in-depth-guide-to-apache-spark-features-components-and-applications-f0a82ef09e61\" target=\"_blank\" rel=\"noreferrer noopener\">Spark has tools for all of it<\/a>.<\/p>\n\n\n\n<p>I love how flexible it is; you can write <a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/2723372.2742797\" target=\"_blank\" rel=\"noreferrer noopener\">SQL queries<\/a>, use <a href=\"https:\/\/intellipaat.com\/blog\/tutorial\/spark-tutorial\/spark-dataframe\/\" target=\"_blank\" rel=\"noreferrer noopener\">DataFrames<\/a>, or drop into Python, Scala, or R depending on your comfort zone.<\/p>\n\n\n\n<p>Before the release of Spark 4.0 (e.g. with Spark 3.5), you had to work with libraries like <a href=\"https:\/\/github.com\/databricks\/spark-xml\" target=\"_blank\" rel=\"noreferrer noopener\">spark-xml<\/a>, which would have been your right-hand man when dealing with XML in Spark and Databricks.<\/p>\n\n\n\n<p>But now with Spark 4.0, you can directly work with <a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-44265\" target=\"_blank\" rel=\"noreferrer noopener\">native XML commands<\/a>, which are based on spark-xml.<\/p>\n\n\n\n<p>While this may seem like a cool new feature, as you\u2019ll find out later on, spark-xml-based approaches come with some limitations, especially with complex or deeply nested XML structures.<\/p>\n\n\n\n<p><strong>And what about Databricks?<\/strong><\/p>\n\n\n\n<p>Databricks is where Spark gets seriously user-friendly. It\u2019s a cloud-based platform built right on top of Spark, <a href=\"https:\/\/acmsocc.org\/2019\/slides\/socc19-slides-keynote-zaharia.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">but with some interesting extras<\/a>.<\/p>\n\n\n\n<p>It gives you collaborative notebooks (think: Google Docs for data), easy scheduling, auto-scaling compute, and a bunch of handy integrations with <a href=\"https:\/\/sonra.io\/convert-xml-aws-athena\/\" target=\"_blank\" rel=\"noreferrer noopener\">AWS<\/a>, <a href=\"https:\/\/www.researchgate.net\/profile\/Santosh-Kumar-Singu\/publication\/386875132_Designing_Scalable_Data_Engineering_Pipelines_Using_Azure_and_Databricks\/links\/675a2f7d951ca355613eab75\/Designing-Scalable-Data-Engineering-Pipelines-Using-Azure-and-Databricks.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Azure<\/a>, and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Google_Cloud_Platform\" target=\"_blank\" rel=\"noreferrer noopener\">GCP<\/a>. If Spark is the engine, Databricks is the smooth ride with heated seats.<\/p>\n\n\n\n<p>The big difference? Databricks takes away many setup headaches, so you can just focus on your data.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>An important part of using any tool is understanding its data lineage: how data moves, transforms, and ends up where it does.<\/p>\n<p>If you\u2019re getting started with Databricks, then it\u2019s good to know that while <a target=\"_blank\" href=\"https:\/\/docs.databricks.com\/aws\/en\/data-governance\/unity-catalog\/\">Unity Catalog<\/a> offers built-in lineage features like system tables, REST APIs, and visual diagrams, there are still some gaps.<\/p>\n<p>For example, it doesn\u2019t fully capture deeper multi-hop transformations or certain operations like UPDATE and DELETE.<\/p>\n<p>These limitations can make troubleshooting and auditing more difficult in complex pipelines. For a closer look at what\u2019s possible and where things fall short, <a target=\"_blank\" href=\"https:\/\/sonra.io\/databricks-data-lineage\/\">check out this article<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>What are Delta Tables (and Delta Lake)?<\/strong><\/p>\n\n\n\n<p>Suppose you\u2019re working in Databricks (or planning to), then you\u2019ll quickly come across Delta Tables.<\/p>\n\n\n\n<p>These are tables built using Delta Lake, which is kind of like a smart layer on top of regular Parquet files.<\/p>\n\n\n\n<p>Delta Lake adds:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.databricks.com\/glossary\/acid-transactions\" target=\"_blank\" rel=\"noreferrer noopener\">ACID transactions<\/a> (yep, just like in SQL databases),<\/li>\n\n\n\n<li>Schema enforcement,<\/li>\n\n\n\n<li>Time travel (you can query past versions of your data),<\/li>\n\n\n\n<li>And a tidy metadata layer to keep things organised.<\/li>\n<\/ul>\n\n\n\n<p>The key difference:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Delta Lake is the technology\/framework,<\/li>\n\n\n\n<li>Delta Table is the actual table format you use.<\/li>\n<\/ul>\n\n\n\n<p>I love that <a href=\"https:\/\/www.databricks.com\/blog\/2022\/06\/30\/open-sourcing-all-of-delta-lake.html\" target=\"_blank\" rel=\"noreferrer noopener\">Delta Lake is open-source, too<\/a>; you don\u2019t have to be on Databricks to use it (though it\u2019s definitely easiest there).<\/p>\n\n\n\n<p><strong>So, what does it mean to convert XML to Databricks or Spark?<\/strong><\/p>\n\n\n\n<p>What we\u2019re really talking about is converting your deeply nested XML files into formats these platforms are designed to handle, <a href=\"https:\/\/sonra.io\/convert-xml-with-spark-to-parquet\/\" target=\"_blank\" rel=\"noreferrer noopener\">namely Delta Tables or Parquet files.<\/a><\/p>\n\n\n\n<p>XML is inherently hierarchical: great for <a href=\"https:\/\/sonra.io\/csv-vs-json-vs-xml\/#xmls-pros-cons\" target=\"_blank\" rel=\"noreferrer noopener\">document storage and data exchange<\/a>, but not ideal if you want to work with analytics.<\/p>\n\n\n\n<p>Delta Tables and Parquet are tabular formats built for scalable querying, which you can use to <a href=\"https:\/\/sonra.io\/xml-to-database-converter\/#the-xml-to-database-fundamentals\" target=\"_blank\" rel=\"noreferrer noopener\">convert your complex XML into flat, analytics-ready data<\/a> optimised for large-scale processing.<\/p>\n\n\n\n<p>You may choose to convert to either Parquet or Delta Tables; Delta is ideal for Databricks with features like ACID transactions and versioning, but since it\u2019s built on top of Parquet, it\u2019s heavier.<\/p>\n\n\n\n<p>Parquet remains the more lightweight option and is widely supported across platforms like <a href=\"https:\/\/sonra.io\/take-the-pain-out-of-xml-processing-on-spark\/\" target=\"_blank\" rel=\"noreferrer noopener\">Spark<\/a>, <a href=\"https:\/\/sonra.io\/converting-xml-hive\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hive<\/a>, <a href=\"https:\/\/sonra.io\/iceberg-ahead-all-you-need-to-know-about-snowflakes-polaris-catalog\/\" target=\"_blank\" rel=\"noreferrer noopener\">Snowflake<\/a>, <a href=\"https:\/\/sonra.io\/xml-to-redshift-guide\/\" target=\"_blank\" rel=\"noreferrer noopener\">Redshift<\/a>, <a href=\"https:\/\/sonra.io\/convert-xml-aws-athena\/\" target=\"_blank\" rel=\"noreferrer noopener\">AWS Athena<\/a>, and more.<\/p>\n\n\n\n<p>Still not sure whether Delta or Parquet is right for your XML conversion? I\u2019ve put together a quick comparison to help you decide:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>\n<p><strong>Feature(s)<\/strong><\/p>\n<\/th><th>\n<p><strong>Delta Table<\/strong><\/p>\n<\/th><th>\n<p><strong>Parquet<\/strong><\/p>\n<\/th><\/tr><tr><td>\n<p><strong>ACID Transactions &amp; Data Consistency<\/strong><\/p>\n<\/td><td>\n<p>Built-in support for ACID transactions: Better suited for concurrent writes and updates.<\/p>\n<\/td><td>\n<p>No ACID support; requires external tooling or orchestration for data consistency.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Schema evolution <\/strong><\/p>\n<\/td><td>\n<p>Supports schema enforcement and evolution out of the box.<\/p>\n<\/td><td>\n<p>Limited schema evolution; needs manual updates and schema tracking.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Versioning<\/strong><\/p>\n<\/td><td>\n<p>Allows querying previous versions of data (time-travel queries).<\/p>\n<\/td><td>\n<p>No native support for versioning.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Performance on Updates &amp; Deletes<\/strong><\/p>\n<\/td><td>\n<p>Optimised for upserts, deletes, and merges.<\/p>\n<\/td><td>\n<p>Not designed for row-level updates; best for immutable datasets.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Integration with Databricks Ecosystem<\/strong><\/p>\n<\/td><td>\n<p>First-class support in Databricks with enhanced UI and tooling.<\/p>\n<\/td><td>\n<p>Also supported, but lacks advanced features.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Storage &amp; Metadata Overhead<\/strong><\/p>\n<\/td><td>\n<p>Slightly higher due to transaction logs and metadata.<\/p>\n<\/td><td>\n<p>Lightweight; no extra storage overhead.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Interoperability with External Tools<\/strong><\/p>\n<\/td><td>\n<p>Requires Delta Lake support in other environments.<\/p>\n<\/td><td>\n<p>Widely supported by Hadoop, Spark, Presto, Trino, and other open data tools.<\/p>\n<\/td><\/tr><tr><td>\n<p><strong>Complexity of Setup outside Databricks<\/strong><\/p>\n<\/td><td>\n<p>Requires Delta Lake installation and setup if not using Databricks.<\/p>\n<\/td><td>\n<p>Works out of the box with most data tools.<\/p>\n<\/td><\/tr><\/thead><\/table><\/figure>\n\n\n\n<p><strong>Why even bother converting XML to Spark or Databricks?<\/strong><\/p>\n\n\n\n<p>Honestly? Because working with raw XML at scale is a nightmare.<\/p>\n\n\n\n<p>XML is bulky, nested, and just not built for fast analytics.<\/p>\n\n\n\n<p>Spark and Databricks, on the other hand, are made for scale. They can tear through massive datasets, once the data is in a format it understands.<\/p>\n\n\n\n<p>So what do we do? We convert the XML into Parquet or Delta Tables, which:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Speed up queries,<\/li>\n\n\n\n<li>Compress better (so you save on storage),<\/li>\n\n\n\n<li>Are way easier to use in modern ETL and analytics pipelines.<\/li>\n<\/ul>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>When you&#8217;re just starting out with Databricks, one of the most important things you can do is learn what your SQL is doing behind the scenes.<\/p>\n<p>Understanding how queries interact with your data, what\u2019s being read, written, or transformed, lays the groundwork for better debugging, auditing, and optimisation.<\/p>\n<p>Tools like <a target=\"_blank\" href=\"https:\/\/sonra.io\/flowhigh\/\">FlowHigh<\/a> can help by parsing SQL and giving you clearer insights into query behaviour. It\u2019s not just about using a tool: it\u2019s about building that deeper awareness early on.<\/p>\n<p>This article breaks it down really well:<a target=\"_blank\" href=\"https:\/\/sonra.io\/sql-parser-for-databricks-parsing-sql-for-table-audit-logging-and-much-more\/\"> SQL parser for Databricks \u2013 Parsing SQL for table audit logging and much more<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>In short, here\u2019s what you should remember from this section:<\/strong> Whether you plan to use Parquet or Delta Tables, in Databricks or Spark, converting your XML is essential for performance, scalability, and compatibility with modern data processing workflows.<\/p>\n\n\n\n<p>XML isn\u2019t exactly efficient for big data tasks, so transforming it into columnar formats like Parquet or Delta helps you take full advantage of Spark\u2019s optimisation features.<\/p>\n\n\n\n<p>It\u2019s the difference between a messy, manual pipeline and a scalable, cost-efficient solution. Your project and your future self will thank you.<\/p>\n\n\n\n<p>And regarding the Delta Tables vs Parquet debate, keep in mind that you should:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Delta when you need transactional integrity, versioning, and schema enforcement: ideal for long-term, production-ready pipelines.<\/li>\n\n\n\n<li>Use Parquet when you need broad compatibility and want to skip the metadata overhead that comes with Delta.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><a id=\"post-27580-_hne1m2w3fz9t\"><\/a>Spark and Databricks&#8217; XML conversion features and capabilities<\/h3>\n\n\n\n<p>If you&#8217;ve ever wondered how Apache Spark, Databricks, and spark-xml fit into the picture and what their features are, you\u2019re in the right place.<\/p>\n\n\n\n<p>Since Databricks was built by the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Databricks#History\" target=\"_blank\" rel=\"noreferrer noopener\">original creators of Apache Spark<\/a>, it\u2019s no surprise that their XML conversion features <a href=\"https:\/\/www.bigeye.com\/blog\/a-brief-history-of-databricks\" target=\"_blank\" rel=\"noreferrer noopener\">share the same DNA<\/a>.<\/p>\n\n\n\n<p>This means that you\u2019ll enjoy more or less the same core features, whether you\u2019re working with Spark 3.5 and spark-xml, or the new native XML support in Databricks Runtime 14.3 and Spark 4.0.<\/p>\n\n\n\n<p><strong>Let\u2019s Dig Into the Nuts and Bolts: Spark and Databricks XML Features<\/strong><\/p>\n\n\n\n<p>The reality? While Spark and Databricks do give you control over how you read, structure, and write your XML data, that control comes at a cost.<\/p>\n\n\n\n<p>This means you\u2019ll get a decent set of commands and configuration options\u2014but they\u2019re not always intuitive, especially when dealing with complex or inconsistent XML.<\/p>\n\n\n\n<p>And here\u2019s the downside: you\u2019ll likely need to get fairly hands-on, tweaking settings, adjusting parameters, and writing manual code just to get your data into a usable format.<\/p>\n\n\n\n<p>Here\u2019s a detailed breakdown; first up, you get a few options in <a href=\"https:\/\/spark.apache.org\/docs\/4.0.0-preview1\/sql-data-sources-xml.html\" target=\"_blank\" rel=\"noreferrer noopener\">how you parse your XML<\/a>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>rowTag <\/strong>lets you tell Spark which XML element should act like a \u201crow\u201d when creating your DataFrame. Super handy if your XML isn\u2019t naturally flat.<\/li>\n\n\n\n<li>Then there\u2019s <strong>attributePrefix<\/strong>, which lets you tag attribute names with a prefix (like _attr_) so you don\u2019t confuse them with child elements when parsing. If you\u2019ve ever dealt with messy XML naming collisions, you know this feature is important.<\/li>\n\n\n\n<li><strong>rootTag <\/strong>lets you define what the top-level XML tag should be when you&#8217;re writing XML out.<\/li>\n\n\n\n<li><strong>valueTag <\/strong>gives you a clean way to capture the inner text of elements (not just their attributes), so you don\u2019t lose valuable data when parsing.<\/li>\n\n\n\n<li><strong>ignoreNamespace <\/strong>lets you ignore messy XML namespaces altogether if they\u2019re just getting in your way, making queries simpler.<\/li>\n<\/ul>\n\n\n\n<p>And there are a few more options for when you\u2019re working with <a href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/\" target=\"_blank\" rel=\"noreferrer noopener\">XSD<\/a> (and trust me, you probably are if you&#8217;re dealing with <a href=\"https:\/\/sonra.io\/library-xml-data-standards\/\" target=\"_blank\" rel=\"noreferrer noopener\">industry XML standards<\/a>):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Despite this option <a href=\"https:\/\/stackoverflow.com\/questions\/42010638\/how-to-load-all-xml-files-from-a-hdfs-directory-using-spark-databricks-xml-parse\" target=\"_blank\" rel=\"noreferrer noopener\">not being trustworthy when processing large volumes of XML<\/a>, you can infer schema from your XML without needing an XSD for your XML, using <strong>.option(&#8220;inferSchema&#8221;, &#8220;true&#8221;)<\/strong> when parsing your XML.<\/li>\n\n\n\n<li>And if you\u2019ve got huge XML files to deal with, you can tweak <strong>samplingRatio <\/strong>to speed up the schema inference process without losing too much accuracy.<\/li>\n\n\n\n<li>If you have an XSD handy, you can validate your XML records as they&#8217;re read, catching bad or malformed data early instead of dealing with headaches later. For that, you may use the <strong>rowValidationXSDPath<\/strong> option.<\/li>\n\n\n\n<li>Plus, with the <strong>mode <\/strong>option, you can control whether Spark should drop, fail fast, or gently tolerate invalid records.<\/li>\n\n\n\n<li>And if you want to skip building a DataFrame schema manually (because who doesn\u2019t?), you can use <a href=\"https:\/\/docs.databricks.com\/aws\/en\/query\/formats\/xml\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>XSDtoSchema<\/strong><\/a> to convert your XSD into a Spark SQL schema automatically.<\/li>\n<\/ul>\n\n\n\n<p><strong>Don\u2019t forget: <\/strong>Even though Spark 3.5, Spark 4.0, and Databricks Runtime 14.3 all promise XML support, there are some differences under the hood.<\/p>\n\n\n\n<p>If you\u2019re working with an older Spark version, like Spark 3.5, your initial configuration doesn\u2019t support XML out of the box.<\/p>\n\n\n\n<p>You\u2019re forced to rely on an external library like spark-xml, which means <a href=\"https:\/\/mvnrepository.com\/artifact\/com.databricks\/spark-xml_2.12\/0.18.0\" target=\"_blank\" rel=\"noreferrer noopener\">managing JAR files<\/a> and dealing with <a href=\"https:\/\/community.databricks.com\/t5\/data-engineering\/spark-xml-not-working-with-databricks-connect-and-pyspark\/td-p\/13802\" target=\"_blank\" rel=\"noreferrer noopener\">compatibility issues<\/a>.<\/p>\n\n\n\n<p>Only with Spark 4.0 and Databricks Runtime 14.3 does XML finally get built-in support, something that arguably should have been standard long ago.<\/p>\n\n\n\n<p>And if you&#8217;re using Databricks, starting from Runtime 14.3, you can load XML files using <a href=\"https:\/\/docs.databricks.com\/aws\/en\/ingestion\/cloud-object-storage\/auto-loader\/\" target=\"_blank\" rel=\"noreferrer noopener\">Autoloader<\/a>; although \u201cno manual parsing\u201d sounds great on paper, there are still <a href=\"https:\/\/community.databricks.com\/t5\/data-engineering\/databricks-autoloader-is-getting-stuck-and-does-not-pass-to-the\/td-p\/31396\" target=\"_blank\" rel=\"noreferrer noopener\">a few details you\u2019ll want to watch out for<\/a>.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<div class=\"note-block-content\"><p><strong><em>Spoiler alert:<\/em><\/strong><em> As I show later on in this blog post, <\/em><a href=\"#post-27580-_5zmw8j2uy9oh\"><em>I couldn\u2019t get Autoloader to automate my workflow.<\/em><\/a><\/p>\n<\/div>\n<\/div>\n\n\n<p>And while spark-xml has been a trusty sidekick for years, <a href=\"https:\/\/github.com\/databricks\/spark-xml\/releases\" target=\"_blank\" rel=\"noreferrer noopener\">its future maintenance is starting to wind down<\/a>, so if you\u2019re thinking long-term, it\u2019s smart to plan ahead.<\/p>\n\n\n\n<p>If you&#8217;re looking to future-proof your workflows, using native XML in Spark 4.0 or <a href=\"https:\/\/docs.databricks.com\/gcp\/en\/release-notes\/runtime\" target=\"_blank\" rel=\"noreferrer noopener\">Databricks 14.3+<\/a> is the clearest, safest path forward.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro Tip<\/h4>\n    \t<div class=\"note-block-content\"><p>Even if you\u2019ve mastered every feature I just walked you through, don\u2019t make the mistake of thinking Spark or Databricks can magically handle any XML or XSD file you toss their way.<\/p>\n<p>Deeply nested XML can break your workflow, XSD support doesn\u2019t fully capture schema constraints, and what about multi-file XSDs?<\/p>\n<p><a href=\"#post-27580-_labm59hczlfe\">Later on in this blog post<\/a>, I review all these errors that came up in my testing (and more).<\/p>\n<p>Let\u2019s just say things get messy. If you&#8217;re already itching for a shortcut, feel free to skip ahead; I\u2019ll be showing you an automated solution that handles all of this <a href=\"#post-27580-_iz5zldj0jdwk\"><em>without a single line of code.<\/em><\/a><\/p>\n<\/div>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><a id=\"post-27580-_d2n99ff55r53\"><\/a>How to convert XML to Delta Table with Spark 3.5 and spark-xml?<\/h2>\n\n\n\n<p>Alright, enough with theory and introductions; it\u2019s time to get real.<\/p>\n\n\n\n<p>In this section, I\u2019ll show you the steps to get your source XML into a Delta Table in Spark.<\/p>\n\n\n\n<p>Before I start, I assume you have already installed Spark and downloaded the JAR files for spark-xml and Delta Lake (if not, you may check <a href=\"https:\/\/iamholumeedey007.medium.com\/how-to-install-pyspark-on-your-local-machine-0fcd1c14d0bc\" target=\"_blank\" rel=\"noreferrer noopener\">this article<\/a>).<\/p>\n\n\n\n<p>For this workflow, I\u2019ve used <a href=\"https:\/\/spark.apache.org\/downloads.html\" target=\"_blank\" rel=\"noreferrer noopener\">Spark 3.5.5<\/a> (with <a href=\"https:\/\/www.oracle.com\/java\/technologies\/downloads\/#java8\" target=\"_blank\" rel=\"noreferrer noopener\">Java 8<\/a>), as well as <a href=\"https:\/\/mvnrepository.com\/artifact\/com.databricks\/spark-xml\" target=\"_blank\" rel=\"noreferrer noopener\">version 0.14.0 for spark-xml<\/a> in combination with <a href=\"https:\/\/mvnrepository.com\/artifact\/io.delta\/delta-core\" target=\"_blank\" rel=\"noreferrer noopener\">Delta Lake 2.1.0<\/a>.<\/p>\n\n\n\n<p>Oh, and do you want to follow along with my example? Grab my <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">simple XML test case<\/a>; it\u2019s the perfect starting point to see how this process works.<\/p>\n\n\n\n<p>It\u2019s one of my go-to test files from <a href=\"https:\/\/sonra.io\/xml-to-csv-converters-compared\/#xml-to-csv-conversion-test-cases\" target=\"_blank\" rel=\"noreferrer noopener\">a broader set I use to test various tools<\/a>.<\/p>\n\n\n\n<p>Ready? Let\u2019s go!<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<div class=\"note-block-content\"><p><em>This approach is fine for simple XML and demos, but if you need repeatable conversion across changing XMLs, jump to <\/em><a href=\"#post-27580-_iz5zldj0jdwk\"><em>the automated option (Flexter)<\/em><\/a><em>.<\/em><\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>Step 1: Set Up Your Python Environment and Install PySpark<\/strong><\/p>\n\n\n\n<p>Assuming you\u2019ve already got Spark configured (because we\u2019re past that, right?), now we\u2019re diving into setting up your development environment.<\/p>\n\n\n\n<p>To keep things simple, I\u2019m using Visual Studio Code, since it enables you to <a href=\"https:\/\/code.visualstudio.com\/docs\/python\/python-tutorial\" target=\"_blank\" rel=\"noreferrer noopener\">create the Python environment easily through their UI<\/a> and manage dependencies.<\/p>\n\n\n\n<p>Once that\u2019s set up, don\u2019t forget <a href=\"https:\/\/iamholumeedey007.medium.com\/how-to-install-pyspark-on-your-local-machine-0fcd1c14d0bc\" target=\"_blank\" rel=\"noreferrer noopener\">to install PySpark<\/a> (yes, another dependency) within your new environment.<\/p>\n\n\n\n<p><strong>Step 2: Configure your Spark session with the necessary libraries<\/strong><\/p>\n\n\n\n<p>To get our Spark session ready, you\u2019ll need to build it and provide links to the JAR files for spark-xml and delta-core.<\/p>\n\n\n\n<p>So, if the JAR files are in place, you should be able to set up your Spark session using the following code:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">from pyspark.sql import SparkSession\n\n# Path to the JAR files\nspark_jar_path = \"file:\/\/\/C:\/spark\/spark-3.5.5-bin-hadoop3\/jars\"\n\n# Start Spark session with `spark-xml` and `delta-core` JARs, and Delta extensions enabled\nspark = SparkSession.builder \\\n\t.appName(\"XML to Delta Table Conversion\") \\\n\t.config(\"spark.jars\", f\"{spark_jar_path}\/spark-xml_2.12-0.14.0.jar,{spark_jar_path}\/delta-core_2.12-2.1.0.jar\") \\\n\t.config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \\\n\t.config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \\\n\t.getOrCreate()\n<\/pre><\/div>\n\n\n\n<p>And please note that the Python variable <em>spark_jar_path<\/em> should point to the location where you installed Spark on your local machine and the respective <em>jars<\/em> subfolder.<\/p>\n\n\n\n<p><strong>Step 3: Read the XML file with spark-xml<\/strong><\/p>\n\n\n\n<p>Next up, it\u2019s time to bring that messy XML file into Spark.<\/p>\n\n\n\n<p>With the spark-xml library, you can tell Spark to interpret the XML schema and flatten it into a DataFrame.<\/p>\n\n\n\n<p>But we need to tell Spark what the root element is (which one should act as the \u201crow\u201d in the DataFrame).<\/p>\n\n\n\n<p>Please note that I\u2019ve placed the <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">simple XML test case<\/a> in the root folder of my project (at the same level as my script).<\/p>\n\n\n\n<p>Here\u2019s how the Python code for this step looks:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \"># Path to the XML file\nxml_file_path = \"simplexmltocsv_1.xml\"\n\n# Read XML data into a DataFrame using spark-xml\ndf_company = spark.read.format(\"xml\") \\\n\t.option(\"rowTag\", \"Company\") \\\n\t.load(xml_file_path)\n<\/pre><\/div>\n\n\n\n<p>At this point, your<em> df_company<\/em> DataFrame contains a flattened version of your XML, so you&#8217;re almost ready to write it to a Delta Table, right?<\/p>\n\n\n\n<p><strong>Not quite<\/strong>. The spark-xml library handles the <strong>initial flattening<\/strong> by reading the XML and converting it into a DataFrame based on the row tag.<\/p>\n\n\n\n<p>But if your XML is more complex and has multiple levels of nesting (which, let\u2019s be honest, it probably does), you&#8217;ll need to handle the rest of the flattening manually.<\/p>\n\n\n\n<p>And for that, you&#8217;ll need to roll up your sleeves and dive into Step 4.<\/p>\n\n\n\n<p><strong>Step 4: Manually flatten the XML throughout its hierarchy<\/strong><\/p>\n\n\n\n<p>Sorry to say: Here\u2019s where you start to feel the manual pain, because you\u2019ll have to unpack the XML layers one by one, like peeling an onion.<\/p>\n\n\n\n<p>For that, we\u2019ll use PySpark\u2019s explode() command. Here\u2019s how the code looks for the first explode statement:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">from pyspark.sql.functions import col, explode\n\n# Explode statement 1: Exploding the Department &gt; Team\ndf_departments = df_company.select(\n\tcol(\"_name\").alias(\"company_name\"),  # Access the company name\n\tcol(\"Department._name\").alias(\"department_name\"),  # Access the department name\n\texplode(\"Department.Team\").alias(\"team\")  # Exploding the Team array inside Department\n)\n<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>And then you\u2019ll need to keep exploding..<\/strong><\/p>\n\n\n\n<p>It\u2019s like a Russian doll of XML data, and you\u2019re stuck opening every little one by hand. Don\u2019t worry, you\u2019ll be <em>almost<\/em> done when you hit the last layer.<\/p>\n\n\n\n<p>Here\u2019s how the rest of the code looks for this step:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \"># Explode statement 2: Exploding the Team &gt; Project\ndf_teams = df_departments.select(\n\t\"company_name\",\n\t\"department_name\",\n\tcol(\"team._name\").alias(\"team_name\"),  # Access team name after explosion\n\texplode(\"team.Project\").alias(\"project\")  # Exploding the Project array inside Team\n)\n\n# Explode statement 3: Exploding the Project &gt; Task\ndf_projects = df_teams.select(\n\t\"company_name\",\n\t\"department_name\",\n\t\"team_name\",\n\tcol(\"project._name\").alias(\"project_name\"),  # Access the project name after explosion\n\texplode(\"project.Task\").alias(\"task\")  # Exploding the Task array inside Project\n)\n\n# Explode statement 4: Exploding the Task &gt; Subtask\ndf_tasks = df_projects.select(\n\t\"company_name\",\n\t\"department_name\",\n\t\"team_name\",\n\t\"project_name\",\n\tcol(\"task._name\").alias(\"task_name\"),  # Access the task name\n\texplode(\"task.Subtask\").alias(\"subtask\")  # Exploding the Subtask array inside Task\n)\n\n# Explode statement 5: Final selection of required fields from Subtask\ndf_final = df_tasks.select(\n\t\"company_name\",\n\t\"department_name\",\n\t\"team_name\",\n\t\"project_name\",\n\t\"task_name\",\n\tcol(\"subtask._name\").alias(\"subtask_name\")  # Access the subtask name\n)\n<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>And just like that, your data\u2019s in place. But after all this manual work, is it still how you want it?<\/p>\n\n\n\n<p>If you&#8217;re working with a real-world project, most likely no. Because you will still have to account for <a href=\"https:\/\/sonra.io\/convert-xml-to-csv\/?_gl=1*16v541x*_up*MQ..*_ga*MTk4MTAwMDYxNy4xNzM0MDg4MTgz*_ga_7H38LVR4Z5*MTczNDA4ODE4MS4xLjEuMTczNDA4ODE4MS4wLjAuMA..#option-2-normalised-xml-conversion-to-csv\" target=\"_blank\" rel=\"noreferrer noopener\">key challenges regarding flattened data<\/a>.<\/p>\n\n\n\n<p>Here\u2019s the kicker: <a href=\"https:\/\/sonra.io\/convert-xml-with-spark-to-parquet\/#the-limitations-of-the-spark-xml-library\" target=\"_blank\" rel=\"noreferrer noopener\">schema evolution is a known weak spot in Spark<\/a>. So, if your XML structure changes (and it will), you\u2019ll be back at the keyboard tweaking your code.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>Yes, there are other approaches to flattening your XML, like using selectExpr() or withColumn().<\/p>\n<p>But let\u2019s be real: in the end, you\u2019re still left with a flattened table.<\/p>\n<p>The spark-xml library can help you parse and flatten, but it doesn&#8217;t go the extra mile to normalise your <a target=\"_blank\" href=\"https:\/\/sonra.io\/xml-to-database-converter\/#must-have-features-for-xml-to-sql-converters\">XML into an efficient relational schema<\/a>.<\/p>\n<p>You\u2019ll end up with a &#8220;one big table&#8221; (OBT) that&#8217;s only usable for simple scenarios. Once you go beyond 3 levels of nesting it tends to become unusable.<\/p>\n<p>For better scalability and efficiency, <a target=\"_blank\" href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/#make-it-easy-with-flexter\">you\u2019ll need to explore other strategies for schema normalisation<\/a>.<\/p>\n<p>Or you may skip ahead to where I give you <a href=\"#post-27580-_iz5zldj0jdwk\">the answer to XML to Delta Table<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>Step 5: Save the flattened table to Delta<\/strong><\/p>\n\n\n\n<p>This is the part where your DataFrame finally gets saved as a Delta Table.<\/p>\n\n\n\n<p>You\u2019ve made it through the forest of explode(), and now it\u2019s time to store your work in a nice, organised Delta Table.<\/p>\n\n\n\n<p>Here\u2019s the code for that (since we\u2019re just testing, I saved it as a temporary view):<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \"># Step 5.1 Register the DataFrame as a temporary view in the Spark catalog\ndf_final.createOrReplaceTempView(\"df_final_view\")\n<\/pre><\/div>\n\n\n\n<p>And then you should be able to query and see your data:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \"># Step 5.2: Run a SQL query on the registered view\nsql_query = \"\"\"\n            \tSELECT\n                \tcompany_name,\n                \tdepartment_name,\n                \tteam_name,\n                \tproject_name,\n                \ttask_name,\n                \tsubtask_name\n            \tFROM df_final_view\n            \tWHERE department_name = 'Research and Development'\n        \t\"\"\"\n\n# Step 5.3: Execute the query using Spark SQL\nresult = spark.sql(sql_query)\n\n# Step 5.4: Show the result of the query\nresult.show(truncate=False)\n<\/pre><\/div>\n\n\n\n<p>If all goes well, the result that you should get should be:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/project-management-table-for-tech-solutions-team-t-1.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>There you have it: your XML is now converted into a Delta Table.<\/p>\n\n\n\n<p>But hold up before you start celebrating. Let\u2019s be honest: this process is far from automated.<\/p>\n\n\n\n<p>You\u2019ve probably done your fair share of explode() gymnastics, and if the XML structure changes even a little, guess what? You\u2019re back to tweaking code.<\/p>\n\n\n\n<p>Spark-xml and Spark 3.5 aren\u2019t exactly <a href=\"#post-27580-_iz5zldj0jdwk\">the \u201cset it and forget it\u201d solution we all wish for<\/a>.<\/p>\n\n\n\n<p>So, what happens when you switch to Databricks (14.3 or later)? Keep reading to learn more in the next section!<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>Next up, I\u2019ll walk you through how to convert XML to Delta table using Databricks.<\/p>\n<p>With the recent full release of <a target=\"_blank\" href=\"https:\/\/spark.apache.org\/releases\/spark-release-4-0-0.html#xml\">Spark 4.0<\/a>, it&#8217;s worth noting that <a target=\"_blank\" href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-44265\">spark-xml is now integrated directly into Spark<\/a>.<\/p>\n<p>So, while the following workflow focuses specifically on using Databricks, my local testing suggests that converting XML to Delta with Spark 4.0 (outside of Databricks) follows similar steps.<\/p>\n<\/div>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\"><a id=\"post-27580-_labm59hczlfe\"><\/a>How to convert XML to Delta in Databricks<\/h2>\n\n\n\n<p>Parsing XML in modern data platforms should be easy. After all, we\u2019ve had semi-structured data for decades.<\/p>\n\n\n\n<p>But as I\u2019ll show in this section, if you\u2019ve ever tried working with XML in Databricks, even with version 14.3 that includes native XML support, you\u2019ll quickly realise that \u201cmodern\u201d doesn\u2019t mean \u201cconvenient.\u201d<\/p>\n\n\n\n<p>Next, I\u2019ll walk through the exact steps I followed to convert <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">a simple XML file<\/a> into a Delta Table using Databricks Community Edition.<\/p>\n\n\n\n<p><strong>Step 1: Register for Databricks Community Edition<\/strong><\/p>\n\n\n\n<p>You may start by Googling \u201cDatabricks Community Edition\u201d and clicking through to <a href=\"https:\/\/community.cloud.databricks.com\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/community.cloud.databricks.com<\/a>.<\/p>\n\n\n\n<p>Instead of inserting your email in this first widget and clicking \u201cContinue with email\u201d, you should click <strong>Sign-up<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-community-edition-sign-in-page-with-sig.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Once on the next screen, follow the typical routine of entering your email, verifying through a code you will receive in your inbox, and logging in.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-community-edition-email-sign-up-form-fo.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Eventually, you\u2019ll land on the Databricks dashboard, which should look like this.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-showcasing-features-for-data.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Step 2: Set Up a Cluster<\/strong><\/p>\n\n\n\n<p>Next, we need to set up a compute cluster as the foundation where our XML conversions will run.<\/p>\n\n\n\n<p>From the Databricks dashboard, you may click \u201cNew\u201d as shown in the screenshot below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-highlighting-new-button-for-d.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>And from the menu that pops up in the upper left corner, you may select \u201cCluster\u201d:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-highlighting-the-cluster-crea.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>In the new webpage that pops up, you\u2019ll be asked to name your cluster and select a runtime version.<\/p>\n\n\n\n<p>The most important part is selecting Databricks runtime version 14.3 or higher, so that you get the native XML support. Hit \u201cCreate compute\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-for-creating-a-new-compute-cl.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Now wait. Databricks takes a few minutes to spin up the cluster.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<div class=\"note-block-content\"><p>While Databricks takes its sweet time spinning up your cluster (yes, it can take 3 to 5 minutes), why not make good use of the wait?<\/p>\n<p>I\u2019ve put together a separate post on <a target=\"_blank\" href=\"https:\/\/sonra.io\/9-critical-types-of-xml-tools-for-developers\/\">9 essential tools every developer working with XML should know about<\/a>.<\/p>\n<p>This resource is definitely worth a skim before diving deeper into XML parsing.<\/p>\n<\/div>\n<\/div>\n\n\n<p>To check whether your compute cluster has completed its setup, you may go to the main dashboard and, from the left side panel, click \u201cCompute\u201d, where you\u2019ll find a list of clusters.<\/p>\n\n\n\n<p>It should look like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-showing-compute-configuration.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Step 3: Upload the XML File<\/strong><\/p>\n\n\n\n<p>In the next step, let\u2019s try to upload <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">a simple XML test case<\/a> to the cluster you just created.<\/p>\n\n\n\n<p>Back at the dashboard, click \u201cNew\u201d again and select \u201cAdd or upload data\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-showing-data-upload-feature-f.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>You\u2019ll be redirected to a new page where you\u2019ll get a drag-and-drop interface for uploading your file. I highlight it with a red box in the next screenshot.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/upload-files-to-create-new-tables-in-databricks-wo.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Once your XML file is uploaded, Databricks assigns a filepath within its internal filesystem (<a href=\"https:\/\/docs.databricks.com\/aws\/en\/dbfs\/\" target=\"_blank\" rel=\"noreferrer noopener\">DBFS<\/a>). For example:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/create-new-table-in-databricks-upload-xml-file-fro.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>You should look for the green tick sign and the file&#8217;s location, which is displayed next to it.<\/p>\n\n\n\n<p>The rest of the options (i.e. \u201cCreate Table with UI\u201d and \u201cCreate Table in Notebook\u201d) should be ignored for now.<\/p>\n\n\n\n<p>In this case, the file&#8217;s location is <em>\u201c\/FileStore\/tables\/simpleXMLtocsv_2_-6.xml\u201d<\/em>. Keep a note of the location as we\u2019ll use it to read the file from our Python script in the next few steps.<\/p>\n\n\n\n<p><strong>Step 4: Create a Notebook<\/strong><\/p>\n\n\n\n<p>A few more preparatory steps are needed before writing your XML to Databricks.<\/p>\n\n\n\n<p>In Step 4, you need to create a Notebook, which you\u2019ll use to write your Python scripts.<\/p>\n\n\n\n<p>Click \u201cNew\u201d and then \u201cNotebook\u201d and give it a name. Attach it to your running cluster.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-workspace-interface-highlighting-notebo.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Once created, you\u2019ll finally have a code environment to write your manual code. You may change the notebook name as I show below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/xml-to-delta-tables-using-spark-for-efficient-data.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>If the notebook is created successfully, it will also show up in your main dashboard: <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-interface-showcasing-recent-notebooks-f.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><strong>Step 5: Read the XML File with (native) spark-xml<\/strong><\/p>\n\n\n\n<p>In this step, we\u2019ll write our Python code to read and parse the source XML file in the Databricks notebook.<\/p>\n\n\n\n<p>Start with some imports in the first cell:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">from pyspark.sql.functions import col, explode<\/pre><\/div>\n\n\n\n<p>Then, in the next cell, you should specify the file path to the location of the XML in DBFS (as it occurs in Step 3). Then read the XML file with the following code:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \"># Path to uploaded XML file\ninput_path = \"dbfs:\/FileStore\/tables\/simple_XML_test_cases\/simplexmltocsv_1.xml\"\ndf_company = spark.read.format(\"xml\") \\\n.option(\"rowTag\", \"Company\") \\\n.option(\"attributePrefix\", \"\") \\\n.load(input_path)<\/pre><\/div>\n\n\n\n<p><strong>Note:<\/strong> spark-xml requires you to know your rowTag ahead of time; there\u2019s no auto-detection or schema guessing. If the XML is complex or inconsistent, expect parsing errors or missing data.<\/p>\n\n\n\n<p>As discussed earlier in the <a href=\"#post-27580-_hne1m2w3fz9t\">Databricks features section<\/a>, it is important to consider what the extra parameters of the command mean for parsing the XML:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>rowTag <\/strong>is a required parameter that specifies the XML element that should be treated as one row in the resulting DataFrame.<\/li>\n\n\n\n<li>We use <strong>attributePrefix <\/strong>to simplify column names in the resulting DataFrame.<\/li>\n<\/ul>\n\n\n\n<p><strong>Step 6: Flatten the XML with the explode() command<\/strong><\/p>\n\n\n\n<p>Now comes the real work: manually navigating the XML structure layer by layer with explode().<\/p>\n\n\n\n<p>In this step, you\u2019ll need to use PySpark\u2019s explode() command (as imported in Step 5) to navigate down the XML tree, unpacking arrays of child elements step by step so that they can end up with a clean table.<\/p>\n\n\n\n<p>This process is called flattening the XML and <a href=\"https:\/\/sonra.io\/convert-xml-to-csv\/?_gl=1*16v541x*_up*MQ..*_ga*MTk4MTAwMDYxNy4xNzM0MDg4MTgz*_ga_7H38LVR4Z5*MTczNDA4ODE4MS4xLjEuMTczNDA4ODE4MS4wLjAuMA..#option-2-normalised-xml-conversion-to-csv\">has se<\/a><a href=\"https:\/\/sonra.io\/convert-xml-to-csv\/?_gl=1*16v541x*_up*MQ..*_ga*MTk4MTAwMDYxNy4xNzM0MDg4MTgz*_ga_7H38LVR4Z5*MTczNDA4ODE4MS4xLjEuMTczNDA4ODE4MS4wLjAuMA..#option-2-normalised-xml-conversion-to-csv\" target=\"_blank\" rel=\"noreferrer noopener\">v<\/a><a href=\"https:\/\/sonra.io\/convert-xml-to-csv\/?_gl=1*16v541x*_up*MQ..*_ga*MTk4MTAwMDYxNy4xNzM0MDg4MTgz*_ga_7H38LVR4Z5*MTczNDA4ODE4MS4xLjEuMTczNDA4ODE4MS4wLjAuMA..#option-2-normalised-xml-conversion-to-csv\">eral limitations<\/a> compared to normalisation, which is the <a href=\"https:\/\/www.youtube.com\/watch?v=NTMn4t-okMM&amp;ab_channel=SonraClips\" target=\"_blank\" rel=\"noreferrer noopener\">state-of-the-art approach to XML conversion<\/a>.<\/p>\n\n\n\n<p>Before getting started with this step, you\u2019ll need to know the structure of your source XML very well (<a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">the simple XML test case<\/a> for this workflow).<\/p>\n\n\n\n<p>Start with a basic selection to get company and department names:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">df_departments = df_company.select(\n\tcol(\"name\").alias(\"company_name\"),\n\tcol(\"Department.name\").alias(\"department_name\"),\n\tcol(\"Department.Team\").alias(\"teams\")  # We'll explode this in the next few commands\n)\n<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Then keep exploding each nested level. Here\u2019s how the code looks for this manual approach:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">df_teams = df_departments.select(\n\t\"company_name\",\n\t\"department_name\",\n\texplode(\"teams\").alias(\"team\")\n)\n\ndf_projects = df_teams.select(\n\t\"company_name\",\n\t\"department_name\",\n\tcol(\"team.name\").alias(\"team_name\"),\n\texplode(\"team.Project\").alias(\"project\")\n)\n\ndf_tasks = df_projects.select(\n\t\"company_name\",\n\t\"department_name\",\n\t\"team_name\",\n\tcol(\"project.name\").alias(\"project_name\"),\n\texplode(\"project.Task\").alias(\"task\")\n)\n<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>Each explode() gets you one level closer to a usable table.<\/p>\n\n\n\n<p>It&#8217;s tedious, sure, but at least predictable (well, if you\u2019re in a real-world XML conversion project, this process is OK <a href=\"https:\/\/sonra.io\/convert-xml-with-spark-to-parquet\/#the-limitations-of-the-spark-xml-library\" target=\"_blank\" rel=\"noreferrer noopener\">until the source XML structure changes<\/a>).<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">df_subtasks = df_tasks.select(\n\t\"company_name\",\n\t\"department_name\",\n\t\"team_name\",\n\t\"project_name\",\n\tcol(\"task.name\").alias(\"task_name\"),\n\texplode(\"task.Subtask\").alias(\"subtask\")\n)\n\ndf_final = df_subtasks.select(\n\t\"company_name\",\n\t\"department_name\",\n\t\"team_name\",\n\t\"project_name\",\n\t\"task_name\",\n\tcol(\"subtask.name\").alias(\"subtask_name\")\n)\n<\/pre><\/div>\n\n\n\n<p><\/p>\n\n\n\n<p>By the end of Step 6, you\u2019ll have a flattened table in df_final. All we need to do is write your df_final to a Delta table, as we do in Step 7 (I promise it is the last step).<\/p>\n\n\n\n<p><strong>Step 7: Save as Delta Table<\/strong><\/p>\n\n\n\n<p>To check your flattened table (i.e. dataframe), try the following Python statement in a cell of your notebook:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/data-table-showcasing-tech-project-tasks-for-effic-1.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>If you are happy with the result (despite the repeating\/redundant values), then you may write to a Delta Table as follows:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">df_final.write.format(\"delta\") \\\n\t.mode(\"overwrite\") \\\n\t.saveAsTable(\"company_full_subtasks\")\n<\/pre><\/div>\n\n\n\n<p>Now, to check your Delta Table in Databricks, navigate to the <strong>Catalog <\/strong>tab on the left, and check if your table appears. It should.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/data-management-interface-showcasing-table-catalog.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Double-click it, and you\u2019ll be able to view it in your browser:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><span class=\"m-posts__contentImg\"><img decoding=\"async\" src=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/databricks-schema-overview-with-sample-data-for-pr-1.png\" alt=\"\"\/><\/span><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Congrats! Your source XML data is now converted into a Delta Table in Databricks.<\/p>\n\n\n\n<p>But remember, if the XML changes, so does your manual code in Steps 5 and 6 (well, and every other step).<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>XML conversion projects aren\u2019t your average data task; they\u2019re a different beast entirely.<\/p>\n<p>Even seasoned data engineers can hit roadblocks without <a target=\"_blank\" href=\"https:\/\/sonra.io\/xml-the-6-factors-you-need-to-get-right-to-make-your-xml-conversion-project-a-success\/\">the right success factors<\/a> in place.<\/p>\n<p>If you\u2019re considering whether to tackle it in-house or outsource the conversion and free up your team for higher-impact work, you\u2019re not alone. Here\u2019s <a target=\"_blank\" href=\"https:\/\/sonra.io\/should-you-use-outsourced-xml-conversion-services\/\">a helpful resource to guide that decision<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>Verdict: <\/strong>Let\u2019s be real: this process works, but it\u2019s not elegant. When working with any real-world XML file (other than <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">the simple XML test case<\/a> I\u2019ve used here), you\u2019ll be basically hardcoding your way through an XML file using trial and error, and a lot of explode().<\/p>\n\n\n\n<p>This approach gets the job done for simple, well-structured XML, but things will break down fast with deeply nested or inconsistent files.<\/p>\n\n\n\n<p>Databricks offers a clean environment for testing small data flows, but it doesn\u2019t solve the complexities of XML parsing. At best, this is a workaround, not a solution built for XML.<\/p>\n\n\n\n<p>For anything more than a handful of static test files, <a href=\"#post-27580-_iz5zldj0jdwk\" target=\"_blank\" rel=\"noreferrer noopener\">you\u2019ll want something more robust<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><a id=\"post-27580-_raf0a7rdtakp\"><\/a>Limitations of converting XML on Spark and Databricks<\/h2>\n\n\n\n<p>I\u2019ve shown you how to convert XML with Spark and Databricks, but here\u2019s what I haven\u2019t told you: the caveats.<\/p>\n\n\n\n<p>During my testing, I uncovered some major limitations that could break your workflow when relying on spark-xml or newer tools like <a href=\"#post-27580-_5zmw8j2uy9oh\">Databricks\u2019 Auto Loader<\/a>.<\/p>\n\n\n\n<p>Below, I\u2019ll walk you through the hidden pitfalls you need to know before going all-in.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a id=\"post-27580-_eq6dw6ojafiw\"><\/a>Limitations of spark-xml-based approaches (Spark and Databricks)<\/h3>\n\n\n\n<p>Databricks 14.3 and Spark 4.0 now bundle spark-xml for out-of-the-box XML handling, but don\u2019t let the \u201cnative\u201d label fool you.<\/p>\n\n\n\n<p>Under the hood, the same old issues persist: manual parsing, fragile schema support, and the same flatten-it-all approach that crumbles under real-world XML.<\/p>\n\n\n\n<p>Here\u2019s what my testing <a href=\"https:\/\/sonra.io\/xml-to-csv-converters-compared\/#features-and-evaluation-criteria-xml-to-csv-converter\" target=\"_blank\" rel=\"noreferrer noopener\">with more complex test cases<\/a> revealed, and why you should think twice before relying on it in production.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Manual coding is unavoidable<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Spark-xml-based pipelines typically require <a href=\"https:\/\/sonra.io\/take-the-pain-out-of-xml-processing-on-spark\/\" target=\"_blank\" rel=\"noreferrer noopener\">heavy manual effort<\/a>.<\/p>\n\n\n\n<p>That means you have to write (and maintain) the code to parse XML elements and map values, handle exceptions, and stitch everything together.<\/p>\n\n\n\n<p>This adds overhead and increases the risk of bugs, particularly for data teams without deep XML experience.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p><a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-product-page\/\">Flexter<\/a> automates schema discovery and mapping so you\u2019re not hand-writing (and re-writing) parsing\/flattening logic every time the XML changes.<\/p>\n<p>It\u2019s configuration-driven and repeatable, <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#automation\">which is exactly what manual pipelines aren\u2019t<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Flattening leads to One Big Table (OBT) syndrome<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Regardless of the method used, Spark flattens the hierarchical XML structure into a single table. This results in <a href=\"https:\/\/sonra.io\/convert-xml-to-csv\/?_gl=1*16v541x*_up*MQ..*_ga*MTk4MTAwMDYxNy4xNzM0MDg4MTgz*_ga_7H38LVR4Z5*MTczNDA4ODE4MS4xLjEuMTczNDA4ODE4MS4wLjAuMA..#option-2-normalised-xml-conversion-to-csv\" target=\"_blank\" rel=\"noreferrer noopener\">repeated values and inefficient queries<\/a>.<\/p>\n\n\n\n<p><a href=\"https:\/\/sonra.io\/flexter-product-brief-faq\/#data-format-processing-capabilities\" target=\"_blank\" rel=\"noreferrer noopener\">Normalisation<\/a>, where hierarchical branches are broken into multiple related tables with proper primary and foreign keys, isn\u2019t supported natively and must be implemented manually, if at all.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p>Instead of dumping everything into a single OBT, <a href=\"https:\/\/sonra.io\/flexter-book-a-demo\/\">Flexter<\/a> outputs <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#normalized-schemas\">normalised relational tables (with appropriate relationships)<\/a>, so queries stay sane, storage stays efficient, and you don\u2019t live inside explode() forever.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Poor XSD support<\/strong><\/li>\n<\/ol>\n\n\n\n<p>According to my testing, the spark-xml library struggles to <a href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/#creating-sql-tables-with-an-xsd-to-database-schema-tool\" target=\"_blank\" rel=\"noreferrer noopener\">properly interpret XSDs<\/a>, especially when dealing with common features like namespace imports and polymorphism.<\/p>\n\n\n\n<p>In both cases, parsing either failed entirely or produced incomplete schemas, resulting in broken dataframes and missing structure.<\/p>\n\n\n\n<p>This lack of robust XSD support forces engineers to <a href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/#step-by-step-tutorial-converting-xsd-to-tables-in-a-database\" target=\"_blank\" rel=\"noreferrer noopener\">manually translate constraints into the target schema<\/a>, adding time and complexity.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p><a target=\"_blank\" href=\"https:\/\/docs.sonra.io\/flexter\/master\/docs\/?_gl=1*19sjprv*_gcl_au*MTE5NDY4Mzc5OS4xNzY0MDU1MTYy*_ga*OTA1NTkxMTA4LjE3NjQwNTUxNjI.*_ga_7H38LVR4Z5*czE3NzAxMTE1NTQkbzM0JGcxJHQxNzcwMTExNTU0JGo2MCRsMCRoMA..\">Flexter<\/a> supports full XSD-driven parsing, including patterns that commonly break Spark pipelines (like <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#flexter-automating-xml-json-conversion-for-enterprises\">polymorphism via xsi:type<\/a>), and it uses the XSD to build the target model without you manually translating constraints into code.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>No built-in schema evolution<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Handling changes in XML structure, such as updated cardinalities, new or missing XPaths, or datatype changes, requires manual intervention.<\/p>\n\n\n\n<p>My testing showed that even enabling the latest Databricks 14.3+ features couldn\u2019t handle schema shifts gracefully.<\/p>\n\n\n\n<p>When elements changed type or structure, Databricks and Spark failed to append new data to existing tables, causing ingestion to break.<\/p>\n\n\n\n<p>I had to intervene and rewrite the parsing and flattening logic manually.<\/p>\n\n\n\n<p>This made my workflow very hard to maintain; I don\u2019t even want to think about the long-term (e.g. what happens if my XML changes in two years from now).<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p>Flexter <a target=\"_blank\" href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/\">detects schema changes and supports automated evolution<\/a> with version tracking and upgrade scripts, so you\u2019re not rebuilding the pipeline every time cardinalities, XPaths, or datatypes shift.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Fragile schema inference<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Although <a href=\"#post-27580-_raf0a7rdtakp\"><em>.option(&#8220;inferSchema&#8221;, &#8220;true&#8221;)<\/em><\/a> exists, it\u2019s unreliable across multiple files.<\/p>\n\n\n\n<p>If just one XML file differs in structure, missing an element or changing its nesting, Spark may infer an inconsistent or incorrect schema.<\/p>\n\n\n\n<p>This is a known issue that has been <a href=\"https:\/\/stackoverflow.com\/questions\/42010638\/how-to-load-all-xml-files-from-a-hdfs-directory-using-spark-databricks-xml-parse\" target=\"_blank\" rel=\"noreferrer noopener\">widely reported in community discussions<\/a>.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p><a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-book-a-demo\/\">Flexter\u2019s<\/a> runtime schema checks + logging reduce silent failures and make ingestion runs diagnosable and restartable.<\/p>\n<p>One weird file <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#reliability-and-business-continuity\">shouldn\u2019t be allowed to derail your SLA like it\u2019s the main character<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>And please take note: <\/strong>Fragile schema inference isn\u2019t just \u201cannoying\u201d.<\/p>\n\n\n\n<p><strong>It\u2019s operational risk:<\/strong> one weird file can skew schema inference, break ingestion, and blow up your SLAs.<\/p>\n\n\n\n<p>Flexter\u2019s <a href=\"https:\/\/docs.sonra.io\/flexter\/master\/docs\/logging\/\" target=\"_blank\" rel=\"noreferrer noopener\">runtime schema checks + logging<\/a> reduce silent failures and make pipelines restartable.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Handling edge cases becomes a hackathon<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Without robust XML support, I had to fall back on custom string logic, regex, or brittle transformation chains to handle XML edge cases.<\/p>\n\n\n\n<p>These solutions are hard to maintain and fail silently when unexpected structures appear.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p><a target=\"_blank\" href=\"https:\/\/docs.sonra.io\/flexter\/master\/docs\/?_gl=1*19sjprv*_gcl_au*MTE5NDY4Mzc5OS4xNzY0MDU1MTYy*_ga*OTA1NTkxMTA4LjE3NjQwNTUxNjI.*_ga_7H38LVR4Z5*czE3NzAxMTE1NTQkbzM0JGcxJHQxNzcwMTExNTU0JGo2MCRsMCRoMA..\">Flexter<\/a> is built to handle real-world XML edge cases without regex archaeology.<\/p>\n<p>You get <a target=\"_blank\" href=\"https:\/\/docs.sonra.io\/flexter\/master\/docs\/logging\/\">consistent parsing behaviour plus validation and error reporting<\/a>, instead of a brittle chain of transformations that fails quietly.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>You still have to guess your way through schema design<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Even once data is parsed, transforming it into a usable, scalable schema in Spark and Databricks remains challenging.<\/p>\n\n\n\n<p>Nested structures, mixed content, and inconsistent XML fields turn schema design into a high-stakes guessing game, one that\u2019s easy to get wrong and hard to maintain.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p><a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-product-page\/\">Flexter<\/a> auto-generates <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#optimisation-algorithms\">an optimised target schema<\/a> (not just \u201cwhatever fell out of flattening\u201d), so you\u2019re not gambling on a table design that becomes unmaintainable the moment the XML grows.<\/p>\n<p>It can also <a target=\"_blank\" href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/#creating-sql-tables-with-an-xsd-to-database-schema-tool\">generate an optimised target schema based on your XSD<\/a>, accurately, and without the guesswork.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Lack of automated documentation<\/strong><\/li>\n<\/ol>\n\n\n\n<p>There\u2019s no out-of-the-box generation of Source-to-Target Mappings (STM), Entity-Relationship Diagrams (ERDs), or schema evolution tracking.<\/p>\n\n\n\n<p>Based on my test cases, this gap creates documentation debt that becomes a huge long-term liability.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p><a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-book-a-demo\/\">Flexter<\/a> auto-generates ERDs, STM mappings, lineage, and version diffs, and stores them in <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#metadata-management\">a metadata catalogue<\/a> for governance and change management.<\/p>\n<p>Translation: you don\u2019t pay interest on documentation debt forever.<\/p>\n<\/div>\n<\/div>\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Integration &amp; workflow reality check<\/strong><\/li>\n<\/ol>\n\n\n\n<p>In production, you typically need to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Trigger conversions on a schedule or event (CLI\/API),<\/li>\n\n\n\n<li>Monitor runs (status, metrics, logs) and alert on failures,<\/li>\n\n\n\n<li>Orchestrate the whole thing (jobs, retries, backfills).<\/li>\n<\/ul>\n\n\n\n<p>Manual Spark\/Databricks notebook workflows don\u2019t give you that end-to-end.<\/p>\n\n\n\n<p><\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>What Flexter does instead?<\/h4>\n    \t<div class=\"note-block-content\"><p>Flexter supports <a target=\"_blank\" href=\"https:\/\/sonra.io\/flexter-faq-guide\/#integration-and-workflow-support\">API\/CLI-driven runs that are easy to plug into orchestrators<\/a> (for example, Databricks Jobs or Airflow), with run status and logging so failures are visible and retries\/backfills are practical.<\/p>\n<\/div>\n<\/div>\n\n\n<p><a id=\"post-27580-_5zmw8j2uy9oh\"><\/a><\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Eyeing AWS EMR as a workaround for Databricks in your XML pipeline?<\/h4>\n    \t<div class=\"note-block-content\"><p>It might feel like a smart move: more control, lower cost. But don\u2019t let advertisements mislead you.<\/p>\n<p>EMR still drags the same baggage: manual parsing, limited XSD support, and brittle schema inference. And if Redshift is your destination? The cracks show fast.<\/p>\n<p>I\u2019ve tested these scenarios myself and broken them all down so that you don\u2019t have to.<\/p>\n<p><a target=\"_blank\" href=\"https:\/\/sonra.io\/xml-to-redshift-guide\/\">Check out my full XML to Redshift guide<\/a> before you commit to another half-solution.<\/p>\n<\/div>\n<\/div>\n\n\n<p>Understanding the Limitations of Databricks\u2019 Auto Loader<\/p>\n\n\n\n<p>If the Auto Loader feature in Databricks has \u201csparked\u201d your interest (pun intended), you&#8217;re not alone.<\/p>\n\n\n\n<p>This feature by Databricks <a href=\"https:\/\/www.databricks.com\/blog\/announcing-simplified-xml-data-ingestion\" target=\"_blank\" rel=\"noreferrer noopener\">promises simplified XML data ingestion<\/a> from cloud storage into Delta Lake, with a few lines of manual code.<\/p>\n\n\n\n<p>At this point, please note that <a href=\"https:\/\/docs.databricks.com\/aws\/en\/ingestion\/cloud-object-storage\/auto-loader\/\" target=\"_blank\" rel=\"noreferrer noopener\">Auto Loader<\/a> is a unique feature of Databricks that supports XML ingestion, starting from Databricks 14.3 and onwards. It is not available in Spark.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Pro tip<\/h4>\n    \t<div class=\"note-block-content\"><p>If you\u2019re doing this in production (multiple files, evolving XML, real SLAs), <a href=\"#post-27580-_iz5zldj0jdwk\">skip ahead to the automated option: Flexter. <\/a><\/p>\n<p>Flexter is the difference between a pipeline and a permanent debugging hobby.<\/p>\n<\/div>\n<\/div>\n\n\n<p>While this feature sounds convenient, I\u2019ve attempted to test it and see if I can develop an XML conversion workflow that will help me with a <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">simple XML test case<\/a>.<\/p>\n\n\n\n<p>Here\u2019s the workflow I\u2019ve tried in order to test Databricks\u2019 Auto Loader:<\/p>\n\n\n\n<p><strong>Steps 1 and 2: Create a Databricks Community account and set up a Cluster<\/strong><\/p>\n\n\n\n<p>These are exactly the same as when I tested <a href=\"#post-27580-_labm59hczlfe\">XML conversion with native support in Databricks<\/a>.<\/p>\n\n\n\n<p><strong>Step 3:<\/strong> <strong>Upload the source XML file to DBFS<\/strong><\/p>\n\n\n\n<p>In step three, there\u2019s a slight change as compared to previous testing; you\u2019ll need to create a subfolder in DBFS (called \u201csimple_XML test_cases\u201d in my case) and then upload the <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">simple XML test case<\/a> there, as I show in Step 3 (<a href=\"#post-27580-_labm59hczlfe\">here<\/a>) of my previous workflow.<\/p>\n\n\n\n<p>This is because Auto Loader won\u2019t read single files, but will monitor folders of files.<\/p>\n\n\n\n<p><strong>Step 4: Create a Databricks Notebook<\/strong><\/p>\n\n\n\n<p>This is exactly the same as I show in my previous <a href=\"#post-27580-_labm59hczlfe\">XML to Darabricks workflow (Step 4)<\/a>.<\/p>\n\n\n\n<p><strong>Step 5: Reading and Parsing XML<\/strong><\/p>\n\n\n\n<p>This is where things started to get complicated (exponentially).<\/p>\n\n\n\n<p>In this step, I expected all the promises of Auto Loader to come to fruition on my monitor; only I was met with more (hidden) headaches.<\/p>\n\n\n\n<p>The first hidden headache is that Auto Loader always needs an XSD specified when reading the XML.<\/p>\n\n\n\n<p>Surprisingly, despite schema inference and schema evolution being claimed as the <a href=\"https:\/\/docs.databricks.com\/aws\/en\/ingestion\/cloud-object-storage\/auto-loader\/options\" target=\"_blank\" rel=\"noreferrer noopener\">core Auto Loader features<\/a>, I still had to consider providing an XSD.<\/p>\n\n\n\n<p>But, in our <a href=\"https:\/\/sonra.io\/wp-content\/uploads\/2024\/10\/simplexmltocsv.zip\" target=\"_blank\" rel=\"noreferrer noopener\">simple XML test case<\/a>, which is just one of the simpler <a href=\"https:\/\/sonra.io\/xml-to-csv-converters-compared\/#features-and-evaluation-criteria-xml-to-csv-converter\" target=\"_blank\" rel=\"noreferrer noopener\">potential test cases in the real world<\/a>, I do not have XSD!<\/p>\n\n\n\n<p>So what I had to do was use the spark-xml-based schema inference, as I did <a href=\"#post-27580-_labm59hczlfe\">in my other workflows<\/a>, and then provide the inferred schema to Auto Loader\u2019s readStream command.<\/p>\n\n\n\n<p>Here\u2019s how the code looks for Step 5. First, define the imports and filepaths needed for this step:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">from pyspark.sql.functions import col, explode\n\n# Path to the input XML file and a required schema location\ninput_path_1 = \"dbfs:\/FileStore\/tables\/simple_XML_test_cases\/\"\ninput_path_2 = \"dbfs:\/FileStore\/tables\/simple_XML_test_cases\/simplexmltocsv_1.xml\"\nschema_path = \"dbfs:\/FileStore\/tables\/simple_XML_schema\/\"  # Required for schema inference\n<\/pre><\/div>\n\n\n\n<p>And then use the spark-xml-based parsing and schema inference options:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">df_schema = spark.read.format(\"xml\") \\\n\t.option(\"rowTag\", \"Company\") \\\n\t.option(\"attributePrefix\", \"\") \\\n\t.load(input_path_2)\n\nschema = df_schema.schema\n<\/pre><\/div>\n\n\n\n<p>And based on this derived schema, you may now read and parse the simplest of XML test cases as follows (well, actually, we\u2019re setting up the readStream, which we will trigger later on):<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \"># Read XML using Auto Loader and let it infer schema\ndf_stream = spark.readStream \\\n\t.format(\"cloudFiles\") \\\n\t.option(\"cloudFiles.format\", \"xml\") \\\n\t.option(\"rowTag\", \"Company\") \\\n\t.option(\"attributePrefix\", \"\") \\\n\t.schema(schema) \\\n\t.load(input_path_1)\n\n# Preview schema (non-blocking)\ndf_stream.printSchema()\n<\/pre><\/div>\n\n\n\n<p>Now with readStream in place, we have a streaming DataFrame that continuously monitors input_path_1 for new XML files, parses each <em>&lt;Company&gt;<\/em> element into a structured initial table, using the provided schema.<\/p>\n\n\n\n<p>Unfortunately, after Step 5, you still don\u2019t get a final flattened table to write to Delta Lake.<\/p>\n\n\n\n<p>You must still follow Step 6, as we did in our <a href=\"#post-27580-_labm59hczlfe\">XML to Databricks workflow<\/a>.<\/p>\n\n\n<div class=\"note-block\">\n\t<div class=\"note-block-icon\"><\/div>\n    \t<h4>Curious why Auto Loader insists on having a schema up front?<\/h4>\n    \t<div class=\"note-block-content\"><p>It\u2019s not just a quirk; Databricks explicitly explains this in their documentation, and you can <a target=\"_blank\" href=\"https:\/\/docs.databricks.com\/aws\/en\/query\/formats\/xml#schema-inference-and-evolution-in-auto-loader\">read more about it here<\/a>.<\/p>\n<\/div>\n<\/div>\n\n\n<p><strong>Step 6: Yes, you need to flatten the XML (again)<\/strong><\/p>\n\n\n\n<p>Auto Loader won\u2019t help normalise your XML into an efficient target schema. It also won\u2019t help you flatten it into One Big Table (OBT).<\/p>\n\n\n\n<p>In this step, you will have to write manual code to flatten the XML, exactly as we did in <strong>Step 6 <\/strong>of the <a href=\"#post-27580-_labm59hczlfe\">XML to Databricks workflow<\/a>.<\/p>\n\n\n\n<p>So you may head over there and see the manual code needed for this step. The only thing you may need to change is &#8220;df_company\u201d to \u201cdf_stream\u201d in the first explode statement.<\/p>\n\n\n\n<p><strong>Step 7: Write the flattened XML to a Delta Table with a streaming service<\/strong><\/p>\n\n\n\n<p>You\u2019d think that after jumping through all these hoops, you could at least wrap things up by streaming your XML straight into a Delta table.<\/p>\n\n\n\n<p>Well, think again. Auto Loader throws in one last twist: it won\u2019t let you stream directly into an SQL-registered Delta table.<\/p>\n\n\n\n<p>Instead, you first have to write the streaming data to a Delta path using .start(output_path), and only then can you register it as a table with .saveAsTable() or a good old-fashioned CREATE TABLE statement.<\/p>\n\n\n\n<p>Because why make the last step easy?<\/p>\n\n\n\n<p>Here\u2019s how the code for this last step looks:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">from datetime import datetime\n\n# Generate unique timestamp-based paths and table name\ntimestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\noutput_path = f\"\/dbfs\/FileStore\/tables\/output\/simple_XML_{timestamp}\"\ncheckpoint_path = f\"\/dbfs\/FileStore\/tables\/checkpoints\/simple_XML_{timestamp}\"\n\n# Start streaming flattened data to Delta\nquery = (\n\tdf_final.writeStream\n    \t.format(\"delta\")\n    \t.outputMode(\"append\")\n    \t.option(\"checkpointLocation\", checkpoint_path)\n    \t.trigger(once=True)  # Run one micro-batch and stop\n    \t.start(output_path)\n)\n\nquery.awaitTermination()\n<\/pre><\/div>\n\n\n\n<p>In the code I\u2019m showing you above, I\u2019ve triggered the streaming service <strong>once <\/strong>to read, parse and flatten my XML test case.<\/p>\n\n\n\n<p>At this point, you&#8217;re writing the flattened XML data to a Delta table path in append mode using structured streaming, with a checkpoint for fault tolerance and a unique timestamp to organise the output.<\/p>\n\n\n\n<p>However, this is usually not what we mean when converting to a Delta table because the data is simply written to a path and not registered as a table in the metastore.<\/p>\n\n\n\n<p>Without registration, you can\u2019t query it like a normal SQL table or manage it through the Databricks UI. That final step still needs to be done manually.<\/p>\n\n\n\n<p>To write your flattened XML to a Delta Table, you need one more step:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:default decode:true \">spark.read.format(\"delta\").load(output_path) \\\n\t.write.format(\"delta\") \\\n\t.mode(\"overwrite\") \\\n\t.saveAsTable(\"company_full_subtasks\")\n<\/pre><\/div>\n\n\n\n<p>At long last, after all that manual effort, your XML data is finally written to a Delta Table in Databricks (you should see the same result as in my <a href=\"#post-27580-_labm59hczlfe\">XML to Delta with Databricks workflow<\/a>).<\/p>\n\n\n\n<p>Time to relax? Well, maybe not.<\/p>\n\n\n\n<p>Because any change in your source XML\u2019s structure means you\u2019ll need to revisit and adapt <strong>Step 5<\/strong> all over again.<\/p>\n\n\n\n<p><strong>So, What\u2019s the Catch with Auto Loader?<\/strong><\/p>\n\n\n\n<p><strong>Let\u2019s be honest: <\/strong>while Databricks Auto Loader sounds like it should do all the heavy lifting for you, when it comes to converting XML to Delta Tables\u2026 it mostly just watches you do the work.<\/p>\n\n\n\n<p><strong>Here\u2019s the reality:<\/strong> Auto Loader inherits all the manual flattening pain from the classic Databricks 14.3, Spark and spark-xml-based approaches.<\/p>\n\n\n\n<p>Yes, you&#8217;re still responsible for converting that deeply nested XML into a flat, analytics-friendly One Big Table (OBT). No magic wand here.<\/p>\n\n\n\n<p><strong>But wait, Auto Loader doesn\u2019t just stop at not helping. It adds a few quirks of its own:<\/strong><\/p>\n\n\n\n<p><strong>Schema inference? Not really:<\/strong> Auto Loader takes a step back from spark-xml since it can&#8217;t reliably infer nested XML structure. I had to use spark.read in batch mode to figure out the schema, then hand it over to the stream.<\/p>\n\n\n\n<p><strong>No direct save to SQL tables:<\/strong> You can&#8217;t just stream your results into a SQL-friendly Delta table with .saveAsTable(). Instead, you stream to a raw Delta path and then register it manually, like a bureaucratic form you forgot to fill out.<\/p>\n\n\n\n<p><strong>Lazy execution strikes again:<\/strong> Transformations like explode() don\u2019t run until the stream starts writing. That means your code looks fine&#8230; until it doesn\u2019t, and you find out only after triggering the Spark job.<\/p>\n\n\n\n<p>And who knows what chaos might ensue if you over-explode your XML outside the safe sandbox of Databricks Community Edition.<\/p>\n\n\n\n<p>If it melts a few cores in the process, please don\u2019t send me the invoice.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><a id=\"post-27580-_iz5zldj0jdwk\"><\/a>How to free up your team with an automated XML to Databricks and Spark solution<\/h2>\n\n\n\n<p>If you\u2019ve followed the blog post up to here and tried<a href=\"#post-27580-_d2n99ff55r53\"> to replicate my workflows<\/a>, you\u2019ve:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Written custom code to flatten nested XML structures.<\/li>\n\n\n\n<li>Dealt with fragile schema inference and manual XML flattening workarounds.<\/li>\n\n\n\n<li>Probably rebuilt parts of your pipeline every time your XML changed or tried a different XML file.<\/li>\n\n\n\n<li>Watched Databricks\u2019 \u201cnative support\u201d fall short in real-world scenarios.<\/li>\n\n\n\n<li>And if you wanted to keep some documentation, you had to create your own Source-to-Target mappings from scratch.<\/li>\n<\/ul>\n\n\n\n<p>And that\u2019s just to convert your simple test cases, most likely with up to three levels of nesting.<\/p>\n\n\n\n<p>Now imagine scaling this to a real-world project with hundreds of XML files, frequent schema changes, and tight deadlines.<\/p>\n\n\n\n<p><strong>The problem?<\/strong><\/p>\n\n\n\n<p>Except for the case when you can <a href=\"https:\/\/sonra.io\/should-you-use-outsourced-xml-conversion-services\/\" target=\"_blank\" rel=\"noreferrer noopener\">hire a team of XML Conversion Experts<\/a>, manual XML to Delta (or Parquet) workflows don\u2019t scale.<\/p>\n\n\n\n<p>They\u2019re time-consuming, error-prone, and depend too heavily on custom logic that breaks easily.<\/p>\n\n\n\n<p><strong>The Alternative: Automated XML to Delta Conversion<\/strong><\/p>\n\n\n\n<p>If only a dedicated, purpose-built XML to Databricks solution existed, right?<\/p>\n\n\n\n<p>What if I told you that it exists,<a href=\"https:\/\/sonra.io\/flexter-product-brief-faq\/\"> it\u2019s called Flex<\/a><a href=\"https:\/\/sonra.io\/flexter-product-brief-faq\/\" target=\"_blank\" rel=\"noreferrer noopener\">t<\/a><a href=\"https:\/\/sonra.io\/flexter-product-brief-faq\/\">er<\/a>, and you are just a few clicks away from converting your XML to Delta in Databricks?<\/p>\n\n\n\n<p><strong>Ready to Scale Your XML Conversion Workflow?<\/strong><\/p>\n\n\n\n<p><strong>Flexter Enterprise<\/strong> is the leading automated XML to Databricks conversion platform, and it outshines all other alternatives because:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#automation\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Automation (it requires zero code):<\/strong><\/a> just download Flexter, <a href=\"https:\/\/docs.sonra.io\/flexter\/master\/docs\/configuration-and-settings\/\" target=\"_blank\" rel=\"noreferrer noopener\">configure your Databricks connection<\/a>, and point it to your XML files or folders.<\/li>\n\n\n\n<li><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#scalability\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Scalability:<\/strong><\/a> It scales effortlessly and converts XML of any size or volume, and adjusts performance based on your infrastructure.<\/li>\n\n\n\n<li><strong>Versatility:<\/strong> <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#versatility\" target=\"_blank\" rel=\"noreferrer noopener\">It fits any environment:<\/a> use it via CLI or API, or deploy it on-prem or in the cloud. It supports your team\u2019s tech stack: compatible with multiple OS environments and various data sources (FTP, SFTP, HTTPs, object storage, queues).<\/li>\n\n\n\n<li><strong>It will take your XSD as input<\/strong> and automatically <a href=\"https:\/\/sonra.io\/xsd-to-database-schema-sql-tables\/\" target=\"_blank\" rel=\"noreferrer noopener\">build a relational target schema<\/a> in Databricks or Spark without requiring you to manually add any constraints.<\/li>\n\n\n\n<li><strong>It handles your XML\u2019s complexity with ease<\/strong>: Supports deeply nested XML, multi-file XSDs, and <a href=\"https:\/\/sonra.io\/library-xml-data-standards\/\" target=\"_blank\" rel=\"noreferrer noopener\">industry-standard formats<\/a>.<\/li>\n\n\n\n<li><strong>Normalised Schemas:<\/strong> Your data will not be flattened in One Big Table (OBT), <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#normalized-schemas\">but it wi<\/a><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#normalized-schemas\" target=\"_blank\" rel=\"noreferrer noopener\">l<\/a><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#normalized-schemas\">l get normalised into multiple connected tables<\/a> for efficient storage and retrieval.<\/li>\n\n\n\n<li><strong>Optimisation Algorithms:<\/strong> <a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#optimisation-algorithms\" target=\"_blank\" rel=\"noreferrer noopener\">It optimises your target schema using state-of-the-art algorithms<\/a> that convert unnecessary hierarchies into simpler one-to-one relationships to improve query performance.<\/li>\n\n\n\n<li><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#metadata-management\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Metadata Management:<\/strong><\/a> It delivers full documentation out of the box, including ER diagrams, Source-to-Target mappings, schema diffs, and a metadata catalogue.<\/li>\n\n\n\n<li><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#integration-and-workflow-support\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>It supports your team\u2019s tech stack:<\/strong><\/a> Compatible with multiple OS environments and various data sources (FTP, SFTP, HTTPs, object storage, queues).<\/li>\n\n\n\n<li><a href=\"https:\/\/sonra.io\/flexter-faq-guide\/#reliability-and-business-continuity\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Reliability:<\/strong><\/a> It includes built-in monitoring and validation tools, enabling you to track errors and performance with confidence.<\/li>\n\n\n\n<li><strong>If you become familiar with it,<\/strong> you can convert it <a href=\"https:\/\/sonra.io\/flexter-product-brief-faq\/#flexter-capability-sheet\" target=\"_blank\" rel=\"noreferrer noopener\">to other file formats or databases<\/a>, such as CSV, TSV, Snowflake, Mysql, ORC, Avro, etc.<\/li>\n\n\n\n<li><strong>It\u2019s backed by expert support<\/strong>: From setup to troubleshooting, Flexter\u2019s team is ready to assist via email, phone, or support tickets.<\/li>\n<\/ul>\n\n\n\n<p>If my list hasn\u2019t convinced you yet, you can also <a href=\"https:\/\/sonra.io\/flexter-xml-to-sql-guide\/\" target=\"_blank\" rel=\"noreferrer noopener\">try Flexter Online for free to convert your real-world XML files<\/a> with just a drag and drop through your browser:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/sonra.io\/xml-to-csv-converter\/\" target=\"_blank\" rel=\"noreferrer noopener\">Try the Online XML to CSV Converter for FREE<\/a>,<\/li>\n\n\n\n<li><a href=\"https:\/\/sonra.io\/xml-to-sql-database-converter\/\" target=\"_blank\" rel=\"noreferrer noopener\">Check the Online XML to Snowflake Converter for FREE<\/a>.<\/li>\n<\/ul>\n\n\n\n<p><strong>Convinced and don\u2019t want to waste another minute? See Flexter in action!<\/strong><\/p>\n\n\n\n<p><a href=\"https:\/\/sonra.io\/flexter-book-a-demo\/\" target=\"_blank\" rel=\"noreferrer noopener\">Book a call with Flexter Enterprise<\/a> and find out how it can transform your XML to Delta workflow, at scale.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>You\u2019d think working with XML in Spark 4.0 or Databricks 14.3+ in 2026 would be easy. But it often isn\u2019t. TL;DR: This blog post breaks down how to convert XML into Delta Tables using Spark and Databricks, covering both spark-xml-based workflows and native support in Spark 4.0 and Databricks Runtime 14.3 (or higher). Keep reading, &#8230;<\/p>\n","protected":false},"author":29,"featured_media":27627,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[6],"tags":[],"class_list":["post-27580","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-xml"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How to Parse XML in Spark and Databricks [2026 Guide]<\/title>\n<meta name=\"description\" content=\"Master XML parsing in Spark and Databricks. Explore spark-xml vs. native features, schema inference, and converting XML to Delta Tables.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/sonra.io\/parse-xml-spark-databricks\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Parse XML in Spark and Databricks [2026 Guide]\" \/>\n<meta property=\"og:description\" content=\"Master XML parsing in Spark and Databricks. Explore spark-xml vs. native features, schema inference, and converting XML to Delta Tables.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/sonra.io\/parse-xml-spark-databricks\/\" \/>\n<meta property=\"og:site_name\" content=\"Sonra\" \/>\n<meta property=\"article:published_time\" content=\"2025-05-22T09:53:03+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-24T12:17:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Maciek\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Maciek\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"32 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/\"},\"author\":{\"name\":\"Maciek\",\"@id\":\"https:\\\/\\\/sonra.io\\\/#\\\/schema\\\/person\\\/f6961e781666bffd0142c5ccc300f219\"},\"headline\":\"How to Parse XML in Spark and Databricks [2026 Guide]\",\"datePublished\":\"2025-05-22T09:53:03+00:00\",\"dateModified\":\"2026-02-24T12:17:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/\"},\"wordCount\":6544,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/sonra.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg\",\"articleSection\":[\"XML\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/\",\"url\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/\",\"name\":\"How to Parse XML in Spark and Databricks [2026 Guide]\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/sonra.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg\",\"datePublished\":\"2025-05-22T09:53:03+00:00\",\"dateModified\":\"2026-02-24T12:17:16+00:00\",\"description\":\"Master XML parsing in Spark and Databricks. Explore spark-xml vs. native features, schema inference, and converting XML to Delta Tables.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#primaryimage\",\"url\":\"https:\\\/\\\/sonra.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg\",\"contentUrl\":\"https:\\\/\\\/sonra.io\\\/wp-content\\\/uploads\\\/2025\\\/05\\\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg\",\"width\":1920,\"height\":1080,\"caption\":\"XML parsing guide for Spark and Databricks data integration solutions-min\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/sonra.io\\\/parse-xml-spark-databricks\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/sonra.io\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"XML\",\"item\":\"https:\\\/\\\/sonra.io\\\/category\\\/xml\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"How to Parse XML in Spark and Databricks [2026 Guide]\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/sonra.io\\\/#website\",\"url\":\"https:\\\/\\\/sonra.io\\\/\",\"name\":\"Sonra\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/sonra.io\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/sonra.io\\\/#organization\",\"name\":\"Sonra\",\"alternateName\":\"Sonra.io\",\"url\":\"https:\\\/\\\/sonra.io\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/sonra.io\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/sonra.io\\\/wp-content\\\/uploads\\\/2015\\\/02\\\/sonra-logo-circle.png\",\"contentUrl\":\"https:\\\/\\\/sonra.io\\\/wp-content\\\/uploads\\\/2015\\\/02\\\/sonra-logo-circle.png\",\"width\":600,\"height\":600,\"caption\":\"Sonra\"},\"image\":{\"@id\":\"https:\\\/\\\/sonra.io\\\/#\\\/schema\\\/logo\\\/image\\\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/sonra.io\\\/#\\\/schema\\\/person\\\/f6961e781666bffd0142c5ccc300f219\",\"name\":\"Maciek\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f1f4c86b9824a4e747832130d9194894903c6c4d8171ae528624afefcabea1b1?s=96&d=https%3A%2F%2Fsonra.io%2Fwp-content%2Fuploads%2F2023%2F04%2FScreenshot_15-removebg-preview.png&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f1f4c86b9824a4e747832130d9194894903c6c4d8171ae528624afefcabea1b1?s=96&d=https%3A%2F%2Fsonra.io%2Fwp-content%2Fuploads%2F2023%2F04%2FScreenshot_15-removebg-preview.png&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/f1f4c86b9824a4e747832130d9194894903c6c4d8171ae528624afefcabea1b1?s=96&d=https%3A%2F%2Fsonra.io%2Fwp-content%2Fuploads%2F2023%2F04%2FScreenshot_15-removebg-preview.png&r=g\",\"caption\":\"Maciek\"},\"url\":\"https:\\\/\\\/sonra.io\\\/author\\\/maciek\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Parse XML in Spark and Databricks [2026 Guide]","description":"Master XML parsing in Spark and Databricks. Explore spark-xml vs. native features, schema inference, and converting XML to Delta Tables.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/sonra.io\/parse-xml-spark-databricks\/","og_locale":"en_US","og_type":"article","og_title":"How to Parse XML in Spark and Databricks [2026 Guide]","og_description":"Master XML parsing in Spark and Databricks. Explore spark-xml vs. native features, schema inference, and converting XML to Delta Tables.","og_url":"https:\/\/sonra.io\/parse-xml-spark-databricks\/","og_site_name":"Sonra","article_published_time":"2025-05-22T09:53:03+00:00","article_modified_time":"2026-02-24T12:17:16+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg","type":"image\/jpeg"}],"author":"Maciek","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Maciek","Est. reading time":"32 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#article","isPartOf":{"@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/"},"author":{"name":"Maciek","@id":"https:\/\/sonra.io\/#\/schema\/person\/f6961e781666bffd0142c5ccc300f219"},"headline":"How to Parse XML in Spark and Databricks [2026 Guide]","datePublished":"2025-05-22T09:53:03+00:00","dateModified":"2026-02-24T12:17:16+00:00","mainEntityOfPage":{"@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/"},"wordCount":6544,"commentCount":0,"publisher":{"@id":"https:\/\/sonra.io\/#organization"},"image":{"@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg","articleSection":["XML"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/sonra.io\/parse-xml-spark-databricks\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/","url":"https:\/\/sonra.io\/parse-xml-spark-databricks\/","name":"How to Parse XML in Spark and Databricks [2026 Guide]","isPartOf":{"@id":"https:\/\/sonra.io\/#website"},"primaryImageOfPage":{"@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#primaryimage"},"image":{"@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg","datePublished":"2025-05-22T09:53:03+00:00","dateModified":"2026-02-24T12:17:16+00:00","description":"Master XML parsing in Spark and Databricks. Explore spark-xml vs. native features, schema inference, and converting XML to Delta Tables.","breadcrumb":{"@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/sonra.io\/parse-xml-spark-databricks\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#primaryimage","url":"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg","contentUrl":"https:\/\/sonra.io\/wp-content\/uploads\/2025\/05\/XML-parsing-guide-for-Spark-and-Databricks-data-integration-solutions-min.jpg","width":1920,"height":1080,"caption":"XML parsing guide for Spark and Databricks data integration solutions-min"},{"@type":"BreadcrumbList","@id":"https:\/\/sonra.io\/parse-xml-spark-databricks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/sonra.io\/"},{"@type":"ListItem","position":2,"name":"XML","item":"https:\/\/sonra.io\/category\/xml\/"},{"@type":"ListItem","position":3,"name":"How to Parse XML in Spark and Databricks [2026 Guide]"}]},{"@type":"WebSite","@id":"https:\/\/sonra.io\/#website","url":"https:\/\/sonra.io\/","name":"Sonra","description":"","publisher":{"@id":"https:\/\/sonra.io\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/sonra.io\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/sonra.io\/#organization","name":"Sonra","alternateName":"Sonra.io","url":"https:\/\/sonra.io\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/sonra.io\/#\/schema\/logo\/image\/","url":"https:\/\/sonra.io\/wp-content\/uploads\/2015\/02\/sonra-logo-circle.png","contentUrl":"https:\/\/sonra.io\/wp-content\/uploads\/2015\/02\/sonra-logo-circle.png","width":600,"height":600,"caption":"Sonra"},"image":{"@id":"https:\/\/sonra.io\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/sonra.io\/#\/schema\/person\/f6961e781666bffd0142c5ccc300f219","name":"Maciek","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/f1f4c86b9824a4e747832130d9194894903c6c4d8171ae528624afefcabea1b1?s=96&d=https%3A%2F%2Fsonra.io%2Fwp-content%2Fuploads%2F2023%2F04%2FScreenshot_15-removebg-preview.png&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/f1f4c86b9824a4e747832130d9194894903c6c4d8171ae528624afefcabea1b1?s=96&d=https%3A%2F%2Fsonra.io%2Fwp-content%2Fuploads%2F2023%2F04%2FScreenshot_15-removebg-preview.png&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f1f4c86b9824a4e747832130d9194894903c6c4d8171ae528624afefcabea1b1?s=96&d=https%3A%2F%2Fsonra.io%2Fwp-content%2Fuploads%2F2023%2F04%2FScreenshot_15-removebg-preview.png&r=g","caption":"Maciek"},"url":"https:\/\/sonra.io\/author\/maciek\/"}]}},"_links":{"self":[{"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/posts\/27580","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/users\/29"}],"replies":[{"embeddable":true,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/comments?post=27580"}],"version-history":[{"count":12,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/posts\/27580\/revisions"}],"predecessor-version":[{"id":28151,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/posts\/27580\/revisions\/28151"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/media\/27627"}],"wp:attachment":[{"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/media?parent=27580"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/categories?post=27580"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sonra.io\/wp-json\/wp\/v2\/tags?post=27580"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}