Spark unionAll multiple dataframes

Question

For a set of dataframes

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")

to union all of them I do

df1.unionAll(df2).unionAll(df3)

Is there a more elegant and scalable way of doing this for any number of dataframes, for example from

Seq(df1, df2, df3)

Vishwajeet Pol · Accepted Answer · 2023-07-01 18:49:07Z

128

For pyspark you can do the following:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2,df3]
df = reduce(DataFrame.unionAll, dfs)

It's also worth noting that the order of all the columns in all the dataframes in the list should be the same for this to work. This can silently give unexpected results if you don't have the correct column orders!!

If you are using pyspark 2.3 or greater, you can use unionByName so you don't have to reorder the columns.

edited Jul 1, 2023 at 18:49

Vishwajeet Pol

658 bronze badges

answered Aug 31, 2018 at 21:29

TH22

2,0512 gold badges18 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

pnv Over a year ago

Please remember the point mentioned in bold.

Tim McNamara Over a year ago

Using Python's reduce means that the operations don't occur in parallel though.. correct?

Sip Over a year ago

How can i add a parameter like allowMissingColumns=True?

kjsr7 Over a year ago

DataFrame.unionAll is now deprecated. Use DataFrame.union instead

Shivangi Singh Over a year ago

Wouldn't this be counterproductive to using spark as the reduce will write to disk?

|

zero323 · Accepted Answer · 2017-02-07 22:08:24Z

74

The simplest solution is to reduce with union (unionAll in Spark < 2.0):

val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)

This is relatively concise and shouldn't move data from off-heap storage ~~but extends lineage with each union~~ requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames.

You can also convert to RDDs and use SparkContext.union:

dfs match {
  case h :: Nil => Some(h)
  case h :: _   => Some(h.sqlContext.createDataFrame(
                     h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
                     h.schema
                   ))
  case Nil  => None
}

It keeps ~~lineage short~~ analysis cost low but otherwise it is less efficient than merging DataFrames directly.

edited Feb 7, 2017 at 22:08

answered Jun 3, 2016 at 11:17

zero323

331k108 gold badges982 silver badges958 bronze badges

5 Comments

echo Over a year ago

Thanks for all these approaches!

Leothorn Over a year ago

Is this as simple in scala ? What would it be ?

drkostas Over a year ago

How would the equivalent of this code be in pySpark?

Benjamin Du Over a year ago

How is the performance is there are lots (say, more than 20) of DataFrames?

alex Over a year ago

Also curious in performance for large number of DF

NTB · Accepted Answer · 2023-02-16 09:03:10Z

2

You can add parameters like allowMissingColumns by using reduce with lambda

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1, df2]
df = reduce(lambda x, y: x.unionByName(y, allowMissingColumns=True), dfs)

answered Feb 16, 2023 at 9:03

NTB

292 bronze badges

1 Comment

Payal Bhatia Over a year ago

I could not get why we use lambda here. I was doing without it and error thrown out. Basically , how does lambda made the difference. Can you please elaborate ?

S. Biedermann · Accepted Answer · 2019-03-22 17:46:35Z

Under the Hood spark flattens union expressions. So it takes longer when the Union is done linearly.

The best solution is spark to have a union function that supports multiple DataFrames.

But the following code might speed up the union of multiple DataFrames (or DataSets)somewhat.

  def union[T : ClassTag](datasets : TraversableOnce[Dataset[T]]) : Dataset[T] = {
      binaryReduce[Dataset[T]](datasets, _.union(_))
  }
  def binaryReduce[T : ClassTag](ts : TraversableOnce[T], op: (T, T) => T) : T = {
      if (ts.isEmpty) {
         throw new IllegalArgumentException
      }
      var array = ts toArray
      var size = array.size
      while(size > 1) {
         val newSize = (size + 1) / 2
         for (i <- 0 until newSize) {
             val index = i*2
             val index2 = index + 1
             if (index2 >= size) {
                array(i) = array(index)  // last remaining
             } else {
                array(i) = op(array(index), array(index2))
             }
         }
         size = newSize
     }
     array(0)
 }

saza · Accepted Answer · 2023-07-06 07:31:59Z

0

In case some dataframes have missing columns, one can used a partially applied function:

from functools import reduce
from pyspark.sql import DataFrame

# Union dataframes by name (missing columns filled with null) 
union_by_name = partial(DataFrame.unionByName, allowMissingColumns=True)
df_output = reduce(union_by_name, [df1, df2, ...])

answered Jul 6, 2023 at 7:31

saza

5987 silver badges7 bronze badges

Collectives™ on Stack Overflow

Spark unionAll multiple dataframes

5 Answers 5

6 Comments

5 Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

5 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related