PySpark is the Python API for Apache Spark, the wildly popular distributed data processing framework. The sort() function in PySpark allows for sorting PySpark DataFrames and RDDs based on one or more columns.
In this comprehensive guide, we will cover the following topics related to sort() in PySpark:
- What is the sort() function in PySpark?
- Sorting PySpark DataFrames
- Basic usage
- Sorting by multiple columns
- Specifying sort order
- Sorting PySpark RDDs
- Underlying Sort Algorithm in Spark
- Performance Considerations and Optimizations
- Comparison with Other Sorting Methods
- Advanced Sort Functionality
- Real-world Use Cases of sort()
What is the sort() Function in PySpark?
The sort() function in PySpark allows sorting the rows of a DataFrame or RDD based on one or more columns.
The syntax for sort() in PySpark DataFrames is:
df.sort(col1, col2, ..., ascending=True)
Where:
- df is the PySpark DataFrame
- col1, col2, etc. are the column names to sort by
- ascending determines sort order – True for ascending, False for descending
For PySpark RDDs, the syntax is:
rdd.sortBy(lambda x: x[0], ascending=False)
Where the lambda function defines the value to sort by.
Conceptually, the sort is performed on the DataFrame/RDD partitions, with the sort algorithm explained in detail below.
Sorting PySpark DataFrames
Sorting DataFrames is very common operation during analysis to arrange the data in some meaningful order.
Basic Usage
Sorting DataFrames by a single column is straightforward:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(‘myapp‘).getOrCreate()
df = spark.createDataFrame([
(2, ‘Alice‘),
(5, ‘Bob‘),
(3, ‘Charlie‘)
], ["id", "name"])
df.show()
# +---+-------+
# | id| name|
# +---+-------+
# | 2| Alice|
# | 5| Bob|
# | 3|Charlie|
# +---+-------+
df.sort(‘id‘).show()
# +---+-------+
# | id| name|
# +---+-------+
# | 2| Alice|
# | 3|Charlie|
# | 5| Bob|
# +---+-------+
To sort in descending order, we pass ascending=False:
df.sort(‘id‘, ascending=False).show()
# +---+-------+
# | id| name|
# +---+-------+
# | 5| Bob|
# | 3|Charlie|
# | 2| Alice|
# +---+-------+
Sorting by Multiple Columns
To sort by multiple columns, simply pass a list of column names:
from pyspark.sql.functions import desc
df.sort([‘name‘, ‘id‘], ascending=[True, False]).show()
# +---+-------+
# | id| name|
# +---+-------+
# | 2| Alice|
# | 5| Bob|
# | 3|Charlie|
# +---+-------+
# OR with desc for descending sort
df.sort([‘name‘, desc(‘id‘)]).show()
# +---+-------+
# | id| name|
# +---+-------+
# | 2| Alice|
# | 5| Bob|
# | 3|Charlie|
# +---+-------+
This sorts by the name column first, and then by the id column in descending order.
Specifying Sort Order
Instead of Boolean flags, we can also specify the sort order explicitly using the asc and desc expressions:
from pyspark.sql.functions import asc, desc
df.sort(asc(‘name‘), desc(‘id‘))
This clears up any confusion about which column is sorted in which order.
Sorting PySpark RDDs
We can also sort Spark RDDs using the sortBy() transformation:
unsorted_rdd = sc.parallelize([
(‘Alice‘, 25),
(‘Bob‘, 20),
(‘Charlie‘, 30)
])
sorted_rdd = unsorted_rdd.sortBy(lambda x: x[1])
sorted_rdd.collect()
# [(‘Bob‘, 20), (‘Alice‘, 25), (‘Charlie‘, 30)]
The sortBy() takes a lambda function to define the value to sort by. Here we‘ve sorted by the second element of each tuple.
We can also specify ascending vs descending order:
sorted_rdd = unsorted_rdd.sortBy(lambda x: x[0], ascending=False)
sorted_rdd.collect()
# [(‘Charlie‘, 30), (‘Bob‘, 20), (‘Alice‘, 25)]
So Spark RDDs provide lower-level access to sorting functionality. But PySpark DataFrames have more optimization, which we will cover next.
Underlying Sort Algorithm in Spark
Under the hood, Spark performs sorting using a shuffle operation across the partitions of the RDD or DataFrame.
The basic algorithm works as follows:
1. Generate a sort key for each element
This is done by applying the lambda function or evaluating the columns to sort on. This defines the sorting criteria.
2. Hash partition the keys
A hash function assigns each key to a target partition ID from 0 to numPartitions-1. Keys that hash to the same ID go to the same output partition.

3. Repartition (shuffle) data to target partitions
The elements are sent across the network to the partition corresponding to their key‘s hash value.
4. Sort each partition locally
Within each partition, the elements are sorted using an algorithm like quicksort or Timsort. This produces sorted partition files.
5. Results are now globally sorted
Since the partitions themselves are sorted, the final result is a globally sorted DataFrame/RDD.

So Spark uses a common parallel processing pattern – distributed sort. Data is shuffled and sorted in parallel across partitions for efficiency.
Next let‘s discuss the performance implications…
Performance Considerations & Optimizations
There are a few things to keep in mind when using sorting in Apache Spark:
-
Sorting requires heavy shuffle operations – This shuffle of large volumes of data across executors is expensive. Try minimizing unnecessary sorting on big datasets.
-
Local sorting is fast, global sorting is expensive – Quicksort within a single partition may be fast. But having to coordinate sorting across all partitions is costly.
-
If data was pre-sorted, avoid shuffling it – For example, if logs are already time sorted on S3, free piggyback off that!
Some Spark configurations that influence sort performance:
-
spark.sql.shuffle.partitions – Controls number of partitions during shuffle. More partitions means more parallelism but smaller sized partitions.
-
spark.sql.files.maxPartitionBytes – Limits partition size. Too large partitions increases garbage collection pressure during sorts.
Strategies to optimize Spark sorting:
-
Increase partition count for more parallelism during sorting. But beware tiny partitions!
-
Use Dataframes over RDDs – Catalyst can apply optimizations over raw RDD transformations.
-
Pre-sorting data before saving can avoid re-sorting downstream. Eg. partitionBy before writing to Parquet.
-
Tuning join strategy – SortMerge joins pre-sort data reducing later sorting.
Here‘s a comparison of a sort workload across various partition counts:

Source: Databricks
As you can see, more partitions speeds up total sorting time by allowing more parallelism. But watch out for too many small partitions.
Comparison with Other Sorting Functions
PySpark provides several ways to sort data:
- sort() – General purpose sorting as we‘ve covered so far
- orderBy() – Alias for sort() in Spark SQL & DataFrames
- top() / takeOrdered() – Return top K rows
- rank() – Calculate ranking within window/group
sort() vs orderBy()
orderBy() is essentially an alias for sort() in Spark SQL and DataFrames. It uses optimized physical sorting and runs on the DataFrame. Whereas plain old sort() uses basic Python sorting on driver node.
So when working with DataFrames, favor orderBy() over sort() for performance.
Top K Sorting with top()
If you only need the Top K rows by some criteria, use top() instead of a full sort:
df.orderBy(desc("sales")).top(5) # Top 5 sales
This avoids a costly full sort in order to find extremum values.
Window Ranking with rank()
To calculate ranking within groups, use built-in window functions like rank():
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
window = Window.partitionBy(‘dept‘).orderBy(‘salary‘)
df.withColumn(‘rank‘, rank().over(window))
So depending on use case, utilize the right sorting tool for the job!
Advanced Sort Functionality
There are some more advanced capabilities around sorting that deserve mention:
Sort with Null Ordering
Use nullsFirst() or nullsLast() to control where nulls appear in the sort order:
from pyspark.sql.functions import desc_nulls_last
df.orderBy(desc_nulls_last(‘col‘))
# Nulls last in descending sort
Stable vs Unstable Sort
Spark normal sorts are unstable meaning rows with equal keys can shuffle positions. Use stable sort for deterministic order:
df.orderBy("col1", "col2", stable=True)
External Sort
When data is too large to fit in memory, Spark spills sorted partitions to disk for later merge. This is called external sort.
Incremental Sort
For streaming data, Spark continuously performs micro-batch incremental sorts behind the scenes to maintain order.
Real-world Use Cases
Here are some common use cases where the sort() function shines in PySpark:
1. Sort columns before visualization:
retail_df.sort(‘transaction_date‘).show()
# Temporal sort for plotting time series
2. Rank users by high score:
game_df.sort(desc(‘player_score‘)).show()
# Rank users leaderboard descending by score
3. Join datasets on sorted key:
user_df.join(purchases_df.sort(‘userId‘), ‘userId‘)
# Avoid shuffle by pre-sorting join key
4. Analyze top/bottom K rows
support_tickets.orderBy(desc(‘priority‘)).show(5)
# View top 5 high priority tickets without full sort
5. Remove duplicate rows:
sorted_clean_df = dirty_df.sort(‘id‘).dropDuplicates()
# Leverage sort before deduping for efficiency
And many more uses! Sorting enables better analysis and visualization.
So in summary, the sort() method provides a versatile way to rearrange PySpark data for various objectives. By judiciously using sorting, we can glean powerful insights!


