As a full-stack developer well-versed in large-scale data engineering, I utilize PySpark daily for processing colossal datasets. One function vital to my analytical toolkit is between(), which enables powerful range-based filtering for subgroup analysis.
In my previous guide, we covered basic usage of between() for slicing datasets. Now let‘s level up and unlock advanced analytics by harnessing the deeper capabilities of between().
We‘ll walk through advanced use cases, assess performance tradeoffs, and learn optimization best practices. Follow along and gain a competitive edge for wrangling Big Data.
Filtering Master Tables while Preserving Referential Integrity
When filtering downstream tables tied to a master table via foreign key relationships, we must be careful not to introduce integrity issues by orphaning related records.
For example, envision an e-commerce database with:
- A
customersmaster table - An
ordersdetail table with customer_id linking each order to corresponding customer
If we naively filter the customers table with between(), we may inadvertently delete customer records that have associated orders. Any orders linked to deleted customers would be orphaned and lead to corrupted analysis.
Instead, we can perform a filtered join with between() to safely subset the master by preserving relational integrity:
filtered_customers_df = customers_df.join(
orders_df,
customers_df.id == orders_df.customer_id
).select(customers_df.id.between(100, 500))
This LEFT OUTER JOIN will retain all customers alongside their orders data. Applying the between() criteria after the join prevents orphan records by filtering customers based on presence in the joined output.
Improving Join Performance with Range Partitioning
When joining large datasets, PySpark shuffle operations can become prohibitively expensive. The solution is optimizing the join with pre-partitioning techniques like range partitioning.
Range partitioning logically divides data into partitions based on a numerical column‘s value falling into defined ranges. This collocates related records into matching partitions, reducing future shuffle requirements.
We can leverage between() to easily implement range partitioning:
# Range partition customers_df on id
range_partitioned_df = customers_df.repartition(5, customers_df.id.between(1, 100),
customers_df.id.between(101, 200),
customers_df.id.between(201, 300),
customers_df.id.between(301, 400),
customers_df.id.between(401, 500))
# Join against range partitioned orders_df
optimized_join_df = range_partitioned_df.join(orders_df, customers_df.id == orders_df.customer_id)
By range partitioning both DataFrames before joining, matching records will be perfectly collocated in the same physical partitions. This avoids expensive shuffles during the join. With proper partitioning, joins of even multi-terabyte tables can be very efficient.
Aggregating with Partial Ranges for Granular Insights
between() also proves useful for partial range aggregation. By specifying overlapping min/max bounds, we can build granular aggregations for deeper analysis.
Let‘s look at some sample sales data:
We want to analyze sales by week, but the raw data has timestamps rather than week groupings. Using between(), we can group into weekly buckets:
sales_df.withColumn(
"week",
when(sales_df.timestamp.between(startDate, startDate + 7 days), 1).otherwise(
when(sales_df.timestamp.between(startDate + 7 days, startDate + 14 days), 2).otherwise(
when(sales_df.timestamp.between(startDate + 14 days, startDate + 21 days), 3).otherwise(
when(sales_df.timestamp.between(startDate + 21 days, startDate + 28 days), 4)
)
)
)
).groupBy("week").agg(sum("sales")
By chaining overlapping between() cases, we aggregated to intricate week-long slices rather than arbitrary month or year chunks. This level of granularity provides greater visibility into trends.
Benchmarking against Alternate Filtering Approaches
While between() is optimized for lightning-fast range filtering, performance can vary drastically based on dataset characteristics and cluster resources. It‘s instructive to benchmark against alternative filtering syntaxes using production data.
Let‘s test a few variants for selecting sales orders between $200 and $500 from an orders table with 10 million rows stored as Parquet:

Key Takeaways:
- For equality checks, binary comparison with
>and<is faster thanbetween() - Vectorized optimizations provide sizable speedups when expression can be pushed to the storage layer
- In scalar mode, complex expressions with many OR branches have higher latency
Understanding these performance nuances allows picking the best filtering strategy for your data. Never assume between() will necessarily dominate.
Optimizing between() with Caching and Partition Pruning
We can further boost between() performance by leveraging Catalyst optimizations like caching and partition pruning:
Caching: Cache frequently filtered DataFrames in-memory to avoid rescanning storage for each filter:
sales_df.cache()
filtered_sales_df = sales_df.select(sales_df.value.between(200, 500))
Partition Pruning: Enable scan optimization of partitioned data by extracting range filters:
spark.conf.set("spark.sql.range.coalesceRanges", "true")
filtered_sales_df = sales_df.select(sales_df.value.between(200, 500))
Combining these optimizations with between() prevents unnecessary I/O, reducing latency and speeding up pipelines.
When Not to Use between()
While between() is ideal for slicing numerical ranges, other methods may prove superior depending on dataset characteristics:
Set membership checking: Use isin() for set membership filtering on categorical data
Open-ended ranges: For conditions like value > 1000, use simple comparison operators
Non-equijoins: Join by range overlap with rangeBetween() rather than just equality
Statistical / percentile analysis: Alternative libraries like SciPy provide these advanced analytics
Always critically assess if between() matches your exact use case or if another approach would work better given your data properties.
Key Takeaways
Mastering between() accelerated my analytics by efficiently subsetting massive datasets. Hopefully, you now feel empowered to leverage between() for intensive data exploration.
Here are the key insights from our guide:
- Carefully filter master tables while maintaining data integrity
- Range partitioning boosts join performance by reducing shuffle needs
- Partial range aggregation enables deep analysis at any granularity
- Benchmark against alternatives and optimize with caching/pruning
- Consider other methods for non-equijoins, statistical tests etc.
For hands-on practice, checkout these public datasets perfect for sharpening your between() skills:
- Amazon Reviews – 130+ million customer reviews
- GitHub Events – Public GitHub activity timeline
- US Census – Deep US demographic data
Feel free to reach out if you have any other questions!


