Describe the bug
I've built Datafusion Comet using commit f7f0bb1 for Spark 3.5.1. I found that the memory usage keeps increasing when repeatedly running the TPC-H benchmark script on a set of parquet files. The parquet files were generated using https://github.com/databricks/spark-sql-perf with scale factor = 10. The memory usage could be as high as 20GB. Given the spark and comet configurations I'm using to run the benchmarks (see Additional context) this seems to be problematic.

I've noticed that the native memory allocated by Unsafe_AllocateMemory0 keeps increasing using jcmd VM.native_memory detail.diff | grep Unsafe -A 2. I'm not enabling offheap memory so the allocation should be initiated by the arrow RootAllocator:
Initially after setting the baseline:
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x000000011a0523b4]
(malloc=870721KB type=Other +621478KB #6842866 +4937676)
--
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x0000000119017be0]
(malloc=8463KB type=Other -469KB #221 -3)
After 10 minutes:
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x000000011a0523b4]
(malloc=4349265KB type=Other +4100021KB #34671096 +32765906)
--
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x0000000119017be0]
(malloc=8449KB type=Other -483KB #217 -7)
The leaked memory were allocated by the CometArrowAllocator. I've verified this by attaching a debugger to the Spark process and inspected CometArrowAllocator.getAllocatedMemory:

I've also deliberately disabled AQE coalesce partitions since I noticed this issue: #381. Although it is fixed I still disabled it for being safe. See Additional context section for more details.
Steps to reproduce
Run the TPC-H benchmark script with --iterations=100 and observe the RSS of the Spark process increase over time.
Expected behavior
Memory usage should not increase over time.
Additional context
I'm simply running it locally with master = local[4]. Here are my test environment and spark configurations:
Environment:
- Operating System: macOS 14.6.1, arch: Apple M1 Pro
- Apache Spark: 3.5.1
- Datafusion Comet: commit f7f0bb1
- JVM: 17.0.10 (Eclipse Adoptium)
Spark configurations:
spark.master local[4]
spark.driver.cores 4
spark.executor.cores 4
spark.driver.memory 4g
spark.executor.memory 4g
spark.comet.memory.overhead.factor 0.4
spark.jars /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
spark.driver.extraClassPath /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
spark.executor.extraClassPath /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.sql.extensions org.apache.comet.CometSparkSessionExtensions
spark.comet.enabled true
spark.comet.exec.enabled true
spark.comet.exec.all.enabled true
spark.comet.explainFallback.enabled false
spark.comet.exec.shuffle.enabled true
spark.comet.exec.shuffle.mode auto
spark.shuffle.manager org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
# Disable AQE coalesce partitions
spark.sql.adaptive.enabled false
spark.sql.adaptive.coalescePartitions.enabled false
# Enable debugging and native memory tracking
spark.driver.extraJavaOptions -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -XX:NativeMemoryTracking=detail
Describe the bug
I've built Datafusion Comet using commit f7f0bb1 for Spark 3.5.1. I found that the memory usage keeps increasing when repeatedly running the TPC-H benchmark script on a set of parquet files. The parquet files were generated using https://github.com/databricks/spark-sql-perf with scale factor = 10. The memory usage could be as high as 20GB. Given the spark and comet configurations I'm using to run the benchmarks (see Additional context) this seems to be problematic.
I've noticed that the native memory allocated by
Unsafe_AllocateMemory0keeps increasing usingjcmd VM.native_memory detail.diff | grep Unsafe -A 2. I'm not enabling offheap memory so the allocation should be initiated by the arrowRootAllocator:Initially after setting the baseline:
After 10 minutes:
The leaked memory were allocated by the
CometArrowAllocator. I've verified this by attaching a debugger to the Spark process and inspectedCometArrowAllocator.getAllocatedMemory:I've also deliberately disabled AQE coalesce partitions since I noticed this issue: #381. Although it is fixed I still disabled it for being safe. See Additional context section for more details.
Steps to reproduce
Run the TPC-H benchmark script with
--iterations=100and observe the RSS of the Spark process increase over time.Expected behavior
Memory usage should not increase over time.
Additional context
I'm simply running it locally with
master = local[4]. Here are my test environment and spark configurations:Environment:
Spark configurations: