Memory leaks when running the TPC-H benchmark repeatedly

### Describe the bug

I've built Datafusion Comet using commit https://github.com/apache/datafusion-comet/commit/f7f0bb1ed68367b8d3e1c88010c1f943f480ea11 for Spark 3.5.1. I found that the memory usage keeps increasing when repeatedly running the [TPC-H benchmark script](https://github.com/apache/datafusion-benchmarks/blob/main/runners/datafusion-comet/tpcbench.py) on a set of parquet files. The parquet files were generated using https://github.com/databricks/spark-sql-perf with scale factor = 10. The memory usage could be as high as 20GB. Given the spark and comet configurations I'm using to run the benchmarks (see **Additional context**) this seems to be problematic.

![image](https://github.com/user-attachments/assets/2adfb671-d674-4753-8bcc-cbd272e15da0)


I've noticed that the native memory allocated by `Unsafe_AllocateMemory0` keeps increasing using `jcmd VM.native_memory detail.diff | grep Unsafe -A 2`. I'm not enabling offheap memory so the allocation should be initiated by the arrow `RootAllocator`:

Initially after setting the baseline:

```
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x000000011a0523b4]
                             (malloc=870721KB type=Other +621478KB #6842866 +4937676)
--
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x0000000119017be0]
                             (malloc=8463KB type=Other -469KB #221 -3)
```

After 10 minutes:

```
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x000000011a0523b4]
                             (malloc=4349265KB type=Other +4100021KB #34671096 +32765906)
--
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x0000000119017be0]
                             (malloc=8449KB type=Other -483KB #217 -7)
```

The leaked memory were allocated by the [`CometArrowAllocator`](https://github.com/apache/datafusion-comet/blob/33706125b8c7a7f347865c7fb38fede6aceb97e9/common/src/main/scala/org/apache/comet/package.scala#L35). I've verified this by attaching a debugger to the Spark process and inspected `CometArrowAllocator.getAllocatedMemory`:

![image](https://github.com/user-attachments/assets/3d0ddeb7-d6bd-4d97-8a57-44544fc1e19f)

I've also deliberately disabled AQE coalesce partitions since I noticed this issue: https://github.com/apache/datafusion-comet/issues/381. Although it is fixed I still disabled it for being safe.  See **Additional context** section for more details.

### Steps to reproduce

Run the [TPC-H benchmark script](https://github.com/apache/datafusion-benchmarks/blob/main/runners/datafusion-comet/tpcbench.py) with `--iterations=100` and observe the RSS of the Spark process increase over time.

### Expected behavior

Memory usage should not increase over time.

### Additional context

I'm simply running it locally with `master = local[4]`. Here are my test environment and spark configurations:

**Environment**:

* Operating System: macOS 14.6.1, arch: Apple M1 Pro
* Apache Spark: 3.5.1
* Datafusion Comet: commit https://github.com/apache/datafusion-comet/commit/f7f0bb1ed68367b8d3e1c88010c1f943f480ea11
* JVM: 17.0.10 (Eclipse Adoptium)

**Spark configurations**:

```
spark.master                     local[4]
spark.driver.cores               4
spark.executor.cores             4
spark.driver.memory              4g
spark.executor.memory            4g
spark.comet.memory.overhead.factor 0.4

spark.jars                     /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar

spark.driver.extraClassPath    /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
spark.executor.extraClassPath  /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar

spark.serializer          org.apache.spark.serializer.KryoSerializer

spark.sql.extensions         org.apache.comet.CometSparkSessionExtensions
spark.comet.enabled          true
spark.comet.exec.enabled     true
spark.comet.exec.all.enabled true
spark.comet.explainFallback.enabled false

spark.comet.exec.shuffle.enabled true
spark.comet.exec.shuffle.mode auto
spark.shuffle.manager org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager

# Disable AQE coalesce partitions
spark.sql.adaptive.enabled   false
spark.sql.adaptive.coalescePartitions.enabled  false

# Enable debugging and native memory tracking
spark.driver.extraJavaOptions  -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -XX:NativeMemoryTracking=detail
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks when running the TPC-H benchmark repeatedly #884

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Memory leaks when running the TPC-H benchmark repeatedly #884

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions