Skip to content

Memory leaks when running the TPC-H benchmark repeatedly #884

@Kontinuation

Description

@Kontinuation

Describe the bug

I've built Datafusion Comet using commit f7f0bb1 for Spark 3.5.1. I found that the memory usage keeps increasing when repeatedly running the TPC-H benchmark script on a set of parquet files. The parquet files were generated using https://github.com/databricks/spark-sql-perf with scale factor = 10. The memory usage could be as high as 20GB. Given the spark and comet configurations I'm using to run the benchmarks (see Additional context) this seems to be problematic.

image

I've noticed that the native memory allocated by Unsafe_AllocateMemory0 keeps increasing using jcmd VM.native_memory detail.diff | grep Unsafe -A 2. I'm not enabling offheap memory so the allocation should be initiated by the arrow RootAllocator:

Initially after setting the baseline:

[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x000000011a0523b4]
                             (malloc=870721KB type=Other +621478KB #6842866 +4937676)
--
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x0000000119017be0]
                             (malloc=8463KB type=Other -469KB #221 -3)

After 10 minutes:

[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x000000011a0523b4]
                             (malloc=4349265KB type=Other +4100021KB #34671096 +32765906)
--
[0x00000001099c98a8] Unsafe_AllocateMemory0(JNIEnv_*, _jobject*, long)+0xcc
[0x0000000119017be0]
                             (malloc=8449KB type=Other -483KB #217 -7)

The leaked memory were allocated by the CometArrowAllocator. I've verified this by attaching a debugger to the Spark process and inspected CometArrowAllocator.getAllocatedMemory:

image

I've also deliberately disabled AQE coalesce partitions since I noticed this issue: #381. Although it is fixed I still disabled it for being safe. See Additional context section for more details.

Steps to reproduce

Run the TPC-H benchmark script with --iterations=100 and observe the RSS of the Spark process increase over time.

Expected behavior

Memory usage should not increase over time.

Additional context

I'm simply running it locally with master = local[4]. Here are my test environment and spark configurations:

Environment:

  • Operating System: macOS 14.6.1, arch: Apple M1 Pro
  • Apache Spark: 3.5.1
  • Datafusion Comet: commit f7f0bb1
  • JVM: 17.0.10 (Eclipse Adoptium)

Spark configurations:

spark.master                     local[4]
spark.driver.cores               4
spark.executor.cores             4
spark.driver.memory              4g
spark.executor.memory            4g
spark.comet.memory.overhead.factor 0.4

spark.jars                     /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar

spark.driver.extraClassPath    /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar
spark.executor.extraClassPath  /path/to/workspace/github/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.3.0-SNAPSHOT.jar

spark.serializer          org.apache.spark.serializer.KryoSerializer

spark.sql.extensions         org.apache.comet.CometSparkSessionExtensions
spark.comet.enabled          true
spark.comet.exec.enabled     true
spark.comet.exec.all.enabled true
spark.comet.explainFallback.enabled false

spark.comet.exec.shuffle.enabled true
spark.comet.exec.shuffle.mode auto
spark.shuffle.manager org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager

# Disable AQE coalesce partitions
spark.sql.adaptive.enabled   false
spark.sql.adaptive.coalescePartitions.enabled  false

# Enable debugging and native memory tracking
spark.driver.extraJavaOptions  -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -XX:NativeMemoryTracking=detail

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions