fix: Fix various memory leaks problems by Kontinuation · Pull Request #890 · apache/datafusion-comet

Kontinuation · 2024-08-29T17:13:58Z

Which issue does this PR close?

Closes #884.

Rationale for this change

Please refer to the comments in #884 for details.

What changes are included in this PR?

Fixes ArrowSchema and ArrowArray base structure leaks in JVM
Fixes AQE coalesce partition leaks by closing the ArrowStreamWriter

How are these changes tested?

It is pretty hard to add tests for this fix, so we manually tested this and relying on existing tests to make sure that it does not break anything.

viirya · 2024-08-29T17:35:50Z

common/src/main/scala/org/apache/comet/vector/NativeUtil.scala

    (0 until batch.numCols()).foreach { index =>
-      batch.column(index) match {
-        case a: CometVector =>
-          val valueVector = a.getValueVector
-
-          val provider = if (valueVector.getField.getDictionary != null) {
-            a.getDictionaryProvider
-          } else {
-            null
-          }
-


Unrelated change.

This is for moving the sanity check for column types prior to the actual construction of the Arrow C data structure. It is tricky to release already constructed FFI data structures before raising the exception.

viirya · 2024-08-29T17:36:57Z

common/src/main/scala/org/apache/comet/vector/NativeUtil.scala

+      exportedVectors += arrowSchema.memoryAddress()
    }

-    exportedVectors.toArray


You can return ExportedBatch without any above change.

If you don't like the restructuring that moves the sanity checks, I can revert it to the original control flow.

It makes sense for the reason https://github.com/apache/datafusion-comet/pull/890/files#r1736808325

viirya · 2024-08-29T17:38:06Z

spark/src/main/java/org/apache/comet/CometBatchIterator.java

+    // The native executor should have moved the previous batch, it is safe for us to deallocate
+    // the ArrowSchema and ArrowArray base structures.


Not exactly. A native operator could possibly save batches internally.

Yes, even though the native operator save batches internally, the batch would be moved to the native operator to be saved.

viirya · 2024-08-29T17:41:20Z

common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala

-
-      out.flush()
-      out.close()
+      writer.close()


Closing the writer will close the dictionary provider. If the dictionary arrays are shared across batches, you will close them and empty later batches. I remember we hit the issue before.

Hmm... I need to take further look at this to fully understand if this fix is correct or not.

This should be correct. The dictionary provider held by the writer contains copied vectors, so closing them does not interfere with the rest parts of the system.

I remember closing it will cause some errors in CI due the reason I mentioned. Let's see if CI can pass or not.

The CI passes on all commits of this PR, it has run 3 rounds with no problem.

Yea, maybe there was some other changes before. Anyway it is good to close the writer without issue.

viirya · 2024-08-29T18:21:03Z

common/src/main/scala/org/apache/comet/vector/ExportedBatch.scala

+  def close(): Unit = {
+    arrowSchemas.foreach { schema =>
+      val snapshot = schema.snapshot
+      if (snapshot.release != 0) schema.release()


I don't think we should call release here. The release of exported array should be done when native side drops the imported array.

Yes. I did this in case the native side not move away the array (by mistake maybe). This could be removed if the native side always move the array.

Yea, I took another look. Native side moved them.

viirya · 2024-08-29T18:21:47Z

common/src/main/scala/org/apache/comet/vector/ExportedBatch.scala

+    arrowArrays.foreach { array =>
+      val snapshot = array.snapshot
+      if (snapshot.release != 0) array.release()
+      array.close()


It makes sense to close the internal ArrowBuf of ArrowArray and ArrowSchema. Good catch.

spark/src/main/java/org/apache/comet/CometBatchIterator.java

codecov-commenter · 2024-08-29T19:26:14Z

Codecov Report

Attention: Patch coverage is 23.25581% with 33 lines in your changes missing coverage. Please review.

Project coverage is 55.01%. Comparing base (9d8730d) to head (a0c46ee).
Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
...ain/scala/org/apache/comet/vector/NativeUtil.scala	0.00%	25 Missing ⚠️
.../scala/org/apache/comet/vector/ExportedBatch.scala	0.00%	7 Missing ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #890      +/-   ##
============================================
- Coverage     55.16%   55.01%   -0.15%     
+ Complexity      857      854       -3     
============================================
  Files           109      110       +1     
  Lines         10542    10592      +50     
  Branches       2010     2020      +10     
============================================
+ Hits           5815     5827      +12     
- Misses         3714     3750      +36     
- Partials       1013     1015       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2024-08-30T02:34:53Z

@Kontinuation I revised the current approach in #893. So the ArrowSchema and ArrowArray are allocated in native. By doing that, I think we don't need to release the JVM structures.

Kontinuation · 2024-08-30T02:45:15Z

@Kontinuation I revised the current approach in #893. So the ArrowSchema and ArrowArray are allocated in native. By doing that, I think we don't need to release the JVM structures.

Yes. They are allocated by the native side in #893, so they should be released by the native side accordingly.

viirya · 2024-08-31T21:10:05Z

Thank you @Kontinuation

I will merge this first. And in #893, I will remove ExportedBatch as it allocates array/schema structures in native.

* Try to fix the JVM Unsafe memory leak issue * Fixed leaks when AQE coalesce partitions is enabled * Fixes according to reviewer's comments * Update spark/src/main/java/org/apache/comet/CometBatchIterator.java Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Kontinuation added 2 commits August 29, 2024 11:23

Try to fix the JVM Unsafe memory leak issue

a90f43a

Fixed leaks when AQE coalesce partitions is enabled

8657a82

viirya reviewed Aug 29, 2024

View reviewed changes

Fixes according to reviewer's comments

a0c46ee

Kontinuation force-pushed the try-fix-jvm-unsafe-mem-leak branch from 83230a9 to a0c46ee Compare August 29, 2024 18:37

viirya reviewed Aug 29, 2024

View reviewed changes

spark/src/main/java/org/apache/comet/CometBatchIterator.java Outdated Show resolved Hide resolved

Update spark/src/main/java/org/apache/comet/CometBatchIterator.java

c6ec92c

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

Kontinuation marked this pull request as ready for review August 30, 2024 01:50

viirya approved these changes Aug 31, 2024

View reviewed changes

viirya merged commit 06bb321 into apache:main Aug 31, 2024

		// The native executor should have moved the previous batch, it is safe for us to deallocate
		// the ArrowSchema and ArrowArray base structures.

Conversation

Kontinuation commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kontinuation Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kontinuation Aug 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Aug 29, 2024

Codecov Report

Uh oh!

viirya commented Aug 30, 2024

Uh oh!

Kontinuation commented Aug 30, 2024

Uh oh!

viirya commented Aug 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kontinuation commented Aug 29, 2024 •

edited

Loading

viirya Aug 29, 2024 •

edited

Loading

Kontinuation Aug 29, 2024 •

edited

Loading

Kontinuation Aug 30, 2024 •

edited

Loading