Skip to content

fix: Fix various memory leaks problems#890

Merged
viirya merged 4 commits intoapache:mainfrom
Kontinuation:try-fix-jvm-unsafe-mem-leak
Aug 31, 2024
Merged

fix: Fix various memory leaks problems#890
viirya merged 4 commits intoapache:mainfrom
Kontinuation:try-fix-jvm-unsafe-mem-leak

Conversation

@Kontinuation
Copy link
Copy Markdown
Member

@Kontinuation Kontinuation commented Aug 29, 2024

Which issue does this PR close?

Closes #884.

Rationale for this change

Please refer to the comments in #884 for details.

What changes are included in this PR?

  • Fixes ArrowSchema and ArrowArray base structure leaks in JVM
  • Fixes AQE coalesce partition leaks by closing the ArrowStreamWriter

How are these changes tested?

It is pretty hard to add tests for this fix, so we manually tested this and relying on existing tests to make sure that it does not break anything.

Comment on lines 68 to -64
(0 until batch.numCols()).foreach { index =>
batch.column(index) match {
case a: CometVector =>
val valueVector = a.getValueVector

val provider = if (valueVector.getField.getDictionary != null) {
a.getDictionaryProvider
} else {
null
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for moving the sanity check for column types prior to the actual construction of the Arrow C data structure. It is tricky to release already constructed FFI data structures before raising the exception.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay.

exportedVectors += arrowSchema.memoryAddress()
}

exportedVectors.toArray
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can return ExportedBatch without any above change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't like the restructuring that moves the sanity checks, I can revert it to the original control flow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +52 to +53
// The native executor should have moved the previous batch, it is safe for us to deallocate
// the ArrowSchema and ArrowArray base structures.
Copy link
Copy Markdown
Member

@viirya viirya Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. A native operator could possibly save batches internally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even though the native operator save batches internally, the batch would be moved to the native operator to be saved.


out.flush()
out.close()
writer.close()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the writer will close the dictionary provider. If the dictionary arrays are shared across batches, you will close them and empty later batches. I remember we hit the issue before.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I need to take further look at this to fully understand if this fix is correct or not.

Copy link
Copy Markdown
Member Author

@Kontinuation Kontinuation Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be correct. The dictionary provider held by the writer contains copied vectors, so closing them does not interfere with the rest parts of the system.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember closing it will cause some errors in CI due the reason I mentioned. Let's see if CI can pass or not.

Copy link
Copy Markdown
Member Author

@Kontinuation Kontinuation Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI passes on all commits of this PR, it has run 3 rounds with no problem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, maybe there was some other changes before. Anyway it is good to close the writer without issue.

def close(): Unit = {
arrowSchemas.foreach { schema =>
val snapshot = schema.snapshot
if (snapshot.release != 0) schema.release()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should call release here. The release of exported array should be done when native side drops the imported array.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I did this in case the native side not move away the array (by mistake maybe). This could be removed if the native side always move the array.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I took another look. Native side moved them.

arrowArrays.foreach { array =>
val snapshot = array.snapshot
if (snapshot.release != 0) array.release()
array.close()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to close the internal ArrowBuf of ArrowArray and ArrowSchema. Good catch.

@Kontinuation Kontinuation force-pushed the try-fix-jvm-unsafe-mem-leak branch from 83230a9 to a0c46ee Compare August 29, 2024 18:37
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 23.25581% with 33 lines in your changes missing coverage. Please review.

Project coverage is 55.01%. Comparing base (9d8730d) to head (a0c46ee).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
...ain/scala/org/apache/comet/vector/NativeUtil.scala 0.00% 25 Missing ⚠️
.../scala/org/apache/comet/vector/ExportedBatch.scala 0.00% 7 Missing ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #890      +/-   ##
============================================
- Coverage     55.16%   55.01%   -0.15%     
+ Complexity      857      854       -3     
============================================
  Files           109      110       +1     
  Lines         10542    10592      +50     
  Branches       2010     2020      +10     
============================================
+ Hits           5815     5827      +12     
- Misses         3714     3750      +36     
- Partials       1013     1015       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
@Kontinuation Kontinuation marked this pull request as ready for review August 30, 2024 01:50
@viirya
Copy link
Copy Markdown
Member

viirya commented Aug 30, 2024

@Kontinuation I revised the current approach in #893. So the ArrowSchema and ArrowArray are allocated in native. By doing that, I think we don't need to release the JVM structures.

@Kontinuation
Copy link
Copy Markdown
Member Author

@Kontinuation I revised the current approach in #893. So the ArrowSchema and ArrowArray are allocated in native. By doing that, I think we don't need to release the JVM structures.

Yes. They are allocated by the native side in #893, so they should be released by the native side accordingly.

@viirya
Copy link
Copy Markdown
Member

viirya commented Aug 31, 2024

Thank you @Kontinuation

I will merge this first. And in #893, I will remove ExportedBatch as it allocates array/schema structures in native.

@viirya viirya merged commit 06bb321 into apache:main Aug 31, 2024
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
* Try to fix the JVM Unsafe memory leak issue

* Fixed leaks when AQE coalesce partitions is enabled

* Fixes according to reviewer's comments

* Update spark/src/main/java/org/apache/comet/CometBatchIterator.java

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

---------

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leaks when running the TPC-H benchmark repeatedly

3 participants