Skip to content

fix(engine): prevent GraalJS memory leak from polyglot Context accumulation#2768

Merged
kthoms merged 1 commit into
mainfrom
issues/2761-memory-leak
Apr 22, 2026
Merged

fix(engine): prevent GraalJS memory leak from polyglot Context accumulation#2768
kthoms merged 1 commit into
mainfrom
issues/2761-memory-leak

Conversation

@kthoms

@kthoms kthoms commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #2761 — progressive memory growth under sustained load with JavaScript script tasks (e.g. Spin/JSON transformations), leading to OutOfMemoryError and container restart.

The issue was introduced with Operaton's switch from Nashorn to GraalJS. Camunda 7.24 (Nashorn) did not have this problem.

Root Cause

Every script evaluation created two new GraalJS polyglot Contexts that were never closed:

  1. ScriptingEngines.createBindings() called scriptEngine.createBindings() → returns GraalJSBindings with a lazy polyglot Context A
  2. This was wrapped in ScriptBindings (which is not instanceof GraalJSBindings)
  3. During engine.eval(script, scriptBindings), GraalJS internally calls getOrCreateGraalJSBindings() which checks bindings instanceof GraalJSBindings
  4. ScriptBindings fails the check → GraalJS creates another new polyglot Context B
  5. Both Contexts leak — ~1 MB per script evaluation under concurrent load

GraalJS reports THREADING=null, meaning it is not cachable and creates a fresh engine per evaluation. Nashorn reported THREADING=MULTITHREADED and was cached — single engine, single context, no leak.

Fix

Primary fix in ScriptingEngines.createBindings() for non-cachable engines (GraalJS):

  • Use the engine's default ENGINE_SCOPE bindings directly (which ARE GraalJSBindings)
  • Populate them eagerly with resolver values (process variables, execution context, beans)
  • Return them without wrapping in ScriptBindings

This ensures getOrCreateGraalJSBindings() finds the instanceof GraalJSBindings check passing → reuses the existing polyglot Context instead of creating a new one.

Supplementary fixes:

  • ScriptingEnvironment: close non-cached engines (AutoCloseable) after evaluation
  • ExpressionEvaluationHandler (DMN): same close-after-eval pattern + guard compilation to cachable engines only
  • HttpResponseImpl: null out the HTTP response reference after reading to allow GC of buffered bodies
  • ServiceTaskConnectorActivityBehavior: close connector response in try-finally after output mapping

Load Test Results

A new qa/load-test module reproduces the issue with the exact process and conditions from the report.

Scenario Before After
Pure JS script tasks (50 users, 90s) +893 MB → OOM +38 MB ✅
Full process (30 users, 60s, audit history) OOM within minutes +53 MB ✅
Full process (100 users, 120s, no history) OOM +55 MB ✅

Memory trend between first and last quarter of sustained load:

  • Before: linear growth to OOM
  • After: −21 MB (memory actually decreasing as GC reclaims)

Changes

Engine Core

  • ScriptingEngines: createBindingsForNonCachableEngine() uses engine's default GraalJSBindings; isNonCachableEngine() helper checks THREADING
  • ScriptingEnvironment: closeNonCachedEngine() closes non-cached AutoCloseable engines after eval
  • ScriptLogger: added logClosingScriptEngineFailed() log message

DMN Engine

  • ExpressionEvaluationHandler: close non-cached engines after eval; guard script compilation to cachable engines
  • DmnEngineLogger: added logClosingScriptEngineFailed() log message
  • ExpressionCachingTest: fixed mock to stub ScriptEngineFactory with THREADING=MULTITHREADED (required by the new cachability check)

HTTP Connector

  • HttpResponseImpl: null out httpResponse after collectResponseParameters() so the Apache HttpClient response body buffer is eligible for GC
  • HttpResponseTest: added tests for the null-out behaviour and the no-op Closeable fallback

Connect Plugin

  • ServiceTaskConnectorActivityBehavior: added try-finally to close CloseableConnectorResponse after output parameter mapping

Copilot AI review requested due to automatic review settings April 17, 2026 14:32
@kthoms kthoms self-assigned this Apr 17, 2026
@kthoms kthoms added the qa Tests, quality improvements and assurance label Apr 17, 2026
@kthoms kthoms added this to the 2.1.0 milestone Apr 17, 2026
@kthoms kthoms added the noteworthy Should be documented in the release notes label Apr 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses sustained-load memory growth/OOMs caused by GraalJS polyglot Context accumulation during script evaluation, and adds a dedicated QA load-test module to reproduce and validate the fix.

Changes:

  • Adjust GraalJS bindings creation and close non-cached script engines after evaluation (engine + DMN).
  • Improve connector/HTTP response resource cleanup to reduce retained buffers/resources.
  • Add qa/load-test module with processes, WireMock stubs, and a memory-trend load test reproducing #2761.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
qa/pom.xml Adds the new qa/load-test module to the QA reactor.
qa/load-test/pom.xml Defines the load-test module dependencies and failsafe profile execution.
qa/load-test/README.md Documents how to run/configure the new load test module.
qa/load-test/src/test/java/org/operaton/bpm/qa/loadtest/LoadTestApplication.java Minimal Spring Boot test application to host embedded Operaton for load tests.
qa/load-test/src/test/java/org/operaton/bpm/qa/loadtest/MemoryLeakLoadTest.java Concurrent start-process load test with heap sampling assertions to detect leaks/regressions.
qa/load-test/src/test/resources/application.yml Test app configuration (H2, plugins, history, thread limits).
qa/load-test/src/test/resources/processes/simple-process.bpmn Minimal BPMN used as a baseline scenario.
qa/load-test/src/test/resources/processes/pure-js-process.bpmn Script-only BPMN scenario exercising pure JS execution.
qa/load-test/src/test/resources/processes/script-only-process.bpmn Script-only BPMN scenario using Spin JSON access patterns.
qa/load-test/src/test/resources/processes/http-only-process.bpmn HTTP-connector-only BPMN scenario to isolate connector memory behavior.
qa/load-test/src/test/resources/processes/vale-antecipado-elegibility.bpmn Full reproduction BPMN resembling the reported real-world process (scripts + HTTP + DMN).
qa/load-test/src/test/resources/processes/dmn_ValeAntecipado_Policy_Age.dmn DMN decision used by the reproduction process.
qa/load-test/src/test/resources/processes/dmn_ValeAntecipado_Policy_Fee.dmn DMN decision resource for the reproduction set.
qa/load-test/src/test/resources/processes/dmn_ValeAntecipado_Policy_TOJ.dmn DMN decision used by the reproduction process.
engine/src/main/java/org/operaton/bpm/engine/impl/scripting/engine/ScriptingEngines.java Special-case binding creation for non-cachable engines (GraalJS) to avoid extra polyglot contexts.
engine/src/main/java/org/operaton/bpm/engine/impl/scripting/env/ScriptingEnvironment.java Closes non-cached AutoCloseable script engines after execution.
engine/src/main/java/org/operaton/bpm/engine/impl/scripting/ScriptLogger.java Adds a log message for failures when closing a script engine.
engine-dmn/engine/src/main/java/org/operaton/bpm/dmn/engine/impl/evaluation/ExpressionEvaluationHandler.java Closes non-cached engines after eval and guards compilation to cachable engines.
engine-dmn/engine/src/main/java/org/operaton/bpm/dmn/engine/impl/DmnEngineLogger.java Adds a log message for failures when closing a DMN script engine.
engine-dmn/engine/src/test/java/org/operaton/bpm/dmn/engine/el/ExpressionCachingTest.java Updates test stubbing for the new cachability check via THREADING.
connect/http-client/src/main/java/org/operaton/connect/httpclient/impl/HttpResponseImpl.java Ensures the underlying HTTP response is closed and dereferenced after parameters are collected.
connect/http-client/src/test/java/org/operaton/connect/httpclient/HttpResponseTest.java Adds tests ensuring the response remains closable in both “read” and “unread” cases.
engine-plugins/connect-plugin/src/main/java/org/operaton/connect/plugin/impl/ServiceTaskConnectorActivityBehavior.java Closes connector responses after output parameter mapping to prevent resource retention.

Comment thread qa/load-test/README.md
Comment thread qa/load-test/src/test/resources/processes/vale-antecipado-elegibility.bpmn Outdated
Comment on lines 59 to +65
ConnectorResponse response = request.execute();
applyOutputParameters(execution, response);
try {
applyOutputParameters(execution, response);
} finally {
if (response instanceof CloseableConnectorResponse closeableResponse) {
closeableResponse.close();
}
Comment thread qa/load-test/src/test/java/org/operaton/bpm/qa/loadtest/MemoryLeakLoadTest.java Outdated
Comment thread qa/load-test/src/test/java/org/operaton/bpm/qa/loadtest/MemoryLeakLoadTest.java Outdated
Comment thread qa/load-test/pom.xml
Comment thread qa/load-test/README.md
Comment thread qa/load-test/src/test/resources/processes/dmn-policy-tenure.dmn
Comment thread qa/load-test/src/test/resources/processes/dmn-policy-age.dmn
Comment thread qa/load-test/src/test/resources/processes/credit-eligibility.bpmn
@kthoms kthoms force-pushed the issues/2761-memory-leak branch from 1419479 to 11652f4 Compare April 20, 2026 21:34
@kthoms kthoms added the backport:patch-release Should be ported back to the latest feature release branch. label Apr 21, 2026
@kthoms kthoms force-pushed the issues/2761-memory-leak branch 2 times, most recently from 48baff2 to e3c437b Compare April 22, 2026 02:11
…lation

Under sustained load with JavaScript script tasks, each script evaluation
created two new GraalJS polyglot Contexts that were never closed, causing
progressive memory growth until OutOfMemoryError.

Root cause: ScriptingEngines.createBindings() called
scriptEngine.createBindings() which returns GraalJSBindings with a lazy
polyglot Context, then wrapped it in ScriptBindings. During eval(),
GraalJS checks if ENGINE_SCOPE bindings are instanceof GraalJSBindings.
Since ScriptBindings is not GraalJSBindings, a second polyglot Context
was created. Neither Context was ever closed.

Fix: For non-cachable engines (THREADING=null, e.g., GraalJS), use the
engine's default ENGINE_SCOPE bindings directly (which ARE
GraalJSBindings), populated with resolver values. This ensures GraalJS
reuses the existing polyglot Context instead of creating a new one.

Additional fixes:
- Close non-cached script engines after evaluation in both the engine
  scripting environment and the DMN expression evaluation handler
- Release HTTP response reference after parameter collection in
  HttpResponseImpl to allow GC of buffered response bodies
- Close connector response after output parameter mapping in
  ServiceTaskConnectorActivityBehavior

Load test results (before/after fix):
- Pure JS process: 893 MB growth -> 38 MB growth (50 users, 90s)
- Full process: OOM crash -> stable at ~166 MB (30 users, 60s)

closes #2761

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kthoms kthoms force-pushed the issues/2761-memory-leak branch from e3c437b to bc3286a Compare April 22, 2026 02:19
@sonarqubecloud

Copy link
Copy Markdown

@kthoms kthoms merged commit 58ff693 into main Apr 22, 2026
16 checks passed
@kthoms kthoms deleted the issues/2761-memory-leak branch April 22, 2026 04:35
kthoms added a commit that referenced this pull request Apr 22, 2026
…lation (#2768)

Under sustained load with JavaScript script tasks, each script evaluation
created two new GraalJS polyglot Contexts that were never closed, causing
progressive memory growth until OutOfMemoryError.

Root cause: ScriptingEngines.createBindings() called
scriptEngine.createBindings() which returns GraalJSBindings with a lazy
polyglot Context, then wrapped it in ScriptBindings. During eval(),
GraalJS checks if ENGINE_SCOPE bindings are instanceof GraalJSBindings.
Since ScriptBindings is not GraalJSBindings, a second polyglot Context
was created. Neither Context was ever closed.

Fix: For non-cachable engines (THREADING=null, e.g., GraalJS), use the
engine's default ENGINE_SCOPE bindings directly (which ARE
GraalJSBindings), populated with resolver values. This ensures GraalJS
reuses the existing polyglot Context instead of creating a new one.

Additional fixes:
- Close non-cached script engines after evaluation in both the engine
  scripting environment and the DMN expression evaluation handler
- Release HTTP response reference after parameter collection in
  HttpResponseImpl to allow GC of buffered response bodies
- Close connector response after output parameter mapping in
  ServiceTaskConnectorActivityBehavior

Load test results (before/after fix):
- Pure JS process: 893 MB growth -> 38 MB growth (50 users, 90s)
- Full process: OOM crash -> stable at ~166 MB (30 users, 60s)

closes #2761

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
(cherry picked from commit 58ff693)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:patch-release Should be ported back to the latest feature release branch. noteworthy Should be documented in the release notes qa Tests, quality improvements and assurance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory Leak Under Load (100 Concurrent Users via Locust) Causing Container Restart and JDBC Pool Exhaustion

4 participants