-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[Bug]: Large PIPELINE_OPTIONS can lead to command line args list too long errors in SDK containers #27839
Description
What happened?
A Dataflow customer with a large number of --filesToStage leads to workers unable to boot up, failing with Java exited: fork/exec /opt/java/openjdk/bin/java: argument list too long.
After some investigation, it's revealed that in Linux, Environment variables take up command line length apparently:
And Beam Java serializes the pipeline options in JSON format to an evironement variable.
https://github.com/apache/beam/blob/release-2.49.0/sdks/java/container/boot.go#L128
This also happens for Python:
beam/sdks/python/container/boot.go
Line 206 in 9080909
| os.Setenv("PIPELINE_OPTIONS", options) |
Previous work to resolve this was here, focused on the Java class path: #25582
While that certainly helped the issue, large Pipeline options remain an issue.
The proposed fix for Java at least is to write another environment variable PIPELINE_OPTIONS_FILE, which will contain the file location for a json encoded version of the pipeline options, similar to how we've done the pathing jar.
The behavior from the portable SDK harness should be to look at this environment variable, and if it exists, read the JSON pipeline options from them. Otherwise, fall back to the existing behavior.
This allows for slight mismatch in container versions vs Beam versions for users who aren't experiencing this issue.
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner