Skip to content

[Bug]: Large PIPELINE_OPTIONS can lead to command line args list too long errors in SDK containers #27839

@lostluck

Description

@lostluck

What happened?

A Dataflow customer with a large number of --filesToStage leads to workers unable to boot up, failing with Java exited: fork/exec /opt/java/openjdk/bin/java: argument list too long.

After some investigation, it's revealed that in Linux, Environment variables take up command line length apparently:

https://stackoverflow.com/questions/28865473/setting-environment-variable-to-a-large-value-argument-list-too-long

And Beam Java serializes the pipeline options in JSON format to an evironement variable.

https://github.com/apache/beam/blob/release-2.49.0/sdks/java/container/boot.go#L128

This also happens for Python:

os.Setenv("PIPELINE_OPTIONS", options)
but no reports for this as of yet.

Previous work to resolve this was here, focused on the Java class path: #25582

While that certainly helped the issue, large Pipeline options remain an issue.

The proposed fix for Java at least is to write another environment variable PIPELINE_OPTIONS_FILE, which will contain the file location for a json encoded version of the pipeline options, similar to how we've done the pathing jar.

The behavior from the portable SDK harness should be to look at this environment variable, and if it exists, read the JSON pipeline options from them. Otherwise, fall back to the existing behavior.

This allows for slight mismatch in container versions vs Beam versions for users who aren't experiencing this issue.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions