Skip to content

[Task]: Spark runner flatMap output should not be required to fit in the memory #23852

@JozoVilcek

Description

@JozoVilcek

What needs to happen?

Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for ParquetIO, which instead of Source<> based reads uses DoFn and let reader from inside DoFn push all elements to the output. Similar happens with JdbcIO and was discussed here [2].

The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.

[1] https://github.com/apache/beam/blob/v2.39.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkProcessContext.java#L125

[2] https://www.mail-archive.com/dev@beam.apache.org/msg16806.html

Issue Priority

Priority: 2

Issue Component

Component: runner-spark

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions