What needs to happen?
Currently on Spark runner, if single processElement call produces multiple output elements, they all needs to fit in the memory [1]. This is problematic e.g. for ParquetIO, which instead of Source<> based reads uses DoFn and let reader from inside DoFn push all elements to the output. Similar happens with JdbcIO and was discussed here [2].
The goal is to overcome this constraint and allow to produce large output from DoFn on Spark runner.
[1] https://github.com/apache/beam/blob/v2.39.0/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkProcessContext.java#L125
[2] https://www.mail-archive.com/dev@beam.apache.org/msg16806.html
Issue Priority
Priority: 2
Issue Component
Component: runner-spark