Skip to content

[Feature Request][Go SDK]: Specialize post-GBK iteration to decode on demand, if not-reiterated. #22900

@lostluck

Description

@lostluck

What would you like to happen?

The Go SDK currently reads an entire value before permitting a ProcessElement to occur. This is fine for single small elements, but for KV<K, Iter> elements (the post GBK type), this can lead to extreme heap usage depending on the runner behaviour, and the size of each Iter.

Instead of pre-decoding every V in the iterator before processing, we can instead decode each V as the user iterates through the stream, and process on demand. This moves the burden, and control over the heap usage after a GBK to the user.

Note, that for this approach to be effective, we can't add any caching to allow for re-iteration without re-decoding. This prevents this approach from being used with general CoGBK types due their typical implementation in Beam SDKs (implementing them in terms of a KV<K,Iter<colID+value>> coder), which requires the SDK to re-iterate through values depending on the iterator in question in order to filter them as necessary. This prevents this approach being used in a fused stage where a single GBK result PCollection is read by multiple PTransforms.

However, such an approach does work for post lifted combines, reshuffles, and all single GBKs, rendering this a worthwhile optimization specialization.

A prototype of this approach yielded pipeline RAM reduction from 50% of previous usage to 33% of usage with int streams, with one user reporting as little as 5% of previous usage (though, they had a very hefty decoded value vs the read in data). The magnitude of the benefit will be strongly data dependent, but in all cases, should improve Garbage Collection behavior with commensurate GC overhead reduction.

Issue Priority

Priority: 2

Issue Component

Component: sdk-go

Metadata

Metadata

Assignees

Labels

P2done & doneIssue has been reviewed after it was closed for verification, followups, etc.gonew featureperformance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions