Disable blob level Lineage metrics for FileSystems#32642
Disable blob level Lineage metrics for FileSystems#32642Abacn wants to merge 1 commit intoapache:masterfrom
Conversation
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Thank you for prompt changes. Changes look good to me for immediate fix.
|
rohitsinha54
left a comment
There was a problem hiding this comment.
I think one way to test both both wild card and sharded wil be
dummy job which write 1 to 100k to sharded file one number each file named with num 1.txt, 2.txt and then reading that back with wildcard. WDYT?
|
Tested TextIOIT: write then read 100,000 files
Dataflow job id: In comparison (on master): It stucks at update string set (see also #32649) job id |
|
superceded by #32662 |

introduced in #32090 (2.59.0 for Java) and #32430 (2.60.0 for Python),
There are use case of read/write millions of files in a pipeline, reporting lineage resulted in big stringset metrics that causing job status response size exceeding some internal limit, thus affecting visual / functionality relied on metrics (job progress, other user counter, etc). Symptom includes progress bar stall for batch job, user counter increment incomplete or dropped, etc
Until Beam and/or backend can handle and/or guard from large number of metrics, this PR mitigate the issue by only report bucket level Lineage
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.