[BEAM-12513] Update initial sections of BPG for Go#15057
[BEAM-12513] Update initial sections of BPG for Go#15057lostluck merged 6 commits intoapache:masterfrom
Conversation
|
Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc. |
|
Definitely helped things to move them into section number prefixed files. The code is roughly in order as it appears in the BPG (some things are more naturally in the test file, rather than in the library file). The transform code is all tested against the Go Direct Runner (these simple cases are expected to work against it). Even with nothing the the into examples direcly tested, we now have 80%+ unit tested. This format will be maintained for subsequent PRs. I'll be branching from here and carrying on with the rest of the BPG, hopefully in smaller chunks. |
|
@lostluck just letting you know I got the notification and this is on my radar. Thank you for inviting me to look at this. |
| } | ||
|
|
||
| // Create the Pipeline object and root scope. | ||
| p, s := beam.NewPipelineWithRoot() |
There was a problem hiding this comment.
Suggested change. In this example when it renders on the documentation, the reader is able to know that the s is the scope variable. However, in other examples it isn't as clear.
p, s := beam.NewPipelineWithRoot()
linesPCol := beam.CreateList(s, lines)
to
pipeline, scope := beam.NewPipelineWithRoot()
linesPCol := beam.CreateList(scope, lines)
There was a problem hiding this comment.
It might be relevant to point out that using s as a variable name for the scope is the convention in the Go SDK. This is what the user will see if they read any other Go SDK examples or code. I'm not sure if the value of clarity outweighs sticking to convention, just wanted to mention that. What do you think?
There was a problem hiding this comment.
Agreed that the convention needed to be made explicit, and this was an opportunity to do it. Additional commentary added for what's happening in the call.
| // a PCollection<string> | ||
| func applyWordLen(s beam.Scope, words beam.PCollection) beam.PCollection { | ||
| // [START model_pardo_apply] | ||
| wordLengths := beam.ParDo(s, &ComputeWordLengthFn{}, words) |
There was a problem hiding this comment.
Suggested change:
wordLengths := beam.ParDo(s, &ComputeWordLengthFn{}, words)
to
wordLengths := beam.ParDo(scope, &ComputeWordLengthFn{}, words)
| {{< /highlight >}} | ||
|
|
||
| {{< highlight go >}} | ||
| [Second PCollection] := beam.ParDo(s, [First Transform], [Initial Input PCollection]) |
There was a problem hiding this comment.
Suggested change:
[Second PCollection] := beam.ParDo(s, [First Transform], [Initial Input PCollection])
to
[Second PCollection] := beam.ParDo(scope, [First Transform], [Initial Input PCollection])
This makes sense to me. Does this impact how it will be displayed in docs? E.g. using github snippets |
If one looked at the GoDoc for the snippet package, no that wouldn't display correctly at all. GoDoc doesn't take file ordering into account everything is sorted by top level name (so methods are grouped with their type as you'd expect). I don't know about github snippets. As for using those snippets in the beam website, the snippets are file origin agnostic, and copied at static site generation, so how they're stored in code doesn't matter. Overall, the code organization is largely cosmetic for the code itself, so the present approach of prefixing with BPG section number means the files will be in the order that match the section whose code they contain. Everything internally is magic comments for the code snippet puller. |
| } | ||
|
|
||
| // Create the Pipeline object and root scope. | ||
| p, s := beam.NewPipelineWithRoot() |
There was a problem hiding this comment.
It might be relevant to point out that using s as a variable name for the scope is the convention in the Go SDK. This is what the user will see if they read any other Go SDK examples or code. I'm not sure if the value of clarity outweighs sticking to convention, just wanted to mention that. What do you think?
| type ComputeWordLengthFn struct{} | ||
|
|
||
| // ProcessElement is the method to execute for each element. | ||
| func (fn *ComputeWordLengthFn) ProcessElement(word string, emit func(int)) { |
There was a problem hiding this comment.
As the first example of a ParDo, would it be better to go with returning outputs instead of using an emit function? I think there's a section somewhere in the programming guide for outputting multiple elements per input where we can introduce emit functions.
There was a problem hiding this comment.
I'm largely basing it on the Java approach which has the output context object, rather than the python one.
Generally users will need to know the arbitrary emitters, for more complex DoFns. The simple return case is a convenience but 1:1 is incredibly limiting. If users don't find it, they won't hurt themselves by not knowing it.
The section is outputting multiple PCollections, not multiple elements within a PCollection.
Similarly, that's why I didn't introduce it using a functional DoFn. It leads to users assuming closures can work for configuration, which can't work in Go at the moment. Structural DoFns are not the most convenient, but they are the most versatile. They prevent errors like closure functions, which are far harder to debug when they sometimes work and fail otherwise.
There was a problem hiding this comment.
It mostly comes down to if a user only learns the return convenience, they're stuck with 1:1. If they learn emitters, they can build off of that for 0 to 1, and multiple output PCollections by themselves.
With the return convenience, it needs to be explained that it's always the first output PCollection, in the case the return and the emitter are mixed.
|
PTAL |
|
R: @jrmccluskey |
jrmccluskey
left a comment
There was a problem hiding this comment.
Looks good mod the typo I spotted
Co-authored-by: Jack McCluskey <34928439+jrmccluskey@users.noreply.github.com>
This PR adds additional Go SDK content to the Beam Programming Guide, up to the end of section 4.2. It also corrects a few incidental display errors in earlier sections (incorrect use of span wrappings or paragraph tags).
Notably an additional example package is added to hold Go snippets for the generation of website.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunnercompliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.