[BEAM-12513] Update initial sections of BPG for Go by lostluck · Pull Request #15057 · apache/beam

lostluck · 2021-06-22T01:06:29Z

This PR adds additional Go SDK content to the Beam Programming Guide, up to the end of section 4.2. It also corrects a few incidental display errors in earlier sections (incorrect use of span wrappings or paragraph tags).

Notably an additional example package is added to hold Go snippets for the generation of website.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

`ValidatesRunner` compliance status (on master branch)

Lang	ULR	Samza	Twister2
Go	---	---	---
Java
Python	---	---	---
XLang		---	---

Examples testing status on various runners

Lang	ULR	Dataflow	Flink	Samza	Spark	Twister2
Go	---	---	---	---	---	---	---
Java	---		---	---	---	---	---
Python	---	---	---	---	---	---	---
XLang	---	---	---	---	---	---	---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go	Java	Python

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

lostluck · 2021-06-22T01:11:20Z

R: @youngoli @tysonjh

Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc.

lostluck · 2021-06-22T22:11:23Z

Definitely helped things to move them into section number prefixed files. The code is roughly in order as it appears in the BPG (some things are more naturally in the test file, rather than in the library file). The transform code is all tested against the Go Direct Runner (these simple cases are expected to work against it). Even with nothing the the into examples direcly tested, we now have 80%+ unit tested.

This format will be maintained for subsequent PRs. I'll be branching from here and carrying on with the rest of the BPG, hopefully in smaller chunks.

lostluck · 2021-06-24T17:21:27Z

R: @damondouglas

damondouglas · 2021-06-24T18:32:31Z

@lostluck just letting you know I got the notification and this is on my radar. Thank you for inviting me to look at this.

sdks/go/examples/snippets/01_03intro.go

website/www/site/content/en/documentation/programming-guide.md

sdks/go/examples/snippets/01_03intro.go

damondouglas · 2021-06-25T00:56:52Z

sdks/go/examples/snippets/01_03intro.go

+	}
+
+	// Create the Pipeline object and root scope.
+	p, s := beam.NewPipelineWithRoot()


Suggested change. In this example when it renders on the documentation, the reader is able to know that the s is the scope variable. However, in other examples it isn't as clear.

p, s := beam.NewPipelineWithRoot() linesPCol := beam.CreateList(s, lines)

to

pipeline, scope := beam.NewPipelineWithRoot() linesPCol := beam.CreateList(scope, lines)

It might be relevant to point out that using s as a variable name for the scope is the convention in the Go SDK. This is what the user will see if they read any other Go SDK examples or code. I'm not sure if the value of clarity outweighs sticking to convention, just wanted to mention that. What do you think?

Agreed that the convention needed to be made explicit, and this was an opportunity to do it. Additional commentary added for what's happening in the call.

sdks/go/examples/snippets/04transforms.go

damondouglas · 2021-06-25T01:00:21Z

sdks/go/examples/snippets/04transforms.go

+// a PCollection<string>
+func applyWordLen(s beam.Scope, words beam.PCollection) beam.PCollection {
+	// [START model_pardo_apply]
+	wordLengths := beam.ParDo(s, &ComputeWordLengthFn{}, words)


Suggested change:

wordLengths := beam.ParDo(s, &ComputeWordLengthFn{}, words)

to

wordLengths := beam.ParDo(scope, &ComputeWordLengthFn{}, words)

website/www/site/content/en/documentation/programming-guide.md

damondouglas · 2021-06-25T01:05:32Z

website/www/site/content/en/documentation/programming-guide.md

 {{< /highlight >}}

+{{< highlight go >}}
+[Second PCollection] := beam.ParDo(s, [First Transform], [Initial Input PCollection])


Suggested change:

[Second PCollection] := beam.ParDo(s, [First Transform], [Initial Input PCollection])

to

[Second PCollection] := beam.ParDo(scope, [First Transform], [Initial Input PCollection])

website/www/site/content/en/documentation/programming-guide.md

tysonjh · 2021-06-25T20:55:47Z

R: @youngoli @tysonjh

Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc.

This makes sense to me. Does this impact how it will be displayed in docs? E.g. using github snippets

lostluck · 2021-06-25T21:14:27Z

R: @youngoli @tysonjh
Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc.

This makes sense to me. Does this impact how it will be displayed in docs? E.g. using github snippets

If one looked at the GoDoc for the snippet package, no that wouldn't display correctly at all. GoDoc doesn't take file ordering into account everything is sorted by top level name (so methods are grouped with their type as you'd expect). I don't know about github snippets.

As for using those snippets in the beam website, the snippets are file origin agnostic, and copied at static site generation, so how they're stored in code doesn't matter.

Overall, the code organization is largely cosmetic for the code itself, so the present approach of prefixing with BPG section number means the files will be in the order that match the section whose code they contain. Everything internally is magic comments for the code snippet puller.

youngoli · 2021-06-29T06:33:48Z

sdks/go/examples/snippets/01_03intro.go

+	}
+
+	// Create the Pipeline object and root scope.
+	p, s := beam.NewPipelineWithRoot()


It might be relevant to point out that using s as a variable name for the scope is the convention in the Go SDK. This is what the user will see if they read any other Go SDK examples or code. I'm not sure if the value of clarity outweighs sticking to convention, just wanted to mention that. What do you think?

website/www/site/content/en/documentation/programming-guide.md

youngoli · 2021-06-29T06:47:42Z

sdks/go/examples/snippets/04transforms.go

+type ComputeWordLengthFn struct{}
+
+// ProcessElement is the method to execute for each element.
+func (fn *ComputeWordLengthFn) ProcessElement(word string, emit func(int)) {


As the first example of a ParDo, would it be better to go with returning outputs instead of using an emit function? I think there's a section somewhere in the programming guide for outputting multiple elements per input where we can introduce emit functions.

I'm largely basing it on the Java approach which has the output context object, rather than the python one.

Generally users will need to know the arbitrary emitters, for more complex DoFns. The simple return case is a convenience but 1:1 is incredibly limiting. If users don't find it, they won't hurt themselves by not knowing it.

The section is outputting multiple PCollections, not multiple elements within a PCollection.

Similarly, that's why I didn't introduce it using a functional DoFn. It leads to users assuming closures can work for configuration, which can't work in Go at the moment. Structural DoFns are not the most convenient, but they are the most versatile. They prevent errors like closure functions, which are far harder to debug when they sometimes work and fail otherwise.

It mostly comes down to if a user only learns the return convenience, they're stuck with 1:1. If they learn emitters, they can build off of that for 0 to 1, and multiple output PCollections by themselves.

With the return convenience, it needs to be explained that it's always the first output PCollection, in the case the return and the emitter are mixed.

website/www/site/content/en/documentation/programming-guide.md

lostluck · 2021-07-02T00:22:41Z

PTAL

lostluck · 2021-07-02T00:23:53Z

R: @jrmccluskey

jrmccluskey

Looks good mod the typo I spotted

website/www/site/content/en/documentation/programming-guide.md

Co-authored-by: Jack McCluskey <34928439+jrmccluskey@users.noreply.github.com>

[BEAM-12513] Update initial sections of BPG for Go

37f1e6a

lostluck added 2 commits June 22, 2021 09:13

!fixup whitepsace lint

77469f1

[BEAM-12513] Reformat code to be easier to find.

f035a8b