Skip to content

[BEAM-12513] Update initial sections of BPG for Go#15057

Merged
lostluck merged 6 commits intoapache:masterfrom
lostluck:beam12513
Jul 2, 2021
Merged

[BEAM-12513] Update initial sections of BPG for Go#15057
lostluck merged 6 commits intoapache:masterfrom
lostluck:beam12513

Conversation

@lostluck
Copy link
Copy Markdown
Contributor

This PR adds additional Go SDK content to the Beam Programming Guide, up to the end of section 4.2. It also corrects a few incidental display errors in earlier sections (incorrect use of span wrappings or paragraph tags).

Notably an additional example package is added to hold Go snippets for the generation of website.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@lostluck
Copy link
Copy Markdown
Contributor Author

R: @youngoli @tysonjh

Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc.

@lostluck
Copy link
Copy Markdown
Contributor Author

Definitely helped things to move them into section number prefixed files. The code is roughly in order as it appears in the BPG (some things are more naturally in the test file, rather than in the library file). The transform code is all tested against the Go Direct Runner (these simple cases are expected to work against it). Even with nothing the the into examples direcly tested, we now have 80%+ unit tested.

This format will be maintained for subsequent PRs. I'll be branching from here and carrying on with the rest of the BPG, hopefully in smaller chunks.

@lostluck
Copy link
Copy Markdown
Contributor Author

R: @damondouglas

@damondouglas
Copy link
Copy Markdown
Contributor

@lostluck just letting you know I got the notification and this is on my radar. Thank you for inviting me to look at this.

}

// Create the Pipeline object and root scope.
p, s := beam.NewPipelineWithRoot()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change. In this example when it renders on the documentation, the reader is able to know that the s is the scope variable. However, in other examples it isn't as clear.

p, s := beam.NewPipelineWithRoot()

linesPCol := beam.CreateList(s, lines)

to

pipeline, scope := beam.NewPipelineWithRoot()

linesPCol := beam.CreateList(scope, lines)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be relevant to point out that using s as a variable name for the scope is the convention in the Go SDK. This is what the user will see if they read any other Go SDK examples or code. I'm not sure if the value of clarity outweighs sticking to convention, just wanted to mention that. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the convention needed to be made explicit, and this was an opportunity to do it. Additional commentary added for what's happening in the call.

// a PCollection<string>
func applyWordLen(s beam.Scope, words beam.PCollection) beam.PCollection {
// [START model_pardo_apply]
wordLengths := beam.ParDo(s, &ComputeWordLengthFn{}, words)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change:

wordLengths := beam.ParDo(s, &ComputeWordLengthFn{}, words)

to

wordLengths := beam.ParDo(scope, &ComputeWordLengthFn{}, words)

{{< /highlight >}}

{{< highlight go >}}
[Second PCollection] := beam.ParDo(s, [First Transform], [Initial Input PCollection])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change:

[Second PCollection] := beam.ParDo(s, [First Transform], [Initial Input PCollection])

to

[Second PCollection] := beam.ParDo(scope, [First Transform], [Initial Input PCollection])

@tysonjh
Copy link
Copy Markdown
Contributor

tysonjh commented Jun 25, 2021

R: @youngoli @tysonjh

Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc.

This makes sense to me. Does this impact how it will be displayed in docs? E.g. using github snippets

@lostluck
Copy link
Copy Markdown
Contributor Author

R: @youngoli @tysonjh
Question: What do you think of me consolidating the code example files a bit? I think they should probably not get larger than ~400 lines each, but the small ones are certainly too small. I'm thinking consolidating roughly on size so all the earlier sections would be in pipeline.go, transforms.go would cover section 4, etc.

This makes sense to me. Does this impact how it will be displayed in docs? E.g. using github snippets

If one looked at the GoDoc for the snippet package, no that wouldn't display correctly at all. GoDoc doesn't take file ordering into account everything is sorted by top level name (so methods are grouped with their type as you'd expect). I don't know about github snippets.

As for using those snippets in the beam website, the snippets are file origin agnostic, and copied at static site generation, so how they're stored in code doesn't matter.

Overall, the code organization is largely cosmetic for the code itself, so the present approach of prefixing with BPG section number means the files will be in the order that match the section whose code they contain. Everything internally is magic comments for the code snippet puller.

}

// Create the Pipeline object and root scope.
p, s := beam.NewPipelineWithRoot()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be relevant to point out that using s as a variable name for the scope is the convention in the Go SDK. This is what the user will see if they read any other Go SDK examples or code. I'm not sure if the value of clarity outweighs sticking to convention, just wanted to mention that. What do you think?

type ComputeWordLengthFn struct{}

// ProcessElement is the method to execute for each element.
func (fn *ComputeWordLengthFn) ProcessElement(word string, emit func(int)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the first example of a ParDo, would it be better to go with returning outputs instead of using an emit function? I think there's a section somewhere in the programming guide for outputting multiple elements per input where we can introduce emit functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm largely basing it on the Java approach which has the output context object, rather than the python one.

Generally users will need to know the arbitrary emitters, for more complex DoFns. The simple return case is a convenience but 1:1 is incredibly limiting. If users don't find it, they won't hurt themselves by not knowing it.

The section is outputting multiple PCollections, not multiple elements within a PCollection.

Similarly, that's why I didn't introduce it using a functional DoFn. It leads to users assuming closures can work for configuration, which can't work in Go at the moment. Structural DoFns are not the most convenient, but they are the most versatile. They prevent errors like closure functions, which are far harder to debug when they sometimes work and fail otherwise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly comes down to if a user only learns the return convenience, they're stuck with 1:1. If they learn emitters, they can build off of that for 0 to 1, and multiple output PCollections by themselves.

With the return convenience, it needs to be explained that it's always the first output PCollection, in the case the return and the emitter are mixed.

@lostluck
Copy link
Copy Markdown
Contributor Author

lostluck commented Jul 2, 2021

PTAL

@lostluck
Copy link
Copy Markdown
Contributor Author

lostluck commented Jul 2, 2021

R: @jrmccluskey

Copy link
Copy Markdown
Contributor

@jrmccluskey jrmccluskey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mod the typo I spotted

Co-authored-by: Jack McCluskey <34928439+jrmccluskey@users.noreply.github.com>
@lostluck lostluck merged commit b76db4d into apache:master Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants