Skip to content

YamlTemplate speed improvements#1405

Merged
copybara-service[bot] merged 1 commit intoGoogleCloudPlatform:mainfrom
Polber:jkinard/yaml-speed
Apr 18, 2024
Merged

YamlTemplate speed improvements#1405
copybara-service[bot] merged 1 commit intoGoogleCloudPlatform:mainfrom
Polber:jkinard/yaml-speed

Conversation

@Polber
Copy link
Copy Markdown
Contributor

@Polber Polber commented Mar 28, 2024

Create new Dockerfile based on Google-provided base template images that serves as a Distroless Xlang image for YAML templates.

@Polber Polber self-assigned this Mar 28, 2024
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2024

Codecov Report

❗ No coverage uploaded for pull request base (main@f35f806). Click here to learn what that means.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1405   +/-   ##
=======================================
  Coverage        ?   39.37%           
  Complexity      ?     2820           
=======================================
  Files           ?      754           
  Lines           ?    42286           
  Branches        ?     4550           
=======================================
  Hits            ?    16651           
  Misses          ?    24131           
  Partials        ?     1504           
Components Coverage Δ
spanner-templates 53.54% <ø> (?)
spanner-import-export 65.49% <ø> (?)
spanner-live-forward-migration 57.79% <ø> (?)
spanner-live-reverse-replication 39.01% <ø> (?)
spanner-bulk-migration 62.78% <ø> (?)
Files Coverage Δ
...a/com/google/cloud/teleport/metadata/Template.java 100.00% <ø> (ø)
...oud/teleport/plugin/PythonDockerfileGenerator.java 25.00% <100.00%> (ø)
...cloud/teleport/plugin/YamlDockerfileGenerator.java 88.46% <100.00%> (ø)

@Polber Polber force-pushed the jkinard/yaml-speed branch 2 times, most recently from 94c49f2 to f922a0b Compare April 1, 2024 18:32
@Polber Polber requested a review from damccorm April 1, 2024 20:29
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these libraries come from?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, it feels like we're taking in base container images, but making a lot of assumptions about how they'll be configured. What constraints do we have on our base images here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are being copied from Python base image which has a full install of python on it (with necessary licenses and libraries). I am copying the minimum shared c libraries that python uses to compile the graph since that is all that is done in the template launcher.

Ideally, there would be a staged distroless base image that has python, java and up-to-date libraries, but until that is possible, this provides significant speed up by copying just what is needed from both.

I don't know about "constraints" but the base image for python is only updated with the new version of beam, so nothing else should be changing other than the site-packages folder (which is not used by this anyway)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're assuming that this will be the Beam Python base image, why are we parameterizing this? Will we ever pass in a base image that isn't the beam base image? Can we just take the beam version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the python base image could theoretically be kept static since it is just used as a build env, but using the parameter keeps it inline with the other templates that also use it (i.e. python templates and java templates with python UDF support), so that ti only needs to be kept updated in one place.

The java base image could also theoretically be kept static, but having a parameter for it allows a user to use a custom java image (albeit this use-case may not exist in the wild), and the same argument could be made for the python image too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we make too many assumptions about the structure of the base image to allow arbitrary images. For example /tmp/dataflow-requirements-cache is not guaranteed to exist (and won't exist) in most images

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also assuming that the java container doesn't have python already installed (may or may not be true). Generally, this dockerfile is fit pretty tightly to the images its building on, so I don't think this should be parameterized.

so that ti only needs to be kept updated in one place.

We could still keep the version in a single place next to the images (we could even keep image names stripped of version and version as separate vars since I imagine generally people just want to update the version)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java container does not have Python installed, it is Java distroless, so it doesn't have much more than a Java runtime and the template files/launcher.

/tmp/dataflow-requirements-cache is guaranteed to exist on python base image because that is where the python wheels are being downloaded to (also documented in https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#supply-requirements-file)

I do get your point about it being tightly fitted, and while I think it would be nice that have a common definition so that they can stay in sync, I think there may be a few more iterations to this anyway, so I'll go ahead and hard code it for now.

@Polber Polber force-pushed the jkinard/yaml-speed branch from f922a0b to c677b83 Compare April 5, 2024 22:51
@Polber
Copy link
Copy Markdown
Contributor Author

Polber commented Apr 5, 2024

@damccorm Fixed relevant changes, and this should also fix the broken YamlTemplate IT's

@Polber Polber force-pushed the jkinard/yaml-speed branch 5 times, most recently from 22a6feb to c547013 Compare April 9, 2024 00:50
@Polber Polber requested a review from damccorm April 10, 2024 17:05
@Polber Polber force-pushed the jkinard/yaml-speed branch from c547013 to 9ddcbfa Compare April 11, 2024 22:28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we make too many assumptions about the structure of the base image to allow arbitrary images. For example /tmp/dataflow-requirements-cache is not guaranteed to exist (and won't exist) in most images

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also assuming that the java container doesn't have python already installed (may or may not be true). Generally, this dockerfile is fit pretty tightly to the images its building on, so I don't think this should be parameterized.

so that ti only needs to be kept updated in one place.

We could still keep the version in a single place next to the images (we could even keep image names stripped of version and version as separate vars since I imagine generally people just want to update the version)

@Polber Polber force-pushed the jkinard/yaml-speed branch 3 times, most recently from a405391 to 563a66a Compare April 12, 2024 16:33
@Polber Polber requested a review from damccorm April 12, 2024 16:35
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a test with an empty list here as well to test that it does the right thing

Polber added a commit to Polber/DataflowTemplates that referenced this pull request Apr 15, 2024
Signed-off-by: Jeffrey Kinard <jeff@thekinards.com>
@Polber Polber force-pushed the jkinard/yaml-speed branch from 563a66a to 4d893c4 Compare April 17, 2024 15:43
damccorm
damccorm previously approved these changes Apr 17, 2024
@damccorm damccorm added the Google LGTM Approval of a pull request to be merged into the repository label Apr 17, 2024
@Polber Polber force-pushed the jkinard/yaml-speed branch 2 times, most recently from 6607d1a to b2235e5 Compare April 17, 2024 19:01
Signed-off-by: Jeffrey Kinard <jeff@thekinards.com>
@Polber Polber force-pushed the jkinard/yaml-speed branch from b2235e5 to dd5a13f Compare April 17, 2024 19:01
Copy link
Copy Markdown
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@copybara-service copybara-service bot merged commit 2b77b2f into GoogleCloudPlatform:main Apr 18, 2024
liferoad pushed a commit to aksharauke/DataflowTemplates that referenced this pull request Apr 18, 2024
Get consistent java-pr signal on main

Default to THROUGHPUT_BASED autoscaling for classic templates (GoogleCloudPlatform#1403)

* single commit off of master

* No-op comment change to kick copybara

* Revert "No-op comment change to kick copybara"

This reverts commit 928dc6b.

* Different no-op to kick copybara and avoid conflicts

* One more try to get copybara to merge

Support FLOAT32 type in v1 templates of Spanner. (GoogleCloudPlatform#1376)

Squashed commit of the following:

commit 1e21615
Merge: 5764ef8 27e7486
Author: Aravind Pedapudi <aravindp1510@gmail.com>
Date:   Thu Apr 4 21:35:22 2024 +0530

    Merge branch 'main' into float32

commit 5764ef8
Author: Aravind <aravindp1510@gmail.com>
Date:   Mon Apr 1 17:40:34 2024 +0530

    Support FLOAT32 type in v1 templates of Spanner.

    Squashed commit of the following:

    commit a678a75
    Merge: f9fe874 c9f1c66
    Author: Aravind Pedapudi <aravindp1510@gmail.com>
    Date:   Mon Apr 1 16:08:20 2024 +0530

        Merge branch 'main' into float32

    commit f9fe874
    Merge: e28335c 7449ae6
    Author: Aravind Pedapudi <aravindp1510@gmail.com>
    Date:   Thu Mar 28 10:50:24 2024 +0530

        Merge branch 'main' into float32

    commit e28335c
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Wed Mar 27 16:33:56 2024 +0530

        Fix integration test failures for TextImportPipelineIT

    commit 8480654
    Merge: c6364b7 afd1dc2
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Wed Mar 27 16:25:57 2024 +0530

        Merge branch 'float32' of github.com:arawind/DataflowTemplates into float32

    commit c6364b7
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Wed Mar 27 16:22:27 2024 +0530

        Add avro integration tests for FLOAT32 type in Spanner.

        This fixes the ExportPipelineIT and adds support for the
        ImportPipelineIT.

    commit afd1dc2
    Merge: eca6acf 4536653
    Author: Aravind Pedapudi <aravindp1510@gmail.com>
    Date:   Wed Mar 27 11:37:24 2024 +0530

        Merge branch 'main' into float32

    commit eca6acf
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Fri Mar 22 17:12:35 2024 +0530

        Address review comments

        Adds new tests in ExportPipelineIT and also removes the INT32 -> FLOAT32 parsing as we don't see any use for it.

    commit c1cd281
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Thu Mar 21 18:15:05 2024 +0530

        Fix Beam to handle float32 arrays and pg schema.

        This was missed in the earlier commits

    commit 5e936f4
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Thu Mar 21 13:32:01 2024 +0530

        Address review comments

    commit f4ab908
    Merge: a6aba30 1df7e0a
    Author: Aravind Pedapudi <aravindp1510@gmail.com>
    Date:   Thu Mar 21 12:56:09 2024 +0530

        Merge branch 'GoogleCloudPlatform:main' into float32

    commit a6aba30
    Merge: 66c488c f44f7ce
    Author: Aravind Pedapudi <aravindp1510@gmail.com>
    Date:   Wed Mar 20 16:21:11 2024 +0530

        Merge branch 'main' into float32

    commit 66c488c
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Wed Mar 20 15:49:31 2024 +0530

        Fix gax dependency failures in spanner integration tests.

        We were seeing this error in the spanner integration tests: `class com.google.cloud.spanner.v1.stub.SpannerStubSettings overrides final method com.google.api.gax.rpc.StubSettings.getEndpoint()Ljava/lang/String;`, which were triggered by the google-cloud-spanner BOM update to 6.61.
        We're fixing these errors by setting the gax version to the one used by Cloud Spanner client libs. This should be removed once Beam is updated to the latest version of gax (which may happen when Beam's dependency on Spanner is updated).

    commit 90c1fbc
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Tue Mar 19 19:17:03 2024 +0530

        Support FLOAT32 type in v1 spanner templates

    commit 51f75f0
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Tue Mar 19 19:10:06 2024 +0530

        Modify the local copy of Beam to support FLOAT32 type.

    commit f85fd00
    Author: Aravind <aravindp1510@gmail.com>
    Date:   Tue Mar 19 18:53:17 2024 +0530

        Update the version of google-cloud-spanner-bom to 6.61.0

Upgrade Beam version to 2.55.1

Specify venv

single commit off of master

No-op comment change to kick copybara

Revert "No-op comment change to kick copybara"

This reverts commit 928dc6b.

Different no-op to kick copybara and avoid conflicts

PR GoogleCloudPlatform#1376: Support FLOAT32 type in v1 templates of Spanner

Please approve this CL. It will be submitted automatically, and its GitHub pull request will be marked as merged.

Imported from GitHub PR GoogleCloudPlatform#1376

GoogleSQL:
* [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_01_22-9618753751624897563;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22)))
* [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_38_00-8680174928377892705;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22)))

PG:
* [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_47_07-2771163877050158940;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22)))
* [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_58_11-1643174307354839989;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22)))

GoogleSQL:
* [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_00_17_56-5509502875882896332?e=13802955&mods=monitoring_api_prod&project=span-cloud-testing)
* [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_02_50_39-14326646282971458157?e=13802955&mods=monitoring_api_prod&project=span-cloud-testing)

PG:
* [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_03_17_13-5127588554893435640;graphView=0?project=span-cloud-testing&e=13802955&mods=monitoring_api_prod&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22)))
* [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_03_09_55-9477941764979866789?e=13802955&mods=monitoring_api_prod&project=span-cloud-testing)

Copybara import of the project:

  - 357481bb0fdfba9c91fdde1c24ae4bbcaf5d2bf4 Support FLOAT32 type in v1 templates of Spanner. by Aravind <aravindp1510@gmail.com>

COPYBARA_INTEGRATE_REVIEW=GoogleCloudPlatform#1376 from arawind:float32 357481bb0fdfba9c91fdde1c24ae4bbcaf5d2bf4
PiperOrigin-RevId: 622232578

String change to match external

PiperOrigin-RevId: 622247573

Add schemaMapperInterface and implementations

Add namespace to params

spotless

Add user-agent string to Elasticsearch templates

Rename templateName to userAgent

Fix style

Capitalize header name

Add unit tests

Fix style

Bugfix for UI issue removing flex templates from drop down menu

Use beam version defined in the pom.xml to make sure we stay on the most current version. Also add a comment regarding the hashed file location for context.

Implement enum based experiment value provider

disable YamlTemplateIT's until GoogleCloudPlatform#1405 is merged

Signed-off-by: Jeffrey Kinard <jeff@thekinards.com>

Add smoke tests for the no UDF case for Xlang templates, abstract conversion functions, fix bug with flexContainerName.

A small simplification in Bulk Reader tests.

Allow manual triggering of PR workflow

Add python version to setup-env actions

Add parentName and parentTriggerValues attributes

Test that new attributes are present

Remove redundant line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Google LGTM Approval of a pull request to be merged into the repository size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants