YamlTemplate speed improvements#1405
YamlTemplate speed improvements#1405copybara-service[bot] merged 1 commit intoGoogleCloudPlatform:mainfrom
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1405 +/- ##
=======================================
Coverage ? 39.37%
Complexity ? 2820
=======================================
Files ? 754
Lines ? 42286
Branches ? 4550
=======================================
Hits ? 16651
Misses ? 24131
Partials ? 1504
|
94c49f2 to
f922a0b
Compare
metadata/src/main/java/com/google/cloud/teleport/metadata/Template.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Where do these libraries come from?
There was a problem hiding this comment.
Generally, it feels like we're taking in base container images, but making a lot of assumptions about how they'll be configured. What constraints do we have on our base images here?
There was a problem hiding this comment.
They are being copied from Python base image which has a full install of python on it (with necessary licenses and libraries). I am copying the minimum shared c libraries that python uses to compile the graph since that is all that is done in the template launcher.
Ideally, there would be a staged distroless base image that has python, java and up-to-date libraries, but until that is possible, this provides significant speed up by copying just what is needed from both.
I don't know about "constraints" but the base image for python is only updated with the new version of beam, so nothing else should be changing other than the site-packages folder (which is not used by this anyway)
There was a problem hiding this comment.
If we're assuming that this will be the Beam Python base image, why are we parameterizing this? Will we ever pass in a base image that isn't the beam base image? Can we just take the beam version?
There was a problem hiding this comment.
I suppose the python base image could theoretically be kept static since it is just used as a build env, but using the parameter keeps it inline with the other templates that also use it (i.e. python templates and java templates with python UDF support), so that ti only needs to be kept updated in one place.
The java base image could also theoretically be kept static, but having a parameter for it allows a user to use a custom java image (albeit this use-case may not exist in the wild), and the same argument could be made for the python image too.
There was a problem hiding this comment.
I think we make too many assumptions about the structure of the base image to allow arbitrary images. For example /tmp/dataflow-requirements-cache is not guaranteed to exist (and won't exist) in most images
There was a problem hiding this comment.
We're also assuming that the java container doesn't have python already installed (may or may not be true). Generally, this dockerfile is fit pretty tightly to the images its building on, so I don't think this should be parameterized.
so that ti only needs to be kept updated in one place.
We could still keep the version in a single place next to the images (we could even keep image names stripped of version and version as separate vars since I imagine generally people just want to update the version)
There was a problem hiding this comment.
The Java container does not have Python installed, it is Java distroless, so it doesn't have much more than a Java runtime and the template files/launcher.
/tmp/dataflow-requirements-cache is guaranteed to exist on python base image because that is where the python wheels are being downloaded to (also documented in https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#supply-requirements-file)
I do get your point about it being tightly fitted, and while I think it would be nice that have a common definition so that they can stay in sync, I think there may be a few more iterations to this anyway, so I'll go ahead and hard code it for now.
...es-maven-plugin/src/main/java/com/google/cloud/teleport/plugin/maven/TemplatesStageMojo.java
Outdated
Show resolved
Hide resolved
f922a0b to
c677b83
Compare
|
@damccorm Fixed relevant changes, and this should also fix the broken YamlTemplate IT's |
22a6feb to
c547013
Compare
c547013 to
9ddcbfa
Compare
plugins/core-plugin/src/main/java/com/google/cloud/teleport/plugin/YamlDockerfileGenerator.java
Outdated
Show resolved
Hide resolved
plugins/core-plugin/src/main/java/com/google/cloud/teleport/plugin/model/ImageSpecMetadata.java
Outdated
Show resolved
Hide resolved
plugins/core-plugin/src/main/resources/Dockerfile-template-yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I think we make too many assumptions about the structure of the base image to allow arbitrary images. For example /tmp/dataflow-requirements-cache is not guaranteed to exist (and won't exist) in most images
There was a problem hiding this comment.
We're also assuming that the java container doesn't have python already installed (may or may not be true). Generally, this dockerfile is fit pretty tightly to the images its building on, so I don't think this should be parameterized.
so that ti only needs to be kept updated in one place.
We could still keep the version in a single place next to the images (we could even keep image names stripped of version and version as separate vars since I imagine generally people just want to update the version)
a405391 to
563a66a
Compare
metadata/src/main/java/com/google/cloud/teleport/metadata/Template.java
Outdated
Show resolved
Hide resolved
plugins/core-plugin/src/main/resources/Dockerfile-template-yaml
Outdated
Show resolved
Hide resolved
plugins/core-plugin/src/main/resources/Dockerfile-template-yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Would be good to have a test with an empty list here as well to test that it does the right thing
Signed-off-by: Jeffrey Kinard <jeff@thekinards.com>
563a66a to
4d893c4
Compare
.../core-plugin/src/test/java/com/google/cloud/teleport/plugin/YamlDockerfileGeneratorTest.java
Outdated
Show resolved
Hide resolved
6607d1a to
b2235e5
Compare
Signed-off-by: Jeffrey Kinard <jeff@thekinards.com>
b2235e5 to
dd5a13f
Compare
Get consistent java-pr signal on main Default to THROUGHPUT_BASED autoscaling for classic templates (GoogleCloudPlatform#1403) * single commit off of master * No-op comment change to kick copybara * Revert "No-op comment change to kick copybara" This reverts commit 928dc6b. * Different no-op to kick copybara and avoid conflicts * One more try to get copybara to merge Support FLOAT32 type in v1 templates of Spanner. (GoogleCloudPlatform#1376) Squashed commit of the following: commit 1e21615 Merge: 5764ef8 27e7486 Author: Aravind Pedapudi <aravindp1510@gmail.com> Date: Thu Apr 4 21:35:22 2024 +0530 Merge branch 'main' into float32 commit 5764ef8 Author: Aravind <aravindp1510@gmail.com> Date: Mon Apr 1 17:40:34 2024 +0530 Support FLOAT32 type in v1 templates of Spanner. Squashed commit of the following: commit a678a75 Merge: f9fe874 c9f1c66 Author: Aravind Pedapudi <aravindp1510@gmail.com> Date: Mon Apr 1 16:08:20 2024 +0530 Merge branch 'main' into float32 commit f9fe874 Merge: e28335c 7449ae6 Author: Aravind Pedapudi <aravindp1510@gmail.com> Date: Thu Mar 28 10:50:24 2024 +0530 Merge branch 'main' into float32 commit e28335c Author: Aravind <aravindp1510@gmail.com> Date: Wed Mar 27 16:33:56 2024 +0530 Fix integration test failures for TextImportPipelineIT commit 8480654 Merge: c6364b7 afd1dc2 Author: Aravind <aravindp1510@gmail.com> Date: Wed Mar 27 16:25:57 2024 +0530 Merge branch 'float32' of github.com:arawind/DataflowTemplates into float32 commit c6364b7 Author: Aravind <aravindp1510@gmail.com> Date: Wed Mar 27 16:22:27 2024 +0530 Add avro integration tests for FLOAT32 type in Spanner. This fixes the ExportPipelineIT and adds support for the ImportPipelineIT. commit afd1dc2 Merge: eca6acf 4536653 Author: Aravind Pedapudi <aravindp1510@gmail.com> Date: Wed Mar 27 11:37:24 2024 +0530 Merge branch 'main' into float32 commit eca6acf Author: Aravind <aravindp1510@gmail.com> Date: Fri Mar 22 17:12:35 2024 +0530 Address review comments Adds new tests in ExportPipelineIT and also removes the INT32 -> FLOAT32 parsing as we don't see any use for it. commit c1cd281 Author: Aravind <aravindp1510@gmail.com> Date: Thu Mar 21 18:15:05 2024 +0530 Fix Beam to handle float32 arrays and pg schema. This was missed in the earlier commits commit 5e936f4 Author: Aravind <aravindp1510@gmail.com> Date: Thu Mar 21 13:32:01 2024 +0530 Address review comments commit f4ab908 Merge: a6aba30 1df7e0a Author: Aravind Pedapudi <aravindp1510@gmail.com> Date: Thu Mar 21 12:56:09 2024 +0530 Merge branch 'GoogleCloudPlatform:main' into float32 commit a6aba30 Merge: 66c488c f44f7ce Author: Aravind Pedapudi <aravindp1510@gmail.com> Date: Wed Mar 20 16:21:11 2024 +0530 Merge branch 'main' into float32 commit 66c488c Author: Aravind <aravindp1510@gmail.com> Date: Wed Mar 20 15:49:31 2024 +0530 Fix gax dependency failures in spanner integration tests. We were seeing this error in the spanner integration tests: `class com.google.cloud.spanner.v1.stub.SpannerStubSettings overrides final method com.google.api.gax.rpc.StubSettings.getEndpoint()Ljava/lang/String;`, which were triggered by the google-cloud-spanner BOM update to 6.61. We're fixing these errors by setting the gax version to the one used by Cloud Spanner client libs. This should be removed once Beam is updated to the latest version of gax (which may happen when Beam's dependency on Spanner is updated). commit 90c1fbc Author: Aravind <aravindp1510@gmail.com> Date: Tue Mar 19 19:17:03 2024 +0530 Support FLOAT32 type in v1 spanner templates commit 51f75f0 Author: Aravind <aravindp1510@gmail.com> Date: Tue Mar 19 19:10:06 2024 +0530 Modify the local copy of Beam to support FLOAT32 type. commit f85fd00 Author: Aravind <aravindp1510@gmail.com> Date: Tue Mar 19 18:53:17 2024 +0530 Update the version of google-cloud-spanner-bom to 6.61.0 Upgrade Beam version to 2.55.1 Specify venv single commit off of master No-op comment change to kick copybara Revert "No-op comment change to kick copybara" This reverts commit 928dc6b. Different no-op to kick copybara and avoid conflicts PR GoogleCloudPlatform#1376: Support FLOAT32 type in v1 templates of Spanner Please approve this CL. It will be submitted automatically, and its GitHub pull request will be marked as merged. Imported from GitHub PR GoogleCloudPlatform#1376 GoogleSQL: * [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_01_22-9618753751624897563;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))) * [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_38_00-8680174928377892705;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))) PG: * [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_47_07-2771163877050158940;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))) * [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-21_05_58_11-1643174307354839989;graphView=0?project=span-cloud-testing&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))) GoogleSQL: * [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_00_17_56-5509502875882896332?e=13802955&mods=monitoring_api_prod&project=span-cloud-testing) * [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_02_50_39-14326646282971458157?e=13802955&mods=monitoring_api_prod&project=span-cloud-testing) PG: * [Export](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_03_17_13-5127588554893435640;graphView=0?project=span-cloud-testing&e=13802955&mods=monitoring_api_prod&pageState=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))) * [Import](https://console.cloud.google.com/dataflow/jobs/us-central1/2024-03-22_03_09_55-9477941764979866789?e=13802955&mods=monitoring_api_prod&project=span-cloud-testing) Copybara import of the project: - 357481bb0fdfba9c91fdde1c24ae4bbcaf5d2bf4 Support FLOAT32 type in v1 templates of Spanner. by Aravind <aravindp1510@gmail.com> COPYBARA_INTEGRATE_REVIEW=GoogleCloudPlatform#1376 from arawind:float32 357481bb0fdfba9c91fdde1c24ae4bbcaf5d2bf4 PiperOrigin-RevId: 622232578 String change to match external PiperOrigin-RevId: 622247573 Add schemaMapperInterface and implementations Add namespace to params spotless Add user-agent string to Elasticsearch templates Rename templateName to userAgent Fix style Capitalize header name Add unit tests Fix style Bugfix for UI issue removing flex templates from drop down menu Use beam version defined in the pom.xml to make sure we stay on the most current version. Also add a comment regarding the hashed file location for context. Implement enum based experiment value provider disable YamlTemplateIT's until GoogleCloudPlatform#1405 is merged Signed-off-by: Jeffrey Kinard <jeff@thekinards.com> Add smoke tests for the no UDF case for Xlang templates, abstract conversion functions, fix bug with flexContainerName. A small simplification in Bulk Reader tests. Allow manual triggering of PR workflow Add python version to setup-env actions Add parentName and parentTriggerValues attributes Test that new attributes are present Remove redundant line
Create new Dockerfile based on Google-provided base template images that serves as a Distroless Xlang image for YAML templates.