GH-39707: [Java] Enable local build cache for Maven/Java build#39708
GH-39707: [Java] Enable local build cache for Maven/Java build#39708lidavidm merged 7 commits intoapache:mainfrom
Conversation
|
|
|
Thank you @clayburn! This is an exciting improvement. For local build execution, would we have to change anything in our developer docs[1] on how to build Arrow java in order to take advantage of this? [1]https://arrow.apache.org/docs/developers/java/building.html#building-java-modules |
|
@clayburn I have approved the CI runs, I assume that only builds on main will be able to upload data due to the use of secrets? For similar things I have used a two step process in the past where the unprivileged workflow creates the data/artifacts and a second privileged workflow that doesn't run any user code would upload the data. Is this something gradle could support? Or maybe data for each PR build would be a bit much ^^ |
@danepitkin - For caching, there is no action needed. The one stipulation that should be noted is that the local cache is machine local, so with some of your containerized builds, it would be local to the container itself without any special handling. The local cache by default is located at For build scans, you won't see any changes unless you explicitly authenticate to https://ge.apache.org. Instructions to do so are here. All ASF committers can authenticate to ge.apache.org with LDAP credentials. Contributors cannot authenticate, so they won't see anything regarding build scans. The access key is either stored in
Build scans will only be produced for builds that run from the The caching functionality will work without the secret, although the benefits may not be apparent in your CI builds since they are presumably running on ephemeral agents and the cache is machine local. This is where the remote cache will help more, which we hope to enable in the future. |
Can it be directed to another directory using an environment variable? In C++ we direct ccache data to a custom directory which is then persisted as Docker volumes. This is then used together with GHA caching to shorten CI times. |
I think this is plausible. I'll give it a shot and update back here. |
|
@pitrou - Looking through the workflow, you are already caching the right directory. It's all of I've pushed two changes to further optimize this:
|
.mvn/gradle-enterprise.xml
Outdated
ci/scripts/java_build.sh
Outdated
There was a problem hiding this comment.
I suppose build caching means that the 'clean' here isn't actually a clean build?
There was a problem hiding this comment.
Add clean simply runs the clean lifecycle before running other requested lifecycles. We need this to safely write cache entries so that we can know that which goal invocation produced particular files.
So this will clean the workspace, but it may restore cache entries to the workspace during execution rather than running the goals. Just depends on your definition of "clean" here. -DrerunGoals can be used to explicitly rerun goals in cases where that is desired.
|
@github-actions crossbow submit java |
|
Revision: aa0ffadb0e87ad24b464b6d43cbb6ac1028ecb2f Submitted crossbow builds: ursacomputing/crossbow @ actions-f432146f72 |
@clayburn If you could help me with doubts regarding local development, I would appreciate it:
I would like to identify/map what the new advantages will be for developers on their loca machines. Thank you in advance for your support. |
danepitkin
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
|
@lidavidm Done ✅ |
@davisusanibar - So sorry for losing track of this:
|
|
@github-actions crossbow submit -g java |
|
Revision: 988562f Submitted crossbow builds: ursacomputing/crossbow @ actions-be6617b977 |
.github/workflows/java.yml
Outdated
| ARCHERY_DOCKER_PASSWORD: ${{ secrets.DOCKERHUB_TOKEN }} | ||
| run: archery docker run ${{ matrix.image }} | ||
| GRADLE_ENTERPRISE_ACCESS_KEY: ${{ secrets.GE_ACCESS_TOKEN }} | ||
| run: archery docker run ${{ matrix.image }} -e "GRADLE_ENTERPRISE_ACCESS_KEY=$GRADLE_ENTERPRISE_ACCESS_KEY" -e CI=true |
There was a problem hiding this comment.
Could you use archery docker run -e ... ${{ matrix.image }} style like we did in other places?
| run: archery docker run ${{ matrix.image }} -e "GRADLE_ENTERPRISE_ACCESS_KEY=$GRADLE_ENTERPRISE_ACCESS_KEY" -e CI=true | |
| run: | | |
| archery docker run \ | |
| -e CI=true \ | |
| -e "GRADLE_ENTERPRISE_ACCESS_KEY=$GRADLE_ENTERPRISE_ACCESS_KEY" \ | |
| ${{ matrix.image }} |
.github/workflows/java_jni.yml
Outdated
| ARCHERY_DOCKER_PASSWORD: ${{ secrets.DOCKERHUB_TOKEN }} | ||
| run: archery docker run conda-python-java-integration | ||
| GRADLE_ENTERPRISE_ACCESS_KEY: ${{ secrets.GE_ACCESS_TOKEN }} | ||
| run: archery docker run conda-python-java-integration -e "GRADLE_ENTERPRISE_ACCESS_KEY=$GRADLE_ENTERPRISE_ACCESS_KEY" -e CI=true |
|
@danepitkin @vibhatha A potential CI improvement might be to have a post-run step that captures and uploads dump files like these so we can get to the bottom of this instability. |
|
Verification failure is unrelated. However, another improvement might be to run |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 6e54b7b. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
Hmm, I mightve found the culprit.. DEBUG messages are enabled in the build now causing a huge amount of logging. |
|
Do we want to revert, or do you think disabling the log is good enough? |
|
I think disabling DEBUG logs should be good enough. I'll try it out. |
…pache#39708) ### Rationale for this change This change has two main benefits: #### Enabling local build caching Enabling local build cache can speed up builds that occur on the same machine by skipping the execution of certain deterministic goals, such as Java compilation, when no change has occurred. In the future, [https://ge.apache.org](ge.apache.org) will support remote build caching as well so that results can be shared between builds (e.g. ephemeral CI build agents). #### Enabling build scans This change enables the publishing of build scans of the Apache Arrow project to the Develocity instance at [ge.apache.org](https://ge.apache.org/), hosted by the Apache Software Foundation and run in partnership between the ASF and Gradle. This Develocity instance has all features and extensions enabled and is freely available for use by the Apache Arrow project and all other Apache projects. Currently, Maven-built projects such as [Pulsar](https://ge.apache.org/scans?search.buildToolType=maven&search.rootProjectNames=*Pulsar*&search.timeZoneId=America%2FChicago), [IoTDB](https://ge.apache.org/scans?search.buildToolType=maven&search.rootProjectNames=*iotdb*&search.timeZoneId=America%2FChicago), [Ozone](https://ge.apache.org/scans?search.buildToolType=maven&search.rootProjectNames=*ozone*&search.timeZoneId=America%2FChicago), and others are using this instance. On this Develocity instance, Apache Arrow will have access not only to all of the published build scans but other aggregate data features such as: - Dashboards to view all historical build scans, along with performance trends over time - Build failure analytics for enhanced investigation and diagnosis of build failures - Test failure analytics to better understand trends and causes around slow, failing, and flaky tests ### What changes are included in this PR? - Adds the [Develocity Maven Extension](https://docs.gradle.com/enterprise/maven-extension/) to enable local caching - Publishes build scans to [ge.apache.org](https://ge.apache.org) from CI builds and authenticated local builds - Adds the [Common Custom User Data Maven Extension](https://github.com/gradle/common-custom-user-data-maven-extension) to enhance build scans with more metadata ### Are these changes tested? No (these changes are for the build and do not affect the code) ### Are there any user-facing changes? No * Closes: apache#39707 Authored-by: Clay Johnson <cjohnson@gradle.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
This change has two main benefits:
Enabling local build caching
Enabling local build cache can speed up builds that occur on the same machine by skipping the execution of certain deterministic goals, such as Java compilation, when no change has occurred.
In the future, https://ge.apache.org will support remote build caching as well so that results can be shared between builds (e.g. ephemeral CI build agents).
Enabling build scans
This change enables the publishing of build scans of the Apache Arrow project to the Develocity instance at ge.apache.org, hosted by the Apache Software Foundation and run in partnership between the ASF and Gradle. This Develocity instance has all features and extensions enabled and is freely available for use by the Apache Arrow project and all other Apache projects. Currently, Maven-built projects such as Pulsar, IoTDB, Ozone, and others are using this instance.
On this Develocity instance, Apache Arrow will have access not only to all of the published build scans but other aggregate data features such as:
What changes are included in this PR?
Are these changes tested?
No (these changes are for the build and do not affect the code)
Are there any user-facing changes?
No