Skip to content

Avoid extremely slow kokoro job finalization by moving workspace to a non-synced directory on kokoro workers.#28259

Merged
jtattermusch merged 8 commits intogrpc:masterfrom
jtattermusch:altsrc_respawn
Dec 3, 2021
Merged

Avoid extremely slow kokoro job finalization by moving workspace to a non-synced directory on kokoro workers.#28259
jtattermusch merged 8 commits intogrpc:masterfrom
jtattermusch:altsrc_respawn

Conversation

@jtattermusch
Copy link
Copy Markdown
Contributor

@jtattermusch jtattermusch commented Dec 2, 2021

A workaround for b/74837748 (and b/143186860)

Background: once kokoro job finishes, the content of the entire workspace (/tmpfs/src) is rsynced back to the kokoro agent (from where the artifacts matching the pattern are uploaded). The issue is that after finishing the build, the grpc workspace can be very large (~tens of gigabytes: e.g. 1. it has all the intermediate files produced during the build 2. it has all the cloned submodules 3. many jobs perform multiple builds, each in a separate copy of the grpc repository, so 1 and 2 are included multiple times). In addition to that, the rsync from windows and macos workers can be quite slow.
As a result, a lot of time can be spent at the end of the job, simply rsyncing the entire workspace (out of which only a few files matching the pattern will be used as artifacts and 99.9% of data is copied for no reason).
In the past, we tried to work around this problem by invoking delete_nonartifacts.sh at the end of selected jobs, which helps a little bit, but has number of other problems:

  • even running delete_nonartifact.sh can be quite slow (e.g. we recently discovered that for windows grpc_build_artifacts it takes tens of minutes :-( )
  • if the jobs times out and gets cancelled by kokoro, the delete_nonartifacts.sh may clash with the rsync started by kokoro, which results in thousands of "file has vanished" warnings flooding the build log (and making the log very hard to read) - see b/143186860
  • it is easy to forget invoking delete_nonartifacts.sh in the CI jobs, so this approach is hard to maintain. Also, when the delete_nonartifacts.sh gets added to the end of the CI script, one must take care to correctly propagate the exit code of the actual test that ran previously (so this approach is error prone).

So this PR is taking a different approach.

  • at the very beginning of the CI script run, simply move the entire workspace (/tmpfs/src is used by kokoro) to a different location (I chose /tmpfs/altsrc) and restart the CI script there. Only /tmpfs/src is rsynced by kokoro, so any intermediate files produced under /tmpfs/altsrc won't cause any finalization overhead
  • care must be taken to make sure that all the files we actually want to keep (the artifacts produced by the build and the tests reports that will be later displayed in the UI) will actually be stored under the original location (/tmpsfs/src), otherwise they won't be stored by kokoro.

Most of the logic of this script is:

  • a move_src_and_respawn script (for unix and windows) that can be easily started at the beginning of any CI script and it takes care of the dirty job of switching the current workspace.
  • make sure the test reports and artifacts end up in the right spot (note that the approach I chosed make sure that the reports end up in the right spot even if the test scripts crashes abruptly and there's no "copy afterward" steps for test reports).

This PR switches all the macos and windows CI jobs to use the "altsrc", since kokoro's rsync is much slower on macos and windows.
On linux I'm only switching some selected jobs to "altsrc", since 1. rsync quite fast on linux 2. most linux builds run under a docker container, which doesn't pollute the workspace with intermediate files (since the intermediate files are inside a container that is thrown away once the build finishes), so the workspace size on linux is generally smaller than on macos and windows.

@jtattermusch jtattermusch marked this pull request as ready for review December 3, 2021 09:03
@jtattermusch jtattermusch changed the title DO NOT MERGE: Altsrc respawn Avoid extremely slow kokoro job finalization by moving workspace to a non-synced directory on kokoro workers. Dec 3, 2021
@jtattermusch
Copy link
Copy Markdown
Contributor Author

Some evidence that the delete_nonartifact.sh approach isn't working out well for large workspaces:

@lidizheng and me discovered just recently that the windows grpc_build_artifacts job actually spends almost an hour
simply cleaning up the gigantic workspace it has produced:

T:\src\github\grpc>bash tools/internal_ci/helper_scripts/delete_nonartifacts.sh

++ dirname tools/internal_ci/helper_scripts/delete_nonartifacts.sh
+ cd tools/internal_ci/helper_scripts/../../..
+ find . -type f -not -iname '*sponge_log.*' -not -path './reports/*' -not -path './artifacts/*' -not -path './tools/internal_ci/*' -exec rm -f '{}' +

real	57m37.962s
user	0m30.797s
sys	3m9.490s

https://source.cloud.google.com/results/invocations/126715f4-fef8-43fc-9eb8-c0f48b3b03b5/log

@jtattermusch jtattermusch added the release notes: no Indicates if PR should not be in release notes label Dec 3, 2021
@jtattermusch
Copy link
Copy Markdown
Contributor Author

Results:

  • I checked manually that the test xml reports are being displayed fine (I checked the UI for multiple jobs)
  • I checked manually that artifacts for build_artifact jobs are being stored (for linux and mac). TODO: storing artifact on windows
  • windows/grpc_build_artifacts job duration dropped from ~120min to 75min.

The Python Windows failure: https://source.cloud.google.com/results/invocations/aee90479-5069-49cc-bfb5-8363dc4c09dd is a timeout a probably unrelated (I checked the log and everything looks normal). The good news is that I'm not seeing the "file has vanished" messages flooding the log for this timed-out job

@jtattermusch
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Contributor

@veblush veblush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I was about to question why windows build took ~20 minutes deleting itself but you already improved it.

Copy link
Copy Markdown
Contributor

@lidizheng lidizheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Small comments. Thanks for doing this!

This PR might have some rollback risk, so I would recommend to merge #28228 first.

@jtattermusch jtattermusch merged commit 3a024ea into grpc:master Dec 3, 2021
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bloat/none imported Specifies if the PR has been imported to the internal repository perf-change/none release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants