Avoid extremely slow kokoro job finalization by moving workspace to a non-synced directory on kokoro workers.#28259
Conversation
cf58a7f to
308e844
Compare
308e844 to
0db25bb
Compare
|
Some evidence that the @lidizheng and me discovered just recently that the windows grpc_build_artifacts job actually spends almost an hour https://source.cloud.google.com/results/invocations/126715f4-fef8-43fc-9eb8-c0f48b3b03b5/log |
e7e82e6 to
18f9780
Compare
18f9780 to
971fabb
Compare
|
Results:
The Python Windows failure: https://source.cloud.google.com/results/invocations/aee90479-5069-49cc-bfb5-8363dc4c09dd is a timeout a probably unrelated (I checked the log and everything looks normal). The good news is that I'm not seeing the "file has vanished" messages flooding the log for this timed-out job |
|
Adhoc distribtest run: https://fusion2.corp.google.com/invocations/ee99a99d-50ce-40bf-a8f0-821ed9470746/targets |
veblush
left a comment
There was a problem hiding this comment.
Awesome! I was about to question why windows build took ~20 minutes deleting itself but you already improved it.
A workaround for b/74837748 (and b/143186860)
Background: once kokoro job finishes, the content of the entire workspace (/tmpfs/src) is rsynced back to the kokoro agent (from where the artifacts matching the pattern are uploaded). The issue is that after finishing the build, the grpc workspace can be very large (~tens of gigabytes: e.g. 1. it has all the intermediate files produced during the build 2. it has all the cloned submodules 3. many jobs perform multiple builds, each in a separate copy of the grpc repository, so 1 and 2 are included multiple times). In addition to that, the rsync from windows and macos workers can be quite slow.
As a result, a lot of time can be spent at the end of the job, simply rsyncing the entire workspace (out of which only a few files matching the pattern will be used as artifacts and 99.9% of data is copied for no reason).
In the past, we tried to work around this problem by invoking delete_nonartifacts.sh at the end of selected jobs, which helps a little bit, but has number of other problems:
So this PR is taking a different approach.
/tmpfs/srcis used by kokoro) to a different location (I chose/tmpfs/altsrc) and restart the CI script there. Only/tmpfs/srcis rsynced by kokoro, so any intermediate files produced under /tmpfs/altsrc won't cause any finalization overhead/tmpsfs/src), otherwise they won't be stored by kokoro.Most of the logic of this script is:
This PR switches all the macos and windows CI jobs to use the "altsrc", since kokoro's rsync is much slower on macos and windows.
On linux I'm only switching some selected jobs to "altsrc", since 1. rsync quite fast on linux 2. most linux builds run under a docker container, which doesn't pollute the workspace with intermediate files (since the intermediate files are inside a container that is thrown away once the build finishes), so the workspace size on linux is generally smaller than on macos and windows.