Skip to content

[Carry 39595] Tailor CI for ARM, skip legacy integration test.#39678

Merged
tiborvass merged 9 commits intomoby:masterfrom
thaJeztah:carry_39595_tailor_arm_ci
Oct 10, 2019
Merged

[Carry 39595] Tailor CI for ARM, skip legacy integration test.#39678
tiborvass merged 9 commits intomoby:masterfrom
thaJeztah:carry_39595_tailor_arm_ci

Conversation

@thaJeztah
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah commented Aug 6, 2019

carry of #39595 Tailor CI for ARM, skip legacy integration test
closes #39595 Tailor CI for ARM, skip legacy integration test
fixes #39479 Jenkins CI for arm64

- What I did
This PR is trying to make a minimal test set (without legacy integration test cases) for ARM.
Introduce a new environment variable SKIP_LEGACY_INTEGRATION_TEST to control whether legacy integration test cases are ignored or not.

- How I did it

  • Introduce a new environment variable SKIP_LEGACY_INTEGRATION_TEST, default value is "no".
  • If it's set to "yes" or anything rather than "no", legacy integration suites will be skipped.
  • In arm CI, SKIP_LEGACY_INTEGRATION_TEST is set to "yes".
  • CI for other architecture's are not impacted.

- How to verify it
Run "make test-integration" on an ARM machine.

- Description for the changelog

See the discussion in issue: #39479
Backgroud: CI on arm (32 and 64 bit) architecture took too long time to finish. So the CI for it was stopped. Now I am trying to bring it back, and try to set a minimal set in it to save time.

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah
Copy link
Copy Markdown
Member Author

Whoop! That worked;

Screenshot 2019-08-06 at 15 42 24

Screenshot 2019-08-06 at 15 42 37

Screenshot 2019-08-06 at 15 42 31

@thaJeztah
Copy link
Copy Markdown
Member Author

thaJeztah commented Aug 6, 2019

Looks like performance on that machine isn't terrible (see it has lots of ram as well); let me try how long running everything takes (just for fun)

From the docker info output;

Kernel Version: 4.10.0-38-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: aarch64
CPUs: 96
Total Memory: 125.9GiB

We should probably decide if it should run on every PR (if we have a limited number of machines on ARM currently, might want to only run it on master by default)

AWS machines:

Kernel Version: 4.15.0-1044-aws
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: aarch64
CPUs: 4
Total Memory: 7.524GiB

@thaJeztah
Copy link
Copy Markdown
Member Author

I'm still a bit on the fence if we should name this arm64 instead of aarch64 (arm64 might be slightly clearer)

@michael2012z
Copy link
Copy Markdown

I vote for arm64, it sounds more understandable than aarch64.

@thaJeztah thaJeztah changed the title Tailor CI for ARM, skip legacy integration test. [Carry 39595] Tailor CI for ARM, skip legacy integration test. Aug 6, 2019
@thaJeztah
Copy link
Copy Markdown
Member Author

I vote for arm64, it sounds more understandable than aarch64.

Agreed; I think I'll change it (I'll let this one running for now, but can push an updated version after it has completed)

@thaJeztah
Copy link
Copy Markdown
Member Author

PowerPC here failed on https://ci.docker.com/public/job/moby/job/PR-39678/6/execution/node/208/log/

Which is tracked as a flaky test; #23516

[2019-08-06T15:03:11.839Z] ----------------------------------------------------------------------
[2019-08-06T15:03:11.839Z] FAIL: docker_api_swarm_node_test.go:75: DockerSwarmSuite.TestAPISwarmNodeDrainPause

I posted details in #23516 (comment)

@thaJeztah
Copy link
Copy Markdown
Member Author

OK, so running everything is indeed a bit too long

Screenshot 2019-08-06 at 20 04 17

@thaJeztah thaJeztah force-pushed the carry_39595_tailor_arm_ci branch from 571fec7 to 13f3032 Compare August 6, 2019 21:52
@thaJeztah
Copy link
Copy Markdown
Member Author

No idea what this Janky failure was; https://ci.docker.com/public/blue/organizations/jenkins/moby/detail/PR-39678/7/pipeline


[2019-08-06T22:30:11.181Z] ----------------------------------------------------------------------
[2019-08-06T22:30:11.181Z] FAIL: docker_cli_run_test.go:2459: DockerSuite.TestRunModeUTSHost
[2019-08-06T22:30:11.181Z] 
[2019-08-06T22:30:11.181Z] assertion failed: 
[2019-08-06T22:30:11.181Z] Command:  /usr/local/cli/docker run busybox readlink /proc/self/ns/uts
[2019-08-06T22:30:11.181Z] ExitCode: 125
[2019-08-06T22:30:11.181Z] Error:    exit status 125
[2019-08-06T22:30:11.181Z] Stdout:   
[2019-08-06T22:30:11.181Z] Stderr:   /usr/local/cli/docker: Error response from daemon: ttrpc: client shutting down: ttrpc: closed: unknown.
[2019-08-06T22:30:11.181Z] time="2019-08-06T22:30:10Z" level=error msg="error waiting for container: context canceled" 
[2019-08-06T22:30:11.181Z] 
[2019-08-06T22:30:11.181Z] 
[2019-08-06T22:30:11.181Z] Failures:
[2019-08-06T22:30:11.181Z] ExitCode was 125 expected 0
[2019-08-06T22:30:11.181Z] Expected no error

@thaJeztah thaJeztah force-pushed the carry_39595_tailor_arm_ci branch from 13f3032 to bf4feab Compare August 7, 2019 21:21
@thaJeztah
Copy link
Copy Markdown
Member Author

Removed the restriction to run on packet machines only; now will pick whatever aarch64 worker is available (packet, or auto-scaling
a1.xlarge arm64 machines on AWS)

@michael2012z
Copy link
Copy Markdown

I can't manage to reproduce the failure in Janky:
[2019-08-07T22:32:20.683Z] FAIL: docker_cli_swarm_test.go:1333: DockerSwarmSuite.TestSwarmClusterRotateUnlockKey

Similar failure was seen long ago in #34384 and #36696.

@thaJeztah
Copy link
Copy Markdown
Member Author

Yes, that test is still known to be flaky (TestSwarmClusterRotateUnlockKey)

@michael2012z
Copy link
Copy Markdown

Great, now everything looks good.

@thaJeztah
Copy link
Copy Markdown
Member Author

@psftw @andrewhsu PTAL

@michael2012z
Copy link
Copy Markdown

Hi, all
It's been silent for days on this PR. :)
Is it going to be merged?

@thaJeztah
Copy link
Copy Markdown
Member Author

I'll do one more rebase (and I think there were some changes in the other stages/platforms I'll have to sync)

I'll check for people to review this 👍

@thaJeztah thaJeztah force-pushed the carry_39595_tailor_arm_ci branch from 4ecc247 to 4be066f Compare September 21, 2019 14:30
@thaJeztah
Copy link
Copy Markdown
Member Author

rebased

@michael2012z
Copy link
Copy Markdown

Hi, @thaJeztah and other maintainers,
Regarding the test case TestTemplatedConfig failure, it passed with the early commits made on Aug/6, it was also OK with the rebase on Aug/8. It began to appear stably since the rebase on Aug/15. I am thinking if it was triggered by some PR merged during Aug/8 and Aug/15.
Maybe identifying the commit that triggered the problem (by "binary-searching" all the PRs during the 7 days) is an easier way to move forward.
I am willing to fix the problem, but it was not reproducible in my local tests. May I get some temporary permission to apply Jenkinsfile in my PR, then I can help?

michael2012z and others added 9 commits October 8, 2019 18:37
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Also switch aarch64 to use overlay2

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Pick whatever is available; packet worker, or auto-scaling
a1.xlarge arm64 machines on AWS

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah thaJeztah force-pushed the carry_39595_tailor_arm_ci branch from 4be066f to eda98ad Compare October 8, 2019 16:49
@michael2012z
Copy link
Copy Markdown

:), interesting, all test cases passed in this run.
Thank you for rebasing the code.

@thaJeztah
Copy link
Copy Markdown
Member Author

oh! I see I forgot to leave a comment (sorry!)

interesting; all I did was indeed rebase this PR (and sync some minor changes that were added in other stages). So either something was fixed in our codebase, or the arm machines were updated, and it was an issues with the machines themselves 🤔

Either way; happy to see this green. Let's see if I can get people to review this

@tiborvass @andrewhsu @kolyshkin PTAL

@thaJeztah
Copy link
Copy Markdown
Member Author

thaJeztah commented Oct 9, 2019

One thing I notice is that these machines are running Ubuntu 16.04 (other stages are running 18.04); something we may want to update in a follow up (I would need to check if there's 18.04 arm64 machines)

Copy link
Copy Markdown
Contributor

@andrewhsu andrewhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tiborvass tiborvass merged commit d31633b into moby:master Oct 10, 2019
@thaJeztah thaJeztah deleted the carry_39595_tailor_arm_ci branch October 10, 2019 18:22
@thaJeztah
Copy link
Copy Markdown
Member Author

Whoop, merged! Thank you for getting the ball rolling on this one, and for your patience @michael2012z

@michael2012z
Copy link
Copy Markdown

Thank you @thaJeztah for carrying my original PR, making a lot of improvement, and patient rebasing again and again.
And thanks to all other maintainers for reviewing and merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Jenkins CI for arm64

9 participants