clean-up: clean stale lock files on bare-metal CI node by Pennyzct · Pull Request #1720 · kata-containers/tests

Pennyzct · 2019-06-17T10:23:16Z

Hi~ guys.
I have encountered dpkg frontend is locked by another process a few times on the ARM CI node.
Sometimes, due to unsteady network connection, apt/apt-get commands have been terminated improperly. And it leaves intermediate lock files, which may lead to Unable to lock error in next run.
So you need to clean stale lock files on bare-metal CI node before each run.

Sometimes, due to unsteady network connection, apt/apt-get commands have been terminated improperly. And it leaves intermediate lock files, which may lead to 'Unable to lock' error in next run. So you need to clean stale lock files on bare-metal CI node before each run Signed-off-by: Penny Zheng <penny.zheng@arm.com>

Pennyzct · 2019-06-17T10:23:39Z

/test

grahamwhaley

Hi @Pennyzct Just to check, are you sure it is network timeout things that is causing the issue?
I would have thought a timout of an apt etc. would probably clean up after itself quite well?
In the past we have seen package manager lock contentions - but, it was from the system wide background package update polling code (auto updaters) - so, we went and turned them off, and have not seen the issues since. For an example see:
https://github.com/kata-containers/ci/blob/master/deployment/packet/install_packet.yaml#L107-L125

grahamwhaley · 2019-06-17T10:50:42Z

.ci/lib.sh

+
+	info "Clean stale lock files"
+	# stale lock files are created when processes are not terminated properly
+	stale_lock_file_union=( "/var/lib/dpkg/lock" )


It's slightly odd having this global var set here, but used in the other func. I'd probably either define it in the other fun, or pass it through as an argument?

grahamwhaley · 2019-06-17T10:52:24Z

.ci/lib.sh

+	stale_lock_file_union=( "/var/lib/dpkg/lock" )
+	delete_stale_lock_files
+	# reconfigure the packages
+	sudo dpkg --configure -a


This is in a generic (non-distro specific?) cleanup func I think - should it be a distro specific call?

Pennyzct · 2019-06-20T06:39:12Z

Hi~ @grahamwhaley
It is only my guessing. i'm sorry.😰
this error is pretty rare. It always comes with a hanging apt-get process.
And not every time the error shares the same info. I have found a wiki about classifying and fixing those issues. you may help yourself to have a look.
I'm not sure if it is the same thing you talked about. totally confused.

grahamwhaley · 2019-06-20T14:33:09Z

@Pennyzct - probably before we start killing processes and nuking lock files then, we should diagnose what the actual problem is, and then work out if the solution is to, say, disable the auto updaters etc.

I would start by maybe adding some diagnostics in the error case, so try to capture the apt fail (in a TRAP function maybe), and then add something like

$ ps -ef | fgrep apt
$ sudo lsof /var/lib/apt/lists/
$ sudo lsof /var/lib/dpkg/
$ sudo /var/cache/apt/archives/

I do suspect maybe your server has the auto update services running in the background, and occasionally they will be active when you try to process a PR, and they clash on the lock file. We have seen that before (both on the bare metal metrics machines and on the cloud instances iirc) - so, if you have not already deliberately disabled the auto updaters - then do that first...

Pennyzct · 2019-06-21T02:14:22Z

Hi~ @grahamwhaley thanks three thousand times!!!!! 😍 It is the auto updates and auto upgrades running in the background and it clashed with the manually apt.

root@arm-testing-1:~# systemctl status apt-daily.timer
● apt-daily.timer - Daily apt download activities
   Loaded: loaded (/lib/systemd/system/apt-daily.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Thu 2019-06-06 04:08:18 UTC; 2 weeks 0 days ago
  Trigger: Fri 2019-06-21 17:14:35 UTC; 15h left

root@arm-testing-1:~# systemctl status apt-daily-upgrade.timer
● apt-daily-upgrade.timer - Daily apt upgrade and clean activities
   Loaded: loaded (/lib/systemd/system/apt-daily-upgrade.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Thu 2019-06-06 04:08:18 UTC; 2 weeks 0 days ago
  Trigger: Fri 2019-06-21 06:36:33 UTC; 4h 48min left

So maybe what I need to do is do the auto updates and auto upgrades check before each run in bare-metal, something like systemctl is-active apt-daily-upgrade.service, systemctl is-enabled apt-daily-upgrade.timer?? Wdyt? @grahamwhaley
And with lock file, If this error occurs again with all auto updates and auto upgrades disabled, then i will deal with it.

grahamwhaley · 2019-06-21T09:02:23Z

@Pennyzct :-) np - first time we hit this it took us some days to work out what the problem was :-)
On all other node/agents we have disabled the auto updates, like you can see in the ansible files used to set up the Packet metrics bare metal machines here. I guess you may have to discuss that with your machine devops to see if that is an acceptable solution?
If not, then yes, you will have to try and find some dynamic way to ensure the background timer routines don't trigger and clash with the CI scripts! If triggering the auto-updates just before each CI run works, then great, that sounds like a solution.

Pennyzct · 2019-06-24T09:28:34Z

Hi~ @grahamwhaley For all our CI nodes in packet.net, I'm thinking just manually disable them, when setting up this bare-metal CI nodes for Kata community. maybe another item in How to set-up-a-bare-metal-jenkins-slave-node. ;)
For this PR, I'm gonna close it. If i disable all auto updates and auto upgrades services, similar error still occured. Then I will re-open and deal with it.

grahamwhaley · 2019-06-24T09:55:21Z

np @Pennyzct . I have probably said before, but we have some ansible scripts for deploying the (metrics) bare metal slaves in Packet: https://github.com/kata-containers/ci/tree/master/deployment/packet
Maybe they will work or be adaptable for other Packet bare metal machines as well.

Pennyzct · 2019-06-24T10:05:27Z

Hi~@vielmetti could we use ansible scripts for deploying the ARM bare metal slaves in Packet.net?

vielmetti · 2019-06-24T19:58:58Z

Yes @Pennyzct , the Ansible scripts mentioned above by @grahamwhaley should work on Packet for Arm servers with only the change in the "plan" type to pick a server of an appropriate type.

grahamwhaley reviewed Jun 17, 2019

View reviewed changes

Pennyzct mentioned this pull request Jun 21, 2019

versions update kernel to 4.19.52 kata-containers/runtime#1817

Merged

Pennyzct closed this Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean-up: clean stale lock files on bare-metal CI node#1720

clean-up: clean stale lock files on bare-metal CI node#1720
Pennyzct wants to merge 1 commit intokata-containers:masterfrom
Pennyzct:lock_file

Pennyzct commented Jun 17, 2019

Uh oh!

Pennyzct commented Jun 17, 2019

Uh oh!

grahamwhaley left a comment

Uh oh!

grahamwhaley Jun 17, 2019

Uh oh!

grahamwhaley Jun 17, 2019

Uh oh!

Pennyzct commented Jun 20, 2019 •

edited

Loading

Uh oh!

grahamwhaley commented Jun 20, 2019

Uh oh!

Pennyzct commented Jun 21, 2019

Uh oh!

grahamwhaley commented Jun 21, 2019

Uh oh!

Pennyzct commented Jun 24, 2019

Uh oh!

grahamwhaley commented Jun 24, 2019

Uh oh!

Pennyzct commented Jun 24, 2019

Uh oh!

vielmetti commented Jun 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Pennyzct commented Jun 17, 2019

Uh oh!

Pennyzct commented Jun 17, 2019

Uh oh!

grahamwhaley left a comment

Choose a reason for hiding this comment

Uh oh!

grahamwhaley Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

grahamwhaley Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

Pennyzct commented Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grahamwhaley commented Jun 20, 2019

Uh oh!

Pennyzct commented Jun 21, 2019

Uh oh!

grahamwhaley commented Jun 21, 2019

Uh oh!

Pennyzct commented Jun 24, 2019

Uh oh!

grahamwhaley commented Jun 24, 2019

Uh oh!

Pennyzct commented Jun 24, 2019

Uh oh!

vielmetti commented Jun 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pennyzct commented Jun 20, 2019 •

edited

Loading