Skip to content
This repository was archived by the owner on Jun 28, 2024. It is now read-only.

clean-up: clean stale lock files on bare-metal CI node#1720

Closed
Pennyzct wants to merge 1 commit intokata-containers:masterfrom
Pennyzct:lock_file
Closed

clean-up: clean stale lock files on bare-metal CI node#1720
Pennyzct wants to merge 1 commit intokata-containers:masterfrom
Pennyzct:lock_file

Conversation

@Pennyzct
Copy link
Copy Markdown
Contributor

Hi~ guys.
I have encountered dpkg frontend is locked by another process a few times on the ARM CI node.
Sometimes, due to unsteady network connection, apt/apt-get commands have been terminated improperly. And it leaves intermediate lock files, which may lead to Unable to lock error in next run.
So you need to clean stale lock files on bare-metal CI node before each run.

Sometimes, due to unsteady network connection, apt/apt-get commands
have been terminated improperly. And it leaves intermediate lock
files, which may lead to 'Unable to lock' error in next run.
So you need to clean stale lock files on bare-metal CI node before
each run

Signed-off-by: Penny Zheng <penny.zheng@arm.com>
@Pennyzct
Copy link
Copy Markdown
Contributor Author

/test

Copy link
Copy Markdown
Contributor

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Pennyzct Just to check, are you sure it is network timeout things that is causing the issue?
I would have thought a timout of an apt etc. would probably clean up after itself quite well?
In the past we have seen package manager lock contentions - but, it was from the system wide background package update polling code (auto updaters) - so, we went and turned them off, and have not seen the issues since. For an example see:
https://github.com/kata-containers/ci/blob/master/deployment/packet/install_packet.yaml#L107-L125


info "Clean stale lock files"
# stale lock files are created when processes are not terminated properly
stale_lock_file_union=( "/var/lib/dpkg/lock" )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly odd having this global var set here, but used in the other func. I'd probably either define it in the other fun, or pass it through as an argument?

stale_lock_file_union=( "/var/lib/dpkg/lock" )
delete_stale_lock_files
# reconfigure the packages
sudo dpkg --configure -a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a generic (non-distro specific?) cleanup func I think - should it be a distro specific call?

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Pennyzct commented Jun 20, 2019

Hi~ @grahamwhaley
It is only my guessing. i'm sorry.😰
this error is pretty rare. It always comes with a hanging apt-get process.
And not every time the error shares the same info. I have found a wiki about classifying and fixing those issues. you may help yourself to have a look.
I'm not sure if it is the same thing you talked about. totally confused.

@grahamwhaley
Copy link
Copy Markdown
Contributor

@Pennyzct - probably before we start killing processes and nuking lock files then, we should diagnose what the actual problem is, and then work out if the solution is to, say, disable the auto updaters etc.

I would start by maybe adding some diagnostics in the error case, so try to capture the apt fail (in a TRAP function maybe), and then add something like

$ ps -ef | fgrep apt
$ sudo lsof /var/lib/apt/lists/
$ sudo lsof /var/lib/dpkg/
$ sudo /var/cache/apt/archives/

I do suspect maybe your server has the auto update services running in the background, and occasionally they will be active when you try to process a PR, and they clash on the lock file. We have seen that before (both on the bare metal metrics machines and on the cloud instances iirc) - so, if you have not already deliberately disabled the auto updaters - then do that first...

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Hi~ @grahamwhaley thanks three thousand times!!!!! 😍 It is the auto updates and auto upgrades running in the background and it clashed with the manually apt.

root@arm-testing-1:~# systemctl status apt-daily.timer
● apt-daily.timer - Daily apt download activities
   Loaded: loaded (/lib/systemd/system/apt-daily.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Thu 2019-06-06 04:08:18 UTC; 2 weeks 0 days ago
  Trigger: Fri 2019-06-21 17:14:35 UTC; 15h left
root@arm-testing-1:~# systemctl status apt-daily-upgrade.timer
● apt-daily-upgrade.timer - Daily apt upgrade and clean activities
   Loaded: loaded (/lib/systemd/system/apt-daily-upgrade.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Thu 2019-06-06 04:08:18 UTC; 2 weeks 0 days ago
  Trigger: Fri 2019-06-21 06:36:33 UTC; 4h 48min left

So maybe what I need to do is do the auto updates and auto upgrades check before each run in bare-metal, something like systemctl is-active apt-daily-upgrade.service, systemctl is-enabled apt-daily-upgrade.timer?? Wdyt? @grahamwhaley
And with lock file, If this error occurs again with all auto updates and auto upgrades disabled, then i will deal with it.

@grahamwhaley
Copy link
Copy Markdown
Contributor

@Pennyzct :-) np - first time we hit this it took us some days to work out what the problem was :-)
On all other node/agents we have disabled the auto updates, like you can see in the ansible files used to set up the Packet metrics bare metal machines here. I guess you may have to discuss that with your machine devops to see if that is an acceptable solution?
If not, then yes, you will have to try and find some dynamic way to ensure the background timer routines don't trigger and clash with the CI scripts! If triggering the auto-updates just before each CI run works, then great, that sounds like a solution.

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Hi~ @grahamwhaley For all our CI nodes in packet.net, I'm thinking just manually disable them, when setting up this bare-metal CI nodes for Kata community. maybe another item in How to set-up-a-bare-metal-jenkins-slave-node. ;)
For this PR, I'm gonna close it. If i disable all auto updates and auto upgrades services, similar error still occured. Then I will re-open and deal with it.

@Pennyzct Pennyzct closed this Jun 24, 2019
@grahamwhaley
Copy link
Copy Markdown
Contributor

np @Pennyzct . I have probably said before, but we have some ansible scripts for deploying the (metrics) bare metal slaves in Packet: https://github.com/kata-containers/ci/tree/master/deployment/packet
Maybe they will work or be adaptable for other Packet bare metal machines as well.

@Pennyzct
Copy link
Copy Markdown
Contributor Author

Hi~@vielmetti could we use ansible scripts for deploying the ARM bare metal slaves in Packet.net?

@vielmetti
Copy link
Copy Markdown

Yes @Pennyzct , the Ansible scripts mentioned above by @grahamwhaley should work on Packet for Arm servers with only the change in the "plan" type to pick a server of an appropriate type.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants