clean-up: clean stale lock files on bare-metal CI node#1720
clean-up: clean stale lock files on bare-metal CI node#1720Pennyzct wants to merge 1 commit intokata-containers:masterfrom
Conversation
Sometimes, due to unsteady network connection, apt/apt-get commands have been terminated improperly. And it leaves intermediate lock files, which may lead to 'Unable to lock' error in next run. So you need to clean stale lock files on bare-metal CI node before each run Signed-off-by: Penny Zheng <penny.zheng@arm.com>
|
/test |
grahamwhaley
left a comment
There was a problem hiding this comment.
Hi @Pennyzct Just to check, are you sure it is network timeout things that is causing the issue?
I would have thought a timout of an apt etc. would probably clean up after itself quite well?
In the past we have seen package manager lock contentions - but, it was from the system wide background package update polling code (auto updaters) - so, we went and turned them off, and have not seen the issues since. For an example see:
https://github.com/kata-containers/ci/blob/master/deployment/packet/install_packet.yaml#L107-L125
|
|
||
| info "Clean stale lock files" | ||
| # stale lock files are created when processes are not terminated properly | ||
| stale_lock_file_union=( "/var/lib/dpkg/lock" ) |
There was a problem hiding this comment.
It's slightly odd having this global var set here, but used in the other func. I'd probably either define it in the other fun, or pass it through as an argument?
| stale_lock_file_union=( "/var/lib/dpkg/lock" ) | ||
| delete_stale_lock_files | ||
| # reconfigure the packages | ||
| sudo dpkg --configure -a |
There was a problem hiding this comment.
This is in a generic (non-distro specific?) cleanup func I think - should it be a distro specific call?
|
Hi~ @grahamwhaley |
|
@Pennyzct - probably before we start killing processes and nuking lock files then, we should diagnose what the actual problem is, and then work out if the solution is to, say, disable the auto updaters etc. I would start by maybe adding some diagnostics in the error case, so try to capture the I do suspect maybe your server has the auto update services running in the background, and occasionally they will be active when you try to process a PR, and they clash on the lock file. We have seen that before (both on the bare metal metrics machines and on the cloud instances iirc) - so, if you have not already deliberately disabled the auto updaters - then do that first... |
|
Hi~ @grahamwhaley thanks three thousand times!!!!! 😍 It is the auto updates and auto upgrades running in the background and it clashed with the manually So maybe what I need to do is do the auto updates and auto upgrades check before each run in bare-metal, something like |
|
@Pennyzct :-) np - first time we hit this it took us some days to work out what the problem was :-) |
|
Hi~ @grahamwhaley For all our CI nodes in packet.net, I'm thinking just manually disable them, when setting up this bare-metal CI nodes for Kata community. maybe another item in How to set-up-a-bare-metal-jenkins-slave-node. ;) |
|
np @Pennyzct . I have probably said before, but we have some ansible scripts for deploying the (metrics) bare metal slaves in Packet: https://github.com/kata-containers/ci/tree/master/deployment/packet |
|
Hi~@vielmetti could we use ansible scripts for deploying the ARM bare metal slaves in Packet.net? |
|
Yes @Pennyzct , the Ansible scripts mentioned above by @grahamwhaley should work on Packet for Arm servers with only the change in the "plan" type to pick a server of an appropriate type. |
Hi~ guys.
I have encountered
dpkg frontend is locked by another processa few times on the ARM CI node.Sometimes, due to unsteady network connection,
apt/apt-getcommands have been terminated improperly. And it leaves intermediate lock files, which may lead toUnable to lockerror in next run.So you need to clean stale lock files on bare-metal CI node before each run.