config-manager: Only use tedge-write after normal copy fails due to permissions and improve tedge-write logging#3069
Conversation
Codecov ReportAttention: Patch coverage is Additional details and impacted files📢 Thoughts on this report? Let us know! |
Robot Results
|
albinsuresh
left a comment
There was a problem hiding this comment.
LGTM, except for one minor unwarp issues.
crates/core/tedge_write/src/api.rs
Outdated
| let output = output.with_context(|| format!("failed to start process '{program}'"))?; | ||
|
|
||
| if !output.status.success() { | ||
| let stderr = String::from_utf8(output.stderr).expect("output should be utf-8"); |
There was a problem hiding this comment.
I'm assuming that we're using expect here as we know that the output of tedge-write process can't be non-UTF8.
crates/core/tedge_write/src/api.rs
Outdated
| if !output.status.success() { | ||
| let stderr = String::from_utf8(output.stderr).expect("output should be utf-8"); | ||
| let stderr = stderr.trim(); | ||
| let exit_code = output.status.code().unwrap(); |
There was a problem hiding this comment.
Here, isn't it better to be defenesive and account for the None status code case as well, as there is the theoretical possibility of the tedge-write process getting killed?
There was a problem hiding this comment.
Hmm, I think that tedge-write process being terminated by a signal is very unlikely, as it is spawned and joined by tedge-agent, but you're right, it's better for that edge case to be properly handled as well.
There was a problem hiding this comment.
First, as this as been an issue observed by a user, it would be good to have a issue giving the context (#3073). Notably, it is not obvious reading this, that the root cause is that configuration management is not working on a device where /usr/bin is readonly the path for tedge_write being hard coded.
Second, instead of an option to disable tedge_write, it would be better to have a setting telling where tedge_write has been installed.
didier-wenzek
left a comment
There was a problem hiding this comment.
We need to discuss what is the best option : tedge_write.disableor tedge_write.path.
|
During the call with @albinsuresh and @didier-wenzek we decided the following:
|
f9fc679 to
2884a0f
Compare
2884a0f to
e130de6
Compare
e130de6 to
febaea6
Compare
|
Side remark. It's a pity that the unit tests fail for a reason already highlighted by clippy: unused import: |
febaea6 to
9c7a381
Compare
9c7a381 to
ec8562c
Compare
ec8562c to
8f6bf66
Compare
8f6bf66 to
b89069c
Compare
b89069c to
2d755e5
Compare
| let error_message = format!("config-manager failed downloading a file: {err}",); | ||
| let err = | ||
| anyhow::Error::from(err).context("config-manager failed downloading a file"); | ||
| let error_message = format!("{err:#}"); |
There was a problem hiding this comment.
{err:#} on anyhow::Error prints all the lower level causes in error 1: error 2: error 3 format.
Put more details into warn and error logs when deploying configuration files using `tedge-write`: - before: "WARN: `sudo.enable` set to `true`, but sudo not found in $PATH" - after: "WARN: `sudo.enable` set to `true`, but sudo not found in $PATH, using tedge-write directly" - before: "ERROR: config-manager failed writing updated configuration file: Starting tedge-write process failed" - after: "ERROR: config-manager failed writing updated configuration file: failed to start process '/usr/bin/tedge-write': No such file or directory (os error 2)" - before: "ERROR: config-manager failed writing updated configuration file: sudo: /usr/bin/tedge-write: command not found" - after: "ERROR: config-manager failed writing updated configuration file: process '/usr/bin/sudo' returned non-zero exit code (1); stderr="sudo: /usr/bin/tedge-write: command not found"" Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
Configuration file deployment was changed to always do a regular copy first, and only use tedge-write if copy failes due to lacking permissions. Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
`Trigger config_update operation from another workflow` and `Trigger config_snapshot operation from another operation` were failing if started directly and not part of test suite. Because of the missing `Set Device Context` call, workflow toml files were being copied to the wrong device. Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
2d755e5 to
33d220d
Compare
| // TODO: source temporary file should be cleaned up automatically | ||
| let _ = std::fs::remove_file(from); |
There was a problem hiding this comment.
This has definitely to be done.
But not necessarily here. This file has been created and populated outside this method. It should be automatically cleaned by the caller. Not easy to implement with the current design, though. Can this be done once merged this refactoring
| /// - `src` doesn't exist | ||
| /// - user or group doesn't exist | ||
| /// - we have no write/execute permissions to the directory | ||
| fn move_file_atomic_set_permissions_if_doesnt_exist( |
There was a problem hiding this comment.
| fn move_file_atomic_set_permissions_if_doesnt_exist( | |
| fn move_file_preserving_target_permissions( |
Can we also move this API into tedge_utils so that it can be reused? After removing the dependency on FileEntry, of course. It could take the expected permissions instead.
There was a problem hiding this comment.
Yes, I have this done for a follow-up PR
| .tempfile_in(dest.parent().context("invalid path")?) | ||
| .with_context(|| format!("could not create temporary file at '{dest}'"))?; | ||
|
|
||
| std::io::copy(&mut src_file, &mut tempfile).context("failed to copy")?; |
There was a problem hiding this comment.
The function name implies move but we're doing a copy here. Can we attempt a file move at least when the source and target files are under the same parent directory?
There was a problem hiding this comment.
Yeah, move is probably a little misleading for now, but this function will be used by both config-manager directly and by tedge-write, which reads from stdin, so the function will be renamed to write_... in a follow-up PR
| tempfile | ||
| .as_file() | ||
| .set_permissions(std::fs::Permissions::from_mode(target_permissions.mode)) | ||
| .context("failed to set mode on the destination file")?; | ||
|
|
||
| fchown( | ||
| tempfile.as_file(), | ||
| Some(target_permissions.uid), | ||
| Some(target_permissions.gid), | ||
| ) | ||
| .context("failed to change ownership of the destination file")?; |
There was a problem hiding this comment.
Could have been done with tedge_utils::file::PermissionEntry::apply() as well, right?
There was a problem hiding this comment.
AFAIK PermissionEntry::apply() uses a chown, but since we have a file descriptor open and want all these operations finished before the final rename, I decided to use fchown, which should complete before the file descriptor is closed.
There is also one problem with PermissionEntry::apply() when used in this context:
Since we're doing an atomic write as either tedge or root, the temporary file will be owned by tedge or root but the target configuration file will likely be owned by some other user, so most likely we will always have to change file owner. We want the atomic write to look like a normal write so it has to preserve original ownership no matter what it is.
But it is technically possible for a file to have uid/gid of a user that does not exist on the system. Perhaps the file was created by a program coming from a package that set up its own user, and the package was since uninstalled and the user was removed. Still in this case we still want to preserve original permissions, so we have to use original uid/gid directly. But PermissionEntry::apply() uses a name, so if we wanted to use it here we'd need to look up the name of the owner of the file, which can fail.
So not using PermissionEntry::apply() avoids an extra failure mode.
| Ok(Permissions { uid, gid, mode }) | ||
| } | ||
|
|
||
| struct Permissions { |
There was a problem hiding this comment.
Why not reuse tedge_utils::file::PermissionEntry?
There was a problem hiding this comment.
We want to use uid/gid directly instead of names, since doing so avoids an extra failure mode.
gligorisaev
left a comment
There was a problem hiding this comment.
Checked the test and it is doing the expected verification
Proposed changes
Only use
tedge-writefor privilege elevation after normal copy fails due toPermissionDeniederror and tedge-write is enabledPut more details into warn and error logs when deploying configuration files using
tedge-write:Types of changes
Paste Link to the issue
Checklist
cargo fmtas mentioned in CODING_GUIDELINEScargo clippyas mentioned in CODING_GUIDELINESFurther comments