feat: support reloading workflows at runtime by didier-wenzek · Pull Request #3180 · thin-edge/thin-edge.io

didier-wenzek · 2024-10-09T16:57:20Z

Proposed changes

In order to support dynamic reloading of workflow, without breaking a running workflow,
the proposal is to:

use a hash of the workflow definition file to distinguish workflow versions
persist a copy of each version for a given operation
make the copy when a command instance is triggered
use reference-counting to remove copies when no more in-use.

Plan:

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#3156

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2024-10-09T17:09:10Z

Codecov Report

Attention: Patch coverage is 41.17647% with 340 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ore/tedge_agent/src/operation_workflows/persist.rs	34.22%	210 Missing and 11 partials ⚠️
crates/core/tedge_api/src/workflow/supervisor.rs	38.60%	93 Missing and 4 partials ⚠️
.../core/tedge_agent/src/operation_workflows/actor.rs	45.45%	17 Missing and 1 partial ⚠️
...core/tedge_agent/src/operation_workflows/config.rs	0.00%	3 Missing ⚠️
...tes/core/tedge_agent/src/state_repository/state.rs	93.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

📢 Thoughts on this report? Let us know!

github-actions · 2024-10-09T17:30:46Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
520	0	2	520	100	1h39m10.086467s

didier-wenzek · 2024-10-10T07:53:52Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

+) -> Result<(OperationWorkflow, WorkflowVersion), anyhow::Error> {
+    let bytes = tokio::fs::read(path).await.context("Fail to read file")?;
+    let input = std::str::from_utf8(&bytes).context("Fail to extract UTF8 content")?;
+    let version = sha256::digest(input);


Using md5 should be enough here - as there is no crypto concerns.

crates/core/tedge_agent/src/operation_workflows/persist.rs

tests/RobotFramework/tests/tedge_agent/workflows/long-running-command-v1.toml

didier-wenzek · 2024-10-22T08:43:06Z

crates/core/tedge_api/src/workflow/supervisor.rs

+        let Some(version) = &command_state.workflow_version() else {
+            return Err(WorkflowExecutionError::MissingVersion);
+        };


I have to revert this change. Indeed, this might be a source of issue if the agent is updated whilst there is a pending operation. If the previous agent was not featuring command versions, then the resumed operation should not be rejected by the new agent: the current version of the workflow must be used.

crates/core/tedge_agent/src/operation_workflows/persist.rs

albinsuresh

LGTM.

albinsuresh · 2024-10-23T08:09:14Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

        let dir_path = &self.custom_workflows_dir.clone();
        if let Err(err) = self
-            .load_operation_workflows(WorkflowSource::UserDefined, dir_path)
+            .load_operation_workflows(WorkflowSource::UserDefined(dir_path))


Opinion: Although I understand that WorkflowSource being a generic type allows such generic usages, but this one with the directory path seems slightly overloaded. I agree that we can still defend it since the directory roots are also linked to the source type.

crates/core/tedge_agent/src/operation_workflows/persist.rs

tests/RobotFramework/tests/tedge_agent/workflows/long-running-command-v1.toml

Bravo555 · 2024-10-23T11:34:28Z

tests/RobotFramework/tests/tedge_agent/workflows/dynamic_workflow_reloading.robot

+    ...    item="@version":"76e9afe834b4a7cadc9029670ba76745fcda73784f9e78c09f0c0416f7f58ad2"
+
+Recover Builtin Operation
+    ThinEdgeIO.File Should Exist    /etc/tedge/operations/software_list.toml


suggestion: the test fails if run by itself and not part of the test suite, because this file is created by the previous test case. Could we instead use Transfer to Device everywhere so that there are no dependencies between the test cases? I'd expect Transfer to Device to overwrite the file if it already exists, so it should be okay to use that?

I have a mix feeling here. On one side, you are correct, it would be handy to have independent tests. But, on the other side, this test suite represents well a scenario where a user creates and iterate updating a workflow file.

Concretely, replacingFile Should Exist assertion by a Transfer to Device command would lead to a different test while running the suite vs the isolated test. Indeed, in the suite case, one checks that a user can update a workflow while, in the isolated case, one checks that the user can create a workflow (i.e. Update User-Defined Operation doing the same test as Create User-Defined Operation).

Bravo555

Some nits, but LGTM overall.

Bravo555 · 2024-10-23T11:46:40Z

tests/RobotFramework/tests/tedge_agent/workflows/dynamic_workflow_reloading.robot

+    ${workflow_log}    Execute Command    cat /var/log/tedge/agent/workflow-user-command-dyn-test-1.log
+    Should Contain
+    ...    ${workflow_log}
+    ...    item="@version":"37d0861e3038b34e8ab2ffe3257dd9372213ed5e17ba352e5028b0bf9762a089"


nit(non-blocking): actual SHA256 is an implementation detail, we don't need to compare full value, only that it changed between different versions of the workflow

Also if the toml file changes this value will have to be updated.

It's a bit of a nitpick, but a comment would help because it's not obvious that it's SHA256 hash of user-command-v1.toml and why we're comparing it

crates/core/tedge_api/src/workflow/supervisor.rs

Bravo555 · 2024-10-23T12:34:38Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

+use anyhow::Context;
+use camino::Utf8Path;
+use camino::Utf8PathBuf;
+use log::error;


Suggested change

use log::error;

use tracing::error;

Fixed 41f4208

Bravo555 · 2024-10-23T12:45:27Z

crates/core/tedge_agent/src/operation_workflows/persist.rs

thought: this module has quite a bit of functionality but doesn't have any unit tests - codecov reports 210 missed lines and 34.2% patch coverage (from other workflow-related tests)

I can only acknowledge that the unit test coverage is poor. However, this code is quit extensively tested by the system test suite added by this PR (Dynamic Workflow Reloading). I opted for system-tests instead of unit-tests because the features introduced by this module are heavily related to the file system and inotify as well as sequence of user actions (adding/updating/removing files while the agent is running/restarted. One place where unit tests can be improved is the tedge_api::workflow::supervisor module which provide the in-memory representation of the uploaded workflow definitions.

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

This is an intermediate step, the aim being to use the same directory to persist a copy of the workflows currently used (i.e. for which there is a running operation instance). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

For this first step the behavior is unchanged: the workflows are only loaded on start Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

…r engine The WorkflowRepository acts as a facade to WorkflowSupervisor adding all disk related features: loading definitions from disk, caching definitions in-use, reloading definitions on changes. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

A workflow source being always used with a complementary info: a file path or a workflow version, it makes sense to pack the complementary info within the WorkflowSource itself. This also highlights the corner case of the BuiltIn workflow for which there is no complementary info. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek temporarily deployed to Test Pull Request October 9, 2024 16:57 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 9, 2024 17:13 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 10, 2024 07:50 — with GitHub Actions Inactive

didier-wenzek commented Oct 10, 2024

View reviewed changes

didier-wenzek temporarily deployed to Test Auto October 10, 2024 07:55 — with GitHub Actions Inactive

reubenmiller added the theme:workflows Theme: Workflow engine topics label Oct 10, 2024

didier-wenzek temporarily deployed to Test Pull Request October 10, 2024 13:09 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 10, 2024 13:15 — with GitHub Actions Failure

didier-wenzek commented Oct 10, 2024

View reviewed changes

crates/core/tedge_agent/src/operation_workflows/persist.rs Outdated Show resolved Hide resolved

crates/core/tedge_agent/src/operation_workflows/persist.rs Show resolved Hide resolved

didier-wenzek temporarily deployed to Test Pull Request October 11, 2024 08:28 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 11, 2024 08:34 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 11, 2024 13:31 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 11, 2024 13:37 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 14:17 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 14:23 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 15:29 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto October 14, 2024 15:37 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 16:35 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 16:41 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 17:22 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 17:28 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 14, 2024 18:54 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 14, 2024 19:00 — with GitHub Actions Failure

didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from eb59902 to d0a5cde Compare October 15, 2024 07:26

didier-wenzek temporarily deployed to Test Pull Request October 15, 2024 07:26 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 15, 2024 07:32 — with GitHub Actions Failure

didier-wenzek force-pushed the feat/load-operation-workflows-on-updates branch from d0a5cde to b0454bc Compare October 15, 2024 08:13

didier-wenzek temporarily deployed to Test Pull Request October 15, 2024 08:13 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request October 18, 2024 20:53 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 18, 2024 20:59 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Pull Request October 21, 2024 07:53 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 21, 2024 07:59 — with GitHub Actions Failure

albinsuresh reviewed Oct 21, 2024

View reviewed changes

crates/core/tedge_agent/src/operation_workflows/persist.rs Outdated Show resolved Hide resolved

tests/RobotFramework/tests/tedge_agent/workflows/long-running-command-v1.toml Outdated Show resolved Hide resolved

didier-wenzek temporarily deployed to Test Pull Request October 21, 2024 13:49 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto October 21, 2024 13:54 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Pull Request October 21, 2024 15:34 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto October 22, 2024 06:23 — with GitHub Actions Inactive

didier-wenzek commented Oct 22, 2024

View reviewed changes

didier-wenzek commented Oct 23, 2024

View reviewed changes

crates/core/tedge_agent/src/operation_workflows/persist.rs Outdated Show resolved Hide resolved

albinsuresh approved these changes Oct 23, 2024

View reviewed changes

Bravo555 reviewed Oct 23, 2024

View reviewed changes

Bravo555 approved these changes Oct 23, 2024

View reviewed changes

didier-wenzek added 16 commits October 23, 2024 15:52

Attach versions to operation workflows and commands

3b78959

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Move on-disk workflow representation in a sub-module

a83642f

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Make pub the logic used to check the agent state dir

bc928e8

This is an intermediate step, the aim being to use the same directory to persist a copy of the workflows currently used (i.e. for which there is a running operation instance). Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Group workflow loading logic in struct WorkflowRepository

446950d

For this first step the behavior is unchanged: the workflows are only loaded on start Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Persist operation definition when a new instance is created

6cc005c

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Reload workflow definitions on file change using inotify

dca2f6f

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Unregister user-defined workflows which definitions are removed

30249b4

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Restore builtin definition when a user defined workflow is removed

434f8db

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Update capability messages on operation workflow updates

2f96370

Test reloading workflows at runtime

79fc269

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

A main workflow can update a sub-workflow before using it

1467f18

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

A new command instance must use the latest workflow version

651a0b7

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Support concurrent instances with difference versions

55aef04

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Remove copies of in-use workflow when no more used

d70209e

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Conversation

didier-wenzek commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

Uh oh!

codecov bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robot Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albinsuresh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bravo555 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

didier-wenzek commented Oct 9, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

github-actions bot commented Oct 9, 2024 •

edited

Loading