-
Notifications
You must be signed in to change notification settings - Fork 72
tedge-agent should not block processing commands if one command is unacknowledged by the creator #3456
Description
Is your feature improvement request related to a problem? Please describe.
Workflows processed by tedge-agent currently await the workflow result to be cleared by the user/component which created it.
New command requests for the same workflow are then queued until any previously in-use workflow (of the same type) has been cleared. This makes the tedge-agent fragile to clients who either forget to clear an existing command, which then will block other users from processing their requests.
Problem 1: Badly behaving client which fails to acknowledge the result
- Client 1: Create workflow for firmware_update (cmd_id=1)
- tedge-agent: processes firmware_update (cmd_id=1), and sets the status to "successful" or "failed"
- Client 2: Create workflow for firmware_update (cmd_id=2)
- tedge-agent: waits until cmd_id=1 has been acknowledged by Client 1 (and cleared) until processing cmd_id=2, but if Client 1 never acknowledges the result, then the tedge-agent will be blocked indefinitely
Problem 2: tedge-agent fails to see acknowledgement
If some messages are lost (which is currently the case with mosquitto > 2.0.11, <= 2.0.21), then it is possible that the clearing of the command is not seen by the tedge-agent, and therefore it will block any future commands until the tedge-agent is restarted (as on startup it will check for any existing retain messages, and reconcile the any in-progress commands). Whilst this is mainly due to existing mosquitto bug #2618, the problem would also exist if the MQTT server isn't configured with persistence.
Below shows the potentially problematic sequence:
- Client 1: Create workflow for firmware_update (cmd_id=1)
- tedge-agent: processes firmware_update (cmd_id=1), and sets the status to "successful" or "failed"
- tedge-agent: Gets disconnected from the MQTT broker (possibly due to the device being restarted)
- Client 1: Clears the cmd_id=1 message (tedge-agent is still disconnected)
- tedge-agent: Reconnects to the MQTT broker, but does not receive the cleared retain message, so it will not process any future command of the same type until the tedge-agent is restarted or another client re-sends the clearing message (but this is unlikely as it would be difficult to find the correct message id)
Describe the solution you'd like
The solution is open up for discussion, but below are some questions to think about:
- Why should the tedge-agent care if an operation is acknowledged by the owner of the command? The tedge-agent should not clear the commands itself (at least those that didn't originate from the tedge-agent itself). The tedge-agent is responsible for doing the work, not if the work has been observed
Describe alternatives you've considered
Additional context