Skip to content

feat: new tedge command to collect diagnostic information tedge diag collect#3608

Merged
rina23q merged 5 commits intothin-edge:mainfrom
rina23q:feature/3412/tedge-diag-collect
May 27, 2025
Merged

feat: new tedge command to collect diagnostic information tedge diag collect#3608
rina23q merged 5 commits intothin-edge:mainfrom
rina23q:feature/3412/tedge-diag-collect

Conversation

@rina23q
Copy link
Copy Markdown
Member

@rina23q rina23q commented May 12, 2025

Proposed changes

tedge diag and its sub-command collect helps users to get debugging information.

For the design and the contracts between the runner and the plugin APIs, refer to docs/src/references/diagnostic-plugin.md in this PR.

How to try

  1. Clone example plugins
# using ssh
git clone git@github.com:rina23q/tedge-diag-plugins.git

# using https
git clone https://github.com/rina23q/tedge-diag-plugins.git
  1. Give the path to the example plugins directory to --plugin-dir option and execute
tedge diag collect --plugin-dir <path>/tedge-diag-plugins/plugins

Todos in this PR

  • Improve UX; point 3 & 7 in feat: new tedge command to collect diagnostic information tedge diag collect #3608 (comment)
  • Support setting the plugin directories via tedge.toml (and also flags)
  • Support multiple plugin directories (process in order that it is configured)
  • Store tedge plugins included in the binary under /usr/share/tedge/diag-plugins (edit the nfpm.yaml file for the tedge binary)
  • Move the source code for the plugins under configuration/contrib/diag-plugins and add a template called template.ignore
  • When searching for plugins, ignore files with that end in .ignore

Follow-up

Note

  • It's tested with Ubuntu 24.04.1 with x86_64 architecture
  • Plugins are stored at https://github.com/rina23q/tedge-diag-plugins
  • The 07_mosquitto plugin may return an error for the case /var/log/mosquitto/mosquitto.log not readable by user

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#3412

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
  • I ran just format as mentioned in CODING_GUIDELINES
  • I used just check as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@rina23q rina23q had a problem deploying to Test Pull Request May 12, 2025 16:50 — with GitHub Actions Failure
@codecov
Copy link
Copy Markdown

codecov bot commented May 12, 2025

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...common/tedge_config/src/tedge_toml/tedge_config.rs 50.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rina23q rina23q had a problem deploying to Test Pull Request May 12, 2025 17:11 — with GitHub Actions Failure
Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice first working version

@reubenmiller reubenmiller added the theme:troubleshooting Theme: Troubleshooting and remote control label May 13, 2025
@reubenmiller
Copy link
Copy Markdown
Contributor

@rina23q It seems that the execution order of the plugins is not controlled, as the order keeps on changing, below shows the output from two different runs.

Run 1

$ sudo tedge diag collect --plugin-dir ./tedge-diag-plugins/plugins
Executing ./tedge-diag-plugins/plugins/02_os.sh... ✓
Executing ./tedge-diag-plugins/plugins/04_workflow.sh... ✓
Executing ./tedge-diag-plugins/plugins/07_mosquitto.sh... ✓
Executing ./tedge-diag-plugins/plugins/05_entities.sh... ✓
Executing ./tedge-diag-plugins/plugins/03_mqtt.sh... ✓
Executing ./tedge-diag-plugins/plugins/00_template.sh... ✓
Executing ./tedge-diag-plugins/plugins/01_tedge.sh... ✓
Executing ./tedge-diag-plugins/plugins/06_internal.sh... ✓
Total 8 executed: 8 completed, 0 failed, 0 skipped
Diagnostic information saved to /tmp/tedge-diag-2025-05-13T15:49:03.tar.gz

Run 2

$ sudo tedge diag collect --plugin-dir ./tedge-diag-plugins/plugins
Executing ./tedge-diag-plugins/plugins/00_template.sh... ✓
Executing ./tedge-diag-plugins/plugins/04_workflow.sh... ✓
Executing ./tedge-diag-plugins/plugins/03_mqtt.sh... ✓
Executing ./tedge-diag-plugins/plugins/07_mosquitto.sh... ✓
Executing ./tedge-diag-plugins/plugins/06_internal.sh... ✓
Executing ./tedge-diag-plugins/plugins/02_os.sh... ✓
Executing ./tedge-diag-plugins/plugins/01_tedge.sh... ✓
Executing ./tedge-diag-plugins/plugins/05_entities.sh... ✓
Total 8 executed: 8 completed, 0 failed, 0 skipped
Diagnostic information saved to /tmp/tedge-diag-2025-05-13T15:49:56.tar.gz

Expected

  • The plugins should be sorted alphabetically and then executed in order (serially)

@reubenmiller
Copy link
Copy Markdown
Contributor

@rina23q Overall the feature works nicely. It was able to guess how to use it by simply using the tab completion for the new command, and looking at the existing plugins (a really good sign that it has a solid existing UX).

Below are just some minor improvements/suggestions to improve the overall UX (mostly polish).

  1. Do we need to split the plugin's standard output and standard error into two different files? (e.g. out.log and err.log). Since the plugin can write the output to file, then the plugins might already write to an additional file. Having a combined out.log and err.log will just make it easier for the person reading these files to know where to look

  2. delete empty out.log and err.log files. Removing them helps the user to see if there is any useful information in them or not (if they can't see the file size)

  3. If plugin (which matches the given pattern) is not executable, then it would be useful to print a warning as this is most likely a mistake.

    WARN: Skipping plugin as it is not executable. file: ./tedge-diag-plugins/plugins/04_example.sh
  4. The temp directory should be cleaned up after the tarball is created. It might be useful to add a flag to control whether or not to keep the directory (helpful for debugging), or just to create the directory and not the tarball?

  5. folder and tarball are named differently, e.g. the folder has a timestamp with milliseconds, and the tarball only has second resolution. It is expected that both the folder and the tarball use the same naming convention (whole second resolution is enough in this case)

    $ ls -l /tmp/                        
    total 256
    drwxr-xr-x 10 root root   200 May 13 15:48 tedge-diag-2025-05-13T15:48:02.914068322Z
    -rw-r--r--  1 root root 24910 May 13 15:48 tedge-diag-2025-05-13T15:48:02.tar.gz
  6. Writing a list of which plugins that was executed and storing it in the output folder would be useful for someone analyzing the diagnostic logs on what was executed and what was skipped (as there might also be a bug in the plugin itself)

    For example something like below:

    file: /tmp/tedge-diag-2025-05-13T16:06:27/collection.log

    Executing ./tedge-diag-plugins/plugins/03_mqtt.sh... ✓
    Executing ./tedge-diag-plugins/plugins/02_os.sh... ✓
    Executing ./tedge-diag-plugins/plugins/02_reuben.sh... ✓
    ERROR: ./tedge-diag-plugins/plugins/02_reuben.sh failed with exit status: exit status: 1
    Executing ./tedge-diag-plugins/plugins/01_tedge.sh... ✓
    Executing ./tedge-diag-plugins/plugins/05_entities.sh... ✓
    Executing ./tedge-diag-plugins/plugins/06_internal.sh... ✓
    Executing ./tedge-diag-plugins/plugins/07_mosquitto.sh... ✓
    Executing ./tedge-diag-plugins/plugins/04_workflow.sh... ✓
    Executing ./tedge-diag-plugins/plugins/00_template.sh... ✓
    Total 9 executed: 8 completed, 1 failed, 0 skipped
  7. What is the default colour configured for a log message at the INFO level. At the moment it is looks like yellow which would indicate a warning rather than an informational message. But this could be my terminal colour settings and I don't have a warning message to compare.

    INFO: ./tedge-diag-plugins/plugins/02_reuben.sh is marked skipped
    

@rina23q rina23q had a problem deploying to Test Pull Request May 16, 2025 12:01 — with GitHub Actions Failure
@rina23q rina23q had a problem deploying to Test Pull Request May 16, 2025 18:14 — with GitHub Actions Failure
@rina23q rina23q force-pushed the feature/3412/tedge-diag-collect branch from a94113d to b8ddd6f Compare May 19, 2025 15:25
@rina23q rina23q had a problem deploying to Test Pull Request May 19, 2025 15:25 — with GitHub Actions Failure
@rina23q rina23q had a problem deploying to Test Pull Request May 19, 2025 20:13 — with GitHub Actions Failure
@rina23q rina23q had a problem deploying to Test Pull Request May 19, 2025 20:27 — with GitHub Actions Failure
@rina23q rina23q force-pushed the feature/3412/tedge-diag-collect branch from 5ba1b27 to e77cf27 Compare May 20, 2025 15:01
@rina23q rina23q had a problem deploying to Test Pull Request May 20, 2025 15:01 — with GitHub Actions Failure
@rina23q rina23q had a problem deploying to Test Pull Request May 20, 2025 16:15 — with GitHub Actions Failure
@rina23q
Copy link
Copy Markdown
Member Author

rina23q commented May 20, 2025

@reubenmiller
Thanks for a lot of feedback. Let me reply to your suggestions :)

The plugins should be sorted alphabetically and then executed in order (serially)

Done! Switched from HashSet to BTreeSet.

  1. Do we need to split the plugin's standard output and standard error into two different files?

No, and since now the code uses the same plugin APIs as software management and others, the output log file keeps both stdout and stderr together. I updated the specification markdown file accordingly.

  1. delete empty out.log and err.log files. Removing them helps the user to see if there is any useful information in them or not (if they can't see the file size)

Related to the point 1, since the plugin API outputs some more information other than stdout and stderr, empty file should not be created now.

  1. If plugin (which matches the given pattern) is not executable, then it would be useful to print a warning as this is most likely a mistake.

Changed to output a warning message as below. I'm still now searching for a better crate to address error/warn/info messages in a nice way, the output color and format may be changed in a later point.

WARN: Skipping plugin as it is not executable. file: "/home/rina/Work/thin-edge/tedge-diag-plugins/plugins/100_dummy.sh"
  1. The temp directory should be cleaned up after the tarball is created. It might be useful to add a flag to control whether or not to keep the directory (helpful for debugging), or just to create the directory and not the tarball?

I added a new option --clean-tmp-output. By default true. If false, the temporary directory won't be removed.

  1. folder and tarball are named differently, e.g. the folder has a timestamp with milliseconds, and the tarball only has second resolution. It is expected that both the folder and the tarball use the same naming convention (whole second resolution is enough in this case)

Previously, I used RFC3339 format (e.g. 1985-04-12T23:20:50.52Z), but this caused several problems.

  • . confuses the program which indicates an extension
  • : confuses tar command to decompress, as : is considered hostname
    So, I decided to use this custom format. [year]-[month]-[day]_[hour]-[minute]-[second]
drwxrwxr-x 11 rina rina     4096 May 20 17:08 tedge-diag-2025-05-20_15-08-42
-rw-rw-r--  1 rina rina   218946 May 20 17:08 tedge-diag-2025-05-20_15-08-42.tar.gz
  1. Writing a list of which plugins that was executed and storing it in the output folder would be useful for someone analyzing the diagnostic logs on what was executed and what was skipped (as there might also be a bug in the plugin itself)

I haven't done it yet. I'll put it in todo.

  1. What is the default colour configured for a log message at the INFO level. At the moment it is looks like yellow which would indicate a warning rather than an informational message. But this could be my terminal colour settings and I don't have a warning message to compare.

I think green or blue. Yellow was definitely wrong. Related to the point 3, it's still in todo to improve output messages.

@rina23q rina23q had a problem deploying to Test Pull Request May 21, 2025 12:22 — with GitHub Actions Failure
@rina23q rina23q had a problem deploying to Test Pull Request May 21, 2025 13:56 — with GitHub Actions Failure
@rina23q rina23q had a problem deploying to Test Pull Request May 21, 2025 14:26 — with GitHub Actions Failure
@rina23q rina23q temporarily deployed to Test Pull Request May 21, 2025 14:31 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented May 21, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
640 0 3 640 100 1h53m54.326138s

@rina23q rina23q marked this pull request as ready for review May 21, 2025 15:09
@rina23q
Copy link
Copy Markdown
Member Author

rina23q commented May 27, 2025

To track the progress of #3608 (review)

  • 1. Won't do here since it's the bug in the main branch
  • 2. 6df225c
  • 3. f8fdd36
  • 4. da6ecfa
  • 5. ae6ffb8
  • 6. The default value for both timeouts are already set to 60s and these values are displayed by --help.
      --timeout <TIMEOUT>
          Timeout for a graceful plugin shutdown
          
          [default: 60s]

      --forceful-timeout <FORCEFUL_TIMEOUT>
          Timeout for forced termination, starting after a graceful timeout expires
          
          [default: 60s]
  • 7. As well as 6., I think this one was fixed already. Just in case, I added a check no stdout was recorded in the Robot test. 4c466b2

Copy link
Copy Markdown
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Overall nice addition and works nicely.

The Security aspects and writing to stdout can be changed in follow up PRs

@reubenmiller reubenmiller changed the title feat: implement tedge diag collect feat: new command to collect diagnostic information tedge diag collect May 27, 2025
@reubenmiller reubenmiller added the theme:cli Theme: cli related topics label May 27, 2025
@reubenmiller reubenmiller changed the title feat: new command to collect diagnostic information tedge diag collect feat: new tedge command to collect diagnostic information tedge diag collect May 27, 2025
rina23q added 4 commits May 27, 2025 15:11
Signed-off-by: Rina Fujino <rina.fujino.23@gmail.com>
Signed-off-by: Rina Fujino <rina.fujino.23@gmail.com>
Signed-off-by: Rina Fujino <rina.fujino.23@gmail.com>
Signed-off-by: Rina Fujino <rina.fujino.23@gmail.com>
@rina23q rina23q force-pushed the feature/3412/tedge-diag-collect branch from 36fb79c to c98ab9c Compare May 27, 2025 15:12
@rina23q rina23q temporarily deployed to Test Pull Request May 27, 2025 15:12 — with GitHub Actions Inactive
@rina23q rina23q added this pull request to the merge queue May 27, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 27, 2025
@rina23q rina23q added this pull request to the merge queue May 27, 2025
Merged via the queue into thin-edge:main with commit 6fd20de May 27, 2025
34 checks passed
@rina23q rina23q deleted the feature/3412/tedge-diag-collect branch May 27, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:cli Theme: cli related topics theme:troubleshooting Theme: Troubleshooting and remote control

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants