Skip to content

Make polling intervals in the ThermalMonitor class configurable#635

Merged
judyjoseph merged 4 commits intosonic-net:masterfrom
michaelc-nexthop:make_thermal_monitor_polling_interval_configurable
Oct 30, 2025
Merged

Make polling intervals in the ThermalMonitor class configurable#635
judyjoseph merged 4 commits intosonic-net:masterfrom
michaelc-nexthop:make_thermal_monitor_polling_interval_configurable

Conversation

@michaelc-nexthop
Copy link
Copy Markdown
Contributor

Default polling interval of 60s is quite high and can be unresponsive. Vendors should be able to override this polling interval

Description

Motivation and Context

How Has This Been Tested?

Additional Information (Optional)

…urable.

Default polling interval of 60s is quite high and can be unresponsive. Vendors should be able to override this polling interval
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@gregoryboudreau
Copy link
Copy Markdown
Contributor

would it make more sense to do this via a config file like was done for sensormond? https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-sensormond/scripts/sensormond#L485

@lotus-nexthop
Copy link
Copy Markdown

would it make more sense to do this via a config file like was done for sensormond? https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-sensormond/scripts/sensormond#L485

Hi Gregory,

We currently have the config set by the existing pmon_daemon_control.json config file:
https://github.com/search?q=repo%3Asonic-net%2Fsonic-buildimage+path%3Apmon_daemon_control.json&type=code

Example usage of this new config:
https://github.com/sonic-net/sonic-buildimage/blob/30d2b7881971ad38fdc0b242f57c9efdb6f93693/device/nexthop/common/pmon_daemon_control.json#L5-L8

PR which adds new config flags to thermalctld:
#635

PR which sets new config flags using pmon_daemon_control.json:
https://github.com/sonic-net/sonic-buildimage/pull/23139/files

When we were working on this initially we thought pmon_daemon_control.json would be the most fitting place to put the configuration.

We hadn't seen your change to use platform_env.conf yet.

What are your thoughts on platform_env.conf vs pmon_daemon_control.json?

@gregoryboudreau
Copy link
Copy Markdown
Contributor

would it make more sense to do this via a config file like was done for sensormond? https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-sensormond/scripts/sensormond#L485

Hi Gregory,

We currently have the config set by the existing pmon_daemon_control.json config file: https://github.com/search?q=repo%3Asonic-net%2Fsonic-buildimage+path%3Apmon_daemon_control.json&type=code

Example usage of this new config: https://github.com/sonic-net/sonic-buildimage/blob/30d2b7881971ad38fdc0b242f57c9efdb6f93693/device/nexthop/common/pmon_daemon_control.json#L5-L8

PR which adds new config flags to thermalctld: #635

PR which sets new config flags using pmon_daemon_control.json: https://github.com/sonic-net/sonic-buildimage/pull/23139/files

When we were working on this initially we thought pmon_daemon_control.json would be the most fitting place to put the configuration.

We hadn't seen your change to use platform_env.conf yet.

What are your thoughts on platform_env.conf vs pmon_daemon_control.json?

I had actually originally tried the same thing using the pmon daemon control to act as a further input for this to the supervisord configuration: 9e91789#diff-195551a342dbd08efef0aa4f3e21a4f4e24e2ee87cadcc27070299e9fc4a0390R560. General community sentiment though was to use the platform_env.conf for platform specific configurability (at least when this was discussed in the chassis call when we reviewed it originally).

@gregoryboudreau
Copy link
Copy Markdown
Contributor

Chassisd also uses it for a timing related configuration: https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-chassisd/scripts/chassisd#L312. Probably best to keep it consistent across the different daemons. Imo you should join next weeks chassis call and bring it up for discussion if you feel strongly towards the pmon_daemon_control method.

@lotus-nexthop
Copy link
Copy Markdown

Thanks Gregory, I will join next week and get more ideas before refactoring this.

thermal_control = ThermalControlDaemon()
parser = argparse.ArgumentParser()
parser.add_argument('--thermal-monitor-initial-interval', type=int, default=5)
parser.add_argument('--thermal-monitor-update-interval', type=int, default=60)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see @gregoryboudreau comments earlier in the issue. I have similar comment - It is not possible to add platform speficif timer intervals when we start thermalctld in supervisord.conf : https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-platform-monitor/docker-pmon.supervisord.conf.j2.

We can define it in platform.json file for that particular platform.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am of the opinion that we keep this unified across daemons within PMON and put it in the platform_env.conf given that sensormond and chassisd already use this for timing related configurations that differ from the default.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Judy and Gregory, will update these PRs soon to go with that feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have updated the PR to get timing-related configurations from platform_env.conf, so that it is consistent with sensormond and chassisd. Unit tests were added as well.
Please take another look.
Thank you both Judy and Gregory!

Copy link
Copy Markdown
Contributor

@judyjoseph judyjoseph Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! sorry if I got this comment late @lotus-nexthop

platform_env.conf was originally thought to set configurations platform wide eg: https://github.com/sonic-net/sonic-buildimage/blob/master/device/arista/x86_64-arista_7060x6_16pe_384c_b/platform_env.conf, https://github.com/sonic-net/sonic-buildimage/blob/master/device/arista/x86_64-arista_7280cr3_32p4/platform_env.conf

Prefer to update the timer configuraion details in the "pmon_daemon_control.json" where we can keep configs pertaining to any daemon

Copy link
Copy Markdown
Contributor

@louis-nexthop louis-nexthop Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also +1 to making this change in pmon_daemon_control.json. We can do incremental improvement starting with sensormond.

I can revert this PR back to the original. Please note that it will need to be paired with sonic-net/sonic-buildimage#23139 as well for the full functionality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is now reverted back to the original.

Please help review both this PR and sonic-net/sonic-buildimage#23139. Thank you both!

Copy link
Copy Markdown
Contributor

@spilkey-cisco spilkey-cisco Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is to have thermal related daemons/threads use the same interval specified by thermal_manager for the fan/thermal algorithm (or some scaled smaller interval to ensure operational fluctuations do not delay temperature updates to the algorithm): https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-thermalctld/scripts/thermalctld#L837
This would ideally also be able to apply to xcvrd for DOM polling, syncd for ASIC temperature polling, and perhaps others I'm unaware of.

Copy link
Copy Markdown

@lotus-nexthop lotus-nexthop Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback @spilkey-cisco

If we want to create a new relationship between ThermalControlDaemon and ThermalMonitor I'd prefer to do so in a future orthogonal PR.

The PR as it is now does not make changes to the existing design of thermalctld, but just provides a way to provide vendor specific configuration to the ThermalMonitor's constants.

@judyjoseph if the current contents of this PR looks good, could we merge it? I am happy to make changes in follow up PRs or make changes in this one if something is wrong.

This change would provide significant improvements on the thermal management on Nexthop's NH-4010, NH-4020, and NH-5010 SKUs.

EDIT: Sorry to clarify, this change improves the responsiveness of the show commands for operators, but doesn't relate to the actual thermal management where fans speeds are adjusted based on temperature changes, that happens in ThermalControlDaemon which is orthogonal this change which affects ThermalMonitor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks like a reasonable first step for thermal configurations!

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

k, v = line.split('=', 1)
k = k.strip()
v = v.strip()
if k == 'THERMALCTLD_THERMAL_MONITOR_INITIAL_INTERVAL':
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be wrapped in a try/except? Not sure if you want these to fallback to defaults or take the daemon down on an incorrect config file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this up! I think it's better to fallback to the default values and let the daemon continue.

I added try/except and updated the unit tests as well. Thanks, Gregory!

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@judyjoseph
Copy link
Copy Markdown
Contributor

to merge following this PR merge : sonic-net/sonic-buildimage#23139

Copy link
Copy Markdown
Contributor

@gregoryboudreau gregoryboudreau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to add a test case that mocks the args path to ensure its working as expected, up to you

@lotus-nexthop
Copy link
Copy Markdown

lotus-nexthop commented Oct 29, 2025

might be good to add a test case that mocks the args path to ensure its working as expected, up to you

Thanks @gregoryboudreau , will skip on the test case for now.

We should merge this soon as sonic-net/sonic-buildimage#23139 has been merged.

@lotus-nexthop
Copy link
Copy Markdown

Hi @judyjoseph please merge this when you get a chance, thanks!

@louis-nexthop
Copy link
Copy Markdown
Contributor

Hi @judyjoseph, thank you for your help merging this PR!

It broke the tests that were added recently (in #652), so I opened #700 for the fix.

Please take a look at #700 when you get a chance. Thank you!

@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Oct 31, 2025

@judyjoseph any idea why this PR got successfully merged even with test failures that are being fixed as part of #700

@lotus-nexthop
Copy link
Copy Markdown

@judyjoseph any idea why this PR got successfully merged even with test failures that are being fixed as part of #700

I believe the tests on this PR ran and passed before the conflicting PR merged.

@vvolam
Copy link
Copy Markdown
Contributor

vvolam commented Oct 31, 2025

@judyjoseph any idea why this PR got successfully merged even with test failures that are being fixed as part of #700

I believe the tests on this PR ran and passed before the conflicting PR merged.

Thank you for clarification. Did the tests run after latest merge and before PR merged? Just thinking if we need any extra gatekeepers for the PRs

@lotus-nexthop
Copy link
Copy Markdown

@judyjoseph any idea why this PR got successfully merged even with test failures that are being fixed as part of #700

I believe the tests on this PR ran and passed before the conflicting PR merged.

Thank you for clarification. Did the tests run after latest merge and before PR merged? Just thinking if we need any extra gatekeepers for the PRs

Hi @vvolam

I believe there's no requirement for the github actions to re-run in the case that the target branch has changed since the last time the github actions ran.

There is also no requirement for the target branch to have been unchanged since github actions last ran.

One solution for this is to use github merge queues.
https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/configuring-pull-request-merges/managing-a-merge-queue

@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202505: #780

mssonicbld added a commit to mssonicbld/sonic-buildimage that referenced this pull request Mar 26, 2026
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

Platforms can now configure thermal monitor intervals in their pmon_daemon_control.json:
```
# example
{
    "thermalctld": {
        "thermal_monitor_initial_interval": 5,
        "thermal_monitor_update_interval": 30,
        "thermal_monitor_update_elapsed_threshold": 25
    }
}
```

Note this only affects the `ThermalMonitor` thread in the `thermalctld` daemon.
`ThermalMonitor`'s role is to poll fan and temperature sensors from hardware and publish information to redis.
This redis values are used in `show platform temperature` and `show platform fan` for example.

Parameter Details

`thermal_monitor_initial_interval`
- Purpose: The initial time to wait before the first poll by `ThermalMonitor` on `thermalctld` startup.
- Default: 5 seconds

`thermal_monitor_update_interval`
- Purpose: Every `thermal_monitor_update_interval` seconds, the hardware is polled
- Default: 60 seconds

`thermal_monitor_update_elapsed_threshold`
- Purpose: If it takes longer than `thermal_monitor_update_elapsed_threshold` seconds to poll hardware (collected information from all fans and temperature sensors), a warning is logged.
- Default: 30 seconds

#### Why I did it
The default polling interval of 60s is quite high and feels unresponsive (i.e. an operator can remove a fan and wait nearly a minute for `show plat fan` to update).

#### How I did it
In sonic-net/sonic-platform-daemons#635 we made these intervals configurable.

This PR updates the jinja template to handle these new configuration options.

It decreases the update interval from 60s -> 10s for NH-4010. I'm aiming for a balance of responsiveness without polling excessively.

Example usage of these feature:
https://github.com/nexthop-ai/private-sonic-buildimage/blob/master/device/nexthop/common/pmon_daemon_control.json

#### How to verify it
Verified on NH-4010 that `thermalctld` is being run with the expected options.
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>

#### A picture of a cute animal (not mandatory but encouraged)
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202505: #781

mssonicbld added a commit to sonic-net/sonic-buildimage that referenced this pull request Mar 26, 2026
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

Platforms can now configure thermal monitor intervals in their pmon_daemon_control.json:
```
# example
{
 "thermalctld": {
 "thermal_monitor_initial_interval": 5,
 "thermal_monitor_update_interval": 30,
 "thermal_monitor_update_elapsed_threshold": 25
 }
}
```

Note this only affects the `ThermalMonitor` thread in the `thermalctld` daemon.
`ThermalMonitor`'s role is to poll fan and temperature sensors from hardware and publish information to redis.
This redis values are used in `show platform temperature` and `show platform fan` for example.

Parameter Details

`thermal_monitor_initial_interval`
- Purpose: The initial time to wait before the first poll by `ThermalMonitor` on `thermalctld` startup.
- Default: 5 seconds

`thermal_monitor_update_interval`
- Purpose: Every `thermal_monitor_update_interval` seconds, the hardware is polled
- Default: 60 seconds

`thermal_monitor_update_elapsed_threshold`
- Purpose: If it takes longer than `thermal_monitor_update_elapsed_threshold` seconds to poll hardware (collected information from all fans and temperature sensors), a warning is logged.
- Default: 30 seconds

#### Why I did it
The default polling interval of 60s is quite high and feels unresponsive (i.e. an operator can remove a fan and wait nearly a minute for `show plat fan` to update).

#### How I did it
In sonic-net/sonic-platform-daemons#635 we made these intervals configurable.

This PR updates the jinja template to handle these new configuration options.

It decreases the update interval from 60s -> 10s for NH-4010. I'm aiming for a balance of responsiveness without polling excessively.

Example usage of these feature:
https://github.com/nexthop-ai/private-sonic-buildimage/blob/master/device/nexthop/common/pmon_daemon_control.json

#### How to verify it
Verified on NH-4010 that `thermalctld` is being run with the expected options.
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants