[PSU & system health] Support PSU power threshold checking#1060
[PSU & system health] Support PSU power threshold checking#1060prgeor merged 23 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
|
@prgeor kinldy reminder to review and provide your feedback |
|
Conflicts resolved. |
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
|
@stephenxs please update the https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md with the new command line. |
|
@shyam77git can you review from chassis point of view? |
|
@prgeor the PR is approved and build passed, are you ok to merge? Thanks. |
prgeor
left a comment
There was a problem hiding this comment.
@stephenxs as discussed in the community can we have an agreement that the "warning threshold" is confusing because no warning is raised at "warning" threshold? How about renaming it to "warning-suppress" level?
|
|
||
| PSU number is stored in chassis table. Please refer to this [document](https://github.com/sonic-net/SONiC/blob/master/doc/pmon/pmon-enhancement-design.md), section 1.5.2. | ||
|
|
||
| PSU information is stored in PSU table: |
There was a problem hiding this comment.
is this table for all PSUs in the system?
| power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold | ||
| power_warning_threshold = 1*4.3DIGIT ; The power warning threshold | ||
| power_critical_threshold = 1*4.3DIGIT ; The power critical threshold |
There was a problem hiding this comment.
are these thresholds the same for all PSUs?
There was a problem hiding this comment.
For us, yes. But it's possible for other vendors to have different thresholds among all PSUs.
doc/psud/PSU_daemon_design.md
Outdated
| - `Not OK` which can be caused by: | ||
| - power is not good, which means the PSU is present but no power (Eg. the power is down or power cable is unplugged) | ||
| - `WARNING` which can be caused by: | ||
| - power exceeds the PSU's power threshold |
There was a problem hiding this comment.
which threshold? critical or warning?
There was a problem hiding this comment.
The critical threshold. Will fix it.
doc/psud/PSU_daemon_design.md
Outdated
| raise NotImplementedError | ||
| ... | ||
|
|
||
| def get_psu_power_warning_threshold(self) |
There was a problem hiding this comment.
can we have a set API for user to override?
There was a problem hiding this comment.
I do not suggest having an API to override the thresholds. As discussed before, it should be controlled by the platform vendor only because users do not have full knowledge to decide what the threshold should be. For example, in our system, the thresholds depend on temperatures as well.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
1ac1eff
Signed-off-by: Stephen Sun <stephens@nvidia.com>
| voltage_max_th = INT ; max threshold of the output voltage | ||
| voltage_min_th = INT ; min threshold of the output voltage | ||
| power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold | ||
| power_warning_threshold = 1*4.3DIGIT ; The power warning threshold |
There was a problem hiding this comment.
naming not changed to supress
| orchagent is not running | ||
| Hardware Fault | ||
| PSU 1 temp 85C and threshold is 70C | ||
| PSU 1 power (66.32w) exceeds thresholds (warning: 60.00w critical: 70.00w) |
There was a problem hiding this comment.
'thresholds' which threshold?
There was a problem hiding this comment.
it's hard to say which threshold is crossed because most times it exceeds the critical threshold but sometimes it can be in (warning_suppress_threshold, critical_threshold) because of hysteresis.
I think we can just put the value of warning_suppress_threshold here.
| power_warning_suppress_threshold = 1*4.3DIGIT ; The power warning threshold | ||
| power_critical_threshold = 1*4.3DIGIT ; The power critical threshold | ||
|
|
||
| Now psud only collect and update "presence" and "status" field. |
There was a problem hiding this comment.
we need one field in the DB which indicates the critical threshold warning is ACTIVE. this is needed for telemetry
There was a problem hiding this comment.
power_overload is the field to indicate whether it’s in the warning state
prgeor
left a comment
There was a problem hiding this comment.
@stephenxs its not clear how is the user notified? if syslog how many times syslog is raised?
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Fix comments
syslog. it is printed only once when the alarm is raised and cleared. |
Signed-off-by: Stephen Sun <stephens@nvidia.com>
doc/psud/PSU_daemon_design.md
Outdated
| 4. PSU's power was in range (warning threshold, critical threshold) and is greater than the critical threshold | ||
| 1. if warning was raised, no action expected | ||
| 2. if warning was not raised, a warning should be raised | ||
| 5. PSU's power was less than the warning threshold and is greater than the critical threshold: a warning should be raised |
There was a problem hiding this comment.
It's to test the case that the PSU power jumps from a value below the warning suppress threshold to a value above critical threshold.
| voltage = INT ; output voltage of the PSU | ||
| voltage_max_th = INT ; max threshold of the output voltage | ||
| voltage_min_th = INT ; min threshold of the output voltage | ||
| power_overload = "true" / "false" ; whether the PSU's power exceeds the threshold |
There was a problem hiding this comment.
let me check if this should be INT for telemetry
Signed-off-by: Stephen Sun <stephens@nvidia.com>
doc/psud/PSU_daemon_design.md
Outdated
| We use asymmetric thresholds between raising and clearing the alarm for the purpose of creating a hysteresis and avoiding alarm flapping. | ||
|
|
||
| - an alarm will be raised when a PSU's power is rising accross the critical threshold | ||
| - an alarm will be cleared when a PSU's power is dropping across the warning threshold |
There was a problem hiding this comment.
warning threshold -> 'warning suppress' threshold
doc/psud/PSU_daemon_design.md
Outdated
|
|
||
| 1. Retrieve the current power | ||
| 2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold | ||
| - If `current power` < `warning threshold` |
doc/psud/PSU_daemon_design.md
Outdated
| 2. If flag `PSU power exceeded threshold` is `true`, compare the current power against the warning threshold | ||
| - If `current power` < `warning threshold` | ||
| - Set `PSU power exceeded threshold` to `false` | ||
| - Message in NOTICE level should be logged: `PSU <x>: current power <power> is below the warning threshold <threshold>` where |
doc/psud/PSU_daemon_design.md
Outdated
| `psuutil` fetches the information via calling platform API directly. Both warning and critical thresholds will be exposed in the output of psuutil status. | ||
| The "WARNING" state is not exposed because psuutil is a one-time command instead of a daemon, which means it does not store state information. It fetches information via calling platform API so it can not distinguish the following status: | ||
|
|
||
| 1. The power exceeded the critical threshold but is in the range between the warning and critical thresholds, which means the alarm should be raised | ||
| 2. The power didn't exceed the critical threshold and exceeds the warning threshold, which means the alarm should not be raised | ||
|
|
There was a problem hiding this comment.
warning -> 'warning-suppress'
Signed-off-by: Stephen Sun <stephens@nvidia.com>
PRs
Support PSU power threshold checking
Noneis returned orNotImplementedis thrown by either API, the PSU power threshold checking will not be performed for the PSUSigned-off-by: Stephen Sun stephens@nvidia.com