HLD for diagnostic monitoring of CMIS based transceivers#1828
HLD for diagnostic monitoring of CMIS based transceivers#1828prgeor merged 20 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Mihir Patel <patelmi@microsoft.com>
|
@qinchuanares - It will be great if you can help in reviewing this PR. |
|
@Junchao-Mellanox @keboliu - It will be great if you can help in reviewing this PR. |
|
|
||
| ### 4.2 Dynamic Diagnostic Information | ||
|
|
||
| The `DomInfoUpdateTask` thread is responsible for updating the dynamic diagnostic information for all the transceivers in the system. The `DomInfoUpdateTask` thread is triggered by a timer (`DOM_INFO_UPDATE_PERIOD_SECS`), which is set to 60 seconds by default. The `DomInfoUpdateTask` thread reads the diagnostic information from the transceiver and updates the relevant tables in `redis-db`. |
There was a problem hiding this comment.
Curious if there is any plan for making this alarm logging interrupt based?
There was a problem hiding this comment.
@prgeor Can you please help in clarifying this?
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
prgeor
left a comment
There was a problem hiding this comment.
@mihirpat1 I don't see how the timestamp, count for each alarm will be updated so that these times are as closer to the actual reporting time by the module
| 3. Read the transceiver firmware information from the module and update the `TRANSCEIVER_FIRMWARE_INFO` table. | ||
| 4. Read the transceiver DOM sensor data from the module and update the `TRANSCEIVER_DOM_SENSOR` table. | ||
| 5. Read the transceiver DOM flag data from the module, record the timestamp, and update the `TRANSCEIVER_DOM_FLAG` table. | ||
| 6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. |
There was a problem hiding this comment.
should we also record the flag set/clear event to syslog?
There was a problem hiding this comment.
@Junchao-Mellanox Sure, will plan to record the set/clear event to syslog as a debug. Making it as a notice will make the syslog chatty.
| 6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. | ||
| 7. Read the transceiver status data from the module and update the `TRANSCEIVER_STATUS` table. | ||
| 8. If the transceiver supports VDM monitoring, perform the following steps: | ||
| 1. Freeze the statistics by calling the CMIS API and wait for `FreezeDone`. Once the statistics are frozen, record the timestamp and copy the VDM and PM statistics from the transceiver. |
There was a problem hiding this comment.
Could you please specify which API will be used to "freeze" the statistic? I guess it will write something to module EEPROM, and it is only applicable when module is under software control.
There was a problem hiding this comment.
@Junchao-Mellanox I have now updated the API name now. Can you please elaborate on what do you mean by "it is only applicable when module is under software control"?
| ----------- --------------- -------- -------- -------- -------- | ||
|
|
||
| Example: | ||
| admin@sonic#show interfaces transceiver vdm flag Ethernet1 |
There was a problem hiding this comment.
is it possible to add a filter to get abnormal status only?
There was a problem hiding this comment.
@prgeor do we want to add filter as part of current enhancement?
There was a problem hiding this comment.
@Junchao-Mellanox We are now displaying the output after applying the filter. A --detail option has been added to allow user to dump data for all lanes irrespective of the flag being set.
| **Tables Used for Flag Analysis:** | ||
|
|
||
| - `TRANSCEIVER_VDM_FLAG`: This table stores flags indicating the status of various VDM parameters. | ||
| - `TRANSCEIVER_VDM_FLAG_CHANGE_COUNT`: This table keeps a count of how many times each VDM flag has changed. |
There was a problem hiding this comment.
If xcvrd stopped, will the change count be reset to 0? The same question to time set and time clear. Also, should we record the change count to syslog?
There was a problem hiding this comment.
@Junchao-Mellanox Yes - the change count and the set/clear time in redis-db tables will be removed when xcvrd is stopped. We can log the change count to syslog before deleting the table.
@prgeor I have now added |
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
| 3. Read the transceiver firmware information from the module and update the `TRANSCEIVER_FIRMWARE_INFO` table. | ||
| 4. Read the transceiver DOM sensor data from the module and update the `TRANSCEIVER_DOM_SENSOR` table. | ||
| 5. Read the transceiver DOM flag data from the module, record the timestamp, and update the `TRANSCEIVER_DOM_FLAG` table. | ||
| 6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. |
There was a problem hiding this comment.
@Junchao-Mellanox Sure, will plan to record the set/clear event to syslog as a debug. Making it as a notice will make the syslog chatty.
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
| ----------- --------------- -------- -------- -------- -------- | ||
|
|
||
| Example: | ||
| admin@sonic#show interfaces transceiver vdm flag Ethernet1 |
There was a problem hiding this comment.
@prgeor do we want to add filter as part of current enhancement?
| 6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables. | ||
| 7. Read the transceiver status data from the module and update the `TRANSCEIVER_STATUS` table. | ||
| 8. If the transceiver supports VDM monitoring, perform the following steps: | ||
| 1. Freeze the statistics by calling the CMIS API and wait for `FreezeDone`. Once the statistics are frozen, record the timestamp and copy the VDM and PM statistics from the transceiver. |
There was a problem hiding this comment.
@Junchao-Mellanox I have now updated the API name now. Can you please elaborate on what do you mean by "it is only applicable when module is under software control"?
| **Tables Used for Flag Analysis:** | ||
|
|
||
| - `TRANSCEIVER_VDM_FLAG`: This table stores flags indicating the status of various VDM parameters. | ||
| - `TRANSCEIVER_VDM_FLAG_CHANGE_COUNT`: This table keeps a count of how many times each VDM flag has changed. |
There was a problem hiding this comment.
@Junchao-Mellanox Yes - the change count and the set/clear time in redis-db tables will be removed when xcvrd is stopped. We can log the change count to syslog before deleting the table.
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
doc/platform_api/CMIS_Diagnostic_Monitoring_Overview_in_SONiC.md
Outdated
Show resolved
Hide resolved
| ; Defines Transceiver PM information for a port | ||
| key = TRANSCEIVER_PM|ifname ; information of PM on port | ||
| ; field = value | ||
| prefec_ber_avg = FLOAT ; prefec ber avg |
There was a problem hiding this comment.
we could expand this to media and host prefec BER and uncorr frames, even if this is C-CMIS only.
Media BER and UC frames info comes from page 34h, host BER and UC frames come from page 3Ah
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
@mihirpat1 Capture the FreezeDone and UnfreezeDone time value defined by SONiC |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
|
mark this feature as done for 202505 release since HLD and all code PRs are merged. |
@zhangyanzhao There is still pending development work for this HLD so will keep the feature open for now. |
|
fixes #1885 |
This HLD while provide an overview of how SONiC reads and stores the various diagnostic parameters read from a CMIS based transceiver.
Proposed changes for existing tables in SONiC
TRANSCEIVER_STATUStoTRANSCEIVER_DOM_FLAGtable.fixes #1885