Skip to content

HLD for diagnostic monitoring of CMIS based transceivers#1828

Merged
prgeor merged 20 commits intosonic-net:masterfrom
mihirpat1:cmis_diagnostic_monitoring_hld
Feb 12, 2025
Merged

HLD for diagnostic monitoring of CMIS based transceivers#1828
prgeor merged 20 commits intosonic-net:masterfrom
mihirpat1:cmis_diagnostic_monitoring_hld

Conversation

@mihirpat1
Copy link
Copy Markdown
Contributor

@mihirpat1 mihirpat1 commented Oct 11, 2024

This HLD while provide an overview of how SONiC reads and stores the various diagnostic parameters read from a CMIS based transceiver.

Proposed changes for existing tables in SONiC

  1. Removed CCMIS specific VDM data from DOM related tables and moved to VDM related tables
  2. Not all VDM thresholds are currently shown.
  3. With the proposed changes, I am planning to change the name of existing fields to be in line with the CMIS spec.
  4. Moved txfault, txlos, txcdrlol, rxlos and rxcdrlol from TRANSCEIVER_STATUS to TRANSCEIVER_DOM_FLAG table.
PR title state context
Move DomInfoUpdateTask class to a separate file GitHub issue/pull request detail GitHub pull request check contexts
Add VDM and Status related cmis fields for onboarding xcvr diagnostic features GitHub issue/pull request detail GitHub pull request check contexts
Add diagnostic monitoring APIs for DOM, VDM and status access for CMIS modules GitHub issue/pull request detail GitHub pull request check contexts
[xcvrd] Enable periodic polling of VDM relevant data GitHub issue/pull request detail GitHub pull request check contexts
Create is_transceiver_vdm_supported API for CMIS transceivers GitHub issue/pull request detail GitHub pull request check contexts
Add pipeline check for missing init.py in sonic-xcvrd whl package GitHub issue/pull request detail GitHub pull request check contexts
[xcvrd] Skip VDM threshold DB update for flat memory transceivers GitHub issue/pull request detail GitHub pull request check contexts
[cmis] Separate Flag-Specific Fields for DOM and Status APIs GitHub issue/pull request detail GitHub pull request check contexts
[xcvrd] Re-organize transceiver DOM and STATUS tables GitHub issue/pull request detail GitHub pull request check contexts
Transceiver CLI changes to support DOM and STATUS table related changes GitHub issue/pull request detail GitHub pull request check contexts

fixes #1885

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@mihirpat1 mihirpat1 requested a review from prgeor October 11, 2024 06:26
@mihirpat1
Copy link
Copy Markdown
Contributor Author

@qinchuanares - It will be great if you can help in reviewing this PR.

@mihirpat1
Copy link
Copy Markdown
Contributor Author

@Junchao-Mellanox @keboliu - It will be great if you can help in reviewing this PR.


### 4.2 Dynamic Diagnostic Information

The `DomInfoUpdateTask` thread is responsible for updating the dynamic diagnostic information for all the transceivers in the system. The `DomInfoUpdateTask` thread is triggered by a timer (`DOM_INFO_UPDATE_PERIOD_SECS`), which is set to 60 seconds by default. The `DomInfoUpdateTask` thread reads the diagnostic information from the transceiver and updates the relevant tables in `redis-db`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if there is any plan for making this alarm logging interrupt based?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please help in clarifying this?

Copy link
Copy Markdown
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihirpat1 I don't see how the timestamp, count for each alarm will be updated so that these times are as closer to the actual reporting time by the module

3. Read the transceiver firmware information from the module and update the `TRANSCEIVER_FIRMWARE_INFO` table.
4. Read the transceiver DOM sensor data from the module and update the `TRANSCEIVER_DOM_SENSOR` table.
5. Read the transceiver DOM flag data from the module, record the timestamp, and update the `TRANSCEIVER_DOM_FLAG` table.
6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also record the flag set/clear event to syslog?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox Sure, will plan to record the set/clear event to syslog as a debug. Making it as a notice will make the syslog chatty.

6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables.
7. Read the transceiver status data from the module and update the `TRANSCEIVER_STATUS` table.
8. If the transceiver supports VDM monitoring, perform the following steps:
1. Freeze the statistics by calling the CMIS API and wait for `FreezeDone`. Once the statistics are frozen, record the timestamp and copy the VDM and PM statistics from the transceiver.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please specify which API will be used to "freeze" the statistic? I guess it will write something to module EEPROM, and it is only applicable when module is under software control.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox I have now updated the API name now. Can you please elaborate on what do you mean by "it is only applicable when module is under software control"?

----------- --------------- -------- -------- -------- --------

Example:
admin@sonic#show interfaces transceiver vdm flag Ethernet1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to add a filter to get abnormal status only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor do we want to add filter as part of current enhancement?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox We are now displaying the output after applying the filter. A --detail option has been added to allow user to dump data for all lanes irrespective of the flag being set.

**Tables Used for Flag Analysis:**

- `TRANSCEIVER_VDM_FLAG`: This table stores flags indicating the status of various VDM parameters.
- `TRANSCEIVER_VDM_FLAG_CHANGE_COUNT`: This table keeps a count of how many times each VDM flag has changed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If xcvrd stopped, will the change count be reset to 0? The same question to time set and time clear. Also, should we record the change count to syslog?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox Yes - the change count and the set/clear time in redis-db tables will be removed when xcvrd is stopped. We can log the change count to syslog before deleting the table.

@mihirpat1
Copy link
Copy Markdown
Contributor Author

@mihirpat1 I don't see how the timestamp, count for each alarm will be updated so that these times are as closer to the actual reporting time by the module

@prgeor I have now added Diagnostic Information Update During Link Down Event section to address this scenario.

3. Read the transceiver firmware information from the module and update the `TRANSCEIVER_FIRMWARE_INFO` table.
4. Read the transceiver DOM sensor data from the module and update the `TRANSCEIVER_DOM_SENSOR` table.
5. Read the transceiver DOM flag data from the module, record the timestamp, and update the `TRANSCEIVER_DOM_FLAG` table.
6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox Sure, will plan to record the set/clear event to syslog as a debug. Making it as a notice will make the syslog chatty.

----------- --------------- -------- -------- -------- --------

Example:
admin@sonic#show interfaces transceiver vdm flag Ethernet1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor do we want to add filter as part of current enhancement?

6. Analyze the transceiver DOM flag data by comparing the current flag data with the previous flag data and update the `TRANSCEIVER_DOM_FLAG_CHANGE_COUNT`, `TRANSCEIVER_DOM_FLAG_TIME_SET`, and `TRANSCEIVER_DOM_FLAG_TIME_CLEAR` tables.
7. Read the transceiver status data from the module and update the `TRANSCEIVER_STATUS` table.
8. If the transceiver supports VDM monitoring, perform the following steps:
1. Freeze the statistics by calling the CMIS API and wait for `FreezeDone`. Once the statistics are frozen, record the timestamp and copy the VDM and PM statistics from the transceiver.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox I have now updated the API name now. Can you please elaborate on what do you mean by "it is only applicable when module is under software control"?

**Tables Used for Flag Analysis:**

- `TRANSCEIVER_VDM_FLAG`: This table stores flags indicating the status of various VDM parameters.
- `TRANSCEIVER_VDM_FLAG_CHANGE_COUNT`: This table keeps a count of how many times each VDM flag has changed.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox Yes - the change count and the set/clear time in redis-db tables will be removed when xcvrd is stopped. We can log the change count to syslog before deleting the table.

; Defines Transceiver PM information for a port
key = TRANSCEIVER_PM|ifname ; information of PM on port
; field = value
prefec_ber_avg = FLOAT ; prefec ber avg
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could expand this to media and host prefec BER and uncorr frames, even if this is C-CMIS only.

Media BER and UC frames info comes from page 34h, host BER and UC frames come from page 3Ah

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mihirpat1
Copy link
Copy Markdown
Contributor Author

@mihirpat1 Capture the FreezeDone and UnfreezeDone time value defined by SONiC

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

prgeor
prgeor previously approved these changes Feb 8, 2025
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@zhangyanzhao
Copy link
Copy Markdown
Collaborator

mark this feature as done for 202505 release since HLD and all code PRs are merged.

@mihirpat1
Copy link
Copy Markdown
Contributor Author

mark this feature as done for 202505 release since HLD and all code PRs are merged.

@zhangyanzhao There is still pending development work for this HLD so will keep the feature open for now.

@mihirpat1
Copy link
Copy Markdown
Contributor Author

fixes #1885

@mihirpat1 mihirpat1 mentioned this pull request Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

Per-Lane DOM data

8 participants