Skip to content

[Auto-Techsupport] Issues related to Multiple Cores crashing handled#1948

Merged
qiluo-msft merged 6 commits intosonic-net:masterfrom
vivekrnv:event_driv_ts_bug
Dec 6, 2021
Merged

[Auto-Techsupport] Issues related to Multiple Cores crashing handled#1948
qiluo-msft merged 6 commits intosonic-net:masterfrom
vivekrnv:event_driv_ts_bug

Conversation

@vivekrnv
Copy link
Copy Markdown
Contributor

@vivekrnv vivekrnv commented Nov 24, 2021

Signed-off-by: Vivek Reddy Karri vkarri@nvidia.com

What I did

Issues seen when multiple cores are crashed in very quick succession:

  1. The rate_limit_interval is not honored. Because, i previously was finding out the last created tech-support using the glob pattern sonic_dump_*tar*, which will not include the dumps which are being currently run. These existing dump will not have .tar.gz extension. Thus, modified the get_ts_dumps to search based on the TS_ROOT i.e sonic_dump_*
  2. show auto-tech support history is not showing all the created dumps. I've previously used to take the diff of tech support dumps before and after running the invocation and used to assign the diff as the corresponding techsupport for this core. This approach is prone to race condition as we can have multiple dumps in the diff found in the interval.
    Avoided this by parsing the stdout returned by show techsupport invocation

How I did it

How to verify it

  1. Unit Tests
  2. Generate core-dumps in very quick succession. Use the default rate limit interval. Should only see one entry in tech-support history
  3. Set global rate limit interval to 0. Generate cores in quick succession. Should see a few entries in the history.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@vivekrnv
Copy link
Copy Markdown
Contributor Author

@ganglyu @qiluo-msft Please help review

@vivekrnv
Copy link
Copy Markdown
Contributor Author

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

matches = re.findall(TS_PTRN, ts_stdout)
if matches:
return matches[-1]
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary here.

if "show techsupport --since '2 days ago'" in cmd_str:
patcher.fs.create_file("/var/dump/sonic_dump_random3.tar.gz")
return 0, "", ""
print(cmd_str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is print used for debug?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Will remove

@ganglyu ganglyu requested a review from qiluo-msft November 26, 2021 01:50
@vivekrnv vivekrnv requested a review from ganglyu November 30, 2021 19:39
Copy link
Copy Markdown
Contributor

@ganglyu ganglyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@qiluo-msft qiluo-msft merged commit 6de91af into sonic-net:master Dec 6, 2021
@vivekrnv vivekrnv deleted the event_driv_ts_bug branch December 6, 2021 17:00
judyjoseph pushed a commit that referenced this pull request Jan 9, 2022
…1948)

#### What I did

**Issues seen when multiple cores are crashed in very quick succession:**
1) The **rate_limit_interval** is not honored. Because, i previously was finding out the last created tech-support using the glob pattern `sonic_dump_*tar*`, which  will not include the dumps which are being currently run. These existing dump will not have .tar.gz extension. Thus, modified the `get_ts_dumps` to search based on the TS_ROOT i.e `sonic_dump_*`
2) **show auto-tech support history** is not showing all the created dumps. I've previously used to take the diff of tech support dumps before and after running the invocation and used to assign the diff as the corresponding techsupport for this core. This approach is prone to race condition as we can have multiple dumps in the diff found in the interval. 
Avoided this by parsing the stdout returned by `show techsupport` invocation

#### How to verify it

1) Unit Tests
2) Generate core-dumps in very quick succession. Use the default rate limit interval. Should only see one entry in tech-support history
3) Set global rate limit interval to 0. Generate cores in quick succession. Should see a few entries in the history.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants