Skip to content

Kdump usability and reliability improvements#6113

Merged
lguohan merged 4 commits intosonic-net:masterfrom
rajendra-dendukuri:kdump_improvements
Dec 10, 2020
Merged

Kdump usability and reliability improvements#6113
lguohan merged 4 commits intosonic-net:masterfrom
rajendra-dendukuri:kdump_improvements

Conversation

@rajendra-dendukuri
Copy link
Copy Markdown
Contributor

  • Allow platform specific reboot script to be called after crash kernel has
    finished copying the kernel vmcore
  • Kdump configurations stored and manipulated in ConfigDB are now processed
    by hostcfgd and applied asynchronously
  • Disable pcie advanced features when running crash kernel. This improves
    reliability of the crash kernel to successfully create a vmcore and also
    reboot
  • Allow crash kernel to reboot if a panic is seen while it is generating a
    vmcore
  • Fix crash kernel to use the SONiC specific /usr/local/bin/reboot script
    instead of the Linux reboot command /sbin/reboot

Signed-off-by: Rajendra Dendukuri rajendra.dendukuri@broadcom.com

- Why I did it
Improve Kdump usability and reliability

- How I did it

  • Allow platform specific reboot script to be called after crash kernel has finished copying the kernel vmcore
  • Kdump configurations stored and manipulated in ConfigDB are now processed by hostcfgd and applied asynchronously
  • Disable pcie advanced features when running crash kernel. This improves reliability of the crash kernel to successfully create a vmcore and also reboot
  • Allow crash kernel to reboot if a panic is seen while it is generating a vmcore
  • Fix crash kernel to use the SONiC specific /usr/local/bin/reboot script instead of the Linux reboot command /sbin/reboot

- How to verify it
config kdump enable
echo c > /proc/sysrq-trigger
show kdump status
show kdump log 1

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006

- Description for the changelog

Kdump usability and reliability improvements

- A picture of a cute animal (not mandatory but encouraged)

- Allow platform specific reboot script to be called after crash kernel has
finished copying the kernel vmcore
- Kdump configurations stored and manipulated in ConfigDB are now processed
by hostcfgd and applied asynchronously
- Disable pcie advanced features when running crash kernel. This improves
reliability of the crash kernel to successfully create a vmcore and also
reboot
- Allow crash kernel to reboot if a panic is seen while it is generating a
vmcore
- Fix crash kernel to use the SONiC specific /usr/local/bin/reboot script
instead of the Linux reboot command /sbin/reboot

Signed-off-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
#KDUMP_KEXEC_ARGS=""
#KDUMP_CMDLINE=""
-#KDUMP_CMDLINE_APPEND="irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0"
+KDUMP_CMDLINE_APPEND="irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 panic=10 debug hpet=disable pcie_port=compat pci=nommconf platform=__PLATFORM__"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why pass PLATFORM=$platform to the kernel cmdline? what is the use case here? I do not see this in the description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PLATFORM string will be accessible to the reboot script. When the crash kernel is rebooting it requires $platform value to use any platform specific reboot script which is defined in /usr/share/sonic/device/$platform/platform_reboot.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platform seems like a common keyword, do you know if kernel does not use it as a parameter already. will there be a chance of conflict? it would be better if you can put sonic_platform to avoid such potential conflict

@lguohan
Copy link
Copy Markdown
Collaborator

lguohan commented Dec 3, 2020

can you separate the hostcfgd changes into a separate pr, that one is for usability. we can merge that one with the sonic-utilities pr you post.

In case the capture kernel (the kdump kernel started with kexec) would
either crash or be stuck, the system should reboot. We need then to add
the "panic=X" option to the kernel. Without this option, the system could
stuck and not reboot.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the patch description is not complete. you have added a few other options, such as debug, hpet, pcie_port, pci, PLATFORM. can you describe those purpose as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lguohan

After taking a second look at these two proposed patches, I feel we should not add patches to kdump-tools package. Instead these additional configurations need to be appended to /etc/default/kdump-config as part of build_debian.sh. This will make the SONiC kdump customizations easier to manage. I will also add appropriate description for the options added.

Comments?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with you. okay to move to build_debian.sh.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you are renamed it as sonic_platform.

@rajendra-dendukuri
Copy link
Copy Markdown
Contributor Author

can you separate the hostcfgd changes into a separate pr, that one is for usability. we can merge that one with the sonic-utilities pr you post.

Yes. That it is a very good suggestion and will make the recommended changes.

…system creation

Moved changes to hostcfgd to a different PR

Signed-off-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
@rajendra-dendukuri
Copy link
Copy Markdown
Contributor Author

can you separate the hostcfgd changes into a separate pr, that one is for usability. we can merge that one with the sonic-utilities pr you post.

Yes. That it is a very good suggestion and will make the recommended changes.

hostcfgd changes moved to #6122

rajendra-dendukuri and others added 2 commits December 4, 2020 11:28
…orm identifier string

Signed-off-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
@rajendra-dendukuri
Copy link
Copy Markdown
Contributor Author

retest mellanox

@lguohan
Copy link
Copy Markdown
Collaborator

lguohan commented Dec 8, 2020

retest mellanox please

@lguohan lguohan merged commit 31ce20a into sonic-net:master Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants