Skip to content

Fix SIGTERM can't terminate PubSub::listen issue by add cancellation token support.#606

Merged
liuh-80 merged 19 commits intosonic-net:masterfrom
liuh-80:dev/liuh/cancellation_token
May 19, 2022
Merged

Fix SIGTERM can't terminate PubSub::listen issue by add cancellation token support.#606
liuh-80 merged 19 commits intosonic-net:masterfrom
liuh-80:dev/liuh/cancellation_token

Conversation

@liuh-80
Copy link
Copy Markdown
Contributor

@liuh-80 liuh-80 commented Apr 19, 2022

Why I did it

There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
https://github.com/Azure/sonic-swss-common/issues/603

How I did it

Add following class:
1. CancellationToken: this class will help exist the infinite loops when SIGTERM or other signal happen.
2. SignalHandlerHelper: Provide a native signal handler.

How to verify it

  1. manually test.
  2. Pass all test case.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@liuh-80 liuh-80 changed the title Add cancellation token Fix SIGTERM can't terminate PubSub::listen issue by add cancellation token support. Apr 19, 2022
@liuh-80 liuh-80 marked this pull request as ready for review April 20, 2022 08:41
Comment thread common/cancellationtoken.h Outdated
Comment thread common/cancellationtoken.cpp Outdated
Comment thread common/cancellationtoken.h Outdated
Comment thread common/signalhandlerhelper.cpp Outdated
Comment thread common/signalhandlerhelper.cpp Outdated

void SignalHandlerHelper::RegisterSignalHandler(int signalNumber)
{
signal(signalNumber, SignalHandlerHelper::OnSignal);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it override the signal handler defined by the application? In case of python, does it mean that the user defined signal handler in python won't be executed?

Copy link
Copy Markdown
Contributor Author

@liuh-80 liuh-80 Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because linux can only have 1 signal handler per-application.
If customer need define their signal handler, then customer also need cancel the token when signal happen in their handler, this class just a helper class.

Comment thread common/select.cpp Outdated
Comment thread common/select.cpp Outdated
@liuh-80 liuh-80 requested a review from qiluo-msft April 21, 2022 02:11
Comment thread common/configdb.h Outdated
Comment thread common/configdb.h Outdated
Comment thread common/select.cpp
@qiluo-msft
Copy link
Copy Markdown
Contributor

qiluo-msft commented Apr 22, 2022

ret = poll_descriptors(c, timeout);

Did you change old behavior?


In reply to: 1106076032


Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

Comment thread common/select.h Outdated
@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented Apr 22, 2022

ret = poll_descriptors(c, timeout);

Did you change old behavior?

Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

No, I didn't. when timeout happen, current code still will try read again.

@qiluo-msft
Copy link
Copy Markdown
Contributor

ret = poll_descriptors(c, timeout);

For non terminating signals (like siguser), original behavior is keeping while loop, but the new behavior is return SIGNALINTR. I think this is unexpected.


In reply to: 1106076032


Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented Apr 27, 2022

ret = poll_descriptors(c, timeout);

For non terminating signals (like siguser), original behavior is keeping while loop, but the new behavior is return SIGNALINTR. I think this is unexpected.

In reply to: 1106076032

Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

Will discussion offline, the reason is:

  1. without a customize signal handler we don't know which signal break the epoll_wait.
  2. When poll_descriptors break by siguser, it will return to python code and python code will re-run poll_descriptors method, so from the python code side, the behavior not change.

However. there are some UT failed, seems related with this change, will check reason.

Update: in latest code, only sigint can break the while loop, all other signal will not break the while loop.

@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented Apr 28, 2022

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented Apr 28, 2022

Add SignalHandlerHelper class back because:

  1. There are UT break because in last version any signal will break poll_descriptors() method.
  2. So the solution should only allow SIGTERM to break poll_descriptors().
  3. However, according to linux signal system, it's impossible to check which signal happening when there is no signal handler.

So we have to add this class back.

Also here is demo code in python3:

from swsscommon import swsscommon
swsscommon.SignalHandlerHelper.registerSignalHandler(swsscommon.SIGNAL_INT)
c=swsscommon.ConfigDBConnector()
c.subscribe('A', lambda a: None)
c.connect()
c.listen(None)
^C>>>

Comment thread common/pubsub.cpp Outdated
Comment thread common/pubsub.cpp
Comment thread common/pubsub.cpp
Comment thread common/pubsub.h Outdated
Comment thread common/signalhandlerhelper.cpp Outdated
Comment thread common/select.cpp Outdated
@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented May 11, 2022

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented May 11, 2022

Code coverage does not reach target, will add UT to cover new code.

liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this pull request May 11, 2022
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this pull request May 11, 2022
…ystemd (#2133)" (#2166)

- What I did
This reverts commit a5f55aa.

Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.

- How I did it
git revert a5f55aa

- How to verify it
Run tests
@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented May 16, 2022

@stepanblyschak, could you please check this PR again? all issue fixed.

Comment thread common/pubsub.cpp Outdated
Comment thread common/signalhandlerhelper.cpp Outdated
@qiluo-msft
Copy link
Copy Markdown
Contributor

ret = poll_descriptors(c, timeout);

Could you fix for non terminating signals (like siguser)?


In reply to: 1109033970


Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

@liuh-80
Copy link
Copy Markdown
Contributor Author

liuh-80 commented May 18, 2022

ret = poll_descriptors(c, timeout);

Could you fix for non terminating signals (like siguser)?

In reply to: 1109033970

Refers to: common/select.cpp:166 in 037283c. [](commit_id = 037283c, deletion_comment = False)

This already fixed, currently only sigint can break the while loop in select.cpp, all other signal will not break while loop.

@liuh-80 liuh-80 merged commit 7ae22be into sonic-net:master May 19, 2022
itamar-talmon pushed a commit to itamar-talmon/sonic-swss-common that referenced this pull request Jul 19, 2022
…token support. (sonic-net#606)

Why I did it
    There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
    sonic-net#603

How I did it
    Add following class:
    1. CancellationToken: this class will help exist the infinite loops when SIGTERM or other signal happen.
    2. SignalHandlerHelper: Provide a native signal handler.

How to verify it
   1. manually test.
   2. Pass all test case.
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this pull request Aug 3, 2023
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants