Auditbeat: Fixes for system/socket dataset#19033
Conversation
|
Pinging @elastic/siem (Team:SIEM) |
There was a problem hiding this comment.
Just for a bit deeper explanation for a bit of posterity sake.
So, it looks like this happens when you get kernel pointer re-use of the sockets after a missed inet_release syscall. Our old clean-up code failed to remove the socket from the socketLRU but overwrote the map lookup value in the underlying sockets map. When the reaper code for cleaning up old sockets then hit, the orphaned record in the socketLRU would be referenced along with its kernel-based pointer which now pointed to a new socket. As a result the reference in the socketLRU would never get removed and would get evaluated again and again via the Peek call in the for loop. The fix works because any time we expire a socket we now also explicitly remove it from the socketLRU (which a socket reference is always added to when the socket is created) and mark it in a closing state.
Does that sound about right? If so, would it be possible to add a simple test for this condition? Basically flow --> missed inet_release --> reused kernel pointer uint64 --> make sure we have the old socket reference in a closing state?
The feature was using socket.closeTime as a reference for expiration, but this timestamp was only set once the socket was closed or expired, so it caused all sockets to expirate every closeTimeout.
|
@andrewstucki while adding the test I found yet another problem. It wasn't dealing with socket timeouts properly, see the new commit. Can you have another look? |
andrewstucki
left a comment
There was a problem hiding this comment.
painful 😬 thanks for tracking these down and adding the tests @adriansr
|
yep, the whole state.go should be rewritten from scratch :D |
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 665b67f)
Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations.
|
@adriansr auditbeat version 7.17.0 (amd64), libbeat 7.17.0 system/socket Peak CPU Usage 100%+,mean value Cpu Usage 40%+. |
…9081) Fixes two problems with the system/socket dataset: - A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in kernel 2.6 / CentOS/RHEL 6.x). - Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check. Also fixes other two minor issues: - A flow could be terminated twice by different code paths leading to wrong numFlows calculation and duplicated flows indexed. - Decoupled the status debug log and socket cleanup into separate goroutines so that logging is still performed under high load situations. (cherry picked from commit 9555ff4)
What does this PR do?
Fixes two problems with the system/socket dataset:
A bug in the internal state of the socket dataset that lead to an infinite loop in systems were the kernel aggressively reuses sockets (observed in 2.6 / CentOS/RHEL 6.x).
Socket expiration wasn't working as expected due to it using an uninitialized timestamp: Flows were expiring at every check.
Also fixes other two minor issues:
Why is it important?
It has been observed that the dataset would use 100% CPU and stop reporting events. During testing it was discovered that socket expiration, a new feature to prevent excessive memory usage, wasn't working as expected.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.How to test this PR locally
The infinite loop is easy to trigger in RHEL 6.x by running: