Native Epoll consuming 100% of one CPU normal? #16236
-
|
We are seeing some issues with the Native library where the cpu time is higher with Epoll than with NIO which interferes with some scaling logic we had before given the CPU defaults are now higher than usual. To measure the CPU we are using the MBeanServer APIs getting the attributes of On average without the Epoll changes, we found that the CPU was normally below 10% and so the scaling logic worked as expected. Now with Epoll, the CPU numbers are above 10% but after some investigation it seems like the issue is because the Epoll thread is consuming 100% of a single CPU with no traffic happening and so on average with 8 CPUs that shows around 12.5% (which is 1CPU completely used over 8 CPUs available). Is this normal? There isn't really any traffic happening in the server since this is on startup so I wouldn't expect the CPU usage it is reporting. From what I could see, the main issue seems to be because of the epollBusyWait case here where a busy wait is happening and seems to be escalating the CPU numbers where in KQueue and NIO the busy wait case does not have that logic... Any comments would be appreciated. Similar issue #5896 but didn't really see a resolution... Edit: I was corrected with more info, the 10% cpu issue is really because of the number of CPUs used. A more precise way to describe the behavior is that the Netty Epoll thread is using 100% of a single CPU when there is no traffic happening. On container workload, this would show up as the containers using massive CPU usage which doesn't seem right. I reworded the discussion to match this. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Afaik nope, and I don't remember the default select strategy to end up there (BUSY_WAiT). Reading the other issue it ends up with no more actionable data point for us (the user switched to a new stack without using a proper profiler): if you can provide some flamegrph/profiling information let's see if we can do something for it 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
When I originally opened that thread, I didn’t have much experience and couldn’t collect useful diagnostic data. At the time, we worked around the issue by switching from Netty to basic AIO sockets on CentOS 7 with Oracle JRE 8. Later, after moving to a new hosting provider (Ubuntu + Oracle JDK 11), we switched back to Netty and the issue no longer occurred. I was never able to pinpoint the exact cause of the epoll thread spinning. From what I recall, it mainly happened while idle; CPU usage under load was similar to NIO. As a temporary workaround, I added a small sleep in the event loop when there were no active connections. |
Beta Was this translation helpful? Give feedback.
-
|
I opened #16240 since this seems like an issue in the source. Will close this and continue the discussions there. |
Beta Was this translation helpful? Give feedback.
I opened #16240 since this seems like an issue in the source. Will close this and continue the discussions there.