Skip to content

Added option to do busy-wait on epoll#8267

Merged
normanmaurer merged 4 commits intonetty:4.1from
merlimat:spin-epoll
Sep 28, 2018
Merged

Added option to do busy-wait on epoll#8267
normanmaurer merged 4 commits intonetty:4.1from
merlimat:spin-epoll

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

@merlimat merlimat commented Sep 6, 2018

Motivation:

Add an option (through a SelectStrategy return code) to have the Netty event loop thread to do busy-wait on the epoll.

The reason for this change is to avoid the context switch cost that comes when the event loop thread is blocked on the epoll_wait() call.

On average, the context switch has a penalty of ~13usec.

This benefits both:

  • The latency when reading from a socket
  • Scheduling tasks to be executed on the event loop thread.

The tradeoff, when enabling this feature, is that the event loop thread will be using 100% cpu, even when inactive.

Modification:

  • Added SelectStrategy option to return BUSY_WAIT
  • Epoll loop will do a epoll_wait() with no timeout
  • Use pause instruction to hint to processor that we're in a busy loop

Result:

  • When enabled, minimizes impact of context switch in the critical path

@@ -0,0 +1,102 @@
/*
* Copyright 2014 The Netty Project
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018

@normanmaurer
Copy link
Copy Markdown
Member

@merlimat just curious what you use this for... Do you try to minimise the latency of some app that is really network heavy ?

@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented Sep 6, 2018

@normanmaurer main goal is to reduce context switches in the critical path. Currently there can be multiple tasks executed in the Netty event-loop. For example, when writing to a channel from a different thread we have to "jump" to the event loop (to retain ordering and avoid mutex contention).

By having the eventLoop thread spinning, when you submit the task it will be picked up immediately (generally in ~100ns) compared to waking up the thread from epoll_wait through the eventfd which will take ~10usec. The same latency reduction applies when reading something from a channel, if the thread is spinning it will "see" the message earlier.

While a single context switch is not the end of the world, in a complex application the number of these "thread jumps" and socket reads chains can be significant and become a dominating factor in the overall request/response latency.

Of course this is only really useful when trying to squeeze latency at very minimum, at the expense of CPU usage (and max throughput, since it would be keeping the cpu busy for event loop while it could be used for other threads).

Also, this would be typically useful paired with CPU affinity as well. The effects of the 2 are:

  • Busy-wait brings down avg/med latency offset
  • CPU affinity will lower 99pct (and above) long tails by removing intermittent jitter

@normanmaurer
Copy link
Copy Markdown
Member

@merlimat makes sense... do you have any benchmarks to share (Just curious) ?

}
return ready;
}
private static native int epollBusyWait0(int efd, long address, int len);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add empty line above.

};
}

private BootstrapFactory<Bootstrap> clientSocket() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static ?

return list;
}

private BootstrapFactory<ServerBootstrap> serverSocket() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static ?

return new SelectStrategy() {
@Override
public int calculateStrategy(IntSupplier selectSupplier, boolean hasTasks)
throws Exception {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove throws ....

@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented Sep 7, 2018

@normanmaurer Yes, I did some brief measurements. This is the latency for an rpc request/response application. There are additional thread switches (though they are already using a local queue and busy-waiting on that as well).

Conditions:

  • Client/server on same machine with TCP connection
  • Tested on a bare-metal node
  • Client sends 1K rps (size 1KB) and waits for response
  • Multiple request outstanding, responses in order
  • Latency measured in millis
50pct 95pct 99pct 99.9pct 99.99pct 99.999pct max
Baseline 0.049 0.059 0.066 0.095 0.449 0.465 0.466
Busy-Wait IO Threads 0.036 0.047 0.052 0.075 0.436 0.445 0.452
Disable HyperThread 0.029 0.038 0.041 0.054 0.204 0.388 0.396
CPU-Isolation 0.029 0.037 0.040 0.050 0.066 0.090 0.104

I have included the disable hyper-threading and cpu-isolation as they are the next steps (and easily configured in linux settings).
The remaining bulk of the median latency I think is related to SO_BUSY_POLL (#8268) being ineffective on loopback interface.

This is a very simple example, but in a more complex app, there would be more thread jumps and possibly more RPCs involved.

@carl-mastrangelo
Copy link
Copy Markdown
Member

@merlimat Have you considered busy polling on reads or writes, rather than on polls? Even if poll does return something, you need to follow it up with a subsequent read, which is going to add a syscall of additional latency before you get your data. How many active sockets do you have, and are you trying to busy poll on writes or reads more?

@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented Sep 8, 2018

@carl-mastrangelo It's not just the read or write. The Netty event loop is also used to execute task in the same connection thread.

For example, to ensure ordering of operations and to avoid mutex contention, we typically do :

ctx.channel().eventLoop().execute(() -> {
    ctx.writeAndFlush(buffer); 
});

So the event loop blocks on the epoll_wait but also keeps an eye on the tasks that are posted, using the eventfd mechanism to wake up epoll_wait immediately when a task is posted.

Even if poll does return something, you need to follow it up with a subsequent read, which is going to add a syscall of additional latency before you get your data.

Sure, the read will involve a syscall, but the data will be available and it won't be blocking. The main goal here is to make sure the IO thread is never (or to the least possible extent) descheduled. The cost in terms of latency of the context switch (~10 usec) is much higher than the cost of the syscall.

Also, we always have multiple connections to manage, with the direct read() approach (which I don't think is currently possible to do through Netty), we'd need to spin on 1 thread/connection, while with epoll we can just spin on 1 (or few) IO thread and leave the other CPUs for the rest of application.

@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented Sep 8, 2018

@normanmaurer addressed comments

@carl-mastrangelo
Copy link
Copy Markdown
Member

@merlimat two things:

  1. I asked how many sockets you have because if you do have multiple, it might make more sense to dedicate a single thread to each one, one per core. Then you could do reads directly.
  2. On a recent PR I made (Don't re-arm timerfd each epoll_wait #7816) it was measured that the changing the timerfd took about 4us, so I'll pick that as a baseline for syscall time. Why do you not care about the followup read, if you do care about the context switch? They are in the same order of magnitude.

@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented Sep 17, 2018

@carl-mastrangelo

  1. I asked how many sockets you have because if you do have multiple, it might make more sense to dedicate a single thread to each one, one per core. Then you could do reads directly.

There is always more than 1 connection, from 10s to 100s or 1000s, so dedicating one core per connection is not really feasible.

  1. On a recent PR I made (Don't re-arm timerfd each epoll_wait #7816) it was measured that the changing the timerfd took about 4us, so I'll pick that as a baseline for syscall time. Why do you not care about the followup read, if you do care about the context switch? They are in the same order of magnitude.

How did you measure that time? 4 usec for a syscall seems very high to me.

Just did some experiments and was seeing timings of < 0.8usec for epoll syscalls with no timeout (including JNI overhead).

@carl-mastrangelo
Copy link
Copy Markdown
Member

How did you measure that time? 4 usec for a syscall seems very high to me.

The latency win on the benchmark. Also, I frequently run ptrace with timing output, so I think 4 us is not uncommon for syscalls.

I guess I don't have anything really against this, I just don't believe it will the solution you actually need.

@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo @merlimat so we want to pull this in or not ? I am happy either way :)

@carl-mastrangelo
Copy link
Copy Markdown
Member

@normanmaurer I'll defer to you, but I can confidently say Google cannot use this (busy looping is extremely expensive for power and opportunity cost to run something else).

But, @merlimat may be willing to make the trade off.

@merlimat
Copy link
Copy Markdown
Contributor Author

Sure, this setting is definitely not for general use, or meant to be a default.

We plan to give the option to enable busy wait in BookKeeper and Pulsar, for cases where one wants to maximize latency at expense of throughput and CPU consumption.

@netty-bot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

@normanmaurer normanmaurer added this to the 4.1.31.Final milestone Sep 28, 2018
@normanmaurer normanmaurer self-assigned this Sep 28, 2018
@normanmaurer normanmaurer merged commit 3a96e73 into netty:4.1 Sep 28, 2018
@normanmaurer
Copy link
Copy Markdown
Member

@merlimat thanks a lot!

@merlimat
Copy link
Copy Markdown
Contributor Author

Thanks for getting this in!

@johnou
Copy link
Copy Markdown
Contributor

johnou commented Sep 30, 2018

@merlimat you mean minimise latency? ;)

@merlimat
Copy link
Copy Markdown
Contributor Author

merlimat commented Oct 1, 2018

Ouch. Or maybe “maximize the responsiveness” :)

merlimat added a commit to apache/bookkeeper that referenced this pull request Nov 6, 2018
### Motivation

 * Upgrade to latest Netty version which brings in perf improvements and some new features (eg: netty/netty#8267) 

 * Broke down the dependencies from `netty-all` into individual components, as discussed at #1755 (comment)



Reviewers: Ivan Kelly <ivank@apache.org>, Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org>

This closes #1784 from merlimat/netty-4.1.31
@rohitsahay2000
Copy link
Copy Markdown

Will this solve #327 ?

@njhill njhill mentioned this pull request Jun 21, 2019
@VaughnVernon
Copy link
Copy Markdown

Hi All, We are looking for some help on applying the resolution you reached here to another project. Please see:

#327 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants