Issue #1409: Added server side backpressure (@bug W-3651831@) #1410

dlg99 · 2018-05-17T05:24:00Z

Added server side backpressure handling and related unit tests.

Background:

BK’s writes happen in this order on the server side:
First, ledger storage (Interleaved|Sorted), presumably non-blocking to some level.
Second, journal.
Request is finished when journal’s write is fsynced.

Three major moving parts on the server side need to be taken into account:

Journal; its performance (I/O delays) or journal’s batching delay (mis)configuration affect client latency. Journal has internal batching and separate thread for data fsync/request ack.
InterleavedLedgerStorage/entry log. It will naturally block request before it reaches journal if blocked on I/O.
SortedLedgerStorage. Sorted storage puts request into in-memory data structure/SkipList (aka memtable) until it reaches certain limit and flushes it to disk asynchronously.

Implementation:

Limit number of requests in progress (separately for reads and writes since they are handled in different thread pools).

Requests in progress (RIPs) are requests being processed by threads in threadpool + requests waiting for the next thread. RIPs lifetime if from the moment it is received/read by netty to the moment response for the request is sent.

Target limit of RIPs (heuristics) is ((number of processing threads)*2+(expected max batch size on journal)), so each thread can have one request to process and the next one waiting.
It is assumed that we have enough memory to keep data for that many requests. It is impossible to estimate size of read request at the moment when it is received anyway.

Limit is configured by setting number of RIPs explicitly in config for the following reasons:
Easier to experiment with different numbers. I.e. we may want to experiment with different number of requests in progress, i.e. ((number of processing threads)+2*(expected max batch size on journal)) or simply (2*(expected max batch size on journal)).
There is an option to run request directly on netty thread so no config parameter to base initial value on and netty’s defaults can change between versions.
Removes need for explicit enable/disable backpressure flag, instead we can set RIPs to zero.

Pause netty’s autoread when limit is exceeded to prevent it from pulling more data before we track it as RIP.
Limit number of requests in asynchronous write path (LedgerStorage)

InterleavedLedgerStorage will naturally block if write is slowed down due to i.e. fsync.
SortedLedgerStorage has naive implementation of throttling that blocks request for 1ms if checkpoint (memtable flush) is in progress. This is replaced with block until space in memtable is available. The limit is set to 2*(skipListSize) where skipListSize is limit that triggers memtable flush.

Sending response must respect netty’s isWritable() flag and wait up to certain timeout, if needed.
Drop response after timeout (client will not hear about that request) or close the channel (disconnect will notify the client that responses to requests from that connection will never happen).

Master Issue: #1409

sijie · 2018-05-17T05:53:09Z

@dlg99 overall looks awesome. will do a detailed review tomorrow.

@merlimat please review this as well.

eolivelli

Good idea.

I think that the changes on SortedLedgerStorage are important. Personally I am okay with those changes but I would like to know opinions of the comunity

eolivelli · 2018-05-17T06:48:23Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Journal.java

+
                    logId = logId + 1;
+                    final JournalChannel.BufferedChannelBuilder bcBuilder;
+                    if (getDelay > 0 || addDelay > 0 || flushDelay > 0) {


it would be better to use PowerMock and to not use this kind of "explict" instrumentation.
Recently we added a lot of test cases which tweak the journal

will do.
PowerMock in tests that extends BookKeeperClusterTestCase a bit problematic, likely due to
powermock/powermock#822

Figured a way how to use just mockito there, will update PR soon.

eolivelli · 2018-05-17T06:49:42Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/SortedLedgerStorage.java

 */
-public class SortedLedgerStorage extends InterleavedLedgerStorage
-        implements LedgerStorage, CacheCallback, SkipListFlusher {
+public class SortedLedgerStorage


this is a important change, SortedLedgerStorage will not be a InterleavedLedgerStorage anymore

cc @sijie @ivankelly @merlimat

What are the implications of this change? @eolivelli ?

@dlg99 can you please describe the motto of this change?

This simplifies testability i.e. SlowSortedLedgerStorage implementation. Reduces code duplication etc. I tried dealing with this without this change first and turn up ugly.

I am not worried about this as there is no reason to code directly against InterleavedLedgerStorage instead of CompactableLedgerStorage interface which SortedLedgerStorage still implements. Tests still pass though I had to change couple of tests that casted SortedLedgerStorage to InterleavedLedgerStorage.
If there is some specific reason that I have missed that explains why we should not change that or a reason why we should ignore interface and code against concrete implementation please let me know.

@dlg99 thank you for trying.
Go head with your change.
We already told at community meeting that this is not a problem

eolivelli · 2018-05-17T06:51:16Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ClientConfiguration.java

    protected static final String PCBC_TIMEOUT_TIMER_NUM_TICKS = "pcbcTimeoutTimerNumTicks";
    protected static final String TIMEOUT_TIMER_TICK_DURATION_MS = "timeoutTimerTickDurationMs";
    protected static final String TIMEOUT_TIMER_NUM_TICKS = "timeoutTimerNumTicks";
-    protected static final String WAIT_TIMEOUT_ON_BACKPRESSURE = "waitTimeoutOnBackpressureMs";


I this it would be better to have two separate properties, one for server and another for client.
Having two properties with the same name will be confiusing and/or we have to write docs to explain this fact

dlg99 · 2018-05-17T17:39:40Z

retest this please

jvrao · 2018-05-21T02:45:18Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/AbstractConfiguration.java

+     *        0 to allow request to fail immediately
+     *        Default is -1 (disabled)
+     */
+    public long getWaitTimeoutOnBackpressureMillis() {


If it is from the client, how does the client know if the bookie is slow/dead/hung or something backpressure related? Also, how does this manifest with rest of the timeouts, say, read/write timeout

all the client is see is that connection to bookie still exists but data cannot be send via it at the moment (or cannot be send fast enough to clean up existing backlog) and netty raises channel.isWritable() == false.
the rest is handled by client backpressure.

jvrao · 2018-05-21T04:18:55Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java

+     *
+     * @return Max number of adds in progress.
+     */
+    public int getMaxAddsInProgress() {


This sounds like a stat rather than config. Maybe change getMaxAddsInProgressLimit

jvrao · 2018-05-21T04:21:49Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java

+     *
+     * @return Max number of reads in progress.
+     */
+    public int getMaxReadsInProgress() {


Same comment on the name as above. Adding "limit" or something like that will describe it better.

jvrao · 2018-05-21T04:22:58Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/conf/ServerConfiguration.java

+        if (size > (Integer.MAX_VALUE - 1) / 2) {
+            // gives max of 2*1023MB for mem table (one being checkpointed and still writable).
+            throw new IllegalArgumentException("skiplist size over " + ((Integer.MAX_VALUE - 1) / 2));
+        }


Another option is to set the max to ( intmax-1)/2 with a warning.

I'll leave it at explicit error (fail on start, fast detection) over the quiet value change.

jvrao · 2018-05-21T04:52:39Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieRequestProcessor.java

+        this.addEntryBlockedStats = statsLogger.getOpStatsLogger(ADD_ENTRY_BLOCKED_WAIT);
+        this.readEntryBlockedStats = statsLogger.getOpStatsLogger(READ_ENTRY_BLOCKED_WAIT);
+
+        int maxAdds = serverCfg.getMaxAddsInProgress();


int maxAddsInProgressLimit?

jvrao · 2018-05-21T04:52:51Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieRequestProcessor.java

+        int maxAdds = serverCfg.getMaxAddsInProgress();
+        addsSemaphore = maxAdds > 0 ? new Semaphore(maxAdds, true) : null;
+
+        int maxReads = serverCfg.getMaxReadsInProgress();


Same comment

jvrao · 2018-05-21T05:22:10Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieRequestProcessor.java

+                final long throttlingStartTimeNanos = MathUtils.nowInNano();
+                channel.config().setAutoRead(false);
+                readsBlocked.incrementAndGet();
+                readsSemaphore.acquireUninterruptibly();


Info/debug message around this block will be helpful

jvrao · 2018-05-21T05:27:45Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieRequestProcessor.java

+                final long throttlingStartTimeNanos = MathUtils.nowInNano();
+                channel.config().setAutoRead(false);
+                addsBlocked.incrementAndGet();
+                addsSemaphore.acquireUninterruptibly();


Info/Debug message will be helpful

…ing of non-responsive client

…ime when test started. Trying to fix + add more logging

dlg99 · 2018-05-26T06:53:44Z

retest this please

sijie

Great work! @dlg99

sijie · 2018-05-29T21:01:56Z

bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieRequestProcessor.java

+
+    protected void onAddRequestStart(Channel channel) {
+        if (addsSemaphore != null) {
+            if (!addsSemaphore.tryAcquire()) {


nit:

currently all the throttling is based requests. it might be good to do throttling based on bytes, because that is much easier to translate to bandwidth usage rather requests. #1367 is one ask for that.

that can be a future improvement.

@sijie This is partially addressed this in memtable flush scenario but I wanted to keep configuration consistent across read and write requests (as in: requests in progress limit) and in case of reads one does not know size of the entry until index is read.
We can think about to byte-based configuration later but I feel like it is not going to be a high priority if backpressure is dynamic (like with this change) and reacts on actual throughput and not a rate-base throttling like client used to have.

dlg99 · 2018-05-29T23:46:03Z

@jvrao @eolivelli Guys, I did the changes to address your comments. Can you please take another look?

eolivelli

+1
Great work

#shipit

eolivelli · 2018-05-31T15:02:04Z

@dlg99 it seems that we are missing @jvrao's approval as he is in the review loop

jvrao

LGTM

@bug

…pache#1410) (apache#167) (@bug W-3651831@) backpressure: server-side backpressure @Rev vjujjuri@ @Rev sijie@ @Rev eolivelli@

*Motivation* A strong behavior was observed when using sorted ledger storage with single entry log manager on production: "the entry log files are rotated very frequently and small entry log files are produced". The problem was introduced due to apache#1410. At current entry logger, when a new entry log file is created, EntryLogger will notify its listeners that a new entry log file is rotated via [`onRotateEntryLog`](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/EntryLogManagerBase.java#L152). Before the change in apache#1410, `SortedLedgerStorage` inherits from `InterleavedLedgerStorage`. So when a new entry log file is rotated, `SortedLedgerStorage` is notified. However after the change in apache#1410, `SortedLedgerStorage` doesn't inherits `InterleavedLedgerStorage` anymore. Instead, the relationship is changed to composition. `SortedLedgerStorage` is composed using an interleaved ledger storage. So the entrylog listener contract was broken. `SortedLedgerStorage` will not receive any `onRotateEntryLog` notification any more. *Changes* When `SortedLedgerStorage` initializes, it passes its own entry log listener down to the interleaved ledger storage. So entry logger can notify the right person for entry log rotations. *Tests* Existing tests should cover most of the case. Looking for how to add new test cases.

Descriptions of the changes in this PR: *Motivation* A strong behavior was observed when using sorted ledger storage with single entry log manager on production: "the entry log files are rotated very frequently and small entry log files are produced". The problem was introduced due to #1410. At current entry logger, when a new entry log file is created, EntryLogger will notify its listeners that a new entry log file is rotated via [`onRotateEntryLog`](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/EntryLogManagerBase.java#L152). Before the change in #1410, `SortedLedgerStorage` inherits from `InterleavedLedgerStorage`. So when a new entry log file is rotated, `SortedLedgerStorage` is notified. However after the change in #1410, `SortedLedgerStorage` doesn't inherits `InterleavedLedgerStorage` anymore. Instead, the relationship is changed to composition. `SortedLedgerStorage` is composed using an interleaved ledger storage. So the entrylog listener contract was broken. `SortedLedgerStorage` will not receive any `onRotateEntryLog` notification any more. *Changes* When `SortedLedgerStorage` initializes, it passes its own entry log listener down to the interleaved ledger storage. So entry logger can notify the right person for entry log rotations. *Tests* Existing tests should cover most of the case. Looking for how to add new test cases. Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Charan Reddy Guttapalem <reddycharan18@gmail.com>, Andrey Yegorov <None> This closes #1807 from sijie/fix_rotation_behavior

Descriptions of the changes in this PR: *Motivation* A strong behavior was observed when using sorted ledger storage with single entry log manager on production: "the entry log files are rotated very frequently and small entry log files are produced". The problem was introduced due to #1410. At current entry logger, when a new entry log file is created, EntryLogger will notify its listeners that a new entry log file is rotated via [`onRotateEntryLog`](https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/EntryLogManagerBase.java#L152). Before the change in #1410, `SortedLedgerStorage` inherits from `InterleavedLedgerStorage`. So when a new entry log file is rotated, `SortedLedgerStorage` is notified. However after the change in #1410, `SortedLedgerStorage` doesn't inherits `InterleavedLedgerStorage` anymore. Instead, the relationship is changed to composition. `SortedLedgerStorage` is composed using an interleaved ledger storage. So the entrylog listener contract was broken. `SortedLedgerStorage` will not receive any `onRotateEntryLog` notification any more. *Changes* When `SortedLedgerStorage` initializes, it passes its own entry log listener down to the interleaved ledger storage. So entry logger can notify the right person for entry log rotations. *Tests* Existing tests should cover most of the case. Looking for how to add new test cases. Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Charan Reddy Guttapalem <reddycharan18@gmail.com>, Andrey Yegorov <None> This closes #1807 from sijie/fix_rotation_behavior (cherry picked from commit deebe6d) Signed-off-by: Sijie Guo <sijie@apache.org>

--- *Motivation* If one bookie is slow (not down, just slow), the BK client will the acks to the user that the entries are written after the first 2 acks. In the meantime, it will keep waiting for the 3rd bookie to respond. If the bookie responds within the timeout, the entries can now be dropped from memory, otherwise the write will timeout internally and it will get replayed to a new bookie. In the V3 request, we have [server-side backpressure](apache#1410) to impact the client-side behaviors. We should apply the same changes to the V2 request. That would help this [issue](apache/pulsar#14861) to be resolved. *Modification* - Apply the change apache#1410 to V2 protocol

--- *Motivation* If one bookie is slow (not down, just slow), the BK client will the acks to the user that the entries are written after the first 2 acks. In the meantime, it will keep waiting for the 3rd bookie to respond. If the bookie responds within the timeout, the entries can now be dropped from memory, otherwise the write will timeout internally and it will get replayed to a new bookie. In the V3 request, we have [server-side backpressure](#1410) to impact the client-side behaviors. We should apply the same changes to the V2 request. That would help this [issue](apache/pulsar#14861) to be resolved. *Modification* - Apply the change #1410 to V2 protocol Descriptions of the changes in this PR:

--- *Motivation* If one bookie is slow (not down, just slow), the BK client will the acks to the user that the entries are written after the first 2 acks. In the meantime, it will keep waiting for the 3rd bookie to respond. If the bookie responds within the timeout, the entries can now be dropped from memory, otherwise the write will timeout internally and it will get replayed to a new bookie. In the V3 request, we have [server-side backpressure](#1410) to impact the client-side behaviors. We should apply the same changes to the V2 request. That would help this [issue](apache/pulsar#14861) to be resolved. *Modification* - Apply the change #1410 to V2 protocol Descriptions of the changes in this PR: (cherry picked from commit 62400bd)

--- *Motivation* If one bookie is slow (not down, just slow), the BK client will the acks to the user that the entries are written after the first 2 acks. In the meantime, it will keep waiting for the 3rd bookie to respond. If the bookie responds within the timeout, the entries can now be dropped from memory, otherwise the write will timeout internally and it will get replayed to a new bookie. In the V3 request, we have [server-side backpressure](apache#1410) to impact the client-side behaviors. We should apply the same changes to the V2 request. That would help this [issue](apache/pulsar#14861) to be resolved. *Modification* - Apply the change apache#1410 to V2 protocol Descriptions of the changes in this PR: (cherry picked from commit 62400bd) (cherry picked from commit b025f7b)

--- *Motivation* If one bookie is slow (not down, just slow), the BK client will the acks to the user that the entries are written after the first 2 acks. In the meantime, it will keep waiting for the 3rd bookie to respond. If the bookie responds within the timeout, the entries can now be dropped from memory, otherwise the write will timeout internally and it will get replayed to a new bookie. In the V3 request, we have [server-side backpressure](apache#1410) to impact the client-side behaviors. We should apply the same changes to the V2 request. That would help this [issue](apache/pulsar#14861) to be resolved. *Modification* - Apply the change apache#1410 to V2 protocol Descriptions of the changes in this PR:

dlg99 requested review from jvrao and sijie May 17, 2018 05:24

eolivelli requested changes May 17, 2018

View reviewed changes

jvrao reviewed May 21, 2018

View reviewed changes

Andrey Yegorov added 7 commits May 25, 2018 15:08

(@bug W-3651831@) backpressure: server-side backpressure

b76eb3a

(@bug W-3651831@) backpressure: server-side backpressure, added handl…

fbca2eb

…ing of non-responsive client

CR feedback, better mocking in unit tests

01d048d

CR feedback, renamed configuration props

5499814

CR feedback, extra logging

1b92d2b

CR feedback, split client and server props to have different names

2a7a9c4

Java9 jenkins run failed one test with bookie being readonly at the t…

7ad47de

…ime when test started. Trying to fix + add more logging

dlg99 force-pushed the feature/server-side-backpressure branch from 9aceb8a to 7ad47de Compare May 25, 2018 23:23

dlg99 mentioned this pull request May 26, 2018

Size-based throttling in Bookie side #1367

Closed

sijie approved these changes May 29, 2018

View reviewed changes

eolivelli approved these changes May 30, 2018

View reviewed changes

jvrao approved these changes Jun 1, 2018

View reviewed changes

dlg99 merged commit 319f761 into apache:master Jun 1, 2018

sijie assigned dlg99 Jun 4, 2018

sijie added area/bookie type/improvement release/4.8.0 labels Jun 4, 2018

sijie added this to the 4.8.0 milestone Jun 4, 2018

sijie mentioned this pull request Jun 4, 2018

Server-side backpressure #1409

Closed

sijie mentioned this pull request Nov 12, 2018

[bookie] Fix sorted ledger storage rotating entry log files too frequent #1807

Merged

dlg99 deleted the feature/server-side-backpressure branch October 14, 2021 23:38

dlg99 mentioned this pull request Apr 26, 2022

BP-51: BookKeeper client memory limits #3231

Open

zymap mentioned this pull request Jun 9, 2022

Apply the backpressure changes on the V2 requests #3324

Merged

dlg99 mentioned this pull request Sep 23, 2022

Bookie OutOfDirectMemoryError #3497

Open

Issue #1409: Added server side backpressure (@bug W-3651831@) #1410

Issue #1409: Added server side backpressure (@bug W-3651831@) #1410

Uh oh!

Conversation

dlg99 commented May 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sijie commented May 17, 2018

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlg99 commented May 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlg99 commented May 26, 2018

Uh oh!

sijie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlg99 commented May 29, 2018

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli commented May 31, 2018

Uh oh!

dlg99 commented May 17, 2018 •

edited

Loading