Skip to content

Conversation

@thetumbled
Copy link
Member

@thetumbled thetumbled commented Nov 8, 2022

Fixes #18388

Motivation

ConcurrentLongLongPairHashMap use optimistic locking to read data in Section, but if there is another thread that is removing data from the same Section, and trigger the shrink process, then the index calculated basing on dirty capacity exceed the array size, an ArrayIndexOut0fBoundsException will be throw, which is not handled now. Eventually the connection with client will be closed.
image

There are same problems with org.apache.pulsar.common.util.collections.ConcurrentLongHashMap, which is referenced in many places.

Modifications

use local backup variable, so OutOfBound won't happen.

Verifying this change

  • Make sure that the change passes the CI checks.
    This change is already covered by existing tests, such as (please describe tests).

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: thetumbled#5

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

@thetumbled Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-label-missing doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Nov 8, 2022
@lhotari lhotari requested review from Jason918 and eolivelli November 8, 2022 12:46
@lhotari
Copy link
Member

lhotari commented Nov 8, 2022

This seems to be a bug in changes introduced by #14515 . @lordcheng10 Please review

@lhotari
Copy link
Member

lhotari commented Nov 8, 2022

@BewareMyPower
Copy link
Contributor

But I'm still not sure about the following logic

                    // First try optimistic locking
                    long storedKey1 = table[bucket];
                    long storedKey2 = table[bucket + 1];
                    long storedValue1 = table[bucket + 2];
                    long storedValue2 = table[bucket + 3];

                    if (!acquiredLock && validate(stamp)) {
                        // The values we have read are consistent
                        if (key1 == storedKey1 && key2 == storedKey2) {
                            return new LongPair(storedValue1, storedValue2);

Even if storedKey1 and storeKey2 matched the expected keys, are storedValue1 and storedValue2 guaranteed to be exactly the associated values? This PR covered the case that table is shrinking, which makes the index invalid. However, could it happen that when retrieving table[bucket + 2] and table[bucket + 3], these values have been modified? @lhotari @lordcheng10 @thetumbled

@thetumbled thetumbled force-pushed the fixbug_ArrayIndexOutOfBoundsException branch from 8290967 to 5c73403 Compare November 9, 2022 01:50
@thetumbled
Copy link
Member Author

@thetumbled
Copy link
Member Author

But I'm still not sure about the following logic

                    // First try optimistic locking
                    long storedKey1 = table[bucket];
                    long storedKey2 = table[bucket + 1];
                    long storedValue1 = table[bucket + 2];
                    long storedValue2 = table[bucket + 3];

                    if (!acquiredLock && validate(stamp)) {
                        // The values we have read are consistent
                        if (key1 == storedKey1 && key2 == storedKey2) {
                            return new LongPair(storedValue1, storedValue2);

Even if storedKey1 and storeKey2 matched the expected keys, are storedValue1 and storedValue2 guaranteed to be exactly the associated values? This PR covered the case that table is shrinking, which makes the index invalid. However, could it happen that when retrieving table[bucket + 2] and table[bucket + 3], these values have been modified? @lhotari @lordcheng10 @thetumbled

if table[bucket + 2] and table[bucket + 3] is modified, there must be one thread having acquired write lock, which result into that validate(stamp) return false and fallback to pessimistic read.

@labuladong
Copy link
Contributor

Nice catch! And I think the ConcurrentLongLongPairHashMap.Section#forEach method may have the same problem. Could you help to confirm it?

image

@liangyepianzhou
Copy link
Contributor

And ConcurrentLongLongPairHashMap.Section#remove has the same issue. Please also handle it.

@thetumbled
Copy link
Member Author

thetumbled commented Nov 10, 2022

And ConcurrentLongLongPairHashMap.Section#remove has the same issue. Please also handle it.

i don't find that ConcurrentLongLongPairHashMap.Section#remove do use any optimistic read lock but write lock.

@labuladong
Copy link
Contributor

Yes, only tryOptimisticRead may cause this issue.

@liangyepianzhou
Copy link
Contributor

Oh my fault, here is a writeLock in the ConcurrentLongLongPairHashMap.Section#remove.

@thetumbled
Copy link
Member Author

Nice catch! And I think the ConcurrentLongLongPairHashMap.Section#forEach method may have the same problem. Could you help to confirm it?
image

i find that, even if there is any thread acquired write lock and trigger shrink/expand process, the local variable table will not be updated, so the index used in loop will not exceed the bound.

@dragonls
Copy link
Contributor

dragonls commented Nov 10, 2022

The ConcurrentLongLongPairHashMap comes from https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/util/collections/ConcurrentLongLongHashMap.java

Maybe there is the same bug in bookkeeper, or it only occur in getting stored value?

@labuladong
Copy link
Contributor

labuladong commented Nov 10, 2022

the local variable table will not be updated.

@thetumbled It's not true. table variable is a reference not a copy, so its length may change.

@thetumbled
Copy link
Member Author

thetumbled commented Nov 10, 2022

the local variable table will not be updated.

@thetumbled It's not true. table variable is a reference not a copy, so its length may change.

In the rehash method, the member this.table of Section will be reallocated a new long[] array, which is different from old table member.

        private void rehash(int newCapacity) {
            long[] newTable = new long[4 * newCapacity];
            ...
            table = newTable;
            ...
        }

For example:

    public static void main(String[] args) {
        int[] table = new int[]{1, 2};
        int[] localTable = table;
        System.out.println(localTable.length);
        // rehash table to new array.
        table = new int[]{3};
        System.out.println(localTable.length);
    }

the result will be following:
image

@labuladong
Copy link
Contributor

Oh I see, thanks for your explanation.

@thetumbled
Copy link
Member Author

The ConcurrentLongLongPairHashMap comes from https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/util/collections/ConcurrentLongLongHashMap.java

Maybe there is the same bug in bookkeeper, or it only occur in getting stored value?

it seems that there is the same bug in bookkeeper.

@labuladong
Copy link
Contributor

labuladong commented Nov 10, 2022

In the rehash method, the member this.table of Section will be reallocated a new long[] array, which is different from old table member.

        private void rehash(int newCapacity) {
            long[] newTable = new long[4 * newCapacity];
            ...
            table = newTable;
            ...
        }

This inspires me. Can we fix the get issue in the same way?

            boolean acquiredLock = false;
            // add local variable here, so OutOfBound won't happen
            long[] table = this.table;
            // calculate table.length / 4 as capacity to avoid rehash changing capacity
            int bucket = signSafeMod(keyHash, table.length / 4);

            try {
                while (true) {
                    long storedKey1 = table[bucket];
                    long storedKey2 = table[bucket + 1];
                    long storedValue1 = table[bucket + 2];
                    long storedValue2 = table[bucket + 3];
                    
                    // compare table to ensure not rehash
                    if (!acquiredLock && validate(stamp) && table == this.table) {
                        if (key1 == storedKey1 && key2 == storedKey2) {
                            return new LongPair(storedValue1, storedValue2);
                        } else if (storedKey1 == EmptyKey) {
                            return null;
                        }
                    } else {
                        if (!acquiredLock) {
                            stamp = readLock();
                            acquiredLock = true;
                            // update local variable
                            table = this.table;
                            bucket = signSafeMod(keyHash, capacity);
                            storedKey1 = table[bucket];
                            storedKey2 = table[bucket + 1];
                            storedValue1 = table[bucket + 2];
                            storedValue2 = table[bucket + 3];
                        }

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

@thetumbled
Copy link
Member Author

/pulsarbot run-failure-checks

@codelipenghui codelipenghui merged commit 2aa8c3b into apache:master Sep 15, 2023
@codelipenghui codelipenghui added type/bug The PR fixed a bug or issue reported a bug category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost area/broker and removed Stale labels Sep 15, 2023
poorbarcode pushed a commit that referenced this pull request Oct 7, 2023
shibd pushed a commit that referenced this pull request Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-2.11 cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs release/2.10.7 release/2.11.3 release/3.0.2 release/3.1.1 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] ArrayIndexOutOfBoundsException caused by optimistic locking in ConcurrentLongLongPairHashMap implementation