Skip to content

Do not suppress exceptions in web disk (and fix retries for requests from web disk)#42800

Merged
kssenii merged 7 commits intoClickHouse:masterfrom
azat:disks/web-fix
Nov 20, 2022
Merged

Do not suppress exceptions in web disk (and fix retries for requests from web disk)#42800
kssenii merged 7 commits intoClickHouse:masterfrom
azat:disks/web-fix

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Oct 30, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Do not suppress exceptions in web disk. Fix retries for web disk.

Before this patch it is possible that web disk will suppress it and instead of got an exception during read you will get just zero records:

2022.10.28 15:18:30.739698 [ 10663 ] {} <Error> ReadWriteBufferFromHTTP: HTTP request to `http://127.0.0.1:8080/store/a3e/a3e65e1f-5fd4-47ed-9dbd-307f2586b52d/detached/.index` failed at try 1/1 with bytes read: 0/unknown. Error: Connection refused. (Current backoff wait is 100/1600 ms)
2022.10.28 15:18:30.841210 [ 10663 ] {} <Trace> DiskWeb: Cannot load disk metadata. Error: Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 22.11.1.1...

Here this exception received on initialization, from server context, i.e. w/o query context and that's why it hadn't throw, and eventually you will just zero records:

SELECT *
FROM data_from_web

Query id: ee544a5e-3c67-4fb4-8f14-f8e4a082b237

Ok.

0 rows in set. Elapsed: 0.019 sec.

Cc: @kssenii
Cc: @alexey-milovidov

@robot-ch-test-poll robot-ch-test-poll added the pr-bugfix Pull request with bugfix, not backported by default label Oct 30, 2022
@yakov-olkhovskiy yakov-olkhovskiy self-assigned this Oct 31, 2022
@azat azat marked this pull request as draft November 1, 2022 11:29
@azat azat force-pushed the disks/web-fix branch 2 times, most recently from dabb495 to bd1b77e Compare November 11, 2022 08:05
@azat azat marked this pull request as ready for review November 11, 2022 08:05
@kssenii kssenii self-assigned this Nov 11, 2022
azat added 6 commits November 12, 2022 12:37
Before this patch it is possible that web disk will suppress it and
instead of got an exception during read you will get just zero records:

    2022.10.28 15:18:30.739698 [ 10663 ] {} <Error> ReadWriteBufferFromHTTP: HTTP request to `http://127.0.0.1:8080/store/a3e/a3e65e1f-5fd4-47ed-9dbd-307f2586b52d/detached/.index` failed at try 1/1 with bytes read: 0/unknown. Error: Connection refused. (Current backoff wait is 100/1600 ms)
    2022.10.28 15:18:30.841210 [ 10663 ] {} <Trace> DiskWeb: Cannot load disk metadata. Error: Poco::Exception. Code: 1000, e.code() = 111, Connection refused (version 22.11.1.1...

Here this exception received on initialization, from server context,
i.e. w/o query context and that's why it hadn't throw, and eventually
you will just zero records:

    SELECT *
    FROM data_from_web

    Query id: ee544a5e-3c67-4fb4-8f14-f8e4a082b237

    Ok.

    0 rows in set. Elapsed: 0.019 sec.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@azat
Copy link
Copy Markdown
Member Author

azat commented Nov 18, 2022

Failures are unrelated.

AST fuzzer (asan) — Received signal 11

2022.11.15 16:27:42.137939 [ 425 ] {} <Fatal> BaseDaemon: 3.1. inlined from ./build_docker/../src/Analyzer/Passes/QueryAnalysisPass.cpp:1523: DB::(anonymous namespace)::QueryAnalyzer::collectScopeValidIdentifiersForTypoCorrection(DB::Identifier const&, DB::(anonymous namespace)::IdentifierResolveScope const&, bool, bool, bool, std::__1::unordered_set<DB::Identifier, std::__1::hash<DB::Identifier>, std::__1::equal_to<DB::Identifier>, std::__1::allocator<DB::Identifier>>&)

Performance Comparison Aarch64 [3/4] — 2 errors, 2 too long

questdb_sum_float64 | DB::Exception: Estimated query execution time (349.03809364490064 seconds) is too long. Maximum: 300. Estimated rows to process: 500000000: While executing Numbers. Stack trace:

Stress test (msan) — OOM killer (or signal 9) in clickhouse-server.log

Likely due to slow MergeTree parts removal

Details
2022-11-16 00:31:29 Thread 1 (Thread 0x7f99f5f2dd40 (LWP 756)):
2022-11-16 00:31:29 #0  0x00007f99f620cbff in ?? () from /lib/x86_64-linux-gnu/libc.so.6
2022-11-16 00:31:29 #1  0x000000000bf96598 in __msan::MsanDeallocate(__sanitizer::StackTrace*, void*) ()
2022-11-16 00:31:29 #2  0x000000000bf9a638 in free ()
2022-11-16 00:31:29 #3  0x00000000281609a7 in Allocator<false, false>::freeNoTrack (this=this@entry=0x7100000d3b48, buf=0x2f9786954000, buf@entry=0x7f9786954000, size=size@entry=1048639) at ../src/Common/Allocator.h:257
2022-11-16 00:31:29 #4  0x00000000281606c6 in Allocator<false, false>::free (this=0x7100000d3b48, buf=0x7f9786954000, size=1048639) at ../src/Common/Allocator.h:105
2022-11-16 00:31:29 #5  0x0000000028373be3 in DB::Memory<Allocator<false, false> >::dealloc (this=0x7100000d3b48) at ../src/IO/BufferWithOwnMemory.h:140
2022-11-16 00:31:29 #6  DB::Memory<Allocator<false, false> >::~Memory (this=0x7100000d3b48) at ../src/IO/BufferWithOwnMemory.h:54
2022-11-16 00:31:29 #7  DB::BufferWithOwnMemory<DB::SeekableReadBuffer>::~BufferWithOwnMemory (this=0x7100000d3b00) at ../src/IO/BufferWithOwnMemory.h:150
2022-11-16 00:31:29 #8  DB::ReadBufferFromFileBase::~ReadBufferFromFileBase (this=0x7100000d3b00) at ../src/IO/ReadBufferFromFileBase.cpp:25
2022-11-16 00:31:29 #9  0x000000003f09ab6e in DB::ReadBufferFromFileDescriptor::~ReadBufferFromFileDescriptor (this=0x7100000d3b00) at ../src/IO/ReadBufferFromFileDescriptor.h:14
2022-11-16 00:31:29 #10 DB::ReadBufferFromFilePReadWithDescriptorsCache::~ReadBufferFromFilePReadWithDescriptorsCache (this=0x7100000d3b00) at ../src/IO/ReadBufferFromFile.h:78
2022-11-16 00:31:29 #11 0x000000003f09ac0d in DB::ReadBufferFromFilePReadWithDescriptorsCache::~ReadBufferFromFilePReadWithDescriptorsCache (this=0x2f9786954000) at ../src/IO/ReadBufferFromFile.h:78
2022-11-16 00:31:29 #12 0x0000000040f80cec in std::__1::default_delete<DB::ReadBufferFromFileBase>::operator()[abi:v15003](DB::ReadBufferFromFileBase*) const (this=0x7ffff0a6da88, __ptr=0x7100000d3b00) at ../contrib/libcxx/include/__memory/unique_ptr.h:48
2022-11-16 00:31:29 #13 std::__1::unique_ptr<DB::ReadBufferFromFileBase, std::__1::default_delete<DB::ReadBufferFromFileBase> >::reset[abi:v15003](DB::ReadBufferFromFileBase*) (this=0x7ffff0a6da88, __p=0x0) at ../contrib/libcxx/include/__memory/unique_ptr.h:305
2022-11-16 00:31:29 #14 std::__1::unique_ptr<DB::ReadBufferFromFileBase, std::__1::default_delete<DB::ReadBufferFromFileBase> >::~unique_ptr[abi:v15003]() (this=0x7ffff0a6da88) at ../contrib/libcxx/include/__memory/unique_ptr.h:259
2022-11-16 00:31:29 #15 DB::UnlinkFileOperation::execute (this=0x70400049a140) at ../src/Disks/ObjectStorages/MetadataStorageFromDiskTransactionOperations.cpp:68
2022-11-16 00:31:29 #16 0x0000000040f8cafc in DB::UnlinkMetadataFileOperation::execute (this=0x70400069e8c0, metadata_lock=...) at ../src/Disks/ObjectStorages/MetadataStorageFromDiskTransactionOperations.cpp:322
2022-11-16 00:31:29 #17 0x0000000040f658f3 in DB::MetadataStorageFromDiskTransaction::commit (this=0x70500042c5d8) at ../src/Disks/ObjectStorages/MetadataStorageFromDisk.cpp:186
2022-11-16 00:31:29 #18 0x0000000040e151c6 in DB::DiskObjectStorageTransaction::commit (this=0x70700022aac8) at ../src/Disks/ObjectStorages/DiskObjectStorageTransaction.cpp:738
2022-11-16 00:31:29 #19 0x0000000040d93bc9 in DB::DiskObjectStorage::removeSharedFiles (this=<optimized out>, files=..., keep_all_batch_data=false, file_names_remove_metadata_only=...) at ../src/Disks/ObjectStorages/DiskObjectStorage.cpp:267
2022-11-16 00:31:29 #20 0x00000000474a7bcf in DB::DataPartStorageOnDisk::clearDirectory (this=this@entry=0x708000448c18, dir=..., can_remove_shared_data=true, names_not_to_remove=..., checksums=..., skip_directories=..., is_temp=<optimized out>, state=DB::MergeTreeDataPartState::Deleting, log=0x7040000ffd00, is_projection=<optimized out>) at ../src/Storages/MergeTree/DataPartStorageOnDisk.cpp:349
2022-11-16 00:31:29 #21 0x000000004749e640 in DB::DataPartStorageOnDisk::remove(std::__1::function<DB::CanRemoveDescription ()>&&, DB::MergeTreeDataPartChecksums const&, std::__1::list<DB::IDataPartStorage::ProjectionChecksums, std::__1::allocator<DB::IDataPartStorage::ProjectionChecksums> >, bool, DB::MergeTreeDataPartState, Poco::Logger*) (this=0x708000448c18, can_remove_callback=..., checksums=..., projections=..., is_temp=<optimized out>, state=state@entry=DB::MergeTreeDataPartState::Deleting, log=0x7040000ffd00) at ../src/Storages/MergeTree/DataPartStorageOnDisk.cpp:301
2022-11-16 00:31:29 #22 0x00000000475f4956 in DB::IMergeTreeDataPart::remove (this=0x719000533418) at ../src/Storages/MergeTree/IMergeTreeDataPart.cpp:1670
2022-11-16 00:31:29 #23 0x00000000477fa6ee in DB::MergeTreeData::clearPartsFromFilesystemImpl (this=this@entry=0x71e000de6020, parts_to_remove=..., part_names_succeed=<optimized out>, part_names_succeed@entry=0x7ffff0a6eab0) at ../src/Storages/MergeTree/MergeTreeData.cpp:1974
2022-11-16 00:31:29 #24 0x00000000477f8ca5 in DB::MergeTreeData::clearPartsFromFilesystem (this=this@entry=0x71e000de6020, parts=..., throw_on_error=true, parts_failed_to_delete=parts_failed_to_delete@entry=0x0) at ../src/Storages/MergeTree/MergeTreeData.cpp:1915
2022-11-16 00:31:29 #25 0x00000000477f847e in DB::MergeTreeData::clearOldPartsFromFilesystem (this=0x71e000de6020, force=true) at ../src/Storages/MergeTree/MergeTreeData.cpp:1882
2022-11-16 00:31:29 #26 0x00000000481f4a0d in DB::StorageMergeTree::shutdown (this=0x71e000de6020) at ../src/Storages/StorageMergeTree.cpp:200
2022-11-16 00:31:29 #27 0x0000000040ce4e52 in DB::IStorage::flushAndShutdown (this=0x71e000de6020) at ../src/Storages/IStorage.h:534
2022-11-16 00:31:29 #28 DB::DatabaseWithOwnTablesBase::shutdown (this=0x713000278318) at ../src/Databases/DatabasesCommon.cpp:293
2022-11-16 00:31:29 #29 0x0000000042443d22 in DB::DatabaseCatalog::shutdownImpl (this=0x71b000002300) at ../src/Interpreters/DatabaseCatalog.cpp:198
2022-11-16 00:31:29 #30 0x000000004245a47f in DB::DatabaseCatalog::shutdown () at ../src/Interpreters/DatabaseCatalog.cpp:685
2022-11-16 00:31:29 #31 0x000000004228ca10 in DB::ContextSharedPart::shutdown (this=0x723000003800) at ../src/Interpreters/Context.cpp:443
2022-11-16 00:31:29 #32 0x0000000042268312 in DB::Context::shutdown (this=<optimized out>) at ../src/Interpreters/Context.cpp:3049
2022-11-16 00:31:29 #33 0x00000000284af28f in DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_10::operator()() const (this=0x7ffff0a70ac0) at ../programs/server/Server.cpp:1494
2022-11-16 00:31:29 #34 BasicScopeGuard<DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_10>::invoke() (this=0x7ffff0a70ac0) at ../base/base/scope_guard.h:99
2022-11-16 00:31:29 #35 BasicScopeGuard<DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&)::$_10>::~BasicScopeGuard() (this=this@entry=0x7ffff0a70ac0) at ../base/base/scope_guard.h:48
2022-11-16 00:31:29 #36 0x000000002848e222 in DB::Server::main (this=0x7ffff0a70ef0) at ../programs/server/Server.cpp:1859
2022-11-16 00:31:29 #37 0x00000000545bc696 in Poco::Util::Application::run (this=0x7ffff0a70ef0) at ../contrib/poco/Util/src/Application.cpp:334
2022-11-16 00:31:29 #38 0x000000002844add9 in DB::Server::run (this=0x7ffff0a70ef0) at ../programs/server/Server.cpp:482
2022-11-16 00:31:29 #39 0x000000005461db5b in Poco::Util::ServerApplication::run (this=0x7ffff0a70ef0, argc=6, argv=0x70300001fd70) at ../contrib/poco/Util/src/ServerApplication.cpp:611
2022-11-16 00:31:29 #40 0x00000000284409ec in mainEntryClickHouseServer (argc=6, argv=0x70300001fd70) at ../programs/server/Server.cpp:188
2022-11-16 00:31:29 #41 0x000000000bff75a8 in main (argc_=<optimized out>, argv_=<optimized out>) at ../programs/main.cpp:482

Stress test (tsan) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt)

2022.11.16 06:11:58.093340 [ 327989 ] {} <Error> edba3962-7f71-411e-bee0-7c71fcb477b2::all_0_0_26_33(MutateFromLogEntryTask): virtual bool DB::ReplicatedMergeMutateTaskBase::executeStep(): Code: 47. DB::Exception: Missing columns: 'value3' while processing query: 'value1, value3, value5, value0, key, key, value0, value5, value3, value1', required columns: 'value1' 'value3' 'value5' 'value0' 'key', maybe you meant: ['value1','value0','value5','value0','key']. (UNKNOWN_IDENTIFIER), Stack trace (when copying this message, always include the lines below):

@azat azat changed the title Do not suppress exceptions in web disk Do not suppress exceptions in web disk (and fix retries for requests from web disk) Nov 18, 2022
@azat azat requested a review from kssenii November 20, 2022 12:43
@kssenii kssenii merged commit f0dbfbb into ClickHouse:master Nov 20, 2022
@azat azat deleted the disks/web-fix branch November 20, 2022 15:11
azat added a commit to azat/ClickHouse that referenced this pull request Nov 22, 2022
While dropping the table server scans all disk for existence of data to
remove, but in case of unavailable web disk it will fail, and also not
instantly but after 10 retries (by default, after ClickHouse#42800), that can take
~30 seconds.

But scanning read only disks does not makes any sense, so reorder
conditions to avoid this.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants