Skip to content

Keep track of the number of async tables loading jobs. If there are some running jobs, do not update tail_ptr in TransactionLog::removeOldEntries#82824

Merged
tuanpach merged 1 commit intoClickHouse:masterfrom
tuanpach:fix-issue-60406
Jul 9, 2025
Merged

Conversation

@tuanpach
Copy link
Copy Markdown
Member

@tuanpach tuanpach commented Jun 29, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Keep track of the number of async tables loading jobs. If there are some running jobs, do not update tail_ptr in TransactionLog::removeOldEntries

Because loadTableFromMetadataAsync is running asynchronously, it is possible that the outdated parts are loading while the tail_ptr is updated here. It might trigger assertion when the part create_csn is lower than tail_ptr. Refer: #60406

We keep track of asyncTablesLoadingJobNumber, and not update tail_ptr if there are running jobs.

Closes #60406
Closes #82718

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@tuanpach tuanpach added the can be tested Allows running workflows for external contributors label Jun 29, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 29, 2025

Workflow [PR], commit [349c461]

Summary:

job_name test_name status info comment
Build (fuzzers) error
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 29, 2025
@tuanpach tuanpach force-pushed the fix-issue-60406 branch 3 times, most recently from 14123dd to e441e0e Compare June 30, 2025 06:09
Comment on lines +653 to +663
void TransactionLog::increaseAsyncTablesLoadingJobNumber()
{
async_tables_loading_job_number.fetch_add(1, std::memory_order_relaxed);
}
void TransactionLog::decreaseAsyncTablesLoadingJobNumber()
{
async_tables_loading_job_number.fetch_sub(1, std::memory_order_relaxed);
}
Int64 TransactionLog::asyncTablesLoadingJobNumber()
{
return async_tables_loading_job_number.load(std::memory_order_relaxed);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why memory_order_relaxed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed memory_order_relaxed.
I thought it might help to optimize the implementation but it seems to have some problems.
In DatabaseOrdinary::loadTableFromMetadataAsync:

LoadTaskPtr DatabaseOrdinary::loadTableFromMetadataAsync(
    AsyncLoader & async_loader,
    LoadJobSet load_after,
    ContextMutablePtr local_context,
    const String & file_path,
    const QualifiedTableName & name,
    const ASTPtr & ast,
    LoadingStrictnessLevel mode)
{
    TransactionLog::increaseAsyncTablesLoadingJobNumber();
    std::scoped_lock lock(mutex);
    auto job = makeLoadJob(
        std::move(load_after),
        TablesLoaderBackgroundLoadPoolId,
        fmt::format("load table {}", name.getFullName()),
        [this, local_context, file_path, name, ast, mode](AsyncLoader &, const LoadJobPtr &)
        {
            SCOPE_EXIT(TransactionLog::decreaseAsyncTablesLoadingJobNumber(););
            loadTableFromMetadata(local_context, file_path, name, ast, mode);
        });

Even though TransactionLog::increaseAsyncTablesLoadingJobNumber() is called before std::scoped_lock lock(mutex);, it might be called after makeLoadJob.

@tavplubix tavplubix self-assigned this Jul 2, 2025
@tuanpach tuanpach force-pushed the fix-issue-60406 branch 3 times, most recently from ad2af29 to 19002cc Compare July 4, 2025 02:23
…ome running jobs, do not update `tail_ptr` in `TransactionLog::removeOldEntries`
@tuanpach tuanpach added this pull request to the merge queue Jul 9, 2025
Merged via the queue into ClickHouse:master with commit a6587e2 Jul 9, 2025
118 of 122 checks passed
@tuanpach tuanpach deleted the fix-issue-60406 branch July 9, 2025 01:10
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 9, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Jul 23, 2025

@tuanpach looks like this fixes only #60406 and not #82718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

4 participants