Skip to content

Always schedule direct IO to threadpool for pread_threadpool#85765

Merged
azat merged 1 commit intoClickHouse:masterfrom
korowa:flaky-threadpool-rdr-localfs
Aug 21, 2025
Merged

Always schedule direct IO to threadpool for pread_threadpool#85765
azat merged 1 commit intoClickHouse:masterfrom
korowa:flaky-threadpool-rdr-localfs

Conversation

@korowa
Copy link
Copy Markdown
Contributor

@korowa korowa commented Aug 17, 2025

Closes #85689

preadv2 with RWF_NOWAIT may perform reads from device and return data (not EAGAIN) when called on fd with O_DIRECT flag. This leads to data reading being executed in query thread, instead of threadpool which is intended for pread_threadpool.

This patch skips preadv2 / pagecache checking logic when direct io is requested, so that it always executed in threadpool.

Simple reproducer out of ClickHouse context
// "iotest.c"

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/uio.h>
#include <unistd.h>
#include <errno.h>
#include <assert.h>

#define ALIGNMENT 131072
#define BUF_SIZE ALIGNMENT

int main(int argc, char *argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
        return 1;
    }
    const char *filename = argv[1];

    int fd = open(filename, O_RDONLY | O_DIRECT | O_CLOEXEC);
    if (fd < 0) {
        perror("open");
        return 1;
    }

    void *buf;
    if (posix_memalign(&buf, ALIGNMENT, BUF_SIZE) != 0) {
        perror("posix_memalign");
        close(fd);
        return 1;
    }

    memset(buf, 0, BUF_SIZE);

    struct iovec iov = {
        .iov_base = buf,
        .iov_len = BUF_SIZE
    };

    ssize_t offset = 0;
    ssize_t ret = 0;

    while (ret >= 0)
    {
        ret = syscall(SYS_preadv2, fd, &iov, 1, offset, offset >> 32, RWF_NOWAIT);

        if (ret >= 0) {
            printf("preadv2 RWF_NOWAIT succeeded, read %zd bytes\n", ret);
        } else {
            if (errno == EAGAIN) {
                printf("preadv2 RWF_NOWAIT returned EAGAIN (data not in page cache)\n");
                return 1;
            } else {
                perror("preadv2");
                return 1;
            }
        }

        if (ret == 0)
        {
            break;
        }

        offset += ret;
    }

    free(buf);
    close(fd);

    return 0;
}
# Build
clang -o iotest iotest.c

# Create non-cached data file on ext4 fs
dd if=/dev/urandom of=/tmp/trash.dat oflag=direct bs=128K count=100

# Read the file (it'll succeed, while failing for fd opened w/o O_DIRECT).
# Running it with strace + blktrace shows successful preadv2 along with
# disk reads.
./iotest /tmp/trash.dat

(or added test which demonstrates this behavior)

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

cc @kssenii

@bharatnc bharatnc added the can be tested Allows running workflows for external contributors label Aug 17, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 17, 2025

Workflow [PR], commit [d9c83ad]

Summary:

job_name test_name status info comment
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel) failure
03573_json_keys_with_dots FAIL
02443_detach_attach_partition FAIL
03595_alter_drop_column_comment_if_exists FAIL
Stateless tests (amd_tsan, s3 storage, parallel) failure
01676_clickhouse_client_autocomplete FAIL
Integration tests (amd_binary, 4/5) failure
test_postgresql_replica_database_engine/test_0.py::test_table_schema_changes FAIL
Integration tests (arm_binary, distributed plan, 3/4) failure
test_global_overcommit_tracker/test.py::test_global_overcommit FAIL

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 17, 2025
@azat azat self-assigned this Aug 17, 2025
@azat azat marked this pull request as draft August 18, 2025 08:49
@korowa
Copy link
Copy Markdown
Contributor Author

korowa commented Aug 18, 2025

Marked as draft, as the description is not completely correct -- even in cases when there are only ThreadPoolReaderPageCacheHit profile events in the query (and there are no misses), the data is still read from disk (according to OSReadBytes), so may not be related to data presence in page cache.

@korowa korowa force-pushed the flaky-threadpool-rdr-localfs branch from 8419b6f to f358ed0 Compare August 19, 2025 11:15
@korowa korowa changed the title Fix flaky test_local_fs_threadpool_reader Always schedule direct IO to threadpool for pread_threadpool Aug 19, 2025
@korowa korowa force-pushed the flaky-threadpool-rdr-localfs branch from f358ed0 to 779fc19 Compare August 19, 2025 15:28
@korowa
Copy link
Copy Markdown
Contributor Author

korowa commented Aug 20, 2025

@korowa korowa marked this pull request as ready for review August 20, 2025 05:43
Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korowa nice catch! Patch looks good to me, please take a look at my adjustments (mostly comments in the code)

@korowa korowa force-pushed the flaky-threadpool-rdr-localfs branch 2 times, most recently from 5412009 to ce955d5 Compare August 21, 2025 15:21
preadv2 with RWF_NOWAIT may perform reads from device and
return data (not EAGAIN) when called on fd with O_DIRECT
flag. This leads to data reading being executed in query
thread, instead of threadpool which is intended for
pread_threadpool.

This patch skips preadv2 / pagecache checking logic when
direct io is requested, so that it always executed in
threadpool.

Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>
@korowa korowa force-pushed the flaky-threadpool-rdr-localfs branch from ce955d5 to d9c83ad Compare August 21, 2025 15:36
@azat
Copy link
Copy Markdown
Member

azat commented Aug 21, 2025

CI issues either known or reported

@azat azat enabled auto-merge August 21, 2025 21:43
@azat azat added this pull request to the merge queue Aug 21, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Aug 21, 2025

@korowa Thanks for keeping history clean!

Merged via the queue into ClickHouse:master with commit 582ea77 Aug 21, 2025
119 of 122 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_threadpool_readers/test.py::test_local_fs_threadpool_reader is flaky

4 participants