Streaming inserts by Sasao4o · Pull Request #94509 · ClickHouse/ClickHouse

Sasao4o · 2026-01-17T23:55:33Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added input_format_max_block_wait_ms setting to emit data blocks by timeout and allowed processing of remaining data when an HTTP connection is closed unexpectedly.

Example use


const http = require('http');

const CLICKHOUSE_HOST = 'localhost';
const CLICKHOUSE_PORT = 8123;

const QUERY =
 'INSERT INTO my_first_table ' +
 'SETTINGS input_format_max_block_size_bytes=20000, ' +
 'max_insert_block_size=100, ' +
 'input_format_parallel_parsing=0, ' +
 'input_format_max_block_wait_ms=3000 ' +
 'FORMAT JSONEachRow';  

const INTERVAL_MS = 100;

const N = parseInt(process.argv[2], 10);
if (isNaN(N) || N <= 0) {
 console.error('Usage: node script.js <N>');
 process.exit(1);
}

console.log(`Sending ${N} rows initially, then keeping connection open...`);
 
const options = {
 hostname: CLICKHOUSE_HOST,
 port: CLICKHOUSE_PORT,
 path: `/?max_query_size=1000&query=${encodeURIComponent(QUERY)}`,
 method: 'POST',
 headers: {
   'Content-Type': 'application/json',
   'Transfer-Encoding': 'chunked',
   'Connection': 'keep-alive',
 }
};

let request = null;
let shouldRun = true;
function createRequest() {
 request = http.request(options, (res) => {
   console.log(`Status: ${res.statusCode}`);

   let responseData = '';
   res.on('data', chunk => responseData += chunk);
   res.on('end', () => {
     console.log('Response ended:', responseData);
   });
 });

 request.on('error', (err) => {
   console.error('Request error:', err.message);
 });

 sendInitialRows();
}

async function sendInitialRows() {
 for (let i = 0; i < N; i++) {
   if (request.destroyed) break;

   const timestamp = Math.floor(Date.now() / 1000);
   const rowObj = {
     id: i + 1,
     message: 'hello',
     ts: timestamp,
     code: 3
   };
   const row = JSON.stringify(rowObj) + '\n'; // <-- JSON row

   request.write(row);
   console.log(`[${new Date().toISOString()}] Sent: ${row.trim()}`);

   await new Promise(r => setTimeout(r, INTERVAL_MS));
 }

 console.log(`Finished sending ${N} rows. Connection remains open.`);
 keepConnectionOpen();
}

async function keepConnectionOpen() {
 while (true) {
   await new Promise(r => setTimeout(r, 1000));
 }
}

process.on('SIGINT', () => {
 console.log('\nStopping stream...');
 if (request && !request.destroyed) {
   request.end();
 }
 process.exit(0);
});

createRequest();

This code demonstrates streaming inserts over HTTP and shows that ClickHouse can successfully insert records even when max_insert_block_size is not reached, by flushing data based on input_format_max_block_wait_ms. It also verifies that if the connection is unexpectedly closed (for example, due to a timeout), ClickHouse still correctly parses and inserts any remaining data instead of treating the situation as an error.

This closes #41439

Sasao4o · 2026-01-18T00:00:13Z

While trying to write a Python unit test for this behavior, I noticed that emitting chunks does not necessarily mean the data is immediately inserted into the table. Based on this, I am thinking of generating a number of records smaller than max_insert_block_size and verifying that SELECT count(*) returns a value greater than zero, and that after the timeout expires, all records are eventually inserted (using JSONEachRow).

Does this sound like a reasonable way to test this feature?

Note: max_query_size must be explicitly set and input_format_parallel_parsing must be disabled (input_format_parallel_parsing = 0) so that this behavior can be observed with a small number of rows.

clickhouse-gh · 2026-01-18T08:02:59Z

Workflow [PR], commit [1d826a9]

Summary: ✅

Fgrtue

Hi @Sasao4o. Thank you for contribution! The changes look good.

Regarding your questions on testing.

I am thinking of generating a number of records smaller than max_insert_block_size and verifying that SELECT count(*) returns a value greater than zero, and that after the timeout expires, all records are eventually inserted (using JSONEachRow).

This could be a way to got, yes. Although, I would suggest writing a bash test, since it should be more straightforward. We need to test two expected behaviors that you implemented:

The input format should be able to form blocks of data for INSERT not only by the threshold on the number of rows or bytes but also by timeout. -- we could check then the number of parts (using one of system tables) that are inserted. The number of parts should be "cut" according with the time delimiter.
When the connection is unexpectedly closed, it should parse and process the remaining data instead of treating it as an error. -- could be done similarly to the approach above, or probably even by using SELECT count()

Sasao4o · 2026-01-23T11:16:17Z

Hi @Sasao4o. Thank you for contribution! The changes look good.

Regarding your questions on testing.

I am thinking of generating a number of records smaller than max_insert_block_size and verifying that SELECT count(*) returns a value greater than zero, and that after the timeout expires, all records are eventually inserted (using JSONEachRow).

This could be a way to got, yes. Although, I would suggest writing a bash test, since it should be more straightforward. We need to test two expected behaviors that you implemented:

The input format should be able to form blocks of data for INSERT not only by the threshold on the number of rows or bytes but also by timeout. -- we could check then the number of parts (using one of system tables) that are inserted. The number of parts should be "cut" according with the time delimiter.

When the connection is unexpectedly closed, it should parse and process the remaining data instead of treating it as an error. -- could be done similarly to the approach above, or probably even by using SELECT count()

You are welcome , my pleasure.
Yes the system parts thing actually made the first test really easy i was thinking how to query in middle of streaming :D thank you

the second test (the connection timeout one) i want remove this SLEEP 31(the default) to be sth smaller but http_receive_timeout setting does not seem to have any effect on my HTTP requests i also made sure that i wrote it in the correct place

what do you think?

…not depend on timeouts but connection drop

Sasao4o · 2026-01-26T12:58:46Z

Hi @Sasao4o. Thank you for contribution! The changes look good.
Regarding your questions on testing.

I am thinking of generating a number of records smaller than max_insert_block_size and verifying that SELECT count(*) returns a value greater than zero, and that after the timeout expires, all records are eventually inserted (using JSONEachRow).

This could be a way to got, yes. Although, I would suggest writing a bash test, since it should be more straightforward. We need to test two expected behaviors that you implemented:

The input format should be able to form blocks of data for INSERT not only by the threshold on the number of rows or bytes but also by timeout. -- we could check then the number of parts (using one of system tables) that are inserted. The number of parts should be "cut" according with the time delimiter.

When the connection is unexpectedly closed, it should parse and process the remaining data instead of treating it as an error. -- could be done similarly to the approach above, or probably even by using SELECT count()

You are welcome , my pleasure. Yes the system parts thing actually made the first test really easy i was thinking how to query in middle of streaming :D thank you

the second test (the connection timeout one) i want remove this SLEEP 31(the default) to be sth smaller but http_receive_timeout setting does not seem to have any effect on my HTTP requests i also made sure that i wrote it in the correct place what do you think?

never mind i changed the test to get out of this timeout mess but i found out that UNEXPECTED_END_OF_LINE is thrown when connection is dropped at middle of streaming so i added it to the connection error codes

Fgrtue · 2026-01-27T18:49:18Z

@Sasao4o some of the test are failing. We should add some tags on your tests to prevent your tests to be added to some test suites. For example, no-async-insert is likely to be handy for you. Please take a look at other .sh tests and add that one.

Please also check parallel tests -- they are also giving wrong results. Just to be sure try to add SYSTEM FLUSH LOGS <log_system_tables> before you query the table. And if this doesn't help you can try to query other system databases for checking how many parts were inserted (see 03723_max_insert_block_size_bytes_http.sh as a reference).

In the drop test, we can try to increase the waiting time before killing the process. This might help exclude the latency problem if it is present.

If this doesn't help, it is likely that the test are incompatible with some other test suites, then we can remove those test suites using tags as well.

…rts table + increase waiting time to 3 sec before killing process

Fgrtue · 2026-01-28T11:33:30Z

@Sasao4o it seems that we need to pull from master, and update the setting history file -- your setting should be now in 26-2 release.

Sasao4o · 2026-01-28T11:45:57Z

@Sasao4o it seems that we need to pull from master, and update the setting history file -- your setting should be now in 26-2 release.

yea i felt happy when i knew that was the reason of test failure xd

Sasao4o · 2026-01-29T13:21:03Z

Hello @Fgrtue ,
I ran the timeout tests many times locally using flaky-check, and I managed to reproduce the failure twice. After inspecting the system.parts logs table, I found that may be it is a latency issue.: sometimes it comes from the client side (the first part ends up larger than expected), and sometimes from the server side. Because of that, I increased the time on both sides.

For the drop-timeout test, I wasn’t able to reproduce the failure locally. I ran it many times with the parallel flag (-j) together with many other tests from the report, but it always succeeded. I also tried running it with the exact same settings, and it still succeeded. Do you think S3 could be the reason? (more latency or sth like this?

Sasao4o · 2026-02-11T11:15:36Z

@Fgrtue Hello
Shouldn’t we add the following in IRowInputFormat.cpp?

if (params.in_transaction) throw

Otherwise, we won’t throw an error when inside a transaction, which could break atomicity. I was expecting 02435_rollback_cancelled_queries to fail. What do you think?

Fgrtue · 2026-02-11T11:30:38Z

Let's see if the test fails this time.
I assume it failed last since we tried to implement the behavior for parallel parsing. So I assume we won't need anything extra.

Fgrtue · 2026-02-11T11:32:54Z

Just to clarify, in which case do you suggest that we should throw and error?

Sasao4o · 2026-02-11T11:39:51Z

Just to clarify, in which case do you suggest that we should throw and error?

If the INSERT is wrapped in BEGIN/END (transaction semantics), and the connection is dropped while we are shutting down gracefully, the rollback logic in executeQuery will not run.

at first i thought clickhouse doesn't support transactions but i found it as an experimental feature. So should we take it into account?

Fgrtue · 2026-02-11T12:34:31Z

I see. Instead of propagating an exception in IRowInputFormat during graceful handling, let's better then throw an exception in format factory -- in case we are in transaction and input_format_max_block_wait_ms setting is non zero. This would allow users to clearly see that they shouldn't use this setting within transactions. Let's also document that it is impossible to use this setting when we use transactions.

Sasao4o · 2026-02-11T13:42:39Z

I see. Instead of propagating an exception in IRowInputFormat during graceful handling, let's better then throw an exception in format factory -- in case we are in transaction and input_format_max_block_wait_ms setting is non zero. This would allow users to clearly see that they shouldn't use this setting within transactions. Let's also document that it is impossible to use this setting when we use transactions.

i didn't mean the input_format_max_block_wait_ms (i think it works well with transaction)

what i meant is what we are doing now

if (connectionError) don't throw and  emit chunk

so executeQuery doesn't know that there is a connection error so it doesnt execute rollback

I tested this locally using our test, and I added (implicit transaction = 1). I should find 0 parts, but I found more than 0

so the proposed solution is to pass is_transaction to iRowInputFormat (as it was before reverting)
and

if (!transaction && connectionError) safe to emit because we are not in txn

i also think we will need to catch the exception in this layer so we can rollback partially filled rows (the popBack() code)

Fgrtue · 2026-02-11T14:22:03Z

I discussed with the team, and we think what should be done it to make another setting that will control this functionality with exception AND possibility to use input_format_max_block_wait_ms.

So we introduce a setting input_format_connection_handling (or something else) which can be either 0 or 1, and set to 0 by default. For this setting we document what behavior it controls -- When the connection is unexpectedly closed, it should parse and process the remaining data instead of treating it as an error.. Also, we should NOT be able to set input_format_max_block_wait_ms to non-zero value, unless input_format_connection_handling is true.

Then, in the implementation, instead of

if (connectionError) don't throw and  emit chunk

we should have

if (input_format_connection_handling && connectionError) ...

It should be clearly mentioned, that in case of connection error we cannot guarantee that the blocks of data will be deduplicated. Speaking about transactions, such a behavior can be expected with the semantics that this setting introduces -- that some of the data will be inserted even in case of connection error.

…ol Flushing On Unexpectedly Closed Connection

src/Processors/Formats/IRowInputFormat.cpp

Fgrtue · 2026-02-12T09:02:14Z

tests/queries/0_stateless/03803_insert_on_connection_drop.sh

+sleep 8
+
+kill -9 $PIPELINE_PID 2>/dev/null
+
+wait $PIPELINE_PID 2>/dev/null
+
+
+sleep 1


On the internal testing this test fails due to difference in result from the reference -- it gets 0 parts instead of 1 part. Let's try to increase the duration of sleep. The test suit that fails is related to s3 in asan build, so this is likely a latency issue.

Fgrtue · 2026-03-17T10:32:14Z

The changes are under the setting so harmless for the production. Also, before merging I consulted with @tavplubix, who approved the changes.

Sasao4o added 4 commits January 17, 2026 23:54

allow timeout-based flush and treat connection errors as EOF for INSERT

96de3ae

Merge remote-tracking branch 'upstream/master' into streaming_inserts

c1d48eb

Fix Merge Conflicts

48838ad

Remove Redundant Function

3709ddc

alexey-milovidov added the can be tested Allows running workflows for external contributors label Jan 18, 2026

Merge branch 'master' into streaming_inserts

db37274

clickhouse-gh bot added the pr-feature Pull request with new product feature label Jan 18, 2026

Fgrtue self-assigned this Jan 19, 2026

Fgrtue reviewed Jan 21, 2026

View reviewed changes

Add Tests For Both The Timeout + Unexpected Connection Drop

a47ae79

Add UNEXPECTED_END_OF_FILE To Connection ErrorCodes + Change Test To …

ae073e7

…not depend on timeouts but connection drop

Merge branch 'master' into streaming_inserts

0132049

clickhouse-gh bot added pr-improvement Pull request with some product improvements and removed pr-feature Pull request with new product feature labels Jan 26, 2026

Fix Style Issue + Remove Logs

1e577b4

Add --no-async insert flag + depend on log table instead of system.pa…

d4d89bc

…rts table + increase waiting time to 3 sec before killing process

Merge remote-tracking branch 'upstream/master' into streaming_inserts

d67c5b3

Sasao4o added 2 commits January 28, 2026 13:46

Change SettingsChangeHistory From 26.1 to 26.2

f7dd11f

Make insert stream timeout more robust

f407f04

Sasao4o added 2 commits January 29, 2026 16:02

Merge remote-tracking branch 'upstream/master' into streaming_inserts

1b1c325

Add missing comma and fix git history

9fea3d9

Sasao4o force-pushed the streaming_inserts branch from 51f0010 to 9fea3d9 Compare January 29, 2026 14:06

Sasao4o and others added 8 commits February 11, 2026 18:05

Add New Setting To Control Using input_format_block_wait and To Contr…

587a072

…ol Flushing On Unexpectedly Closed Connection

Fix Typo + Missing class datamember

e4ee1c2

Remove Trailing Space

027b6ed

Fix

d8137d5

Merge branch 'master' into streaming_inserts

ac53664

Merge remote-tracking branch 'upstream/master' into streaming_inserts

bff52cb

Fix style

317547b

Fix Conflict In Merge

c563511

Fgrtue reviewed Feb 12, 2026

View reviewed changes

Sasao4o added 2 commits February 12, 2026 12:52

Move If Condition to FormatFactory + Increase Sleep Time to 15

6b887d8

Change To Bool

34aba64

clickhouse-gh bot added the manual approve Manual approve required to run CI label Feb 14, 2026

Merge branch 'master' into streaming_inserts

1d826a9

Fgrtue added this pull request to the merge queue Feb 20, 2026

Merged via the queue into ClickHouse:master with commit e77bae5 Feb 20, 2026
147 checks passed

robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 20, 2026

Fgrtue added a commit that referenced this pull request Feb 23, 2026

Updated tests introduced in #94509

9e4bd95

Fgrtue mentioned this pull request Feb 23, 2026

Fix flaky 03803_insert_stream_timeout_flush #97769

Closed

1 task

Fgrtue mentioned this pull request Mar 16, 2026

Added fix input_format_connection_handling #99595

Merged

1 task

Fgrtue added post-approved Approved, but after the PR is merged. labels Mar 17, 2026

Conversation

Sasao4o commented Jan 17, 2026 • edited by Fgrtue Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

Sasao4o commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fgrtue left a comment

Choose a reason for hiding this comment

Uh oh!

Sasao4o commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sasao4o commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fgrtue commented Jan 27, 2026

Uh oh!

Fgrtue commented Jan 28, 2026

Uh oh!

Sasao4o commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sasao4o commented Jan 29, 2026

Uh oh!

Sasao4o commented Feb 11, 2026

Uh oh!

Fgrtue commented Feb 11, 2026

Uh oh!

Fgrtue commented Feb 11, 2026

Uh oh!

Sasao4o commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fgrtue commented Feb 11, 2026

Uh oh!

Sasao4o commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fgrtue commented Feb 11, 2026

Uh oh!

Uh oh!

Fgrtue Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fgrtue commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sasao4o commented Jan 17, 2026 •

edited by Fgrtue

Loading

Sasao4o commented Jan 18, 2026 •

edited

Loading

clickhouse-gh bot commented Jan 18, 2026 •

edited

Loading

Sasao4o commented Jan 23, 2026 •

edited

Loading

Sasao4o commented Jan 26, 2026 •

edited

Loading

Sasao4o commented Jan 28, 2026 •

edited

Loading

Sasao4o commented Feb 11, 2026 •

edited

Loading

Sasao4o commented Feb 11, 2026 •

edited

Loading