Skip to content

Improve waiting for specific line in logs in integration tests (fixes some flakiness on CI)#86030

Merged
azat merged 1 commit intoClickHouse:masterfrom
azat:better-logs-wait-integration-tests
Aug 23, 2025
Merged

Improve waiting for specific line in logs in integration tests (fixes some flakiness on CI)#86030
azat merged 1 commit intoClickHouse:masterfrom
azat:better-logs-wait-integration-tests

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Aug 22, 2025

Recently CI found the failure of test_postgresql_replica_database_engine 1, but, according to logs:

  • the grep executed before the log entry appeared in logs
  • the log entry appeared before timeout expired
logs
2025-08-21 16:45:14.632000 [ 683 ] DEBUG : Command:[docker exec roottestpostgresqlreplicadatabaseengine0-gw0-instance-1 bash -c timeout 30 tail -Fn100 "/var/log/clickhouse-server/clickhouse-server.log" | grep -Em 1 'Table postgresql_replica_2 is skipped from replication stream'] (cluster.py:159, run_and_check)
2025-08-21 16:45:44.679000 [ 683 ] DEBUG : Exitcode:1 (cluster.py:187, run_and_check)
2025.08.21 16:45:14.638793 [ 962 ] {BgSchPool::8d13df61-833b-4141-83e2-0e8e1c30d1c9} <Information> PostgreSQLReplicaConsumer(postgres_database): Table structure of the table postgresql_replica_2 changed (column value2 was not found), will mark it as skipped from replication. Please perform manual DETACH and ATTACH of the table to bring it back
2025.08.21 16:45:14.638813 [ 962 ] {BgSchPool::8d13df61-833b-4141-83e2-0e8e1c30d1c9} <Warning> PostgreSQLReplicaConsumer(postgres_database): Table postgresql_replica_2 is skipped from replication stream because its structure has changes. Please detach this table and reattach to resume the replication (relation id: 16583)

But the test was not succeeded in locating this line.

I don't see any other reason except for line buffer for this to happen, so let's disable it with stdbuf.

Also I've added some introspection with tee /dev/stderr, so that now we will see all entries that the grep received.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Fixes: #86008

Recently CI found the failure of test_postgresql_replica_database_engine
[1], but, according to logs:
- the grep executed **before** the log entry appeared in logs
- the log entry appeared **before** timeout expired

<details>

<summary>logs</summary>

```
2025-08-21 16:45:14.632000 [ 683 ] DEBUG : Command:[docker exec roottestpostgresqlreplicadatabaseengine0-gw0-instance-1 bash -c timeout 30 tail -Fn100 "/var/log/clickhouse-server/clickhouse-server.log" | grep -Em 1 'Table postgresql_replica_2 is skipped from replication stream'] (cluster.py:159, run_and_check)
2025-08-21 16:45:44.679000 [ 683 ] DEBUG : Exitcode:1 (cluster.py:187, run_and_check)
```

```
2025.08.21 16:45:14.638793 [ 962 ] {BgSchPool::8d13df61-833b-4141-83e2-0e8e1c30d1c9} <Information> PostgreSQLReplicaConsumer(postgres_database): Table structure of the table postgresql_replica_2 changed (column value2 was not found), will mark it as skipped from replication. Please perform manual DETACH and ATTACH of the table to bring it back
2025.08.21 16:45:14.638813 [ 962 ] {BgSchPool::8d13df61-833b-4141-83e2-0e8e1c30d1c9} <Warning> PostgreSQLReplicaConsumer(postgres_database): Table postgresql_replica_2 is skipped from replication stream because its structure has changes. Please detach this table and reattach to resume the replication (relation id: 16583)
```

</details>

But the test was not succeeded in locating this line.

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=85765&sha=d9c83add6087386351df45010de71b027af87a76&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%204%2F5%29

I don't see any other reason except for line buffer for this to happen,
so let's disable it with `stdbuf`.

Also I've added some introspection with `tee /dev/stderr`, so that now
we will see all entries that the `grep` received.
@azat azat added the 🍃 green ci 🌿 Fixing flaky tests in CI label Aug 22, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 22, 2025

Workflow [PR], commit [586450e]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 22, 2025
@bharatnc bharatnc self-assigned this Aug 23, 2025
@azat azat added this pull request to the merge queue Aug 23, 2025
Merged via the queue into ClickHouse:master with commit 2398fbf Aug 23, 2025
122 checks passed
@azat azat deleted the better-logs-wait-integration-tests branch August 23, 2025 10:42
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 23, 2025
azat added a commit to azat/ClickHouse that referenced this pull request Sep 26, 2025
… way)

CI found failuer again [1]:

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=87584&sha=97e95fa30f48eb6414c18ffc59e2c70c38a437d1&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%204%2F5%29

And after added ClickHouse#86030, I can see that logs contains rows starting from:

    2025.09.26 16:33:49.268044 [ 965 ] {} <Trace> test_database.postgresql_replica_4 (938f7e82-5e0e-4f88-ad88-a5ac82b289c3): Trying to reserve 1.00 MiB using storage policy from min volume index 0

While the haystack was slightly earlier:

    2025.09.26 16:33:49.267221 [ 965 ] {BgSchPool::2fe592b6-5cb1-45f2-becc-27ceea4a4e98} <Warning> PostgreSQLReplicaConsumer(postgres_database): Table postgresql_replica_1 is skipped from replication stream because its structure has changes. Please detach this table and reattach to resume the replication (relation id: 16619)

This is a typical problem with test logging, let's simply increase
look_behind_lines, but not in the test, but globally, to avoid any
flakiness further, 100 or 10000 is not a big deal for `grep -F`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🍃 green ci 🌿 Fixing flaky tests in CI pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_postgresql_replica_database_engine is flaky

3 participants