Skip to content

Inconsistent query results #135

@vikaasbv

Description

@vikaasbv

I was running some queries to verify the consistency of data after a load operation. Below are some observations

Clickhouse version details

ClickHouse client version 1.1.54022.
Connecting to database smsr at localhost:9000.
Connected to ClickHouse server version 1.1.54022.

TOTAL number of records loaded. This is consistent with the raw CSV data.

:) select count() from cdr;

SELECT count()
FROM cdr

┌───count()─┐
│ 527670245 │
└───────────┘

1 rows in set. Elapsed: 0.138 sec. Processed 527.67 million rows, 1.06 GB (3.83 billion rows/s., 7.65 GB/s.)

The number of records for each date. This too is consistent with the raw data.

:) select count(), merge_date from cdr group by merge_date;

SELECT
count(),
merge_date
FROM cdr
GROUP BY merge_date

┌───count()─┬─merge_date─┐
│ 205080000 │ 2015-01-21 │
│ 322590245 │ 2015-01-22 │
└───────────┴────────────┘

2 rows in set. Elapsed: 0.835 sec. Processed 527.67 million rows, 1.06 GB (631.70 million rows/s., 1.26 GB/s.)

How can the total number of distinct rows when grouped by the merge date field be more than the total records in the table?

:) select count(distinct id), merge_date from cdr group by merge_date;

SELECT
countDistinct(id),
merge_date
FROM cdr
GROUP BY merge_date

┌──uniq(id)─┬─merge_date─┐
│ 206247254 │ 2015-01-21 │
│ 324102175 │ 2015-01-22 │
└───────────┴────────────┘

2 rows in set. Elapsed: 21.991 sec. Processed 527.67 million rows, 3.17 GB (23.99 million rows/s., 143.97 MB/s.)

:) select 206247254 + 324102175, 206247254 + 324102175 - 527670245;

SELECT
206247254 + 324102175,
(206247254 + 324102175) - 527670245

┌─plus(206247254, 324102175)─┬─minus(plus(206247254, 324102175), 527670245)─┐
│ 530349429 │ 2679184 │
└────────────────────────────┴──────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.005 sec.

How can the total number of distinct column values be more than the total records in the table?

:) select count(distinct id), count(), count(distinct id) - count() from cdr;

SELECT
countDistinct(id),
count(),
countDistinct(id) - count()
FROM cdr

┌──uniq(id)─┬───count()─┬─minus(uniq(id), count())─┐
│ 531654647 │ 527670245 │ 3984402 │
└───────────┴───────────┴──────────────────────────┘

1 rows in set. Elapsed: 1.169 sec. Processed 527.67 million rows, 2.11 GB (451.55 million rows/s., 1.81 GB/s.)

Are there any approximations involved here? How can we get precision if it is so?

Attached a text file with the query outputs for readability.

clickhouse_inconsistent.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions