Skip to content

Remove shuffle from sampling#228

Merged
Tmonster merged 3 commits intomainfrom
remove_shuffle_from_sampling
Jan 24, 2025
Merged

Remove shuffle from sampling#228
Tmonster merged 3 commits intomainfrom
remove_shuffle_from_sampling

Conversation

@Tmonster
Copy link
Owner

Also add Nightly test to test across operating systems

@Tmonster Tmonster force-pushed the remove_shuffle_from_sampling branch from a8d4a8d to d364b4c Compare January 21, 2025 08:56
@Tmonster Tmonster merged commit ed3d957 into main Jan 24, 2025
14 of 31 checks passed
Tmonster pushed a commit that referenced this pull request Sep 4, 2025
…groups, and limiting vacuum operations for the last number of row groups (duckdb#18829)

When performing a checkpoint, we rewrite columns of row groups that have
had modifications. When performing insertions, all columns are modified,
thus all columns of a row group must be rewritten. As a result, any
insertion followed by a checkpoint triggers a full rewrite of the last
row group.

When dealing with wide tables or string columns with large strings,
rewriting a row group can be relatively expensive. This is particularly
the case when only small insertions have been made. For example,
inserting a single row followed by checkpointing can become expensive as
a result of triggering a rewrite of the last row group:

```sql
create table z as select repeat(sha256(i::varchar), 10) as a, repeat(sha256((i**2)::varchar), 10) as b, repeat(sha256((i**3)::varchar), 10) as c from range(100000) r(i);
CHECKPOINT;
.timer on
insert into z values ('a','b','c');
Run Time (s): real 0.006 user 0.004249 sys 0.001186
CHECKPOINT;
Run Time (s): real 0.611 user 1.163009 sys 0.054814
```

The checkpoint takes 0.6s, despite us inserting only a single row.

#### Appending a new row group

This PR solves this problem by appending a new row group when writing to
a table that has been persisted/checkpointed. As a result, we will no
longer modify the (already checkpointed) data. As a trade-off, we end up
with more (smaller) row groups.

#### Vacuuming
As part of vacuuming, we merge adjacent row groups if they fit within
the row group size. As part of this PR - we restrict merging for the
last few row groups. Otherwise, we would end up with the same problem.
We would create row groups like so:

```
100K rows
1 row
```

Then merge them back into a single row group with size `100K+1`,
effectively performing the same rewrite as before.

Instead, for the last row groups in a file, we need a *minimum
threshold* to merge. This is either the `row group size` itself, or 2X
the size of the first row group we are considering. We also always merge
row groups if the total size is `< 2048` (the vector size). This
exponential merging ensures we don't merge on every single insert, while
also ensuring we never have too many row groups.

We can see this process in action when starting with 100K rows, and
inserting 100 rows at a time - effectively repeating the below script:

```sql
insert into z select 'a','b','c' from range(100);
checkpoint;
select row_group_id, max(count) from pragma_storage_info('z') group by all order by all;
```

```sql
 -- insert batch #1
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │        100 │
└──────────────┴────────────┘

-- insert batch #30
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │       2000 │
│            2 │       1000 │
└──────────────┴────────────┘
-- insert batch #150
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │       8000 │
│            2 │       4000 │
│            3 │       2000 │
│            4 │       1000 │
└──────────────┴────────────┘
-- insert batch #228
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │      16000 │
│            2 │       4000 │
│            3 │       2000 │
│            4 │        800 │
└──────────────┴────────────┘
-- insert batch #229
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     122800 │
│            1 │        100 │
└──────────────┴────────────┘
```


### Performance

Running the above example again, the checkpoint now completes in `0.005`
seconds, instead of `0.6` seconds.

Note that we still do the compaction at some point, so while most
checkpoints will have gotten faster, not all of them will have gotten
faster.

Running the above example 300X, here are the timings:

```sql
create table z as select repeat(sha256(i::varchar), 10) as a, repeat(sha256((i**2)::varchar), 10) as b, repeat(sha256((i**3)::varchar), 10) as c from range(100000) r(i);
-- 300X the below insertion
insert into z select 'a','b','c' from range(100);
CHECKPOINT;
```

|           Summary           | v1.3.2  |  New   |
|-----------------------------|---------|--------|
| Total Time                  | 128.82s | 1.00s  |
| Average Time Per Checkpoint | 0.56s   | 0.01s  |
| Max Checkpoint Time         | 0.631s  | 0.526s |

In the above example, there is still a single checkpoint that does the
full on compaction, and thus has the same timing as the original script
had for every checkpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant