Skip to content

Conversation

@liaoxin01
Copy link
Contributor

Proposed changes

  1. The failed schema change tablet reports its status as bad to FE,to avoid being scheduled by FE for read/write requests
  2. disable compaction when calc delete bitmap after converting history data

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@liaoxin01
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@liaoxin01
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit e19ed8a7ace47d172619e3e265a1b6c73aa8557d, data reload: false

------ Round 1 ----------------------------------
q1	17621	5194	5109	5109
q2	2022	162	148	148
q3	10527	1080	1181	1080
q4	10177	870	777	777
q5	8139	2977	2922	2922
q6	216	141	139	139
q7	907	568	507	507
q8	9258	2021	2025	2021
q9	6847	6434	6419	6419
q10	8234	3016	3046	3016
q11	432	218	216	216
q12	387	241	236	236
q13	18005	3637	3616	3616
q14	249	207	216	207
q15	581	540	528	528
q16	466	394	394	394
q17	956	510	514	510
q18	7405	6800	6705	6705
q19	1570	1480	1335	1335
q20	691	336	331	331
q21	2804	2379	2425	2379
q22	374	322	343	322
Total cold run time: 107868 ms
Total hot run time: 38917 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5171	5046	5052	5046
q2	335	236	256	236
q3	3338	3288	3278	3278
q4	2119	2034	1986	1986
q5	5815	5797	5776	5776
q6	218	130	128	128
q7	2310	1915	1943	1915
q8	3371	3437	3477	3437
q9	8850	8804	8713	8713
q10	3804	3846	3836	3836
q11	568	484	491	484
q12	802	642	656	642
q13	8140	3216	3158	3158
q14	292	270	265	265
q15	596	542	521	521
q16	571	492	497	492
q17	1936	1775	1754	1754
q18	8678	8322	8346	8322
q19	1628	1558	1622	1558
q20	2176	1971	1985	1971
q21	5585	5235	5252	5235
q22	569	560	500	500
Total cold run time: 66872 ms
Total hot run time: 59253 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.63% (8614/23517)
Line Coverage: 28.68% (70005/244078)
Region Coverage: 27.67% (36235/130956)
Branch Coverage: 24.37% (18515/75972)
Coverage Report: http://coverage.selectdb-in.cc/coverage/e19ed8a7ace47d172619e3e265a1b6c73aa8557d_e19ed8a7ace47d172619e3e265a1b6c73aa8557d/report/index.html

@doris-robot
Copy link

TPC-DS test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpcds-tools

TPC-DS sf100 test result on commit e19ed8a7ace47d172619e3e265a1b6c73aa8557d, data reload: false

run tpcds-sf100 query with default conf and session variables
query1	931	361	337	337
query2	6424	1900	1969	1900
query3	6649	210	207	207
query4	26289	22497	22514	22497
query5	5195	568	568	568
query6	286	185	190	185
query7	4571	291	279	279
query8	239	227	201	201
query9	8247	2643	2637	2637
query10	399	269	241	241
query11	16190	15723	15553	15553
query12	136	80	76	76
query13	1636	343	343	343
query14	11292	7088	7111	7088
query15	234	189	186	186
query16	6481	291	286	286
query17	1790	505	492	492
query18	1935	268	265	265
query19	277	144	141	141
query20	84	77	79	77
query21	181	97	97	97
query22	4946	4756	4789	4756
query23	31847	31177	30988	30988
query24	12215	2838	2800	2800
query25	590	359	348	348
query26	1685	142	147	142
query27	2892	277	284	277
query28	7153	1951	1945	1945
query29	2081	406	391	391
query30	292	139	147	139
query31	974	790	777	777
query32	93	66	56	56
query33	727	281	274	274
query34	876	458	436	436
query35	899	765	770	765
query36	1269	1118	1246	1118
query37	187	73	73	73
query38	3359	3314	3291	3291
query39	1322	1305	1278	1278
query40	310	91	94	91
query41	37	39	35	35
query42	98	95	90	90
query43	506	481	506	481
query44	1127	710	716	710
query45	200	182	188	182
query46	1075	643	640	640
query47	1715	1523	1558	1523
query48	352	265	254	254
query49	1200	321	327	321
query50	734	353	333	333
query51	5354	5300	5242	5242
query52	89	95	89	89
query53	218	150	152	150
query54	1361	564	575	564
query55	97	86	87	86
query56	212	199	201	199
query57	1055	953	960	953
query58	226	206	205	205
query59	2821	2677	2517	2517
query60	259	234	239	234
query61	93	87	89	87
query62	659	456	471	456
query63	165	150	144	144
query64	5903	1691	1724	1691
query65	3322	3247	3245	3245
query66	1311	339	334	334
query67	15701	15480	15010	15010
query68	12641	524	555	524
query69	520	252	254	252
query70	1702	1543	1445	1445
query71	495	237	234	234
query72	5716	3540	3701	3540
query73	2888	319	314	314
query74	7667	6400	6428	6400
query75	5270	2294	2263	2263
query76	6294	1164	1129	1129
query77	660	247	261	247
query78	9060	9012	8584	8584
query79	2113	503	499	499
query80	586	372	367	367
query81	457	208	207	207
query82	206	108	101	101
query83	161	138	139	138
query84	249	56	53	53
query85	935	290	283	283
query86	391	393	390	390
query87	3615	3360	3361	3360
query88	3393	2250	2252	2250
query89	340	280	260	260
query90	1846	209	216	209
query91	124	98	94	94
query92	59	53	57	53
query93	1704	444	493	444
query94	805	179	185	179
query95	471	418	398	398
query96	619	312	317	312
query97	4297	4139	4166	4139
query98	202	192	190	190
query99	1129	853	832	832
Total cold run time: 295688 ms
Total hot run time: 179066 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.14 seconds
stream load tsv: 560 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.4 seconds inserted 10000000 Rows, about 352K ops/s
storage size: 17183830791 Bytes

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit e19ed8a7ace47d172619e3e265a1b6c73aa8557d, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5466	5152	5129	5129
q2	411	156	158	156
q3	1481	1249	1181	1181
q4	1092	802	781	781
q5	3165	3151	3065	3065
q6	227	144	130	130
q7	994	550	525	525
q8	2168	2211	2270	2211
q9	6742	6668	6654	6654
q10	3183	3133	3131	3131
q11	348	215	224	215
q12	378	242	238	238
q13	4390	3688	3641	3641
q14	243	215	221	215
q15	601	554	551	551
q16	457	406	412	406
q17	1057	546	524	524
q18	7072	6784	6825	6784
q19	1648	1547	1437	1437
q20	601	352	326	326
q21	2865	2368	2428	2368
q22	386	332	330	330
Total cold run time: 44975 ms
Total hot run time: 39998 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5157	5075	5117	5075
q2	343	260	257	257
q3	3394	3369	3328	3328
q4	2152	2055	2027	2027
q5	5984	5944	5925	5925
q6	229	124	123	123
q7	2411	1942	1976	1942
q8	3577	3671	3668	3668
q9	9042	9019	9020	9019
q10	3884	3917	3944	3917
q11	592	475	487	475
q12	822	650	636	636
q13	3880	3180	3187	3180
q14	318	270	286	270
q15	613	548	543	543
q16	572	507	498	498
q17	2020	1793	1821	1793
q18	8802	8455	8567	8455
q19	1746	1696	1708	1696
q20	2303	1987	1980	1980
q21	5646	5345	5313	5313
q22	628	478	508	478
Total cold run time: 64115 ms
Total hot run time: 60598 ms

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 2, 2024
@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2024

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2024

PR approved by anyone and no changes requested.

@dataroaring dataroaring merged commit 797238c into apache:master Jan 2, 2024
@wm1581066 wm1581066 added the p0_w label Jan 3, 2024
seawinde pushed a commit to seawinde/doris that referenced this pull request Jan 3, 2024
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Jan 5, 2024
liaoxin01 added a commit to liaoxin01/doris that referenced this pull request Jan 5, 2024
dataroaring pushed a commit to liaoxin01/doris that referenced this pull request Jan 5, 2024
dataroaring pushed a commit to liaoxin01/doris that referenced this pull request Jan 6, 2024
HappenLee pushed a commit to HappenLee/incubator-doris that referenced this pull request Jan 12, 2024
@liaoxin01 liaoxin01 deleted the fix_sc branch February 6, 2024 12:26
dataroaring pushed a commit that referenced this pull request Mar 13, 2025
…n=true` (#48399)

### What problem does this PR solve?

#39558 add a config to support
shadow tablet to do cumulative compaction during schema change in cloud
mode to avoid -235 error on new tablet in the case of a large number of
loads. However, this introduces correctness problem on merge-on-write
table because some rowsets' delete bitmaps are wrong if there are cumu
compactions on new tablet when SC calculate delete bitmaps for
incremental rowsets after converting historical data.

This PR introduce a new type of compaction `STOP_TOKEN` to fail all
existing compaction jobs and disallow doing any compaction on tablet and
change the SC process as following:
1. converting historical data
2. register stop token on new tablet to fail all existing compaction
jobs and disallow doing any compaction on new tablet
3. calculate delete bitmap for incremental rowsets without lock
4. calculate delete bitmap for incremental rowsets with lock
5. commit SC job and remove stop token
----

ref: #29386
bobhan1 added a commit to bobhan1/doris that referenced this pull request Mar 13, 2025
…n=true` (apache#48399)

apache#39558 add a config to support
shadow tablet to do cumulative compaction during schema change in cloud
mode to avoid -235 error on new tablet in the case of a large number of
loads. However, this introduces correctness problem on merge-on-write
table because some rowsets' delete bitmaps are wrong if there are cumu
compactions on new tablet when SC calculate delete bitmaps for
incremental rowsets after converting historical data.

This PR introduce a new type of compaction `STOP_TOKEN` to fail all
existing compaction jobs and disallow doing any compaction on tablet and
change the SC process as following:
1. converting historical data
2. register stop token on new tablet to fail all existing compaction
jobs and disallow doing any compaction on new tablet
3. calculate delete bitmap for incremental rowsets without lock
4. calculate delete bitmap for incremental rowsets with lock
5. commit SC job and remove stop token
----

ref: apache#29386
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…n=true` (apache#48399)

### What problem does this PR solve?

apache#39558 add a config to support
shadow tablet to do cumulative compaction during schema change in cloud
mode to avoid -235 error on new tablet in the case of a large number of
loads. However, this introduces correctness problem on merge-on-write
table because some rowsets' delete bitmaps are wrong if there are cumu
compactions on new tablet when SC calculate delete bitmaps for
incremental rowsets after converting historical data.

This PR introduce a new type of compaction `STOP_TOKEN` to fail all
existing compaction jobs and disallow doing any compaction on tablet and
change the SC process as following:
1. converting historical data
2. register stop token on new tablet to fail all existing compaction
jobs and disallow doing any compaction on new tablet
3. calculate delete bitmap for incremental rowsets without lock
4. calculate delete bitmap for incremental rowsets with lock
5. commit SC job and remove stop token
----

ref: apache#29386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.4-merged dev/3.0.0-merged p0_w reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants