Skip to content

Conversation

@hust-hhb
Copy link
Contributor

@hust-hhb hust-hhb commented Jan 10, 2025

For mow table, shcema change may encouter TXN_CONFILCT beacause of tow tablet trying to modify delete bitmap lock in the same time, which may lead to shcema change failed, so should add retry in fe.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hust-hhb
Copy link
Contributor Author

run buildall

BiteTheDDDDt
BiteTheDDDDt previously approved these changes Jan 10, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 10, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 39.34% (10252/26058)
Line Coverage: 30.52% (87402/286410)
Region Coverage: 29.57% (44552/150689)
Branch Coverage: 26.11% (22803/87328)
Coverage Report: http://coverage.selectdb-in.cc/coverage/d9535df4426ef1921c655445ec71e79aeb6a1efa_d9535df4426ef1921c655445ec71e79aeb6a1efa/report/index.html

@hust-hhb
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jan 10, 2025
@doris-robot
Copy link

TPC-H: Total hot run time: 32783 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 702e8c69bfb83d797b93e5b59cb66b7ff59363a1, data reload: false

------ Round 1 ----------------------------------
q1	17586	6168	6041	6041
q2	2062	308	172	172
q3	10405	1305	749	749
q4	10208	890	443	443
q5	7494	2203	2000	2000
q6	201	173	143	143
q7	893	748	618	618
q8	9261	1414	1234	1234
q9	5419	4872	4971	4872
q10	6754	2298	1853	1853
q11	483	291	264	264
q12	335	361	214	214
q13	17773	3698	3101	3101
q14	231	229	209	209
q15	553	523	489	489
q16	638	641	597	597
q17	583	864	327	327
q18	7081	6404	6424	6404
q19	1212	977	578	578
q20	324	332	193	193
q21	2813	2180	1970	1970
q22	361	328	312	312
Total cold run time: 102670 ms
Total hot run time: 32783 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6195	6285	6243	6243
q2	234	334	253	253
q3	2266	2642	2355	2355
q4	1430	1797	1335	1335
q5	4357	4818	5137	4818
q6	188	185	151	151
q7	2054	2000	1872	1872
q8	2673	2813	2716	2716
q9	7282	7275	7195	7195
q10	3140	3358	2840	2840
q11	599	530	489	489
q12	701	795	642	642
q13	3393	3926	3187	3187
q14	286	299	301	299
q15	565	506	499	499
q16	652	669	668	668
q17	1242	1741	1254	1254
q18	7799	7507	7003	7003
q19	802	1137	1051	1051
q20	1920	1997	1850	1850
q21	5527	5114	4935	4935
q22	595	604	544	544
Total cold run time: 53900 ms
Total hot run time: 52199 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 188789 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 702e8c69bfb83d797b93e5b59cb66b7ff59363a1, data reload: false

query1	971	376	370	370
query2	6519	2420	2353	2353
query3	6717	215	219	215
query4	33825	23415	23304	23304
query5	4370	601	444	444
query6	304	206	213	206
query7	4629	495	310	310
query8	306	247	241	241
query9	9542	2668	2633	2633
query10	483	335	235	235
query11	18364	15155	15047	15047
query12	152	109	101	101
query13	1643	517	390	390
query14	10346	7047	7276	7047
query15	224	198	185	185
query16	8221	609	481	481
query17	1590	740	587	587
query18	2098	407	309	309
query19	221	191	160	160
query20	122	115	111	111
query21	242	117	101	101
query22	4197	4491	4241	4241
query23	34109	32941	33223	32941
query24	6484	2271	2238	2238
query25	478	446	380	380
query26	1204	300	152	152
query27	1960	457	326	326
query28	5326	2441	2414	2414
query29	682	542	410	410
query30	233	190	152	152
query31	976	853	785	785
query32	75	62	56	56
query33	519	358	290	290
query34	744	873	508	508
query35	785	819	718	718
query36	979	1048	945	945
query37	127	103	75	75
query38	4107	4062	4068	4062
query39	1453	1410	1400	1400
query40	213	109	99	99
query41	71	54	53	53
query42	127	100	99	99
query43	516	517	497	497
query44	1304	788	811	788
query45	185	175	168	168
query46	865	1032	658	658
query47	1830	1899	1775	1775
query48	376	390	312	312
query49	782	475	382	382
query50	646	647	402	402
query51	6769	6941	6833	6833
query52	101	103	87	87
query53	229	252	183	183
query54	478	481	413	413
query55	91	77	76	76
query56	248	291	246	246
query57	1187	1169	1109	1109
query58	251	233	238	233
query59	3221	3092	2920	2920
query60	291	287	281	281
query61	150	137	137	137
query62	840	768	734	734
query63	230	200	196	196
query64	4362	999	622	622
query65	3244	3193	3232	3193
query66	1053	426	337	337
query67	15859	15700	15597	15597
query68	8353	721	515	515
query69	467	286	263	263
query70	1206	1125	1136	1125
query71	428	294	264	264
query72	6172	3873	3842	3842
query73	664	772	354	354
query74	10307	8845	8814	8814
query75	4293	3218	2632	2632
query76	4186	1180	790	790
query77	779	361	274	274
query78	10841	9991	9468	9468
query79	3678	780	579	579
query80	681	582	434	434
query81	481	289	273	273
query82	604	153	127	127
query83	202	172	149	149
query84	285	97	76	76
query85	734	366	311	311
query86	348	318	308	308
query87	4487	4502	4305	4305
query88	4130	2158	2111	2111
query89	406	307	295	295
query90	1910	189	189	189
query91	140	138	106	106
query92	66	57	54	54
query93	1005	771	534	534
query94	666	395	309	309
query95	333	267	256	256
query96	493	595	275	275
query97	2918	2996	2830	2830
query98	224	203	197	197
query99	1655	1498	1377	1377
Total cold run time: 293988 ms
Total hot run time: 188789 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.59 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 702e8c69bfb83d797b93e5b59cb66b7ff59363a1, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.04
query3	0.23	0.08	0.06
query4	1.60	0.11	0.10
query5	0.44	0.43	0.41
query6	1.19	0.65	0.68
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.59	0.50	0.51
query10	0.56	0.55	0.56
query11	0.14	0.10	0.10
query12	0.14	0.12	0.11
query13	0.60	0.61	0.61
query14	2.73	2.88	2.75
query15	0.91	0.83	0.82
query16	0.38	0.38	0.38
query17	1.06	1.05	1.07
query18	0.23	0.22	0.22
query19	1.85	1.79	1.93
query20	0.02	0.02	0.01
query21	15.38	0.97	0.59
query22	0.74	1.01	0.67
query23	15.04	1.46	0.56
query24	3.08	1.18	1.04
query25	0.18	0.09	0.18
query26	0.36	0.14	0.13
query27	0.07	0.05	0.04
query28	13.56	1.57	1.05
query29	12.61	3.92	3.26
query30	0.25	0.10	0.06
query31	2.81	0.62	0.39
query32	3.23	0.55	0.47
query33	3.10	3.09	3.17
query34	16.86	5.14	4.64
query35	4.50	4.53	4.50
query36	0.66	0.49	0.48
query37	0.10	0.06	0.06
query38	0.04	0.04	0.04
query39	0.03	0.02	0.02
query40	0.16	0.14	0.13
query41	0.08	0.02	0.02
query42	0.04	0.03	0.02
query43	0.03	0.03	0.04
Total cold run time: 105.75 s
Total hot run time: 31.59 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 39.35% (10255/26059)
Line Coverage: 30.52% (87421/286420)
Region Coverage: 29.55% (44532/150692)
Branch Coverage: 26.11% (22798/87330)
Coverage Report: http://coverage.selectdb-in.cc/coverage/702e8c69bfb83d797b93e5b59cb66b7ff59363a1_702e8c69bfb83d797b93e5b59cb66b7ff59363a1/report/index.html

res->status().msg());
} else if (res->status().code() ==
MetaServiceCode::KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES) {
return Status::Error<ErrorCode::DELETE_BITMAP_LOCK_ERROR, false>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return such error for all rpc? DELETE_BITMAP_LOCK_ERROR is only used for delete
bitmap lock related rpc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already remove and use another way to judge @zhannngchen

@hust-hhb
Copy link
Contributor Author

run buildall

@hust-hhb
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32603 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 62422e911cf5e9279661b2737080f952233469a6, data reload: false

------ Round 1 ----------------------------------
q1	17567	6118	6031	6031
q2	2072	294	175	175
q3	10423	1207	762	762
q4	10250	885	428	428
q5	8444	2182	2012	2012
q6	211	180	147	147
q7	895	753	600	600
q8	9239	1364	1192	1192
q9	5175	4925	4852	4852
q10	6750	2291	1859	1859
q11	497	273	260	260
q12	337	371	222	222
q13	17761	3599	3100	3100
q14	236	242	213	213
q15	566	534	494	494
q16	626	627	585	585
q17	574	840	327	327
q18	6898	6457	6358	6358
q19	2529	958	552	552
q20	306	310	182	182
q21	2831	2124	1947	1947
q22	361	349	305	305
Total cold run time: 104548 ms
Total hot run time: 32603 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6303	6177	6230	6177
q2	241	337	228	228
q3	2268	2607	2306	2306
q4	1364	1792	1331	1331
q5	4342	4783	4766	4766
q6	186	176	137	137
q7	2045	1979	1782	1782
q8	2615	2808	2728	2728
q9	7251	7224	7320	7224
q10	3041	3296	2744	2744
q11	577	517	502	502
q12	739	804	629	629
q13	3418	3791	3326	3326
q14	290	317	281	281
q15	563	521	509	509
q16	664	694	640	640
q17	1260	1718	1284	1284
q18	7783	7500	7416	7416
q19	825	1022	1137	1022
q20	1959	2062	1892	1892
q21	5691	5155	4934	4934
q22	619	607	602	602
Total cold run time: 54044 ms
Total hot run time: 52460 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194536 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 62422e911cf5e9279661b2737080f952233469a6, data reload: false

query1	1318	942	962	942
query2	6370	2327	2379	2327
query3	10981	4846	4686	4686
query4	33235	23541	23238	23238
query5	3851	599	447	447
query6	297	184	177	177
query7	3968	489	305	305
query8	297	220	209	209
query9	9239	2646	2659	2646
query10	442	311	250	250
query11	18124	15387	15089	15089
query12	169	102	103	102
query13	1561	524	407	407
query14	9876	7393	6710	6710
query15	246	222	188	188
query16	7912	654	484	484
query17	1552	750	608	608
query18	2101	444	343	343
query19	223	185	170	170
query20	123	123	114	114
query21	212	128	108	108
query22	4663	4636	4460	4460
query23	34017	33303	34365	33303
query24	6399	2355	2311	2311
query25	462	454	395	395
query26	723	256	164	164
query27	2155	459	339	339
query28	5412	2473	2441	2441
query29	526	547	425	425
query30	208	191	186	186
query31	959	887	850	850
query32	74	57	60	57
query33	475	363	313	313
query34	766	862	506	506
query35	776	841	747	747
query36	1038	1042	947	947
query37	118	102	72	72
query38	4111	4015	4343	4015
query39	1492	1506	1445	1445
query40	201	116	98	98
query41	52	48	51	48
query42	124	109	109	109
query43	522	521	494	494
query44	1319	833	827	827
query45	189	174	170	170
query46	866	1039	656	656
query47	1950	1944	1895	1895
query48	391	402	326	326
query49	729	486	382	382
query50	648	662	398	398
query51	6959	7085	6968	6968
query52	100	100	93	93
query53	226	250	186	186
query54	491	510	417	417
query55	87	86	79	79
query56	260	272	252	252
query57	1195	1240	1151	1151
query58	249	248	236	236
query59	3275	3378	3158	3158
query60	265	261	255	255
query61	122	110	116	110
query62	827	836	742	742
query63	223	196	187	187
query64	3301	1030	692	692
query65	3243	3189	3248	3189
query66	810	407	310	310
query67	16446	15776	15357	15357
query68	8926	698	513	513
query69	483	292	264	264
query70	1266	1077	1135	1077
query71	434	280	255	255
query72	6430	3869	3821	3821
query73	664	753	355	355
query74	10609	9195	8989	8989
query75	4705	3149	2680	2680
query76	4606	1210	761	761
query77	882	419	280	280
query78	10010	10037	9309	9309
query79	3546	791	583	583
query80	737	526	436	436
query81	492	269	230	230
query82	641	154	120	120
query83	206	170	157	157
query84	284	101	71	71
query85	769	349	289	289
query86	368	306	284	284
query87	4574	4266	4423	4266
query88	4200	2142	2108	2108
query89	422	323	283	283
query90	1829	195	188	188
query91	133	130	104	104
query92	70	56	53	53
query93	1408	841	532	532
query94	643	368	274	274
query95	331	256	260	256
query96	478	603	275	275
query97	2821	2973	2832	2832
query98	215	200	203	200
query99	1655	1520	1380	1380
Total cold run time: 296378 ms
Total hot run time: 194536 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 40.35% (10512/26055)
Line Coverage: 31.11% (89107/286463)
Region Coverage: 30.21% (45546/150770)
Branch Coverage: 26.51% (23155/87352)
Coverage Report: http://coverage.selectdb-in.cc/coverage/62422e911cf5e9279661b2737080f952233469a6_62422e911cf5e9279661b2737080f952233469a6/report/index.html

@doris-robot
Copy link

ClickBench: Total hot run time: 31.15 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 62422e911cf5e9279661b2737080f952233469a6, data reload: false

query1	0.03	0.03	0.04
query2	0.07	0.04	0.03
query3	0.24	0.07	0.08
query4	1.61	0.10	0.11
query5	0.42	0.41	0.41
query6	1.14	0.66	0.66
query7	0.02	0.02	0.02
query8	0.04	0.04	0.03
query9	0.58	0.48	0.51
query10	0.56	0.56	0.54
query11	0.15	0.11	0.11
query12	0.14	0.10	0.11
query13	0.60	0.61	0.59
query14	2.77	2.85	2.76
query15	0.88	0.82	0.84
query16	0.38	0.37	0.38
query17	1.06	1.05	1.00
query18	0.24	0.21	0.20
query19	1.86	1.90	1.96
query20	0.01	0.01	0.02
query21	15.37	0.95	0.59
query22	0.75	0.75	0.72
query23	15.27	1.49	0.52
query24	3.77	0.65	2.02
query25	0.21	0.08	0.07
query26	0.26	0.14	0.13
query27	0.06	0.04	0.05
query28	13.97	1.48	1.06
query29	12.59	4.00	3.33
query30	0.25	0.09	0.07
query31	2.83	0.58	0.38
query32	3.23	0.55	0.45
query33	3.13	3.10	3.15
query34	16.79	5.10	4.56
query35	4.51	4.48	4.47
query36	0.67	0.50	0.48
query37	0.10	0.06	0.06
query38	0.05	0.04	0.03
query39	0.03	0.02	0.02
query40	0.16	0.14	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 106.95 s
Total hot run time: 31.15 s

Copy link
Contributor

@zhannngchen zhannngchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 13, 2025
@dataroaring dataroaring added dev/3.0.x usercase Important user case type label labels Jan 14, 2025
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhannngchen zhannngchen merged commit a1cac25 into apache:master Jan 14, 2025
26 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 14, 2025
…T in cloud mode (#46748)

For mow table, shcema change may encouter TXN_CONFILCT beacause of tow
tablet trying to modify delete bitmap lock in the same time, which may
lead to shcema change failed, so should add retry in fe.
zhannngchen pushed a commit that referenced this pull request Jan 17, 2025
… TXN_CONFILCT in cloud mode #46748 (#46955)

Cherry-picked from #46748

Co-authored-by: huanghaibin <huanghaibin@selectdb.com>
@gavinchou gavinchou mentioned this pull request Feb 18, 2025
lzyy2024 pushed a commit to lzyy2024/doris that referenced this pull request Feb 21, 2025
…T in cloud mode (apache#46748)

For mow table, shcema change may encouter TXN_CONFILCT beacause of tow
tablet trying to modify delete bitmap lock in the same time, which may
lead to shcema change failed, so should add retry in fe.
dataroaring pushed a commit that referenced this pull request Mar 26, 2025
### What problem does this PR solve?

cloud heavy sc job will retry the whole alter tasks when encounter
`KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in
`commit_tablet_job`(#46748). We
should remove stop token(#48399) in
MS for the sc job if it fails in `commit_tablet_job`, otherwise the
later retries may fail to regsiter stop token(because the first stop
token won't expire in `config::lease_compaction_interval_seconds *
4=80s`) and the schema change job will fail.

```
I20250318 15:40:15.851157  7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829
I20250318 15:40:31.346628  6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829
I20250318 15:40:31.346635  6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209
I20250318 15:40:31.350860  6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457
I20250318 15:40:31.350906  6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457
I20250318 15:40:31.350916  6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829
I20250318 15:40:31.382493  6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289
I20250318 15:40:31.385416  6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2
I20250318 15:40:31.387535  6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970
I20250318 15:40:31.388285  6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829
I20250318 15:40:31.391326  6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829
I20250318 15:40:31.392035  6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970
W20250318 15:40:39.947554  6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 }
I20250318 15:40:46.204162  7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829
I20250318 15:41:07.487172  6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829
I20250318 15:41:07.487183  6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209
I20250318 15:41:07.489440  6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457
I20250318 15:41:07.489511  6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457
I20250318 15:41:07.489523  6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829
I20250318 15:41:07.490249  6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829
I20250318 15:41:07.490275  6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2
W20250318 15:41:07.490864  6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970
W20250318 15:41:07.490897  6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970
```
github-actions bot pushed a commit that referenced this pull request Mar 26, 2025
### What problem does this PR solve?

cloud heavy sc job will retry the whole alter tasks when encounter
`KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in
`commit_tablet_job`(#46748). We
should remove stop token(#48399) in
MS for the sc job if it fails in `commit_tablet_job`, otherwise the
later retries may fail to regsiter stop token(because the first stop
token won't expire in `config::lease_compaction_interval_seconds *
4=80s`) and the schema change job will fail.

```
I20250318 15:40:15.851157  7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829
I20250318 15:40:31.346628  6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829
I20250318 15:40:31.346635  6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209
I20250318 15:40:31.350860  6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457
I20250318 15:40:31.350906  6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457
I20250318 15:40:31.350916  6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829
I20250318 15:40:31.382493  6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289
I20250318 15:40:31.385416  6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2
I20250318 15:40:31.387535  6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970
I20250318 15:40:31.388285  6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829
I20250318 15:40:31.391326  6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829
I20250318 15:40:31.392035  6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970
W20250318 15:40:39.947554  6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 }
I20250318 15:40:46.204162  7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829
I20250318 15:41:07.487172  6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829
I20250318 15:41:07.487183  6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209
I20250318 15:41:07.489440  6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457
I20250318 15:41:07.489511  6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457
I20250318 15:41:07.489523  6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829
I20250318 15:41:07.490249  6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829
I20250318 15:41:07.490275  6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2
W20250318 15:41:07.490864  6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970
W20250318 15:41:07.490897  6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970
```
dataroaring pushed a commit that referenced this pull request May 13, 2025
…in cloud mode (#50705)

### What problem does this PR solve?

#46748 should also be in rollup
tasks.
github-actions bot pushed a commit that referenced this pull request May 13, 2025
…in cloud mode (#50705)

### What problem does this PR solve?

#46748 should also be in rollup
tasks.
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…he#49275)

### What problem does this PR solve?

cloud heavy sc job will retry the whole alter tasks when encounter
`KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in
`commit_tablet_job`(apache#46748). We
should remove stop token(apache#48399) in
MS for the sc job if it fails in `commit_tablet_job`, otherwise the
later retries may fail to regsiter stop token(because the first stop
token won't expire in `config::lease_compaction_interval_seconds *
4=80s`) and the schema change job will fail.

```
I20250318 15:40:15.851157  7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829
I20250318 15:40:31.346628  6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829
I20250318 15:40:31.346635  6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209
I20250318 15:40:31.350860  6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457
I20250318 15:40:31.350906  6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457
I20250318 15:40:31.350916  6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829
I20250318 15:40:31.382493  6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289
I20250318 15:40:31.385416  6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2
I20250318 15:40:31.387535  6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970
I20250318 15:40:31.388285  6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829
I20250318 15:40:31.391326  6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829
I20250318 15:40:31.392035  6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970
W20250318 15:40:39.947554  6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 }
I20250318 15:40:46.204162  7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829
I20250318 15:41:07.487172  6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829
I20250318 15:41:07.487183  6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209
I20250318 15:41:07.489440  6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457
I20250318 15:41:07.489511  6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457
I20250318 15:41:07.489523  6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829
I20250318 15:41:07.490249  6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829
I20250318 15:41:07.490275  6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2
W20250318 15:41:07.490864  6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970
W20250318 15:41:07.490897  6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970
```
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…in cloud mode (apache#50705)

### What problem does this PR solve?

apache#46748 should also be in rollup
tasks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.4-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants