Skip to content

Conversation

@TsukiokaKogane
Copy link
Contributor

@TsukiokaKogane TsukiokaKogane commented Nov 6, 2023

Proposed changes

Currently if a loadloading task fails, in retry the failed task's progress would retain with double the total file num.
Because the load job progress was initiated in BrokerLoadJob::createLoadingTask() and the total file num is added every time in Coordinator::exec(), the progress is not properly reset in retry.
For example, previously the progress looks like this after retry:

progress: 68.72% (68717/100000)
----------task-failed----------
progress: 34.38% (68753/200000)

Issue Number: close #xxx

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@TsukiokaKogane TsukiokaKogane changed the title [fix](load) restore load job progress before retry load task [fix](load) restore load job progress before retry failed load task Nov 6, 2023
@TsukiokaKogane
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.39 seconds
stream load tsv: 553 seconds loaded 74807831229 Bytes, about 129 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162100369 Bytes

@dataroaring
Copy link
Contributor

Cloud you explain more in commit message?

@TsukiokaKogane
Copy link
Contributor Author

Cloud you explain more in commit message?

done

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 10, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@xiaokang
Copy link
Contributor

@TsukiokaKogane can you submit a pr for branch-2.0

@TsukiokaKogane
Copy link
Contributor Author

@TsukiokaKogane can you submit a pr for branch-2.0

done #26802

Copy link
Contributor

@xiaokang xiaokang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaokang
Copy link
Contributor

#26469

@TsukiokaKogane
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 70781e169892f26d7a60ba4468f8f407fe8f3c8f, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4876	4659	4688	4659
q2	361	172	158	158
q3	2032	1957	1939	1939
q4	1387	1262	1225	1225
q5	3981	3938	3984	3938
q6	254	128	128	128
q7	1410	882	885	882
q8	2738	2770	2743	2743
q9	9704	9647	9577	9577
q10	3452	3531	3529	3529
q11	381	254	240	240
q12	442	301	291	291
q13	4531	3785	3862	3785
q14	323	300	298	298
q15	580	544	519	519
q16	668	579	576	576
q17	1127	940	950	940
q18	7863	7318	7386	7318
q19	1663	1654	1680	1654
q20	567	290	306	290
q21	4330	3911	3888	3888
q22	472	363	370	363
Total cold run time: 53142 ms
Total hot run time: 48940 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4614	4540	4584	4540
q2	339	215	261	215
q3	4013	4004	3991	3991
q4	2676	2674	2686	2674
q5	9681	9719	9674	9674
q6	244	123	122	122
q7	2998	2470	2493	2470
q8	4473	4456	4478	4456
q9	13195	13041	13054	13041
q10	4092	4205	4197	4197
q11	786	633	698	633
q12	989	807	804	804
q13	4255	3603	3558	3558
q14	398	360	346	346
q15	578	532	523	523
q16	741	687	659	659
q17	3903	3868	3966	3868
q18	9698	9105	8947	8947
q19	1809	1774	1762	1762
q20	2384	2062	2065	2062
q21	8903	8610	8735	8610
q22	894	804	831	804
Total cold run time: 81663 ms
Total hot run time: 77956 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.26 seconds
stream load tsv: 578 seconds loaded 74807831229 Bytes, about 123 MB/s
stream load json: 18 seconds loaded 2358488459 Bytes, about 124 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17101104696 Bytes

@yiguolei yiguolei force-pushed the restore_fail_load_job_progress branch from 70781e1 to b24c066 Compare December 30, 2023 12:58
@yiguolei
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit b24c0661847e35977b3929ba8a106fabb04fc105, data reload: false

------ Round 1 ----------------------------------
q1	17667	5137	5125	5125
q2	2012	158	146	146
q3	10523	1126	1137	1126
q4	10177	762	844	762
q5	7757	2898	2795	2795
q6	216	132	138	132
q7	928	546	525	525
q8	9297	1989	2023	1989
q9	6768	6366	6365	6365
q10	8241	3031	2962	2962
q11	439	226	222	222
q12	390	230	227	227
q13	18005	3641	3603	3603
q14	252	215	205	205
q15	580	542	526	526
q16	444	396	414	396
q17	951	451	496	451
q18	7259	6623	6653	6623
q19	1581	1395	1359	1359
q20	741	354	358	354
q21	2796	2392	2391	2391
q22	369	335	332	332
Total cold run time: 107393 ms
Total hot run time: 38616 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5146	5134	5145	5134
q2	345	254	267	254
q3	3311	3250	3256	3250
q4	2128	2004	2003	2003
q5	5769	5747	5725	5725
q6	215	124	129	124
q7	2327	1900	1872	1872
q8	3381	3471	3467	3467
q9	8759	8699	8693	8693
q10	3776	3846	3808	3808
q11	565	494	489	489
q12	804	634	627	627
q13	8367	3161	3207	3161
q14	290	270	268	268
q15	594	547	530	530
q16	567	510	513	510
q17	1943	1794	1794	1794
q18	8616	8385	8267	8267
q19	1641	1616	1562	1562
q20	2207	1972	1944	1944
q21	5586	5328	5270	5270
q22	521	506	474	474
Total cold run time: 66858 ms
Total hot run time: 59226 ms

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit b24c0661847e35977b3929ba8a106fabb04fc105, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5471	5146	5208	5146
q2	387	168	167	167
q3	1480	1197	1073	1073
q4	1098	820	810	810
q5	3111	3062	3088	3062
q6	227	134	136	134
q7	967	533	540	533
q8	2151	2290	2224	2224
q9	6709	6682	6657	6657
q10	3128	3098	3140	3098
q11	367	222	220	220
q12	383	238	237	237
q13	4379	3707	3694	3694
q14	249	220	220	220
q15	623	545	553	545
q16	452	399	407	399
q17	1047	536	522	522
q18	7054	6827	6828	6827
q19	1639	1684	1478	1478
q20	627	350	372	350
q21	2860	2556	2495	2495
q22	392	318	325	318
Total cold run time: 44801 ms
Total hot run time: 40209 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5143	5096	5094	5094
q2	334	264	255	255
q3	3369	3330	3301	3301
q4	2164	1991	2025	1991
q5	5943	5914	5916	5914
q6	226	126	127	126
q7	2369	1926	1909	1909
q8	3571	3652	3649	3649
q9	9076	8976	8950	8950
q10	3861	3913	3947	3913
q11	587	474	490	474
q12	807	677	675	675
q13	3891	3227	3192	3192
q14	301	285	268	268
q15	628	559	551	551
q16	567	528	518	518
q17	2042	1794	1785	1785
q18	8744	8420	8370	8370
q19	1735	1709	1701	1701
q20	2281	2020	1976	1976
q21	5714	5384	5326	5326
q22	576	521	517	517
Total cold run time: 63929 ms
Total hot run time: 60455 ms

@doris-robot
Copy link

TPC-DS test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpcds-tools

TPC-DS sf100 test result on commit b24c0661847e35977b3929ba8a106fabb04fc105, data reload: false

run tpcds-sf100 query with default conf and session variables
query1	934	357	342	342
query2	6440	1948	1996	1948
query3	6651	205	205	205
query4	27220	22564	22407	22407
query5	4777	515	540	515
query6	273	185	183	183
query7	4580	280	266	266
query8	227	211	197	197
query9	8175	2689	2804	2689
query10	417	261	229	229
query11	16253	15537	15682	15537
query12	132	79	77	77
query13	1627	348	336	336
query14	12003	7262	7118	7118
query15	255	196	197	196
query16	6523	287	277	277
query17	1832	509	493	493
query18	1942	274	266	266
query19	283	141	142	141
query20	86	78	78	78
query21	186	98	95	95
query22	4924	4886	4586	4586
query23	31914	31073	30966	30966
query24	11730	2817	2842	2817
query25	594	349	365	349
query26	1704	144	154	144
query27	2836	290	293	290
query28	7007	2025	2016	2016
query29	1859	412	410	410
query30	302	141	148	141
query31	987	782	768	768
query32	90	62	61	61
query33	736	280	270	270
query34	854	450	447	447
query35	904	778	797	778
query36	1334	1253	1316	1253
query37	104	79	74	74
query38	3376	3302	3246	3246
query39	1333	1292	1268	1268
query40	301	96	93	93
query41	37	36	38	36
query42	102	88	98	88
query43	547	514	481	481
query44	1059	726	743	726
query45	201	195	185	185
query46	1054	642	653	642
query47	1708	1512	1522	1512
query48	330	273	271	271
query49	1192	330	328	328
query50	735	359	333	333
query51	5520	5348	5325	5325
query52	98	85	89	85
query53	219	147	154	147
query54	1370	584	581	581
query55	97	88	90	88
query56	219	208	203	203
query57	1025	930	964	930
query58	227	209	210	209
query59	2785	2574	2541	2541
query60	255	247	237	237
query61	90	87	86	86
query62	659	449	455	449
query63	173	159	155	155
query64	5863	1771	1739	1739
query65	3356	3250	3270	3250
query66	1309	338	330	330
query67	15666	15196	15288	15196
query68	12862	518	544	518
query69	546	256	262	256
query70	1768	1517	1542	1517
query71	517	246	245	245
query72	5697	3592	3602	3592
query73	3047	318	325	318
query74	6923	6461	6492	6461
query75	5252	2327	2288	2288
query76	6336	1169	1182	1169
query77	664	293	282	282
query78	9108	8727	8636	8636
query79	1053	511	518	511
query80	586	385	385	385
query81	459	222	217	217
query82	214	106	104	104
query83	180	142	140	140
query84	249	54	58	54
query85	969	300	288	288
query86	400	382	372	372
query87	3594	3332	3390	3332
query88	3099	2310	2313	2310
query89	342	276	266	266
query90	1949	205	212	205
query91	122	97	95	95
query92	62	56	60	56
query93	1409	492	507	492
query94	906	198	190	190
query95	480	427	416	416
query96	621	325	318	318
query97	4296	4170	4184	4170
query98	213	203	191	191
query99	1070	821	814	814
Total cold run time: 294594 ms
Total hot run time: 179933 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.8 seconds
stream load tsv: 563 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.1 seconds inserted 10000000 Rows, about 355K ops/s
storage size: 17183843825 Bytes

@dataroaring dataroaring force-pushed the restore_fail_load_job_progress branch from b24c066 to 572156d Compare June 27, 2024 06:44
@dataroaring
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39612 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 572156dada6564a34d627c2c124feb5edbf106ee, data reload: false

------ Round 1 ----------------------------------
q1	18815	4351	4291	4291
q2	2017	193	196	193
q3	10476	1233	1099	1099
q4	10202	755	864	755
q5	7475	2630	2603	2603
q6	222	136	137	136
q7	958	597	620	597
q8	9234	2059	2032	2032
q9	8978	6444	6406	6406
q10	8974	3782	3659	3659
q11	477	246	242	242
q12	442	240	230	230
q13	17772	2964	2991	2964
q14	275	227	215	215
q15	514	495	474	474
q16	532	403	377	377
q17	943	721	621	621
q18	8045	7484	7360	7360
q19	3849	1513	1492	1492
q20	685	333	339	333
q21	4936	3193	3197	3193
q22	394	340	354	340
Total cold run time: 116215 ms
Total hot run time: 39612 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4367	4214	4215	4214
q2	372	270	267	267
q3	3002	2784	2771	2771
q4	1889	1569	1576	1569
q5	5247	5267	5248	5248
q6	217	129	125	125
q7	2113	1773	1769	1769
q8	3192	3339	3330	3330
q9	8320	8324	8352	8324
q10	3840	3701	3686	3686
q11	575	491	489	489
q12	765	618	601	601
q13	17489	2975	2957	2957
q14	295	262	260	260
q15	527	481	485	481
q16	461	406	419	406
q17	1753	1484	1451	1451
q18	7675	7539	7322	7322
q19	4230	1385	1597	1385
q20	2006	1791	1729	1729
q21	5107	4747	4701	4701
q22	596	554	562	554
Total cold run time: 74038 ms
Total hot run time: 53639 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 170935 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 572156dada6564a34d627c2c124feb5edbf106ee, data reload: false

query1	931	399	371	371
query2	6450	2572	2471	2471
query3	6651	213	226	213
query4	19195	17429	17418	17418
query5	4276	502	498	498
query6	294	174	168	168
query7	4600	300	298	298
query8	309	309	302	302
query9	8477	2350	2337	2337
query10	587	296	275	275
query11	10450	10043	9979	9979
query12	130	83	81	81
query13	1649	367	366	366
query14	9462	7544	7153	7153
query15	261	197	190	190
query16	8104	270	268	268
query17	1885	580	539	539
query18	2050	287	264	264
query19	185	151	150	150
query20	87	80	85	80
query21	213	125	125	125
query22	4294	4136	4134	4134
query23	33624	33065	33097	33065
query24	12116	2835	2774	2774
query25	637	357	368	357
query26	1776	163	155	155
query27	2933	313	310	310
query28	7248	2036	2013	2013
query29	1020	621	603	603
query30	289	149	151	149
query31	946	741	761	741
query32	93	51	57	51
query33	770	287	293	287
query34	983	483	487	483
query35	755	638	622	622
query36	1114	924	958	924
query37	282	72	76	72
query38	2855	2726	2738	2726
query39	879	779	803	779
query40	273	163	124	124
query41	54	54	54	54
query42	124	102	99	99
query43	596	580	572	572
query44	1257	734	729	729
query45	191	164	167	164
query46	1069	729	697	697
query47	1853	1786	1775	1775
query48	374	299	294	294
query49	1195	413	411	411
query50	772	390	399	390
query51	6899	6802	6797	6797
query52	105	89	98	89
query53	369	293	304	293
query54	886	456	442	442
query55	73	80	72	72
query56	280	256	262	256
query57	1185	1012	1049	1012
query58	263	233	266	233
query59	3471	3433	3157	3157
query60	312	269	279	269
query61	91	87	92	87
query62	648	434	441	434
query63	319	294	294	294
query64	9842	2230	1772	1772
query65	3189	3080	3131	3080
query66	1377	333	327	327
query67	15438	14950	15039	14950
query68	6139	540	562	540
query69	600	515	407	407
query70	1181	1163	1149	1149
query71	445	283	277	277
query72	7595	2823	2577	2577
query73	758	333	331	331
query74	5817	5499	5519	5499
query75	3835	2664	2669	2664
query76	3691	937	937	937
query77	619	312	309	309
query78	10630	9906	9853	9853
query79	2056	517	504	504
query80	1129	468	469	468
query81	526	213	220	213
query82	730	104	115	104
query83	195	174	165	165
query84	270	86	85	85
query85	1435	287	271	271
query86	454	339	278	278
query87	3259	3096	3093	3093
query88	3451	2440	2441	2440
query89	481	385	387	385
query90	1815	187	192	187
query91	130	98	98	98
query92	61	54	51	51
query93	1803	506	497	497
query94	1204	189	188	188
query95	397	326	327	326
query96	581	272	282	272
query97	3196	3084	3063	3063
query98	226	208	203	203
query99	1396	837	877	837
Total cold run time: 277470 ms
Total hot run time: 170935 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.1 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 572156dada6564a34d627c2c124feb5edbf106ee, data reload: false

query1	0.05	0.03	0.03
query2	0.07	0.04	0.04
query3	0.23	0.04	0.05
query4	1.68	0.06	0.07
query5	0.50	0.49	0.48
query6	1.13	0.72	0.72
query7	0.02	0.01	0.01
query8	0.05	0.04	0.04
query9	0.54	0.50	0.49
query10	0.56	0.55	0.55
query11	0.15	0.12	0.12
query12	0.16	0.12	0.12
query13	0.60	0.58	0.59
query14	0.78	0.78	0.77
query15	0.85	0.83	0.81
query16	0.37	0.38	0.37
query17	1.02	0.96	0.97
query18	0.20	0.26	0.23
query19	1.91	1.71	1.69
query20	0.01	0.00	0.00
query21	15.47	0.75	0.67
query22	4.35	6.87	2.51
query23	18.25	1.40	1.28
query24	2.22	0.23	0.23
query25	0.16	0.08	0.08
query26	0.26	0.17	0.17
query27	0.08	0.07	0.08
query28	13.19	1.01	1.00
query29	12.61	3.33	3.26
query30	0.25	0.06	0.05
query31	2.86	0.40	0.38
query32	3.28	0.46	0.48
query33	2.91	2.88	2.94
query34	17.05	4.43	4.40
query35	4.55	4.49	4.49
query36	0.66	0.51	0.46
query37	0.18	0.16	0.15
query38	0.15	0.14	0.15
query39	0.04	0.03	0.03
query40	0.16	0.15	0.15
query41	0.10	0.05	0.05
query42	0.06	0.04	0.04
query43	0.04	0.04	0.05
Total cold run time: 109.76 s
Total hot run time: 31.1 s

@github-actions
Copy link
Contributor

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 25, 2024
@github-actions github-actions bot closed this Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.0.3-merged reviewed Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants