[Fix-9401]Master service dispatch retry log optimization. by WangJPLeo · Pull Request #9955 · apache/dolphinscheduler

WangJPLeo · 2022-05-09T11:27:12Z

Purpose of the pull request

Optimize output logging; distribution failures are visible to users.

Brief change log

After the distribution fails, ExecuteException is no longer thrown, and the distribution result is returned;

If the distribution result fails, change the current task instance to PENDING state, and record the current timestamp as the latest distribution failure time;

If the distribution fails, put it into the distribution failure queue [isolated from the original queue] and wait for it to be taken out and distributed again;

The retry timing is based on the latest dispatch failure time and the current interval time is greater than or equal to the specified time and increases the number of failed retries;

The server is down: the failover server picks up the workflow instance in the specified state -> create command RECOVER TOLERANCE FAULT PROCESS

close #9401

Verify this pull request

Manually verified the change by testing locally.

codecov-commenter · 2022-05-09T11:39:54Z

Codecov Report

Merging #9955 (806fb63) into dev (4c50bf0) will increase coverage by 0.04%.
The diff coverage is 60.37%.

@@             Coverage Diff              @@
##                dev    #9955      +/-   ##
============================================
+ Coverage     40.68%   40.73%   +0.04%     
- Complexity     4579     4589      +10     
============================================
  Files           834      834              
  Lines         33832    33879      +47     
  Branches       3745     3754       +9     
============================================
+ Hits          13765    13800      +35     
- Misses        18747    18755       +8     
- Partials       1320     1324       +4

Impacted Files	Coverage Δ
...rg/apache/dolphinscheduler/common/enums/Event.java	`0.00% <0.00%> (ø)`
...e/dolphinscheduler/common/enums/TaskStateType.java	`0.00% <0.00%> (ø)`
...ler/server/master/dispatch/ExecutorDispatcher.java	`0.00% <0.00%> (ø)`
...rver/master/processor/queue/TaskExecuteThread.java	`0.00% <0.00%> (ø)`
...er/server/master/runner/WorkflowExecuteThread.java	`8.25% <ø> (ø)`
...olphinscheduler/server/utils/DependentExecute.java	`0.00% <ø> (ø)`
...e/dolphinscheduler/service/queue/TaskPriority.java	`47.76% <14.28%> (-3.91%)`	⬇️
...ver/master/consumer/TaskPriorityQueueConsumer.java	`33.33% <71.05%> (+33.33%)`	⬆️
.../org/apache/dolphinscheduler/common/Constants.java	`82.60% <100.00%> (+1.65%)`	⬆️
...nscheduler/service/process/ProcessServiceImpl.java	`30.90% <100.00%> (+0.05%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c50bf0...806fb63. Read the comment docs.

genuinner · 2022-05-10T09:28:09Z

...er/src/main/java/org/apache/dolphinscheduler/server/master/runner/WorkflowExecuteThread.java

-            iTaskProcessor.action(TaskAction.RUN);
-
+            // pending task needs to be dispatched
+            if (task.getState().typeIsPending()){


hello, in dispatchFailedTaskInstanceState2Pending() method, you just save the pending status to db but not in WorkflowExecuteThread.taskInstanceMap, i am not sure if the task here that get from taskInstanceMap can be pending, or the taskInstanceMap may update taskInstance other place ?

Thanks, the consideration here is that the task instance in the Pending state needs to be dispatched when a failover occurs, I reconfirmed the test and found that this is not the case. But the interesting thing is that a historical problem was discovered during the period, the selected task instance was recreated when the Master service failed over. The executed task instance is re-created, causing the execution of the new task instance to resume successfully, The old task instance will not change anything and generate garbage data. This problem will be written in Issues and will be resolved in the next PR. Finally thank you again for your review.

…ter service failover.

caishunfeng · 2022-05-12T09:51:53Z

@WangJPLeo need to sync the doc too, see https://dolphinscheduler.apache.org/en-us/docs/dev/user_doc/architecture/design.html

lenboo · 2022-05-16T04:14:22Z

.../main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java

+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() >= Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                            continue;


These tasks need to continue to be submitted.

Yes, Thanks. After retries 100 times, the business alarm, and then continue to retry according to each 100s delay.

WangJPLeo · 2022-05-16T06:00:47Z

@WangJPLeo need to sync the doc too, see https://dolphinscheduler.apache.org/en-us/docs/dev/user_doc/architecture/design.html
ok

sonarqubecloud · 2022-05-18T07:45:39Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

63.1% Coverage
0.0% Duplication

zhongjiajie · 2022-05-23T08:14:36Z

FYI @WangJPLeo @caishunfeng @lenboo I change this PR's milestone from 300b1 to 300b2

ruanwenjun

In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.

Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and separate it rather than separate the task.

And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.

One possibly thing I think we may need to do is separate the task state Dispatch to Dispatching and Dispatched.

ruanwenjun · 2022-07-02T01:15:16Z

.../main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java

+                        // service alarm when retries 100 times
+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() == Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                        }
+                        // differentiate the queue to prevent high priority from affecting the execution of other tasks
+                        taskPriorityDispatchFailedQueue.put(dispatchFailedTask);


So if one task exceeds the max retry times, will it still retry in your plan? I guess you set the maxRetryTimes is want to stop retry?

Instead of stopping the retry, the retry interval is increased according to the number of retries. When the maximum number of retries is reached, each time interval is 100s. This is described in the comment on line 222 (retry more than 100 times with 100 seconds delay each time).

ruanwenjun · 2022-07-02T01:15:20Z

.../main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java

+            TaskPriority dispatchFailedTaskPriority = taskPriorityDispatchFailedQueue.poll(Constants.SLEEP_TIME_MILLIS, TimeUnit.MILLISECONDS);
+            if (Objects.isNull(dispatchFailedTaskPriority)){
+                continue;
+            }
+            if (canRetry(dispatchFailedTaskPriority)){
+                dispatchFailedTaskPriority.setDispatchFailedRetryTimes(dispatchFailedTaskPriority.getDispatchFailedRetryTimes() + 1);
+                taskPriorityQueue.put(dispatchFailedTaskPriority);
+            } else {
+                taskPriorityDispatchFailedQueue.put(dispatchFailedTaskPriority);
+            }
+        }


It's not a good idea to deal with the dispatchFailedQueue in the normal process.
Most of the time, the failedDispatchQueue is empty, so if you use the default dispatchTaskNum -> 3, you will extra wait for 3s in each dispatch, since you need to poll the failedDispatchQueue.

This situation may occur, and it is determined whether to traverse by judging the size of the Queue that fails to dispatch.

ruanwenjun · 2022-07-02T01:34:16Z

If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.

WangJPLeo · 2022-07-04T03:45:33Z

If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.

In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.

Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and separate it rather than separate the task.

And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
One possibly thing I think we may need to do is separate the task state Dispatch to Dispatching and Dispatched.

thank you @ruanwenjun , the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:

The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.
Invalid log printing that keeps looping after worker dispatch fails.

based on these three points, I think your question can be explained and is consistent with the present.

ruanwenjun · 2022-07-04T04:05:48Z

If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.

In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.
Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and separate it rather than separate the task.
And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
One possibly thing I think we may need to do is separate the task state Dispatch to Dispatching and Dispatched.

thank you @ruanwenjun , the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:

The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
This is a good point, we set the status dispatching, dispatched. BTY, this case may occur in all our status.

The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.

When we dispatch failure, the reason is that there is no worker/worker's network is broken, the correct thing is we don't consume command, although use failure queue can mitigating such problem, but this is the same solution to the current plan, put it back to the normal queue, and sleep.

When you put it to failure queue, you need to use another thread to handle it, otherwise, you may influence the normal process.
#9955 (comment)

Invalid log printing that keeps looping after worker dispatch fails.
This problem is fixed by sleep.

based on these three points, I think your question can be explained and is consistent with the present.

WangJPLeo · 2022-07-04T05:21:21Z

I think it's necessary to use another thread to address the risk handling of the failed queue and the 3 second delay. but the biggest difference between the failure queue and the current solution is, To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue, like monitor and more.

#9955 (comment)

ruanwenjun · 2022-07-04T07:34:22Z

I think it's necessary to use another thread to address the risk handling of the failed queue and the 3 second delay. but the biggest difference between the failure queue and the current solution is, To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue, like monitor and more.

#9955 (comment)

Ok, use a single thread to consume the failure queue looks good to me, and since you add a new status, you also need to add this to NEED_FAILOVER_STATES, and I am not sure if there is other place need to update.

In the current code, when we add new status, it is easy to bring new bug.

WangJPLeo · 2022-07-04T08:15:01Z

ok thx for your suggestion, I will optimize the processing logic according to the discussion results and check where the status is used.

caishunfeng · 2023-02-02T03:20:43Z

Will close this pr due to no update for a long time.

WangJPLeo added 2 commits May 9, 2022 18:55

Master service dispatch retry log optimization.

71b9bac

Not retryable to put back again.

e0b45c3

Coverage on New Code

9d5019f

genuinner reviewed May 10, 2022

View reviewed changes

WangJPLeo added 2 commits May 11, 2022 02:48

Fixed the issue of Pending task instance execution dispatch under Mas…

4c6d8e0

…ter service failover.

Merge remote-tracking branch 'ApachePull/dev' into jp-master_log-fix

806fb63

WangJPLeo mentioned this pull request May 10, 2022

[Fix-9975] The selected task instance was recreated when the Master service failed over. #9976

Merged

caishunfeng added this to the 3.0.0-beta-1 milestone May 12, 2022

caishunfeng requested review from SbloodyS, brave-lee, caishunfeng and lenboo May 12, 2022 09:45

lenboo reviewed May 16, 2022

View reviewed changes

Retry more than 100 times with 100 seconds delay each time.

219949f

zhongjiajie assigned WangJPLeo May 20, 2022

zhongjiajie added backend core labels May 20, 2022

mergeable bot removed backend core labels May 20, 2022

zhongjiajie modified the milestones: 3.0.0-beta-1, 3.0.0-beta-2 May 23, 2022

devosend modified the milestones: 3.0.0-beta-2, 3.0.0-beta-3 Jun 27, 2022

merge last dev code

025d277

WangJPLeo requested a review from zhuangchong as a code owner July 1, 2022 03:29

WangJPLeo requested a review from ruanwenjun as a code owner July 1, 2022 03:29

ruanwenjun requested changes Jul 2, 2022

View reviewed changes

ruanwenjun added discussion discussion improvement make more easy to user or prompt friendly labels Jul 2, 2022

zhongjiajie modified the milestones: 3.0.0-release, 3.1.0-alpha Jul 28, 2022

caishunfeng modified the milestones: 3.1.0, 3.0.2 Sep 27, 2022

caishunfeng closed this Feb 2, 2023

Conversation

WangJPLeo commented May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

codecov-commenter commented May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

genuinner May 10, 2022

Choose a reason for hiding this comment

Uh oh!

WangJPLeo May 10, 2022

Choose a reason for hiding this comment

Uh oh!

caishunfeng commented May 12, 2022

Uh oh!

lenboo May 16, 2022

Choose a reason for hiding this comment

Uh oh!

WangJPLeo May 16, 2022

Choose a reason for hiding this comment

Uh oh!

WangJPLeo commented May 16, 2022

Uh oh!

sonarqubecloud bot commented May 18, 2022

Uh oh!

zhongjiajie commented May 23, 2022

Uh oh!

ruanwenjun left a comment

Choose a reason for hiding this comment

Uh oh!

ruanwenjun Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WangJPLeo Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

ruanwenjun Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WangJPLeo Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

ruanwenjun commented Jul 2, 2022

Uh oh!

WangJPLeo commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruanwenjun commented Jul 4, 2022

Uh oh!

WangJPLeo commented Jul 4, 2022

Uh oh!

ruanwenjun commented Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WangJPLeo commented Jul 4, 2022

Uh oh!

caishunfeng commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

WangJPLeo commented May 9, 2022 •

edited

Loading

codecov-commenter commented May 9, 2022 •

edited

Loading

ruanwenjun Jul 2, 2022 •

edited

Loading

ruanwenjun Jul 2, 2022 •

edited

Loading

WangJPLeo commented Jul 4, 2022 •

edited

Loading

ruanwenjun commented Jul 4, 2022 •

edited

Loading