Skip to content

[Fix-9401]Master service dispatch retry log optimization.#9955

Closed
WangJPLeo wants to merge 7 commits intoapache:devfrom
WangJPLeo:jp-master_log-fix
Closed

[Fix-9401]Master service dispatch retry log optimization.#9955
WangJPLeo wants to merge 7 commits intoapache:devfrom
WangJPLeo:jp-master_log-fix

Conversation

@WangJPLeo
Copy link
Copy Markdown
Contributor

@WangJPLeo WangJPLeo commented May 9, 2022

Purpose of the pull request

Optimize output logging; distribution failures are visible to users.

Brief change log

After the distribution fails, ExecuteException is no longer thrown, and the distribution result is returned;

If the distribution result fails, change the current task instance to PENDING state, and record the current timestamp as the latest distribution failure time;

If the distribution fails, put it into the distribution failure queue [isolated from the original queue] and wait for it to be taken out and distributed again;

The retry timing is based on the latest dispatch failure time and the current interval time is greater than or equal to the specified time and increases the number of failed retries;

The server is down: the failover server picks up the workflow instance in the specified state -> create command RECOVER TOLERANCE FAULT PROCESS

close #9401

Verify this pull request

Manually verified the change by testing locally.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2022

Codecov Report

Merging #9955 (806fb63) into dev (4c50bf0) will increase coverage by 0.04%.
The diff coverage is 60.37%.

@@             Coverage Diff              @@
##                dev    #9955      +/-   ##
============================================
+ Coverage     40.68%   40.73%   +0.04%     
- Complexity     4579     4589      +10     
============================================
  Files           834      834              
  Lines         33832    33879      +47     
  Branches       3745     3754       +9     
============================================
+ Hits          13765    13800      +35     
- Misses        18747    18755       +8     
- Partials       1320     1324       +4     
Impacted Files Coverage Δ
...rg/apache/dolphinscheduler/common/enums/Event.java 0.00% <0.00%> (ø)
...e/dolphinscheduler/common/enums/TaskStateType.java 0.00% <0.00%> (ø)
...ler/server/master/dispatch/ExecutorDispatcher.java 0.00% <0.00%> (ø)
...rver/master/processor/queue/TaskExecuteThread.java 0.00% <0.00%> (ø)
...er/server/master/runner/WorkflowExecuteThread.java 8.25% <ø> (ø)
...olphinscheduler/server/utils/DependentExecute.java 0.00% <ø> (ø)
...e/dolphinscheduler/service/queue/TaskPriority.java 47.76% <14.28%> (-3.91%) ⬇️
...ver/master/consumer/TaskPriorityQueueConsumer.java 33.33% <71.05%> (+33.33%) ⬆️
.../org/apache/dolphinscheduler/common/Constants.java 82.60% <100.00%> (+1.65%) ⬆️
...nscheduler/service/process/ProcessServiceImpl.java 30.90% <100.00%> (+0.05%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c50bf0...806fb63. Read the comment docs.

iTaskProcessor.action(TaskAction.RUN);

// pending task needs to be dispatched
if (task.getState().typeIsPending()){
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello, in dispatchFailedTaskInstanceState2Pending() method, you just save the pending status to db but not in WorkflowExecuteThread.taskInstanceMap, i am not sure if the task here that get from taskInstanceMap can be pending, or the taskInstanceMap may update taskInstance other place ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the consideration here is that the task instance in the Pending state needs to be dispatched when a failover occurs, I reconfirmed the test and found that this is not the case. But the interesting thing is that a historical problem was discovered during the period, the selected task instance was recreated when the Master service failed over. The executed task instance is re-created, causing the execution of the new task instance to resume successfully, The old task instance will not change anything and generate garbage data. This problem will be written in Issues and will be resolved in the next PR. Finally thank you again for your review.

@caishunfeng
Copy link
Copy Markdown
Contributor

if (dispatchFailedTask.getDispatchFailedRetryTimes() >= Constants.DEFAULT_MAX_RETRY_COUNT){
logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
// business alarm
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tasks need to continue to be submitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Thanks. After retries 100 times, the business alarm, and then continue to retry according to each 100s delay.

@WangJPLeo
Copy link
Copy Markdown
Contributor Author

@sonarqubecloud
Copy link
Copy Markdown

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

63.1% 63.1% Coverage
0.0% 0.0% Duplication

@zhongjiajie
Copy link
Copy Markdown
Member

FYI @WangJPLeo @caishunfeng @lenboo I change this PR's milestone from 300b1 to 300b2

@devosend devosend modified the milestones: 3.0.0-beta-2, 3.0.0-beta-3 Jun 27, 2022
@WangJPLeo WangJPLeo requested a review from zhuangchong as a code owner July 1, 2022 03:29
@WangJPLeo WangJPLeo requested a review from ruanwenjun as a code owner July 1, 2022 03:29
Copy link
Copy Markdown
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.

Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and separate it rather than separate the task.

And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.

One possibly thing I think we may need to do is separate the task state Dispatch to Dispatching and Dispatched.

Comment on lines +152 to +158
// service alarm when retries 100 times
if (dispatchFailedTask.getDispatchFailedRetryTimes() == Constants.DEFAULT_MAX_RETRY_COUNT){
logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
// business alarm
}
// differentiate the queue to prevent high priority from affecting the execution of other tasks
taskPriorityDispatchFailedQueue.put(dispatchFailedTask);
Copy link
Copy Markdown
Member

@ruanwenjun ruanwenjun Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if one task exceeds the max retry times, will it still retry in your plan? I guess you set the maxRetryTimes is want to stop retry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of stopping the retry, the retry interval is increased according to the number of retries. When the maximum number of retries is reached, each time interval is 100s. This is described in the comment on line 222 (retry more than 100 times with 100 seconds delay each time).

Comment on lines +177 to +187
TaskPriority dispatchFailedTaskPriority = taskPriorityDispatchFailedQueue.poll(Constants.SLEEP_TIME_MILLIS, TimeUnit.MILLISECONDS);
if (Objects.isNull(dispatchFailedTaskPriority)){
continue;
}
if (canRetry(dispatchFailedTaskPriority)){
dispatchFailedTaskPriority.setDispatchFailedRetryTimes(dispatchFailedTaskPriority.getDispatchFailedRetryTimes() + 1);
taskPriorityQueue.put(dispatchFailedTaskPriority);
} else {
taskPriorityDispatchFailedQueue.put(dispatchFailedTaskPriority);
}
}
Copy link
Copy Markdown
Member

@ruanwenjun ruanwenjun Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good idea to deal with the dispatchFailedQueue in the normal process.
Most of the time, the failedDispatchQueue is empty, so if you use the default dispatchTaskNum -> 3, you will extra wait for 3s in each dispatch, since you need to poll the failedDispatchQueue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation may occur, and it is determined whether to traverse by judging the size of the Queue that fails to dispatch.

@ruanwenjun ruanwenjun added discussion discussion improvement make more easy to user or prompt friendly labels Jul 2, 2022
@ruanwenjun
Copy link
Copy Markdown
Member

If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.

@WangJPLeo
Copy link
Copy Markdown
Contributor Author

WangJPLeo commented Jul 4, 2022

If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.

In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.

Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and separate it rather than separate the task.

And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
One possibly thing I think we may need to do is separate the task state Dispatch to Dispatching and Dispatched.

thank you @ruanwenjun , the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:

  1. The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
  2. The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.
  3. Invalid log printing that keeps looping after worker dispatch fails.

based on these three points, I think your question can be explained and is consistent with the present.

@ruanwenjun
Copy link
Copy Markdown
Member

If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.

In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.
Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and separate it rather than separate the task.
And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
One possibly thing I think we may need to do is separate the task state Dispatch to Dispatching and Dispatched.

thank you @ruanwenjun , the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:

  1. The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
    This is a good point, we set the status dispatching, dispatched. BTY, this case may occur in all our status.
  1. The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.

When we dispatch failure, the reason is that there is no worker/worker's network is broken, the correct thing is we don't consume command, although use failure queue can mitigating such problem, but this is the same solution to the current plan, put it back to the normal queue, and sleep.

When you put it to failure queue, you need to use another thread to handle it, otherwise, you may influence the normal process.
#9955 (comment)

  1. Invalid log printing that keeps looping after worker dispatch fails.
    This problem is fixed by sleep.

based on these three points, I think your question can be explained and is consistent with the present.

@WangJPLeo
Copy link
Copy Markdown
Contributor Author

I think it's necessary to use another thread to address the risk handling of the failed queue and the 3 second delay. but the biggest difference between the failure queue and the current solution is, To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue, like monitor and more.

#9955 (comment)

@ruanwenjun
Copy link
Copy Markdown
Member

ruanwenjun commented Jul 4, 2022

I think it's necessary to use another thread to address the risk handling of the failed queue and the 3 second delay. but the biggest difference between the failure queue and the current solution is, To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue, like monitor and more.

#9955 (comment)

Ok, use a single thread to consume the failure queue looks good to me, and since you add a new status, you also need to add this to NEED_FAILOVER_STATES, and I am not sure if there is other place need to update.

In the current code, when we add new status, it is easy to bring new bug.

@WangJPLeo
Copy link
Copy Markdown
Contributor Author

ok thx for your suggestion, I will optimize the processing logic according to the discussion results and check where the status is used.

@caishunfeng caishunfeng modified the milestones: 3.1.0, 3.0.2 Sep 27, 2022
@caishunfeng
Copy link
Copy Markdown
Contributor

Will close this pr due to no update for a long time.

@caishunfeng caishunfeng closed this Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discussion discussion improvement make more easy to user or prompt friendly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Master] Master frequent print logs

8 participants