Skip to content

[Fix-10842] Fix master/worker failover will cause status incorrect#10839

Merged
ruanwenjun merged 10 commits intoapache:devfrom
ruanwenjun:dev_wenjun_failover
Jul 9, 2022
Merged

[Fix-10842] Fix master/worker failover will cause status incorrect#10839
ruanwenjun merged 10 commits intoapache:devfrom
ruanwenjun:dev_wenjun_failover

Conversation

@ruanwenjun
Copy link
Copy Markdown
Member

@ruanwenjun ruanwenjun commented Jul 7, 2022

Purpose of the pull request

close #10842
Fix master failover may not update task status

Brief change log

  • Split master failover and worker failover
  • Fix worker failover may lost task due to host does update
  • Fix master failover may cause the success task rerun again
  • Add some failover log for troubleshooting

@SbloodyS SbloodyS added bug Something isn't working backend labels Jul 7, 2022
@SbloodyS SbloodyS added this to the 3.0.0-beta-3 milestone Jul 7, 2022
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_failover branch from a27366f to be47c1c Compare July 7, 2022 15:39
@ruanwenjun ruanwenjun marked this pull request as draft July 7, 2022 17:54
@ruanwenjun ruanwenjun changed the title Fix master failover will not update task instance status [Fix-10842]Fix master failover will not update task instance status Jul 8, 2022
@ruanwenjun ruanwenjun marked this pull request as ready for review July 8, 2022 14:28
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_failover branch from 3d0c3be to 9f8e2bb Compare July 8, 2022 14:33
@ruanwenjun ruanwenjun requested a review from SbloodyS July 8, 2022 15:43
@ruanwenjun
Copy link
Copy Markdown
Member Author

@caishunfeng This PR is ready to review, please take a look.

@ruanwenjun ruanwenjun changed the title [Fix-10842]Fix master failover will not update task instance status [Fix-10842] Fix master/worker failover will cause status incorrect Jul 8, 2022
@github-actions github-actions bot removed the backend label Jul 9, 2022
@ruanwenjun ruanwenjun requested a review from caishunfeng July 9, 2022 03:14
private CuringParamsService curingGlobalParamsService;

@Autowired
private ProcessService processService;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quote itself?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to use transaction in spring, otherwise we need to use SprintContext.getBean like before.

@ruanwenjun ruanwenjun requested a review from caishunfeng July 9, 2022 03:28
Copy link
Copy Markdown
Contributor

@caishunfeng caishunfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jul 9, 2022

Codecov Report

❌ Patch coverage is 32.92079% with 271 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.60%. Comparing base (4169801) to head (89ecf99).
⚠️ Report is 2066 commits behind head on dev.

Files with missing lines Patch % Lines
...er/master/processor/queue/TaskExecuteRunnable.java 0.00% 58 Missing ⚠️
.../server/master/runner/WorkflowExecuteRunnable.java 0.00% 58 Missing ⚠️
...r/server/master/service/WorkerFailoverService.java 50.45% 32 Missing and 23 partials ⚠️
...r/server/master/service/MasterFailoverService.java 61.90% 27 Missing and 13 partials ⚠️
...ver/master/consumer/TaskPriorityQueueConsumer.java 0.00% 20 Missing ⚠️
...nscheduler/service/process/ProcessServiceImpl.java 0.00% 19 Missing ⚠️
...e/dolphinscheduler/service/queue/TaskPriority.java 0.00% 7 Missing ⚠️
.../server/master/runner/StateWheelExecuteThread.java 0.00% 5 Missing ⚠️
...rver/master/dispatch/context/ExecutionContext.java 66.66% 2 Missing ⚠️
...server/master/runner/task/CommonTaskProcessor.java 0.00% 2 Missing ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff              @@
##                dev   #10839      +/-   ##
============================================
- Coverage     40.64%   40.60%   -0.04%     
  Complexity     4827     4827              
============================================
  Files           913      915       +2     
  Lines         36358    36436      +78     
  Branches       4003     4001       -2     
============================================
+ Hits          14776    14794      +18     
- Misses        20102    20160      +58     
- Partials       1480     1482       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Jul 9, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 9 Code Smells

42.8% 42.8% Coverage
6.5% 6.5% Duplication

@ruanwenjun ruanwenjun merged commit 3f69ec8 into apache:dev Jul 9, 2022
@ruanwenjun ruanwenjun deleted the dev_wenjun_failover branch July 9, 2022 03:55
hstdream pushed a commit to hstdream/dolphinscheduler that referenced this pull request Jul 9, 2022
…pache#10839)

* Fix master failover will not update task instance status
* Add some failover log
* Fix worker failover will rerun task more than once
* Fix workflowInstance failover may rerun already success taskInstance
ruanwenjun added a commit to ruanwenjun/dolphinscheduler that referenced this pull request Jul 12, 2022
…correct issue (apache#17)

* [Fix-10842] Fix master/worker failover will cause status incorrect (apache#10839)

* Fix master failover will not update task instance status
* Add some failover log
* Fix worker failover will rerun task more than once
* Fix workflowInstance failover may rerun already success taskInstance

(cherry picked from commit 3f69ec8)

* [Fix-10854] Fix database restart may lost task instance status (apache#10866)

* Fix database update error doesn't rollback the task instance status

* Fix database error may cause workflow dead with running status

(cherry picked from commit f639a2e)
ruanwenjun added a commit that referenced this pull request Jul 19, 2022
…10839)

* Fix master failover will not update task instance status
* Add some failover log
* Fix worker failover will rerun task more than once
* Fix workflowInstance failover may rerun already success taskInstance

(cherry picked from commit 3f69ec8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Master] When worker down, some taskInstance has been dispatched may not be failovered.

4 participants