[#715] fix(mr): The container does not exit because shuffleclient is not closed #882

zhaobing001 · 2023-05-15T09:30:47Z

What changes were proposed in this pull request?

The container does not exit because shuffleclient is not closed

Why are the changes needed?

1.The process does not exit after the maptask or reducetask execution is complete. The reason is that ShuffleWriteClient has a thread pool that does not close when the task completes. So turning off ShuffleWriteClient can solve this problem.

2.How do I recreate this scene?
Initialize a small cluster and submit an mr Task whose requested resources exceed the total resources in the cluster.
We can see that all tasks have completed execution without quitting until the timeout time exceeds 60 seconds(mapreduce.task.exit.timeout). The appmaster requests the nodemanager to kill the corresponding container.

The nodemanager logs are as follows
2023-03-12 13:56:45,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1676901654399_1653119_m_000070_0: [2023-03-12 13:56:44.909]Container killed by the ApplicationMaster. [2023-03-12 13:56:44.921]Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 45556 as user tc_infra for container container_e304_1676901654399_1653119_01_000072, result=success [2023-03-12 13:56:44.985]Container killed on request. Exit code is 143 [2023-03-12 13:56:45.403]Container exited with a non-zero exit code 143.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

existing UTs.

codecov-commenter · 2023-05-15T09:42:48Z

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 57.26%. Comparing base (8c0c37e) to head (c027532).
Report is 775 commits behind head on master.

Files	Patch %	Lines
...rg/apache/hadoop/mapred/RssMapOutputCollector.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master     #882      +/-   ##
============================================
- Coverage     58.60%   57.26%   -1.34%     
- Complexity     1580     2172     +592     
============================================
  Files           194      310     +116     
  Lines         10871    13898    +3027     
  Branches        956     1278     +322     
============================================
+ Hits           6371     7959    +1588     
- Misses         4126     5507    +1381     
- Partials        374      432      +58

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

client-mr/src/main/java/org/apache/hadoop/mapred/RssMapOutputCollector.java

jerqi · 2023-05-19T06:10:12Z

LGTM, thanks @zhaobing001 , good catch! merged to master.

jerqi · 2023-05-19T08:48:23Z

Merged to branch-0.7.

…not closed (#882) ### What changes were proposed in this pull request? The container does not exit because shuffleclient is not closed ### Why are the changes needed? For #715 1.The process does not exit after the maptask or reducetask execution is complete. The reason is that ShuffleWriteClient has a thread pool that does not close when the task completes. So turning off ShuffleWriteClient can solve this problem. 2.How do I recreate this scene? Initialize a small cluster and submit an mr Task whose requested resources exceed the total resources in the cluster. We can see that all tasks have completed execution without quitting until the timeout time exceeds 60 seconds(mapreduce.task.exit.timeout). The appmaster requests the nodemanager to kill the corresponding container. The nodemanager logs are as follows `2023-03-12 13:56:45,901 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1676901654399_1653119_m_000070_0: [2023-03-12 13:56:44.909]Container killed by the ApplicationMaster. [2023-03-12 13:56:44.921]Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 45556 as user tc_infra for container container_e304_1676901654399_1653119_01_000072, result=success [2023-03-12 13:56:44.985]Container killed on request. Exit code is 143 [2023-03-12 13:56:45.403]Container exited with a non-zero exit code 143. ` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? existing UTs. Co-authored-by: zhaobing <zhaobing@zhihu.com>

Close shuffleclient to avoid resolving container without exiting

8904ad0

zhaobing001 changed the title ~~Close shuffleclient to avoid resolving container without exiting~~ The container does not exit because shuffleclient is not closed May 15, 2023

jerqi changed the title ~~The container does not exit because shuffleclient is not closed~~ [#715] fix(mr): The container does not exit because shuffleclient is not closed May 15, 2023

jerqi reviewed May 15, 2023

View reviewed changes

client-mr/src/main/java/org/apache/hadoop/mapred/RssMapOutputCollector.java Outdated Show resolved Hide resolved

jerqi added 2 commits May 19, 2023 14:03

fix

54225f4

fix

c027532

jerqi approved these changes May 19, 2023

View reviewed changes

jerqi merged commit 762df54 into apache:master May 19, 2023

jerqi mentioned this pull request May 19, 2023

[Bug] problems of RssMapOutputCollector #715

Closed

3 tasks

jerqi mentioned this pull request Aug 19, 2023

[Bug] [TEZ] The container does not exit because shuffleclient is not closed #1157

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#715] fix(mr): The container does not exit because shuffleclient is not closed #882

[#715] fix(mr): The container does not exit because shuffleclient is not closed #882

Uh oh!

zhaobing001 commented May 15, 2023 •

edited by jerqi

Loading

Uh oh!

codecov-commenter commented May 15, 2023 •

edited

Loading

Uh oh!

Uh oh!

jerqi commented May 19, 2023

Uh oh!

jerqi commented May 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[#715] fix(mr): The container does not exit because shuffleclient is not closed #882

[#715] fix(mr): The container does not exit because shuffleclient is not closed #882

Uh oh!

Conversation

zhaobing001 commented May 15, 2023 • edited by jerqi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov-commenter commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

jerqi commented May 19, 2023

Uh oh!

jerqi commented May 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhaobing001 commented May 15, 2023 •

edited by jerqi

Loading

codecov-commenter commented May 15, 2023 •

edited

Loading