-
Notifications
You must be signed in to change notification settings - Fork 168
[#1119] improvement: Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER #1393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1393 +/- ##
============================================
+ Coverage 53.23% 54.26% +1.02%
Complexity 2720 2720
============================================
Files 419 402 -17
Lines 23994 21667 -2327
Branches 2051 2043 -8
============================================
- Hits 12774 11758 -1016
+ Misses 10427 9187 -1240
+ Partials 793 722 -71 ☔ View full report in Codecov by Sentry. |
…re_1119_new_2023_12_23 * feature_1119_BUFFER_LIMIT_OF_HUGE_PARTITION: add file for commit [apache#1189] feat(server): Add netty used direct memory size metric (apache#1363) format format format rename file rename [apache#960] fix(dashboard): simplify dependency and correct the startup script (apache#1347) [apache#1348] improvement(metrics): Unify tags generation for shuffle-server metrics reporter (apache#1349) [MINOR] chore: fix kubernetes ci pipeline (apache#1368) [MINOR] fix(spark): Fix NPE for ShuffleWriteClientImpl.unregisterShuffle (apache#1367) format code format code 添加堆外内存监控的代码
| ShuffleTaskInfo shuffleTaskInfo = shuffleTaskInfos.get(appId); | ||
| if (null == shuffleTaskInfo) { | ||
| return RequireBufferStatusCode.NO_REGISTER.statusCode(); | ||
| LOG.error("Find not registered app, appId: {}, shuffleId: {}", appId, shuffleId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about no such app is registered. appId: {}, shuffleId: {}
| appId, shuffleId, partitionId, partitionUsedDataSize)) { | ||
| ShuffleServerMetrics.counterTotalRequireBufferFailedForHugePartition.inc(); | ||
| return RequireBufferStatusCode.NO_BUFFER.statusCode(); | ||
| String errorMessage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not throwing the NoBufferForHugePartitionException ?
| return RequireBufferStatusCode.NO_BUFFER.statusCode(); | ||
| String errorMessage = | ||
| String.format( | ||
| "No Buffer For Huge Partition, appId: %s, shuffleId: %s, partitionIds: %s, partitionUsedDataSize: %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Huge partition is limited to writing. appId: %s, shuffleId: %s, partitionIds: %s, partitionUsedDataSize: %s ?
| requireBufferId = -1; | ||
| status = StatusCode.NO_REGISTER; | ||
| ShuffleServerMetrics.counterTotalRequireBufferFailed.inc(); | ||
| } catch (NoBufferException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to handle NoBufferForHugePartitionException, I don't see these code/.
zuston
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some tests for huge partition no buffer exception ?
Ok, I added sendDataAndRequireBufferTest() test case in ShuffleServerGrpcTest. Mainly: @zuston I added new test case. |
zuston
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the memory message of being limited by huge partition should be shown in client side rather than only having one exception in server
| requireBufferId = -1; | ||
| status = StatusCode.NO_REGISTER; | ||
| ShuffleServerMetrics.counterTotalRequireBufferFailed.inc(); | ||
| } catch (NoBufferException | NoBufferForHugePartitionException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird, why not introducing new stateCode of NoBufferForHugePartitionException
b30aedc to
a94121c
Compare
| TIMEOUT(7), | ||
| ACCESS_DENIED(8), | ||
| INVALID_REQUEST(9), | ||
| NO_BUFFER_FOR_HUGE_PARTITION(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should consider the compatility.
NO_BUFFER_FOR_HUGE_PARTITION(10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. should be 11
a54636f to
096820f
Compare
096820f to
37b897c
Compare
37b897c to
0079888
Compare
| TIMEOUT(7), | ||
| ACCESS_DENIED(8), | ||
| INVALID_REQUEST(9), | ||
| NO_BUFFER_FOR_HUGE_PARTITION(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. should be 11
| protected static final int JETTY_PORT_1 = 19998; | ||
| protected static final int JETTY_PORT_2 = 20040; | ||
| protected static final String COORDINATOR_QUORUM = LOCALHOST + ":" + COORDINATOR_PORT_1; | ||
| protected static final String SHUFFLE_SERVER_METRICS_URL = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need this?
| public static final String LOCAL_DISK_PATH_LABEL_ALL = "ALL"; | ||
| private static final String TOTAL_REQUIRE_BUFFER_FAILED = "total_require_buffer_failed"; | ||
| private static final String TOTAL_REQUIRE_BUFFER_FAILED_FOR_HUGE_PARTITION = | ||
| private static final String TOTAL_REQUIRE_BUFFER_FAILED_FOR_NO_REGISTER = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not related with this PR. Right?
| final int backOffBase = 2000; | ||
| while (rpcResponse.getStatus() == RssProtos.StatusCode.NO_BUFFER) { | ||
| while (rpcResponse.getStatus() == RssProtos.StatusCode.NO_BUFFER | ||
| || rpcResponse.getStatus() == RssProtos.StatusCode.NO_BUFFER_FOR_HUGE_PARTITION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expose the concrete status code into log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @lifeSo
|
Could you reply all the comments from zuston? |
ad3106c to
0fd91b2
Compare
@jerqi I checked and find one place miss modify, and I corrected it. |
You should reply the comment instead of marking the comment resolved. The reply is the your solution. Maybe it will be very simple like |
54687ce to
94abd01
Compare
Update CoordinatorGrpcClient.java, remove un-necessary code 添加require buffer测试样例,加强原有测试样例 add NO_BUFFER_FOR_HUGE_PARTITION status code. add require_buffer_failed_for_huge_partition exception unit test. revert status code fix log msg add status code revert
94abd01 to
9085751
Compare
|
#1425 Merged. And close this. |


What changes were proposed in this pull request?
Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER
Why are the changes needed?
Fix: # #1119
Does this PR introduce any user-facing change?
No.
How was this patch tested?
test on prd and unit test case