Skip to content

Conversation

@lifeSo
Copy link
Collaborator

@lifeSo lifeSo commented Dec 23, 2023

What changes were proposed in this pull request?

Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER

Why are the changes needed?

Fix: # #1119

Does this PR introduce any user-facing change?

No.

How was this patch tested?

test on prd and unit test case

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2023

Codecov Report

Attention: 46 lines in your changes are missing coverage. Please review.

Comparison is base (44fad8d) 53.23% compared to head (9085751) 54.26%.
Report is 10 commits behind head on master.

Files Patch % Lines
...pache/uniffle/server/ShuffleServerGrpcService.java 0.00% 17 Missing ⚠️
...ffle/client/impl/grpc/ShuffleServerGrpcClient.java 0.00% 9 Missing ⚠️
...he/uniffle/common/exception/NoBufferException.java 0.00% 6 Missing ⚠️
...n/exception/NoBufferForHugePartitionException.java 0.00% 6 Missing ⚠️
.../uniffle/common/exception/NoRegisterException.java 0.00% 6 Missing ⚠️
.../org/apache/uniffle/server/ShuffleTaskManager.java 81.81% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1393      +/-   ##
============================================
+ Coverage     53.23%   54.26%   +1.02%     
  Complexity     2720     2720              
============================================
  Files           419      402      -17     
  Lines         23994    21667    -2327     
  Branches       2051     2043       -8     
============================================
- Hits          12774    11758    -1016     
+ Misses        10427     9187    -1240     
+ Partials        793      722      -71     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lifeSo lifeSo changed the title [1119][Improvement] Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER [1119][Improvement] [For Test Now] Dec 23, 2023
@lifeSo lifeSo marked this pull request as draft December 26, 2023 03:40
@lifeSo lifeSo changed the title [1119][Improvement] [For Test Now] [1119] improvement: Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER Dec 26, 2023
…re_1119_new_2023_12_23

* feature_1119_BUFFER_LIMIT_OF_HUGE_PARTITION:
  add file for commit
  [apache#1189] feat(server): Add netty used direct memory size metric (apache#1363)
  format
  format
  format
  rename
  file rename
  [apache#960] fix(dashboard): simplify dependency and correct the startup script (apache#1347)
  [apache#1348] improvement(metrics): Unify tags generation for shuffle-server metrics reporter (apache#1349)
  [MINOR] chore: fix kubernetes ci pipeline (apache#1368)
  [MINOR] fix(spark): Fix NPE for ShuffleWriteClientImpl.unregisterShuffle (apache#1367)
  format code
  format code
  添加堆外内存监控的代码
@lifeSo lifeSo marked this pull request as ready for review December 26, 2023 13:48
@lifeSo lifeSo requested review from jerqi and zuston December 26, 2023 15:39
@lifeSo lifeSo changed the title [1119] improvement: Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER [#1119] improvement: Explicitly throw BUFFER_LIMIT_OF_HUGE_PARTITION instead of NO_BUFFER Dec 27, 2023
ShuffleTaskInfo shuffleTaskInfo = shuffleTaskInfos.get(appId);
if (null == shuffleTaskInfo) {
return RequireBufferStatusCode.NO_REGISTER.statusCode();
LOG.error("Find not registered app, appId: {}, shuffleId: {}", appId, shuffleId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about no such app is registered. appId: {}, shuffleId: {}

appId, shuffleId, partitionId, partitionUsedDataSize)) {
ShuffleServerMetrics.counterTotalRequireBufferFailedForHugePartition.inc();
return RequireBufferStatusCode.NO_BUFFER.statusCode();
String errorMessage =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not throwing the NoBufferForHugePartitionException ?

return RequireBufferStatusCode.NO_BUFFER.statusCode();
String errorMessage =
String.format(
"No Buffer For Huge Partition, appId: %s, shuffleId: %s, partitionIds: %s, partitionUsedDataSize: %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Huge partition is limited to writing. appId: %s, shuffleId: %s, partitionIds: %s, partitionUsedDataSize: %s ?

requireBufferId = -1;
status = StatusCode.NO_REGISTER;
ShuffleServerMetrics.counterTotalRequireBufferFailed.inc();
} catch (NoBufferException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle NoBufferForHugePartitionException, I don't see these code/.

Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some tests for huge partition no buffer exception ?

@lifeSo
Copy link
Collaborator Author

lifeSo commented Dec 30, 2023

Could you add some tests for huge partition no buffer exception ?

Ok, I added sendDataAndRequireBufferTest() test case in ShuffleServerGrpcTest. Mainly:
image

image

@zuston I added new test case.

Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the memory message of being limited by huge partition should be shown in client side rather than only having one exception in server

requireBufferId = -1;
status = StatusCode.NO_REGISTER;
ShuffleServerMetrics.counterTotalRequireBufferFailed.inc();
} catch (NoBufferException | NoBufferForHugePartitionException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird, why not introducing new stateCode of NoBufferForHugePartitionException

@lifeSo lifeSo force-pushed the feature_1119_new_2023_12_23 branch from b30aedc to a94121c Compare January 2, 2024 03:30
TIMEOUT(7),
ACCESS_DENIED(8),
INVALID_REQUEST(9),
NO_BUFFER_FOR_HUGE_PARTITION(3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should consider the compatility.

NO_BUFFER_FOR_HUGE_PARTITION(10)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. should be 11

@lifeSo lifeSo force-pushed the feature_1119_new_2023_12_23 branch from a54636f to 096820f Compare January 2, 2024 07:31
@lifeSo lifeSo marked this pull request as draft January 2, 2024 07:40
@lifeSo lifeSo force-pushed the feature_1119_new_2023_12_23 branch from 096820f to 37b897c Compare January 2, 2024 07:58
@lifeSo lifeSo marked this pull request as ready for review January 2, 2024 07:59
@lifeSo lifeSo force-pushed the feature_1119_new_2023_12_23 branch from 37b897c to 0079888 Compare January 2, 2024 08:16
TIMEOUT(7),
ACCESS_DENIED(8),
INVALID_REQUEST(9),
NO_BUFFER_FOR_HUGE_PARTITION(3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. should be 11

protected static final int JETTY_PORT_1 = 19998;
protected static final int JETTY_PORT_2 = 20040;
protected static final String COORDINATOR_QUORUM = LOCALHOST + ":" + COORDINATOR_PORT_1;
protected static final String SHUFFLE_SERVER_METRICS_URL =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this?

public static final String LOCAL_DISK_PATH_LABEL_ALL = "ALL";
private static final String TOTAL_REQUIRE_BUFFER_FAILED = "total_require_buffer_failed";
private static final String TOTAL_REQUIRE_BUFFER_FAILED_FOR_HUGE_PARTITION =
private static final String TOTAL_REQUIRE_BUFFER_FAILED_FOR_NO_REGISTER =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not related with this PR. Right?

final int backOffBase = 2000;
while (rpcResponse.getStatus() == RssProtos.StatusCode.NO_BUFFER) {
while (rpcResponse.getStatus() == RssProtos.StatusCode.NO_BUFFER
|| rpcResponse.getStatus() == RssProtos.StatusCode.NO_BUFFER_FOR_HUGE_PARTITION) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expose the concrete status code into log message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @lifeSo

@lifeSo lifeSo requested review from jerqi and zuston January 3, 2024 08:26
@jerqi
Copy link
Contributor

jerqi commented Jan 3, 2024

Could you reply all the comments from zuston?

@lifeSo lifeSo force-pushed the feature_1119_new_2023_12_23 branch from ad3106c to 0fd91b2 Compare January 3, 2024 09:35
@lifeSo
Copy link
Collaborator Author

lifeSo commented Jan 3, 2024

**zuston ** reviewed Dec 28, 2023
View reviewed changes

Could you reply all the comments from zuston?

@jerqi I checked and find one place miss modify, and I corrected it.
I thin all comments is reply and corrected.

@jerqi
Copy link
Contributor

jerqi commented Jan 3, 2024

**zuston ** reviewed Dec 28, 2023
View reviewed changes

Could you reply all the comments from zuston?

@jerqi I checked and find one place miss modify, and I corrected it. I thin all comments is reply and corrected.

You should reply the comment instead of marking the comment resolved. The reply is the your solution. Maybe it will be very simple like fixed. Let reviewer mark the comment resolved.

@lifeSo lifeSo force-pushed the feature_1119_new_2023_12_23 branch from 54687ce to 94abd01 Compare January 7, 2024 10:13
Update CoordinatorGrpcClient.java, remove un-necessary code

添加require buffer测试样例,加强原有测试样例

add NO_BUFFER_FOR_HUGE_PARTITION status code.

add require_buffer_failed_for_huge_partition exception unit test.

revert status code

fix log msg add status code

revert
@zuston
Copy link
Member

zuston commented Jan 8, 2024

#1425 Merged. And close this.

@zuston zuston closed this Jan 8, 2024
@lifeSo lifeSo deleted the feature_1119_new_2023_12_23 branch January 8, 2024 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants