improve the runtimeController performance in large-scale scenarios #4509

Syspretor · 2025-02-20T02:37:20Z

As Fluid is increasingly being adopted by more users in production environments, the number of datasets within clusters has reached the order of thousands. This places higher demands on the overall performance of the RuntimeController. The most critical factor here is the time taken for a dataset to become bounded after its creation, as this directly affects the readiness time of business operations. Through extensive testing, it has been observed that the time for a dataset to become bound increases exponentially with the number of datasets. In a cluster environment with thousands of datasets, the average time for a single dataset to become bound has reached 40 seconds. Therefore, this PR aims to optimize the performance of the RuntimeController in large-scale dataset clusters.

Testing has revealed that the main factor affecting the RuntimeController's ability to bind a runtime and dataset is the significant amount of repeated enqueuing of runtimes/datasets. This causes the reconcile-worker to remain in an active state continuously, resulting in queue blockage and delays. There are two major sources of repeated enqueuing:

All runtimes are enqueued by default at 90-second intervals. This action is intended to periodically sync the status of runtimes/datasets.
Dataset updates lead to the corresponding runtime being enqueued as well.

This PR focuses on optimizing the controller's performance in handling runtimes by addressing these two issues. It introduces two configurable environment variables for the controller: FLUID_RUNTIME_RECONCILE_DURATION and FLUID_RUNTIME_RECONCILE_DURATION_OFFSET.

For runtimes that do not require immediate dataset cache status updates, FLUID_RUNTIME_RECONCILE_DURATION (in seconds, default is 90) can be used to control the default reconcile interval. Increasing this interval appropriately can help reduce the pressure on the reconcile queue.

env: 
- name: FLUID_RUNTIME_RECONCILE_DURATION
  value: "180"

When FLUID_RUNTIME_RECONCILE_DURATION is set to -1, runtimes like Thinruntime, which do not support reporting cache stats, will not be reconciled periodically. However, updates to the dataset, runtime, or runtime workload will still trigger the runtime to be enqueued for reconciliation.
These changes are intended to significantly reduce queue pressure and improve performance.

env: 
- name: FLUID_RUNTIME_RECONCILE_DURATION
  value: "-1"

The FLUID_RUNTIME_RECONCILE_DURATION_OFFSET is designed to optimize the batch creation of datasets/runtimes. In batch creation scenarios, all runtimes are enqueued simultaneously, and after processing, they are re-enqueued at the same time interval set by FLUID_RUNTIME_RECONCILE_DURATION. This can lead to suboptimal utilization of reconcile workers. To address this, we can configure FLUID_RUNTIME_RECONCILE_DURATION_OFFSET along with FLUID_RUNTIME_RECONCILE_DURATION to set a random enqueuing time within a specified range. For instance, the following configuration allows runtimes to be enqueued at a random interval between 80 to 160 seconds. This aims to stagger the re-enqueuing times of batch-created datasets/runtimes, thereby improving the utilization of reconcile workers.

env: 
- name: FLUID_RUNTIME_RECONCILE_DURATION
  value: "120"
- name: FLUID_RUNTIME_RECONCILE_DURATION_OFFSET
  value: "40"

On the other hand, this PR optimizes the logic for updating the state of a dataset during the sync process, thereby preventing runtimes from being passively enqueued due to frequent dataset updates.

Testing has shown that, after optimization, the time taken for a dataset to become bound in a cluster with thousands of datasets has been reduced from an average of over 40 seconds to just over 1 second. Furthermore, this performance is maintained even in larger cluster sizes.

pkg/ctrl/ctrl.go

pkg/utils/crtl_utils.go

pkg/controllers/runtime_controller.go

Signed-off-by: jiuyu <guotongyu.gty@alibaba-inc.com>

sonarqubecloud · 2025-02-21T02:13:27Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cheyang · 2025-02-21T08:09:23Z

pkg/utils/crtl_utils.go

+		return
+	}
+
+	duration, err := strconv.Atoi(RuntimeReconcileDurationEnvVal)


If this function is called frequently (such as during each Reconcile), repeatedly parsing the string can waste CPU resources. I suggest doing this once in init function.

cheyang · 2025-02-21T08:13:15Z

pkg/utils/crtl_utils.go

 	return finishTime.Sub(creationTime).Round(time.Second).String()
 }
+
+func GenerateRandomRequeueDurationFromEnv() (needReconcile bool, d time.Duration) {


GenerateRandomRequeueDurationFromEnv -> GenerateRandomRequeueDuration

cheyang · 2025-02-21T08:47:54Z

pkg/utils/crtl_utils.go

+		return
+	}
+	r := rand.New(rand.NewSource(time.Now().UnixNano()))
+	randomDurationValue := (r.Intn(2*offset+1) + duration - offset)


Why not reusing the global math/rand instance to avoid repeatedly creating rand.Source？Reuse the global math/rand instance directly to avoid repeatedly creating rand.Source and rand.Rand.

cheyang · 2025-02-21T09:36:48Z

@Syspretor Thank you providing the useful solution. I think it's better to also provide documents for the end users.

cheyang

/lgtm
/approve

fluid-e2e-bot · 2025-02-23T04:11:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheyang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheyang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Syspretor requested review from TrafalgarZZZ and cheyang and removed request for cheyang February 20, 2025 02:37

Syspretor force-pushed the enhancement/improve-runtime-controller-in-large-scale-scenarios branch from 50768b5 to bb2e883 Compare February 20, 2025 02:38

cheyang requested changes Feb 20, 2025

View reviewed changes

pkg/ctrl/ctrl.go Outdated Show resolved Hide resolved

cheyang requested changes Feb 20, 2025

View reviewed changes

pkg/utils/crtl_utils.go Outdated Show resolved Hide resolved

Syspretor force-pushed the enhancement/improve-runtime-controller-in-large-scale-scenarios branch 2 times, most recently from 3082e72 to 1f4e010 Compare February 21, 2025 01:58

cheyang reviewed Feb 21, 2025

View reviewed changes

pkg/controllers/runtime_controller.go Outdated Show resolved Hide resolved

improve the runtimeController performance in large-scale scenarios

bb0ebbd

Signed-off-by: jiuyu <guotongyu.gty@alibaba-inc.com>

Syspretor force-pushed the enhancement/improve-runtime-controller-in-large-scale-scenarios branch from 1f4e010 to bb0ebbd Compare February 21, 2025 02:12

Syspretor requested a review from cheyang February 21, 2025 02:36

cheyang reviewed Feb 21, 2025

View reviewed changes

cheyang approved these changes Feb 23, 2025

View reviewed changes

fluid-e2e-bot bot assigned cheyang Feb 23, 2025

fluid-e2e-bot bot added the lgtm label Feb 23, 2025

fluid-e2e-bot bot added the approved label Feb 23, 2025

cheyang merged commit 7e917c7 into fluid-cloudnative:master Feb 23, 2025
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve the runtimeController performance in large-scale scenarios #4509

improve the runtimeController performance in large-scale scenarios #4509

Uh oh!

Syspretor commented Feb 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 21, 2025

Uh oh!

cheyang Feb 21, 2025

Uh oh!

cheyang Feb 21, 2025

Uh oh!

cheyang Feb 21, 2025

Uh oh!

cheyang commented Feb 21, 2025

Uh oh!

cheyang left a comment

Uh oh!

fluid-e2e-bot bot commented Feb 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

improve the runtimeController performance in large-scale scenarios #4509

improve the runtimeController performance in large-scale scenarios #4509

Uh oh!

Conversation

Syspretor commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Feb 21, 2025

Quality Gate passed

Uh oh!

cheyang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

cheyang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

cheyang Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

cheyang commented Feb 21, 2025

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

fluid-e2e-bot bot commented Feb 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Syspretor commented Feb 20, 2025 •

edited

Loading