Proposal: Enhancing Forgejo Actions with Autoscaling Capabilities #241

Open
opened 2024-11-21 16:13:02 +01:00 by cobak78 · 24 comments
Member

Hi everyone,

I wanted to share an idea we're exploring to improve the scalability and efficiency of Forgejo Actions runners, using KEDA in a kubernetes environment, and gather your thoughts.

Motivation

Our team is growing, and the workloads we run on Forgejo Actions are incresing. To meet these demands, we need to increase parallelization when running actions jobs without unnecessarily blocking or over-provisioning resources. Moving to an autoscaling model will help us achieve this, ensuring that resources are only allocated when there are tasks to process

Here are the key points:

  1. Metrics for Waiting tasks
  • To enable an autoscaling format for Actions runners, we need Forgejo to provide a metric for the number of tasks waiting to be executed, so the new Forgejo autoscaler can read it and call a runner
  • We are in conversations with project KEDA contributors to create a PR where you can define a Forgejo autoscalers based on these metrics.
  • These metrics should be accessible at the organization, user, and repository levels and will match how you configure the autoscaler.
  • We’re proposing to create a new API endpoints that expose the metrics mentioned above.
  1. Runner lifecycle Change
  • Currently, forgejo runners operate as persistent daemons. To better integrate with KEDA job autoscaling we propose changing the runner lifecycle to function as jobs instead of long-running processes.
  • We already fork the actual runners and add a new command on the code to execute all tasks and exit.
  • This would allow runners to scale to 0 when no jobs are running, optimizing resource usage in dynamic environments.
  1. Benefits
  • These changes would enhance Forgejo Actions’ usability in Kubernetes, making it a more robust solution in cloud-native setups
  • Autoscaling would provide a cost-efficient and flexible system, ensuring runners only consume resources when tasks are queued.

We're already working in a proof of concept that integrates all this points and the first results are very promising.

Looking forward to hear your thought and feedback.

Hi everyone, I wanted to share an idea we're exploring to improve the scalability and efficiency of Forgejo Actions runners, using KEDA in a kubernetes environment, and gather your thoughts. **Motivation** Our team is growing, and the workloads we run on Forgejo Actions are incresing. To meet these demands, we need to increase parallelization when running actions jobs without unnecessarily blocking or over-provisioning resources. Moving to an autoscaling model will help us achieve this, ensuring that resources are only allocated when there are tasks to process Here are the key points: 1. Metrics for Waiting tasks - To enable an autoscaling format for Actions runners, we need Forgejo to provide a metric for the number of tasks waiting to be executed, so the new Forgejo autoscaler can read it and call a runner - We are in conversations with project KEDA contributors to create a PR where you can define a Forgejo autoscalers based on these metrics. - These metrics should be accessible at the organization, user, and repository levels and will match how you configure the autoscaler. - We’re proposing to create a new API endpoints that expose the metrics mentioned above. 2. Runner lifecycle Change - Currently, forgejo runners operate as persistent daemons. To better integrate with KEDA job autoscaling we propose changing the runner lifecycle to function as jobs instead of long-running processes. - We already fork the actual runners and add a new command on the code to execute all tasks and exit. - This would allow runners to scale to 0 when no jobs are running, optimizing resource usage in dynamic environments. 3. Benefits - These changes would enhance Forgejo Actions’ usability in Kubernetes, making it a more robust solution in cloud-native setups - Autoscaling would provide a cost-efficient and flexible system, ensuring runners only consume resources when tasks are queued. We're already working in a proof of concept that integrates all this points and the first results are very promising. Looking forward to hear your thought and feedback.
Owner

There seems to be related work going on to allow for auto-scaling to happen in #5849. CC @wetneb in case you're interested.

There seems to be related work going on to allow for auto-scaling to happen in #5849. CC @wetneb in case you're interested.

Could just clarify what do you mean by metric? Ie is it a prometheus one or an endpoint which returns a number of pending tasks?

Could just clarify what do you mean by metric? Ie is it a prometheus one or an endpoint which returns a number of pending tasks?
Member

Nice that you are also interested in this topic! I would very much welcome your thoughts on forgejo/forgejo#5849 indeed (thank you @Gusted).

Currently, forgejo runners operate as persistent daemons. To better integrate with KEDA job autoscaling we propose changing the runner lifecycle to function as jobs instead of long-running processes.

It's hard for me to understand how you could avoid having any persistent daemon: something needs to poll Forgejo regularly to check if jobs are available and provision the workers, right?

Nice that you are also interested in this topic! I would very much welcome your thoughts on https://codeberg.org/forgejo/forgejo/pulls/5849 indeed (thank you @Gusted). > Currently, forgejo runners operate as persistent daemons. To better integrate with KEDA job autoscaling we propose changing the runner lifecycle to function as jobs instead of long-running processes. It's hard for me to understand how you could avoid having any persistent daemon: something needs to poll Forgejo regularly to check if jobs are available and provision the workers, right?
Author
Member

Could just clarify what do you mean by metric? Ie is it a prometheus one or an endpoint which returns a number of pending tasks?

It's hard for me to understand how you could avoid having any persistent daemon: something needs to poll Forgejo regularly to check if jobs are available and provision the workers, right?

The two comments are related. In this new idea, KEDA is responsible for gathering information about pending jobs. For this, we need that "metric" as KEDA requires it but are just the new endpoints. This is essentially a simple value indicating how many jobs are pending, similar to how it works with the autoscalers for GitHub or Azure. Here's the link (https://keda.sh/docs/2.16/scalers/) that explains how scalers work, and in this case, there would be a specific one for Forgejo.

> Could just clarify what do you mean by metric? Ie is it a prometheus one or an endpoint which returns a number of pending tasks? > It's hard for me to understand how you could avoid having any persistent daemon: something needs to poll Forgejo regularly to check if jobs are available and provision the workers, right? The two comments are related. In this new idea, KEDA is responsible for gathering information about pending jobs. For this, we need that "metric" as KEDA requires it but are just the new endpoints. This is essentially a simple value indicating how many jobs are pending, similar to how it works with the autoscalers for GitHub or Azure. Here's the link (https://keda.sh/docs/2.16/scalers/) that explains how scalers work, and in this case, there would be a specific one for Forgejo.

Thanks a lot, just wanted to know if it's impacting me and since it's not a prometheus one it's not.

Good luck with adding this then!

Thanks a lot, just wanted to know if it's impacting me and since it's not a prometheus one it's not. Good luck with adding this then!
Author
Member

Hi, during the process of testing and experimenting with this proof of concept, I realized something: each execution in a pod with the runner is registered indiviudally. This currently results in a list of all the pods that have been executed (one entry per pod).

What do you think would by more feasible: maintaining a shared volume between pods (using a PVC) to store the runner's config for reuse which may create some parallelization problems, or creating an endpoint to deregister each runner after every execution?

Hi, during the process of testing and experimenting with this proof of concept, I realized something: each execution in a pod with the runner is registered indiviudally. This currently results in a list of all the pods that have been executed (one entry per pod). What do you think would by more feasible: maintaining a shared volume between pods (using a PVC) to store the runner's config for reuse which may create some parallelization problems, or creating an endpoint to deregister each runner after every execution?
Member

I would imagine that once a runner is registered, it's fine to share the config files among multiple instances of that runner. The only concurrency you'd have after that is taking a job, but normally the existing API endpoint should be safe for that (two instances shouldn't get assigned the same task).

Ideally, I would say the "runner" should be the single process that takes care of the auto-scaling and dispatching the jobs to the different machines it spawns. The process that you'd run on your different machines would only talk to their supervisor. That does mean adding more complexity on the runner side, but would reduce the traffic on Forgejo's side (only one process is polling Forgejo for tasks).

I would imagine that once a runner is registered, it's fine to share the config files among multiple instances of that runner. The only concurrency you'd have after that is taking a job, but normally the existing API endpoint should be safe for that (two instances shouldn't get assigned the same task). Ideally, I would say the "runner" should be the single process that takes care of the auto-scaling and dispatching the jobs to the different machines it spawns. The process that you'd run on your different machines would only talk to their supervisor. That does mean adding more complexity on the runner side, but would reduce the traffic on Forgejo's side (only one process is polling Forgejo for tasks).
Author
Member

Thanks for the answer, i will explore how to share runner config.
First approach was that every time a job is detected a runner should register and take that job. I'm leaning to think that having registered autoscalers instead registered runners will simplify label and runner config in this case.

Thanks for the answer, i will explore how to share runner config. First approach was that every time a job is detected a runner should register and take that job. I'm leaning to think that having registered autoscalers instead registered runners will simplify label and runner config in this case.
Owner

Would it be sensible to have a runner that can jobs on remote instances rather than setting up and registering a runner on deployed machines?
Basically allow a runner to connect to a remote machine via SSH and launch a job there, receving the logs back via the same connection. A docker/podman socket can be shared via SSH.

Would it be sensible to have a runner that can jobs on remote instances rather than setting up and registering a runner on deployed machines? Basically allow a runner to connect to a remote machine via SSH and launch a job there, receving the logs back via the same connection. A docker/podman socket can be shared via SSH.
Author
Member

After resolving some issues and moving the runner registration logic, I think we are ready to start with the PRs. However, I'm unsure about which repostiory I should sumbit the issue or the PRs for the new enpoints.
Should it go directly to Gitea? Similar to how the registration token generation enpoints were added in version 7 https://github.com/go-gitea/gitea/pull/27144 and then brought into Forgejo later?

Thank you!

After resolving some issues and moving the runner registration logic, I think we are ready to start with the PRs. However, I'm unsure about which repostiory I should sumbit the issue or the PRs for the new enpoints. Should it go directly to Gitea? Similar to how the registration token generation enpoints were added in version 7 https://github.com/go-gitea/gitea/pull/27144 and then brought into Forgejo later? Thank you!
Owner
That is great news. > Should it go directly to Gitea? Please take a look at https://forgejo.org/compare-to-gitea/#should-i-submit-all-my-pull-requests-to-forgejo-or-are-there-changes-youd-rather-prefer-see-submitted-to-gitea

https://codeberg.org/forgejo/forgejo/actions/runs/51788/jobs/8#jobstep-5-239 this error is related.

--- FAIL: TestAPISearchActionJobs_GlobalRunner (0.44s)
    testlogger.go:405: 2025/01/09 08:10:29 ...les/storage/local.go:33:NewLocalStorage() [I] Creating new Local Storage at /workspace/forgejo/forgejo/tests/gitea-lfs-meta
    testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/login for test-mock:12345, 200 OK in 1.9ms @ auth/auth.go:144(auth.SignIn)
    testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed POST /user/login for test-mock:12345, 303 See Other in 28.0ms @ auth/auth.go:177(auth.SignInPost)
    testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/settings/applications for test-mock:12345, 200 OK in 8.0ms @ setting/applications.go:24(setting.Applications)
    testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed POST /user/settings/applications for test-mock:12345, 303 See Other in 8.1ms @ setting/applications.go:34(setting.ApplicationsPost)
    testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/settings/applications for test-mock:12345, 200 OK in 6.1ms @ setting/applications.go:24(setting.Applications)
    testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /api/v1/admin/runners/jobs?labels=ubuntu-latest for test-mock:12345, 200 OK in 6.8ms @ admin/runners.go:29(admin.SearchActionRunJobs)
    api_admin_actions_test.go:37: 
        	Error Trace:	/workspace/forgejo/forgejo/tests/integration/api_admin_actions_test.go:37
        	Error:      	"[0xc0014e2380 0xc0014e2460]" should have 1 item(s), but has 2
        	Test:       	TestAPISearchActionJobs_GlobalRunner
https://codeberg.org/forgejo/forgejo/actions/runs/51788/jobs/8#jobstep-5-239 this error is related. ``` --- FAIL: TestAPISearchActionJobs_GlobalRunner (0.44s) testlogger.go:405: 2025/01/09 08:10:29 ...les/storage/local.go:33:NewLocalStorage() [I] Creating new Local Storage at /workspace/forgejo/forgejo/tests/gitea-lfs-meta testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/login for test-mock:12345, 200 OK in 1.9ms @ auth/auth.go:144(auth.SignIn) testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed POST /user/login for test-mock:12345, 303 See Other in 28.0ms @ auth/auth.go:177(auth.SignInPost) testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/settings/applications for test-mock:12345, 200 OK in 8.0ms @ setting/applications.go:24(setting.Applications) testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed POST /user/settings/applications for test-mock:12345, 303 See Other in 8.1ms @ setting/applications.go:34(setting.ApplicationsPost) testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /user/settings/applications for test-mock:12345, 200 OK in 6.1ms @ setting/applications.go:24(setting.Applications) testlogger.go:405: 2025/01/09 08:10:29 ...eb/routing/logger.go:102:func1() [I] router: completed GET /api/v1/admin/runners/jobs?labels=ubuntu-latest for test-mock:12345, 200 OK in 6.8ms @ admin/runners.go:29(admin.SearchActionRunJobs) api_admin_actions_test.go:37: Error Trace: /workspace/forgejo/forgejo/tests/integration/api_admin_actions_test.go:37 Error: "[0xc0014e2380 0xc0014e2460]" should have 1 item(s), but has 2 Test: TestAPISearchActionJobs_GlobalRunner ```

The Forgejo side of this is done and the runner side is close to be merged.

What remains to do is an addition to the documentation at https://codeberg.org/forgejo/docs. It can be terse because the Forgejo runner side of it is rather trivial.

What would be really useful is a showcase to explain how this can all fit together. Do you have plans to do that already?

The Forgejo side of this is done and the [runner side](https://code.forgejo.org/forgejo/runner/pulls/423) is close to be merged. What remains to do is an addition to the documentation at https://codeberg.org/forgejo/docs. It can be terse because the Forgejo runner side of it is rather trivial. What would be really useful is a showcase to explain how this can all fit together. Do you have plans to do that already?

@cobak78 gentle ping?

@cobak78 gentle ping?
Author
Member

Sure, I will add the explanation to the documentation soon. Do you have any preference for where to publish the showcase?

Sure, I will add the explanation to the documentation soon. Do you have any preference for where to publish the showcase?

Do you have any preference for where to publish the showcase?

Not really, as long as it is publicly accessible.

> Do you have any preference for where to publish the showcase? Not really, as long as it is publicly accessible.

@cobak78 gentle ping?

@cobak78 gentle ping?
Author
Member

Here is the PR with a new section on how to configure runners with keda autoscalers.
forgejo/docs#1073

I'm right now working in the KEDA PR to enable the new autoscaler, so it can't be tested with an official image.
Here is the PR if you want to track it down https://github.com/kedacore/keda/pull/6495

Here is the PR with a new section on how to configure runners with keda autoscalers. https://codeberg.org/forgejo/docs/pulls/1073 I'm right now working in the KEDA PR to enable the new autoscaler, so it can't be tested with an official image. Here is the PR if you want to track it down https://github.com/kedacore/keda/pull/6495
Author
Member

Also I created a public Gist that i will updating with new configurations and use cases https://gist.github.com/cobak78/22312ddafe2fd18a64aca16410b5b4f5

Also I created a public Gist that i will updating with new configurations and use cases https://gist.github.com/cobak78/22312ddafe2fd18a64aca16410b5b4f5

For the record, an insightful conversation about the relationship between Keda and Kedify forgejo/docs#1073 (comment)

For the record, an insightful conversation about the relationship between Keda and Kedify https://codeberg.org/forgejo/docs/pulls/1073#issuecomment-2714186
Owner

Via forgejo/forgejo#7093 I got into reading https://github.com/go-gitea/gitea/issues/32461, it seems like if this feature was implemented this would help with auto scaling?

Via https://codeberg.org/forgejo/forgejo/pulls/7093 I got into reading https://github.com/go-gitea/gitea/issues/32461, it seems like if this feature was implemented this would help with auto scaling?
Author
Member

@Gusted wrote in #241 (comment):

Via forgejo/forgejo#7093 I got into reading https://github.com/go-gitea/gitea/issues/32461, it seems like if this feature was implemented this would help with auto scaling?

It seems they are trying to address the same scalability needs on runners, starting with an ephemeral runner execution, as I did in this forgejo runner PR. https://code.forgejo.org/forgejo/runner/pulls/423

I'll keep myself updated with their achievements

@Gusted wrote in https://codeberg.org/forgejo/discussions/issues/241#issuecomment-2909288: > Via forgejo/forgejo#7093 I got into reading https://github.com/go-gitea/gitea/issues/32461, it seems like if this feature was implemented this would help with auto scaling? It seems they are trying to address the same scalability needs on runners, starting with an ephemeral runner execution, as I did in this forgejo runner PR. https://code.forgejo.org/forgejo/runner/pulls/423 I'll keep myself updated with their achievements

@cobak78 wrote in #241 (comment):

It seems they are trying to address the same scalability needs on runners, starting with an ephemeral runner execution, as I did in this forgejo runner PR. https://code.forgejo.org/forgejo/runner/pulls/423

I believe there is a slight (but important) difference. The ephemeral runner is removed from gitea once it finishes running the job, mirroring how GitHub treats ephemeral runners. The assumption is that once a job is done running, the credentials for that runner have been exposed to the code running inside the job, and it is assumed that they cannot be reused.

It would be great if there was an --ephemeral flag when a runner is registered. After a job is run, regardless of outcome, the runner should be removed from forgejo and the token invalidated.

I am looking at both Gitea and Forgejo in an effort to maybe add support for both in GARM once I get some cycles.

@cobak78 wrote in https://codeberg.org/forgejo/discussions/issues/241#issuecomment-3025532: > It seems they are trying to address the same scalability needs on runners, starting with an ephemeral runner execution, as I did in this forgejo runner PR. https://code.forgejo.org/forgejo/runner/pulls/423 I believe there is a slight (but important) difference. The ephemeral runner is removed from gitea once it finishes running the job, mirroring how GitHub treats ephemeral runners. The assumption is that once a job is done running, the credentials for that runner have been exposed to the code running inside the job, and it is assumed that they cannot be reused. It would be great if there was an `--ephemeral` flag when a runner is registered. After a job is run, regardless of outcome, the runner should be removed from forgejo and the `token` invalidated. I am looking at both Gitea and Forgejo in an effort to maybe add support for both in [GARM](https://github.com/cloudbase/garm) once I get some cycles.

@cobak78 wrote in #241 (comment):

I'll keep myself updated with their achievements

If you're still interested to see how things are evolving, this is the PR in GARM for Gitea support in GARM: https://github.com/cloudbase/garm/pull/393.

The nice thing is that with slight adjustments, GARM can now spawn gitea runners on any of the existing external providers. Adding new providers is simple (the interface is really basic).

You can see it in action in this video of the first commit that worked. It's still a work in progress. I will try to add Forgejo as well once I finish with Gitea. I like the idea of having one simple interface where you can potentially manage runners for multiple forges.

@cobak78 wrote in https://codeberg.org/forgejo/discussions/issues/241#issuecomment-3025532: > I'll keep myself updated with their achievements If you're still interested to see how things are evolving, this is the PR in GARM for Gitea support in GARM: https://github.com/cloudbase/garm/pull/393. The nice thing is that with slight adjustments, GARM can now spawn gitea runners on any of the [existing external providers](https://github.com/cloudbase/garm/?tab=readme-ov-file#installing-external-providers). Adding new providers is simple (the interface is really basic). You can see it in action [in this video](https://github.com/cloudbase/garm/issues/323#issuecomment-2881651926) of the first commit that worked. It's still a work in progress. I will try to add Forgejo as well once I finish with Gitea. I like the idea of having one simple interface where you can potentially manage runners for multiple forges.
Sign in to join this conversation.
No milestone
No project
No assignees
7 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Reference
forgejo/discussions#241
No description provided.