Proposal: Enhancing Forgejo Actions with Autoscaling Capabilities #241
Labels
No labels
User research - Accessibility
User research - Blocked
User research - Community
User research - Config (instance)
User research - Errors
User research - Filters
User research - Future backlog
User research - Git workflow
User research - Labels
User research - Moderation
User research - Needs input
User research - Notifications/Dashboard
User research - Rendering
User research - Repo creation
User research - Repo units
User research - Security
User research - Settings (in-app)
No milestone
No project
No assignees
7 participants
Notifications
Due date
No due date set.
Reference
forgejo/discussions#241
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi everyone,
I wanted to share an idea we're exploring to improve the scalability and efficiency of Forgejo Actions runners, using KEDA in a kubernetes environment, and gather your thoughts.
Motivation
Our team is growing, and the workloads we run on Forgejo Actions are incresing. To meet these demands, we need to increase parallelization when running actions jobs without unnecessarily blocking or over-provisioning resources. Moving to an autoscaling model will help us achieve this, ensuring that resources are only allocated when there are tasks to process
Here are the key points:
We're already working in a proof of concept that integrates all this points and the first results are very promising.
Looking forward to hear your thought and feedback.
There seems to be related work going on to allow for auto-scaling to happen in #5849. CC @wetneb in case you're interested.
Could just clarify what do you mean by metric? Ie is it a prometheus one or an endpoint which returns a number of pending tasks?
Nice that you are also interested in this topic! I would very much welcome your thoughts on forgejo/forgejo#5849 indeed (thank you @Gusted).
It's hard for me to understand how you could avoid having any persistent daemon: something needs to poll Forgejo regularly to check if jobs are available and provision the workers, right?
The two comments are related. In this new idea, KEDA is responsible for gathering information about pending jobs. For this, we need that "metric" as KEDA requires it but are just the new endpoints. This is essentially a simple value indicating how many jobs are pending, similar to how it works with the autoscalers for GitHub or Azure. Here's the link (https://keda.sh/docs/2.16/scalers/) that explains how scalers work, and in this case, there would be a specific one for Forgejo.
Thanks a lot, just wanted to know if it's impacting me and since it's not a prometheus one it's not.
Good luck with adding this then!
Hi, during the process of testing and experimenting with this proof of concept, I realized something: each execution in a pod with the runner is registered indiviudally. This currently results in a list of all the pods that have been executed (one entry per pod).
What do you think would by more feasible: maintaining a shared volume between pods (using a PVC) to store the runner's config for reuse which may create some parallelization problems, or creating an endpoint to deregister each runner after every execution?
I would imagine that once a runner is registered, it's fine to share the config files among multiple instances of that runner. The only concurrency you'd have after that is taking a job, but normally the existing API endpoint should be safe for that (two instances shouldn't get assigned the same task).
Ideally, I would say the "runner" should be the single process that takes care of the auto-scaling and dispatching the jobs to the different machines it spawns. The process that you'd run on your different machines would only talk to their supervisor. That does mean adding more complexity on the runner side, but would reduce the traffic on Forgejo's side (only one process is polling Forgejo for tasks).
Thanks for the answer, i will explore how to share runner config.
First approach was that every time a job is detected a runner should register and take that job. I'm leaning to think that having registered autoscalers instead registered runners will simplify label and runner config in this case.
Would it be sensible to have a runner that can jobs on remote instances rather than setting up and registering a runner on deployed machines?
Basically allow a runner to connect to a remote machine via SSH and launch a job there, receving the logs back via the same connection. A docker/podman socket can be shared via SSH.
After resolving some issues and moving the runner registration logic, I think we are ready to start with the PRs. However, I'm unsure about which repostiory I should sumbit the issue or the PRs for the new enpoints.
Should it go directly to Gitea? Similar to how the registration token generation enpoints were added in version 7 https://github.com/go-gitea/gitea/pull/27144 and then brought into Forgejo later?
Thank you!
That is great news.
Please take a look at https://forgejo.org/compare-to-gitea/#should-i-submit-all-my-pull-requests-to-forgejo-or-are-there-changes-youd-rather-prefer-see-submitted-to-gitea
https://codeberg.org/forgejo/forgejo/actions/runs/51788/jobs/8#jobstep-5-239 this error is related.
The Forgejo side of this is done and the runner side is close to be merged.
What remains to do is an addition to the documentation at https://codeberg.org/forgejo/docs. It can be terse because the Forgejo runner side of it is rather trivial.
What would be really useful is a showcase to explain how this can all fit together. Do you have plans to do that already?
@cobak78 gentle ping?
Sure, I will add the explanation to the documentation soon. Do you have any preference for where to publish the showcase?
Not really, as long as it is publicly accessible.
@cobak78 gentle ping?
Here is the PR with a new section on how to configure runners with keda autoscalers.
forgejo/docs#1073
I'm right now working in the KEDA PR to enable the new autoscaler, so it can't be tested with an official image.
Here is the PR if you want to track it down https://github.com/kedacore/keda/pull/6495
Also I created a public Gist that i will updating with new configurations and use cases https://gist.github.com/cobak78/22312ddafe2fd18a64aca16410b5b4f5
For the record, an insightful conversation about the relationship between Keda and Kedify forgejo/docs#1073 (comment)
Via forgejo/forgejo#7093 I got into reading https://github.com/go-gitea/gitea/issues/32461, it seems like if this feature was implemented this would help with auto scaling?
@Gusted wrote in #241 (comment):
It seems they are trying to address the same scalability needs on runners, starting with an ephemeral runner execution, as I did in this forgejo runner PR. https://code.forgejo.org/forgejo/runner/pulls/423
I'll keep myself updated with their achievements
@cobak78 wrote in #241 (comment):
I believe there is a slight (but important) difference. The ephemeral runner is removed from gitea once it finishes running the job, mirroring how GitHub treats ephemeral runners. The assumption is that once a job is done running, the credentials for that runner have been exposed to the code running inside the job, and it is assumed that they cannot be reused.
It would be great if there was an
--ephemeralflag when a runner is registered. After a job is run, regardless of outcome, the runner should be removed from forgejo and thetokeninvalidated.I am looking at both Gitea and Forgejo in an effort to maybe add support for both in GARM once I get some cycles.
@cobak78 wrote in #241 (comment):
If you're still interested to see how things are evolving, this is the PR in GARM for Gitea support in GARM: https://github.com/cloudbase/garm/pull/393.
The nice thing is that with slight adjustments, GARM can now spawn gitea runners on any of the existing external providers. Adding new providers is simple (the interface is really basic).
You can see it in action in this video of the first commit that worked. It's still a work in progress. I will try to add Forgejo as well once I finish with Gitea. I like the idea of having one simple interface where you can potentially manage runners for multiple forges.