Determining backend memory limit from container cgroup by spwoodcock · Pull Request #588 · getodk/central

spwoodcock · 2024-01-16T20:21:57Z

Closes #577

* Original Comment *

Includes

Correctly export the WORKER_COUNT variable in start-odk.sh, otherwise the value is not respected (see screenshots).
- This is because pm2.config.js uses process.env.WORKER_COUNT, but the variable is only available in the bash shell, not the pm2 command (it needs to be exported for that).
Fixed the determination of the available memory:
- Currently the vmstat command gets the total available memory on the host system.
- To get the available memory from either the host machine OR a restricted container in docker compose / Kubernetes, we need to inspect the cgroup.
Added an optional for environment variable WORKER_COUNT for the backend service.
- If this value is provided, it overrides the automatically generated value for WORKER_COUNT.
- Added the empty variable to .env.template so users know it's an option.
- The rationale for adding this is twofold:
  - During development it's good to be able to specify 1 worker, even on a machine with large memory.
  - If running in Kubernetes, it is generally not good practice to spawn multiple workers per pod. Ideally 1 process/worker per pod allows for easier load management, autoscaling, etc.
Minor pedantic bash syntax fix: moved a then to the same line as if (in line with shellcheck and bashate linter recommendations).

Screenshots

Before: 4 workers determined, but only one runs

Log:

System info:

After: 4 workers determined, 4 workers run

Log:

System info:

What has been done to verify that this works as intended?

Tests inside docker containers running on a Debian Bookworm host.
This has not been tested via Kubernetes, but I see no reason why the underlying approach would not work.
Also not tested on Docker Desktop or similar, but as they rely on emulation of a Linux VM, this should also work fine.

Why is this the best possible solution? Were any other approaches considered?

This is the most straightforward solution to determine the memory restrictions from a cgroup.
The official nginx image has a best practice approach that is quite convoluted, as it also involves determining other metrics.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

This will not affect users running on systems with memory ~1GB memory.
However, as it fixes the worker allocation, it will affect users with >1GB memory.
Users may find that ODK Central takes more memory to run that it did previously, due to correctly spawning 4 processes.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

It probably does require a docs update. I can make a PR.
It would be useful to document the WORKER_COUNT environment variable is available.

Before submitting this PR, please make sure you have:

branched off and targeted the next branch OR only changed documentation/infrastructure (master is stable and used in production)
verified that any code or assets from external sources are properly credited in comments or that everything is internally sourced

* Updated Comment *

Decided not to go ahead with WORKER_COUNT env var configuration, as a better approach is running one process per container.
Container replicas can be configured via docker compose instead.
This PR includes the total memory check via container cgroup, which should work in containers run both via Docker and Kubernetes.

spwoodcock · 2024-01-16T20:28:11Z

Oops, branched off an old next commit. Will update 👍

lognaturel · 2024-01-17T00:46:04Z

Thanks so much for fixing the memory limit check!

I'm surprised to see the issue related to pm2 workers. I see 4 node processes on the 2Gb servers I checked. Everything else was stock when you identified that problem?

spwoodcock · 2024-01-17T08:28:35Z

Hmm if that's the case, then further debugging is required!

I wonder how pm2.config.js would be able to determine 4 workers from process.env, as setting without an export would mean the variable is not available in process.env.

I tested this as part of my dev setup for working on Central, using next branch for this repo and master branches for backend and frontend.

I use the Dockerfiles and compose file from this repo, with the certs set to upstream in the dotenv.

spwoodcock · 2024-01-17T09:01:06Z

Also, updated my original comment to explain the rationale for adding the WORKER_COUNT as a configurable environment variable. It's especially important if running in Kubernetes.

spwoodcock · 2024-01-24T13:49:55Z

I just tested this on a temp cloud machine:

The 4 workers started up. Now I'm puzzled why they don't start on my WSL Debian instance.

Either way, this functionality seems to work fine currently, so there is no concern than this will cause users to go from 1 --> 4 workers on their deployment.

Adding export to the WORKER_COUNT fixed the issue on my WSL Debian machine, and shouldn't have any other implications, so should I leave it in this PR, or remove it?

lognaturel · 2024-01-24T21:31:44Z

That's very strange!

I'd prefer to split out the new env var, please:

I don't think WORKER_COUNT is the right name at that level. Maybe BACKEND_REPLICAS?
I'm not sure we want to continue running multiple processes in a container: Simplify deployments/upgrades #347 (comment) Once we expose an env variable, we want to continue supporting it so I think we should first resolve that question.

Once that's split off the mem calculation change is ready to merge.

spwoodcock · 2024-01-25T16:05:04Z

Completely agree - PID 1 ideally shouldn't fork processes.

Done 👍

lognaturel · 2024-01-25T17:48:48Z

Thank you! Good discussion all around. I'm glad you're liking the ideas at #347 (comment) 👍

spwoodcock changed the base branch from master to next January 16, 2024 20:22

spwoodcock force-pushed the build/configure-worker-count branch from 042e9de to feefa39 Compare January 16, 2024 20:31

build: fix determining memory limits from container cgroup

b786c92

spwoodcock force-pushed the build/configure-worker-count branch from feefa39 to b786c92 Compare January 25, 2024 15:59

spwoodcock changed the title ~~Fix WORKER_COUNT configuration for pm2 server~~ Determining backend memory limit from container cgroup Jan 25, 2024

lognaturel approved these changes Jan 25, 2024

View reviewed changes

lognaturel merged commit 7476a91 into getodk:next Jan 25, 2024

spwoodcock deleted the build/configure-worker-count branch January 25, 2024 18:21

matthew-white mentioned this pull request Feb 8, 2024

Central sets WORKER_COUNT based on host, not container resources #577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining backend memory limit from container cgroup#588

Determining backend memory limit from container cgroup#588
lognaturel merged 1 commit intogetodk:nextfrom
hotosm:build/configure-worker-count

spwoodcock commented Jan 16, 2024 •

edited

Loading

Uh oh!

spwoodcock commented Jan 16, 2024

Uh oh!

lognaturel commented Jan 17, 2024

Uh oh!

spwoodcock commented Jan 17, 2024 •

edited

Loading

Uh oh!

spwoodcock commented Jan 17, 2024

Uh oh!

spwoodcock commented Jan 24, 2024 •

edited

Loading

Uh oh!

lognaturel commented Jan 24, 2024

Uh oh!

spwoodcock commented Jan 25, 2024 •

edited

Loading

Uh oh!

lognaturel commented Jan 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spwoodcock commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

* Original Comment *

Includes

Screenshots

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

* Updated Comment *

Uh oh!

spwoodcock commented Jan 16, 2024

Uh oh!

lognaturel commented Jan 17, 2024

Uh oh!

spwoodcock commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spwoodcock commented Jan 17, 2024

Uh oh!

spwoodcock commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lognaturel commented Jan 24, 2024

Uh oh!

spwoodcock commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lognaturel commented Jan 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spwoodcock commented Jan 16, 2024 •

edited

Loading

spwoodcock commented Jan 17, 2024 •

edited

Loading

spwoodcock commented Jan 24, 2024 •

edited

Loading

spwoodcock commented Jan 25, 2024 •

edited

Loading