Skip to content

Conversation

@csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Jun 20, 2022

ASAN and TSAN builds of the tests are consuming upwards of 15 and 20 GB of memory each. This is bricking the Github-hosted runners.

In this PR, we modify the flags we use in these runs to reduce memory consumption. These tweaks also allow us to re-introduce TSAN. We also re-introduce some changes that fix some leaks.

Also for future reference, here are the TSAN runtime flags and the ASAN runtime flags.

@csegarragonz csegarragonz changed the title Move tests to self-hosted GH runner Move sanitised tests to self-hosted GH runner Jun 21, 2022
services:
redis:
image: faasm/redis:0.8.12
ports:
Copy link
Collaborator Author

@csegarragonz csegarragonz Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GHA service containers and the job container are connected with a docker-specific network, so there is no need to publish ports:

You don't need to configure any ports for service containers. By default, all containers that are part of the same Docker network expose all ports to each other, and no ports are exposed outside of the Docker network.

In addition, this was creating problems when running jobs in parallel with two self-hosted runners in the same machine.

@csegarragonz csegarragonz self-assigned this Jun 21, 2022
@csegarragonz csegarragonz added the ci Continuous integration label Jun 21, 2022
@csegarragonz csegarragonz changed the title Move sanitised tests to self-hosted GH runner Remove TSAN and ASAN runs Jun 21, 2022
@csegarragonz
Copy link
Collaborator Author

Pinging @eigenraven to pick their brains. Is there something we are completely overlooking here?

@eigenraven
Copy link
Collaborator

Hi! For TSAN, have you tried reducing history size? The memory used grows exponentially with the size parameter in the options. As for ASAN, I'm not sure why the sudden jump - did you change anything with memory management that could've caused more allocations? It usually doesn't have that high of an overhead, it mostly consumes virtual memory which is almost free.

@csegarragonz csegarragonz force-pushed the self-hosted-runner branch 5 times, most recently from 900a60c to 4e8286d Compare June 22, 2022 08:29
@csegarragonz
Copy link
Collaborator Author

@eigenraven thanks for the tips! For ASAN, as you say, I think there might be another problem. For TSAN, changing the history_size to 0 does not help. I am a bit puzzled by that, as I was expecting a big reduce in memory consumption. However, it seems that TSAN's runtime flags need to be provided as a space-spearated (and not colon-separated) list. So we may have been using the default value of 2 all this time.

@eigenraven
Copy link
Collaborator

@csegarragonz looking at the current failed run, there seems to be 1) a lot of useless tls access logging, 2) a lot of warnings about unreclaimed memory https://github.com/faasm/faasm/runs/7000258660?check_suite_focus=true#step%3A23%3A5773= - this looks like faasm logs rather than tsan logs

@eigenraven
Copy link
Collaborator

Looks like this check is failing repeatedly https://github.com/faasm/faasm/blob/main/src/wasm/WasmModule.cpp#L826-L833=. I remember having to modify the brk mechanism locally because it didn't work properly when the wasm module allocated memory itself via memory.grow, and led to a lot of memory leaks if that function was called, maybe that's also what's happening now?

@csegarragonz csegarragonz changed the title Remove TSAN and ASAN runs Re-introduce TSAN and configure ASAN Sep 30, 2022
Co-authored-by: Raven Szewczyk <git@kubasz.xyz>
@csegarragonz
Copy link
Collaborator Author

csegarragonz commented Oct 3, 2022

@eigenraven thanks for pointing out the typos. FYI, the flush_memory_ms really made the difference for TSAN. Is there anything else that u think needs amending?

Copy link
Collaborator

@eigenraven eigenraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, out of curiosity what are the network namespaces data races? All functions there seem to be properly guarded by mutexes so it's a bit unexpected

@csegarragonz
Copy link
Collaborator Author

Good question. I removed it and the tests pass as well. I can't remember why I included it in the first place...

@csegarragonz csegarragonz merged commit bdfda0e into main Oct 3, 2022
@csegarragonz csegarragonz deleted the self-hosted-runner branch October 3, 2022 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants