-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Description
Current logic in GetCGroupMemoryUsage to calculate container memory usage from a GC perspective may produce a significantly different value than popular container tools such as Docker and Kubernetes (for example, for the below test application it's about 30% lower than expected).
This can result in containers unnecessarily getting OOM-killed due to the GC not being able to detect high memory pressure.
Both Docker and Kubernetes use a different method: get total memory usage from memory.usage_in_bytes and subtract the total_inactive_file value of memory.stat.
Would it be possible to update .NET implementation to use the same method?
Reproduction Steps
- Run
dotnet new console -n MemoryLoadTestusing the latest .NET SDK (6.0.101 at the moment) and add the following code toProgram.cs:
var lockObj = new object();
var rnd = new Random();
var cache = new Dictionary<int, int[]>();
Console.WriteLine("Seconds\tMemoryLoadBytes\t\tcache.Count");
var runTimeInSeconds = int.Parse(Environment.GetEnvironmentVariable("RUN_TIME_IN_SECONDS")!);
// Start 1000 threads. Each thread adds a cache item every 5-10s
for (int i = 0; i < 1000; i++)
{
new Thread(() =>
{
while (true)
{
int sleepTime;
lock (lockObj)
{
sleepTime = rnd.Next(5000, 10000);
var cacheItem = new int[10 * 1024];
cache[cacheItem.GetHashCode()] = cacheItem;
}
Thread.Sleep(sleepTime);
}
})
{ IsBackground = true }.Start();
}
// Remove random item from cache every 10ms
for (int i = 1; i <= runTimeInSeconds * 100; i++)
{
lock (lockObj)
{
if (cache.Count > 0)
{
cache.Remove(cache.Keys.ElementAt(rnd.Next(cache.Count)));
}
if (i % 100 == 0)
{
Console.WriteLine($"{i / 100}\t{Format(GC.GetGCMemoryInfo().MemoryLoadBytes)}\t{cache.Count}");
}
}
Thread.Sleep(10);
}
Console.WriteLine("Finished successfully!");
string Format(long bytes) => $"{bytes} ({Math.Round((double)bytes / 1024 / 1024, 2)}MiB)";- Create the following
Dockerfilein theMemoryLoadTestfolder:
FROM mcr.microsoft.com/dotnet/runtime:6.0 AS base
WORKDIR /app
FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["MemoryLoadTest.csproj", "MemoryLoadTest/"]
RUN dotnet restore "MemoryLoadTest/MemoryLoadTest.csproj"
COPY . "MemoryLoadTest/"
WORKDIR "/src/MemoryLoadTest"
RUN dotnet build "MemoryLoadTest.csproj" -c Release -o /app/build
RUN dotnet tool install --tool-path /tools dotnet-trace
FROM build AS publish
RUN dotnet publish "MemoryLoadTest.csproj" -c Release -o /app/publish --os linux --self-contained true
FROM base AS final
WORKDIR /tools
COPY --from=build /tools .
WORKDIR /app
COPY --from=publish /app/publish .
ENV PATH="${PATH}:/tools"
# copy coreclr build from CORECLR_BUILD_PATH and start dotnet-trace, which will create gc.nettrace in [runtime repository root]/artifacts
ENTRYPOINT ["/bin/sh", "-c" , "if [ \"$CORECLR_BUILD_PATH\" != \"\" ] ; then echo \"copying coreclr from $CORECLR_BUILD_PATH ...\" && cp -r \"$CORECLR_BUILD_PATH/.\" /app/ && echo 'finished copying!' ; fi && dotnet-trace collect -o /runtime/artifacts/gc.nettrace --profile gc-collect --show-child-io -- dotnet MemoryLoadTest.dll"]-
From
MemoryLoadTestrundocker build -t memoryloadtest -f Dockerfile . -
(Optional) In a separate window run
docker statsto start monitoring containers. -
Checkout the latest main branch of the runtime repository.
-
Build the runtime repository (replace [runtime repository root] with local runtime repository path):
docker run --rm -v [runtime repository root]:/runtime -w /runtime mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-a50a721-20191120200116 ./build.sh -subset clr -configuration release -clang9
- Run the application (replace [runtime repository root] with local runtime repository path):
docker run -m 165m --memory-swap 165m --name MemoryLoadTest --rm --env RUN_TIME_IN_SECONDS=20 -v [runtime repository root]:/runtime --env CORECLR_BUILD_PATH=/runtime/artifacts/bin/coreclr/Linux.x64.Release -it memoryloadtest
Observe that the application reaches the memory limit after about 13 seconds and gets OOM-killed. Also MemoryLoadBytes is significantly lower than container memory usage reported by docker stats:
Seconds MemoryLoadBytes cache.Count
1 81317068 (77.55MiB) 900
2 81317068 (77.55MiB) 800
3 81317068 (77.55MiB) 700
4 81317068 (77.55MiB) 600
5 81317068 (77.55MiB) 581
6 91697971 (87.45MiB) 685
7 102078873 (97.35MiB) 818
8 102078873 (97.35MiB) 912
9 112459776 (107.25MiB) 1000
10 112459776 (107.25MiB) 1001
11 122840678 (117.15MiB) 940
12 122840678 (117.15MiB) 933
13 122840678 (117.15MiB) 958
Trace completed.
Process exited with code '137'.
- Finally, checkout the branch from PR align GC memory load calculation on Linux with Docker and Kubernetes #64128 and repeat steps 6-7. The application finishes successfully and
MemoryLoadBytesis accurate:
Seconds MemoryLoadBytes cache.Count
1 115920076 (110.55MiB) 900
2 115920076 (110.55MiB) 800
3 115920076 (110.55MiB) 700
4 115920076 (110.55MiB) 600
5 115920076 (110.55MiB) 571
6 126300979 (120.45MiB) 680
7 126300979 (120.45MiB) 790
8 134951731 (128.7MiB) 905
9 145332633 (138.6MiB) 1003
10 145332633 (138.6MiB) 1008
11 164364288 (156.75MiB) 956
12 164364288 (156.75MiB) 950
13 155713536 (148.5MiB) 974
14 155713536 (148.5MiB) 1045
15 150523084 (143.55MiB) 1134
16 152253235 (145.2MiB) 1192
17 152253235 (145.2MiB) 1232
18 164364288 (156.75MiB) 1255
19 164364288 (156.75MiB) 1258
20 164364288 (156.75MiB) 1270
Finished successfully!
Trace completed.
Process exited with code '0'.
From the output and the generated gc.nettrace file in [runtime repository root]/artifacts for the second run we can confirm that there was a full blocking collection due to low memory, which allowed the application to keep executing without hitting the memory limit.
Expected behavior
The application finishes successfully and doesn't get OOM-killed.
Actual behavior
The application gets OOM-killed.
Regression?
Not a regression.
Known Workarounds
Playing with GCHighMemPercent or similar settings can provide a short-term workaround in some cases.
Configuration
.NET 5/6 running in Linux container on Docker or Kubernetes.
Other information
No response