Bug #74156
openCephFS in kernel client appears to be "leaking" folios
0%
Description
Hello,
I am running a server that has a heavy read/write workload to a cephfs
file system. It is a VM.
Over time it appears that the non-cache usage of kernel dynamic memory
increases. The kernel seems to think the pages are reclaimable however
nothing appears to trigger the reclaim. This leads to workloads getting
killed via oomkiller.
smem -wp output:
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 88.21% 36.25% 51.96%
userspace memory 9.49% 0.15% 9.34%
free memory 2.30% 2.30% 0.00%
free -h output:
total used free shared buff/cache available
Mem: 31Gi 3.6Gi 500Mi 4.0Mi 11Gi 27Gi
Swap: 4.0Gi 179Mi 3.8Gi
Unmounting the file system has no effect on the used kernel dynamic memory.
Nor does dropping caches.
I have enabled allocation tracking and got the following:
was happening.
- smem -pw
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 80.46% 65.80% 14.66%
userspace memory 0.35% 0.16% 0.19%
free memory 19.19% 19.19% 0.00%
- sort -g /proc/allocinfo|tail|numfmt --to=iec
22M 5609 mm/memory.c:1190 func:folio_prealloc
23M 1932 fs/xfs/xfs_buf.c:226 [xfs]func:xfs_buf_alloc_backing_mem
24M 24135 fs/xfs/xfs_icache.c:97 [xfs] func:xfs_inode_alloc
27M 6693 mm/memory.c:1192 func:folio_prealloc
58M 14784 mm/page_ext.c:271 func:alloc_page_ext
258M 129 mm/khugepaged.c:1069 func:alloc_charge_folio
430M 770788 lib/xarray.c:378 func:xas_alloc
545M 36444 mm/slub.c:3059 func:alloc_slab_page
9.8G 2563617 mm/readahead.c:189 func:ractl_alloc_folio
20G 5164004 mm/filemap.c:2012 func:__filemap_get_folio
So I stopped the workload and dropped caches to confirm.
- echo 3 > /proc/sys/vm/drop_caches
- smem -pw
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 33.45% 0.09% 33.36%
userspace memory 0.36% 0.16% 0.19%
free memory 66.20% 66.20% 0.00% - sort -g /proc/allocinfo|tail|numfmt --to=iec
12M 2987 mm/execmem.c:41 func:execmem_vmalloc
12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
13M 751 mm/slub.c:3061 func:alloc_slab_page
16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
18M 4355 mm/memory.c:1190 func:folio_prealloc
24M 6119 mm/memory.c:1192 func:folio_prealloc
58M 14784 mm/page_ext.c:271 func:alloc_page_ext
61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
79M 6726 mm/slub.c:3059 func:alloc_slab_page
11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
Reverting to the previous LTS (6.12) fixes the issue
After 24hrs of operation.
smem -wp output:
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 80.22% 79.32% 0.90%
userspace memory 10.48% 0.20% 10.28%
free memory 9.30% 9.30% 0.00%
I have tested 6.18, 6.17 and am in the process of testing the 6.16 kernel, it appears to be affected also.
The reproducer is simple.
I have one VM. 32GB of ram and 16 cores. It has a cephfs filesystem mounted.
I have two rsync copies (rsync -a --progress ./source ./dest/) with the source and destination being different for both copies (four different folders), but all being on the same filesystem.
(I am moving two 5TB data sets from EC pools onto replicated pools as I am currently affected by #70390.)
But I have also replicated this with other large write only workloads. (Downloading data sets from online sources. And unpacking large datasets out of archives) This was before I discovered the issue with squid created OSD's
The leak appears to be quite slow. I usually find I can confirm the issue is present after 6-9hrs of continuous data migration (it's running at an average of around 120MB/s)
I originally emailed the kernel mailing list:
https://lkml.org/lkml/2025/11/10/309
And was referred here after being referred to the memory allocation tracking and getting a result there.
The ceph cluster has been upgraded multuple times during my attempts to find the issue. It started at 19.2.1 and was upgraded through 19.2.2, 19.2.3 and 20.2.1
But I believe the issue is in the kernel client, so the cluster might not be important.
Files
Updated by Malcolm Haak 3 months ago
I have a new faster reproducer.
I realized it leaks an amount per file. The initial workload I encountered this issue with was downloading datasets, which are lots of ~50MB files, in parallel.
I created a small VM. 2GB of ram, 16 cores.
I have two bash files.
The first has a loop that creates 32 50MB files with dd in parallel and waits for all files to finish.
The second calls the first script 100's of times.
This crashes a vm in about 5 mins.
Updated by Malcolm Haak 3 months ago
repo_run.sh
#!/bin/bash
mkdir -p /mnt/ceph/repro
for i in $(seq 1 100);
do
./ddloop.sh $i
done
ddloop.sh
#!/bin/bash
for i in $(seq 1 32);
do
dd if=/dev/zero of=/mnt/ceph/repro/$i.$1 bs=1M count=60 &
done
wait
echo $1 complete
This is the reproducer. It assumes cephfs is mounted at /mnt/ceph
It does a decent job of replicating the workload I was running.
Thanks
Updated by Viacheslav Dubeyko 3 months ago
- Status changed from New to In Progress
I cannot reproduce the issue. The script works already several hours and I don't see any memory leaks in the system. It looks like that one important piece of the puzzle is missing. Which mount options do you have on your side? How have you mounted your CephFS instance?
Updated by Malcolm Haak 3 months ago
Mount options from /etc/fstab192.168.0.244:/ /mnt/ceph ceph rw,relatime,_netdev
Resulting mount line:192.168.0.244:/ on /mnt/ceph type ceph (rw,relatime,secret=<hidden>,fsid=969a4eab-2826-4766-87e1-ecb18a7b5a13,acl,_netdev)
Kernels tested:
All Arch linux kernels from 6.12 - 6.18 as well as mainline kernels from 6.14 - 6.19-rc1
Other cluster details:
4 servers in the cluster
47 OSD's split somewhat evenly between the 4 hosts
10GBe on all hosts.
Ceph 20.2.0 currently used on all servers.
3 Nodes used as MON. 4 running MGR. 3 running MDS, only 1 active MDS.
7 pools. Mix of EC and replicated. Issue happens regardless of pool type.
Auto-scale enabled
Auto-balance also enabled.
Cluster was freshly created on 19.2.x with bluestore_elastic_shared_blobs = false
Just trying to get out ahead of any other questions you might have.
Updated by Malcolm Haak 3 months ago
Also just checking:
One of the files used when reproducing:
getfattr -n ceph.file.layout 145.12
- file: 145.12
ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data"
cephfs_data is a replicated pool.
How else can we instrument the kernel to figure out what's going on as it happens without fail? I have no NVME/Flash so all slow disks.
I have setup at test VM with kdump. (not full memory dump. But I can do that) and enabled crash_on_oom. I'm happy to upload the kernel dumps to my nextcloud for you to access.
If it helps I can build a test VM with any distro/kernel you would like and configure it the same.
Updated by Malcolm Haak 3 months ago
Just in case it's important:
VM's running on Proxmox 9.0.11.
VM's have OVMF bios, x86-64-v2-AES cpu type.Virt-IO used for network and 'local' disk.
Updated by Malcolm Haak 3 months ago
The crash after 5m in on the 2GB sized was not directly due to the bug. My apologies.
I forgot to update the ticket. The most recent run (that just finished) on a 2GB vm took 6hrs. It seems amount of ram isn't doesn't have a large impact as once it starts it snowballs quickly. But getting it to start seems to take some time.
Also I replicated the issue on a physical machine. It took 9 hrs but it has 32GB of ram and was only connected to the cluster via 1GbE.
I'm currently running the reproducer again, with kdump enabled. I will make the dump available once it crashes in, I assume 6-7 hrs.
Updated by Malcolm Haak 3 months ago
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 87.45% 34.89% 52.56%
userspace memory 7.15% 1.08% 6.08%
free memory 5.40% 5.40% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.2G 491.2M 760.7M
userspace memory 103.1M 15.5M 87.5M
free memory 84.0M 84.0M 0
total used free shared buff/cache available
Mem: 1.4Gi 429Mi 111Mi 3.9Mi 483Mi 1.0Gi
Swap: 718Mi 22Mi 696Mi
#sort -g /proc/allocinfo|tail|numfmt --to=iec
8.4M 2660 kernel/fork.c:311 func:alloc_thread_stack_node
8.9M 9033 fs/xfs/xfs_icache.c:97 [xfs] func:xfs_inode_alloc
9.1M 573 mm/slub.c:3061 func:alloc_slab_page
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
16M 3937 mm/memory.c:1192 func:folio_prealloc
22M 38876 lib/xarray.c:378 func:xas_alloc
35M 8775 mm/readahead.c:189 func:ractl_alloc_folio
61M 15420 mm/memory.c:1190 func:folio_prealloc
108M 8333 mm/slub.c:3059 func:alloc_slab_page
970M 248277 mm/filemap.c:2012 func:__filemap_get_folio
My ceph cluster is busy doing quite a bit of remapping due to OSD's being re-created. It has slowed down the reproducer considerably.
This is after 12hrs of running. I'm going to wait for it to oom and collect the crash dump as by that time most of the ram should be claimed by folio's
As you can see a considerable amount of memory is being consumed by the noncache part. That value was around 100-120MB 11hrs ago. Swapping has started, so large amounts of it are failing to reclaim already.
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 86.68% 32.35% 54.33%
userspace memory 7.30% 1.25% 6.05%
free memory 6.02% 6.02% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.2G 461.3M 782.3M
userspace memory 105.2M 18.0M 87.2M
free memory 90.1M 90.1M 0
total used free shared buff/cache available
Mem: 1.4Gi 434Mi 95Mi 3.9Mi 481Mi 1.0Gi
Swap: 718Mi 23Mi 695Mi
Updated by Malcolm Haak 3 months ago
Sorry I prematurely hit send.
The second output was collected immediately after issuing a
sync;echo 3 >/proc/sys/vm/drop_caches
In the past, this is when I would have attempted to get some of that memory back by unmounting the filesystem, as my monitoring would be going nuts. As I mentioned above, this has no effect.
I've also, to try and figure out where the memory was being used, gone on an rmmod rampage, unloading the ceph/cephfs/netfs modules has no effect on the memory usage.
Anyway, I'll leave it run overnight and hopefully wake up to a 2GB crash dump.
Updated by Viacheslav Dubeyko 3 months ago
Malcolm Haak wrote in #note-6:
Also just checking:
One of the files used when reproducing:
getfattr -n ceph.file.layout 145.12
- file: 145.12
ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data"cephfs_data is a replicated pool.
How else can we instrument the kernel to figure out what's going on as it happens without fail? I have no NVME/Flash so all slow disks.
I have setup at test VM with kdump. (not full memory dump. But I can do that) and enabled crash_on_oom. I'm happy to upload the kernel dumps to my nextcloud for you to access.
If it helps I can build a test VM with any distro/kernel you would like and configure it the same.
The ready-made VM with correct environment for the issue reproduction will help a lot. Thanks in advance.
Updated by Viacheslav Dubeyko 3 months ago ยท Edited
#sort -g /proc/allocinfo|tail|numfmt --to=iec
8.4M 2660 kernel/fork.c:311 func:alloc_thread_stack_node
8.9M 9033 fs/xfs/xfs_icache.c:97 [xfs] func:xfs_inode_alloc
9.1M 573 mm/slub.c:3061 func:alloc_slab_page
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
16M 3937 mm/memory.c:1192 func:folio_prealloc
22M 38876 lib/xarray.c:378 func:xas_alloc
35M 8775 mm/readahead.c:189 func:ractl_alloc_folio
61M 15420 mm/memory.c:1190 func:folio_prealloc
108M 8333 mm/slub.c:3059 func:alloc_slab_page
970M 248277 mm/filemap.c:2012 func:__filemap_get_folio <-- This point looks interesting!!!
We have __filemap_get_folio() in fill_readdir_cache() method [1]. Potentially, it could be the place of issue. But, currently, I don't see how the issue could happen. Because, ceph_readdir_cache_release() [2] includes folio_release_kmap() which includes folio_put() [3]. Potentially, somehow, folio's reference counter can be increased unreasonably. But I cannot see right now how it could happen and, maybe, my hypothesis is wrong here. Let me sleep on this and dive deeper in the code.
[1] https://elixir.bootlin.com/linux/v6.18/source/fs/ceph/inode.c#L1940
[2] https://elixir.bootlin.com/linux/v6.18/source/fs/ceph/inode.c#L1915
[3] https://elixir.bootlin.com/linux/v6.18/source/include/linux/highmem.h#L682
Updated by Malcolm Haak 3 months ago
Ok I have a VM running my full kernel/client setup. I can make the drive for the vm available?
It did crash early this morning and I have the crash dump however I realized you'll need my kernel and debug symbols. Also makedumpfile was called with -d 31 not -d 2. That's my fault I should have checked the defaults on the dump.
I can make a VM available with a new dump file, I'll re-run everything and get a full dump of the memory. Did you want ssh access or I can pack the whole thing up, vm with crash and all, and make it available.?
I'd prefer not to post the url for the download in the ticket. And I probably can't upload a several GB file to the bug tracker so how would you like me to get it to you?
The VM is expecting kvm, but otherwise nothing special. I'll reset all the passwords to something simple like ceph.
Otherwise, send me a pubkey and I can get that added and give you details of how to ssh in.
Updated by Viacheslav Dubeyko 3 months ago
Malcolm Haak wrote in #note-13:
Ok I have a VM running my full kernel/client setup. I can make the drive for the vm available?
It did crash early this morning and I have the crash dump however I realized you'll need my kernel and debug symbols. Also makedumpfile was called with -d 31 not -d 2. That's my fault I should have checked the defaults on the dump.
I can make a VM available with a new dump file, I'll re-run everything and get a full dump of the memory. Did you want ssh access or I can pack the whole thing up, vm with crash and all, and make it available.?
I'd prefer not to post the url for the download in the ticket. And I probably can't upload a several GB file to the bug tracker so how would you like me to get it to you?
The VM is expecting kvm, but otherwise nothing special. I'll reset all the passwords to something simple like ceph.
Otherwise, send me a pubkey and I can get that added and give you details of how to ssh in.
Let me spend some time on reproducing the issue on my side. I have some vision how to investigate the issue on my own. If it doesn't work, then I will ask you to provide some artifacts. Thanks.
Updated by Malcolm Haak 3 months ago
- File vmcore-dmesg.log vmcore-dmesg.log added
- File patch.patch patch.patch added
Oh also I forgot. I did a run with the patch suggested by David on the kernel mailing list.
It added tracing to __filemap_get_folio
[64793.828030] [ T382379] Memory allocations (profiling is currently turned on): [64793.828047] [ T382379] 1.18 GiB 308269 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin [64793.828060] [ T382379] 76.2 MiB 6697 mm/slub.c:3059 func:alloc_slab_page [64793.828070] [ T382379] 11.2 MiB 3001 mm/execmem.c:41 func:execmem_vmalloc [64793.828078] [ T382379] 9.16 MiB 579 mm/slub.c:3061 func:alloc_slab_page [64793.828082] [ T382379] 8.92 MiB 2851 kernel/fork.c:311 func:alloc_thread_stack_node [64793.828091] [ T382379] 8.65 MiB 15533 lib/xarray.c:378 func:xas_alloc [64793.828095] [ T382379] 7.95 MiB 2034 mm/readahead.c:189 func:ractl_alloc_folio [64793.828099] [ T382379] 7.43 MiB 1901 mm/zsmalloc.c:237 func:alloc_zpdesc [64793.828114] [ T382379] 7.07 MiB 1811 arch/x86/mm/pgtable.c:18 func:pte_alloc_one [64793.828124] [ T382379] 4.23 MiB 1083 drivers/block/zram/zram_drv.c:1597 [zram] func:zram_meta_alloc
It seems every call to __filemap_get_folio that is slowly accumulating is coming from func:netfs_write_begin. I'm not sure if that confirms or throws a spanner in the works of your theory.
I've included the dmesg from that run and the patch.
Hopefully that helps!
Updated by Viacheslav Dubeyko 3 months ago
Could you please share the kernel .config file that you've used to compile kernel? Potentially, my kernel could not have necessary features to trigger the issue. Thanks.
Updated by Malcolm Haak 3 months ago
All my nodes are running an Arch kernel. Even the one I compiled is based on the Arch config available here:
https://aur.archlinux.org/cgit/aur.git/tree/config?h=linux-mainline
The only differences between this an my kernel are the addition of
CONFIG_MEM_ALLOC_PROFILING=y CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
It is incredibly slow. I didn't see true signs of it happening with the DD workload until around hour 6-7 and as you can see above it wasn't until 12hrs that it really started to show through. It was failing faster with less CPU cores... I might test that as I increased the cores to 32 to re-build the kernel faster and the host that replicates it the fastest (with a different workload not the test one) only has 4 cpus. Weird timing issue under cpu load perhaps?
Updated by Malcolm Haak 3 months ago
Ok two vCPUs and 2GB of ram. 3hrs of running the dd reproducer:
# sort -g /proc/allocinfo|tail|numfmt --to=iec
4.0M 2 mm/khugepaged.c:1069 func:alloc_charge_folio
4.1M 1049 mm/percpu.c:512 func:pcpu_mem_zalloc
4.3M 1087 drivers/block/zram/zram_drv.c:1597 [zram] func:zram_meta_alloc
4.6M 1157 mm/shmem.c:1870 func:shmem_alloc_folio
8.3M 2117 mm/memory.c:1190 func:folio_prealloc
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
20M 4896 mm/memory.c:1192 func:folio_prealloc
29M 4020 mm/slub.c:3059 func:alloc_slab_page
56M 14205 mm/readahead.c:189 func:ractl_alloc_folio
279M 71404 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
# sync; echo 3 > /proc/sys/vm/drop_caches;smem -wp;echo;smem -wk; echo; free -h
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 28.24% 1.12% 27.12%
userspace memory 5.94% 3.26% 2.67%
free memory 65.82% 65.82% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 407.6M 16.2M 391.4M
userspace memory 85.8M 47.1M 38.7M
free memory 950.1M 950.1M 0
total used free shared buff/cache available
Mem: 1.4Gi 284Mi 950Mi 4.5Mi 63Mi 1.1Gi
Swap: 721Mi 8.0Ki 721Mi
It's already showing up at this point. Most of that 391M will be unreclaimable (I'm guessing something close to 279MB). So I'm getting ~90MB every hour.
It had just completed loop 101.
This is using an updated ddloop.sh:
#!/bin/bash
for i in $(seq 1 256);
do
dd if=/dev/zero of=/mnt/ceph/repro/$i.$1 bs=163864 count=410 &
done
wait
sync
for i in $(seq 1 256);
do
rm /mnt/ceph/repro/$i.$1 &
done
wait
echo $1 complete
I was aiming for misaligned writes and was also trying to replicate the "re-assemble and remove source" behavior from the dataset tool. Well just the "remove source" part anyway. Also that tool does a force sync in between stages.
Anyway, it's well on it's way to crashing and getting a much more complete crash dump. It's doing a
-d 2not 31.
So hopefully that should allow a full diagnosis of the issue.
Updated by Viacheslav Dubeyko 3 months ago
Frankly speaking, I don't quite follow what is definition of the issue. What should I detect as the symptoms of issue? How do you define the symptoms of the issue?
As far as I can see, currently, I cannot detect any memory leaks. If I do such steps:
sync; echo 3 > /proc/sys/vm/drop_caches;smem -wp;echo;smem -wk; echo; free -h
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 28.70% 4.06% 24.64%
userspace memory 36.42% 18.15% 18.26%
free memory 34.88% 34.88% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 469.4M 71.2M 398.2M
userspace memory 588.5M 293.4M 295.1M
free memory 557.8M 557.8M 0
total used free shared buff/cache available
Mem: 1.6Gi 691Mi 557Mi 24Mi 366Mi 758Mi
Swap: 2.0Gi 21Mi 2.0Gi
cat /proc/meminfo MemTotal: 1654452 kB MemFree: 498560 kB MemAvailable: 803040 kB Buffers: 6820 kB Cached: 434764 kB SwapCached: 4032 kB Active: 532060 kB Inactive: 201756 kB Active(anon): 275648 kB Inactive(anon): 37784 kB Active(file): 256412 kB Inactive(file): 163972 kB Unevictable: 7656 kB Mlocked: 0 kB SwapTotal: 2097148 kB SwapFree: 2075640 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 432 kB Writeback: 0 kB AnonPages: 298424 kB Mapped: 320312 kB Shmem: 21320 kB KReclaimable: 29016 kB Slab: 288240 kB SReclaimable: 29016 kB SUnreclaim: 259224 kB KernelStack: 11392 kB PageTables: 13944 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2924372 kB Committed_AS: 2371600 kB VmallocTotal: 34359738367 kB VmallocUsed: 21380 kB VmallocChunk: 0 kB Percpu: 1512 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Unaccepted: 0 kB Balloon: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 178036 kB DirectMap2M: 1918976 kB DirectMap1G: 0 kB
/mnt/cephfs/repro1# dd if=/dev/urandom of=./test.0001 bs=1048576 count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 788.002 s, 1.3 MB/s
cat /proc/meminfo MemTotal: 1654452 kB MemFree: 71956 kB MemAvailable: 936076 kB Buffers: 1520 kB Cached: 981652 kB SwapCached: 13400 kB Active: 246580 kB Inactive: 868384 kB Active(anon): 29952 kB Inactive(anon): 114964 kB Active(file): 216628 kB Inactive(file): 753420 kB Unevictable: 8176 kB Mlocked: 0 kB SwapTotal: 2097148 kB SwapFree: 1893800 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 852 kB Writeback: 0 kB AnonPages: 136544 kB Mapped: 132464 kB Shmem: 13236 kB KReclaimable: 48968 kB Slab: 344432 kB SReclaimable: 48968 kB SUnreclaim: 295464 kB KernelStack: 11360 kB PageTables: 13860 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2924372 kB Committed_AS: 2371344 kB VmallocTotal: 34359738367 kB VmallocUsed: 21376 kB VmallocChunk: 0 kB Percpu: 1728 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Unaccepted: 0 kB Balloon: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 178036 kB DirectMap2M: 1918976 kB DirectMap1G: 0 kB
sync; echo 3 > /proc/sys/vm/drop_caches;smem -wp;echo;smem -wk; echo; free -h
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 27.57% 4.27% 23.31%
userspace memory 16.86% 8.20% 8.66%
free memory 55.57% 55.57% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 447.4M 68.8M 378.6M
userspace memory 272.4M 132.5M 139.8M
free memory 895.9M 895.9M 0
total used free shared buff/cache available
Mem: 1.6Gi 519Mi 893Mi 12Mi 203Mi 933Mi
Swap: 2.0Gi 198Mi 1.8Gi
cat /proc/meminfo MemTotal: 1654452 kB MemFree: 814832 kB MemAvailable: 937028 kB Buffers: 6308 kB Cached: 244872 kB SwapCached: 13420 kB Active: 169368 kB Inactive: 213988 kB Active(anon): 29732 kB Inactive(anon): 115048 kB Active(file): 139636 kB Inactive(file): 98940 kB Unevictable: 8176 kB Mlocked: 0 kB SwapTotal: 2097148 kB SwapFree: 1893712 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 440 kB Writeback: 0 kB AnonPages: 136412 kB Mapped: 132528 kB Shmem: 13236 kB KReclaimable: 28060 kB Slab: 309844 kB SReclaimable: 28060 kB SUnreclaim: 281784 kB KernelStack: 11360 kB PageTables: 13856 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2924372 kB Committed_AS: 2371344 kB VmallocTotal: 34359738367 kB VmallocUsed: 21376 kB VmallocChunk: 0 kB Percpu: 1728 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Unaccepted: 0 kB Balloon: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 178036 kB DirectMap2M: 1918976 kB DirectMap1G: 0 kB
I can see that memory has been returned to free state.
If we suspect the netfs_write_begin(), then I have checked the folios' reference counter. And, currently, I don't see any anomalies in this code. Probably, I could miss something.
So, what is your definition of the problem/issue? What am I missing?
Thanks,
Slava.
Updated by Malcolm Haak 3 months ago
The issue is, folios accumulate and cannot be freed at all by anyone or anything. They are stuck.
They are marked as available, but nothing can free them. When the memory usage has expaneded to consume 90% of ram, you can unmount the filesystem, call sync, drop_caches. Remove every module from the kernel and the memory usage of said folios will remain at 90%. The machine will be swapping like crazy to have enough ram to function in, it will claim it has heaps of 'available' memory, but it can never free these pages to actually use them. Memory pressure can't get the "available" pages back.
Your single random one shot dd does not replicate the issue in a way that is observable.. That's 1 file.That is not the replication workload, which is why I provided the scripts. I've been very specific that whatever is happening is either, not every file, or a very very small amount per file. The replication workload creates thousands/millions of files for a reason.
Also you're statement suggests you fundamentally misunderstand the issue. On that box, right now.
kernel dynamic memory 447.4M 68.8M 378.6M
Try and get that 378MB back down to 100MB or even 200MB. Unmount the ceph filesystem, see how it doesn't change. It never changes, that ram is un-freeable by any mechanism in the kernel.
I have uploaded another dmesg from the machine that crashed yesterday. All it was running was lots of streams of DD, then rm'ing the files, on loop. Why would that workload cause the computer to run out of available memory and crash? I can do that exact workload on ANY other filesystem be it local, NFS, SMB, lustre, BeeGeeFS, MooseFS and it will run forever as it rm's all the files it creates. The "Non-cache kernel dynamic memory" doesn't climb over time to 100% of system ram on any of those other filesystems. It does with cephfs and has since kernel 6.15
Look. I'm going to run my reproducer, which I will upload as attachments again today. I will disable panic_on_oom and then run it until the box his 80% ram usage by "non-cache kernel dynamic memory" and then I will run any and all commands you want to see the output of as well as provide and and all logs out of it. Hell I'll give you remote access to it so you can see the issue in full effect. Perhaps this is a language barrier, or perhaps you've not been waiting long enough for it to reproduce, I don't know, I don't care, I just want to give you all the information you want/need to see what I am seeing. Please understand I am not mad/frustrated with you if I come across that way. Any/all frustration is at my inability to effectively communicate the issue.
Updated by Malcolm Haak 3 months ago
- File ddloop.sh ddloop.sh added
- File repro_run.sh repro_run.sh added
Apologies, here are the scripts as I have been using.
Updated by Malcolm Haak 3 months ago
- File latest_dmesg.log latest_dmesg.log added
Ok I ran the VM for ~11hrs straight collecting output from time to time
Pre-workload start
[root@kerneltest ~]# smem -wp;echo;smem -wk; echo; free -h
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 28.76% 20.14% 8.63%
userspace memory 5.71% 3.21% 2.50%
free memory 65.53% 65.53% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 415.7M 290.7M 125.0M
userspace memory 82.4M 46.3M 36.0M
free memory 945.4M 945.4M 0
total used free shared buff/cache available
Mem: 1.4Gi 309Mi 945Mi 4.5Mi 337Mi 1.1Gi
Swap: 721Mi 8.0Ki 721Mi
[root@kerneltest ~]# sort -g /proc/allocinfo|tail|numfmt --to=iec
4.3M 1087 drivers/block/zram/zram_drv.c:1597 [zram] func:zram_meta_alloc
4.6M 1157 mm/shmem.c:1870 func:shmem_alloc_folio
5.6M 30066 fs/dcache.c:1690 func:__d_alloc
9.7M 2458 mm/memory.c:1190 func:folio_prealloc
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
20M 5014 mm/memory.c:1192 func:folio_prealloc
23M 1920 fs/xfs/xfs_buf.c:226 [xfs] func:xfs_buf_alloc_backing_mem
23M 22840 fs/xfs/xfs_icache.c:97 [xfs] func:xfs_inode_alloc
55M 6728 mm/slub.c:3059 func:alloc_slab_page
295M 54777 mm/readahead.c:189 func:ractl_alloc_folio
uptime, 11 hours at start
Every 2.0s: smem -wp;echo;smem -wk; echo; free -h; echo; uptime kerneltest: 11:43:34
in 0.323s (0)
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 83.53% 73.86% 9.67%
userspace memory 10.80% 3.23% 7.58%
free memory 5.67% 5.67% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.2G 1.0G 138.5M
userspace memory 155.9M 46.5M 109.3M
free memory 78.5M 78.5M 0
total used free shared buff/cache available
Mem: 1.4Gi 393Mi 75Mi 4.5Mi 1.1Gi 1.0Gi
Swap: 721Mi 8.0Ki 721Mi
11:43:35 up 12:11, 1 user, load average: 256.43, 227.67, 130.74
Every 2.0s: smem -wp;echo;smem -wk; echo; free -h; echo; uptime kerneltest: 12:33:17
in 0.332s (0)
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 84.22% 70.27% 13.95%
userspace memory 10.81% 3.23% 7.58%
free memory 4.97% 4.97% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.2G 1011.5M 200.9M
userspace memory 156.0M 46.6M 109.4M
free memory 75.1M 75.1M 0
total used free shared buff/cache available
Mem: 1.4Gi 388Mi 73Mi 4.5Mi 1.0Gi 1.0Gi
Swap: 721Mi 8.0Ki 721Mi
12:33:18 up 13:00, 1 user, load average: 254.04, 253.54, 247.31
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 84.00% 63.02% 20.98%
userspace memory 10.74% 3.23% 7.51%
free memory 5.26% 5.26% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.2G 879.3M 303.8M
userspace memory 154.8M 46.6M 108.2M
free memory 105.6M 105.6M 0
total used free shared buff/cache available
Mem: 1.4Gi 390Mi 101Mi 4.5Mi 930Mi 1.0Gi
Swap: 721Mi 8.0Ki 721Mi
13:43:36 up 14:11, 1 user, load average: 256.11, 250.50, 249.63
Every 2.0s: smem -wp;echo;smem -wk; echo; free -h; echo; sort -g /proc/allocinfo|tail|numfmt --to=iec; echo; uptime kerneltest: 15:47:27
in 0.523s (0)
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 83.43% 49.86% 33.57%
userspace memory 10.78% 3.23% 7.55%
free memory 5.79% 5.79% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.2G 728.1M 488.3M
userspace memory 155.6M 46.6M 109.0M
free memory 71.5M 71.5M 0
total used free shared buff/cache available
Mem: 1.4Gi 385Mi 109Mi 4.5Mi 747Mi 1.0Gi
Swap: 721Mi 8.0Ki 721Mi
7.5M 1903 arch/x86/mm/pgtable.c:18 func:pte_alloc_one
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
22M 38907 lib/xarray.c:378 func:xas_alloc
23M 1920 fs/xfs/xfs_buf.c:226 [xfs] func:xfs_buf_alloc_backing_mem
23M 22847 fs/xfs/xfs_icache.c:97 [xfs] func:xfs_inode_alloc
32M 8119 mm/memory.c:1192 func:folio_prealloc
71M 18040 mm/memory.c:1190 func:folio_prealloc
88M 10810 mm/slub.c:3059 func:alloc_slab_page
194M 46893 mm/readahead.c:189 func:ractl_alloc_folio
848M 216999 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
15:47:27 up 16:14, 1 user, load average: 255.76, 252.11, 250.47
smem -wp;echo;smem -wk; echo; free -h; echo; sort -g /proc/allocinfo|tail|numfmt --to=iec; echo; uptime
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 89.89% 14.32% 75.57%
userspace memory 3.77% 2.49% 1.28%
free memory 6.34% 6.34% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.3G 206.8M 1.1G
userspace memory 54.4M 36.0M 18.4M
free memory 91.5M 91.5M 0
total used free shared buff/cache available
Mem: 1.4Gi 262Mi 91Mi 4.1Mi 242Mi 1.2Gi
Swap: 721Mi 18Mi 702Mi
3.8M 960 mm/page_ext.c:271 func:alloc_page_ext
4.1M 1049 mm/percpu.c:512 func:pcpu_mem_zalloc
4.2M 1059 mm/shmem.c:1870 func:shmem_alloc_folio
4.3M 1087 drivers/block/zram/zram_drv.c:1597 [zram] func:zram_meta_alloc
5.9M 10485 lib/xarray.c:378 func:xas_alloc
6.2M 1566 mm/memory.c:4414 func:__alloc_swap_folio
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
43M 5821 mm/slub.c:3059 func:alloc_slab_page
96M 24360 mm/readahead.c:189 func:ractl_alloc_folio
1.1G 284870 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
22:03:22 up 22:30, 2 users, load average: 89.54, 185.80, 221.69
[root@kerneltest ~]# smem -wp;echo;smem -wk; echo; free -h; echo; sort -g /proc/allocinfo|tail|numfmt --to=iec; echo; uptime
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 89.89% 14.32% 75.57%
userspace memory 3.77% 2.49% 1.28%
free memory 6.34% 6.34% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.3G 206.8M 1.1G
userspace memory 54.4M 36.0M 18.4M
free memory 91.5M 91.5M 0
total used free shared buff/cache available
Mem: 1.4Gi 262Mi 91Mi 4.1Mi 242Mi 1.2Gi
Swap: 721Mi 18Mi 702Mi
3.8M 960 mm/page_ext.c:271 func:alloc_page_ext
4.1M 1049 mm/percpu.c:512 func:pcpu_mem_zalloc
4.2M 1059 mm/shmem.c:1870 func:shmem_alloc_folio
4.3M 1087 drivers/block/zram/zram_drv.c:1597 [zram] func:zram_meta_alloc
5.9M 10485 lib/xarray.c:378 func:xas_alloc
6.2M 1566 mm/memory.c:4414 func:__alloc_swap_folio
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
43M 5821 mm/slub.c:3059 func:alloc_slab_page
96M 24360 mm/readahead.c:189 func:ractl_alloc_folio
1.1G 284870 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
22:03:22 up 22:30, 2 users, load average: 89.54, 185.80, 221.69
[root@kerneltest ~]# sync; echo 3 >/proc/sys/mem/drop_caches;smem -wp;echo;smem -wk; echo; free -h; echo; sort -g /proc/allocinfo|tail|numfmt --to=iec; echo; uptime
-bash: /proc/sys/mem/drop_caches: No such file or directory
Area Used Cache Noncache
firmware/hardware 0.00% 0.00% 0.00%
kernel image 0.00% 0.00% 0.00%
kernel dynamic memory 79.18% 4.88% 74.29%
userspace memory 3.78% 2.49% 1.28%
free memory 17.04% 17.04% 0.00%
Area Used Cache Noncache
firmware/hardware 0 0 0
kernel image 0 0 0
kernel dynamic memory 1.1G 70.5M 1.0G
userspace memory 54.5M 36.0M 18.5M
free memory 246.0M 246.0M 0
total used free shared buff/cache available
Mem: 1.4Gi 237Mi 246Mi 4.1Mi 106Mi 1.2Gi
Swap: 721Mi 18Mi 702Mi
3.7M 522 mm/slub.c:3061 func:alloc_slab_page
3.8M 960 mm/page_ext.c:271 func:alloc_page_ext
4.1M 1049 mm/percpu.c:512 func:pcpu_mem_zalloc
4.2M 1059 mm/shmem.c:1870 func:shmem_alloc_folio
4.3M 1087 drivers/block/zram/zram_drv.c:1597 [zram] func:zram_meta_alloc
6.3M 1592 mm/memory.c:4414 func:__alloc_swap_folio
12M 3001 mm/execmem.c:41 func:execmem_vmalloc
33M 4578 mm/slub.c:3059 func:alloc_slab_page
96M 24386 mm/readahead.c:189 func:ractl_alloc_folio
988M 252826 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
22:04:17 up 22:31, 2 users, load average: 35.75, 154.55, 208.94
[root@kerneltest ~]# cat /proc/meminfo
MemTotal: 1478080 kB
MemFree: 250216 kB
MemAvailable: 1233092 kB
Buffers: 0 kB
Cached: 101824 kB
SwapCached: 828 kB
Active: 653032 kB
Inactive: 468360 kB
Active(anon): 11276 kB
Inactive(anon): 1164 kB
Active(file): 641756 kB
Inactive(file): 467196 kB
Unevictable: 4000 kB
Mlocked: 0 kB
SwapTotal: 738812 kB
SwapFree: 719612 kB
Zswap: 0 kB
Zswapped: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 11620 kB
Mapped: 32840 kB
Shmem: 4236 kB
KReclaimable: 7292 kB
Slab: 43388 kB
SReclaimable: 7292 kB
SUnreclaim: 36096 kB
KernelStack: 2096 kB
PageTables: 2636 kB
SecPageTables: 0 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 1477852 kB
Committed_AS: 129916 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 23948 kB
VmallocChunk: 0 kB
Percpu: 1008 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
Unaccepted: 0 kB
Balloon: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 78412 kB
DirectMap2M: 2013184 kB
As you can see each time the Non-Cache part of "Kernel dynamic memory" had grown in percentage as well as size.
You can also see that after issuing a sync and drop_caches that 1GB of that was still in memory. It's still technically "available" but nothing can free it.
As you can also see, there is still 988M of mm/filemap.c:2012 func:__filemap_get_folio called from fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin. Which I would normally assume would be able to be freed after the pages/folios were all marked as clean post flushing out to ceph.
I have not yet attempted to reclaim that memory by way of unmounting a filesystem or removing modules from the kernel or anything else. I can run any test, dump the current kernel memory, or whatever you would like.
I have gdb installed. I have a full set of kernel debug symbols. I'm ready to do anything you would like done. I can even give you access to the incredibly slow VM if it would help.
As you will be able to see from dmesg, memory is not being reclaimed. Memory pressure is through the roof. And all that has been running is my reproducer, which just runs DD and then deletes the resulting file full of zeros, a net sum of nothing.
Hopefully something in all this output is helpful, or at the least something I can get out of this stuck VM is helpful.
Updated by Viacheslav Dubeyko 3 months ago
I have run your scripts 24 hours already. The system continues to work. I don't see any sign of memory leaks.
I think I need to have:
(1) the same kernel source code that you are using
(2) the same kernel configuration
(3) the same virtual machine + qemu configuration
If you can prepare simple VM image with simple password, then I can download everything. You can share the link to VM, login and password in private email. Could it work for you?
Thanks,
Slava.
Updated by Malcolm Haak 2 months ago
Sorry I've been on holidays.
I can do that. But I'm not using anything weird. It's literally stock Arch linux installed with archinstall.
For VM I'm using proxmox with default settings. Only change is uefi bios. Otherwise it's proxmox defaults. I can dump the vm definition if it will help.
I still have the VM you are more than welcome to log into. I'll just clone it. It's got nothing of value or anything in it so I'll just set the root password to root.
I'll have a download image available shortly.
Updated by Viacheslav Dubeyko 2 months ago
Malcolm Haak wrote in #note-24:
Sorry I've been on holidays.
I can do that. But I'm not using anything weird. It's literally stock Arch linux installed with archinstall.
For VM I'm using proxmox with default settings. Only change is uefi bios. Otherwise it's proxmox defaults. I can dump the vm definition if it will help.
I still have the VM you are more than welcome to log into. I'll just clone it. It's got nothing of value or anything in it so I'll just set the root password to root.
I'll have a download image available shortly.
It's OK. I am trying to relax too. :) Thanks a lot.
Updated by Viacheslav Dubeyko about 2 months ago
Ping... Any hope to have VM for the issue reproduction?
Thanks,
Slava.
Updated by Malcolm Haak about 2 months ago
Apologies. I will get it packed up today and send you a download link directly.
Updated by Viacheslav Dubeyko 23 days ago
Should I close the ticket? Because, I cannot reproduce it and I haven't received any other means for the issue reproduction.
Updated by Malcolm Haak 20 days ago
Sorry, personal stuff happened. I needed to remove personal information from the VM.
I can still box up the VM but it's literally the stock Arch Linux kernel.
Also the Arch Linux "mainline" build https://aur.archlinux.org/packages/linux-mainline
Updated by Viacheslav Dubeyko 20 days ago
Malcolm Haak wrote in #note-29:
Sorry, personal stuff happened. I needed to remove personal information from the VM.
I can still box up the VM but it's literally the stock Arch Linux kernel.
Also the Arch Linux "mainline" build https://aur.archlinux.org/packages/linux-mainline
I need exactly the same VM that you were able to reproduce the issue. You simply need to create the VM from scratch without any personal details. And if you can reproduce the issue, then you can share the VM with me.