Skip to content

Commit a98bbb2

Browse files
Hou Taoopenunix
authored andcommitted
virtiofs: use pages instead of pointer for kernel direct IO
BugLink: https://bugs.launchpad.net/bugs/2101915 [ Upstream commit 4174867 ] When trying to insert a 10MB kernel module kept in a virtio-fs with cache disabled, the following warning was reported: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ...... Modules linked in: CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ torvalds#123 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:__alloc_pages+0x2bf/0x380 ...... Call Trace: <TASK> ? __warn+0x8e/0x150 ? __alloc_pages+0x2bf/0x380 __kmalloc_large_node+0x86/0x160 __kmalloc+0x33c/0x480 virtio_fs_enqueue_req+0x240/0x6d0 virtio_fs_wake_pending_and_unlock+0x7f/0x190 queue_request_and_unlock+0x55/0x60 fuse_simple_request+0x152/0x2b0 fuse_direct_io+0x5d2/0x8c0 fuse_file_read_iter+0x121/0x160 __kernel_read+0x151/0x2d0 kernel_read+0x45/0x50 kernel_read_file+0x1a9/0x2a0 init_module_from_file+0x6a/0xe0 idempotent_init_module+0x175/0x230 __x64_sys_finit_module+0x5d/0xb0 x64_sys_call+0x1c3/0x9e0 do_syscall_64+0x3d/0xc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 ...... </TASK> ---[ end trace 0000000000000000 ]--- The warning is triggered as follows: 1) syscall finit_module() handles the module insertion and it invokes kernel_read_file() to read the content of the module first. 2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and passes it to kernel_read(). kernel_read() constructs a kvec iter by using iov_iter_kvec() and passes it to fuse_file_read_iter(). 3) virtio-fs disables the cache, so fuse_file_read_iter() invokes fuse_direct_io(). As for now, the maximal read size for kvec iter is only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so fuse_direct_io() doesn't split the 10MB buffer. It saves the address and the size of the 10MB-sized buffer in out_args[0] of a fuse request and passes the fuse request to virtio_fs_wake_pending_and_unlock(). 4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to queue the request. Because virtiofs need DMA-able address, so virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for all fuse args, copies these args into the bounce buffer and passed the physical address of the bounce buffer to virtiofsd. The total length of these fuse args for the passed fuse request is about 10MB, so copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and it triggers the warning in __alloc_pages(): if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) return NULL; 5) virtio_fs_enqueue_req() will retry the memory allocation in a kworker, but it won't help, because kmalloc() will always return NULL due to the abnormal size and finit_module() will hang forever. A feasible solution is to limit the value of max_read for virtio-fs, so the length passed to kmalloc() will be limited. However it will affect the maximal read size for normal read. And for virtio-fs write initiated from kernel, it has the similar problem but now there is no way to limit fc->max_write in kernel. So instead of limiting both the values of max_read and max_write in kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use pages instead of pointer to pass the KVEC_IO data. After switching to pages for KVEC_IO data, these pages will be used for DMA through virtio-fs. If these pages are backed by vmalloc(), {flush|invalidate}_kernel_vmap_range() are necessary to flush or invalidate the cache before the DMA operation. So add two new fields in fuse_args_pages to record the base address of vmalloc area and the condition indicating whether invalidation is needed. Perform the flush in fuse_get_user_pages() for write operations and the invalidation in fuse_release_user_pages() for read operations. It may seem necessary to introduce another field in fuse_conn to indicate that these KVEC_IO pages are used for DMA, However, considering that virtio-fs is currently the only user of use_pages_for_kvec_io, just reuse use_pages_for_kvec_io to indicate that these pages will be used for DMA. Fixes: a62a8ef ("virtio-fs: add virtiofs filesystem") Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> [koichiroden: adjusted context due to missing commits: 738adad ("fuse: Fix missing FOLL_PIN for direct-io") 7dc4e97 ("fuse: introduce FUSE_PASSTHROUGH capability")] CVE-2024-53219 Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com> (cherry picked from commit 6996dac)
1 parent 70ab16a commit a98bbb2

File tree

3 files changed

+50
-19
lines changed

3 files changed

+50
-19
lines changed

fs/fuse/file.c

Lines changed: 43 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -624,7 +624,7 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
624624
args->out_args[0].size = count;
625625
}
626626

627-
static void fuse_release_user_pages(struct fuse_args_pages *ap,
627+
static void fuse_release_user_pages(struct fuse_args_pages *ap, ssize_t nres,
628628
bool should_dirty)
629629
{
630630
unsigned int i;
@@ -634,6 +634,9 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
634634
set_page_dirty_lock(ap->pages[i]);
635635
put_page(ap->pages[i]);
636636
}
637+
638+
if (nres > 0 && ap->args.invalidate_vmap)
639+
invalidate_kernel_vmap_range(ap->args.vmap_base, nres);
637640
}
638641

639642
static void fuse_io_release(struct kref *kref)
@@ -732,25 +735,29 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
732735
struct fuse_io_args *ia = container_of(args, typeof(*ia), ap.args);
733736
struct fuse_io_priv *io = ia->io;
734737
ssize_t pos = -1;
735-
736-
fuse_release_user_pages(&ia->ap, io->should_dirty);
738+
size_t nres;
737739

738740
if (err) {
739741
/* Nothing */
740742
} else if (io->write) {
741743
if (ia->write.out.size > ia->write.in.size) {
742744
err = -EIO;
743-
} else if (ia->write.in.size != ia->write.out.size) {
744-
pos = ia->write.in.offset - io->offset +
745-
ia->write.out.size;
745+
} else {
746+
nres = ia->write.out.size;
747+
if (ia->write.in.size != ia->write.out.size)
748+
pos = ia->write.in.offset - io->offset +
749+
ia->write.out.size;
746750
}
747751
} else {
748752
u32 outsize = args->out_args[0].size;
749753

754+
nres = outsize;
750755
if (ia->read.in.size != outsize)
751756
pos = ia->read.in.offset - io->offset + outsize;
752757
}
753758

759+
fuse_release_user_pages(&ia->ap, err ?: nres, io->should_dirty);
760+
754761
fuse_aio_complete(io, err, pos);
755762
fuse_io_free(ia);
756763
}
@@ -1366,24 +1373,37 @@ static inline size_t fuse_get_frag_size(const struct iov_iter *ii,
13661373

13671374
static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
13681375
size_t *nbytesp, int write,
1369-
unsigned int max_pages)
1376+
unsigned int max_pages,
1377+
bool use_pages_for_kvec_io)
13701378
{
1379+
bool flush_or_invalidate = false;
13711380
size_t nbytes = 0; /* # bytes already packed in req */
13721381
ssize_t ret = 0;
13731382

1374-
/* Special case for kernel I/O: can copy directly into the buffer */
1383+
/* Special case for kernel I/O: can copy directly into the buffer.
1384+
* However if the implementation of fuse_conn requires pages instead of
1385+
* pointer (e.g., virtio-fs), use iov_iter_extract_pages() instead.
1386+
*/
13751387
if (iov_iter_is_kvec(ii)) {
1376-
unsigned long user_addr = fuse_get_user_addr(ii);
1377-
size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
1388+
void *user_addr = (void *)fuse_get_user_addr(ii);
13781389

1379-
if (write)
1380-
ap->args.in_args[1].value = (void *) user_addr;
1381-
else
1382-
ap->args.out_args[0].value = (void *) user_addr;
1390+
if (!use_pages_for_kvec_io) {
1391+
size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
13831392

1384-
iov_iter_advance(ii, frag_size);
1385-
*nbytesp = frag_size;
1386-
return 0;
1393+
if (write)
1394+
ap->args.in_args[1].value = user_addr;
1395+
else
1396+
ap->args.out_args[0].value = user_addr;
1397+
1398+
iov_iter_advance(ii, frag_size);
1399+
*nbytesp = frag_size;
1400+
return 0;
1401+
}
1402+
1403+
if (is_vmalloc_addr(user_addr)) {
1404+
ap->args.vmap_base = user_addr;
1405+
flush_or_invalidate = true;
1406+
}
13871407
}
13881408

13891409
while (nbytes < *nbytesp && ap->num_pages < max_pages) {
@@ -1409,6 +1429,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
14091429
(PAGE_SIZE - ret) & (PAGE_SIZE - 1);
14101430
}
14111431

1432+
if (write && flush_or_invalidate)
1433+
flush_kernel_vmap_range(ap->args.vmap_base, nbytes);
1434+
1435+
ap->args.invalidate_vmap = !write && flush_or_invalidate;
14121436
ap->args.user_pages = true;
14131437
if (write)
14141438
ap->args.in_pages = true;
@@ -1476,7 +1500,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
14761500
size_t nbytes = min(count, nmax);
14771501

14781502
err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
1479-
max_pages);
1503+
max_pages, fc->use_pages_for_kvec_io);
14801504
if (err && !nbytes)
14811505
break;
14821506

@@ -1490,7 +1514,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
14901514
}
14911515

14921516
if (!io->async || nres < 0) {
1493-
fuse_release_user_pages(&ia->ap, io->should_dirty);
1517+
fuse_release_user_pages(&ia->ap, nres, io->should_dirty);
14941518
fuse_io_free(ia);
14951519
}
14961520
ia = NULL;

fs/fuse/fuse_i.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,9 +283,12 @@ struct fuse_args {
283283
bool page_replace:1;
284284
bool may_block:1;
285285
bool is_ext:1;
286+
bool invalidate_vmap:1;
286287
struct fuse_in_arg in_args[3];
287288
struct fuse_arg out_args[2];
288289
void (*end)(struct fuse_mount *fm, struct fuse_args *args, int error);
290+
/* Used for kvec iter backed by vmalloc address */
291+
void *vmap_base;
289292
};
290293

291294
struct fuse_args_pages {
@@ -818,6 +821,9 @@ struct fuse_conn {
818821
/* Is statx not implemented by fs? */
819822
unsigned int no_statx:1;
820823

824+
/* Use pages instead of pointer for kernel I/O */
825+
unsigned int use_pages_for_kvec_io:1;
826+
821827
/** The number of requests waiting for completion */
822828
atomic_t num_waiting;
823829

fs/fuse/virtio_fs.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1505,6 +1505,7 @@ static int virtio_fs_get_tree(struct fs_context *fsc)
15051505
fc->delete_stale = true;
15061506
fc->auto_submounts = true;
15071507
fc->sync_fs = true;
1508+
fc->use_pages_for_kvec_io = true;
15081509

15091510
/* Tell FUSE to split requests that exceed the virtqueue's size */
15101511
fc->max_pages_limit = min_t(unsigned int, fc->max_pages_limit,

0 commit comments

Comments
 (0)