]> www.infradead.org Git - users/willy/xarray.git/log
users/willy/xarray.git
7 months agoRevert "io_uring/rsrc: simplify the bvec iter count calculation"
Keith Busch [Mon, 10 Mar 2025 18:48:25 +0000 (11:48 -0700)]
Revert "io_uring/rsrc: simplify the bvec iter count calculation"

This reverts commit 2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c.

The kernel registered bvecs do use the iov_iter_advance() API, so we
can't rely on this simplification anymore.

Fixes: 27cb27b6d5ea40 ("io_uring: add support for kernel registered bvecs")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250310184825.569371-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: improve test usability
Ming Lei [Mon, 3 Mar 2025 12:43:21 +0000 (20:43 +0800)]
selftests: ublk: improve test usability

Add UBLK_TEST_QUIET, so we can print test result(PASS/SKIP/FAIL) only.

Also always run from test script's current directory, then the same test
script can be started from other work directory.

This way helps a lot to reuse this test source code and scripts for
other projects(liburing, blktests, ...)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-12-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: add stress test for covering IO vs. killing ublk server
Ming Lei [Mon, 3 Mar 2025 12:43:20 +0000 (20:43 +0800)]
selftests: ublk: add stress test for covering IO vs. killing ublk server

Add stress_test_01 for running IO vs. killing ublk server, so io_uring exit &
cancel code path can be covered, same with ublk's cancel code path.

Especially IO buffer lifetime is one big thing for ublk zero copy, the added
test can verify if this area works as expected.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: add one stress test for covering IO vs. removing device
Ming Lei [Mon, 3 Mar 2025 12:43:19 +0000 (20:43 +0800)]
selftests: ublk: add one stress test for covering IO vs. removing device

Add stress_test_01 for running IO vs. removing device for verifying that
ublk device removal can work as expected when heavy IO workloads are in
progress.

null, loop and loop/zc are covered in this tests.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
Ming Lei [Mon, 3 Mar 2025 12:43:18 +0000 (20:43 +0800)]
selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests

Load ublk_drv module in _prep_test(), and unload it in _cleanup_test(),
so that test can always be done in consistent state.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: move zero copy feature check into _add_ublk_dev()
Ming Lei [Mon, 3 Mar 2025 12:43:17 +0000 (20:43 +0800)]
selftests: ublk: move zero copy feature check into _add_ublk_dev()

Move zero copy feature check into _add_ublk_dev() since we will have
more tests which requires to cover zero copy.

Then one check function of _check_add_dev() has to be added for dealing
with cleanup since '_add_ublk_dev()' is run in sub-shell, and we can't
exit from it to terminal shell.

Meantime always return error code from _add_ublk_dev().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: don't pass ${dev_id} to _cleanup_test()
Ming Lei [Mon, 3 Mar 2025 12:43:16 +0000 (20:43 +0800)]
selftests: ublk: don't pass ${dev_id} to _cleanup_test()

More devices can be created in single tests, so simply remove all
ublk devices in _cleanup_test(), meantime remove the ${dev_id} argument
of _cleanup_test().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: support shellcheck and fix all warning
Ming Lei [Mon, 3 Mar 2025 12:43:15 +0000 (20:43 +0800)]
selftests: ublk: support shellcheck and fix all warning

Add shellcheck, meantime fixes all warnings.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250303124324.3563605-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: fix parsing '-a' argument
Ming Lei [Mon, 3 Mar 2025 12:43:14 +0000 (20:43 +0800)]
selftests: ublk: fix parsing '-a' argument

The argument of '-a' doesn't follow any value, so fix it by putting it
with '-z' together.

Fixes: bedc9cbc5f97 ("selftests: ublk: add ublk zero copy test")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250303124324.3563605-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: add --foreground command line
Ming Lei [Mon, 3 Mar 2025 12:43:13 +0000 (20:43 +0800)]
selftests: ublk: add --foreground command line

Add --foreground command for helping to debug.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250303124324.3563605-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: fix build failure
Ming Lei [Mon, 3 Mar 2025 12:43:12 +0000 (20:43 +0800)]
selftests: ublk: fix build failure

Fixes the following build failure:

ublk//file_backed.c: In function ‘backing_file_tgt_init’:
ublk//file_backed.c:28:42: error: ‘O_DIRECT’ undeclared (first use in this function); did you mean ‘O_DIRECTORY’?
   28 |                 fd = open(file, O_RDWR | O_DIRECT);
      |                                          ^~~~~~~~
      |                                          O_DIRECTORY

when trying to reuse this same utility for liburing test.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250303124324.3563605-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: make ublk_stop_io_daemon() more reliable
Ming Lei [Mon, 3 Mar 2025 12:43:11 +0000 (20:43 +0800)]
selftests: ublk: make ublk_stop_io_daemon() more reliable

Improve ublk_stop_io_daemon() in the following ways:

- don't wait if ->ublksrv_pid becomes -1, which means that the disk
has been stopped

- don't wait if ublk char device doesn't exist any more, so we can
avoid to rely on inoitfy for wait until the char device is closed

And this way may reduce time of delete command a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250303124324.3563605-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: Remove unused declaration io_alloc_async_data()
Yue Haibing [Wed, 5 Mar 2025 01:34:54 +0000 (09:34 +0800)]
io_uring: Remove unused declaration io_alloc_async_data()

Commit ef623a647f42 ("io_uring: Move old async data allocation helper
to header") leave behind this unused declaration.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20250305013454.3635021-1-yuehaibing@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: introduce io_cache_free() helper
Caleb Sander Mateos [Tue, 4 Mar 2025 19:48:12 +0000 (12:48 -0700)]
io_uring: introduce io_cache_free() helper

Add a helper function io_cache_free() that returns an allocation to a
io_alloc_cache, falling back on kfree() if the io_alloc_cache is full.
This is the inverse of io_cache_alloc(), which takes an allocation from
an io_alloc_cache and falls back on kmalloc() if the cache is empty.

Convert 4 callers to use the helper.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node()
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:14 +0000 (16:59 -0700)]
io_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node()

io_rsrc_node's of type IORING_RSRC_FILE always have a file attached
immediately after they are allocated. IORING_RSRC_BUFFER nodes won't be
returned from io_sqe_buffer_register()/io_buffer_register_bvec() until
they have a io_mapped_ubuf attached.

So remove the checks for a NULL file/buffer in io_free_rsrc_node().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: avoid NULL node check on io_sqe_buffer_register() failure
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:13 +0000 (16:59 -0700)]
io_uring/rsrc: avoid NULL node check on io_sqe_buffer_register() failure

The done: label is only reachable if node is non-NULL. So don't bother
checking, just call io_free_node().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:12 +0000 (16:59 -0700)]
io_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure

io_sqe_buffer_register() currently calls io_put_rsrc_node() if it fails
to fully set up the io_rsrc_node. io_put_rsrc_node() is more involved
than necessary, since we already know the reference count will reach 0
and no io_mapped_ubuf has been attached to the node yet.

So just call io_free_node() to release the node's memory. This also
avoids the need to temporarily set the node's buf pointer to NULL.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: free io_rsrc_node using kfree()
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:11 +0000 (16:59 -0700)]
io_uring/rsrc: free io_rsrc_node using kfree()

io_rsrc_node_alloc() calls io_cache_alloc(), which uses kmalloc() to
allocate the node. So it can be freed with kfree() instead of kvfree().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: split out io_free_node() helper
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:10 +0000 (16:59 -0700)]
io_uring/rsrc: split out io_free_node() helper

Split the freeing of the io_rsrc_node from io_free_rsrc_node(), for use
with nodes that haven't been fully initialized.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: include io_uring_types.h in rsrc.h
Caleb Sander Mateos [Sat, 1 Mar 2025 18:36:11 +0000 (11:36 -0700)]
io_uring/rsrc: include io_uring_types.h in rsrc.h

io_uring/rsrc.h uses several types from include/linux/io_uring_types.h.
Include io_uring_types.h explicitly in rsrc.h to avoid depending on
users of rsrc.h including io_uring_types.h first.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250301183612.937529-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoublk: don't cast registered buffer index to int
Caleb Sander Mateos [Sat, 1 Mar 2025 19:03:16 +0000 (12:03 -0700)]
ublk: don't cast registered buffer index to int

io_buffer_register_bvec() takes index as an unsigned int argument, but
ublk_register_io_buf() casts ub_cmd->addr (a u64) to int. Remove the
misleading cast and instead pass index as an unsigned value to
ublk_register_io_buf() and ublk_unregister_io_buf().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250301190317.950208-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/nop: use io_find_buf_node()
Caleb Sander Mateos [Sat, 1 Mar 2025 00:16:08 +0000 (17:16 -0700)]
io_uring/nop: use io_find_buf_node()

Call io_find_buf_node() to avoid duplicating it in io_nop().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: declare io_find_buf_node() in header file
Caleb Sander Mateos [Sat, 1 Mar 2025 00:16:07 +0000 (17:16 -0700)]
io_uring/rsrc: declare io_find_buf_node() in header file

Declare io_find_buf_node() in io_uring/rsrc.h so it can be called from
other files.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-1-csander@purestorage.com
[axboe: keep the inline for local hot path usage]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/ublk: report error when unregister operation fails
Caleb Sander Mateos [Fri, 28 Feb 2025 23:14:31 +0000 (16:14 -0700)]
io_uring/ublk: report error when unregister operation fails

Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF
command specifies an invalid buffer index by returning an error code.
Return -EINVAL if no buffer is registered with the given index, and
-EBUSY if the registered buffer is not a kernel bvec.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228231432.642417-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: convert cmd_to_io_kiocb() macro to function
Caleb Sander Mateos [Fri, 28 Feb 2025 23:03:04 +0000 (16:03 -0700)]
io_uring: convert cmd_to_io_kiocb() macro to function

The cmd_to_io_kiocb() macro applies a pointer cast to its input without
parenthesizing it. Currently all inputs are variable names, so this has
the intended effect. But since casts have relatively high precedence,
the macro would apply the cast to the wrong value if the input was a
pointer addition, for example.

Turn the macro into a static inline function to ensure the pointer cast
is applied to the full input value.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228230305.630885-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/uring_cmd: specify io_uring_cmd_import_fixed() pointer type
Caleb Sander Mateos [Fri, 28 Feb 2025 22:15:13 +0000 (15:15 -0700)]
io_uring/uring_cmd: specify io_uring_cmd_import_fixed() pointer type

io_uring_cmd_import_fixed() takes a struct io_uring_cmd *, but the type
of the ioucmd parameter is void *. Make the pointer type explicit so the
compiler can type check it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228221514.604350-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: use rq_data_dir() to compute bvec dir
Caleb Sander Mateos [Fri, 28 Feb 2025 22:30:56 +0000 (15:30 -0700)]
io_uring/rsrc: use rq_data_dir() to compute bvec dir

The macro rq_data_dir() already computes a request's data direction.
Use it in place of the if-else to set imu->dir.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228223057.615284-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: add ublk zero copy test
Ming Lei [Fri, 28 Feb 2025 16:19:16 +0000 (00:19 +0800)]
selftests: ublk: add ublk zero copy test

Enable zero copy on file backed target, meantime add one fio test for
covering write verify, another test for mkfs/mount/umount.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228161919.2869102-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: add file backed ublk
Ming Lei [Fri, 28 Feb 2025 16:19:15 +0000 (00:19 +0800)]
selftests: ublk: add file backed ublk

Add file backed ublk target code, meantime add one fio test for
covering write verify, another test for mkfs/mount/umount.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228161919.2869102-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoselftests: ublk: add kernel selftests for ublk
Ming Lei [Fri, 28 Feb 2025 16:19:14 +0000 (00:19 +0800)]
selftests: ublk: add kernel selftests for ublk

Both ublk driver and userspace heavily depends on io_uring subsystem,
and tools/testing/selftests/ should be the best place for holding this
cross-subsystem tests.

Add basic read/write IO test over this ublk null disk, and make sure ublk
working.

More tests will be added.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228161919.2869102-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: cache nodes and mapped buffers
Keith Busch [Thu, 27 Feb 2025 22:39:16 +0000 (14:39 -0800)]
io_uring: cache nodes and mapped buffers

Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-7-kbusch@meta.com
[axboe: fix imu leak]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoublk: zc register/unregister bvec
Keith Busch [Thu, 27 Feb 2025 22:39:15 +0000 (14:39 -0800)]
ublk: zc register/unregister bvec

Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.

A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-6-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: add support for kernel registered bvecs
Keith Busch [Thu, 27 Feb 2025 22:39:14 +0000 (14:39 -0800)]
io_uring: add support for kernel registered bvecs

Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.

User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agonvme: map uring_cmd data even if address is 0
Xinyu Zhang [Thu, 27 Feb 2025 22:39:13 +0000 (14:39 -0800)]
nvme: map uring_cmd data even if address is 0

When using kernel registered bvec fixed buffers, the "address" is
actually the offset into the bvec rather than userspace address.
Therefore it can be 0.

We can skip checking whether the address is NULL before mapping
uring_cmd data. Bad userspace address will be handled properly later when
the user buffer is imported.

With this patch, we will be able to use the kernel registered bvec fixed
buffers in io_uring NVMe passthru with ublk zero-copy support.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rw: move fixed buffer import to issue path
Keith Busch [Thu, 27 Feb 2025 22:39:12 +0000 (14:39 -0800)]
io_uring/rw: move fixed buffer import to issue path

Registered buffers may depend on a linked command, which makes the prep
path too early to import. Move to the issue path when the node is
actually needed like all the other users of fixed buffers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-3-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rw: move buffer_select outside generic prep
Keith Busch [Thu, 27 Feb 2025 22:39:11 +0000 (14:39 -0800)]
io_uring/rw: move buffer_select outside generic prep

Cleans up the generic rw prep to not require the do_import flag. Use a
different prep function for callers that might need buffer select.

Based-on-a-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-2-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: fix build warning for !CONFIG_COMPAT
Arnd Bergmann [Thu, 27 Feb 2025 13:20:09 +0000 (14:20 +0100)]
io_uring/net: fix build warning for !CONFIG_COMPAT

A code rework resulted in an uninitialized return code when COMPAT
mode is disabled:

io_uring/net.c:722:6: error: variable 'ret' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
  722 |         if (io_is_compat(req->ctx)) {
      |             ^~~~~~~~~~~~~~~~~~~~~~
io_uring/net.c:736:15: note: uninitialized use occurs here
  736 |         if (unlikely(ret))
      |                      ^~~

Since io_is_compat() turns into a compile-time 'false', the #ifdef
here is completely unnecessary, and removing it avoids the warning.

Fixes: 51e158d40589 ("io_uring/net: unify *mshot_prep calls with compat")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20250227132018.1111094-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: rearrange opdef flags by use pattern
Pavel Begunkov [Wed, 26 Feb 2025 20:46:34 +0000 (20:46 +0000)]
io_uring: rearrange opdef flags by use pattern

Keep all flags that we use in the generic req init path close together.
That saves a load for x86 because apparently some compilers prefer
reading single bytes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef03b6ce4a0c2a5234cd4037fa07e9e4902dcc9e.1740602793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: extract iovec import into a helper
Pavel Begunkov [Wed, 26 Feb 2025 11:41:21 +0000 (11:41 +0000)]
io_uring/net: extract iovec import into a helper

Deduplicate iovec imports between compat and !compat by introducing a
helper function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a5f8c526f6732c4249a7fa0213b49e1a3ecccf0.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: unify *mshot_prep calls with compat
Pavel Begunkov [Wed, 26 Feb 2025 11:41:20 +0000 (11:41 +0000)]
io_uring/net: unify *mshot_prep calls with compat

Instead of duplicating a io_recvmsg_mshot_prep() call in the compat
path, let the common code handle it. For that, copy necessary compat
fields into struct user_msghdr. Note, it zeroes user_msghdr to be on the
safe side as compat is not that interesting and overhead shouldn't be
high.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94e62386dec570f83b4a4270a46ac60bc415fb71.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: derive iovec storage later
Pavel Begunkov [Wed, 26 Feb 2025 11:41:19 +0000 (11:41 +0000)]
io_uring/net: derive iovec storage later

Don't read free_iov until right before we need it to import the iovec.
The only place that uses it before that is provided buffer selection,
but it only serves as temporary storage and iovec content is not reused
afterwards, so use a local variable for that.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bfa7d74c33e37860a724f4e0e96660c25cd4c02.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: verify msghdr before copying iovec
Pavel Begunkov [Wed, 26 Feb 2025 11:41:18 +0000 (11:41 +0000)]
io_uring/net: verify msghdr before copying iovec

Normally, net/ would verify msghdr before importing iovec, for example
see copy_msghdr_from_user(), which further assumed by __copy_msghdr()
validating msg->msg_iovlen.

io_uring does it in reverse order, which is fine, but it'll be more
convenient for flip it so that the iovec business is done at the end and
eventually can be nicely pulled out of msghdr parsing section and
thought as a sepaarate step. That also makes structure accesses more
localised, which should be better for caches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cd35dc1b48d4e6e31f59ae7304c037fbe8a3fd3d.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: isolate msghdr copying code
Pavel Begunkov [Wed, 26 Feb 2025 11:41:17 +0000 (11:41 +0000)]
io_uring/net: isolate msghdr copying code

The user access section in io_msg_copy_hdr() is overextended by covering
selected buffers. It's hard to work with and prone to errors. Limit the
section to msghdr import only, selected buffers will do a separate
copy_from_user() call, and then move it into its own function. This
should be fine, selected buffer single shots are not important, for
multishots the overhead should be non-existent, and it's not that
expensive overall.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d3eb1f81c8cfbea9f1aa57dab90c472d2aa6e371.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: simplify compat selbuf iov parsing
Pavel Begunkov [Wed, 26 Feb 2025 11:41:16 +0000 (11:41 +0000)]
io_uring/net: simplify compat selbuf iov parsing

Use copy_from_user() instead of open coded access_ok() + get_user(),
that's simpler and we don't care about compat that much.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e51f9c323a3cd4ad7c8da656559bdf6237f052fb.1740569495.git.asml.silence@gmail.com
[axboe: fold in bogus < 0 check for tmp_iov.iov_len]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: remove unnecessary REQ_F_NEED_CLEANUP
Pavel Begunkov [Wed, 26 Feb 2025 11:41:15 +0000 (11:41 +0000)]
io_uring/net: remove unnecessary REQ_F_NEED_CLEANUP

REQ_F_NEED_CLEANUP in io_recvmsg_prep_setup() and in io_sendmsg_setup()
are relics of the past and don't do anything useful, the flag should be
and are set earlier on iovec and async_data allocation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aedc3141c1fc027128a4503656cfd686a6980ef.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoMerge branch 'io_uring-6.14' into for-6.15/io_uring
Jens Axboe [Thu, 27 Feb 2025 14:18:01 +0000 (07:18 -0700)]
Merge branch 'io_uring-6.14' into for-6.15/io_uring

Merge mainline fixes into 6.15 branch, as upcoming patches depend on
fixes that went into the 6.14 mainline branch.

* io_uring-6.14:
  io_uring/net: save msg_control for compat
  io_uring/rw: clean up mshot forced sync mode
  io_uring/rw: move ki_complete init into prep
  io_uring/rw: don't directly use ki_complete
  io_uring/rw: forbid multishot async reads
  io_uring/rsrc: remove unused constants
  io_uring: fix spelling error in uapi io_uring.h
  io_uring: prevent opcode speculation
  io-wq: backoff when retrying worker creation

7 months agoio_uring: combine buffer lookup and import
Pavel Begunkov [Mon, 24 Feb 2025 21:31:10 +0000 (13:31 -0800)]
io_uring: combine buffer lookup and import

Registered buffer are currently imported in two steps, first we lookup
a rsrc node and then use it to set up the iterator. The first part is
usually done at the prep stage, and import happens whenever it's needed.
As we want to defer binding to a node so that it works with linked
requests, combine both steps into a single helper.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()
Pavel Begunkov [Mon, 24 Feb 2025 21:31:09 +0000 (13:31 -0800)]
io_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()

io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: reuse req->buf_index for sendzc
Pavel Begunkov [Mon, 24 Feb 2025 21:31:08 +0000 (13:31 -0800)]
io_uring/net: reuse req->buf_index for sendzc

There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_sr_msg.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/nop: reuse req->buf_index
Keith Busch [Mon, 24 Feb 2025 21:31:07 +0000 (13:31 -0800)]
io_uring/nop: reuse req->buf_index

There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_nop.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rsrc: remove redundant check for valid imu
Keith Busch [Mon, 24 Feb 2025 21:31:06 +0000 (13:31 -0800)]
io_uring/rsrc: remove redundant check for valid imu

The only caller to io_buffer_unmap already checks if the node's buf is
not null, so no need to check again.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rw: open code io_prep_rw_setup()
Pavel Begunkov [Mon, 24 Feb 2025 19:45:06 +0000 (19:45 +0000)]
io_uring/rw: open code io_prep_rw_setup()

Open code io_prep_rw_setup() into its only caller, it doesn't provide
any meaningful abstraction anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/61ba72e2d46119db71f27ab908018e6a6cd6c064.1740425922.git.asml.silence@gmail.com
[axboe: fold in 'ret' being unused fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: save msg_control for compat
Pavel Begunkov [Tue, 25 Feb 2025 15:59:02 +0000 (15:59 +0000)]
io_uring/net: save msg_control for compat

Match the compat part of io_sendmsg_copy_hdr() with its counterpart and
save msg_control.

Fixes: c55978024d123 ("io_uring/net: move receive multishot out of the generic msghdr path")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8418821fe83d3b64350ad2b3c0303e9b732bbd.1740498502.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rw: extract helper for iovec import
Pavel Begunkov [Mon, 24 Feb 2025 19:45:05 +0000 (19:45 +0000)]
io_uring/rw: extract helper for iovec import

Split out a helper out of __io_import_rw_buffer() that handles vectored
buffers. I'll need it for registered vectored buffers, but it also looks
cleaner, especially with parameters being properly named.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/075470cfb24be38709d946815f35ec846d966f41.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rw: rename io_import_iovec()
Pavel Begunkov [Mon, 24 Feb 2025 19:45:04 +0000 (19:45 +0000)]
io_uring/rw: rename io_import_iovec()

io_import_iovec() is not limited to iovecs but also imports buffers for
normal reads and selected buffers, rename it for clarity.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91cea59340b61a8f52dc7b8e720274577a25188c.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/rw: allocate async data in io_prep_rw()
Pavel Begunkov [Mon, 24 Feb 2025 19:45:03 +0000 (19:45 +0000)]
io_uring/rw: allocate async data in io_prep_rw()

rw always allocates async_data, so instead of doing that deeper in prep
calls inside of io_prep_rw_setup(), be a bit more explicit and do that
early on in io_prep_rw().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5ead621051bc3374d1e8d96f816454906a6afd71.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring: make io_poll_issue() sturdier
Pavel Begunkov [Sun, 23 Feb 2025 17:22:31 +0000 (17:22 +0000)]
io_uring: make io_poll_issue() sturdier

io_poll_issue() forwards the call to io_issue_sqe() and thus inherits
some of the handling. That's not particularly failure resistant, as for
example returning an innocently looking IOU_OK from a multishot issue
will lead to severe bugs.

Reimplement io_poll_issue() without io_issue_sqe()'s request completion
logic. Remove extra checks as we know that req->file is already set,
linked timeout are armed, and iopoll is not supported. Also cover it
with warnings for now.

The patch should be useful by itself, but it's also preparing the
codebase for other future clean ups.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3096d7b1026d9a52426a598bdfc8d9d324555545.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: canonise accept mshot handling
Pavel Begunkov [Sun, 23 Feb 2025 17:22:30 +0000 (17:22 +0000)]
io_uring/net: canonise accept mshot handling

Use a more recognisable pattern for mshot accept, first try to post an
mshot cqe if needed and after do terminating handling.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/daf5c0df7e2966deb0a115021c065fc6161a52d7.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: fix accept multishot handling
Pavel Begunkov [Sun, 23 Feb 2025 17:22:29 +0000 (17:22 +0000)]
io_uring/net: fix accept multishot handling

REQ_F_APOLL_MULTISHOT doesn't guarantee it's executed from the multishot
context, so a multishot accept may get executed inline, fail
io_req_post_cqe(), and ask the core code to kill the request with
-ECANCELED by returning IOU_STOP_MULTISHOT even when a socket has been
accepted and installed.

Cc: stable@vger.kernel.org
Fixes: 390ed29b5e425 ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51c6deb01feaa78b08565ca8f24843c017f5bc80.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/net: use io_is_compat()
Pavel Begunkov [Mon, 24 Feb 2025 12:42:24 +0000 (12:42 +0000)]
io_uring/net: use io_is_compat()

Use io_is_compat() for consistency.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/fff93d9d08243284c5db5d546be766a82e85c130.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 months agoio_uring/waitid: use io_is_compat()
Pavel Begunkov [Mon, 24 Feb 2025 12:42:23 +0000 (12:42 +0000)]
io_uring/waitid: use io_is_compat()

Use io_is_compat() for consistency.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/28c5b5f1f1bf7f4d18869dafe6e4147ce1bbf0f5.1740400452.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/20250224172337.2009871-1-csander@purestorage.com
[axboe: fold in improvement from Caleb, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rw: shrink io_iov_compat_buffer_select_prep
Pavel Begunkov [Mon, 24 Feb 2025 12:42:22 +0000 (12:42 +0000)]
io_uring/rw: shrink io_iov_compat_buffer_select_prep

Compat performance is not important and simplicity is more appreciated.
Let's not be smart about it and use simpler copy_from_user() instead of
access + __get_user pair.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b334a3a5040efa424ded58e4d8a6ef2554324266.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rw: compile out compat param passing
Pavel Begunkov [Mon, 24 Feb 2025 12:42:21 +0000 (12:42 +0000)]
io_uring/rw: compile out compat param passing

Even when COMPAT is compiled out, we still have to pass
ctx->compat to __import_iovec(). Replace the read with an indirection
with a constant when the kernel doesn't support compat.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/2819df9c8533c36b46d7baccbb317a0ec89da6cd.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/cmd: optimise !CONFIG_COMPAT flags setting
Pavel Begunkov [Mon, 24 Feb 2025 12:42:20 +0000 (12:42 +0000)]
io_uring/cmd: optimise !CONFIG_COMPAT flags setting

Use io_is_compat() to avoid extra overhead in io_uring_cmd() for flag
setting when compat is compiled out.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/f4d74c62d7cbddc386c0a9138ecd2b2ed6d3f146.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: introduce io_is_compat()
Pavel Begunkov [Mon, 24 Feb 2025 12:42:19 +0000 (12:42 +0000)]
io_uring: introduce io_is_compat()

A preparation patch adding a simple helper for gauging the compat state.
It'll help us to optimise and compile out more code in the following
commits.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/1a87a640265196a67bc38300128e0bfd7839ab1f.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rw: clean up mshot forced sync mode
Pavel Begunkov [Wed, 19 Feb 2025 01:33:40 +0000 (01:33 +0000)]
io_uring/rw: clean up mshot forced sync mode

Move code forcing synchronous execution of multishot read requests out
a more generic __io_read().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ad7b928c776d1ad59addb9fff64ef2d1fc474d5.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rw: move ki_complete init into prep
Pavel Begunkov [Wed, 19 Feb 2025 01:33:39 +0000 (01:33 +0000)]
io_uring/rw: move ki_complete init into prep

Initialise ki_complete during request prep stage, we'll depend on it not
being reset during issue in the following patch.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/817624086bd5f0448b08c80623399919fda82f34.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rw: don't directly use ki_complete
Pavel Begunkov [Wed, 19 Feb 2025 01:33:38 +0000 (01:33 +0000)]
io_uring/rw: don't directly use ki_complete

We want to avoid checking ->ki_complete directly in the io_uring
completion path. Fortunately we have only two callback the selection
of which depend on the ring constant flags, i.e. IOPOLL, so use that
to infer the function.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4eb4bdab8cbcf5bc87083f7047edc81e920ab83c.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rw: forbid multishot async reads
Pavel Begunkov [Wed, 19 Feb 2025 01:33:37 +0000 (01:33 +0000)]
io_uring/rw: forbid multishot async reads

At the moment we can't sanely handle queuing an async request from a
multishot context, so disable them. It shouldn't matter as pollable
files / socekts don't normally do async.

Patching it in __io_read() is not the cleanest way, but it's simpler
than other options, so let's fix it there and clean up on top.

Cc: stable@vger.kernel.org
Reported-by: chase xd <sl1589472800@gmail.com>
Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7d51732c125159d17db4fe16f51ec41b936973f8.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rsrc: remove unused constants
Caleb Sander Mateos [Wed, 19 Feb 2025 03:34:43 +0000 (20:34 -0700)]
io_uring/rsrc: remove unused constants

IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86
("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the
rsrc_node_cache.

IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused
since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring
io_rsrc_node list") removed the separate tag table for registered nodes.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: fix spelling error in uapi io_uring.h
Jens Axboe [Tue, 18 Feb 2025 23:47:40 +0000 (16:47 -0700)]
io_uring: fix spelling error in uapi io_uring.h

This is obviously not that important, but when changes are synced back
from the kernel to liburing, the codespell CI ends up erroring because
of this misspelling. Let's just correct it and avoid this biting us
again on an import.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: use lockless_cq flag in io_req_complete_post()
Caleb Sander Mateos [Wed, 12 Feb 2025 00:51:18 +0000 (17:51 -0700)]
io_uring: use lockless_cq flag in io_req_complete_post()

io_uring_create() computes ctx->lockless_cq as:
ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)

So use it to simplify that expression in io_req_complete_post().

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250212005119.3433005-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: pass struct io_tw_state by value
Caleb Sander Mateos [Mon, 17 Feb 2025 02:25:05 +0000 (19:25 -0700)]
io_uring: pass struct io_tw_state by value

8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.

So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: introduce type alias for io_tw_state
Caleb Sander Mateos [Mon, 17 Feb 2025 02:25:04 +0000 (19:25 -0700)]
io_uring: introduce type alias for io_tw_state

In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.

Also add a comment to struct io_tw_state to explain its purpose.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/rsrc: avoid NULL check in io_put_rsrc_node()
Caleb Sander Mateos [Sun, 16 Feb 2025 22:58:59 +0000 (15:58 -0700)]
io_uring/rsrc: avoid NULL check in io_put_rsrc_node()

Most callers of io_put_rsrc_node() already check that node is non-NULL:
- io_rsrc_data_free()
- io_sqe_buffer_register()
- io_reset_rsrc_node()
- io_req_put_rsrc_nodes() (REQ_F_BUF_NODE indicates non-NULL buf_node)

Only io_splice_cleanup() can call io_put_rsrc_node() with a NULL node.
So move the NULL check there.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250216225900.1075446-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: pass ctx instead of req to io_init_req_drain()
Caleb Sander Mateos [Wed, 12 Feb 2025 16:48:05 +0000 (09:48 -0700)]
io_uring: pass ctx instead of req to io_init_req_drain()

io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.

Drop "req" from the function name since it operates on the ctx rather
than a specific req.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: use IO_REQ_LINK_FLAGS more
Caleb Sander Mateos [Tue, 11 Feb 2025 20:19:56 +0000 (13:19 -0700)]
io_uring: use IO_REQ_LINK_FLAGS more

Replace the 2 instances of REQ_F_LINK | REQ_F_HARDLINK with
the more commonly used IO_REQ_LINK_FLAGS.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250211202002.3316324-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/net: improve recv bundles
Jens Axboe [Sat, 8 Feb 2025 17:50:34 +0000 (10:50 -0700)]
io_uring/net: improve recv bundles

Current recv bundles are only supported for multishot receives, and
additionally they also always post at least 2 CQEs if more data is
available than what a buffer will hold. This happens because the initial
bundle recv will do a single buffer, and then do the rest of what is in
the socket as a followup receive. As shown in a test program, if 1k
buffers are available and 32k is available to receive in the socket,
you'd get the following completions:

bundle=1, mshot=0
cqe res 1024
cqe res 1024
[...]
cqe res 1024

bundle=1, mshot=1
cqe res 1024
cqe res 31744

where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 &&
mshot=1 will post a 1k completion and then a 31k completion.

To support bundle recv without multishot, it's possible to simply retry
the recv immediately and post a single completion, rather than split it
into two completions. With the below patch, the same test looks as
follows:

bundle=1, mshot=0
cqe res 32768

bundle=1, mshot=1
cqe res 32768

where mshot=0 works fine for bundles, and both of them post just a
single 32k completion rather than split it into separate completions.
Posting fewer completions is always a nice win, and not needing
multishot for proper bundle efficiency is nice for cases that can't
necessarily use multishot.

Reported-by: Norman Maurer <norman_maurer@apple.com>
Link: https://lore.kernel.org/r/184f9f92-a682-4205-a15d-89e18f664502@kernel.dk
Fixes: 2f9c9515bdfd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/waitid: use generic io_cancel_remove() helper
Jens Axboe [Wed, 5 Feb 2025 20:16:29 +0000 (13:16 -0700)]
io_uring/waitid: use generic io_cancel_remove() helper

Don't implement our own loop rolling and checking, just use the generic
helper to find and cancel requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/futex: use generic io_cancel_remove() helper
Jens Axboe [Wed, 5 Feb 2025 20:15:57 +0000 (13:15 -0700)]
io_uring/futex: use generic io_cancel_remove() helper

Don't implement our own loop rolling and checking, just use the generic
helper to find and cancel requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/cancel: add generic cancel helper
Jens Axboe [Wed, 5 Feb 2025 20:13:58 +0000 (13:13 -0700)]
io_uring/cancel: add generic cancel helper

Any opcode that is cancelable ends up defining its own cancel helper
for finding and canceling a specific request. Add a generic helper that
can be used for this purpose.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/waitid: convert to io_cancel_remove_all()
Jens Axboe [Wed, 5 Feb 2025 19:52:46 +0000 (12:52 -0700)]
io_uring/waitid: convert to io_cancel_remove_all()

Use the generic helper for cancelations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/futex: convert to io_cancel_remove_all()
Jens Axboe [Wed, 5 Feb 2025 19:51:26 +0000 (12:51 -0700)]
io_uring/futex: convert to io_cancel_remove_all()

Use the generic helper for cancelations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/cancel: add generic remove_all helper
Jens Axboe [Wed, 5 Feb 2025 19:48:56 +0000 (12:48 -0700)]
io_uring/cancel: add generic remove_all helper

Any opcode that is cancelable ends up defining its own remove all
helper, which iterates the pending list and cancels matches. Add a
generic helper for it, which can be used by them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: uninline __io_put_kbufs
Pavel Begunkov [Wed, 5 Feb 2025 11:36:49 +0000 (11:36 +0000)]
io_uring/kbuf: uninline __io_put_kbufs

__io_put_kbufs() and other helper functions are too large to be inlined,
compilers would normally refuse to do so. Uninline it and move together
with io_kbuf_commit into kbuf.c.

io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com>

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dade7f55ad590e811aff83b1ec55c9c04e17b2b.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: introduce io_kbuf_drop_legacy()
Pavel Begunkov [Wed, 5 Feb 2025 11:36:48 +0000 (11:36 +0000)]
io_uring/kbuf: introduce io_kbuf_drop_legacy()

io_kbuf_drop() is only used for legacy provided buffers, and so
__io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the
dead branch out of __io_put_kbuf_list(), rename it into
io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: open code __io_put_kbuf()
Pavel Begunkov [Wed, 5 Feb 2025 11:36:47 +0000 (11:36 +0000)]
io_uring/kbuf: open code __io_put_kbuf()

__io_put_kbuf() is a trivial wrapper, open code it into
__io_put_kbufs().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dc17380272b48d56c95992c6f9eaacd5546e1d3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: remove legacy kbuf caching
Pavel Begunkov [Wed, 5 Feb 2025 11:36:46 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf caching

Remove all struct io_buffer caches. It makes it a fair bit simpler.
Apart from from killing a bunch of lines and juggling between lists,
__io_put_kbuf_list() doesn't need ->completion_lock locking now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: simplify __io_put_kbuf
Pavel Begunkov [Wed, 5 Feb 2025 11:36:45 +0000 (11:36 +0000)]
io_uring/kbuf: simplify __io_put_kbuf

As a preparation step remove an optimisation from __io_put_kbuf() trying
to use the locked cache. With that __io_put_kbuf_list() is only used
with ->io_buffers_comp, and we remove the explicit list argument.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b7f1394ec4afc7f96b35a61f5992e27c49fd067.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: move locking into io_kbuf_drop()
Pavel Begunkov [Wed, 5 Feb 2025 11:36:44 +0000 (11:36 +0000)]
io_uring/kbuf: move locking into io_kbuf_drop()

Move the burden of locking out of the caller into io_kbuf_drop(), that
will help with furher refactoring.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/530f0cf1f06963029399f819a9a58b1a34bebef3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: remove legacy kbuf kmem cache
Pavel Begunkov [Wed, 5 Feb 2025 11:36:43 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf kmem cache

Remove the kmem cache used by legacy provided buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/kbuf: remove legacy kbuf bulk allocation
Pavel Begunkov [Wed, 5 Feb 2025 11:36:42 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf bulk allocation

Legacy provided buffers are slow and discouraged in favour of the ring
variant. Remove the bulk allocation to keep it simpler as we don't care
about performance.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a064d70370e590efed8076e9501ae4cfc20fe0ca.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: sanitise ring params earlier
Pavel Begunkov [Fri, 31 Jan 2025 17:31:03 +0000 (17:31 +0000)]
io_uring: sanitise ring params earlier

Do all struct io_uring_params validation early on before allocating the
context. That makes initialisation easier, especially by having fewer
places where we need to care about partial de-initialisation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/363ba90b83ff78eefdc88b60e1b2c4a39d182247.1738344646.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: check for iowq alloc_workqueue failure
Pavel Begunkov [Fri, 31 Jan 2025 17:28:21 +0000 (17:28 +0000)]
io_uring: check for iowq alloc_workqueue failure

alloc_workqueue() can fail even during init in io_uring_init(), check
the result and panic if anything went wrong.

Fixes: 73eaa2b583493 ("io_uring: use private workqueue for exit work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.1738343083.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring: deduplicate caches deallocation
Pavel Begunkov [Fri, 31 Jan 2025 17:27:02 +0000 (17:27 +0000)]
io_uring: deduplicate caches deallocation

Add a function that frees all ring caches since we already have two
spots repeating the same thing and it's easy to miss it and change only
one of them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b6b0125677c58bdff99eda91ab320137406e8562.1738342562.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/io-wq: pass io_wq to io_get_next_work()
Max Kellermann [Tue, 28 Jan 2025 13:39:25 +0000 (14:39 +0100)]
io_uring/io-wq: pass io_wq to io_get_next_work()

The only caller has already determined this pointer, so let's skip
the redundant dereference.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-7-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/io-wq: do not use bogus hash value
Max Kellermann [Tue, 28 Jan 2025 13:39:24 +0000 (14:39 +0100)]
io_uring/io-wq: do not use bogus hash value

Previously, the `hash` variable was initialized with `-1` and only
updated by io_get_next_work() if the current work was hashed.  Commit
60cf46ae6054 ("io-wq: hash dependent work") changed this to always
call io_get_work_hash() even if the work was not hashed.  This caused
the `hash != -1U` check to always be true, adding some overhead for
the `hash->wait` code.

This patch fixes the regression by checking the `IO_WQ_WORK_HASHED`
flag.

Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`:

    38.55%     -1.57%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.86%     -0.72%  [kernel.kallsyms]  [k] io_worker_handle_work
     0.10%     +0.67%  [kernel.kallsyms]  [k] put_prev_entity
     1.96%     +0.59%  [kernel.kallsyms]  [k] io_nop_prep
     3.31%     -0.51%  [kernel.kallsyms]  [k] try_to_wake_up
     7.18%     -0.47%  [kernel.kallsyms]  [k] io_wq_free_work

Fixes: 60cf46ae6054 ("io-wq: hash dependent work")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-6-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/io-wq: cache work->flags in variable
Max Kellermann [Tue, 28 Jan 2025 13:39:23 +0000 (14:39 +0100)]
io_uring/io-wq: cache work->flags in variable

This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.

In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:

io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:

    38.86%     -1.49%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.75%     +0.36%  [kernel.kallsyms]  [k] io_worker_handle_work
     2.60%     +0.19%  [kernel.kallsyms]  [k] io_nop
     3.92%     +0.18%  [kernel.kallsyms]  [k] io_req_task_complete
     6.34%     -0.18%  [kernel.kallsyms]  [k] io_wq_submit_work

HTTP server, static file:

    42.79%     -2.77%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     2.08%     +0.23%  [kernel.kallsyms]     [k] io_wq_submit_work
     1.19%     +0.20%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.46%     +0.15%  [kernel.kallsyms]     [k] ep_poll_callback
     1.80%     +0.15%  [kernel.kallsyms]     [k] io_worker_handle_work

HTTP server, PHP:

    35.03%     -1.80%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
     0.84%     +0.21%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
     1.39%     +0.12%  [kernel.kallsyms]     [k] _copy_to_iter
     0.21%     +0.10%  [kernel.kallsyms]     [k] update_sd_lb_stats

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-5-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/io-wq: move worker lists to struct io_wq_acct
Max Kellermann [Tue, 28 Jan 2025 13:39:22 +0000 (14:39 +0100)]
io_uring/io-wq: move worker lists to struct io_wq_acct

Have separate linked lists for bounded and unbounded workers.  This
way, io_acct_activate_free_worker() sees only workers relevant to it
and doesn't need to skip irrelevant ones.  This speeds up the
linked list traversal (under acct->lock).

The `io_wq.lock` field is moved to `io_wq_acct.workers_lock`.  It did
not actually protect "access to elements below", that is, not all of
them; it only protected access to the worker lists.  By having two
locks instead of one, contention on this lock is reduced.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-4-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agoio_uring/io-wq: add io_worker.acct pointer
Max Kellermann [Tue, 28 Jan 2025 13:39:21 +0000 (14:39 +0100)]
io_uring/io-wq: add io_worker.acct pointer

This replaces the `IO_WORKER_F_BOUND` flag.  All code that checks this
flag is not interested in knowing whether this is a "bound" worker;
all it does with this flag is determine the `io_wq_acct` pointer.  At
the cost of an extra pointer field, we can eliminate some fragile
pointer arithmetic.  In turn, the `create_index` and `index` fields
are not needed anymore.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-3-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>