Jens Axboe [Fri, 14 Mar 2025 16:19:01 +0000 (10:19 -0600)]
io_uring: enable toggle of iowait usage when waiting on CQEs
By default, io_uring marks a waiting task as being in iowait, if it's
sleeping waiting on events and there are pending requests. This isn't
necessarily always useful, and may be confusing on non-storage setups
where iowait isn't expected. It can also cause extra power usage, by
preventing the CPU from entering lower sleep states.
This adds a new enter flag, IORING_ENTER_NO_IOWAIT. If set, then
io_uring will not account the sleeping task as being in iowait. If the
kernel supports this feature, then it will be marked by having the
IORING_FEAT_NO_IOWAIT feature flag set.
As the kernel currently does not support separating the iowait
accounting and CPU frequency boosting, the IORING_ENTER_NO_IOWAIT
controls both of these at the same time. In the future, if those do end
up being split, then it'd be possible to control them separately.
However, it seems more likely that the kernel will decouple iowait and
CPU frequency boosting anyway.
Jens Axboe [Mon, 10 Mar 2025 20:01:49 +0000 (14:01 -0600)]
io_uring/kbuf: enable bundles for incrementally consumed buffers
The original support for incrementally consumed buffers didn't allow it
to be used with bundles, with the assumption being that incremental
buffers are generally larger, and hence there's less of a nedd to
support it.
But that assumption may not be correct - it's perfectly viable to use
smaller buffers with incremental consumption, and there may be valid
reasons for an application or framework to do so.
As there's really no need to explicitly disable bundles with
incrementally consumed buffers, allow it. This actually makes the peek
side cheaper and simpler, with the completion side basically the same,
just needing to iterate for the consumed length.
Reported-by: Norman Maurer <norman_maurer@apple.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 3 Mar 2025 12:43:19 +0000 (20:43 +0800)]
selftests: ublk: add one stress test for covering IO vs. removing device
Add stress_test_01 for running IO vs. removing device for verifying that
ublk device removal can work as expected when heavy IO workloads are in
progress.
Ming Lei [Mon, 3 Mar 2025 12:43:17 +0000 (20:43 +0800)]
selftests: ublk: move zero copy feature check into _add_ublk_dev()
Move zero copy feature check into _add_ublk_dev() since we will have
more tests which requires to cover zero copy.
Then one check function of _check_add_dev() has to be added for dealing
with cleanup since '_add_ublk_dev()' is run in sub-shell, and we can't
exit from it to terminal shell.
Meantime always return error code from _add_ublk_dev().
Ming Lei [Mon, 3 Mar 2025 12:43:16 +0000 (20:43 +0800)]
selftests: ublk: don't pass ${dev_id} to _cleanup_test()
More devices can be created in single tests, so simply remove all
ublk devices in _cleanup_test(), meantime remove the ${dev_id} argument
of _cleanup_test().
Ming Lei [Mon, 3 Mar 2025 12:43:12 +0000 (20:43 +0800)]
selftests: ublk: fix build failure
Fixes the following build failure:
ublk//file_backed.c: In function ‘backing_file_tgt_init’:
ublk//file_backed.c:28:42: error: ‘O_DIRECT’ undeclared (first use in this function); did you mean ‘O_DIRECTORY’?
28 | fd = open(file, O_RDWR | O_DIRECT);
| ^~~~~~~~
| O_DIRECTORY
when trying to reuse this same utility for liburing test.
Add a helper function io_cache_free() that returns an allocation to a
io_alloc_cache, falling back on kfree() if the io_alloc_cache is full.
This is the inverse of io_cache_alloc(), which takes an allocation from
an io_alloc_cache and falls back on kmalloc() if the cache is empty.
io_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node()
io_rsrc_node's of type IORING_RSRC_FILE always have a file attached
immediately after they are allocated. IORING_RSRC_BUFFER nodes won't be
returned from io_sqe_buffer_register()/io_buffer_register_bvec() until
they have a io_mapped_ubuf attached.
So remove the checks for a NULL file/buffer in io_free_rsrc_node().
io_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure
io_sqe_buffer_register() currently calls io_put_rsrc_node() if it fails
to fully set up the io_rsrc_node. io_put_rsrc_node() is more involved
than necessary, since we already know the reference count will reach 0
and no io_mapped_ubuf has been attached to the node yet.
So just call io_free_node() to release the node's memory. This also
avoids the need to temporarily set the node's buf pointer to NULL.
io_uring/rsrc.h uses several types from include/linux/io_uring_types.h.
Include io_uring_types.h explicitly in rsrc.h to avoid depending on
users of rsrc.h including io_uring_types.h first.
io_buffer_register_bvec() takes index as an unsigned int argument, but
ublk_register_io_buf() casts ub_cmd->addr (a u64) to int. Remove the
misleading cast and instead pass index as an unsigned value to
ublk_register_io_buf() and ublk_unregister_io_buf().
io_uring/ublk: report error when unregister operation fails
Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF
command specifies an invalid buffer index by returning an error code.
Return -EINVAL if no buffer is registered with the given index, and
-EBUSY if the registered buffer is not a kernel bvec.
io_uring: convert cmd_to_io_kiocb() macro to function
The cmd_to_io_kiocb() macro applies a pointer cast to its input without
parenthesizing it. Currently all inputs are variable names, so this has
the intended effect. But since casts have relatively high precedence,
the macro would apply the cast to the wrong value if the input was a
pointer addition, for example.
Turn the macro into a static inline function to ensure the pointer cast
is applied to the full input value.
io_uring/uring_cmd: specify io_uring_cmd_import_fixed() pointer type
io_uring_cmd_import_fixed() takes a struct io_uring_cmd *, but the type
of the ioucmd parameter is void *. Make the pointer type explicit so the
compiler can type check it.
Ming Lei [Fri, 28 Feb 2025 16:19:14 +0000 (00:19 +0800)]
selftests: ublk: add kernel selftests for ublk
Both ublk driver and userspace heavily depends on io_uring subsystem,
and tools/testing/selftests/ should be the best place for holding this
cross-subsystem tests.
Add basic read/write IO test over this ublk null disk, and make sure ublk
working.
Keith Busch [Thu, 27 Feb 2025 22:39:15 +0000 (14:39 -0800)]
ublk: zc register/unregister bvec
Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.
A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.
Keith Busch [Thu, 27 Feb 2025 22:39:14 +0000 (14:39 -0800)]
io_uring: add support for kernel registered bvecs
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.
User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.
Xinyu Zhang [Thu, 27 Feb 2025 22:39:13 +0000 (14:39 -0800)]
nvme: map uring_cmd data even if address is 0
When using kernel registered bvec fixed buffers, the "address" is
actually the offset into the bvec rather than userspace address.
Therefore it can be 0.
We can skip checking whether the address is NULL before mapping
uring_cmd data. Bad userspace address will be handled properly later when
the user buffer is imported.
With this patch, we will be able to use the kernel registered bvec fixed
buffers in io_uring NVMe passthru with ublk zero-copy support.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Xinyu Zhang <xizhang@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-4-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 27 Feb 2025 22:39:12 +0000 (14:39 -0800)]
io_uring/rw: move fixed buffer import to issue path
Registered buffers may depend on a linked command, which makes the prep
path too early to import. Move to the issue path when the node is
actually needed like all the other users of fixed buffers.
Arnd Bergmann [Thu, 27 Feb 2025 13:20:09 +0000 (14:20 +0100)]
io_uring/net: fix build warning for !CONFIG_COMPAT
A code rework resulted in an uninitialized return code when COMPAT
mode is disabled:
io_uring/net.c:722:6: error: variable 'ret' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
722 | if (io_is_compat(req->ctx)) {
| ^~~~~~~~~~~~~~~~~~~~~~
io_uring/net.c:736:15: note: uninitialized use occurs here
736 | if (unlikely(ret))
| ^~~
Since io_is_compat() turns into a compile-time 'false', the #ifdef
here is completely unnecessary, and removing it avoids the warning.
Pavel Begunkov [Wed, 26 Feb 2025 20:46:34 +0000 (20:46 +0000)]
io_uring: rearrange opdef flags by use pattern
Keep all flags that we use in the generic req init path close together.
That saves a load for x86 because apparently some compilers prefer
reading single bytes.
Pavel Begunkov [Wed, 26 Feb 2025 11:41:20 +0000 (11:41 +0000)]
io_uring/net: unify *mshot_prep calls with compat
Instead of duplicating a io_recvmsg_mshot_prep() call in the compat
path, let the common code handle it. For that, copy necessary compat
fields into struct user_msghdr. Note, it zeroes user_msghdr to be on the
safe side as compat is not that interesting and overhead shouldn't be
high.
Pavel Begunkov [Wed, 26 Feb 2025 11:41:19 +0000 (11:41 +0000)]
io_uring/net: derive iovec storage later
Don't read free_iov until right before we need it to import the iovec.
The only place that uses it before that is provided buffer selection,
but it only serves as temporary storage and iovec content is not reused
afterwards, so use a local variable for that.
Pavel Begunkov [Wed, 26 Feb 2025 11:41:18 +0000 (11:41 +0000)]
io_uring/net: verify msghdr before copying iovec
Normally, net/ would verify msghdr before importing iovec, for example
see copy_msghdr_from_user(), which further assumed by __copy_msghdr()
validating msg->msg_iovlen.
io_uring does it in reverse order, which is fine, but it'll be more
convenient for flip it so that the iovec business is done at the end and
eventually can be nicely pulled out of msghdr parsing section and
thought as a sepaarate step. That also makes structure accesses more
localised, which should be better for caches.
Pavel Begunkov [Wed, 26 Feb 2025 11:41:17 +0000 (11:41 +0000)]
io_uring/net: isolate msghdr copying code
The user access section in io_msg_copy_hdr() is overextended by covering
selected buffers. It's hard to work with and prone to errors. Limit the
section to msghdr import only, selected buffers will do a separate
copy_from_user() call, and then move it into its own function. This
should be fine, selected buffer single shots are not important, for
multishots the overhead should be non-existent, and it's not that
expensive overall.
REQ_F_NEED_CLEANUP in io_recvmsg_prep_setup() and in io_sendmsg_setup()
are relics of the past and don't do anything useful, the flag should be
and are set earlier on iovec and async_data allocation.
Pavel Begunkov [Mon, 24 Feb 2025 21:31:10 +0000 (13:31 -0800)]
io_uring: combine buffer lookup and import
Registered buffer are currently imported in two steps, first we lookup
a rsrc node and then use it to set up the iterator. The first part is
usually done at the prep stage, and import happens whenever it's needed.
As we want to defer binding to a node so that it works with linked
requests, combine both steps into a single helper.
Pavel Begunkov [Mon, 24 Feb 2025 21:31:09 +0000 (13:31 -0800)]
io_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()
io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.
Pavel Begunkov [Mon, 24 Feb 2025 21:31:08 +0000 (13:31 -0800)]
io_uring/net: reuse req->buf_index for sendzc
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_sr_msg.
Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250224213116.3509093-4-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 19:45:05 +0000 (19:45 +0000)]
io_uring/rw: extract helper for iovec import
Split out a helper out of __io_import_rw_buffer() that handles vectored
buffers. I'll need it for registered vectored buffers, but it also looks
cleaner, especially with parameters being properly named.
Pavel Begunkov [Mon, 24 Feb 2025 19:45:03 +0000 (19:45 +0000)]
io_uring/rw: allocate async data in io_prep_rw()
rw always allocates async_data, so instead of doing that deeper in prep
calls inside of io_prep_rw_setup(), be a bit more explicit and do that
early on in io_prep_rw().
Pavel Begunkov [Sun, 23 Feb 2025 17:22:31 +0000 (17:22 +0000)]
io_uring: make io_poll_issue() sturdier
io_poll_issue() forwards the call to io_issue_sqe() and thus inherits
some of the handling. That's not particularly failure resistant, as for
example returning an innocently looking IOU_OK from a multishot issue
will lead to severe bugs.
Reimplement io_poll_issue() without io_issue_sqe()'s request completion
logic. Remove extra checks as we know that req->file is already set,
linked timeout are armed, and iopoll is not supported. Also cover it
with warnings for now.
The patch should be useful by itself, but it's also preparing the
codebase for other future clean ups.
Pavel Begunkov [Sun, 23 Feb 2025 17:22:29 +0000 (17:22 +0000)]
io_uring/net: fix accept multishot handling
REQ_F_APOLL_MULTISHOT doesn't guarantee it's executed from the multishot
context, so a multishot accept may get executed inline, fail
io_req_post_cqe(), and ask the core code to kill the request with
-ECANCELED by returning IOU_STOP_MULTISHOT even when a socket has been
accepted and installed.
Compat performance is not important and simplicity is more appreciated.
Let's not be smart about it and use simpler copy_from_user() instead of
access + __get_user pair.
Pavel Begunkov [Mon, 24 Feb 2025 12:42:21 +0000 (12:42 +0000)]
io_uring/rw: compile out compat param passing
Even when COMPAT is compiled out, we still have to pass
ctx->compat to __import_iovec(). Replace the read with an indirection
with a constant when the kernel doesn't support compat.
Pavel Begunkov [Wed, 19 Feb 2025 01:33:38 +0000 (01:33 +0000)]
io_uring/rw: don't directly use ki_complete
We want to avoid checking ->ki_complete directly in the io_uring
completion path. Fortunately we have only two callback the selection
of which depend on the ring constant flags, i.e. IOPOLL, so use that
to infer the function.
Pavel Begunkov [Wed, 19 Feb 2025 01:33:37 +0000 (01:33 +0000)]
io_uring/rw: forbid multishot async reads
At the moment we can't sanely handle queuing an async request from a
multishot context, so disable them. It shouldn't matter as pollable
files / socekts don't normally do async.
Patching it in __io_read() is not the cleanest way, but it's simpler
than other options, so let's fix it there and clean up on top.
IO_NODE_ALLOC_CACHE_MAX has been unused since commit fbbb8e991d86
("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the
rsrc_node_cache.
IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused
since commit 7029acd8a950 ("io_uring/rsrc: get rid of per-ring
io_rsrc_node list") removed the separate tag table for registered nodes.
Jens Axboe [Tue, 18 Feb 2025 23:47:40 +0000 (16:47 -0700)]
io_uring: fix spelling error in uapi io_uring.h
This is obviously not that important, but when changes are synced back
from the kernel to liburing, the codespell CI ends up erroring because
of this misspelling. Let's just correct it and avoid this biting us
again on an import.
8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.
So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.
In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.
Also add a comment to struct io_tw_state to explain its purpose.
io_uring: pass ctx instead of req to io_init_req_drain()
io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.
Drop "req" from the function name since it operates on the ctx rather
than a specific req.
Jens Axboe [Sat, 8 Feb 2025 17:50:34 +0000 (10:50 -0700)]
io_uring/net: improve recv bundles
Current recv bundles are only supported for multishot receives, and
additionally they also always post at least 2 CQEs if more data is
available than what a buffer will hold. This happens because the initial
bundle recv will do a single buffer, and then do the rest of what is in
the socket as a followup receive. As shown in a test program, if 1k
buffers are available and 32k is available to receive in the socket,
you'd get the following completions:
bundle=1, mshot=0
cqe res 1024
cqe res 1024
[...]
cqe res 1024
bundle=1, mshot=1
cqe res 1024
cqe res 31744
where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 &&
mshot=1 will post a 1k completion and then a 31k completion.
To support bundle recv without multishot, it's possible to simply retry
the recv immediately and post a single completion, rather than split it
into two completions. With the below patch, the same test looks as
follows:
bundle=1, mshot=0
cqe res 32768
bundle=1, mshot=1
cqe res 32768
where mshot=0 works fine for bundles, and both of them post just a
single 32k completion rather than split it into separate completions.
Posting fewer completions is always a nice win, and not needing
multishot for proper bundle efficiency is nice for cases that can't
necessarily use multishot.
Jens Axboe [Wed, 5 Feb 2025 20:13:58 +0000 (13:13 -0700)]
io_uring/cancel: add generic cancel helper
Any opcode that is cancelable ends up defining its own cancel helper
for finding and canceling a specific request. Add a generic helper that
can be used for this purpose.
Jens Axboe [Wed, 5 Feb 2025 19:48:56 +0000 (12:48 -0700)]
io_uring/cancel: add generic remove_all helper
Any opcode that is cancelable ends up defining its own remove all
helper, which iterates the pending list and cancels matches. Add a
generic helper for it, which can be used by them.
Pavel Begunkov [Wed, 5 Feb 2025 11:36:49 +0000 (11:36 +0000)]
io_uring/kbuf: uninline __io_put_kbufs
__io_put_kbufs() and other helper functions are too large to be inlined,
compilers would normally refuse to do so. Uninline it and move together
with io_kbuf_commit into kbuf.c.
io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com>