Jens Axboe [Tue, 5 Mar 2024 20:10:04 +0000 (13:10 -0700)]
io_uring/net: support bundles for send
If IORING_OP_SEND is used with provided buffers, the caller may also
set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea
is that an application can fill outgoing buffers in a provided buffer
group, and then arm a single send that will service them all. Once
there are no more buffers to send, or if the requested length has
been sent, the request posts a single completion for all the buffers.
This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming
in a separate patch. However, this patch does do a lot of the prep
work that makes wiring up the sendmsg variant pretty trivial. They
share the prep side.
Jens Axboe [Tue, 5 Mar 2024 14:31:52 +0000 (07:31 -0700)]
io_uring/kbuf: add helpers for getting/peeking multiple buffers
Our provided buffer interface only allows selection of a single buffer.
Add an API that allows getting/peeking multiple buffers at the same time.
This is only implemented for the ring provided buffers. It could be added
for the legacy provided buffers as well, but since it's strongly
encouraged to use the new interface, let's keep it simpler and just
provide it for the new API. The legacy interface will always just select
a single buffer.
There are two new main functions:
io_buffers_select(), which selects up as many buffers as it can. The
caller supplies the iovec array, and io_buffers_select() may allocate a
bigger array if the 'out_len' being passed in is non-zero and bigger
than what fits in the provided iovec. Buffers grabbed with this helper
are permanently assigned.
io_buffers_peek(), which works like io_buffers_select(), except they can
be recycled, if needed. Callers using either of these functions should
call io_put_kbufs() rather than io_put_kbuf() at completion time. The
peek interface must be called with the ctx locked from peek to
completion.
This add a bit state for the request:
- REQ_F_BUFFERS_COMMIT, which means that the the buffers have been
peeked and should be committed to the buffer ring head when they are
put as part of completion. Prior to this, req->buf_list was cleared to
NULL when committed.
Jens Axboe [Mon, 19 Feb 2024 17:46:44 +0000 (10:46 -0700)]
io_uring/net: add provided buffer support for IORING_OP_SEND
It's pretty trivial to wire up provided buffer support for the send
side, just like how it's done the receive side. This enables setting up
a buffer ring that an application can use to push pending sends to,
and then have a send pick a buffer from that ring.
One of the challenges with async IO and networking sends is that you
can get into reordering conditions if you have more than one inflight
at the same time. Consider the following scenario where everything is
fine:
1) App queues sendA for socket1
2) App queues sendB for socket1
3) App does io_uring_submit()
4) sendA is issued, completes successfully, posts CQE
5) sendB is issued, completes successfully, posts CQE
All is fine. Requests are always issued in-order, and both complete
inline as most sends do.
However, if we're flooding socket1 with sends, the following could
also result from the same sequence:
1) App queues sendA for socket1
2) App queues sendB for socket1
3) App does io_uring_submit()
4) sendA is issued, socket1 is full, poll is armed for retry
5) Space frees up in socket1, this triggers sendA retry via task_work
6) sendB is issued, completes successfully, posts CQE
7) sendA is retried, completes successfully, posts CQE
Now we've sent sendB before sendA, which can make things unhappy. If
both sendA and sendB had been using provided buffers, then it would look
as follows instead:
1) App queues dataA for sendA, queues sendA for socket1
2) App queues dataB for sendB queues sendB for socket1
3) App does io_uring_submit()
4) sendA is issued, socket1 is full, poll is armed for retry
5) Space frees up in socket1, this triggers sendA retry via task_work
6) sendB is issued, picks first buffer (dataA), completes successfully,
posts CQE (which says "I sent dataA")
7) sendA is retried, picks first buffer (dataB), completes successfully,
posts CQE (which says "I sent dataB")
Now we've sent the data in order, and everybody is happy.
It's worth noting that this also opens the door for supporting multishot
sends, as provided buffers would be a prerequisite for that. Those can
trigger either when new buffers are added to the outgoing ring, or (if
stalled due to lack of space) when space frees up in the socket.
Jens Axboe [Sun, 25 Feb 2024 19:52:39 +0000 (12:52 -0700)]
io_uring/net: add generic multishot retry helper
This is just moving io_recv_prep_retry() higher up so it can get used
for sends as well, and rename it to be generically useful for both
sends and receives.
A previous commit removed the checking on whether or not it was possible
to retry a request, since it's now possible to retry any of them. This
would previously have caused the request to have been ended with an error,
but now the retry condition can simply get lost instead.
Cleanup the retry handling and always just punt it to task_work, which
will queue it with io-wq appropriately.
Reported-by: Changhui Zhong <czhong@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <axboe@kernel.dk>
io-wq: Drop intermediate step between pending list and active work
next_work is only used to make the work visible for
cancellation. Instead, we can just directly write to cur_work before
dropping the acct_lock and avoid the extra hop.
Commit 361aee450c6e ("io-wq: add intermediate work step between pending
list and active work") closed a race between a cancellation and the work
being removed from the wq for execution. To ensure the request is
always reachable by the cancellation, we need to move it within the wq
lock, which also synchronizes the cancellation. But commit 42abc95f05bf ("io-wq: decouple work_list protection from the big
wqe->lock") replaced the wq lock here and accidentally reintroduced the
race by releasing the acct_lock too early.
In other words:
worker | cancellation
work = io_get_next_work() |
raw_spin_unlock(&acct->lock); |
|
| io_acct_cancel_pending_work
| io_wq_worker_cancel()
worker->next_work = work
Using acct_lock is still enough since we synchronize on it on
io_acct_cancel_pending_work.
Fixes: 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240416021054.3940-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
1) The command type does something on the prep side that triggers an
audit call.
2) The thread hasn't done any operations before this that triggered
an audit call inside ->issue(), where we have audit_uring_entry()
and audit_uring_exit().
Work around this by issuing a blanket NOP operation before the SQPOLL
does anything.
Pavel Begunkov [Mon, 15 Apr 2024 12:50:13 +0000 (13:50 +0100)]
io_uring/notif: shrink account_pages to u32
->account_pages is the number of pages we account against the user
derived from unsigned len, it definitely fits into unsigned, which saves
some space in struct io_notif_data.
io_uring: ensure overflow entries are dropped when ring is exiting
A previous consolidation cleanup missed handling the case where the ring
is dying, and __io_cqring_overflow_flush() doesn't flush entries if the
CQ ring is already full. This is fine for the normal CQE overflow
flushing, but if the ring is going away, we need to flush everything,
even if it means simply freeing the overflown entries.
Pavel Begunkov [Wed, 10 Apr 2024 01:26:54 +0000 (02:26 +0100)]
io_uring: always lock __io_cqring_overflow_flush
Conditional locking is never great, in case of
__io_cqring_overflow_flush(), which is a slow path, it's not justified.
Don't handle IOPOLL separately, always grab uring_lock for overflow
flushing.
Pavel Begunkov [Wed, 10 Apr 2024 01:26:52 +0000 (02:26 +0100)]
io_uring: remove extra SQPOLL overflow flush
c1edbf5f081be ("io_uring: flag SQPOLL busy condition to userspace")
added an extra overflowed CQE flush in the SQPOLL submission path due to
backpressure, which was later removed. Remove the flush and let
io_cqring_wait() / iopoll handle it.
Pavel Begunkov [Tue, 9 Apr 2024 21:05:53 +0000 (14:05 -0700)]
io_uring: separate header for exported net bits
We're exporting some io_uring bits to networking, e.g. for implementing
a net callback for io_uring cmds, but we don't want to expose more than
needed. Add a separate header for networking.
Pavel Begunkov [Sun, 7 Apr 2024 23:54:55 +0000 (00:54 +0100)]
io_uring/net: merge ubuf sendzc callbacks
Splitting io_tx_ubuf_callback_ext from io_tx_ubuf_callback is a pre
mature optimisation that doesn't give us much. Merge the functions into
one and reclaim some simplicity back.
Pavel Begunkov [Fri, 5 Apr 2024 15:50:05 +0000 (16:50 +0100)]
io_uring: remove io_req_put_rsrc_locked()
io_req_put_rsrc_locked() is a weird shim function around
io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding
->uring_lock, so we can just use it directly.
Pavel Begunkov [Fri, 5 Apr 2024 15:50:04 +0000 (16:50 +0100)]
io_uring: remove async request cache
io_req_complete_post() was a sole user of ->locked_free_list, but
since we just gutted the function, the cache is not used anymore and
can be removed.
->locked_free_list served as an asynhronous counterpart of the main
request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq.
Now they're all forced to be completed into the main cache directly,
off of the normal completion path or via io_free_req().
Pavel Begunkov [Fri, 5 Apr 2024 15:50:03 +0000 (16:50 +0100)]
io_uring: turn implicit assumptions into a warning
io_req_complete_post() is now io-wq only and shouldn't be used outside
of it, i.e. it relies that io-wq holds a ref for the request as
explained in a comment below. Let's add a warning to enforce the
assumption and make sure nobody would try to do anything weird.
Ming Lei [Fri, 5 Apr 2024 15:50:02 +0000 (16:50 +0100)]
io_uring: kill dead code in io_req_complete_post
Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"),
io_req_complete_post() is only called from io-wq submit work, where the
request reference is guaranteed to be grabbed and won't drop to zero
in io_req_complete_post().
Kill the dead code, meantime add req_ref_put() to put the reference.
Jens Axboe [Fri, 29 Mar 2024 23:19:45 +0000 (17:19 -0600)]
io_uring: fix warnings on shadow variables
There are a few of those:
io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow]
170 | struct file *f = io_file_from_index(&ctx->file_table, i);
| ^
io_uring/fdinfo.c:53:67: note: previous declaration is here
53 | __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
| ^
io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow]
187 | struct io_uring_task *tctx = node->task->io_uring;
| ^
io_uring/cancel.c:166:31: note: previous declaration is here
166 | struct io_uring_task *tctx,
| ^
io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow]
371 | struct io_uring_task *tctx = node->task->io_uring;
| ^
io_uring/register.c:312:24: note: previous declaration is here
312 | struct io_uring_task *tctx = NULL;
| ^
and a simple cleanup gets rid of them. For the fdinfo case, make a
distinction between the file being passed in (for the ring), and the
registered files we iterate. For the other two cases, just get rid of
shadowed variable, there's no reason to have a new one.
Jens Axboe [Wed, 27 Mar 2024 20:59:09 +0000 (14:59 -0600)]
io_uring: move mapping/allocation helpers to a separate file
Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.
Jens Axboe [Wed, 13 Mar 2024 02:24:21 +0000 (20:24 -0600)]
io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.
This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.
Jens Axboe [Tue, 12 Mar 2024 16:42:27 +0000 (10:42 -0600)]
io_uring/kbuf: vmap pinned buffer ring
This avoids needing to care about HIGHMEM, and it makes the buffer
indexing easier as both ring provided buffer methods are now virtually
mapped in a contigious fashion.
Jens Axboe [Wed, 13 Mar 2024 15:56:14 +0000 (09:56 -0600)]
io_uring: get rid of remap_pfn_range() for mapping rings/sqes
Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.
If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.
This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.
Gabriel Krisman Bertazi [Thu, 28 Mar 2024 21:09:35 +0000 (17:09 -0400)]
io_uring: Avoid anonymous enums in io_uring uapi
While valid C, anonymous enums confuse Cython (Python to C translator),
as reported by Ritesh (YoSTEALTH) [1] . Since people rely on it when
building against liburing and we want to keep this header in sync with
the library version, let's name the existing enums in the uapi header.
Jens Axboe [Tue, 26 Mar 2024 00:53:33 +0000 (18:53 -0600)]
io_uring: use the right type for work_llist empty check
io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
not an io_wq_work_list, it's a struct llist_head. They both have
->first as head-of-list, and it turns out the checks are identical. But
be proper and use the right helper.
Fixes: dac6a0eae793 ("io_uring: ensure iopoll runs local task work as well") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Joel Granados [Thu, 28 Mar 2024 15:57:53 +0000 (16:57 +0100)]
io_uring: Remove the now superfluous sentinel elements from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
Remove sentinel element from kernel_io_uring_disabled_table
Jens Axboe [Wed, 27 Mar 2024 04:13:41 +0000 (22:13 -0600)]
io_uring: re-arrange Makefile order
The object list is a bit of a mess, with core and opcode files mixed in.
Re-arrange it so that we have the core bits first, and then opcode
specific files after that.
Jens Axboe [Tue, 26 Mar 2024 01:07:22 +0000 (19:07 -0600)]
io_uring: refill request cache in memory order
The allocator will generally return memory in order, but
__io_alloc_req_refill() then adds them to a stack and we'll extract them
in the opposite order. This obviously isn't a huge deal, but:
1) it makes debugging easier when they are in order
2) keeping them in-order is the right thing to do
3) reduces the code for adding them to the stack
Jens Axboe [Wed, 20 Mar 2024 21:19:44 +0000 (15:19 -0600)]
io_uring/alloc_cache: switch to array based caching
Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.
Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.
Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.
Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.
This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.
Jens Axboe [Wed, 20 Mar 2024 21:23:47 +0000 (15:23 -0600)]
io_uring/uring_cmd: defer SQE copying until it's needed
The previous commit turned on async data for uring_cmd, and did the
basic conversion of setting everything up on the prep side. However, for
a lot of use cases, -EIOCBQUEUED will get returned on issue, as the
operation got successfully queued. For that case, a persistent SQE isn't
needed, as it's just used for issue.
Unless execution goes async immediately, defer copying the double SQE
until it's necessary.
This greatly reduces the overhead of such commands, as evidenced by
a perf diff from before and after this change:
where the prep side drops from 10.60% to ~2%, which is more expected.
Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
to where it was before the async command prep.
Jens Axboe [Tue, 19 Mar 2024 02:37:22 +0000 (20:37 -0600)]
io_uring/net: move connect to always using async data
While doing that, get rid of io_async_connect and just use the generic
io_async_msghdr. Both of them have a struct sockaddr_storage in there,
and while io_async_msghdr is bigger, if the same type can be used then
the netmsg_cache can get reused for connect as well.
Jens Axboe [Mon, 18 Mar 2024 22:31:44 +0000 (16:31 -0600)]
io_uring/rw: add iovec recycling
Let the io_async_rw hold on to the iovec and reuse it, rather than always
allocate and free them.
Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.
While doing so, shrink io_async_rw by getting rid of the bigger embedded
fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1.
This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction.
Jens Axboe [Sat, 23 Mar 2024 02:41:18 +0000 (20:41 -0600)]
io_uring/rw: cleanup retry path
We no longer need to gate a potential retry on whether or not the
context matches our original task, as all read/write operations have
been fully prepared upfront. This means there's never any re-import
needed, and hence we can always retry requests.
Jens Axboe [Mon, 18 Mar 2024 22:13:01 +0000 (16:13 -0600)]
io_uring/rw: always setup io_async_rw for read/write requests
read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.
Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.
This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.
In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.
Jens Axboe [Sat, 16 Mar 2024 21:33:53 +0000 (15:33 -0600)]
io_uring/net: add iovec recycling
Right now the io_async_msghdr is recycled to avoid the overhead of
allocating+freeing it for every request. But the iovec is not included,
hence that will be allocated and freed for each transfer regardless.
This commit enables recyling of the iovec between io_async_msghdr
recycles. This avoids alloc+free for each one if an iovec is used, and
on top of that, it extends the cache hot nature of msg to the iovec as
well.
Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.
The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte
saving (or ~23% smaller), as the fast_iovec entry is dropped from 8
entries to a single entry. There's no point keeping a big fast iovec
entry, if iovecs aren't being allocated and freed continually.
Jens Axboe [Mon, 18 Mar 2024 16:07:37 +0000 (10:07 -0600)]
io_uring: kill io_msg_alloc_async_prep()
We now ONLY call io_msg_alloc_async() from inside prep handling, which
is always locked. No need for this helper anymore, or the check in
io_msg_alloc_async() on whether the ring is locked or not.
Jens Axboe [Mon, 18 Mar 2024 14:09:47 +0000 (08:09 -0600)]
io_uring/net: get rid of ->prep_async() for send side
Move the io_async_msghdr out of the issue path and into prep handling,
e it's now done unconditionally and hence does not need to be part
of the issue path. This means any usage of io_sendrecv_prep_async() and
io_sendmsg_prep_async(), and hence the forced async setup path is now
unified with the normal prep setup.
Jens Axboe [Mon, 18 Mar 2024 13:36:03 +0000 (07:36 -0600)]
io_uring/net: get rid of ->prep_async() for receive side
Move the io_async_msghdr out of the issue path and into prep handling,
since it's now done unconditionally and hence does not need to be part
of the issue path. This reduces the footprint of the multishot fast
path of multiple invocations of ->issue() per prep, and also means that
using ->prep_async() can be dropped for recvmsg asthis is now done via
setup on the prep side.
io_uring/net: always set kmsg->msg.msg_control_user before issue
We currently set this separately for async/sync entry, but let's just
move it to a generic pre-issue spot and eliminate the difference
between the two.
Jens Axboe [Sat, 16 Mar 2024 23:26:09 +0000 (17:26 -0600)]
io_uring/net: always setup an io_async_msghdr
Rather than use an on-stack one and then need to allocate and copy if
async execution is required, always grab one upfront. This should be
very cheap, and potentially even have cache hotness benefits for
back-to-back send/recv requests.
For any recv type of request, this is probably a good choice in general,
as it's expected that no data is available initially. For send this is
not necessarily the case, as space in the socket buffer is expected to
be available. However, getting a cached io_async_msghdr is very cheap,
and as it should be cache hot, probably the difference here is neglible,
if any.
A nice side benefit is that io_setup_async_msg can get killed
completely, which has some nasty iovec manipulation code.
Jens Axboe [Tue, 5 Mar 2024 16:34:21 +0000 (09:34 -0700)]
io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr
No functional changes in this patch, just in preparation for carrying
more state then what is being done now, if necessary. While unifying
some of this code, add a generic send setup prep handler that they can
both use.
This gets rid of some manual msghdr and sockaddr on the stack, and makes
it look a bit more like the sendmsg/recvmsg variants. Going forward, more
can get unified on top.
Jens Axboe [Sun, 17 Mar 2024 00:23:44 +0000 (18:23 -0600)]
io_uring/alloc_cache: shrink default max entries from 512 to 128
In practice, we just need to recycle a few elements for (by far) most
use cases. Shrink the total size down from 512 to 128, which should be
more than plenty.
Jens Axboe [Sat, 16 Mar 2024 17:10:21 +0000 (11:10 -0600)]
io_uring: remove timeout/poll specific cancelations
For historical reasons these were special cased, as they were the only
ones that needed cancelation. But now we handle cancelations generally,
and hence there's no need to check for these in
io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles
both these and the rest as well.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:34 +0000 (22:00 +0000)]
io_uring: refactor io_req_complete_post()
Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
to task_work, it's much cleaner and should normally happen. We couldn't
do it before because there was a possibility of looping in
complete_post() -> tw -> complete_post() -> ...
Also, unexport the function and inline __io_req_complete_post().
Pavel Begunkov [Mon, 18 Mar 2024 22:00:33 +0000 (22:00 +0000)]
io_uring: remove current check from complete_post
task_work execution is now always locked, and we shouldn't get into
io_req_complete_post() from them. That means that complete_post() is
always called out of the original task context and we don't even need to
check current.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:32 +0000 (22:00 +0000)]
io_uring: get rid of intermediate aux cqe caches
io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.
DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:31 +0000 (22:00 +0000)]
io_uring: refactor io_fill_cqe_req_aux
The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.
Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:30 +0000 (22:00 +0000)]
io_uring: remove struct io_tw_state::locked
ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:29 +0000 (22:00 +0000)]
io_uring: force tw ctx locking
We can run normal task_work without locking the ctx, however we try to
lock anyway and most handlers prefer or require it locked. It might have
been interesting to multi-submitter ring with high contention completing
async read/write requests via task_work, however that will still need to
go through io_req_complete_post() and potentially take the lock for
rsrc node putting or some other case.
In other words, it's hard to care about it, so alawys force the locking.
The case described would also because of various io_uring caches.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:28 +0000 (22:00 +0000)]
io_uring/rw: avoid punting to io-wq directly
kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.
Jens Axboe [Tue, 19 Mar 2024 23:10:50 +0000 (17:10 -0600)]
nvme/io_uring: use helper for polled completions
NVMe is making up issue_flags, which is a no-no in general, and to make
matters worse, they are completely the wrong ones. For a pure polled
request, which it does check for, we're already inside the
ctx->uring_lock when the completions are run off io_do_iopoll(). Hence
the correct flag would be '0' rather than IO_URING_F_UNLOCKED.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 18 Mar 2024 22:00:25 +0000 (22:00 +0000)]
io_uring/cmd: fix tw <-> issue_flags conversion
!IO_URING_F_UNLOCKED does not translate to availability of the deferred
completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should
pass and look for to use io_req_complete_defer() and other variants.
Luckily, it's not a real problem as two wrongs actually made it right,
at least as far as io_uring_cmd_work() goes.
Pavel Begunkov [Mon, 18 Mar 2024 22:00:24 +0000 (22:00 +0000)]
io_uring/cmd: kill one issue_flags to tw conversion
io_uring cmd converts struct io_tw_state to issue_flags and later back
to io_tw_state, it's awfully ill-fated, not to mention that intermediate
issue_flags state is not correct.
Get rid of the last conversion, drag through tw everything that came
with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a
direct call to io_req_complete_defer(), at least for the time being.
Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysfs fix from Al Viro:
"Get rid of lockdep false positives around sysfs/overlayfs
syzbot has uncovered a class of lockdep false positives for setups
with sysfs being one of the backing layers in overlayfs. The root
cause is that of->mutex allocated when opening a sysfs file read-only
(which overlayfs might do) is confused with of->mutex of a file opened
writable (held in write to sysfs file, which overlayfs won't do).
Assigning them separate lockdep classes fixes that bunch and it's
obviously safe"
* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kernfs: annotate different lockdep class for of->mutex of writable files
Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Follow up fixes for the BHI mitigations code
- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
expected
- Work around an APIC emulation bug when the kernel is built with Clang
and run as a SEV guest
- Follow up x86 topology fixes
* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
x86/cpu/amd: Make the NODEID_MSR union actually work
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
x86/bugs: Fix BHI handling of RRSBA
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
x86/bugs: Fix BHI documentation
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
x86/bugs: Fix return type of spectre_bhi_state()
x86/apic: Force native_apic_mem_read() to use the MOV instruction
Merge tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"
* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y
Merge tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Fix a bug in the GIC irqchip driver"
* tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix VSYNC referencing an unmapped VPE on GIC v4.1
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio bugfixes from Michael Tsirkin:
"Some small, obvious (in hindsight) bugfixes:
- new ioctl in vhost-vdpa has a wrong # - not too late to fix
- vhost has apparently been lacking an smp_rmb() - due to code
duplication :( The duplication will be fixed in the next merge
cycle, this is a minimal fix
- an error message in vhost talks about guest moving used index -
which of course never happens, guest only ever moves the available
index
- i2c-virtio didn't set the driver owner so it did not get refcounted
correctly"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: correct misleading printing information
vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
virtio: store owner from modules with register_virtio_driver()
vhost: Add smp_rmb() in vhost_enable_notify()
vhost: Add smp_rmb() in vhost_vq_avail_empty()
Merge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix up swiotlb buffer padding even more (Petr Tesarik)
- fix for partial dma_sync on swiotlb (Michael Kelley)
- swiotlb debugfs fix (Dexuan Cui)
* tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
swiotlb: fix swiotlb_bounce() to do partial sync's correctly
swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
Amir Goldstein [Fri, 5 Apr 2024 14:56:35 +0000 (17:56 +0300)]
kernfs: annotate different lockdep class for of->mutex of writable files
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.
To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.
Reported-by: syzbot+9a5b0ced8b1bfb238b56@syzkaller.appspotmail.com Fixes: 0fedefd4c4e3 ("kernfs: sysfs: support custom llseek method for sysfs entries") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Add the mask_port_map parameter to the ahci driver. This is a
follow-up to the recent snafu with the ASMedia controller and its
virtual port hidding port-multiplier devices. As ASMedia confirmed
that there is no way to determine if these slow-to-probe virtual
ports are actually representing the ports of a port-multiplier
devices, this new parameter allow masking ports to significantly
speed up probing during system boot, resulting in shorter boot times.
- A fix for an incorrect handling of a port unlock in
ata_scsi_dev_rescan().
- Allow command duration limits to be detected for ACS-4 devices are
there are such devices out in the field.
* tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: Allow command duration limits detection for ACS-4 drives
ata: libata-scsi: Fix ata_scsi_dev_rescan() error path
ata: ahci: Add mask_port_map module parameter
Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix for oops in cifs_get_fattr of deleted files
- fix for the remote open counter going negative in some directory
lease cases
- fix for mkfifo to instantiate dentry to avoid possible crash
- important fix to allow handling key rotation for mount and remount
(ie cases that are becoming more common when password that was used
for the mount will expire soon but will be replaced by new password)
* tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix broken reconnect when password changing on the server by allowing password rotation
smb: client: instantiate when creating SFU files
smb3: fix Open files on server counter going negative
smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
Igor Pylypiv [Thu, 11 Apr 2024 20:12:24 +0000 (20:12 +0000)]
ata: libata-core: Allow command duration limits detection for ACS-4 drives
Even though the command duration limits (CDL) feature was first added
in ACS-5 (major version 12), there are some ACS-4 (major version 11)
drives that implement CDL as well.
IDENTIFY_DEVICE, SUPPORTED_CAPABILITIES, and CURRENT_SETTINGS log pages
are mandatory in the ACS-4 standard so it should be safe to read these
log pages on older drives implementing the ACS-4 standard.
Fixes: 62e4a60e0cdb ("scsi: ata: libata: Detect support for command duration limits") Cc: stable@vger.kernel.org Signed-off-by: Igor Pylypiv <ipylypiv@google.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Commit 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume")
incorrectly handles failures of scsi_resume_device() in
ata_scsi_dev_rescan(), leading to a double call to
spin_unlock_irqrestore() to unlock a device port. Fix this by redefining
the goto labels used in case of errors and only unlock the port
scsi_scan_mutex when scsi_resume_device() fails.
Bug found with the Smatch static checker warning:
drivers/ata/libata-scsi.c:4774 ata_scsi_dev_rescan()
error: double unlocked 'ap->lock' (orig line 4757)
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org>
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Fix the TLBI RANGE operand calculation causing live migration under
KVM/arm64 to miss dirty pages due to stale TLB entries"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: tlb: Fix TLBI RANGE operand
Merge tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"The device tree changes this time are all for NXP i.MX platforms,
addressing issues with clocks and regulators on i.MX7 and i.MX8.
The old OMAP2 based Nokia N8x0 tablet get a couple of code fixes for
regressions that came in.
The ARM SCMI and FF-A firmware interfaces get a couple of minor bug
fixes.
A regression fix for RISC-V cache management addresses a problem with
probe order on Sifive cores"
* tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
MAINTAINERS: Change Krzysztof Kozlowski's email address
arm64: dts: imx8qm-ss-dma: fix can lpcg indices
arm64: dts: imx8-ss-dma: fix can lpcg indices
arm64: dts: imx8-ss-dma: fix adc lpcg indices
arm64: dts: imx8-ss-dma: fix pwm lpcg indices
arm64: dts: imx8-ss-dma: fix spi lpcg indices
arm64: dts: imx8-ss-conn: fix usb lpcg indices
arm64: dts: imx8-ss-lsio: fix pwm lpcg indices
ARM: dts: imx7s-warp: Pass OV2680 link-frequencies
ARM: dts: imx7-mba7: Use 'no-mmc' property
arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order
arm64: dts: freescale: imx8mp-venice-gw73xx-2x: fix USB vbus regulator
arm64: dts: freescale: imx8mp-venice-gw72xx-2x: fix USB vbus regulator
cache: sifive_ccache: Partially convert to a platform driver
firmware: arm_scmi: Make raw debugfs entries non-seekable
firmware: arm_scmi: Fix wrong fastchannel initialization
firmware: arm_ffa: Fix the partition ID check in ffa_notification_info_get()
ARM: OMAP2+: fix USB regression on Nokia N8x0
mmc: omap: restore original power up/down steps
mmc: omap: fix deferred probe
...
Merge tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Intel VT-d Fixes:
- Allocate local memory for PRQ page
- Fix WARN_ON in iommu probe path
- Fix wrong use of pasid config
- AMD IOMMU Fixes:
- Lock inversion fix
- Log message severity fix
- Disable SNP when v2 page-tables are used
- Mediatek driver:
- Fix module autoloading
* tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Change log message severity
iommu/vt-d: Fix WARN_ON in iommu probe path
iommu/vt-d: Allocate local memory for page request queue
iommu/vt-d: Fix wrong use of pasid config
iommu: mtk: fix module autoloading
iommu/amd: Do not enable SNP when V2 page table is enabled
iommu/amd: Fix possible irq lock inversion dependency issue
Merge tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Revert a quirk that prevented Secondary Bus Reset for LSI / Agere
FW643.
We thought the device was broken, but the reset does work correctly
on other platforms, and the reset avoids leaking data out of VMs
(Bjorn Helgaas)
- Update MAINTAINERS to reflect that Gustavo Pimentel is no longer
reachable (Manivannan Sadhasivam)
* tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Revert "PCI: Mark LSI FW643 to avoid bus reset"
MAINTAINERS: Drop Gustavo Pimentel as PCI DWC Maintainer
* tag 'block-6.9-20240412' of git://git.kernel.dk/linux:
block: fix that blk_time_get_ns() doesn't update time after schedule
block: allow device to have both virt_boundary_mask and max segment size
block: fix q->blkg_list corruption during disk rebind
blk-iocost: avoid out of bounds shift
raid1: fix use-after-free for original bio in raid1_write_request()