Pavel Begunkov [Tue, 13 Apr 2021 01:58:45 +0000 (02:58 +0100)]
io_uring: skip futile iopoll iterations
The only way to get out of io_iopoll_getevents() and continue iterating
is to have empty iopoll_list, otherwise the main loop would just exit.
So, instead of the unlock on 8th time heuristic, do that based on
iopoll_list.
Also, as no one can add new requests to iopoll_list while
io_iopoll_check() hold uring_lock, it's useless to spin with the list
empty, return in that case.
Pavel Begunkov [Tue, 13 Apr 2021 01:58:44 +0000 (02:58 +0100)]
io_uring: don't fail overflow on in_idle
As CQE overflows are now untied from requests and so don't hold any
ref, we don't need to handle exiting/exec'ing cases there anymore.
Moreover, it's much nicer in regards to userspace to save overflowed
CQEs whenever possible, so remove failing on in_idle.
Pavel Begunkov [Tue, 13 Apr 2021 01:58:42 +0000 (02:58 +0100)]
io_uring: refactor hrtimer_try_to_cancel uses
Don't save return values of hrtimer_try_to_cancel() in a variable, but
use right away. It's in general safer to not have an intermediate
variable, which may be reused and passed out wrongly, but it be
contracted out. Also clean io_timeout_extract().
Pavel Begunkov [Tue, 13 Apr 2021 01:58:38 +0000 (02:58 +0100)]
io_uring: fix leaking reg files on exit
If io_sqe_files_unregister() faults on io_rsrc_ref_quiesce(), it will
fail to do unregister leaving files referenced. And that may well happen
because of a strayed signal or just because it does allocations inside.
In io_ring_ctx_free() do an unsafe version of unregister, as it's
guaranteed to not have requests by that point and so quiesce is useless.
* for-5.13/drivers:
lightnvm: deprecated OCSSD support and schedule it for removal in Linux 5.15
lightnvm: remove duplicate include in lightnvm.h
lightnvm: return the correct return value
lightnvm: use kobj_to_dev()
Christoph Hellwig [Tue, 13 Apr 2021 10:52:57 +0000 (10:52 +0000)]
lightnvm: deprecated OCSSD support and schedule it for removal in Linux 5.15
Lightnvm was an innovative idea to expose more low-level control over SSDs.
But it failed to get properly standardized and remains a non-standarized
extension to NVMe that requires vendor specific quirks for a few now mostly
obsolete SSD devices. The standardized ZNS command set for NVMe has take
over a lot of the approaches and allows for fully standardized operation.
Remove the Linux code to support open channel SSDs as the few production
deployments of the above mentioned SSDs are using userspace driver stacks
instead of the fairly limited Linux support.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:40 +0000 (01:46 +0100)]
io_uring: return back safer resurrect
Revert of revert of "io_uring: wait potential ->release() on resurrect",
which adds a helper for resurrect not racing completion reinit, as was
removed because of a strange bug with no clear root or link to the
patch.
Was improved, instead of rcu_synchronize(), just wait_for_completion()
because we're at 0 refs and it will happen very shortly. Specifically
use non-interruptible version to ignore all pending signals that may
have ended prior interruptible wait.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:39 +0000 (01:46 +0100)]
io_uring: improve hardlink code generation
req_set_fail_links() condition checking is bulky. Even though it's
always in a slow path, it's inlined and generates lots of extra code,
simplify it be moving HARDLINK checking into helpers killing linked
requests.
text data bss dec hex filename
before: 79318 12330 8 91656 16608 ./fs/io_uring.o
after: 79126 12330 8 91464 16548 ./fs/io_uring.o
Pavel Begunkov [Sun, 11 Apr 2021 00:46:38 +0000 (01:46 +0100)]
io_uring: improve sqo stop
Set IO_SQ_THREAD_SHOULD_STOP before taking sqd lock, so the sqpoll task
sees earlier. Not a problem, it will stop eventually. Also check
invariant that it's stopped only once.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:37 +0000 (01:46 +0100)]
io_uring: split file table from rsrc nodes
We don't need to store file tables in rsrc nodes, for now it's easier to
handle tables not generically, so move file tables into the context. A
nice side effect is having one less pointer dereference for request with
fixed file initialisation.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:36 +0000 (01:46 +0100)]
io_uring: cleanup buffer register
In preparation for more changes do a little cleanup of
io_sqe_buffers_register(). Move all args/invariant checking into it from
io_buffers_map_alloc(), because it's confusing. And add a bit more
cleaning for the loop.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:34 +0000 (01:46 +0100)]
io_uring: simplify io_rsrc_data refcounting
We don't take many references of struct io_rsrc_data, only one per each
io_rsrc_node, so using percpu refs is overkill. Use atomic ref instead,
which is much simpler.
block: add queue_to_disk() to get gendisk from request_queue
Sometimes we need to get the corresponding gendisk from request_queue.
It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").
So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Gurtovoy [Mon, 12 Apr 2021 09:55:23 +0000 (09:55 +0000)]
null_blk: add option for managing virtual boundary
This will enable changing the virtual boundary of null blk devices. For
now, null blk devices didn't have any restriction on the scatter/gather
elements received from the block layer. Add a module parameter and a
configfs option that will control the virtual boundary. This will
enable testing the efficiency of the block layer bounce buffer in case
a suitable application will send discontiguous IO to the given device.
io_uring: provide io_resubmit_prep() stub for !CONFIG_BLOCK
Randy reports the following error on CONFIG_BLOCK not being set:
../fs/io_uring.c: In function ‘kiocb_done’:
../fs/io_uring.c:2766:7: error: implicit declaration of function ‘io_resubmit_prep’; did you mean ‘io_put_req’? [-Werror=implicit-function-declaration]
if (io_resubmit_prep(req)) {
Provide a dummy stub for io_resubmit_prep() like we do for
io_rw_should_reissue(), which also helps remove an ifdef sequence from
io_complete_rw_iopoll() as well.
Fixes: 8c130827f417 ("io_uring: don't alter iopoll reissue fail ret code") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.13/libata: (53 commits)
pata_ipx4xx_cf: Fix unsigned comparison with less than zero
ata: ahci_tegra: call tegra_powergate_power_off only when PM domain is not present
ata: ahci_tegra: Add AHCI support for Tegra186
dt-binding: ata: tegra: Add dt-binding documentation for Tegra186
dt-bindings: ata: tegra: Convert binding documentation to YAML
pata_legacy: Add `probe_mask' parameter like with ide-generic
pata_platform: Document `pio_mask' module parameter
pata_legacy: Properly document module parameters
ata: ahci: ceva: Updated code by using dev_err_probe()
ata: ahci: Disable SXS for Hisilicon Kunpeng920
ata: libahci_platform: fix IRQ check
sata_mv: add IRQ checks
ata: pata_acpi: Fix some incorrect function param descriptions
ata: libata-acpi: Fix function name and provide description for 'prev_gtf'
ata: sata_mv: Fix misnaming of 'mv_bmdma_stop()'
ata: pata_cs5530: Fix misspelling of 'cs5530_init_one()'s 'pdev' param
ata: pata_legacy: Repair a couple kernel-doc problems
ata: ata_generic: Fix misspelling of 'ata_generic_init_one()'
ata: pata_opti: Fix spelling issue of 'val' in 'opti_write_reg()'
ata: pata_sl82c105: Fix potential doc-rot
...
Junlin Yang [Fri, 9 Apr 2021 13:54:26 +0000 (21:54 +0800)]
pata_ipx4xx_cf: Fix unsigned comparison with less than zero
The return from the call to platform_get_irq() is int, it can be
a negative error code, however this is being assigned to an unsigned
int variable 'irq', so making 'irq' an int, and change the position to
keep the code format.
./drivers/ata/pata_ixp4xx_cf.c:168:5-8:
WARNING: Unsigned expression compared with zero: irq > 0
In file included from /builds/linux/include/linux/scatterlist.h:9,
from /builds/linux/include/linux/dma-mapping.h:10,
from /builds/linux/drivers/cdrom/gdrom.c:16:
/builds/linux/drivers/cdrom/gdrom.c: In function 'gdrom_readdisk_dma':
/builds/linux/drivers/cdrom/gdrom.c:586:61: error: 'rq' undeclared
(first use in this function)
586 | __raw_writel(page_to_phys(bio_page(req->bio)) + bio_offset(rq->bio),
| ^~
Pavel Begunkov [Sun, 11 Apr 2021 00:46:33 +0000 (01:46 +0100)]
io_uring: optimise fill_event() by inlining
There are three cases where we much care about performance of
io_cqring_fill_event() -- flushing inline completions, iopoll and
io_req_complete_post(). Inline a hot part of fill_event() into them.
All others are not as important and we don't want to bloat binary for
them, so add a noinline version of the function for all other use
use cases.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:32 +0000 (01:46 +0100)]
io_uring: always pass cflags into fill_event()
A simple preparation patch inlining io_cqring_fill_event(), which only
role was to pass cflags=0 into an actual fill event. It helps to keep
number of related helpers sane in following patches.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:31 +0000 (01:46 +0100)]
io_uring: optimise non-eventfd post-event
Eventfd is not the canonical way of using io_uring, annotate
io_should_trigger_evfd() with likely so it improves code generation for
non-eventfd branch.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:29 +0000 (01:46 +0100)]
io_uring: enable inline completion for more cases
Take advantage of delayed/inline completion flushing and pass right
issue flags for completion of open, open2, fadvise and poll remove
opcodes. All others either already use it or always punted and never
executed inline.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:27 +0000 (01:46 +0100)]
io_uring: unify files and task cancel
Now __io_uring_cancel() and __io_uring_files_cancel() are very similar
and mostly differ by how we count requests, merge them and allow
tctx_inflight() to handle counting.
Pavel Begunkov [Sun, 11 Apr 2021 00:46:26 +0000 (01:46 +0100)]
io_uring: track inflight requests through counter
Instead of keeping requests in a inflight_list, just track them with a
per tctx atomic counter. Apart from it being much easier and more
consistent with task cancel, it frees ->inflight_entry from being shared
between iopoll and cancel-track, so less headache for us.
Pavel Begunkov [Fri, 9 Apr 2021 08:13:19 +0000 (09:13 +0100)]
io_uring: clean up io_poll_task_func()
io_poll_complete() always fills an event (even an overflowed one), so we
always should do io_cqring_ev_posted() afterwards. And that's what is
currently happening, because second EPOLLONESHOT check is always true,
it can't return !done for oneshots.
Remove those branching, it's much easier to read.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Peter Zijlstra [Thu, 8 Apr 2021 09:44:50 +0000 (11:44 +0200)]
io-wq: Fix io_wq_worker_affinity()
Do not include private headers and do not frob in internals.
On top of that, while the previous code restores the affinity, it
doesn't ensure the task actually moves there if it was running,
leading to the fun situation that it can be observed running outside
of its allowed mask for potentially significant time.
io_uring: don't attempt re-add of multishot poll request if racing
We currently allow racy updates to multishot requests, but we can end up
double adding the poll request if both completion and update does it.
Ensure that we skip re-add on the update side if someone else is
completing it.
Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 1 Apr 2021 14:44:04 +0000 (15:44 +0100)]
io_uring: encapsulate fixed files into struct
Add struct io_fixed_file representing a single registered file, first to
hide ugly struct file **, which may be misleading, and secondly to
retype it to unsigned long as conversions to it and back to file * for
handling and masking FFS_* flags are getting nasty.
Pavel Begunkov [Thu, 1 Apr 2021 14:44:03 +0000 (15:44 +0100)]
io_uring: refactor file tables alloc/free
Introduce a heler io_free_file_tables() doing all the cleaning, there
are several places where it's hand coded. Also move all allocations into
io_sqe_alloc_file_tables() and rename it, so all of it is in one place.
Pavel Begunkov [Thu, 1 Apr 2021 14:44:01 +0000 (15:44 +0100)]
io_uring: set proper FFS* flags on reg file update
Set FFS_* flags (e.g. FFS_ASYNC_READ) not only in initial registration
but also on registered files update. Not a bug, but may miss getting
profit out of the feature.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:59 +0000 (15:43 +0100)]
io_uring: put link timeout req consistently
Don't put linked timeout req in io_async_find_and_cancel() but do it in
io_link_timeout_fn(), so we have only one point for that and won't have
to do it differently as it's now (put vs put_deferred). Btw, improve a
bit io_async_find_and_cancel()'s locking.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:55 +0000 (15:43 +0100)]
io_uring: store reg buffer end instead of length
It's a bit more convenient for us to store a registered buffer end
address instead of length, see struct io_mapped_ubuf, as it allow to not
recompute it every time.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:54 +0000 (15:43 +0100)]
io_uring: improve import_fixed overflow checks
Replace a hand-coded overflow check with a specialised function. Even
though compilers are smart enough to generate identical binary (i.e.
check carry bit), but it's more foolproof and conveys the intention
better.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:50 +0000 (15:43 +0100)]
io_uring: combine lock/unlock sections on exit
io_ring_exit_work() already does uring_lock lock/unlock, no need to
repeat it for lock waiting trick in io_ring_ctx_free(). Move the waiting
with comments and spinlocking into io_ring_exit_work.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:48 +0000 (15:43 +0100)]
io_uring: remove useless is_dying check on quiesce
rsrc_data refs should always be valid for potential submitters,
io_rsrc_ref_quiesce() restores it before unlocking, so
percpu_ref_is_dying() check in io_sqe_files_unregister() does nothing
and misleading. Concurrent quiesce is prevented with
struct io_rsrc_data::quiesce.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:47 +0000 (15:43 +0100)]
io_uring: reuse io_rsrc_node_destroy()
Reuse io_rsrc_node_destroy() in __io_rsrc_put_work(). Also move it to a
more appropriate place -- to the other node routines, and remove forward
declaration.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:46 +0000 (15:43 +0100)]
io_uring: ctx-wide rsrc nodes
If we're going to ever support multiple types of resources we need
shared rsrc nodes to not bloat requests, that is implemented in this
patch. It also gives a nicer API and saves one pointer dereference
in io_req_set_rsrc_node().
We may say that all requests bound to a resource belong to one and only
one rsrc node, and considering that nodes are removed and recycled
strictly in-order, this separates requests into generations, where
generation are changed on each node switch (i.e. io_rsrc_node_switch()).
The API is simple, io_rsrc_node_switch() switches to a new generation if
needed, and also optionally kills a passed in io_rsrc_data. Each call to
io_rsrc_node_switch() have to be preceded with
io_rsrc_node_switch_start(). The start function is idempotent and should
not necessarily be followed by switch.
One difference is that once a node was set it will always retain a valid
rsrc node, even on unregister. It may be a nuisance at the moment, but
makes much sense for multiple types of resources. Another thing changed
is that nodes are bound to/associated with a io_rsrc_data later just
before killing (i.e. switching).
Pavel Begunkov [Thu, 1 Apr 2021 14:43:44 +0000 (15:43 +0100)]
io_uring: move rsrc_put callback into io_rsrc_data
io_rsrc_node's callback operates only on a single io_rsrc_data and only
with its resources, so rsrc_put() callback is actually a property of
io_rsrc_data. Move it there, it makes code much nicecr.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:43 +0000 (15:43 +0100)]
io_uring: encapsulate rsrc node manipulations
io_rsrc_node_get() and io_rsrc_node_set() are always used together,
merge them into one so most users don't even see io_rsrc_node and don't
need to care about it.
It helped to catch io_sqe_files_register() inferring rsrc data argument
for get and set differently, not a problem but a good sign.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:42 +0000 (15:43 +0100)]
io_uring: use rsrc prealloc infra for files reg
Keep it consistent with update and use io_rsrc_node_prealloc() +
io_rsrc_node_get() in io_sqe_files_register() as well, that will be used
in future patches, not as error prone and allows to deduplicate
rsrc_node init.
Pavel Begunkov [Thu, 1 Apr 2021 14:43:41 +0000 (15:43 +0100)]
io_uring: simplify io_rsrc_node_ref_zero
Replace queue_delayed_work() with mod_delayed_work() in
io_rsrc_node_ref_zero() as the later one can schedule a new work, and
cleanup it further for better readability.
io-wq: cancel task_work on exit only targeting the current 'wq'
With using task_work_cancel(), we're potentially canceling task_work
that isn't related to this specific io_wq. Use the newly added
task_work_cancel_match() to ensure that we only remove and cancel work
items that are specific to this io_wq.
Fixes: 685fe7feedb9 ("io-wq: eliminate the need for a manager thread") Signed-off-by: Jens Axboe <axboe@kernel.dk>
task_work: add helper for more targeted task_work canceling
The only exported helper we have right now is task_work_cancel(), which
cancels any task_work from a given task where func matches the queued
work item. This is a bit too coarse for some use cases. Add a
task_work_cancel_match() that allows to more specifically target
individual work items outside of purely the callback function used.
task_work_cancel() can be trivially implemented on top of that, hence do
so.
Jens Axboe [Wed, 31 Mar 2021 15:03:03 +0000 (09:03 -0600)]
io_uring: fix race around poll update and poll triggering
Joakim reports that in some conditions he sees a multishot poll request
being canceled, and that it coincides with getting -EALREADY on
modification. As part of the poll update procedure, there's a small window
where the request is marked as canceled, and if this coincides with the
event actually triggering, then we can get a spurious -ECANCELED and
termination of the multishot request.
Don't mark the poll request as being canceled for update. We also don't
care if we race on removal unless it's a one-shot request, we can safely
updated for either case.
Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests") Reported-by: Joakim Hassila <joj@mac.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.13/drivers: (73 commits)
bcache: fix a regression of code compiling failure in debug.c
bcache: Use 64-bit arithmetic instead of 32-bit
md: bcache: Trivial typo fixes in the file journal.c
md: bcache: avoid -Wempty-body warnings
bcache: use NULL instead of using plain integer as pointer
bcache: remove PTR_CACHE
bcache: reduce redundant code in bch_cached_dev_run()
md: split mddev_find
md: factor out a mddev_find_locked helper from mddev_find
md: md_open returns -EBUSY when entering racing area
drbd: use DEFINE_SPINLOCK() for spinlock
swim3: support highmem
floppy: always use the track buffer
swim: don't call blk_queue_bounce_limit
gdrom: support highmem
block: drbd: drbd_nl: Demote half-complete kernel-doc headers
block: xen-blkfront: Demote kernel-doc abuses
block: drbd: drbd_receiver: Demote less than half complete kernel-doc header
block: drbd: drbd_main: Fix a bunch of function documentation discrepancies
block: drbd: drbd_nl: Make conversion to 'enum drbd_ret_code' explicit
...
* for-5.13/block: (31 commits)
block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration
block: remove disk_part_iter
block: simplify diskstats_show
block: simplify show_partition
block: simplify printk_all_partitions
block: simplify partition_overlaps
block: simplify partition removal
block: take bd_mutex around delete_partitions in del_gendisk
block: refactor blk_drop_partitions
block: move more syncing and invalidation to delete_partition
block: remove invalidate_partition
dasd: use bdev_disk_changed instead of blk_drop_partitions
blk-zoned: Remove the definition of blk_zone_start()
blk-mq: set default elevator as deadline in case of hctx shared tagset
block: stop calling blk_queue_bounce for passthrough requests
block: refactor the bounce buffering code
block: remove BLK_BOUNCE_ISA support
scsi: remove the unchecked_isa_dma flag
advansys: remove ISA support
BusLogic: reject broken old firmware that requires ISA-style bounce buffering
...
Jens Axboe [Thu, 25 Mar 2021 16:21:35 +0000 (10:21 -0600)]
io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE
Now that we have any worker being attached to the original task as
threads, accounting of CPU time is directly attributed to the original
task as well. This means that we no longer have to restrict SQPOLL to
needing elevated privileges, as it's really no different from just having
the task spawn a busy looping thread in userspace.
Jens Axboe [Mon, 8 Mar 2021 16:37:51 +0000 (09:37 -0700)]
io-wq: eliminate the need for a manager thread
io-wq relies on a manager thread to create/fork new workers, as needed.
But there's really no strong need for it anymore. We have the following
cases that fork a new worker:
1) Work queue. This is done from the task itself always, and it's trivial
to create a worker off that path, if needed.
2) All workers have gone to sleep, and we have more work. This is called
off the sched out path. For this case, use a task_work items to queue
a fork-worker operation.
3) Hashed work completion. Don't think we need to do anything off this
case. If need be, it could just use approach 2 as well.
Part of this change is incrementing the running worker count before the
fork, to avoid cases where we observe we need a worker and then queue
creation of one. Then new work comes in, we fork a new one. That last
queue operation should have waited for the previous worker to come up,
it's quite possible we don't even need it. Hence move the worker running
from before we fork it off to more efficiently handle that case.
Jens Axboe [Wed, 17 Mar 2021 14:37:41 +0000 (08:37 -0600)]
io_uring: allow events and user_data update of running poll requests
This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and
IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are
masked into sqe->len. If set, the POLL_ADD will have the following
behavior:
- sqe->addr must contain the the user_data of the poll request that
needs to be modified. This field is otherwise invalid for a POLL_ADD
command.
- If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the
new mask for the existing poll request. There are no checks for whether
these are identical or not, if a matching poll request is found, then it
is re-armed with the new mask.
- If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new
user_data for the existing poll request.
A POLL_ADD with any of these flags set may complete with any of the
following results:
1) 0, which means that we successfully found the existing poll request
specified, and performed the re-arm procedure. Any error from that
re-arm will be exposed as a completion event for that original poll
request, not for the update request.
2) -ENOENT, if no existing poll request was found with the given
user_data.
3) -EALREADY, if the existing poll request was already in the process of
being removed/canceled/completing.
4) -EACCES, if an attempt was made to modify an internal poll request
(eg not one originally issued ass IORING_OP_POLL_ADD).
The usual -EINVAL cases apply as well, if any invalid fields are set
in the sqe for this command type.
Jens Axboe [Tue, 23 Feb 2021 05:08:01 +0000 (22:08 -0700)]
io_uring: add multishot mode for IORING_OP_POLL_ADD
The default io_uring poll mode is one-shot, where once the event triggers,
the poll command is completed and won't trigger any further events. If
we're doing repeated polling on the same file or socket, then it can be
more efficient to do multishot, where we keep triggering whenever the
event becomes true.
This deviates from the usual norm of having one CQE per SQE submitted. Add
a CQE flag, IORING_CQE_F_MORE, which tells the application to expect
further completion events from the submitted SQE. Right now the only user
of this is POLL_ADD in multishot mode.
Since sqe->poll_events is using the space that we normally use for adding
flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot
mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An
application should expect more CQEs for the specificed SQE if the CQE is
flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an
error will terminate the poll request, in which case the flag will be
cleared.
Pavel Begunkov [Tue, 23 Feb 2021 12:40:22 +0000 (12:40 +0000)]
io_uring: allocate memory for overflowed CQEs
Instead of using a request itself for overflowed CQE stashing, allocate a
separate entry. The disadvantage is that the allocation may fail and it
will be accounted as lost (see rings->cq_overflow), so we lose reliability
in case of memory pressure if the application is driving the CQ ring into
overflow. However, it opens a way for for multiple CQEs per an SQE and
even generating SQE-less CQEs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: use GFP_ATOMIC | __GFP_ACCOUNT] Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 19 Mar 2021 20:06:24 +0000 (14:06 -0600)]
io_uring: mask in error/nval/hangup consistently for poll
Instead of masking these in as part of regular POLL_ADD prep, do it in
io_init_poll_iocb(), and include NVAL as that's generally unmaskable,
and RDHUP alongside the HUP that is already set.
Pavel Begunkov [Mon, 22 Mar 2021 01:58:34 +0000 (01:58 +0000)]
io_uring: optimise rw complete error handling
Expect read/write to succeed and create a hot path for this case, in
particular hide all error handling with resubmission under a single
check with the desired result.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>