Jens Axboe [Wed, 29 Dec 2021 17:51:11 +0000 (09:51 -0800)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
nvme: add 'iopolicy' module parameter
nvme: drop unused variable ctrl in nvme_setup_cmd
nvme: increment request genctr on completion
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Jens Axboe [Wed, 29 Dec 2021 17:50:50 +0000 (09:50 -0800)]
Merge tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme into for-5.17/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.17
- increment request genctr on completion (Keith Busch, Geliang Tang)
- add a 'iopolicy' module parameter (Hannes Reinecke)
- print out valid arguments when reading from /dev/nvme-fabrics
(Hannes Reinecke)"
* tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme:
nvme: add 'iopolicy' module parameter
nvme: drop unused variable ctrl in nvme_setup_cmd
nvme: increment request genctr on completion
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Pavel Begunkov [Wed, 15 Dec 2021 22:08:49 +0000 (22:08 +0000)]
io_uring: single shot poll removal optimisation
We don't need to poll oneshot request if we've got a desired mask in
io_poll_wake(), task_work will clean it up correctly, but as we already
hold a wq spinlock, we can remove ourselves and save on additional
spinlocking in io_poll_remove_entries().
Pavel Begunkov [Wed, 15 Dec 2021 22:08:48 +0000 (22:08 +0000)]
io_uring: poll rework
It's not possible to go forward with the current state of io_uring
polling, we need a more straightforward and easier synchronisation.
There are a lot of problems with how it is at the moment, including
missing events on rewait.
The main idea here is to introduce a notion of request ownership while
polling, no one but the owner can modify any part but ->poll_refs of
struct io_kiocb, that grants us protection against all sorts of races.
Main users of such exclusivity are poll task_work handler, so before
queueing a tw one should have/acquire ownership, which will be handed
off to the tw handler.
The other user is __io_arm_poll_handler() do initial poll arming. It
starts taking the ownership, so tw handlers won't be run until it's
released later in the function after vfs_poll. note: also prevents
races in __io_queue_proc().
Poll wake/etc. may not be able to get ownership, then they need to
increase the poll refcount and the task_work should notice it and retry
if necessary, see io_poll_check_events().
There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill
request.
It makes cancellations more reliable, enables double multishot polling,
fixes double poll rewait, fixes missing poll events and fixes another
bunch of races.
Even though it adds some overhead for new refcounting, and there are a
couple of nice performance wins:
- no req->refs refcounting for poll requests anymore
- if the data is already there (once measured for some test to be 1-2%
of all apoll requests), it removes it doesn't add atomics and removes
spin_lock/unlock pair.
- works well with multishots, we don't do remove from queue / add to
queue for each new poll event.
Pavel Begunkov [Wed, 15 Dec 2021 22:08:47 +0000 (22:08 +0000)]
io_uring: kill poll linking optimisation
With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll
requests doesn't make much sense, and in any case re-adding it
shouldn't be a problem considering batching in tctx_task_work(). We can
remove it.
Jens Axboe [Fri, 24 Dec 2021 14:44:10 +0000 (07:44 -0700)]
Merge branch 'for-5.17/io_uring-xattr' into for-next
* for-5.17/io_uring-xattr:
io_uring: add fgetxattr and getxattr support
io_uring: add fsetxattr and setxattr support
fs: split off do_getxattr from getxattr
fs: split off setxattr_copy and do_setxattr function from setxattr
fs: split off do_user_path_at_empty from user_path_at_empty()
Stefan Roesch [Thu, 23 Dec 2021 23:51:20 +0000 (15:51 -0800)]
fs: split off setxattr_copy and do_setxattr function from setxattr
This splits of the setup part of the function
setxattr in its own dedicated function called
setxattr_copy. In addition it also exposes a
new function called do_setxattr for making the
setxattr call.
This makes it possible to call these two functions
from io_uring in the processing of an xattr request.
Ming Lei [Fri, 24 Dec 2021 01:08:31 +0000 (09:08 +0800)]
block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0
It isn't correct to set set->nr_maps as 3 if g_poll_queues is > 0 since
we can change it via configfs for null_blk device created there, so only
set it as 3 if active poll_queues is > 0.
Fixes divide zero exception reported by Shinichiro.
Fixes: 2bfdbe8b7ebd ("null_blk: allow zero poll queues") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20211224010831.1521805-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lukas Bulwahn [Thu, 23 Dec 2021 12:53:00 +0000 (13:53 +0100)]
block: drop needless assignment in set_task_ioprio()
Commit 5fc11eebb4a9 ("block: open code create_task_io_context in
set_task_ioprio") introduces a needless assignment
'ioc = task->io_context', as the local variable ioc is not further
used before returning.
Even after the further fix, commit a957b61254a7 ("block: fix error in
handling dead task for ioprio setting"), the assignment still remains
needless.
Drop this needless assignment in set_task_ioprio().
This code smell was identified with 'make clang-analyzer'.
Fixes: 5fc11eebb4a9 ("block: open code create_task_io_context in set_task_ioprio") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211223125300.20691-1-lukas.bulwahn@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 20 Dec 2021 12:51:45 +0000 (13:51 +0100)]
nvme: add 'iopolicy' module parameter
While the 'iopolicy' sysfs attribute can be set at runtime, most
storage arrays prefer to use the 'round-robin' iopolicy per default.
We can use udev rules to set this, but is getting rather unwieldy
for rebranded arrays as we would have to update the udev rules
anytime a new array shows up, leading to the same mess we currently
have in multipathd for configuring the RDAC arrays.
Hence this patch adds a module parameter 'iopolicy' to allow the
admin to switch the default, and to do away with the need for a
udev rule here.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Geliang Tang [Wed, 22 Dec 2021 09:32:44 +0000 (17:32 +0800)]
nvme: drop unused variable ctrl in nvme_setup_cmd
The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:
drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
^~~~
Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Mon, 13 Dec 2021 17:08:47 +0000 (09:08 -0800)]
nvme: increment request genctr on completion
The nvme request generation counter is intended to catch duplicate
completions. Incrementing the counter on submission means duplicates can
only be caught if the request tag is reallocated and dispatched prior to
the driver observing the corrupted CQE. Incrementing on completion
removes this window, making it possible to detect duplicate completions
in consecutive entries.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Tue, 7 Dec 2021 13:55:49 +0000 (14:55 +0100)]
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Currently applications have a hard time figuring out which
nvme-over-fabrics arguments are supported for any given kernel;
the ioctl will return an error code on failure, and the application
has to guess whether this was due to an invalid argument or due
to a connection or controller error.
With this patch applications can read a list of supported
arguments by simply reading from /dev/nvme-fabrics, allowing
them to validate the connection string.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Randy Dunlap [Wed, 22 Dec 2021 21:15:32 +0000 (13:15 -0800)]
bio.h: fix kernel-doc warnings
Fix all kernel-doc warnings in <linux/bio.h>:
include/linux/bio.h:136: warning: Function parameter or member 'nbytes' not described in 'bio_advance'
include/linux/bio.h:136: warning: Excess function parameter 'bytes' description in 'bio_advance'
include/linux/bio.h:391: warning: No description found for return value of 'bio_next_split'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20211222211532.24060-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 21 Dec 2021 19:16:27 +0000 (12:16 -0700)]
Merge branch 'for-5.17/io_uring-getdents64' into for-next
* for-5.17/io_uring-getdents64:
io_uring: add support for getdents64
fs: split off vfs_getdents function of getdents64 syscall
fs: add offset parameter to iterate_dir function
Jens Axboe [Tue, 21 Dec 2021 19:16:17 +0000 (12:16 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
block: check minor range in device_add_disk()
block: use "unsigned long" for blk_validate_block_size().
block: fix error unwinding in device_add_disk
block: call blk_exit_queue() before freeing q->stats
block: fix error in handling dead task for ioprio setting
blk-mq: blk-mq: check quiesce state before queue_rqs
blktrace: switch trace spinlock to a raw spinlock
message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.
Tetsuo Handa [Sat, 18 Dec 2021 09:41:56 +0000 (18:41 +0900)]
block: use "unsigned long" for blk_validate_block_size().
Since lo_simple_ioctl(LOOP_SET_BLOCK_SIZE) and ioctl(NBD_SET_BLKSIZE) pass
user-controlled "unsigned long arg" to blk_validate_block_size(),
"unsigned long" should be used for validation.
Christoph Hellwig [Tue, 21 Dec 2021 16:18:51 +0000 (17:18 +0100)]
block: fix error unwinding in device_add_disk
One device_add is called disk->ev will be freed by disk_release, so we
should free it twice. Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.
Jens Axboe [Tue, 21 Dec 2021 03:32:24 +0000 (20:32 -0700)]
block: fix error in handling dead task for ioprio setting
Don't combine the task exiting and "already have io_context" case, we
need to just abort if the task is marked as dead. Return -ESRCH, which
is the documented value for ioprio_set() if the specified task could not
be found.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: syzbot+8836466a79f4175961b0@syzkaller.appspotmail.com Fixes: 5fc11eebb4a9 ("block: open code create_task_io_context in set_task_ioprio") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Mon, 20 Dec 2021 20:59:19 +0000 (12:59 -0800)]
blk-mq: blk-mq: check quiesce state before queue_rqs
The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().
Wander Lairson Costa [Mon, 20 Dec 2021 19:28:27 +0000 (16:28 -0300)]
blktrace: switch trace spinlock to a raw spinlock
The running_trace_lock protects running_trace_list and is acquired
within the tracepoint which implies disabled preemption. The spinlock_t
typed lock can not be acquired with disabled preemption on PREEMPT_RT
because it becomes a sleeping lock.
The runtime of the tracepoint depends on the number of entries in
running_trace_list and has no limit. The blk-tracer is considered debug
code and higher latencies here are okay.
Jens Axboe [Fri, 17 Dec 2021 16:51:05 +0000 (09:51 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
block: remove the rsxx driver
rsxx: Drop PCI legacy power management
mtip32xx: convert to generic power management
mtip32xx: remove pointless drvdata lookups
mtip32xx: remove pointless drvdata checking
drbd: Use struct_group() to zero algs
loop: make autoclear operation asynchronous
null_blk: cast command status to integer
pktdvd: stop using bdi congestion framework.
Jens Axboe [Fri, 17 Dec 2021 16:51:03 +0000 (09:51 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: (23 commits)
block: only build the icq tracking code when needed
block: fold create_task_io_context into ioc_find_get_icq
block: open code create_task_io_context in set_task_ioprio
block: fold get_task_io_context into set_task_ioprio
block: move set_task_ioprio to blk-ioc.c
block: cleanup ioc_clear_queue
block: refactor put_io_context
block: remove the NULL ioc check in put_io_context
block: refactor put_iocontext_active
block: simplify struct io_context refcounting
block: remove the nr_task field from struct io_context
nvme: add support for mq_ops->queue_rqs()
nvme: separate command prep and issue
nvme: split command copy into a helper
block: add mq_ops->queue_rqs hook
block: use singly linked list for bio cache
block: add completion handler for fast path
block: make queue stat accounting a reference
bdev: Improve lookup_bdev documentation
mtd_blkdevs: don't scan partitions for plain mtdblock
...
Jens Axboe [Fri, 17 Dec 2021 16:51:00 +0000 (09:51 -0700)]
Merge branch 'for-5.17/io_uring' into for-next
* for-5.17/io_uring:
io_uring: code clean for some ctx usage
io_uring: batch completion in prior_task_list
io_uring: split io_req_complete_post() and add a helper
io_uring: add helper for task work execution code
io_uring: add a priority tw list for irq completion work
io-wq: add helper to merge two wq_lists
Christoph Hellwig [Thu, 9 Dec 2021 06:31:26 +0000 (07:31 +0100)]
block: cleanup ioc_clear_queue
Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.
Christoph Hellwig [Thu, 16 Dec 2021 08:42:44 +0000 (09:42 +0100)]
block: remove the rsxx driver
This driver was for rare and shortlived high end enterprise hardware
and hasn't been maintained since 2014, which also means it never got
converted to use blk-mq.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 18 Nov 2021 15:37:30 +0000 (08:37 -0700)]
nvme: add support for mq_ops->queue_rqs()
This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.
If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.
This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 3 Dec 2021 13:48:53 +0000 (06:48 -0700)]
block: add mq_ops->queue_rqs hook
If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 1 Dec 2021 22:01:51 +0000 (15:01 -0700)]
block: add completion handler for fast path
The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 15 Dec 2021 00:23:05 +0000 (17:23 -0700)]
block: make queue stat accounting a reference
kyber turns on IO statistics when it is loaded on a queue, which means
that even if kyber is then later unloaded, we're still stuck with stats
enabled on the queue.
Change the account enabled from a bool to an int, and pair the enable call
with the equivalent disable call. This ensures that stats gets turned off
again appropriately.
Vaibhav Gupta [Wed, 8 Dec 2021 19:24:48 +0000 (13:24 -0600)]
mtip32xx: convert to generic power management
Convert mtip32xx from legacy PCI power management to the generic power
management framework.
Previously, mtip32xx used legacy PCI power management, where
mtip_pci_suspend() and mtip_pci_resume() were responsible for both
device-specific things and generic PCI things:
Bjorn Helgaas [Wed, 8 Dec 2021 19:24:47 +0000 (13:24 -0600)]
mtip32xx: remove pointless drvdata lookups
Previously we passed a struct pci_dev * to mtip_check_surprise_removal(),
which immediately looked up the driver_data. But all callers already have
the driver_data pointer, so just pass it directly and skip the extra
lookup. No functional change intended.
Kees Cook [Thu, 18 Nov 2021 20:37:12 +0000 (12:37 -0800)]
drbd: Use struct_group() to zero algs
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add a struct_group() for the algs so that memset() can correctly reason
about the size.
Tetsuo Handa [Mon, 13 Dec 2021 12:55:27 +0000 (21:55 +0900)]
loop: make autoclear operation asynchronous
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with disk->open_mutex held.
This circular dependency cannot be broken unless we call __loop_clr_fd()
without holding disk->open_mutex. Therefore, defer __loop_clr_fd() from
lo_release() to a WQ context.
Jens Axboe [Fri, 10 Dec 2021 23:32:44 +0000 (16:32 -0700)]
null_blk: cast command status to integer
kernel test robot reports that sparse now triggers a warning on null_blk:
>> drivers/block/null_blk/main.c:1577:55: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int ioerror @@ got restricted blk_status_t [usertype] error @@
drivers/block/null_blk/main.c:1577:55: sparse: expected int ioerror
drivers/block/null_blk/main.c:1577:55: sparse: got restricted blk_status_t [usertype] error
because blk_mq_add_to_batch() takes an integer instead of a blk_status_t.
Just cast this to an integer to silence it, null_blk is the odd one out
here since the command status is the "right" type. If we change the
function type, then we'll have do that for other callers too (existing and
future ones).
NeilBrown [Fri, 10 Dec 2021 04:31:56 +0000 (21:31 -0700)]
pktdvd: stop using bdi congestion framework.
The bdi congestion framework isn't widely used and should be
deprecated.
pktdvd makes use of it to track congestion, but this can be done
entirely internally to pktdvd, so it doesn't need to use the framework.
So introduce a "congested" flag. When waiting for bio_queue_size to
drop, set this flag and a var_waitqueue() to wait for it. When
bio_queue_size does drop and this flag is set, clear the flag and call
wake_up_var().
We don't use a wait_var_event macro for the waiting as we need to set
the flag and drop the spinlock before calling schedule() and while that
is possible with __wait_var_event(), result is not easy to read.
Hao Xu [Wed, 8 Dec 2021 05:21:25 +0000 (13:21 +0800)]
io_uring: batch completion in prior_task_list
In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.
Hao Xu [Tue, 7 Dec 2021 09:39:48 +0000 (17:39 +0800)]
io_uring: add a priority tw list for irq completion work
Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:
origin timeline:
submit_sqe-->irq-->add completion task_work
-->run heavy work0~n-->run completion task_work
now timeline:
submit_sqe-->irq-->add completion task_work
-->run completion task_work-->run heavy work0~n
Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.
Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:
John Garry [Mon, 6 Dec 2021 12:49:50 +0000 (20:49 +0800)]
blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags
Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees
using megaraid SAS RAID card since moving to shared tags [0].
Previously, when shared tags was shared sbitmap, this function was less
than optimum since we would iter through all tags for all hctx's,
yet only ever match upto tagset depth number of rqs.
Since the change to shared tags, things are even less efficient if we have
parallel callers of blk_mq_queue_tag_busy_iter(). This is because in
bt_iter() -> blk_mq_find_and_get_req() there would be more contention on
accessing each request ref and tags->lock since they are now shared among
all HW queues.
Optimise by having separate calls to bt_for_each() for when we're using
shared tags. In this case no longer pass a hctx, as it is no longer
relevant, and teach bt_iter() about this.
Ming suggested something along the lines of this change, apart from a
different implementation.
John Garry [Mon, 6 Dec 2021 12:49:49 +0000 (20:49 +0800)]
blk-mq: Delete busy_iter_fn
Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete
busy_iter_fn to reduce duplication.
It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is
less specific.
However busy_tag_iter_fn is used in many different parts of the tree,
unlike busy_iter_fn which is just use in block/, so just take the
straightforward path now, so that we could rename later treewide.
John Garry [Mon, 6 Dec 2021 12:49:48 +0000 (20:49 +0800)]
blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().
Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.
With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.
Jens Axboe [Mon, 6 Dec 2021 16:41:50 +0000 (09:41 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk-mq: don't run might_sleep() if the operation needn't blocking
Ming Lei [Mon, 6 Dec 2021 03:33:50 +0000 (11:33 +0800)]
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 4cafe86c9267 ("blk-mq: run dispatch lock once in case of issuing from list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 6 Dec 2021 11:12:13 +0000 (19:12 +0800)]
blk-mq: don't run might_sleep() if the operation needn't blocking
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.
Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 5 Dec 2021 14:37:59 +0000 (14:37 +0000)]
io_uring: tweak iopoll CQE_SKIP event counting
When iopolling the userspace specifies the minimum number of "events" it
expects. Previously, we had one CQE per request, so the definition of
an "event" was unequivocal, but that's not more the case anymore with
REQ_F_CQE_SKIP.
Currently it counts the number of completed requests, replace it with
the number of posted CQEs. This allows users of the "one CQE per link"
scheme to wait for all N links in a single syscall, which is not
possible without the patch and requires extra context switches.
Jens Axboe [Fri, 3 Dec 2021 21:51:46 +0000 (14:51 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
blk-mq: run dispatch lock once in case of issuing from list
blk-mq: pass request queue to blk_mq_run_dispatch_ops
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
blk-mq: remove hctx_lock and hctx_unlock
block: switch to atomic_t for request references
block: move direct_IO into our own read_iter handler
mm: move filemap_range_needs_writeback() into header
Ming Lei [Fri, 3 Dec 2021 13:15:34 +0000 (21:15 +0800)]
blk-mq: run dispatch lock once in case of issuing from list
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.
Ming Lei [Fri, 3 Dec 2021 13:15:32 +0000 (21:15 +0800)]
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.
Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.
So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.
Jens Axboe [Thu, 14 Oct 2021 20:39:59 +0000 (14:39 -0600)]
block: switch to atomic_t for request references
refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.
This borrows that same implementation, which in turn is based on the
mm implementation from Linus.
Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 28 Oct 2021 14:57:09 +0000 (08:57 -0600)]
block: move direct_IO into our own read_iter handler
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.
Ming Lei [Fri, 3 Dec 2021 08:17:03 +0000 (16:17 +0800)]
block: null_blk: batched complete poll requests
Complete poll requests via blk_mq_add_to_batch() and
blk_mq_end_request_batch(), so that we can cover batched complete
code path by running null_blk test.
Meantime this way shows ~14% IOPS boost on 't/io_uring /dev/nullb0'
in my test.
Jens Axboe [Fri, 3 Dec 2021 13:34:00 +0000 (06:34 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
floppy: Add max size check for user space request
floppy: Fix hang in watchdog when disk is ejected
null_blk: allow zero poll queues
Xiongwei Song [Tue, 16 Nov 2021 13:10:33 +0000 (21:10 +0800)]
floppy: Add max size check for user space request
We need to check the max request size that is from user space before
allocating pages. If the request size exceeds the limit, return -EINVAL.
This check can avoid the warning below from page allocator.