www.infradead.org Git - users/hch/block.git/log

block: deprecate autoloading based on dev_t

Make the legacy dev_t based autoloading optional and add a deprecation
warning. This kind of autoloading has ceased to be useful about 20 years
ago.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Merge branch 'for-5.17/drivers' into for-next

* for-5.17/drivers:
  nvme: add 'iopolicy' module parameter
  nvme: drop unused variable ctrl in nvme_setup_cmd
  nvme: increment request genctr on completion
  nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics

Merge tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme into for-5.17/drivers

Pull NVMe updates from Christoph:

"nvme updates for Linux 5.17

- increment request genctr on completion (Keith Busch, Geliang Tang)
- add a 'iopolicy' module parameter (Hannes Reinecke)
- print out valid arguments when reading from /dev/nvme-fabrics
   (Hannes Reinecke)"

* tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme:
  nvme: add 'iopolicy' module parameter
  nvme: drop unused variable ctrl in nvme_setup_cmd
  nvme: increment request genctr on completion
  nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics

Merge branch 'for-5.17/io_uring' into for-next

* for-5.17/io_uring:
  io_uring: use completion batching for poll rem/upd
  io_uring: single shot poll removal optimisation
  io_uring: poll rework
  io_uring: kill poll linking optimisation
  io_uring: move common poll bits
  io_uring: refactor poll update
  io_uring: remove double poll on poll update

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: use completion batching for poll rem/upd

Use __io_req_complete() in io_poll_update(), so we can utilise
completion batching for both update/remove request and the poll
we're killing (if any).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e2bdc6c5abd9e9b80f09b86d8823eb1c780362cd.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: single shot poll removal optimisation

We don't need to poll oneshot request if we've got a desired mask in
io_poll_wake(), task_work will clean it up correctly, but as we already
hold a wq spinlock, we can remove ourselves and save on additional
spinlocking in io_poll_remove_entries().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee170a344a18c9ef36b554d806c64caadfd61c31.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: poll rework

It's not possible to go forward with the current state of io_uring
polling, we need a more straightforward and easier synchronisation.
There are a lot of problems with how it is at the moment, including
missing events on rewait.

The main idea here is to introduce a notion of request ownership while
polling, no one but the owner can modify any part but ->poll_refs of
struct io_kiocb, that grants us protection against all sorts of races.

Main users of such exclusivity are poll task_work handler, so before
queueing a tw one should have/acquire ownership, which will be handed
off to the tw handler.
The other user is __io_arm_poll_handler() do initial poll arming. It
starts taking the ownership, so tw handlers won't be run until it's
released later in the function after vfs_poll. note: also prevents
races in __io_queue_proc().
Poll wake/etc. may not be able to get ownership, then they need to
increase the poll refcount and the task_work should notice it and retry
if necessary, see io_poll_check_events().
There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill
request.

It makes cancellations more reliable, enables double multishot polling,
fixes double poll rewait, fixes missing poll events and fixes another
bunch of races.

Even though it adds some overhead for new refcounting, and there are a
couple of nice performance wins:
- no req->refs refcounting for poll requests anymore
- if the data is already there (once measured for some test to be 1-2%
  of all apoll requests), it removes it doesn't add atomics and removes
  spin_lock/unlock pair.
- works well with multishots, we don't do remove from queue / add to
  queue for each new poll event.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b652927c77ed9580ea4330ac5612f0e0848c946.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: kill poll linking optimisation

With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll
requests doesn't make much sense, and in any case re-adding it
shouldn't be a problem considering batching in tctx_task_work(). We can
remove it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: move common poll bits

Move some poll helpers/etc up, we'll need them there shortly

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6c5c3dba24c86aad5cd389a54a8c7412e6a0621d.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: refactor poll update

Clean up io_poll_update() and unify cancellation paths for remove and
update.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: remove double poll on poll update

Before updating a poll request we should remove it from poll queues,
including the double poll entry.

Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ac39e7f80152613603b8a6cc29a2b6063ac2434f.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/io_uring-xattr' into for-next

* for-5.17/io_uring-xattr:
  io_uring: add fgetxattr and getxattr support
  io_uring: add fsetxattr and setxattr support
  fs: split off do_getxattr from getxattr
  fs: split off setxattr_copy and do_setxattr function from setxattr
  fs: split off do_user_path_at_empty from user_path_at_empty()

io_uring: add fgetxattr and getxattr support

This adds support to io_uring for the fgetxattr and getxattr API.

Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20211223235123.4092764-6-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add fsetxattr and setxattr support

This adds support to io_uring for the fsetxattr and setxattr API.

Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20211223235123.4092764-5-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fs: split off do_getxattr from getxattr

This splits off do_getxattr function from the getxattr
function. This will allow io_uring to call it from its
io worker.

Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20211223235123.4092764-4-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fs: split off setxattr_copy and do_setxattr function from setxattr

This splits of the setup part of the function
setxattr in its own dedicated function called
setxattr_copy. In addition it also exposes a
new function called do_setxattr for making the
setxattr call.

This makes it possible to call these two functions
from io_uring in the processing of an xattr request.

Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20211223235123.4092764-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fs: split off do_user_path_at_empty from user_path_at_empty()

This splits off a do_user_path_at_empty function from the
user_path_at_empty_function. This is required so it can be
called from io_uring.

Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20211223235123.4092764-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/drivers' into for-next

* for-5.17/drivers:
block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0

block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0

It isn't correct to set set->nr_maps as 3 if g_poll_queues is > 0 since
we can change it via configfs for null_blk device created there, so only
set it as 3 if active poll_queues is > 0.

Fixes divide zero exception reported by Shinichiro.

Fixes: 2bfdbe8b7ebd ("null_blk: allow zero poll queues")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20211224010831.1521805-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/block' into for-next

* for-5.17/block:
  block: drop needless assignment in set_task_ioprio()
  block: remove unnecessary trailing '\'
  bio.h: fix kernel-doc warnings

block: drop needless assignment in set_task_ioprio()

Commit 5fc11eebb4a9 ("block: open code create_task_io_context in
set_task_ioprio") introduces a needless assignment
'ioc = task->io_context', as the local variable ioc is not further
used before returning.

Even after the further fix, commit a957b61254a7 ("block: fix error in
handling dead task for ioprio setting"), the assignment still remains
needless.

Drop this needless assignment in set_task_ioprio().

This code smell was identified with 'make clang-analyzer'.

Fixes: 5fc11eebb4a9 ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211223125300.20691-1-lukas.bulwahn@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

nvme: add 'iopolicy' module parameter

While the 'iopolicy' sysfs attribute can be set at runtime, most
storage arrays prefer to use the 'round-robin' iopolicy per default.
We can use udev rules to set this, but is getting rather unwieldy
for rebranded arrays as we would have to update the udev rules
anytime a new array shows up, leading to the same mess we currently
have in multipathd for configuring the RDAC arrays.

Hence this patch adds a module parameter 'iopolicy' to allow the
admin to switch the default, and to do away with the need for a
udev rule here.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>

nvme: drop unused variable ctrl in nvme_setup_cmd

The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:

drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
^~~~

Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

nvme: increment request genctr on completion

The nvme request generation counter is intended to catch duplicate
completions. Incrementing the counter on submission means duplicates can
only be caught if the request tag is reallocated and dispatched prior to
the driver observing the corrupted CQE. Incrementing on completion
removes this window, making it possible to detect duplicate completions
in consecutive entries.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics

Currently applications have a hard time figuring out which
nvme-over-fabrics arguments are supported for any given kernel;
the ioctl will return an error code on failure, and the application
has to guess whether this was due to an invalid argument or due
to a connection or controller error.
With this patch applications can read a list of supported
arguments by simply reading from /dev/nvme-fabrics, allowing
them to validate the connection string.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

block: remove unnecessary trailing '\'

While harmless, the blank line is certainly not intended to be part of
the rq_list_for_each() macro. Remove it.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211222215239.1768164-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bio.h: fix kernel-doc warnings

Fix all kernel-doc warnings in <linux/bio.h>:

include/linux/bio.h:136: warning: Function parameter or member 'nbytes' not described in 'bio_advance'
include/linux/bio.h:136: warning: Excess function parameter 'bytes' description in 'bio_advance'
include/linux/bio.h:391: warning: No description found for return value of 'bio_next_split'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20211222211532.24060-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/io_uring-getdents64' into for-next

* for-5.17/io_uring-getdents64:
  io_uring: add support for getdents64
  fs: split off vfs_getdents function of getdents64 syscall
  fs: add offset parameter to iterate_dir function

Merge branch 'for-5.17/block' into for-next

* for-5.17/block:
  block: check minor range in device_add_disk()
  block: use "unsigned long" for blk_validate_block_size().
  block: fix error unwinding in device_add_disk
  block: call blk_exit_queue() before freeing q->stats
  block: fix error in handling dead task for ioprio setting
  blk-mq: blk-mq: check quiesce state before queue_rqs
  blktrace: switch trace spinlock to a raw spinlock

io_uring: add support for getdents64

This adds support for getdents64 to io_uring.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211221164004.119663-4-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fs: split off vfs_getdents function of getdents64 syscall

This splits off the vfs_getdents function from the getdents64 system
call. This allows io_uring to call the vfs_getdents function.

Signed-off-by: Stefan Roesch <shr@fb.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20211221164004.119663-3-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fs: add offset parameter to iterate_dir function

This change adds the offset parameter to the iterate_dir
function. The offset paramater allows the caller to specify
the offset position.

This change is required to support getdents in io_uring.

Signed-off-by: Stefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20211221164004.119663-2-shr@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: check minor range in device_add_disk()

ioctl(fd, LOOP_CTL_ADD, 1048576) causes

sysfs: cannot create duplicate filename '/dev/block/7:0'

message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.

Reported-by: wangyangbo <wangyangbo@uniontech.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/b1b19379-23ee-5379-0eb5-94bf5f79f1b4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: use "unsigned long" for blk_validate_block_size().

Since lo_simple_ioctl(LOOP_SET_BLOCK_SIZE) and ioctl(NBD_SET_BLKSIZE) pass
user-controlled "unsigned long arg" to blk_validate_block_size(),
"unsigned long" should be used for validation.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/9ecbf057-4375-c2db-ab53-e4cc0dff953d@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: fix error unwinding in device_add_disk

One device_add is called disk->ev will be freed by disk_release, so we
should free it twice. Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.

Based on an earlier patch from Tetsuo Handa.

Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce9574462c6b ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221161851.788424-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: call blk_exit_queue() before freeing q->stats

blk_stat_disable_accounting() is added in commit 68497092bde9
("block: make queue stat accounting a reference"), and called in
kyber_exit_sched().

So we have to free q->stats after elevator is unloaded from
blk_exit_queue() in blk_release_queue(). Otherwise kernel panic
is caused.

Fixes: 68497092bde9 ("block: make queue stat accounting a reference")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211221040436.1333880-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: fix error in handling dead task for ioprio setting

Don't combine the task exiting and "already have io_context" case, we
need to just abort if the task is marked as dead. Return -ESRCH, which
is the documented value for ioprio_set() if the specified task could not
be found.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: syzbot+8836466a79f4175961b0@syzkaller.appspotmail.com
Fixes: 5fc11eebb4a9 ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: blk-mq: check quiesce state before queue_rqs

The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().

Fixes: 3c67d44de787 ("block: add mq_ops->queue_rqs hook")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211220205919.180191-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blktrace: switch trace spinlock to a raw spinlock

The running_trace_lock protects running_trace_list and is acquired
within the tracepoint which implies disabled preemption. The spinlock_t
typed lock can not be acquired with disabled preemption on PREEMPT_RT
because it becomes a sleeping lock.
The runtime of the tracepoint depends on the number of entries in
running_trace_list and has no limit. The blk-tracer is considered debug
code and higher latencies here are okay.

Make running_trace_lock a raw_spinlock_t.

Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/r/20211220192827.38297-1-wander@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/drivers' into for-next

* for-5.17/drivers:
  block: remove the rsxx driver
  rsxx: Drop PCI legacy power management
  mtip32xx: convert to generic power management
  mtip32xx: remove pointless drvdata lookups
  mtip32xx: remove pointless drvdata checking
  drbd: Use struct_group() to zero algs
  loop: make autoclear operation asynchronous
  null_blk: cast command status to integer
  pktdvd: stop using bdi congestion framework.

Merge branch 'for-5.17/block' into for-next

* for-5.17/block: (23 commits)
  block: only build the icq tracking code when needed
  block: fold create_task_io_context into ioc_find_get_icq
  block: open code create_task_io_context in set_task_ioprio
  block: fold get_task_io_context into set_task_ioprio
  block: move set_task_ioprio to blk-ioc.c
  block: cleanup ioc_clear_queue
  block: refactor put_io_context
  block: remove the NULL ioc check in put_io_context
  block: refactor put_iocontext_active
  block: simplify struct io_context refcounting
  block: remove the nr_task field from struct io_context
  nvme: add support for mq_ops->queue_rqs()
  nvme: separate command prep and issue
  nvme: split command copy into a helper
  block: add mq_ops->queue_rqs hook
  block: use singly linked list for bio cache
  block: add completion handler for fast path
  block: make queue stat accounting a reference
  bdev: Improve lookup_bdev documentation
  mtd_blkdevs: don't scan partitions for plain mtdblock
  ...

Merge branch 'for-5.17/io_uring' into for-next

* for-5.17/io_uring:
  io_uring: code clean for some ctx usage
  io_uring: batch completion in prior_task_list
  io_uring: split io_req_complete_post() and add a helper
  io_uring: add helper for task work execution code
  io_uring: add a priority tw list for irq completion work
  io-wq: add helper to merge two wq_lists

block: only build the icq tracking code when needed

Only bfq needs to code to track icq, so make it conditional.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: fold create_task_io_context into ioc_find_get_icq

Fold create_task_io_context into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: open code create_task_io_context in set_task_ioprio

The flow in set_task_ioprio can be simplified by simply open coding
create_task_io_context, which removes a refcount roundtrip on the I/O
context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: fold get_task_io_context into set_task_ioprio

Fold get_task_io_context into its only caller, and simplify the code
as no reference to the I/O context is required to just set the ioprio
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: move set_task_ioprio to blk-ioc.c

Keep set_task_ioprio with the other low-level code that accesses the
io_context structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: cleanup ioc_clear_queue

Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: refactor put_io_context

Move the code to delay freeing the icqs into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: remove the NULL ioc check in put_io_context

No caller passes in a NULL pointer, so remove the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: refactor put_iocontext_active

Factor out a ioc_exit_icqs helper to tear down the icqs and the fold
the rest of put_iocontext_active into exit_io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: simplify struct io_context refcounting

Don't hold a reference to ->refcount for each active reference, but
just one for all active references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: remove the nr_task field from struct io_context

Nothing ever looks at ->nr_tasks, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: remove the rsxx driver

This driver was for rare and shortlived high end enterprise hardware
and hasn't been maintained since 2014, which also means it never got
converted to use blk-mq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

nvme: add support for mq_ops->queue_rqs()

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

nvme: separate command prep and issue

Add a nvme_prep_rq() helper to setup a command, and nvme_queue_rq() is
adapted to use this helper.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

nvme: split command copy into a helper

We'll need it for batched submit as well. Since we now have a copy
helper, get rid of the nvme_submit_cmd() wrapper.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: add mq_ops->queue_rqs hook

If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: use singly linked list for bio cache

Pointless to maintain a head/tail for the list, as we never need to
access the tail. Entries are always LIFO for cache hotness reasons.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: add completion handler for fast path

The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: make queue stat accounting a reference

kyber turns on IO statistics when it is loaded on a queue, which means
that even if kyber is then later unloaded, we're still stuck with stats
enabled on the queue.

Change the account enabled from a bool to an int, and pair the enable call
with the equivalent disable call. This ensures that stats gets turned off
again appropriately.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

rsxx: Drop PCI legacy power management

The rsxx driver doesn't support device suspend, so remove
rsxx_pci_suspend(), the legacy PCI .suspend() method, completely.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20211208192449.146076-5-helgaas@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtip32xx: convert to generic power management

Convert mtip32xx from legacy PCI power management to the generic power
management framework.

Previously, mtip32xx used legacy PCI power management, where
mtip_pci_suspend() and mtip_pci_resume() were responsible for both
device-specific things and generic PCI things:

  mtip_pci_suspend
    mtip_block_suspend(dd)              <-- device-specific
    pci_save_state(pdev)                <-- generic PCI
    pci_set_power_state(pdev, pci_choose_state(pdev, state))

  mtip_pci_resume
    pci_set_power_state(PCI_D0)         <-- generic PCI
    pci_restore_state(pdev)             <-- generic PCI
    pcim_enable_device(pdev)            <-- generic PCI
    pci_set_master(pdev)                <-- generic PCI
    mtip_block_resume(dd)               <-- device-specific

With generic power management, the PCI bus PM methods do the generic PCI
things, and the driver needs only the device-specific part, i.e.,

  suspend_devices_and_enter
    dpm_suspend_start(PMSG_SUSPEND)
      pci_pm_suspend                    # PCI bus .suspend() method
        mtip_pci_suspend                # dev->driver->pm->suspend
          mtip_block_suspend            <-- device-specific
    suspend_enter
      dpm_suspend_noirq(PMSG_SUSPEND)
        pci_pm_suspend_noirq            # PCI bus .suspend_noirq() method
          pci_save_state                <-- generic PCI
          pci_prepare_to_sleep          <-- generic PCI
            pci_set_power_state
    ...
    dpm_resume_end(PMSG_RESUME)
      pci_pm_resume                     # PCI bus .resume() method
        pci_restore_standard_config
          pci_set_power_state(PCI_D0)   <-- generic PCI
          pci_restore_state             <-- generic PCI
        mtip_pci_resume                 # dev->driver->pm->resume
          mtip_block_resume             <-- device-specific

[bhelgaas: commit log]

Link: https://lore.kernel.org/r/20210114115423.52414-2-vaibhavgupta40@gmail.com
Signed-off-by: Vaibhav Gupta <vaibhavgupta40@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20211208192449.146076-4-helgaas@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtip32xx: remove pointless drvdata lookups

Previously we passed a struct pci_dev * to mtip_check_surprise_removal(),
which immediately looked up the driver_data. But all callers already have
the driver_data pointer, so just pass it directly and skip the extra
lookup. No functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20211208192449.146076-3-helgaas@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtip32xx: remove pointless drvdata checking

The .suspend() and .resume() methods are only called after the .probe()
method (mtip_pci_probe()) has set the drvdata and returned success.

Therefore, if we get to mtip_pci_suspend() or mtip_pci_resume(), the
drvdata must be valid. Drop the unnecessary checking.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20211208192449.146076-2-helgaas@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: code clean for some ctx usage

There are some functions doing ctx = req->ctx while still using
req->ctx, update those places.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211214055904.61772-1-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

drbd: Use struct_group() to zero algs

In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.

Add a struct_group() for the algs so that memset() can correctly reason
about the size.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20211118203712.1288866-1-keescook@chromium.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

loop: make autoclear operation asynchronous

syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with disk->open_mutex held.

This circular dependency cannot be broken unless we call __loop_clr_fd()
without holding disk->open_mutex. Therefore, defer __loop_clr_fd() from
lo_release() to a WQ context.

Link: https://syzkaller.appspot.com/bug?extid=643e4ce4b6ad1347d372
Reported-by: syzbot <syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Tested-by: syzbot+643e4ce4b6ad1347d372@syzkaller.appspotmail.com
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1ed7df28-ebd6-71fb-70e5-1c2972e05ddb@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bdev: Improve lookup_bdev documentation

Add a Context section and rewrite the rest to be clearer.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Link: https://lore.kernel.org/r/20211213171113.3097631-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtd_blkdevs: don't scan partitions for plain mtdblock

mtdblock / mtdblock_ro set part_bits to 0 and thus nevever scanned
partitions. Restore that behavior by setting the GENHD_FL_NO_PART flag.

Fixes: 1ebe2e5f9d68e94c ("block: remove GENHD_FL_EXT_DEVT")
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20211206070409.2836165-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

null_blk: cast command status to integer

kernel test robot reports that sparse now triggers a warning on null_blk:

>> drivers/block/null_blk/main.c:1577:55: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected int ioerror @@     got restricted blk_status_t [usertype] error @@
   drivers/block/null_blk/main.c:1577:55: sparse:     expected int ioerror
   drivers/block/null_blk/main.c:1577:55: sparse:     got restricted blk_status_t [usertype] error

because blk_mq_add_to_batch() takes an integer instead of a blk_status_t.
Just cast this to an integer to silence it, null_blk is the odd one out
here since the command status is the "right" type. If we change the
function type, then we'll have do that for other callers too (existing and
future ones).

Fixes: 2385ebf38f94 ("block: null_blk: batched complete poll requests")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

pktdvd: stop using bdi congestion framework.

The bdi congestion framework isn't widely used and should be
deprecated.

pktdvd makes use of it to track congestion, but this can be done
entirely internally to pktdvd, so it doesn't need to use the framework.

So introduce a "congested" flag. When waiting for bio_queue_size to
drop, set this flag and a var_waitqueue() to wait for it. When
bio_queue_size does drop and this flag is set, clear the flag and call
wake_up_var().

We don't use a wait_var_event macro for the waiting as we need to set
the flag and drop the spinlock before calling schedule() and while that
is possible with __wait_var_event(), result is not easy to read.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/163910843527.9928.857338663717630212@noble.neil.brown.name
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: batch completion in prior_task_list

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211208052125.351587-1-haoxu@linux.alibaba.com
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: split io_req_complete_post() and add a helper

Split io_req_complete_post(), this is a prep for the next patch.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211207093951.247840-5-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add helper for task work execution code

Add a helper for task work execution code. We will use it later.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211207093951.247840-4-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: add a priority tw list for irq completion work

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
depth      new          old
1        7.05         7.10
2        8.47         8.60
4        10.42        10.42
8        13.78        13.22
16       27.41        24.33
32       49.40        53.08
64       102.53       103.36
128      196.98       205.61
256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211207093951.247840-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io-wq: add helper to merge two wq_lists

add a helper to merge two wq_lists, it will be useful in the next
patches.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211207093951.247840-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags

Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees
using megaraid SAS RAID card since moving to shared tags [0].

Previously, when shared tags was shared sbitmap, this function was less
than optimum since we would iter through all tags for all hctx's,
yet only ever match upto tagset depth number of rqs.

Since the change to shared tags, things are even less efficient if we have
parallel callers of blk_mq_queue_tag_busy_iter(). This is because in
bt_iter() -> blk_mq_find_and_get_req() there would be more contention on
accessing each request ref and tags->lock since they are now shared among
all HW queues.

Optimise by having separate calls to bt_for_each() for when we're using
shared tags. In this case no longer pass a hctx, as it is no longer
relevant, and teach bt_iter() about this.

Ming suggested something along the lines of this change, apart from a
different implementation.

[0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reported-and-tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: e155b0c238b2 ("blk-mq: Use shared tags for shared sbitmap support")
Link: https://lore.kernel.org/r/1638794990-137490-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: Delete busy_iter_fn

Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete
busy_iter_fn to reduce duplication.

It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is
less specific.

However busy_tag_iter_fn is used in many different parts of the tree,
unlike busy_iter_fn which is just use in block/, so just take the
straightforward path now, so that we could rename later treewide.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument

The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().

Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.

With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.

Reviewed-by Hannes Reinecke <hare@suse.de>

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/block' into for-next

* for-5.17/block:
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk-mq: don't run might_sleep() if the operation needn't blocking

blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()

blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.

Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 4cafe86c9267 ("blk-mq: run dispatch lock once in case of issuing from list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: don't run might_sleep() if the operation needn't blocking

The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.

Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/io_uring' into for-next

* for-5.17/io_uring:
  io_uring: reuse io_req_task_complete for timeouts
  io_uring: tweak iopoll CQE_SKIP event counting
  io_uring: simplify selected buf handling
  io_uring: move up io_put_kbuf() and io_put_rw_kbuf()

io_uring: reuse io_req_task_complete for timeouts

With kbuf unification io_req_task_complete() is now a generic function,
use it for timeout's tw completions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7142fa3cbaf3a4140d59bcba45cbe168cf40fac2.1638714983.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: tweak iopoll CQE_SKIP event counting

When iopolling the userspace specifies the minimum number of "events" it
expects. Previously, we had one CQE per request, so the definition of
an "event" was unequivocal, but that's not more the case anymore with
REQ_F_CQE_SKIP.

Currently it counts the number of completed requests, replace it with
the number of posted CQEs. This allows users of the "one CQE per link"
scheme to wait for all N links in a single syscall, which is not
possible without the patch and requires extra context switches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5a965c4d2249827392037bbd0186f87fea49c55.1638714983.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: simplify selected buf handling

As selected buffers are now stored in a separate field in a request, get
rid of rw/recv specific helpers and simplify the code.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bd4a866d8d91b044f748c40efff9e4eacd07536e.1638714983.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: move up io_put_kbuf() and io_put_rw_kbuf()

Move them up to avoid explicit declaration. We will use them in later
patches.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3631243d6fc4a79bbba0cd62597fc8cd5be95924.1638714983.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/block' into for-next

* for-5.17/block:
  blk-mq: run dispatch lock once in case of issuing from list
  blk-mq: pass request queue to blk_mq_run_dispatch_ops
  blk-mq: move srcu from blk_mq_hw_ctx to request_queue
  blk-mq: remove hctx_lock and hctx_unlock
  block: switch to atomic_t for request references
  block: move direct_IO into our own read_iter handler
  mm: move filemap_range_needs_writeback() into header

blk-mq: run dispatch lock once in case of issuing from list

It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: pass request queue to blk_mq_run_dispatch_ops

We have switched to allocate srcu into request queue, so it is fine
to pass request queue to blk_mq_run_dispatch_ops().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: move srcu from blk_mq_hw_ctx to request_queue

In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.

Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.

So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: remove hctx_lock and hctx_unlock

Remove hctx_lock and hctx_unlock, and add one helper of
blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops
with rcu/srcu read held.

Compared with hctx_lock()/hctx_unlock():

1) remove 2 branch to 1, so we just need to check
(hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops

2) srcu_idx needn't to be touched in case of non-blocking

3) might_sleep_if() can be moved to the blocking branch

Also put the added blk_mq_run_dispatch_ops() in private header, so that
the following patch can use it out of blk-mq.c.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: switch to atomic_t for request references

refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.

This borrows that same implementation, which in turn is based on the
mm implementation from Linus.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: move direct_IO into our own read_iter handler

Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.

Fall back to filemap_read() if we fail.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

mm: move filemap_range_needs_writeback() into header

No functional changes in this patch, just in preparation for efficiently
calling this light function from the block O_DIRECT handling.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/drivers' into for-next

* for-5.17/drivers:
block: null_blk: batched complete poll requests

block: null_blk: batched complete poll requests

Complete poll requests via blk_mq_add_to_batch() and
blk_mq_end_request_batch(), so that we can cover batched complete
code path by running null_blk test.

Meantime this way shows ~14% IOPS boost on 't/io_uring /dev/nullb0'
in my test.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203081703.3506020-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge branch 'for-5.17/drivers' into for-next

* for-5.17/drivers:
  floppy: Add max size check for user space request
  floppy: Fix hang in watchdog when disk is ejected
  null_blk: allow zero poll queues

floppy: Add max size check for user space request

We need to check the max request size that is from user space before
allocating pages. If the request size exceeds the limit, return -EINVAL.
This check can avoid the warning below from page allocator.

WARNING: CPU: 3 PID: 16525 at mm/page_alloc.c:5344 current_gfp_context include/linux/sched/mm.h:195 [inline]
WARNING: CPU: 3 PID: 16525 at mm/page_alloc.c:5344 __alloc_pages+0x45d/0x500 mm/page_alloc.c:5356
Modules linked in:
CPU: 3 PID: 16525 Comm: syz-executor.3 Not tainted 5.15.0-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
RIP: 0010:__alloc_pages+0x45d/0x500 mm/page_alloc.c:5344
Code: be c9 00 00 00 48 c7 c7 20 4a 97 89 c6 05 62 32 a7 0b 01 e8 74 9a 42 07 e9 6a ff ff ff 0f 0b e9 a0 fd ff ff 40 80 e5 3f eb 88 <0f> 0b e9 18 ff ff ff 4c 89 ef 44 89 e6 45 31 ed e8 1e 76 ff ff e9
RSP: 0018:ffffc90023b87850 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 1ffff92004770f0b RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000033 RDI: 0000000000010cc1
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffffffff81bb4686 R11: 0000000000000001 R12: ffffffff902c1960
R13: 0000000000000033 R14: 0000000000000000 R15: ffff88804cf64a30
FS: 0000000000000000(0000) GS:ffff88802cd00000(0063) knlGS:00000000f44b4b40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 000000002c921000 CR3: 000000004f507000 CR4: 0000000000150ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
__get_free_pages+0x8/0x40 mm/page_alloc.c:5418
raw_cmd_copyin drivers/block/floppy.c:3113 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3160 [inline]
fd_locked_ioctl+0x12e5/0x2820 drivers/block/floppy.c:3528
fd_ioctl drivers/block/floppy.c:3555 [inline]
fd_compat_ioctl+0x891/0x1b60 drivers/block/floppy.c:3869
compat_blkdev_ioctl+0x3b8/0x810 block/ioctl.c:662
__do_compat_sys_ioctl+0x1c7/0x290 fs/ioctl.c:972
do_syscall_32_irqs_on arch/x86/entry/common.c:112 [inline]
__do_fast_syscall_32+0x65/0xf0 arch/x86/entry/common.c:178
do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:203
entry_SYSENTER_compat_after_hwframe+0x4d/0x5c

Reported-by: syzbot+23a02c7df2cf2bc93fa2@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20211116131033.27685-1-sxwjean@me.com
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>