Christoph Hellwig [Tue, 11 Jan 2022 09:08:49 +0000 (10:08 +0100)]
block: remove blk_needs_flush_plug
blk_needs_flush_plug forgets to check the callbacks list and is one of
a few reasons why blkdev.h needs to pull in sched.h. Remove it and just
make blk_flush_plug check if there is a plug before calling out of line,
which gets us 90% of the advantages without poking into details in the
header.
Eric Biggers [Thu, 9 Dec 2021 00:38:29 +0000 (16:38 -0800)]
docs: sysfs-block: fill in missing documentation from queue-sysfs.rst
sysfs documentation is supposed to go in Documentation/ABI/.
However, /sys/block/<disk>/queue/* are documented in
Documentation/block/queue-sysfs.rst, and sometimes redundantly in
Documentation/ABI/stable/sysfs-block too.
Let's consolidate this documentation into Documentation/ABI/.
Therefore, copy the relevant docs from queue-sysfs.rst into sysfs-block.
This primarily means adding the 25 missing files that were documented in
queue-sysfs.rst only, as well as mentioning the RO/RW status of files.
Documentation/ABI/ requires "Date" and "Contact" fields. For the Date
fields, I used the date of the commit which added support for each file.
For the "Contact" fields, I used linux-block.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211209003833.6396-5-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Thu, 9 Dec 2021 00:38:27 +0000 (16:38 -0800)]
docs: sysfs-block: sort alphabetically
Sort the documentation for the files alphabetically by file path so that
there is a logical order and it's clear where to add new files.
With two small exceptions, this patch doesn't change the documentation
itself and just reorders it:
- In /sys/block/<disk>/<part>/stat, I replaced <part> with <partition>
to be consistent with the other files.
- The description for /sys/block/<disk>/<part>/stat referred to another
file "above", which I reworded.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211209003833.6396-3-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Eric Biggers [Thu, 9 Dec 2021 00:38:26 +0000 (16:38 -0800)]
docs: sysfs-block: move to stable directory
The block layer sysfs ABI is widely used by userspace software and is
considered stable.
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211209003833.6396-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 4 Jan 2022 13:42:23 +0000 (21:42 +0800)]
block: don't protect submit_bio_checks by q_usage_counter
Commit cc9c884dd7f4 ("block: call submit_bio_checks under q_usage_counter")
uses q_usage_counter to protect submit_bio_checks for avoiding IO after
disk is deleted by del_gendisk().
Turns out the protection isn't necessary, because once
blk_mq_freeze_queue_wait() in del_gendisk() returns:
1) all in-flight IO has been done
2) all new IO will be failed in __bio_queue_enter() because
q_usage_counter is dead, and GD_DEAD is set
3) both disk and request queue instance are safe since caller of
submit_bio() guarantees that the disk can't be closed.
Once submit_bio_checks() needn't the protection of q_usage_counter, we can
move submit_bio_checks before calling blk_mq_submit_bio() and
->submit_bio(). With this change, we needn't to throttle queue with
holding one allocated request, then precise driver tag or request won't be
wasted in throttling. Meantime we can unify the bio check for both bio
based and request based driver.
Pavel Begunkov [Sun, 9 Jan 2022 00:53:22 +0000 (00:53 +0000)]
io_uring: fix not released cached task refs
tctx_task_work() may get run after io_uring cancellation and so there
will be no one to put cached in tctx task refs that may have been added
back by tw handlers using inline completion infra, Call
io_uring_drop_tctx_refs() at the end of the main tw handler to release
them.
Jens Axboe [Thu, 6 Jan 2022 19:37:04 +0000 (12:37 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
md: use default_groups in kobj_type
md: Move alloc/free acct bioset in to personality
lib/raid6: Use strict priority ranking for pq gen() benchmarking
lib/raid6: skip benchmark of non-chosen xor_syndrome functions
md: fix spelling of "its"
md: raid456 add nowait support
md: raid10 add nowait support
md: raid1 add nowait support
md: add support for REQ_NOWAIT
md: drop queue limitation for RAID1 and RAID10
md/raid5: play nice with PREEMPT_RT
Jens Axboe [Thu, 6 Jan 2022 19:36:04 +0000 (12:36 -0700)]
Merge branch 'md-next' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.17/drivers
Pull MD updates from Song:
"The major changes are:
- REQ_NOWAIT support, by Vishal Verma
- raid6 benchmark optimization, by Dirk Müller
- Fix for acct bioset, by Xiao Ni
- Clean up max_queued_requests, by Mariusz Tkaczyk
- PREEMPT_RT optimization, by Davidlohr Bueso
- Use default_groups in kobj_type, by Greg Kroah-Hartman"
* 'md-next' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/song/md:
md: use default_groups in kobj_type
md: Move alloc/free acct bioset in to personality
lib/raid6: Use strict priority ranking for pq gen() benchmarking
lib/raid6: skip benchmark of non-chosen xor_syndrome functions
md: fix spelling of "its"
md: raid456 add nowait support
md: raid10 add nowait support
md: raid1 add nowait support
md: add support for REQ_NOWAIT
md: drop queue limitation for RAID1 and RAID10
md/raid5: play nice with PREEMPT_RT
Greg Kroah-Hartman [Thu, 6 Jan 2022 10:03:35 +0000 (11:03 +0100)]
md: use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the md rdev sysfs code to use default_groups field which
has been the preferred way since commit aa30f47cf666 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.
Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Song Liu <song@kernel.org>
Xiao Ni [Fri, 10 Dec 2021 09:31:15 +0000 (17:31 +0800)]
md: Move alloc/free acct bioset in to personality
bioset acct is only needed for raid0 and raid5. Therefore, md_run only
allocates it for raid0 and raid5. However, this does not cover
personality takeover, which may cause uninitialized bioset. For example,
the following repro steps:
Fix this by moving alloc/free of acct bioset to pers->run and pers->free.
While we are on this, properly handle md_integrity_register() error in
raid0_run().
Fixes: daee2024715d (md: check level before create and exit io_acct_set) Cc: stable@vger.kernel.org Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
Dirk Müller [Wed, 5 Jan 2022 16:38:47 +0000 (17:38 +0100)]
lib/raid6: Use strict priority ranking for pq gen() benchmarking
On x86_64, currently 3 variants of AVX512, 3 variants of AVX2
and 3 variants of SSE2 are benchmarked on initialization, taking
between 144-153 jiffies. Testing across a hardware pool of
various generations of intel cpus I could not find a single
case where SSE2 won over AVX2 or AVX512. There are cases where
AVX2 wins over AVX512 however.
Change "prefer" into an integer priority field (similar to
how recov selection works) to have more than one ranking level
available, which is backwards compatible with existing behavior.
Give AVX2/512 variants higher priority over SSE2 in order to skip
SSE testing when AVX is available. in a AVX2/x86_64/HZ=250 case this
saves in the order of 200ms of initialization time.
Signed-off-by: Dirk Müller <dmueller@suse.de> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Song Liu <song@kernel.org>
Dirk Müller [Wed, 5 Jan 2022 16:38:46 +0000 (17:38 +0100)]
lib/raid6: skip benchmark of non-chosen xor_syndrome functions
In commit fe5cbc6e06c7 ("md/raid6 algorithms: delta syndrome functions")
a xor_syndrome() benchmarking was added also to the raid6_choose_gen()
function. However, the results of that benchmarking were intentionally
discarded and did not influence the choice. It picked the
xor_syndrome() variant related to the best performing gen_syndrome().
Reduce runtime of raid6_choose_gen() without modifying its outcome by
only benchmarking the xor_syndrome() of the best gen_syndrome() variant.
For a HZ=250 x86_64 system with avx2 and without avx512 this removes
5 out of 6 xor() benchmarks, saving 340ms of raid6 initialization time.
Signed-off-by: Dirk Müller <dmueller@suse.de> Signed-off-by: Song Liu <song@kernel.org>
Randy Dunlap [Sun, 26 Dec 2021 02:24:11 +0000 (18:24 -0800)]
md: fix spelling of "its"
Use the possessive "its" instead of the contraction "it's"
in printed messages.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Song Liu <song@kernel.org> Cc: linux-raid@vger.kernel.org Signed-off-by: Song Liu <song@kernel.org>
Vishal Verma [Tue, 21 Dec 2021 20:06:21 +0000 (20:06 +0000)]
md: raid10 add nowait support
This adds nowait support to the RAID10 driver. Very similar to
raid1 driver changes. It makes RAID10 driver return with EAGAIN
for situations where it could wait for eg:
- Waiting for the barrier,
- Reshape operation,
- Discard operation.
wait_barrier() and regular_request_wait() fn are modified to return bool
to support error for wait barriers. They returns true in case of wait
or if wait is not required and returns false if wait was required
but not performed to support nowait.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
Vishal Verma [Tue, 21 Dec 2021 20:06:20 +0000 (20:06 +0000)]
md: raid1 add nowait support
This adds nowait support to the RAID1 driver. It makes RAID1 driver
return with EAGAIN for situations where it could wait for eg:
- Waiting for the barrier,
wait_barrier() fn is modified to return bool to support error for
wait barriers. It returns true in case of wait or if wait is not
required and returns false if wait was required but not performed
to support nowait.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
Vishal Verma [Tue, 21 Dec 2021 20:06:19 +0000 (20:06 +0000)]
md: add support for REQ_NOWAIT
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support
for checking whether a given bdev supports handling of REQ_NOWAIT or not.
Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable
it for linear target") added support for REQ_NOWAIT for dm. This uses
a similar approach to incorporate REQ_NOWAIT for md based bios.
This patch was tested using t/io_uring tool within FIO. A nvme drive
was partitioned into 2 partitions and a simple raid 0 configuration
/dev/md0 was created.
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks
We can see iou-wrk-38397 io worker thread created which gets created
when io_uring sees that the underlying device (/dev/md0 in this case)
doesn't support nowait.
After running this patch, we don't see any io worker thread
being created which indicated that io_uring saw that the
underlying device does support nowait. This is the exact behaviour
noticed on a dm device which also supports nowait.
For all the other raid personalities except raid0, we would need
to train pieces which involves make_request fn in order for them
to correctly handle REQ_NOWAIT.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
Mariusz Tkaczyk [Fri, 17 Dec 2021 09:29:55 +0000 (10:29 +0100)]
md: drop queue limitation for RAID1 and RAID10
As suggested by Neil Brown[1], this limitation seems to be
deprecated.
With plugging in use, writes are processed behind the raid thread
and conf->pending_count is not increased. This limitation occurs only
if caller doesn't use plugs.
It can be avoided and often it is (with plugging). There are no reports
that queue is growing to enormous size so remove queue limitation for
non-plugged IOs too.
Davidlohr Bueso [Tue, 16 Nov 2021 01:23:17 +0000 (17:23 -0800)]
md/raid5: play nice with PREEMPT_RT
raid_run_ops() relies on the implicitly disabled preemption for
its percpu ops, although this is really about CPU locality. This
breaks RT semantics as it can take regular (and thus sleeping)
spinlocks, such as stripe_lock.
Add a local_lock such that non-RT does not change and continues
to be just map to preempt_disable/enable, but makes RT happy as
the region will use a per-CPU spinlock and thus be preemptible
and still guarantee CPU locality.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
Greg Kroah-Hartman [Tue, 4 Jan 2022 16:29:47 +0000 (17:29 +0100)]
block/rnbd-clt-sysfs: use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the rnbd controller sysfs code to use default_groups field
which has been the preferred way since aa30f47cf666 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.
Cc: "Md. Haris Iqbal" <haris.iqbal@ionos.com> Cc: Jack Wang <jinpu.wang@ionos.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> Link: https://lore.kernel.org/r/20220104162947.1320936-1-gregkh@linuxfoundation.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Jan 2022 19:26:19 +0000 (12:26 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
pktcdvd: convert to use attribute groups
block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0
nvme: add 'iopolicy' module parameter
nvme: drop unused variable ctrl in nvme_setup_cmd
nvme: increment request genctr on completion
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Keith Busch [Wed, 5 Jan 2022 17:05:18 +0000 (09:05 -0800)]
nvme-pci: fix queue_rqs list splitting
If command prep fails, current handling will orphan subsequent requests
in the list. Consider a simple example:
rqlist = [ 1 -> 2 ]
When prep for request '1' fails, it will be appended to the
'requeue_list', leaving request '2' disconnected from the original
rqlist and no longer tracked. Meanwhile, rqlist is still pointing to the
failed request '1' and will attempt to submit the unprepped command.
Fix this by updating the rqlist accordingly using the request list
helper functions.
Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220105170518.3181469-5-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Wed, 5 Jan 2022 17:05:17 +0000 (09:05 -0800)]
block: introduce rq_list_move
When iterating a list, a particular request may need to be moved for
special handling. Provide a helper function to achieve that so drivers
don't need to reimplement rqlist manipulation.
Greg Kroah-Hartman [Mon, 3 Jan 2022 16:24:08 +0000 (17:24 +0100)]
pktcdvd: convert to use attribute groups
There is no need to create kobject children of the pktcdvd device just
to display a subdirectory name. Instead, use a named attribute group
which removes the extra kobjects and also fixes the userspace race where
the device is created yet tools like libudev can not see the attributes
as they think the subdirectories are some other sort of device.
Jens Axboe [Wed, 29 Dec 2021 17:50:50 +0000 (09:50 -0800)]
Merge tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme into for-5.17/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.17
- increment request genctr on completion (Keith Busch, Geliang Tang)
- add a 'iopolicy' module parameter (Hannes Reinecke)
- print out valid arguments when reading from /dev/nvme-fabrics
(Hannes Reinecke)"
* tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme:
nvme: add 'iopolicy' module parameter
nvme: drop unused variable ctrl in nvme_setup_cmd
nvme: increment request genctr on completion
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Pavel Begunkov [Wed, 15 Dec 2021 22:08:49 +0000 (22:08 +0000)]
io_uring: single shot poll removal optimisation
We don't need to poll oneshot request if we've got a desired mask in
io_poll_wake(), task_work will clean it up correctly, but as we already
hold a wq spinlock, we can remove ourselves and save on additional
spinlocking in io_poll_remove_entries().
Pavel Begunkov [Wed, 15 Dec 2021 22:08:48 +0000 (22:08 +0000)]
io_uring: poll rework
It's not possible to go forward with the current state of io_uring
polling, we need a more straightforward and easier synchronisation.
There are a lot of problems with how it is at the moment, including
missing events on rewait.
The main idea here is to introduce a notion of request ownership while
polling, no one but the owner can modify any part but ->poll_refs of
struct io_kiocb, that grants us protection against all sorts of races.
Main users of such exclusivity are poll task_work handler, so before
queueing a tw one should have/acquire ownership, which will be handed
off to the tw handler.
The other user is __io_arm_poll_handler() do initial poll arming. It
starts taking the ownership, so tw handlers won't be run until it's
released later in the function after vfs_poll. note: also prevents
races in __io_queue_proc().
Poll wake/etc. may not be able to get ownership, then they need to
increase the poll refcount and the task_work should notice it and retry
if necessary, see io_poll_check_events().
There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill
request.
It makes cancellations more reliable, enables double multishot polling,
fixes double poll rewait, fixes missing poll events and fixes another
bunch of races.
Even though it adds some overhead for new refcounting, and there are a
couple of nice performance wins:
- no req->refs refcounting for poll requests anymore
- if the data is already there (once measured for some test to be 1-2%
of all apoll requests), it removes it doesn't add atomics and removes
spin_lock/unlock pair.
- works well with multishots, we don't do remove from queue / add to
queue for each new poll event.
Pavel Begunkov [Wed, 15 Dec 2021 22:08:47 +0000 (22:08 +0000)]
io_uring: kill poll linking optimisation
With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll
requests doesn't make much sense, and in any case re-adding it
shouldn't be a problem considering batching in tctx_task_work(). We can
remove it.
Ming Lei [Fri, 24 Dec 2021 01:08:31 +0000 (09:08 +0800)]
block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0
It isn't correct to set set->nr_maps as 3 if g_poll_queues is > 0 since
we can change it via configfs for null_blk device created there, so only
set it as 3 if active poll_queues is > 0.
Fixes divide zero exception reported by Shinichiro.
Fixes: 2bfdbe8b7ebd ("null_blk: allow zero poll queues") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20211224010831.1521805-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lukas Bulwahn [Thu, 23 Dec 2021 12:53:00 +0000 (13:53 +0100)]
block: drop needless assignment in set_task_ioprio()
Commit 5fc11eebb4a9 ("block: open code create_task_io_context in
set_task_ioprio") introduces a needless assignment
'ioc = task->io_context', as the local variable ioc is not further
used before returning.
Even after the further fix, commit a957b61254a7 ("block: fix error in
handling dead task for ioprio setting"), the assignment still remains
needless.
Drop this needless assignment in set_task_ioprio().
This code smell was identified with 'make clang-analyzer'.
Fixes: 5fc11eebb4a9 ("block: open code create_task_io_context in set_task_ioprio") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211223125300.20691-1-lukas.bulwahn@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 20 Dec 2021 12:51:45 +0000 (13:51 +0100)]
nvme: add 'iopolicy' module parameter
While the 'iopolicy' sysfs attribute can be set at runtime, most
storage arrays prefer to use the 'round-robin' iopolicy per default.
We can use udev rules to set this, but is getting rather unwieldy
for rebranded arrays as we would have to update the udev rules
anytime a new array shows up, leading to the same mess we currently
have in multipathd for configuring the RDAC arrays.
Hence this patch adds a module parameter 'iopolicy' to allow the
admin to switch the default, and to do away with the need for a
udev rule here.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Geliang Tang [Wed, 22 Dec 2021 09:32:44 +0000 (17:32 +0800)]
nvme: drop unused variable ctrl in nvme_setup_cmd
The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:
drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
^~~~
Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Mon, 13 Dec 2021 17:08:47 +0000 (09:08 -0800)]
nvme: increment request genctr on completion
The nvme request generation counter is intended to catch duplicate
completions. Incrementing the counter on submission means duplicates can
only be caught if the request tag is reallocated and dispatched prior to
the driver observing the corrupted CQE. Incrementing on completion
removes this window, making it possible to detect duplicate completions
in consecutive entries.
Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Tue, 7 Dec 2021 13:55:49 +0000 (14:55 +0100)]
nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
Currently applications have a hard time figuring out which
nvme-over-fabrics arguments are supported for any given kernel;
the ioctl will return an error code on failure, and the application
has to guess whether this was due to an invalid argument or due
to a connection or controller error.
With this patch applications can read a list of supported
arguments by simply reading from /dev/nvme-fabrics, allowing
them to validate the connection string.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Randy Dunlap [Wed, 22 Dec 2021 21:15:32 +0000 (13:15 -0800)]
bio.h: fix kernel-doc warnings
Fix all kernel-doc warnings in <linux/bio.h>:
include/linux/bio.h:136: warning: Function parameter or member 'nbytes' not described in 'bio_advance'
include/linux/bio.h:136: warning: Excess function parameter 'bytes' description in 'bio_advance'
include/linux/bio.h:391: warning: No description found for return value of 'bio_next_split'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Link: https://lore.kernel.org/r/20211222211532.24060-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 21 Dec 2021 19:16:17 +0000 (12:16 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
block: check minor range in device_add_disk()
block: use "unsigned long" for blk_validate_block_size().
block: fix error unwinding in device_add_disk
block: call blk_exit_queue() before freeing q->stats
block: fix error in handling dead task for ioprio setting
blk-mq: blk-mq: check quiesce state before queue_rqs
blktrace: switch trace spinlock to a raw spinlock
message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.
Tetsuo Handa [Sat, 18 Dec 2021 09:41:56 +0000 (18:41 +0900)]
block: use "unsigned long" for blk_validate_block_size().
Since lo_simple_ioctl(LOOP_SET_BLOCK_SIZE) and ioctl(NBD_SET_BLKSIZE) pass
user-controlled "unsigned long arg" to blk_validate_block_size(),
"unsigned long" should be used for validation.
Christoph Hellwig [Tue, 21 Dec 2021 16:18:51 +0000 (17:18 +0100)]
block: fix error unwinding in device_add_disk
One device_add is called disk->ev will be freed by disk_release, so we
should free it twice. Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.
Jens Axboe [Tue, 21 Dec 2021 03:32:24 +0000 (20:32 -0700)]
block: fix error in handling dead task for ioprio setting
Don't combine the task exiting and "already have io_context" case, we
need to just abort if the task is marked as dead. Return -ESRCH, which
is the documented value for ioprio_set() if the specified task could not
be found.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: syzbot+8836466a79f4175961b0@syzkaller.appspotmail.com Fixes: 5fc11eebb4a9 ("block: open code create_task_io_context in set_task_ioprio") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Mon, 20 Dec 2021 20:59:19 +0000 (12:59 -0800)]
blk-mq: blk-mq: check quiesce state before queue_rqs
The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().
Wander Lairson Costa [Mon, 20 Dec 2021 19:28:27 +0000 (16:28 -0300)]
blktrace: switch trace spinlock to a raw spinlock
The running_trace_lock protects running_trace_list and is acquired
within the tracepoint which implies disabled preemption. The spinlock_t
typed lock can not be acquired with disabled preemption on PREEMPT_RT
because it becomes a sleeping lock.
The runtime of the tracepoint depends on the number of entries in
running_trace_list and has no limit. The blk-tracer is considered debug
code and higher latencies here are okay.
Jens Axboe [Fri, 17 Dec 2021 16:51:05 +0000 (09:51 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
block: remove the rsxx driver
rsxx: Drop PCI legacy power management
mtip32xx: convert to generic power management
mtip32xx: remove pointless drvdata lookups
mtip32xx: remove pointless drvdata checking
drbd: Use struct_group() to zero algs
loop: make autoclear operation asynchronous
null_blk: cast command status to integer
pktdvd: stop using bdi congestion framework.
Jens Axboe [Fri, 17 Dec 2021 16:51:03 +0000 (09:51 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: (23 commits)
block: only build the icq tracking code when needed
block: fold create_task_io_context into ioc_find_get_icq
block: open code create_task_io_context in set_task_ioprio
block: fold get_task_io_context into set_task_ioprio
block: move set_task_ioprio to blk-ioc.c
block: cleanup ioc_clear_queue
block: refactor put_io_context
block: remove the NULL ioc check in put_io_context
block: refactor put_iocontext_active
block: simplify struct io_context refcounting
block: remove the nr_task field from struct io_context
nvme: add support for mq_ops->queue_rqs()
nvme: separate command prep and issue
nvme: split command copy into a helper
block: add mq_ops->queue_rqs hook
block: use singly linked list for bio cache
block: add completion handler for fast path
block: make queue stat accounting a reference
bdev: Improve lookup_bdev documentation
mtd_blkdevs: don't scan partitions for plain mtdblock
...
Jens Axboe [Fri, 17 Dec 2021 16:51:00 +0000 (09:51 -0700)]
Merge branch 'for-5.17/io_uring' into for-next
* for-5.17/io_uring:
io_uring: code clean for some ctx usage
io_uring: batch completion in prior_task_list
io_uring: split io_req_complete_post() and add a helper
io_uring: add helper for task work execution code
io_uring: add a priority tw list for irq completion work
io-wq: add helper to merge two wq_lists
Christoph Hellwig [Thu, 9 Dec 2021 06:31:26 +0000 (07:31 +0100)]
block: cleanup ioc_clear_queue
Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.
Christoph Hellwig [Thu, 16 Dec 2021 08:42:44 +0000 (09:42 +0100)]
block: remove the rsxx driver
This driver was for rare and shortlived high end enterprise hardware
and hasn't been maintained since 2014, which also means it never got
converted to use blk-mq.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 18 Nov 2021 15:37:30 +0000 (08:37 -0700)]
nvme: add support for mq_ops->queue_rqs()
This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.
If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.
This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 3 Dec 2021 13:48:53 +0000 (06:48 -0700)]
block: add mq_ops->queue_rqs hook
If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 1 Dec 2021 22:01:51 +0000 (15:01 -0700)]
block: add completion handler for fast path
The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 15 Dec 2021 00:23:05 +0000 (17:23 -0700)]
block: make queue stat accounting a reference
kyber turns on IO statistics when it is loaded on a queue, which means
that even if kyber is then later unloaded, we're still stuck with stats
enabled on the queue.
Change the account enabled from a bool to an int, and pair the enable call
with the equivalent disable call. This ensures that stats gets turned off
again appropriately.
Vaibhav Gupta [Wed, 8 Dec 2021 19:24:48 +0000 (13:24 -0600)]
mtip32xx: convert to generic power management
Convert mtip32xx from legacy PCI power management to the generic power
management framework.
Previously, mtip32xx used legacy PCI power management, where
mtip_pci_suspend() and mtip_pci_resume() were responsible for both
device-specific things and generic PCI things:
Bjorn Helgaas [Wed, 8 Dec 2021 19:24:47 +0000 (13:24 -0600)]
mtip32xx: remove pointless drvdata lookups
Previously we passed a struct pci_dev * to mtip_check_surprise_removal(),
which immediately looked up the driver_data. But all callers already have
the driver_data pointer, so just pass it directly and skip the extra
lookup. No functional change intended.
Kees Cook [Thu, 18 Nov 2021 20:37:12 +0000 (12:37 -0800)]
drbd: Use struct_group() to zero algs
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add a struct_group() for the algs so that memset() can correctly reason
about the size.
Tetsuo Handa [Mon, 13 Dec 2021 12:55:27 +0000 (21:55 +0900)]
loop: make autoclear operation asynchronous
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with disk->open_mutex held.
This circular dependency cannot be broken unless we call __loop_clr_fd()
without holding disk->open_mutex. Therefore, defer __loop_clr_fd() from
lo_release() to a WQ context.
Jens Axboe [Fri, 10 Dec 2021 23:32:44 +0000 (16:32 -0700)]
null_blk: cast command status to integer
kernel test robot reports that sparse now triggers a warning on null_blk:
>> drivers/block/null_blk/main.c:1577:55: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int ioerror @@ got restricted blk_status_t [usertype] error @@
drivers/block/null_blk/main.c:1577:55: sparse: expected int ioerror
drivers/block/null_blk/main.c:1577:55: sparse: got restricted blk_status_t [usertype] error
because blk_mq_add_to_batch() takes an integer instead of a blk_status_t.
Just cast this to an integer to silence it, null_blk is the odd one out
here since the command status is the "right" type. If we change the
function type, then we'll have do that for other callers too (existing and
future ones).
NeilBrown [Fri, 10 Dec 2021 04:31:56 +0000 (21:31 -0700)]
pktdvd: stop using bdi congestion framework.
The bdi congestion framework isn't widely used and should be
deprecated.
pktdvd makes use of it to track congestion, but this can be done
entirely internally to pktdvd, so it doesn't need to use the framework.
So introduce a "congested" flag. When waiting for bio_queue_size to
drop, set this flag and a var_waitqueue() to wait for it. When
bio_queue_size does drop and this flag is set, clear the flag and call
wake_up_var().
We don't use a wait_var_event macro for the waiting as we need to set
the flag and drop the spinlock before calling schedule() and while that
is possible with __wait_var_event(), result is not easy to read.
Hao Xu [Wed, 8 Dec 2021 05:21:25 +0000 (13:21 +0800)]
io_uring: batch completion in prior_task_list
In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.