Jens Axboe [Mon, 6 Dec 2021 16:41:50 +0000 (09:41 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk-mq: don't run might_sleep() if the operation needn't blocking
Ming Lei [Mon, 6 Dec 2021 03:33:50 +0000 (11:33 +0800)]
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 4cafe86c9267 ("blk-mq: run dispatch lock once in case of issuing from list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 6 Dec 2021 11:12:13 +0000 (19:12 +0800)]
blk-mq: don't run might_sleep() if the operation needn't blocking
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.
Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 5 Dec 2021 14:37:59 +0000 (14:37 +0000)]
io_uring: tweak iopoll CQE_SKIP event counting
When iopolling the userspace specifies the minimum number of "events" it
expects. Previously, we had one CQE per request, so the definition of
an "event" was unequivocal, but that's not more the case anymore with
REQ_F_CQE_SKIP.
Currently it counts the number of completed requests, replace it with
the number of posted CQEs. This allows users of the "one CQE per link"
scheme to wait for all N links in a single syscall, which is not
possible without the patch and requires extra context switches.
Jens Axboe [Fri, 3 Dec 2021 21:51:46 +0000 (14:51 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
blk-mq: run dispatch lock once in case of issuing from list
blk-mq: pass request queue to blk_mq_run_dispatch_ops
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
blk-mq: remove hctx_lock and hctx_unlock
block: switch to atomic_t for request references
block: move direct_IO into our own read_iter handler
mm: move filemap_range_needs_writeback() into header
Ming Lei [Fri, 3 Dec 2021 13:15:34 +0000 (21:15 +0800)]
blk-mq: run dispatch lock once in case of issuing from list
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.
Ming Lei [Fri, 3 Dec 2021 13:15:32 +0000 (21:15 +0800)]
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.
Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.
So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.
Jens Axboe [Thu, 14 Oct 2021 20:39:59 +0000 (14:39 -0600)]
block: switch to atomic_t for request references
refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.
This borrows that same implementation, which in turn is based on the
mm implementation from Linus.
Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 28 Oct 2021 14:57:09 +0000 (08:57 -0600)]
block: move direct_IO into our own read_iter handler
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.
Ming Lei [Fri, 3 Dec 2021 08:17:03 +0000 (16:17 +0800)]
block: null_blk: batched complete poll requests
Complete poll requests via blk_mq_add_to_batch() and
blk_mq_end_request_batch(), so that we can cover batched complete
code path by running null_blk test.
Meantime this way shows ~14% IOPS boost on 't/io_uring /dev/nullb0'
in my test.
Jens Axboe [Fri, 3 Dec 2021 13:34:00 +0000 (06:34 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
floppy: Add max size check for user space request
floppy: Fix hang in watchdog when disk is ejected
null_blk: allow zero poll queues
Xiongwei Song [Tue, 16 Nov 2021 13:10:33 +0000 (21:10 +0800)]
floppy: Add max size check for user space request
We need to check the max request size that is from user space before
allocating pages. If the request size exceeds the limit, return -EINVAL.
This check can avoid the warning below from page allocator.
When the watchdog detects a disk change, it calls cancel_activity(),
which in turn tries to cancel the fd_timer delayed work.
In the above scenario, fd_timer_fn is set to fd_watchdog(), meaning
it is trying to cancel its own work.
This results in a hang as cancel_delayed_work_sync() is waiting for the
watchdog (itself) to return, which never happens.
This can be reproduced relatively consistently by attempting to read a
broken floppy, and ejecting it while IO is being attempted and retried.
To resolve this, this patch calls cancel_delayed_work() instead, which
cancels the work without waiting for the watchdog to return and finish.
Before this regression was introduced, the code in this section used
del_timer(), and not del_timer_sync() to delete the watchdog timer.
Link: https://lore.kernel.org/r/399e486c-6540-db27-76aa-7a271b061f76@tasossah.com Fixes: 070ad7e793dc ("floppy: convert to delayed work and single-thread wq") Signed-off-by: Tasos Sahanidis <tasos@tasossah.com> Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 3 Dec 2021 02:39:19 +0000 (19:39 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
block: fix double bio queue when merging in cached request path
block: get rid of useless goto and label in blk_mq_get_new_requests()
Jens Axboe [Thu, 2 Dec 2021 19:43:46 +0000 (12:43 -0700)]
block: fix double bio queue when merging in cached request path
When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.
Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.
Fixes: 5b13bc8a3fd5 ("blk-mq: cleanup request allocation") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ye Bin [Mon, 29 Nov 2021 01:26:59 +0000 (09:26 +0800)]
block: Fix fsync always failed if once failed
We do test with inject error fault base on v4.19, after test some time we found
sync /dev/sda always failed.
[root@localhost] sync /dev/sda
sync: error syncing '/dev/sda': Input/output error
As 8d6996630c03 introduce 'fq->rq_status', this data only update when 'flush_rq'
reference count isn't zero. If flush request once failed and record error code
in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
will always failed.
To address this issue reset 'fq->rq_status' after return error code to upper layer.
Fixes: 8d6996630c03("block: fix null pointer dereference in blk_mq_rq_timed_out()") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 29 Nov 2021 13:42:05 +0000 (06:42 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: (72 commits)
scsi: remove the gendisk argument to scsi_ioctl
block: remove the gendisk argument to blk_execute_rq
block: remove the ->rq_disk field in struct request
block: don't check ->rq_disk in merges
mtd_blkdevs: remove the sector out of range check in do_blktrans_request
block: Remove redundant initialization of variable ret
block: simplify ioc_lookup_icq
block: simplify ioc_create_icq
block: return the io_context from create_task_io_context
block: use alloc_io_context in __copy_io
block: factor out a alloc_io_context helper
block: remove get_io_context_active
block: move the remaining elv.icq handling to the I/O scheduler
block: move blk_mq_sched_assign_ioc to blk-ioc.c
block: mark put_io_context_active static
Revert "block: Provide blk_mq_sched_get_icq()"
bfq: use bfq_bic_lookup in bfq_limit_depth
bfq: simplify bfq_bic_lookup
fork: move copy_io to block/blk-ioc.c
RDMA/qib: rename copy_io to qib_copy_io
...
Tetsuo Handa [Wed, 24 Nov 2021 10:47:40 +0000 (19:47 +0900)]
loop: don't hold lo_mutex during __loop_clr_fd()
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with lo->lo_mutex held.
Since all functions where lo->lo_state matters are already checking
lo->lo_state with lo->lo_mutex held (in order to avoid racing with e.g.
ioctl(LOOP_CTL_REMOVE)), and __loop_clr_fd() can be called from either
ioctl(LOOP_CLR_FD) xor close(), lo->lo_state == Lo_rundown is considered
as an exclusive lock for __loop_clr_fd(). Therefore, hold lo->lo_mutex
inside __loop_clr_fd() only when asserting/updating lo->lo_state.
Since ioctl(LOOP_CLR_FD) depends on lo->lo_state == Lo_bound, a valid
lo->lo_backing_file must have been assigned by ioctl(LOOP_SET_FD) or
ioctl(LOOP_CONFIGURE). Thus, we can remove lo->lo_backing_file test,
and convert __loop_clr_fd() into a void function.
Christoph Hellwig [Fri, 26 Nov 2021 12:18:01 +0000 (13:18 +0100)]
block: remove the gendisk argument to blk_execute_rq
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now. Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:15 +0000 (12:58 +0100)]
block: return the io_context from create_task_io_context
Grab a reference to the newly allocated or existing io_context in
create_task_io_context and return it. This simplifies the callers and
removes the need for double lookups.
Christoph Hellwig [Fri, 26 Nov 2021 11:58:14 +0000 (12:58 +0100)]
block: use alloc_io_context in __copy_io
In __copy_io we know that the newly allocate task_struct does not have
an I/O context yet and is not exiting. So just allocate the I/O context
struct and install it directly. There is no need to lock the task
either as it is just being created.
Christoph Hellwig [Fri, 26 Nov 2021 11:58:10 +0000 (12:58 +0100)]
block: move blk_mq_sched_assign_ioc to blk-ioc.c
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static. Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.
Jan Kara [Thu, 25 Nov 2021 13:36:41 +0000 (14:36 +0100)]
bfq: Do not let waker requests skip proper accounting
Commit 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.
Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.
Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.
Fixes: 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in dispatch list") Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:40 +0000 (14:36 +0100)]
bfq: Log waker detections
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.
Jan Kara [Thu, 25 Nov 2021 13:36:39 +0000 (14:36 +0100)]
bfq: Provide helper to generate bfqq name
Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.
Jan Kara [Thu, 25 Nov 2021 13:36:38 +0000 (14:36 +0100)]
bfq: Limit waker detection in time
Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.
This significantly reduces false waker detection for workload like:
Jan Kara [Thu, 25 Nov 2021 13:36:37 +0000 (14:36 +0100)]
bfq: Limit number of requests consumed by each cgroup
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.
Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.
Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:36 +0000 (14:36 +0100)]
bfq: Store full bitmap depth in bfq_data
Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Jan Kara [Thu, 25 Nov 2021 13:36:35 +0000 (14:36 +0100)]
bfq: Track number of allocated requests in bfq_entity
When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Jan Kara [Thu, 25 Nov 2021 13:36:34 +0000 (14:36 +0100)]
block: Provide blk_mq_sched_get_icq()
Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So provide a
function blk_mq_sched_get_icq() so that BFQ can lookup ICQ earlier.
Sebastian Andrzej Siewior [Mon, 25 Oct 2021 07:06:58 +0000 (09:06 +0200)]
mmc: core: Use blk_mq_complete_request_direct().
The completion callback for the sdhci-pci device is invoked from a
kworker.
I couldn't identify in which context is mmc_blk_mq_req_done() invoke but
the remaining caller are from invoked from preemptible context. Here it
would make sense to complete the request directly instead scheduling
ksoftirqd for its completion.
Eric Biggers [Wed, 24 Nov 2021 01:37:33 +0000 (17:37 -0800)]
blk-crypto: remove blk_crypto_unregister()
This function is trivial and is only used in one place. Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Christoph Hellwig [Wed, 24 Nov 2021 06:28:56 +0000 (07:28 +0100)]
blk-mq: cleanup request allocation
Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.
There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.
Christoph Hellwig [Tue, 23 Nov 2021 16:04:41 +0000 (17:04 +0100)]
blk-mq: simplify the plug handling in blk_mq_submit_bio
blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.
The limited plugging case is only used for a corner case that does
not matter in real life:
- no ->commit_rqs (so not NVMe)
- no shared tags (so not SCSI)
- not rotational (so no old disk or floppy driver)
- must have multiple queues (so no eMMC)
Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.
Christoph Hellwig [Mon, 22 Nov 2021 13:06:23 +0000 (14:06 +0100)]
block: don't set GENHD_FL_NO_PART for hidden gendisks
Hidden gendisks can't be opened using blkdev_get_*, so we can't really
reach any of the partition scanning paths or partitioning ioctls except
for the initial partition scan from add_disk.
Christoph Hellwig [Mon, 22 Nov 2021 13:06:22 +0000 (14:06 +0100)]
block: remove GENHD_FL_EXT_DEVT
All modern drivers can support extra partitions using the extended
dev_t. In fact except for the ioctl method drivers never even see
partitions in normal operation.
So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Christoph Hellwig [Mon, 22 Nov 2021 13:06:20 +0000 (14:06 +0100)]
mmc: don't set GENHD_FL_SUPPRESS_PARTITION_INFO
This manually reverts 07b652cdbec3 ("mmc: card: Don't show eMMC RPMB and
BOOT areas in /proc/partitions"). Based on the commit description that
change was purely cosmetic. mmc is the last driver that sets this
flag and thus prevents it from being removed.
Christoph Hellwig [Mon, 22 Nov 2021 13:06:19 +0000 (14:06 +0100)]
null_blk: don't suppress partitioning information
This manually reverts commit 27290b469051 ("null_blk: suppress invalid
partition info"). The message in that commit log can't appearch as
the flag is never checked during probing, and there is no good reason
to treat null_blk special in /proc/partitions.
Christoph Hellwig [Mon, 22 Nov 2021 13:06:15 +0000 (14:06 +0100)]
block: remove a dead check in show_partition
disk_max_parts never returns 0 given that ->minors for devices not using
the extended dev_t must be non-zero, and disk_max_parts always returns
DISK_MAX_PARTS for the latter.
Christoph Hellwig [Mon, 22 Nov 2021 13:06:14 +0000 (14:06 +0100)]
block: remove GENHD_FL_CD
GENHD_FL_CD marks a gendisk as a vaguely CD-ROM like device.
Besides being used internally inside of sunvdc.c an xen-blkfront it
is used by xen-blkback as a hint to claim a device exported to a
guest is a CD-ROM like device. Just check for disk->cdi instead
which is the right indicator for "real" CD-ROM or DVD drivers. This
will miss the paravirtualized guest drivers, but those make little
sense to report anyway.