Christoph Hellwig [Tue, 21 Dec 2021 10:03:39 +0000 (11:03 +0100)]
block: fix error unwinding in device_add_disk
One device_add is called disk->ev will be freed by disk_release, so we
should free it twice. Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.
Based on an earlier patch from Tetsuo Handa.
Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com> Fixes: 83cbce9574462c6b ("block: add error handling for device_add_disk / add_disk") Signed-off-by: Christoph Hellwig <hch@lst.de>
Jens Axboe [Fri, 17 Dec 2021 16:51:05 +0000 (09:51 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
block: remove the rsxx driver
rsxx: Drop PCI legacy power management
mtip32xx: convert to generic power management
mtip32xx: remove pointless drvdata lookups
mtip32xx: remove pointless drvdata checking
drbd: Use struct_group() to zero algs
loop: make autoclear operation asynchronous
null_blk: cast command status to integer
pktdvd: stop using bdi congestion framework.
Jens Axboe [Fri, 17 Dec 2021 16:51:03 +0000 (09:51 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: (23 commits)
block: only build the icq tracking code when needed
block: fold create_task_io_context into ioc_find_get_icq
block: open code create_task_io_context in set_task_ioprio
block: fold get_task_io_context into set_task_ioprio
block: move set_task_ioprio to blk-ioc.c
block: cleanup ioc_clear_queue
block: refactor put_io_context
block: remove the NULL ioc check in put_io_context
block: refactor put_iocontext_active
block: simplify struct io_context refcounting
block: remove the nr_task field from struct io_context
nvme: add support for mq_ops->queue_rqs()
nvme: separate command prep and issue
nvme: split command copy into a helper
block: add mq_ops->queue_rqs hook
block: use singly linked list for bio cache
block: add completion handler for fast path
block: make queue stat accounting a reference
bdev: Improve lookup_bdev documentation
mtd_blkdevs: don't scan partitions for plain mtdblock
...
Jens Axboe [Fri, 17 Dec 2021 16:51:00 +0000 (09:51 -0700)]
Merge branch 'for-5.17/io_uring' into for-next
* for-5.17/io_uring:
io_uring: code clean for some ctx usage
io_uring: batch completion in prior_task_list
io_uring: split io_req_complete_post() and add a helper
io_uring: add helper for task work execution code
io_uring: add a priority tw list for irq completion work
io-wq: add helper to merge two wq_lists
Christoph Hellwig [Thu, 9 Dec 2021 06:31:26 +0000 (07:31 +0100)]
block: cleanup ioc_clear_queue
Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.
Christoph Hellwig [Thu, 16 Dec 2021 08:42:44 +0000 (09:42 +0100)]
block: remove the rsxx driver
This driver was for rare and shortlived high end enterprise hardware
and hasn't been maintained since 2014, which also means it never got
converted to use blk-mq.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 18 Nov 2021 15:37:30 +0000 (08:37 -0700)]
nvme: add support for mq_ops->queue_rqs()
This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.
If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.
This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 3 Dec 2021 13:48:53 +0000 (06:48 -0700)]
block: add mq_ops->queue_rqs hook
If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 1 Dec 2021 22:01:51 +0000 (15:01 -0700)]
block: add completion handler for fast path
The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 15 Dec 2021 00:23:05 +0000 (17:23 -0700)]
block: make queue stat accounting a reference
kyber turns on IO statistics when it is loaded on a queue, which means
that even if kyber is then later unloaded, we're still stuck with stats
enabled on the queue.
Change the account enabled from a bool to an int, and pair the enable call
with the equivalent disable call. This ensures that stats gets turned off
again appropriately.
Vaibhav Gupta [Wed, 8 Dec 2021 19:24:48 +0000 (13:24 -0600)]
mtip32xx: convert to generic power management
Convert mtip32xx from legacy PCI power management to the generic power
management framework.
Previously, mtip32xx used legacy PCI power management, where
mtip_pci_suspend() and mtip_pci_resume() were responsible for both
device-specific things and generic PCI things:
Bjorn Helgaas [Wed, 8 Dec 2021 19:24:47 +0000 (13:24 -0600)]
mtip32xx: remove pointless drvdata lookups
Previously we passed a struct pci_dev * to mtip_check_surprise_removal(),
which immediately looked up the driver_data. But all callers already have
the driver_data pointer, so just pass it directly and skip the extra
lookup. No functional change intended.
Kees Cook [Thu, 18 Nov 2021 20:37:12 +0000 (12:37 -0800)]
drbd: Use struct_group() to zero algs
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add a struct_group() for the algs so that memset() can correctly reason
about the size.
Tetsuo Handa [Mon, 13 Dec 2021 12:55:27 +0000 (21:55 +0900)]
loop: make autoclear operation asynchronous
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with disk->open_mutex held.
This circular dependency cannot be broken unless we call __loop_clr_fd()
without holding disk->open_mutex. Therefore, defer __loop_clr_fd() from
lo_release() to a WQ context.
Jens Axboe [Fri, 10 Dec 2021 23:32:44 +0000 (16:32 -0700)]
null_blk: cast command status to integer
kernel test robot reports that sparse now triggers a warning on null_blk:
>> drivers/block/null_blk/main.c:1577:55: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int ioerror @@ got restricted blk_status_t [usertype] error @@
drivers/block/null_blk/main.c:1577:55: sparse: expected int ioerror
drivers/block/null_blk/main.c:1577:55: sparse: got restricted blk_status_t [usertype] error
because blk_mq_add_to_batch() takes an integer instead of a blk_status_t.
Just cast this to an integer to silence it, null_blk is the odd one out
here since the command status is the "right" type. If we change the
function type, then we'll have do that for other callers too (existing and
future ones).
NeilBrown [Fri, 10 Dec 2021 04:31:56 +0000 (21:31 -0700)]
pktdvd: stop using bdi congestion framework.
The bdi congestion framework isn't widely used and should be
deprecated.
pktdvd makes use of it to track congestion, but this can be done
entirely internally to pktdvd, so it doesn't need to use the framework.
So introduce a "congested" flag. When waiting for bio_queue_size to
drop, set this flag and a var_waitqueue() to wait for it. When
bio_queue_size does drop and this flag is set, clear the flag and call
wake_up_var().
We don't use a wait_var_event macro for the waiting as we need to set
the flag and drop the spinlock before calling schedule() and while that
is possible with __wait_var_event(), result is not easy to read.
Hao Xu [Wed, 8 Dec 2021 05:21:25 +0000 (13:21 +0800)]
io_uring: batch completion in prior_task_list
In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.
Hao Xu [Tue, 7 Dec 2021 09:39:48 +0000 (17:39 +0800)]
io_uring: add a priority tw list for irq completion work
Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:
origin timeline:
submit_sqe-->irq-->add completion task_work
-->run heavy work0~n-->run completion task_work
now timeline:
submit_sqe-->irq-->add completion task_work
-->run completion task_work-->run heavy work0~n
Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.
Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:
John Garry [Mon, 6 Dec 2021 12:49:50 +0000 (20:49 +0800)]
blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags
Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees
using megaraid SAS RAID card since moving to shared tags [0].
Previously, when shared tags was shared sbitmap, this function was less
than optimum since we would iter through all tags for all hctx's,
yet only ever match upto tagset depth number of rqs.
Since the change to shared tags, things are even less efficient if we have
parallel callers of blk_mq_queue_tag_busy_iter(). This is because in
bt_iter() -> blk_mq_find_and_get_req() there would be more contention on
accessing each request ref and tags->lock since they are now shared among
all HW queues.
Optimise by having separate calls to bt_for_each() for when we're using
shared tags. In this case no longer pass a hctx, as it is no longer
relevant, and teach bt_iter() about this.
Ming suggested something along the lines of this change, apart from a
different implementation.
John Garry [Mon, 6 Dec 2021 12:49:49 +0000 (20:49 +0800)]
blk-mq: Delete busy_iter_fn
Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete
busy_iter_fn to reduce duplication.
It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is
less specific.
However busy_tag_iter_fn is used in many different parts of the tree,
unlike busy_iter_fn which is just use in block/, so just take the
straightforward path now, so that we could rename later treewide.
John Garry [Mon, 6 Dec 2021 12:49:48 +0000 (20:49 +0800)]
blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().
Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.
With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.
Jens Axboe [Mon, 6 Dec 2021 16:41:50 +0000 (09:41 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk-mq: don't run might_sleep() if the operation needn't blocking
Ming Lei [Mon, 6 Dec 2021 03:33:50 +0000 (11:33 +0800)]
blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 4cafe86c9267 ("blk-mq: run dispatch lock once in case of issuing from list") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 6 Dec 2021 11:12:13 +0000 (19:12 +0800)]
blk-mq: don't run might_sleep() if the operation needn't blocking
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.
Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 5 Dec 2021 14:37:59 +0000 (14:37 +0000)]
io_uring: tweak iopoll CQE_SKIP event counting
When iopolling the userspace specifies the minimum number of "events" it
expects. Previously, we had one CQE per request, so the definition of
an "event" was unequivocal, but that's not more the case anymore with
REQ_F_CQE_SKIP.
Currently it counts the number of completed requests, replace it with
the number of posted CQEs. This allows users of the "one CQE per link"
scheme to wait for all N links in a single syscall, which is not
possible without the patch and requires extra context switches.
Jens Axboe [Fri, 3 Dec 2021 21:51:46 +0000 (14:51 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
blk-mq: run dispatch lock once in case of issuing from list
blk-mq: pass request queue to blk_mq_run_dispatch_ops
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
blk-mq: remove hctx_lock and hctx_unlock
block: switch to atomic_t for request references
block: move direct_IO into our own read_iter handler
mm: move filemap_range_needs_writeback() into header
Ming Lei [Fri, 3 Dec 2021 13:15:34 +0000 (21:15 +0800)]
blk-mq: run dispatch lock once in case of issuing from list
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.
Ming Lei [Fri, 3 Dec 2021 13:15:32 +0000 (21:15 +0800)]
blk-mq: move srcu from blk_mq_hw_ctx to request_queue
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.
Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.
So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.
Jens Axboe [Thu, 14 Oct 2021 20:39:59 +0000 (14:39 -0600)]
block: switch to atomic_t for request references
refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.
This borrows that same implementation, which in turn is based on the
mm implementation from Linus.
Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 28 Oct 2021 14:57:09 +0000 (08:57 -0600)]
block: move direct_IO into our own read_iter handler
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.
Ming Lei [Fri, 3 Dec 2021 08:17:03 +0000 (16:17 +0800)]
block: null_blk: batched complete poll requests
Complete poll requests via blk_mq_add_to_batch() and
blk_mq_end_request_batch(), so that we can cover batched complete
code path by running null_blk test.
Meantime this way shows ~14% IOPS boost on 't/io_uring /dev/nullb0'
in my test.
Jens Axboe [Fri, 3 Dec 2021 13:34:00 +0000 (06:34 -0700)]
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
floppy: Add max size check for user space request
floppy: Fix hang in watchdog when disk is ejected
null_blk: allow zero poll queues
Xiongwei Song [Tue, 16 Nov 2021 13:10:33 +0000 (21:10 +0800)]
floppy: Add max size check for user space request
We need to check the max request size that is from user space before
allocating pages. If the request size exceeds the limit, return -EINVAL.
This check can avoid the warning below from page allocator.
When the watchdog detects a disk change, it calls cancel_activity(),
which in turn tries to cancel the fd_timer delayed work.
In the above scenario, fd_timer_fn is set to fd_watchdog(), meaning
it is trying to cancel its own work.
This results in a hang as cancel_delayed_work_sync() is waiting for the
watchdog (itself) to return, which never happens.
This can be reproduced relatively consistently by attempting to read a
broken floppy, and ejecting it while IO is being attempted and retried.
To resolve this, this patch calls cancel_delayed_work() instead, which
cancels the work without waiting for the watchdog to return and finish.
Before this regression was introduced, the code in this section used
del_timer(), and not del_timer_sync() to delete the watchdog timer.
Link: https://lore.kernel.org/r/399e486c-6540-db27-76aa-7a271b061f76@tasossah.com Fixes: 070ad7e793dc ("floppy: convert to delayed work and single-thread wq") Signed-off-by: Tasos Sahanidis <tasos@tasossah.com> Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 3 Dec 2021 02:39:19 +0000 (19:39 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block:
block: fix double bio queue when merging in cached request path
block: get rid of useless goto and label in blk_mq_get_new_requests()
Jens Axboe [Thu, 2 Dec 2021 19:43:46 +0000 (12:43 -0700)]
block: fix double bio queue when merging in cached request path
When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.
Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.
Fixes: 5b13bc8a3fd5 ("blk-mq: cleanup request allocation") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ye Bin [Mon, 29 Nov 2021 01:26:59 +0000 (09:26 +0800)]
block: Fix fsync always failed if once failed
We do test with inject error fault base on v4.19, after test some time we found
sync /dev/sda always failed.
[root@localhost] sync /dev/sda
sync: error syncing '/dev/sda': Input/output error
As 8d6996630c03 introduce 'fq->rq_status', this data only update when 'flush_rq'
reference count isn't zero. If flush request once failed and record error code
in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
will always failed.
To address this issue reset 'fq->rq_status' after return error code to upper layer.
Fixes: 8d6996630c03("block: fix null pointer dereference in blk_mq_rq_timed_out()") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 29 Nov 2021 13:42:05 +0000 (06:42 -0700)]
Merge branch 'for-5.17/block' into for-next
* for-5.17/block: (72 commits)
scsi: remove the gendisk argument to scsi_ioctl
block: remove the gendisk argument to blk_execute_rq
block: remove the ->rq_disk field in struct request
block: don't check ->rq_disk in merges
mtd_blkdevs: remove the sector out of range check in do_blktrans_request
block: Remove redundant initialization of variable ret
block: simplify ioc_lookup_icq
block: simplify ioc_create_icq
block: return the io_context from create_task_io_context
block: use alloc_io_context in __copy_io
block: factor out a alloc_io_context helper
block: remove get_io_context_active
block: move the remaining elv.icq handling to the I/O scheduler
block: move blk_mq_sched_assign_ioc to blk-ioc.c
block: mark put_io_context_active static
Revert "block: Provide blk_mq_sched_get_icq()"
bfq: use bfq_bic_lookup in bfq_limit_depth
bfq: simplify bfq_bic_lookup
fork: move copy_io to block/blk-ioc.c
RDMA/qib: rename copy_io to qib_copy_io
...
Tetsuo Handa [Wed, 24 Nov 2021 10:47:40 +0000 (19:47 +0900)]
loop: don't hold lo_mutex during __loop_clr_fd()
syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
is calling destroy_workqueue() with lo->lo_mutex held.
Since all functions where lo->lo_state matters are already checking
lo->lo_state with lo->lo_mutex held (in order to avoid racing with e.g.
ioctl(LOOP_CTL_REMOVE)), and __loop_clr_fd() can be called from either
ioctl(LOOP_CLR_FD) xor close(), lo->lo_state == Lo_rundown is considered
as an exclusive lock for __loop_clr_fd(). Therefore, hold lo->lo_mutex
inside __loop_clr_fd() only when asserting/updating lo->lo_state.
Since ioctl(LOOP_CLR_FD) depends on lo->lo_state == Lo_bound, a valid
lo->lo_backing_file must have been assigned by ioctl(LOOP_SET_FD) or
ioctl(LOOP_CONFIGURE). Thus, we can remove lo->lo_backing_file test,
and convert __loop_clr_fd() into a void function.
Christoph Hellwig [Fri, 26 Nov 2021 12:18:01 +0000 (13:18 +0100)]
block: remove the gendisk argument to blk_execute_rq
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now. Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 26 Nov 2021 11:58:15 +0000 (12:58 +0100)]
block: return the io_context from create_task_io_context
Grab a reference to the newly allocated or existing io_context in
create_task_io_context and return it. This simplifies the callers and
removes the need for double lookups.
Christoph Hellwig [Fri, 26 Nov 2021 11:58:14 +0000 (12:58 +0100)]
block: use alloc_io_context in __copy_io
In __copy_io we know that the newly allocate task_struct does not have
an I/O context yet and is not exiting. So just allocate the I/O context
struct and install it directly. There is no need to lock the task
either as it is just being created.
Christoph Hellwig [Fri, 26 Nov 2021 11:58:10 +0000 (12:58 +0100)]
block: move blk_mq_sched_assign_ioc to blk-ioc.c
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static. Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.
Jan Kara [Thu, 25 Nov 2021 13:36:41 +0000 (14:36 +0100)]
bfq: Do not let waker requests skip proper accounting
Commit 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.
Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.
Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.
Fixes: 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in dispatch list") Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 25 Nov 2021 13:36:40 +0000 (14:36 +0100)]
bfq: Log waker detections
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.