If the floppy_alloc_disk() failed, disks of current drive will not be set,
thus the lastest allocated set->tag cannot be freed in the error handling
path. A simple call graph shown as below:
->out_put_disk:
for (drive = 0; drive < N_DRIVE; drive++)
if (!disks[drive][0]) # the last disks is not set and loop break
break;
blk_mq_free_tag_set() # the latest allocated set->tag leaked
Fix this problem by free the set->tag of current drive before jump to
error handling path.
Cc: stable@vger.kernel.org Fixes: 302cfee15029 ("floppy: use a separate gendisk for each media format") Signed-off-by: Yuan Can <yuancan@huawei.com>
[efremov: added stable list, changed title] Signed-off-by: Denis Efremov <efremov@linux.com>
Greg Kroah-Hartman [Sat, 3 Dec 2022 14:07:47 +0000 (15:07 +0100)]
block: remove devnode callback from struct block_device_operations
With the removal of the pktcdvd driver, there are no in-kernel users of
the devnode callback in struct block_device_operations, so it can be
safely removed. If it is needed for new block drivers in the future, it
can be brought back.
Greg Kroah-Hartman [Fri, 2 Dec 2022 18:27:58 +0000 (19:27 +0100)]
pktcdvd: remove driver.
Way back in 2016 in commit 5a8b187c61e9 ("pktcdvd: mark as unmaintained
and deprecated") this driver was marked as "will be removed soon". 5
years seems long enough to have it stick around after that, so finally
remove the thing now.
Reported-by: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Maier <balagi@justmail.de> Cc: Peter Osterlund <petero2@telia.com> Cc: linux-block@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20221202182758.1339039-1-gregkh@linuxfoundation.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 29 Nov 2022 13:32:53 +0000 (14:32 +0100)]
md: remove lock_bdev / unlock_bdev
These wrappers for blkdev_get / blkdev_put just horribly confuse the
code with their odd naming. Remove them and improve the error unwinding
in md_import_device with the now folded code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
Yang Li [Fri, 2 Dec 2022 01:17:13 +0000 (09:17 +0800)]
blk-cgroup: Fix some kernel-doc comments
Make the description of @gendisk to @disk in blkcg_schedule_throttle()
to clear the below warnings:
block/blk-cgroup.c:1850: warning: Function parameter or member 'disk' not described in 'blkcg_schedule_throttle'
block/blk-cgroup.c:1850: warning: Excess function parameter 'gendisk' description in 'blkcg_schedule_throttle'
Shin'ichiro Kawasaki [Thu, 1 Dec 2022 06:10:36 +0000 (15:10 +0900)]
null_blk: support read-only and offline zone conditions
In zoned mode, zones with write pointers can have conditions "read-only"
or "offline". In read-only condition, zones can not be written. In
offline condition, the zones can be neither written nor read. These
conditions are intended for zones with media failures, then it is
difficult to set those conditions to zones on real devices.
To test handling of zones in the conditions, add a feature to null_blk
to set up zones in read-only or offline condition. Add new configuration
attributes "zone_readonly" and "zone_offline". Write a sector to the
attribute files to specify the target zone to set the zone conditions.
For example, following command lines do it:
When the specified zones are already in read-only or offline condition,
normal empty condition is restored to the zones. These condition changes
can be done only after the null_blk device get powered, since status
area of each zone is not yet allocated before power-on.
Also improve zone condition checks to inhibit all commands for zones in
offline conditions. In same manner, inhibit write and zone management
commands for zones in read-only condition.
Christoph Böhmwalder [Thu, 1 Dec 2022 11:03:48 +0000 (12:03 +0100)]
drbd: introduce dynamic debug
Incorporate as many out-of-tree changes as possible without changing the
genl API.
Over the years, we restructured this several times, and also changed the
log format.
One breaking change is that DRBD 9 gained "implicit options", like a
connection name. This cannot be replayed here without changing the API,
so save it for later.
Originally-from: Andreas Gruenbacher <agruen@linbit.com>
Originally-from: Philipp Reisner <philipp.reisner@linbit.com>
Originally-from: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20221201110349.1282687-4-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Tue, 18 Oct 2022 12:19:31 +0000 (20:19 +0800)]
blk-iocost: Remove vrate member in struct ioc_now
If we trace vtime_base_rate instead of vtime_rate, there is nowhere
which accesses now->vrate except function ioc_now using now->vrate locally.
Just remove it.
Kemeng Shi [Tue, 18 Oct 2022 12:19:30 +0000 (20:19 +0800)]
blk-iocost: Trace vtime_base_rate instead of vtime_rate
Since commit ac33e91e2daca ("blk-iocost: implement vtime loss
compensation") rename original vtime_rate to vtime_base_rate
and current vtime_rate is original vtime_rate with compensation.
The current rate showed in tracepoint is mixed with vtime_rate
and vtime_base_rate:
1) In function ioc_adjust_base_vrate, the first trace_iocost_ioc_vrate_adj
shows vtime_rate, the second trace_iocost_ioc_vrate_adj shows
vtime_base_rate.
2) In function iocg_activate shows vtime_rate by calling
TRACE_IOCG_PATH(iocg_activate...
3) In function ioc_check_iocgs shows vtime_rate by calling
TRACE_IOCG_PATH(iocg_idle...
Trace vtime_base_rate instead of vtime_rate as:
1) Before commit ac33e91e2daca ("blk-iocost: implement vtime loss
compensation"), the traced rate is without compensation, so still
show rate without compensation.
2) The vtime_base_rate is more stable while vtime_rate heavily depends on
excess budeget on current period which may change abruptly in next period.
Kemeng Shi [Tue, 18 Oct 2022 12:19:29 +0000 (20:19 +0800)]
blk-iocost: Reset vtime_base_rate in ioc_refresh_params
Since commit ac33e91e2daca("blk-iocost: implement vtime loss compensation")
split vtime_rate into vtime_rate and vtime_base_rate, we need reset both
vtime_base_rate and vtime_rate when device parameters are refreshed.
If vtime_base_rate is no reset here, vtime_rate will be overwritten with
old vtime_base_rate soon in ioc_refresh_vrate.
Jan Kara [Wed, 30 Nov 2022 17:56:53 +0000 (18:56 +0100)]
block: Do not reread partition table on exclusively open device
Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in
blk_drop_partitions") we allow rereading of partition table although
there are users of the block device. This has an undesirable consequence
that e.g. if sda and sdb are assembled to a RAID1 device md0 with
partitions, BLKRRPART ioctl on sda will rescan partition table and
create sda1 device. This partition device under a raid device confuses
some programs (such as libstorage-ng used for initial partitioning for
distribution installation) leading to failures.
Fix the problem refusing to rescan partitions if there is another user
that has the block device exclusively open.
Pankaj Raghav [Wed, 30 Nov 2022 12:30:03 +0000 (13:30 +0100)]
virtio-blk: replace ida_simple[get|remove] with ida_[alloc_range|free]
ida_simple[get|remove] are deprecated, and are just wrappers to
ida_[alloc_range|free]. Replace ida_simple[get|remove] with their
corresponding counterparts.
Christoph Hellwig [Mon, 14 Nov 2022 04:26:37 +0000 (05:26 +0100)]
block: mark blk_put_queue as potentially blocking
We can't just say that the last reference release may block, as any
reference dropped could be the last one. So move the might_sleep() from
blk_free_queue to blk_put_queue and update the documentation.
Christoph Hellwig [Mon, 14 Nov 2022 04:26:36 +0000 (05:26 +0100)]
block: untangle request_queue refcounting from sysfs
The kobject embedded into the request_queue is used for the queue
directory in sysfs, but that is a child of the gendisks directory and is
intimately tied to it. Move this kobject to the gendisk and use a
refcount_t in the request_queue for the actual request_queue refcounting
that is completely unrelated to the device model.
Christoph Hellwig [Mon, 14 Nov 2022 04:26:35 +0000 (05:26 +0100)]
block: fix error unwinding in blk_register_queue
blk_register_queue fails to handle errors from blk_mq_sysfs_register,
leaks various resources on errors and accidentally sets queue refs percpu
refcount to percpu mode on kobject_add failure. Fix all that by
properly unwinding on errors.
We've had a few reports on this causing a crash at boot time, because
of a reference issue. While this problem seemginly did exist before
the patch and needs solving separately, this patch makes it a lot
easier to trigger.
Jens Axboe [Tue, 29 Nov 2022 13:54:57 +0000 (06:54 -0700)]
Merge tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme into for-6.2/block
Pull NVMe updates from Christoph:
"nvme updates for Linux 6.2
- support some passthrough commands without CAP_SYS_ADMIN
(Kanchan Joshi)
- refactor PCIe probing and reset (Christoph Hellwig)
- various fabrics authentication fixes and improvements (Sagi Grimberg)
- avoid fallback to sequential scan due to transient issues
(Uday Shankar)
- implement support for the DEAC bit in Write Zeroes (Christoph Hellwig)
- allow overriding the IEEE OUI and firmware revision in configfs for
nvmet (Aleksandr Miloserdov)
- force reconnect when number of queue changes in nvmet (Daniel Wagner)
- minor fixes and improvements (Uros Bizjak, Joel Granados,
Sagi Grimberg, Christoph Hellwig, Christophe JAILLET)"
* tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme: (45 commits)
nvmet: expose firmware revision to configfs
nvmet: expose IEEE OUI to configfs
nvme: rename the queue quiescing helpers
nvmet: fix a memory leak in nvmet_auth_set_key
nvme: return err on nvme_init_non_mdts_limits fail
nvme: avoid fallback to sequential scan due to transient issues
nvme-rdma: stop auth work after tearing down queues in error recovery
nvme-tcp: stop auth work after tearing down queues in error recovery
nvme-auth: have dhchap_auth_work wait for queues auth to complete
nvme-auth: remove redundant auth_work flush
nvme-auth: convert dhchap_auth_list to an array
nvme-auth: check chap ctrl_key once constructed
nvme-auth: no need to reset chap contexts on re-authentication
nvme-auth: remove redundant deallocations
nvme-auth: clear sensitive info right after authentication completes
nvme-auth: guarantee dhchap buffers under memory pressure
nvme-auth: don't keep long lived 4k dhchap buffer
nvme-auth: remove redundant if statement
nvme-auth: don't override ctrl keys before validation
nvme-auth: don't ignore key generation failures when initializing ctrl keys
...
Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the
"s" plural) to more correctly reflect the fact that this function tests
a single request, not multiple requests.
As 'blk_mq_register_hctx' may already add some objects when failed halfway,
but there isn't do fallback, caller don't know which objects add failed.
To solve above issue just do fallback when add objects failed halfway in
'blk_mq_register_hctx'.
As after dd6f7f17bf58 commit move '__elevator_get(qe->type)' before set
'qe->type', so will lead to access wild pointer.
To solve above issue get 'qe->type' after set 'qe->type'.
Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com
Fixes:dd6f7f17bf58("block: add proper helpers for elevator_type module refcount management") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wang ShaoBo [Thu, 24 Nov 2022 01:58:17 +0000 (09:58 +0800)]
drbd: destroy workqueue when drbd device was freed
A submitter workqueue is dynamically allocated by init_submitter()
called by drbd_create_device(), we should destroy it when this
device is not needed or destroyed.
Wang ShaoBo [Thu, 24 Nov 2022 01:58:16 +0000 (09:58 +0800)]
drbd: remove call to memset before free device/resource/connection
This revert c2258ffc56f2 ("drbd: poison free'd device, resource and
connection structs"), add memset is odd here for debugging, there are
some methods to accurately show what happened, such as kdump.
Damien Le Moal [Thu, 24 Nov 2022 02:12:08 +0000 (11:12 +0900)]
block: mq-deadline: Do not break sequential write streams to zoned HDDs
mq-deadline ensures an in order dispatching of write requests to zoned
block devices using a per zone lock (a bit). This implies that for any
purely sequential write workload, the drive is exercised most of the
time at a maximum queue depth of one.
However, when such sequential write workload crosses a zone boundary
(when sequentially writing multiple contiguous zones), zone write
locking may prevent the last write to one zone to be issued (as the
previous write is still being executed) but allow the first write to the
following zone to be issued (as that zone is not yet being writen and
not locked). This result in an out of order delivery of the sequential
write commands to the device every time a zone boundary is crossed.
While such behavior does not break the sequential write constraint of
zoned block devices (and does not generate any write error), some zoned
hard-disks react badly to seeing these out of order writes, resulting in
lower write throughput.
This problem can be addressed by always dispatching the first request
of a stream of sequential write requests, regardless of the zones
targeted by these sequential writes. To do so, the function
deadline_skip_seq_writes() is introduced and used in
deadline_next_request() to select the next write command to issue if the
target device is an HDD (blk_queue_nonrot() being false).
deadline_fifo_request() is modified using the new
deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
requests in the fifo list that have a preceding request in lba order
that is sequential.
With this fix, a sequential write workload executed with the following
fio command:
Damien Le Moal [Thu, 24 Nov 2022 02:12:07 +0000 (11:12 +0900)]
block: mq-deadline: Fix dd_finish_request() for zoned devices
dd_finish_request() tests if the per prio fifo_list is not empty to
determine if request dispatching must be restarted for handling blocked
write requests to zoned devices with a call to
blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
2 problems:
1) Only the priority level of the completed request is considered.
However, writes to a zone may be blocked due to other writes to the
same zone using a different priority level. While this is unlikely to
happen in practice, as writing a zone with different IO priorirites
does not make sense, nothing in the code prevents this from
happening.
2) The use of list_empty() is dangerous as dd_finish_request() does not
take dd->lock and may run concurrently with the insert and dispatch
code.
Fix these 2 problems by testing the write fifo list of all priority
levels using the new helper dd_has_write_work(), and by testing each
fifo list using list_empty_careful().
Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support") Cc: <stable@vger.kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Böhmwalder [Tue, 22 Nov 2022 13:43:01 +0000 (14:43 +0100)]
drbd: use consistent license
DRBD currently has a mix of GPL-2.0 and GPL-2.0-or-later SPDX license
identifiers. We have decided to stick with GPL 2.0 only, so consistently
use that identifier.
Shin'ichiro Kawasaki [Tue, 22 Nov 2022 08:49:17 +0000 (17:49 +0900)]
block: fix missing nr_hw_queues update in blk_mq_realloc_tag_set_tags
The commit ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags")
cleaned up the function blk_mq_realloc_tag_set_tags. After this change,
the function does not update nr_hw_queues of struct blk_mq_tag_set when
new nr_hw_queues value is smaller than original. This results in failure
of queue number change of block devices. To avoid the failure, add the
missing nr_hw_queues update.
Christoph Hellwig [Mon, 14 Nov 2022 04:29:44 +0000 (05:29 +0100)]
blk-crypto: move internal only declarations to blk-crypto-internal.h
blk_crypto_get_keyslot, blk_crypto_put_keyslot, __blk_crypto_evict_key
and __blk_crypto_cfg_supported are only used internally by the
blk-crypto code, so move the out of blk-crypto-profile.h, which is
included by drivers that supply blk-crypto functionality.
Christoph Hellwig [Mon, 14 Nov 2022 04:29:43 +0000 (05:29 +0100)]
blk-crypto: add a blk_crypto_config_supported_natively helper
Add a blk_crypto_config_supported_natively helper that wraps
__blk_crypto_cfg_supported to retrieve the crypto_profile from the
request queue. With this fscrypt can stop including
blk-crypto-profile.h and rely on the public consumer interface in
blk-crypto.h.
Christoph Hellwig [Mon, 14 Nov 2022 04:29:42 +0000 (05:29 +0100)]
blk-crypto: don't use struct request_queue for public interfaces
Switch all public blk-crypto interfaces to use struct block_device
arguments to specify the device they operate on instead of th
request_queue, which is a block layer implementation detail.
Christoph Hellwig [Tue, 15 Nov 2022 10:22:14 +0000 (11:22 +0100)]
nvme: rename the queue quiescing helpers
Naming the nvme helpers that wrap the block quiesce functionality
_start/_stop is rather confusing. Switch to using the quiesce naming
used by the block layer instead.
Waiman Long [Sat, 5 Nov 2022 00:59:02 +0000 (20:59 -0400)]
blk-cgroup: Flush stats at blkgs destruction path
As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which is
called when the blkcg reference count reaches 0. This circular dependency
will prevent blkcg from being freed until some other events cause
cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
function to flush stats for a given css and cpu and call it at the blkgs
destruction path, blkcg_destroy_blkgs(), whenever there are still some
pending stats to be flushed. This will ensure that blkcg reference
count can reach 0 ASAP.
Waiman Long [Sat, 5 Nov 2022 00:59:01 +0000 (20:59 -0400)]
blk-cgroup: Optimize blkcg_rstat_flush()
For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.
As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.
The blkg_iostat_set is added to a lockless list on the update side
in blk_cgroup_bio_start(). It is removed from the lockless list when
flushed in blkcg_rstat_flush(). Due to racing, it is possible that
blk_iostat_set's in the lockless list may have no new IO stats to be
flushed, but that is OK.
To protect against destruction of blkg, a percpu reference is gotten
when putting into the lockless list and put back when removed.
When booting up an instrumented test kernel with this patch on a
2-socket 96-thread system with cgroup v2, out of the 2051 calls to
cgroup_rstat_flush() after bootup, 1788 of the calls were exited
immediately because of empty lockless list. After an all-cpu kernel
build, the ratio became 6295424/6340513. That was more than 99%.
Waiman Long [Sat, 5 Nov 2022 00:59:00 +0000 (20:59 -0400)]
blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path
For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
Simplify error handling code by returning this error directly instead
of setting an intermediate "ret" variable.
Yu Kuai [Tue, 15 Nov 2022 14:10:54 +0000 (22:10 +0800)]
block: don't allow a disk link holder to itself
After creating a dm device, then user can reload such dm with itself,
and dead loop will be triggered because dm keep looking up to itself.
Test procedures:
1) dmsetup create test --table "xxx sda", assume dm-0 is created
2) dmsetup suspend test
3) dmsetup reload test --table "xxx dm-0"
4) dmsetup resume test
Yu Kuai [Tue, 15 Nov 2022 14:10:53 +0000 (22:10 +0800)]
block: store the holder kobject in bd_holder_disk
We hold a reference to the holder kobject for each bd_holder_disk,
so to make the code a bit more robust, use a reference to it instead
of the block_device. As long as no one clears ->bd_holder_dir in
before freeing the disk, this isn't strictly required, but it does
make the code more clear and more robust.
Yu Kuai [Tue, 15 Nov 2022 14:10:52 +0000 (22:10 +0800)]
block: fix use after free for bd_holder_dir
Currently, the caller of bd_link_disk_holer() get 'bdev' by
blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'.
Howerver, it's possible that del_gendisk() can be called currently, and
'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus
use after free is triggered.
Christoph Hellwig [Tue, 15 Nov 2022 14:10:50 +0000 (22:10 +0800)]
dm: track per-add_disk holder relations in DM
dm is a bit special in that it opens the underlying devices. Commit 89f871af1b26 ("dm: delay registering the gendisk") tried to accommodate
that by allowing to add the holder to the list before add_gendisk and
then just add them to sysfs once add_disk is called. But that leads to
really odd lifetime problems and error handling problems as we can't
know the state of the kobjects and don't unwind properly. To fix this
switch to just registering all existing table_devices with the holder
code right after add_disk, and remove them before calling del_gendisk.
Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 15 Nov 2022 14:10:48 +0000 (22:10 +0800)]
dm: cleanup close_table_device
Take the list unlink and free into close_table_device so that no half
torn down table_devices exist. Also remove the check for a NULL bdev
as that can't happen - open_table_device never adds a table_device to
the list that does not have a valid block_device.
Christoph Hellwig [Tue, 15 Nov 2022 14:10:47 +0000 (22:10 +0800)]
dm: cleanup open_table_device
Move all the logic for allocation the table_device and linking it into
the list into the open_table_device. This keeps the code tidy and
ensures that the table_devices only exist in fully initialized state.
Christoph Hellwig [Tue, 15 Nov 2022 14:10:46 +0000 (22:10 +0800)]
dm: remove free_table_devices
free_table_devices just warns and frees all table_device structures when
the target removal did not remove them. This should never happen, but
if it did, just freeing the structure without deleting them from the
list or cleaning up the resources would not help at all. So just WARN on
a non-empty list instead.
Christoph Hellwig [Tue, 15 Nov 2022 14:10:45 +0000 (22:10 +0800)]
block: clear ->slave_dir when dropping the main slave_dir reference
Zero out the pointer to ->slave_dir so that the holder code doesn't
incorrectly treat the object as alive when add_disk failed or after
del_gendisk was called.
Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20221115141054.1051801-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gabriel Krisman Bertazi [Tue, 15 Nov 2022 22:45:53 +0000 (17:45 -0500)]
sbitmap: Try each queue to wake up at least one waiter
Jan reported the new algorithm as merged might be problematic if the
queue being awaken becomes empty between the waitqueue_active inside
sbq_wake_ptr check and the wake up. If that happens, wake_up_nr will
not wake up any waiter and we loose too many wake ups. In order to
guarantee progress, we need to wake up at least one waiter here, if
there are any. This now requires trying to wake up from every queue.
Instead of walking through all the queues with sbq_wake_ptr, this call
moves the wake up inside that function. In a previous version of the
patch, I found that updating wake_index several times when walking
through queues had a measurable overhead. This ensures we only update
it once, at the end.
Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221115224553.23594-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gabriel Krisman Bertazi [Tue, 15 Nov 2022 22:45:52 +0000 (17:45 -0500)]
wait: Return number of exclusive waiters awaken
Sbitmap code will need to know how many waiters were actually woken for
its batched wakeups implementation. Return the number of woken
exclusive waiters from __wake_up() to facilitate that.
Gabriel Krisman Bertazi [Tue, 15 Nov 2022 22:45:51 +0000 (17:45 -0500)]
sbitmap: Advance the queue index before waking up a queue
When a queue is awaken, the wake_index written by sbq_wake_ptr currently
keeps pointing to the same queue. On the next wake up, it will thus
retry the same queue, which is unfair to other queues, and can lead to
starvation. This patch, moves the index update to happen before the
queue is returned, such that it will now try a different queue first on
the next wake up, improving fairness.
Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20221115224553.23594-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 16 Nov 2022 13:20:35 +0000 (14:20 +0100)]
block: remove blkdev_writepages
While the block device code should switch to implementing
->writepages instead of ->writepage eventually, the current
implementation is entirely pointless as it does the same looping over
->writepage as the generic code if no ->writepages is present.
Remove blkdev_writepages so that we can eventually unexport
generic_writepages.
Pavel Begunkov [Wed, 2 Nov 2022 15:18:23 +0000 (15:18 +0000)]
bio: shrink max number of pcpu cached bios
The downside of the bio pcpu cache is that bios of a cpu will be never
freed unless there is new I/O issued from that cpu. We currently keep
max 512 bios, which feels too much, half it.
Pavel Begunkov [Wed, 2 Nov 2022 15:18:22 +0000 (15:18 +0000)]
bio: add pcpu caching for non-polling bio_put
This patch extends REQ_ALLOC_CACHE to IRQ completions, whenever
currently it's only limited to iopoll. Instead of guarding the list with
irq toggling on alloc, which is expensive, it keeps an additional
irq-safe list from which bios are spliced in batches to ammortise
overhead. On the put side it toggles irqs, but in many cases they're
already disabled and so cheap.
Pavel Begunkov [Wed, 2 Nov 2022 15:18:20 +0000 (15:18 +0000)]
bio: don't rob starving biosets of bios
Biosets keep a mempool, so as long as requests complete we can always
can allocate and have forward progress. Percpu bio caches break that
assumptions as we may complete into the cache of one CPU and after try
and fail to allocate with another CPU. We also can't grab from another
CPU's cache without tricky sync.
If we're allocating with a bio while the mempool is undersaturated,
remove REQ_ALLOC_CACHE flag, so on put it will go straight to mempool.
It might try to free into mempool more requests than required, but
assuming than there is no memory starvation in the system it'll
stabilise and never hit that path.
Pavel Begunkov [Wed, 2 Nov 2022 15:18:19 +0000 (15:18 +0000)]
mempool: introduce mempool_is_saturated
Introduce a helper mempool_is_saturated(), which tells if the mempool is
under-filled or not. We need it to figure out whether it should be
freed right into the mempool or could be cached with top level caches.
Uday Shankar [Tue, 15 Nov 2022 00:23:59 +0000 (17:23 -0700)]
nvme: avoid fallback to sequential scan due to transient issues
Currently, if nvme_scan_ns_list fails, nvme_scan_work will fall back to
a sequential scan. nvme_scan_ns_list can fail for a variety of reasons,
e.g. a transient transport issue, and the resulting sequential scan can
be extremely expensive on controllers reporting an NN value close to the
maximum allowed (> 4 billion). Avoid sequential scans wherever possible
by only falling back to them in two cases:
- When the NVMe version supported (VS) value reported by the device is
older than NVME_VS(1, 1, 0), before which support of Identify NS List
not required.
- When the Identify NS List command fails with the DNR bit set in the
status. This is to accommodate (non-compliant) devices which report a
VS value which implies support for Identify NS List, but nevertheless
do not support the command. Such devices will most likely fail the
command with the DNR bit set.
The third case is when the device claims support for Identify NS List
but the command fails with DNR not set. In such cases, fallback to
sequential scan is potentially expensive and likely unnecessary, as a
retry of the list scan should succeed. So this change skips the fallback
in this third case.
Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:24 +0000 (13:24 +0200)]
nvme-rdma: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work
running, and it involves I/O commands. Given the controller is tearing
down there is no chance for the I/O to complete other than timing out
which may unnecessarily take a full io timeout.
So first tear down the queues, fail/cancel all inflight I/O (including
potentially authentication) and only then stop authentication. This
ensures that failover is not stalled due to blocked authentication I/O.
Sagi Grimberg [Sun, 13 Nov 2022 11:24:23 +0000 (13:24 +0200)]
nvme-tcp: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work
running, and it involves I/O commands. Given the controller is tearing
down there is no chance for the I/O to complete other than timing out
which may unnecessarily take a full io timeout.
So first tear down the queues, fail/cancel all inflight I/O (including
potentially authentication) and only then stop authentication. This
ensures that failover is not stalled due to blocked authentication I/O.
Sagi Grimberg [Sun, 13 Nov 2022 11:24:22 +0000 (13:24 +0200)]
nvme-auth: have dhchap_auth_work wait for queues auth to complete
It triggered the queue authentication work elements in parallel, but
the ctrl authentication work itself completes when all of them
completes. Hence wait for queues auth completions.
This also makes nvme_auth_stop simply a sync cancel of ctrl
dhchap_auth_work.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:21 +0000 (13:24 +0200)]
nvme-auth: remove redundant auth_work flush
only ctrl deletion calls nvme_auth_free, which was stopped prior in the
teardown stage, so there is no possibility that it should ever run when
nvme_auth_free is called. As a result, we can remove a local chap pointer
variable.
Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:20 +0000 (13:24 +0200)]
nvme-auth: convert dhchap_auth_list to an array
We know exactly how many dhchap contexts we will need, there is no need
to hold a list that we need to protect with a mutex. Convert to
a dynamically allocated array. And dhchap_context access state is
maintained by the chap itself.
Make dhchap_auth_mutex protect only the ctrl host_key and ctrl_key
in a fine-grained lock such that there is no long lasting acquisition
of the lock and no need to take/release this lock when flushing
authentication works.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:18 +0000 (13:24 +0200)]
nvme-auth: check chap ctrl_key once constructed
ctrl ctrl_key member may be overwritten from a sysfs context driven
by the user. Once a queue local copy was created, use that instead
to minimize checks on a shared resource.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:13 +0000 (13:24 +0200)]
nvme-auth: don't keep long lived 4k dhchap buffer
dhchap structure is per-queue, it is wasteful to keep it for the entire
lifetime of the queue. Allocate it dynamically and get rid of it after
authentication. We don't need kzalloc because all accessors are clearing
it before writing to it.
Also, remove redundant chap buf_size which is always 4096, use a define
instead.
Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>