Guoqing Jiang [Sat, 13 Feb 2021 00:49:59 +0000 (01:49 +0100)]
md: don't unregister sync_thread with reconfig_mutex held
Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.
And it could cause deadlock problem for raid5 as follows:
1. process A tried to reap sync thread with reconfig_mutex held after echo
idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
to hold reconfig_mutex.
More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.
Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <song@kernel.org>
Jens Axboe [Fri, 20 May 2022 12:56:32 +0000 (06:56 -0600)]
Merge tag 'nvme-5.19-2022-05-19' of git://git.infradead.org/nvme into for-5.19/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.19
- set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni)
- add support for TP4084 - Time-to-Ready Enhancements (me)"
* tag 'nvme-5.19-2022-05-19' of git://git.infradead.org/nvme:
nvme: set non-mdts limits in nvme_scan_work
nvme: add support for TP4084 - Time-to-Ready Enhancements
Chaitanya Kulkarni [Wed, 18 May 2022 15:51:38 +0000 (08:51 -0700)]
nvme: set non-mdts limits in nvme_scan_work
In current implementation we set the non-mdts limits by calling
nvme_init_non_mdts_limits() from nvme_init_ctrl_finish().
This also tries to set the limits for the discovery controller which
has no I/O queues resulting in the warning message reported by the
nvme_log_error() when running blktest nvme/002: -
[ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47
[ 2005.192223] loop: module loaded
[ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0
[ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997
[ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998
[ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999
[ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn.
*[ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR*
[ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery"
[ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after
we verify that I/O queues are created since that is a converging point
for each transport where these limits are actually used.
1. FC :
nvme_fc_create_association()
...
nvme_fc_create_io_queues(ctrl);
...
nvme_start_ctrl()
nvme_scan_queue()
nvme_scan_work()
Christoph Hellwig [Mon, 16 May 2022 13:09:21 +0000 (15:09 +0200)]
nvme: add support for TP4084 - Time-to-Ready Enhancements
Add support for using longer timeouts during controller initialization
and letting the controller come up with namespaces that are not ready
for I/O yet. We skip these not ready namespaces during scanning and
only bring them online once anoter scan is kicked off by the AEN that
is set when the NRDY bit gets set in the I/O Command Set Independent
Identify Namespace Data Structure. This asynchronous probing avoids
blocking the kernel boot when controllers take a very long time to
recover after unclean shutdowns (up to minutes).
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
Jens Axboe [Wed, 18 May 2022 12:28:04 +0000 (06:28 -0600)]
Merge tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme into for-5.19/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.19
- tighten the PCI presence check (Stefan Roese):
- fix a potential NULL pointer dereference in an error path
(Kyle Miller Smith)
- fix interpretation of the DMRSL field (Tom Yan)
- relax the data transfer alignment (Keith Busch)
- verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni)
- misc cleanups (Chaitanya Kulkarni, me)"
* tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme:
nvme: split the enum used for various register constants
nvme-fabrics: add a request timeout helper
nvme-pci: harden drive presence detect in nvme_dev_disable()
nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
nvme: mark internal passthru request RQF_QUIET
nvme: remove unneeded include from constants file
nvme: add missing status values to verbose logging
nvme: set dma alignment to dword
nvme: fix interpretation of DMRSL
Xie Yongji [Tue, 22 Mar 2022 08:06:39 +0000 (16:06 +0800)]
nbd: Fix hung on disconnect request if socket is closed before
When userspace closes the socket before sending a disconnect
request, the following I/O requests will be blocked in
wait_for_reconnect() until dead timeout. This will cause the
following disconnect request also hung on blk_mq_quiesce_queue().
That means we have no way to disconnect a nbd device if there
are some I/O requests waiting for reconnecting until dead timeout.
It's not expected. So let's wake up the thread waiting for
reconnecting directly when a disconnect request is sent.
Chaitanya Kulkarni [Wed, 30 Mar 2022 09:40:32 +0000 (02:40 -0700)]
nvme-fabrics: add a request timeout helper
The RDAMA and TCP transport both complete the timed out request in the
same manner and hence code is duplicated. Add and use the helper
nvmf_complete_timed_out_request() to remove the duplicate code.
Analysis: When nvme_dev_disable() is called because of this PCIe hotplug
event, pci_is_enabled() is still true. And accessing the NVMe drive
which is currently not available as it's in reboot process causes this
"synchronous external abort" on this ARM64 platform.
This patch adds the pci_device_is_present() check as well, which returns
false in this "Card not present" hot-plug case. With this change, the
NVMe driver does not try to access the NVMe registers any more and the
FW update finishes without any problems.
Signed-off-by: Stefan Roese <sr@denx.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags
In nvme_alloc_admin_tags, the admin_q can be set to an error (typically
-ENOMEM) if the blk_mq_init_queue call fails to set up the queue, which
is checked immediately after the call. However, when we return the error
message up the stack, to nvme_reset_work the error takes us to
nvme_remove_dead_ctrl()
nvme_dev_disable()
nvme_suspend_queue(&dev->queues[0]).
Here, we only check that the admin_q is non-NULL, rather than not
an error or NULL, and begin quiescing a queue that never existed, leading
to bad / NULL pointer dereference.
Signed-off-by: Kyle Smith <kyles@hpe.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Mark the internal passthru request quiet so that we can skip the verbose
error message from nvme_log_error() in nvme_end_req() completion path,
this will be consistent with what we have in __nvme_submit_sync_cmd().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Wed, 4 May 2022 18:43:25 +0000 (11:43 -0700)]
nvme: set dma alignment to dword
The nvme specification only requires qword alignment for segment
descriptors, and the driver already guarantees that. The spec has always
allowed user data to be dword aligned, which is what the queue's
attribute is for, so relax the alignment requirement to that value.
While we could allow byte alignment for some controllers when using
SGLs, we still need to support PRP, and that only allows dword.
Fixes: 3b2a1ebceba3 ("nvme: set dma alignment to qword") Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Tue, 19 Apr 2022 06:33:01 +0000 (08:33 +0200)]
loop: add a SPDX header
The copyright statement says:
"Redistribution of this file is permitted under the GNU General Public
License." and was added by Ted in 1993, at which point GPLv2 only
was the default Linux license.
Damien Le Moal [Wed, 20 Apr 2022 00:57:18 +0000 (09:57 +0900)]
block: null_blk: Improve device creation with configfs
Currently, the directory name used to create a nullb device through
sysfs is not used as the device name, potentially causing headaches for
users if devices are already created through the modprobe operation
withe the nr_device module parameter not set to 0. E.g. a user can do
"mkdir /sys/kernel/config/nullb/nullb0" to create a nullb device even
though /dev/nullb0 was already created by modprobe. In this case, the
configfs nullb device will be named nullb1, causing confusion for the
user.
Simplify this by using the configfs directory name as the nullb device
name, always, unless another nullb device is already using the same
name. E.g. if modprobe created nullb0, then:
To implement this, the function null_find_dev_by_name() is added to
check for the existence of a nullb device with the name used for a new
configfs device directory. nullb_group_make_item() uses this new
function to check if the directory name can be used as the disk name.
Finally, null_add_dev() is modified to use the device config item name
as the disk name for a new nullb device created using configfs.
The naming of devices created though modprobe remains unchanged.
Of note is that it is possible for a user to create through configfs a
nullb device with the same name as an existing device. E.g.
$ mkdir /sys/kernel/config/nullb/null
will successfully create the nullb device named "null" but this block
device will however not appear under /dev/ since /dev/null already
exists.
Damien Le Moal [Wed, 20 Apr 2022 00:57:17 +0000 (09:57 +0900)]
block: null_blk: Cleanup messages
Use the pr_fmt() macro to prefix all null_blk pr_xxx() messages with
"null_blk:" to clarify which module is printing the messages. Also add
a pr_info() message in null_add_dev() to print the name of a newly
created disk.
Damien Le Moal [Wed, 20 Apr 2022 00:57:16 +0000 (09:57 +0900)]
block: null_blk: Cleanup device creation and deletion
Introduce the null_create_dev() and null_destroy_dev() helper functions
to respectivel create nullb devices on modprobe and destroy them on
rmmod. The null_destroy_dev() helper avoids duplicated code in the
null_init() and null_exit() functions for deleting devices.
Christoph Hellwig [Mon, 18 Apr 2022 04:53:14 +0000 (06:53 +0200)]
xen-blkback: use bdev_discard_alignment
Use bdev_discard_alignment to calculate the correct discard alignment
offset even for partitions instead of just looking at the queue limit.
Also switch to use bdev_discard_granularity to get rid of the last direct
queue reference in xen_blkbk_discard.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-12-hch@lst.de
[axboe: fold in 'q' removal as it's now unused] Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:13 +0000 (06:53 +0200)]
rnbd-srv: use bdev_discard_alignment
Use bdev_discard_alignment to calculate the correct discard alignment
offset even for partitions instead of just looking at the queue limit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:11 +0000 (06:53 +0200)]
loop: remove a spurious clear of discard_alignment
The loop driver never sets a discard_alignment, so it also doens't need
to clear it to zero.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:10 +0000 (06:53 +0200)]
dasd: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to PAGE_SIZE while the discard granularity is the block size
that is smaller or the same as PAGE_SIZE as done by dasd is mostly
harmless but also useless.
Christoph Hellwig [Mon, 18 Apr 2022 04:53:09 +0000 (06:53 +0200)]
raid5: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by raid5 is mostly
harmless but also useless.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:08 +0000 (06:53 +0200)]
dm-zoned: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by dm-zoned is mostly
harmless but also useless.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:07 +0000 (06:53 +0200)]
virtio_blk: fix the discard_granularity and discard_alignment queue limits
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
On the other hand the discard_sector_alignment from the virtio 1.1 looks
similar to what Linux uses as discard granularity (even if not very well
described):
"discard_sector_alignment can be used by OS when splitting a request
based on alignment. "
And at least qemu does set it to the discard granularity.
So stop setting the discard_alignment and use the virtio
discard_sector_alignment to set the discard granularity.
Fixes: 1f23816b8eb8 ("virtio_blk: add discard and write zeroes support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:06 +0000 (06:53 +0200)]
null_blk: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by null_blk is mostly
harmless but also useless.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 18 Apr 2022 04:53:05 +0000 (06:53 +0200)]
nbd: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by nbd is mostly harmless
but also useless.
Christoph Hellwig [Mon, 18 Apr 2022 04:53:04 +0000 (06:53 +0200)]
ubd: don't set the discard_alignment queue limit
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by ubd is mostly harmless
but also useless.
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.19/drivers
Pull MD updates from Song:
"1. Improve annotation in raid5 code, by Logan Gunthorpe.
2. Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk.
3. Other small fixes/cleanups."
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: Replace role magic numbers with defined constants
md/raid0: Ignore RAID0 layout if the second zone has only one device
md/raid5: Annotate functions that hold device_lock with __must_hold
md/raid5-ppl: Annotate with rcu_dereference_protected()
md/raid5: Annotate rdev/replacement access when mddev_lock is held
md/raid5: Annotate rdev/replacement accesses when nr_pending is elevated
md/raid5: Add __rcu annotation to struct disk_info
md/raid5: Un-nest struct raid5_percpu definition
md/raid5: Cleanup setup_conf() error returns
md: replace deprecated strlcpy & remove duplicated line
md/bitmap: don't set sb values if can't pass sanity check
md: fix an incorrect NULL check in md_reload_sb
md: fix an incorrect NULL check in does_sb_need_changing
raid5: introduce MD_BROKEN
md: Set MD_BROKEN for RAID1 and RAID10
null-blk: save memory footprint for struct nullb_cmd
Total 16 bytes can be saved in two ways:
1) The field 'bio' will only be used in bio based mode, and the field
'rq' will only be used in mq mode. Since they won't be used in the
same time, declare a union for them.
2) The field 'bool fake_timeout' can be placed in the hole after the
field 'error'.
David Sloan [Thu, 21 Apr 2022 19:45:58 +0000 (13:45 -0600)]
md: Replace role magic numbers with defined constants
There are several instances where magic numbers are used in md.c instead
of the defined constants in md_p.h. This patch set improves code
readability by replacing all occurrences of 0xffff, 0xfffe, and 0xfffd when
relating to md roles with their equivalent defined constant.
Signed-off-by: David Sloan <david.sloan@eideticom.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>
md/raid0: Ignore RAID0 layout if the second zone has only one device
The RAID0 layout is irrelevant if all members have the same size so the
array has only one zone. It is *also* irrelevant if the array has two
zones and the second zone has only one device, for example if the array
has two members of different sizes.
So in that case it makes sense to allow assembly even when the layout is
undefined, like what is done when the array has only one zone.
Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org> Signed-off-by: Song Liu <song@kernel.org>
md/raid5: Annotate functions that hold device_lock with __must_hold
A handful of functions note the device_lock must be held with a comment
but this is not comprehensive. Many other functions hold the lock when
taken so add an __must_hold() to each call to annotate when the lock is
held.
This makes it a bit easier to analyse device_lock.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
md/raid5-ppl: Annotate with rcu_dereference_protected()
To suppress the last remaining sparse warnings about accessing
rdev, add rcu_dereference_protected calls to a couple places
in raid5-ppl. All of these places are called under raid5_run and
therefore are occurring before the array has started and is thus
safe.
There's no sensible check to do for the second argument of
rcu_dereference_protected() so a comment is added instead.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
md/raid5: Annotate rdev/replacement access when mddev_lock is held
The mddev_lock should be held during raid5_remove_disk() which is when
the rdev/replacement pointers are modified. So any access to these
pointers marked __rcu should be safe whenever the mddev_lock is held.
There are numerous such access that currently produce sparse warnings.
Add a helper function, rdev_mdlock_deref() that wraps
rcu_dereference_protected() in all these instances.
This annotation fixes a number of sparse warnings.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
md/raid5: Add __rcu annotation to struct disk_info
rdev and replacement are protected in some circumstances with
rcu_dereference and synchronize_rcu (in raid5_remove_disk()). However,
they were not annotated with __rcu so a sparse warning is emitted for
every rcu_dereference() call.
Add the __rcu annotation and fix up the initialization with
RCU_INIT_POINTER, all pointer modifications with rcu_assign_pointer(),
a few cases where the pointer value is tested with rcu_access_pointer()
and one case where READ_ONCE() is used instead of rcu_dereference(),
a case in print_raid5_conf() that should have rcu_dereference() and
rcu_read_[un]lock() calls.
Additional sparse issues will be fixed up in further commits.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
md: replace deprecated strlcpy & remove duplicated line
This commit includes two topics:
1> replace deprecated strlcpy
change strlcpy to strscpy for strlcpy is marked as deprecated in
Documentation/process/deprecated.rst
2> remove duplicated strlcpy line
in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the
history:
- commit cf921cc19cf7 ("Add node recovery callbacks") introduced the first
usage of strlcpy().
- commit b97e92574c0b ("Use separate bitmaps for each nodes in the cluster")
introduced the second strlcpy(). this time, the two strlcpy() are same,
we can remove anyone safely.
- commit d3b178adb3a3 ("md: Skip cluster setup for dm-raid") added dm-raid
special handling. And the "nodes" value is the key of this patch. but
from this patch, strlcpy() which was introduced by b97e92574c0bf
become necessary.
- commit 3c462c880b52 ("md: Increment version for clustered bitmaps") used
clustered major version to only handle in clustered env. this patch
could look a polishment for clustered code logic.
md/bitmap: don't set sb values if can't pass sanity check
If bitmap area contains invalid data, kernel will crash then mdadm
triggers "Segmentation fault".
This is cluster-md speical bug. In non-clustered env, mdadm will
handle broken metadata case. In clustered array, only kernel space
handles bitmap slot info. But even this bug only happened in clustered
env, current sanity check is wrong, the code should be changed.
In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't
block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T()
trigger "divide error".
The bug is here:
if (!rdev || rdev->desc_nr != nr) {
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each_rcu(), so it is incorrect to assume that the
iterator value will be NULL if the list is empty or no element
found (In fact, it will be a bogus pointer to an invalid struct
object containing the HEAD). Otherwise it will bypass the check
and lead to invalid memory access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'pdev' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org Fixes: 70bcecdb1534 ("md-cluster: Improve md_reload_sb to be less error prone") Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
md: fix an incorrect NULL check in does_sb_need_changing
The bug is here:
if (!rdev)
The list iterator value 'rdev' will *always* be set and non-NULL
by rdev_for_each(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element found.
Otherwise it will bypass the NULL check and lead to invalid memory
access passing the check.
To fix the bug, use a new variable 'iter' as the list iterator,
while using the original variable 'rdev' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org Fixes: 2aa82191ac36 ("md-cluster: Perform a lazy update") Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Song Liu <song@kernel.org>
Mariusz Tkaczyk [Tue, 22 Mar 2022 15:23:39 +0000 (16:23 +0100)]
raid5: introduce MD_BROKEN
Raid456 module had allowed to achieve failed state. It was fixed by fb73b357fb9 ("raid5: block failing device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
analyze_stripe().
Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
As a result, this level is allowed to achieve failed state again, but
communication with userspace (via -EBUSY status) will be preserved.
This restores possibility to fail array via #mdadm --set-faulty command
and will be fixed by additional verification on mdadm side.
Cc: stable@vger.kernel.org Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
Mariusz Tkaczyk [Tue, 22 Mar 2022 15:23:38 +0000 (16:23 +0100)]
md: Set MD_BROKEN for RAID1 and RAID10
There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.
Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.
There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
process cannot proceed safe.
-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.
The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.
As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.
[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
RAID1/RAID10.")
Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
Christoph Hellwig [Wed, 30 Mar 2022 05:29:17 +0000 (07:29 +0200)]
loop: don't destroy lo->workqueue in __loop_clr_fd
There is no need to destroy the workqueue when clearing unbinding
a loop device from a backing file. Not doing so on the other hand
avoid creating a complex lock dependency chain involving the global
system_transition_mutex.
Based on a patch from Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>.
Reported-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: syzbot+6479585dfd4dedd3f7e1@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20220330052917.2566582-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 30 Mar 2022 05:29:16 +0000 (07:29 +0200)]
loop: remove lo_refcount and avoid lo_mutex in ->open / ->release
lo_refcount counts how many openers a loop device has, but that count
is already provided by the block layer in the bd_openers field of the
whole-disk block_device. Remove lo_refcount and allow opens to
succeed even on devices beeing deleted - now that ->free_disk is
implemented we can handle that race gracefull and all I/O on it will
just fail. Similarly there is a small race window now where
loop_control_remove does not synchronize the delete vs the remove
due do bd_openers not being under lo_mutex protection, but we can
handle that just as gracefully.
Tetsuo Handa [Wed, 30 Mar 2022 05:29:15 +0000 (07:29 +0200)]
loop: avoid loop_validate_mutex/lo_mutex in ->release
Since ->release is called with disk->open_mutex held, and __loop_clr_fd()
from lo_release() is called via ->release when disk_openers() == 0, we are
guaranteed that "struct file" which will be passed to loop_validate_file()
via fget() cannot be the loop device __loop_clr_fd(lo, true) will clear.
Thus, there is no need to hold loop_validate_mutex from __loop_clr_fd()
if release == true.
When I made commit 3ce6e1f662a91097 ("loop: reintroduce global lock for
safe loop_validate_file() traversal"), I wrote "It is acceptable for
loop_validate_file() to succeed, for actual clear operation has not started
yet.". But now I came to feel why it is acceptable to succeed.
It seems that the loop driver was added in Linux 1.3.68, and
if (lo->lo_refcnt > 1)
return -EBUSY;
check in loop_clr_fd() was there from the beginning. The intent of this
check was unclear. But now I think that current
disk_openers(lo->lo_disk) > 1
form is there for three reasons.
(1) Avoid I/O errors when some process which opens and reads from this
loop device in response to uevent notification (e.g. systemd-udevd),
as described in commit a1ecac3b0656a682 ("loop: Make explicit loop
device destruction lazy"). This opener is short-lived because it is
likely that the file descriptor used by that process is closed soon.
(2) Avoid I/O errors caused by underlying layer of stacked loop devices
(i.e. ioctl(some_loop_fd, LOOP_SET_FD, other_loop_fd)) being suddenly
disappeared. This opener is long-lived because this reference is
associated with not a file descriptor but lo->lo_backing_file.
(3) Avoid I/O errors caused by underlying layer of mounted loop device
(i.e. mount(some_loop_device, some_mount_point)) being suddenly
disappeared. This opener is long-lived because this reference is
associated with not a file descriptor but mount.
While race in (1) might be acceptable, (2) and (3) should be checked
racelessly. That is, make sure that __loop_clr_fd() will not run if
loop_validate_file() succeeds, by doing refcount check with global lock
held when explicit loop device destruction is requested.
As a result of no longer waiting for lo->lo_mutex after setting Lo_rundown,
we can remove pointless BUG_ON(lo->lo_state != Lo_rundown) check.
Christoph Hellwig [Wed, 30 Mar 2022 05:29:14 +0000 (07:29 +0200)]
loop: suppress uevents while reconfiguring the device
Currently, udev change event is generated for a loop device before the
device is ready for IO. Due to serialization on lo->lo_mutex in
lo_open() this does not matter because anybody is able to open the
device and do IO only after the configuration is finished. However this
synchronization in lo_open() is going away so make sure userspace
reacting to the change event will see the new device state by generating
the event only when the device is setup.
Christoph Hellwig [Wed, 30 Mar 2022 05:29:13 +0000 (07:29 +0200)]
loop: implement ->free_disk
Ensure that the lo_device which is stored in the gendisk private
data is valid until the gendisk is freed. Currently the loop driver
uses a lot of effort to make sure a device is not freed when it is
still in use, but to to fix a potential deadlock this will be relaxed
a bit soon.
Christoph Hellwig [Wed, 30 Mar 2022 05:29:10 +0000 (07:29 +0200)]
loop: remove the racy bd_inode->i_mapping->nrpages asserts
Nothing prevents a file system or userspace opener of the block device
from redirtying the page right afte sync_blockdev returned. Fortunately
data in the page cache during a block device change is mostly harmless
anyway.
Christoph Hellwig [Wed, 30 Mar 2022 05:29:09 +0000 (07:29 +0200)]
loop: initialize the worker tracking fields once
There is no need to reinitialize idle_worker_list, worker_tree and timer
every time a loop device is configured. Just initialize them once at
allocation time.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220330052917.2566582-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 30 Mar 2022 05:29:07 +0000 (07:29 +0200)]
block: turn bdev->bd_openers into an atomic_t
All manipulation of bd_openers is under disk->open_mutex and will remain
so for the foreseeable future. But at least one place reads it without
the lock (blkdev_get) and there are more to be added. So make sure the
compiler does not do turn the increments and decrements into non-atomic
sequences by using an atomic_t.
Christoph Hellwig [Wed, 30 Mar 2022 05:29:06 +0000 (07:29 +0200)]
block: add a disk_openers helper
Add a helper that returns the openers for a given gendisk to avoid having
drivers poke into disk->part0 to get at this information in a somewhat
cumbersome way.
Christoph Hellwig [Wed, 30 Mar 2022 05:29:03 +0000 (07:29 +0200)]
nbd: use the correct block_device in nbd_bdev_reset
The bdev parameter to ->ioctl contains the block device that the ioctl
is called on, which can be the partition. But the openers check in
nbd_bdev_reset really needs to check use the whole device, so switch to
using that.
Instead of invoking a synchronize_rcu() to free a pointer
after a grace period we can directly make use of new API
that does the same but in more efficient way.
TO: Jens Axboe <axboe@kernel.dk>
TO: Philipp Reisner <philipp.reisner@linbit.com>
TO: Jason Gunthorpe <jgg@nvidia.com>
TO: drbd-dev@lists.linbit.com
TO: linux-block@vger.kernel.org Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20220406190715.1938174-7-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: drbd: drbd_receiver: Remove redundant assignment to err
Variable err is set to '-EIO' but this value is never read as
it is overwritten or not used later on, hence it is a redundant
assignment and can be removed.
Clean up the following clang-analyzer warning:
drivers/block/drbd/drbd_receiver.c:3955:5: warning: Value stored to
'err' is never read [clang-analyzer-deadcode.DeadStores].
drivers/block/drbd/drbd_main.c:3676:22: warning: initialized field overwritten [-Woverride-init]
Remove the first one since it was already ignored by the compiler
and reorder the list to match the enum definition. As P_ZEROES had
no entry, add that one instead.
Fixes: 036b17eaab93 ("drbd: Receiving part for the PROTOCOL_UPDATE packet") Fixes: f31e583aa2c2 ("drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20220406190715.1938174-2-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 15 Apr 2022 04:52:58 +0000 (06:52 +0200)]
direct-io: remove random prefetches
Randomly poking into block device internals for manual prefetches isn't
exactly a very maintainable thing to do. And none of the performance
critical direct I/O implementations still use this library function
anyway, so just drop it.
Christoph Hellwig [Fri, 15 Apr 2022 04:52:57 +0000 (06:52 +0200)]
block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD
Secure erase is a very different operation from discard in that it is
a data integrity operation vs hint. Fully split the limits and helper
infrastructure to make the separation more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2] Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Chao Yu <chao@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 15 Apr 2022 04:52:55 +0000 (06:52 +0200)]
block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.
The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 15 Apr 2022 04:52:54 +0000 (06:52 +0200)]
block: add a bdev_max_discard_sectors helper
Add a helper to query the number of sectors support per each discard bio
based on the block device and use this helper to stop various places from
poking into the request_queue to see if discard is supported and if so how
much. This mirrors what is done e.g. for write zeroes as well.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-24-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 15 Apr 2022 04:52:50 +0000 (06:52 +0200)]
block: remove queue_discard_alignment
Just use bdev_alignment_offset in disk_discard_alignment_show instead.
That helpers is the same except for an always false branch that doesn't
matter in this slow path.
Christoph Hellwig [Fri, 15 Apr 2022 04:52:47 +0000 (06:52 +0200)]
block: use bdev_alignment_offset in part_alignment_offset_show
Replace the open coded offset calculation with the proper helper.
This is an ABI change in that the -1 for a misaligned partition is
properly propagated, which can be considered a bug fix and matches
what is done on the whole device.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Fri, 15 Apr 2022 04:52:46 +0000 (06:52 +0200)]
block: add a bdev_max_zone_append_sectors helper
Add a helper to check the max supported sectors for zone append based on
the block_device instead of having to poke into the block layer internal
request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>