Michael Schmitz [Sun, 24 Oct 2021 00:20:13 +0000 (13:20 +1300)]
block: ataflop: more blk-mq refactoring fixes
As it turns out, my earlier patch in commit 86d46fdaa12a (block:
ataflop: fix breakage introduced at blk-mq refactoring) was
incomplete. This patch fixes any remaining issues found during
more testing and code review.
Requests exceeding 4 k are handled in 4k segments but
__blk_mq_end_request() is never called on these (still
sectors outstanding on the request). With redo_fd_request()
removed, there is no provision to kick off processing of the
next segment, causing requests exceeding 4k to hang. (By
setting /sys/block/fd0/queue/max_sectors_k <= 4 as workaround,
this behaviour can be avoided).
Instead of reintroducing redo_fd_request(), requeue the remainder
of the request by calling blk_mq_requeue_request() on incomplete
requests (i.e. when blk_update_request() still returns true), and
rely on the block layer to queue the residual as new request.
Both error handling and formatting needs to release the
ST-DMA lock, so call finish_fdc() on these (this was previously
handled by redo_fd_request()). finish_fdc() may be called
legitimately without the ST-DMA lock held - make sure we only
release the lock if we actually held it. In a similar way,
early exit due to errors in ataflop_queue_rq() must release
the lock.
After minor errors, fd_error sets up to recalibrate the drive
but never re-runs the current operation (another task handled by
redo_fd_request() before). Call do_fd_action() to get the next
steps (seek, retry read/write) underway.
Christoph Hellwig [Tue, 19 Oct 2021 07:56:39 +0000 (09:56 +0200)]
block: remove support for cryptoloop and the xor transfer
Support for cyrptoloop has been officially marked broken and deprecated
in favor of dm-crypt (which supports the same broken algorithms if
needed) in Linux 2.6.4 (released in March 2004), and support for it has
been entirely removed from losetup in util-linux 2.23 (released in April
2013). The XOR transfer has never been more than a toy to demonstrate
the transfer in the bad old times of crypto export restrictions.
Remove them as they have some nasty interactions with loop device life
times due to the iteration over all loop devices in
loop_unregister_transfer.
Luis Chamberlain [Fri, 15 Oct 2021 23:30:24 +0000 (16:30 -0700)]
xen-blkfront: add error handling support for add_disk()
We never checked for errors on device_add_disk() as this function
returned void. Now that this is fixed, use the shiny new error
handling. The function xlvbd_alloc_gendisk() typically does the
unwinding on error on allocating the disk and creating the tag,
but since all that error handling was stuffed inside
xlvbd_alloc_gendisk() we must repeat the tag free'ing as well.
We set the info->rq to NULL to ensure blkif_free() doesn't crash
on blk_mq_stop_hw_queues() on device_add_disk() error as the queue
will be long gone by then.
Luis Chamberlain [Fri, 15 Oct 2021 23:30:22 +0000 (16:30 -0700)]
dm: add add_disk() error handling
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
There are two calls to dm_setup_md_queue() which can fail then,
one on dm_early_create() and we can easily see that the error path
there calls dm_destroy in the error path. The other use case is on
the ioctl table_load case. If that fails userspace needs to call
the DM_DEV_REMOVE_CMD to cleanup the state - similar to any other
failure.
Jens Axboe [Thu, 21 Oct 2021 14:25:54 +0000 (08:25 -0600)]
Merge tag 'nvme-5.16-2021-10-21' of git://git.infradead.org/nvme into for-5.16/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.16
- fix a multipath partition scanning deadlock (Hannes Reinecke)
- generate uevent once a multipath namespace is operational again
(Hannes Reinecke)
- support unique discovery controller NQNs (Hannes Reinecke)
- fix use-after-free when a port is removed (Israel Rukshin)
- clear shadow doorbell memory on resets (Keith Busch)
- use struct_size (Len Baker)
- add error handling support for add_disk (Luis Chamberlain)
- limit the maximal queue size for RDMA controllers (Max Gurtovoy)
- use a few more symbolic names (Max Gurtovoy)
- fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
- add support for ->map_queues on FC (Saurav Kashyap)"
* tag 'nvme-5.16-2021-10-21' of git://git.infradead.org/nvme: (23 commits)
nvmet: use struct_size over open coded arithmetic
nvme: drop scan_lock and always kick requeue list when removing namespaces
nvme-pci: clear shadow doorbell memory on resets
nvme-rdma: fix error code in nvme_rdma_setup_ctrl
nvme-multipath: add error handling support for add_disk()
nvmet: use macro definitions for setting cmic value
nvmet: use macro definition for setting nmic value
nvme: display correct subsystem NQN
nvme: Add connect option 'discovery'
nvme: expose subsystem type in sysfs attribute 'subsystype'
nvmet: set 'CNTRLTYPE' in the identify controller data
nvmet: add nvmet_is_disc_subsys() helper
nvme: add CNTRLTYPE definitions for 'identify controller'
nvmet: make discovery NQN configurable
nvmet-rdma: implement get_max_queue_size controller op
nvmet: add get_max_queue_size op for controllers
nvme-rdma: limit the maximal queue size for RDMA controllers
nvmet-tcp: fix use-after-free when a port is removed
nvmet-rdma: fix use-after-free when a port is removed
nvmet: fix use-after-free when a port is removed
...
Len Baker [Sun, 17 Oct 2021 09:56:50 +0000 (11:56 +0200)]
nvmet: use struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.
In this case this is not actually dynamic size: all the operands
involved in the calculation are constant values. However it is better to
refactor this anyway, just to keep the open-coded math idiom out of
code.
So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kmalloc() function.
This code was detected with the help of Coccinelle and audited and fixed
manually.
As we're now properly ordering the namespace list there is no need to
hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore.
And we always need to kick the requeue list as the path will be marked
as unusable and I/O will be requeued _without_ a current path.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Thu, 14 Oct 2021 16:45:42 +0000 (09:45 -0700)]
nvme-pci: clear shadow doorbell memory on resets
The host memory doorbell and event buffers need to be initialized on
each reset so the driver doesn't observe stale values from the previous
instantiation.
Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: John Levon <john.levon@nutanix.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Sun, 17 Oct 2021 08:58:16 +0000 (11:58 +0300)]
nvme-rdma: fix error code in nvme_rdma_setup_ctrl
In case that icdoff is not zero or mandatory keyed sgls are not
supported by the NVMe/RDMA target, we'll go to error flow but we'll
return 0 to the caller. Fix it by returning an appropriate error code.
Fixes: c66e2998c8ca ("nvme-rdma: centralize controller setup sequence") Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Luis Chamberlain [Fri, 15 Oct 2021 23:52:08 +0000 (16:52 -0700)]
nvme-multipath: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
Since we now can tell for sure when a disk was added, move
setting the bit NVME_NSHEAD_DISK_LIVE only when we did
add the disk successfully.
Nothing to do here as the cleanup is done elsewhere. We take
care and use test_and_set_bit() because it is protects against
two nvme paths simultaneously calling device_add_disk() on the
same namespace head.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Wed, 22 Sep 2021 06:35:25 +0000 (08:35 +0200)]
nvme: display correct subsystem NQN
With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Wed, 22 Sep 2021 06:35:24 +0000 (08:35 +0200)]
nvme: Add connect option 'discovery'
Add a connect option 'discovery' to specify that the connection
should be made to a discovery controller, not a normal I/O controller.
With discovery controllers supporting unique subsystem NQNs we
cannot easily distinguish by the subsystem NQN if this should be
a discovery connection, but we need this information to blank out
options not supported by discovery controllers.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Wed, 22 Sep 2021 06:35:23 +0000 (08:35 +0200)]
nvme: expose subsystem type in sysfs attribute 'subsystype'
With unique discovery controller NQNs we cannot distinguish the
subsystem type by the NQN alone, but need to check the subsystem
type, too.
So expose the subsystem type in a new sysfs attribute 'subsystype'.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Wed, 22 Sep 2021 21:55:37 +0000 (00:55 +0300)]
nvmet-rdma: implement get_max_queue_size controller op
Limit the maximal queue size for RDMA controllers. Today, the target
reports a limit of 1024 and this limit isn't valid for some of the RDMA
based controllers. For now, limit RDMA transport to 128 entries (the
max queue depth configured for Linux NVMe/RDMA host).
Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.
Reported-by: Mark Ruijter <mruijter@primelogic.nl> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Wed, 22 Sep 2021 21:55:36 +0000 (00:55 +0300)]
nvmet: add get_max_queue_size op for controllers
Some transports, such as RDMA, would like to set the queue size
according to device/port/ctrl characteristics. Add a new nvmet transport
op that is called during ctrl initialization. This will not effect
transports that don't implement this option.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Wed, 22 Sep 2021 21:55:35 +0000 (00:55 +0300)]
nvme-rdma: limit the maximal queue size for RDMA controllers
Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In
case the target expose a cap of larger amount of entries (e.g. 1024),
the initiator may fail to create a QP with this size. Thus limit to a
value that works for all RDMA adapters.
Future general solution should use RDMA/core API to calculate this size
according to device capabilities and number of WRs needed per NVMe IO
request.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Israel Rukshin [Wed, 6 Oct 2021 08:09:45 +0000 (08:09 +0000)]
nvmet-tcp: fix use-after-free when a port is removed
When removing a port, all its controllers are being removed, but there
are queues on the port that doesn't belong to any controller (during
connection time). This causes a use-after-free bug for any command
that dereferences req->port (like in nvmet_alloc_ctrl). Those queues
should be destroyed before freeing the port via configfs. Destroy
the remaining queues after the accept_work was cancelled guarantees
that no new queue will be created.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Israel Rukshin [Wed, 6 Oct 2021 08:09:44 +0000 (08:09 +0000)]
nvmet-rdma: fix use-after-free when a port is removed
When removing a port, all its controllers are being removed, but there
are queues on the port that doesn't belong to any controller (during
connection time). This causes a use-after-free bug for any command
that dereferences req->port (like in nvmet_alloc_ctrl). Those queues
should be destroyed before freeing the port via configfs. Destroy the
remaining queues after the RDMA-CM was destroyed guarantees that no
new queue will be created.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Israel Rukshin [Wed, 6 Oct 2021 08:09:43 +0000 (08:09 +0000)]
nvmet: fix use-after-free when a port is removed
When a port is removed through configfs, any connected controllers
are starting teardown flow asynchronously and can still send commands.
This causes a use-after-free bug for any command that dereferences
req->port (like in nvmet_parse_io_cmd).
To fix this, wait for all the teardown scheduled works to complete
(like release_work at rdma/tcp drivers). This ensures there are no
active controllers when the port is eventually removed.
Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Saurav Kashyap [Mon, 23 Aug 2021 12:56:48 +0000 (05:56 -0700)]
nvme-fc: add support for ->map_queues
NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP
transports. Add a ->map_queues callout for the LLDDs to provide such
functionality.
Hannes Reinecke [Wed, 6 Oct 2021 09:13:13 +0000 (11:13 +0200)]
nvme: generate uevent once a multipath namespace is operational again
When fast_io_fail_tmo is set I/O will be aborted while recovery is
still ongoing. This causes MD to set the namespace to failed, and
no futher I/O will be submitted to that namespace.
However, once the recovery succeeds and the namespace becomes
operational again the NVMe subsystem doesn't send a notification,
so MD cannot automatically reinstate operation and requires
manual interaction.
This patch will send a KOBJ_CHANGE uevent per multipathed namespace
once the underlying controller transitions to LIVE, allowing an automatic
MD reassembly with these udev rules:
UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID})
MD_DEVNAME=${UUID_LINK##*/}
export $(${MDADM} --detail --export /dev/${MD_DEVNAME})
if [ -z "${MD_METADATA}" ] ; then
exit 1
fi
if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then
echo "${MD_DEVNAME}: array not degraded, nothing to do"
exit 0
fi
MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state)
if [ ${MD_STATE} != "clean" ] ; then
echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add"
exit 1
fi
MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE"
if [ ${!MD_VARNAME} = "spare" ] ; then
${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME}
fi
Changes to v2:
- Add udev rules example to description
Changes to v1:
- use disk_uevent() as suggested by hch
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
But cached_dev won't be unlinked from c->cached_devs list until we call
following list_move(&dc->list, &uncached_devices),
so previous fix in 'commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3
("bcache: recal cached_dev_sectors on detach")' is wrong, now we move
it to its right place.
Chao Yu [Wed, 20 Oct 2021 14:38:07 +0000 (22:38 +0800)]
bcache: fix error info in register_bcache()
In register_bcache(), there are several cases we didn't set
correct error info (return value and/or error message):
- if kzalloc() fails, it needs to return ENOMEM and print
"cannot allocate memory";
- if register_cache() fails, it's better to propagate its
return value rather than using default EINVAL.
Coly Li [Wed, 20 Oct 2021 14:38:06 +0000 (22:38 +0800)]
bcache: reserve never used bits from bkey.high
There sre 3 bits in member high of struct bkey are never used, and no
plan to support them in future,
- HEADER_SIZE, start at bit 58, length 2 bits
- KEY_PINNED, start at bit 55, length 1 bit
No any kernel code, or user space tool references or accesses the three
bits. Therefore it is possible and feasible to reserve the valuable bits
from bkey.high. They can be used in future for other purpose.
Stefan Haberland [Wed, 20 Oct 2021 11:51:24 +0000 (13:51 +0200)]
s390/dasd: fix possibly missed path verification
__dasd_device_check_path_events() calls the discipline path event handler.
This handler can leave the 'to be verified pathmask' populated for an
additional verification.
There is a race window where the worker has finished before
dasd_path_clear_all_verify() is called which resets the tbvpm.
Due to this there could be outstanding path verifications missed.
Fix by clearing the pathmasks before calling the handler and add them
again in case of an error.
Stefan Haberland [Wed, 20 Oct 2021 11:51:23 +0000 (13:51 +0200)]
s390/dasd: fix missing path conf_data after failed allocation
dasd_eckd_path_available_action() does a memory allocation to store
the per path configuration data permanently.
In the unlikely case that this allocation fails there is no conf_data
stored for the corresponding path.
This is OK since this is not necessary for an operational path but some
features like control unit initiated reconfiguration (CUIR) do not work.
To fix this add the path to the 'to be verified pathmask' again and
schedule the handler again.
Stefan Haberland [Wed, 20 Oct 2021 11:51:22 +0000 (13:51 +0200)]
s390/dasd: summarize dasd configuration data in a separate structure
Summarize the dasd configuration data in a separate structure so that
functions that need temporary config data do not need to allocate the
whole eckd_private structure.
Stefan Haberland [Wed, 20 Oct 2021 11:51:21 +0000 (13:51 +0200)]
s390/dasd: move dasd_eckd_read_fc_security
dasd_eckd_read_conf is called multiple times during device setup but the
fc_security feature needs to be read only once. So move it into the calling
function.
Heiko Carstens [Wed, 20 Oct 2021 11:51:19 +0000 (13:51 +0200)]
s390/dasd: fix kernel doc comment
Fix this:
drivers/s390/block/dasd_ioctl.c:666: warning:
Function parameter or member 'disk' not described in 'dasd_biodasdinfo'
drivers/s390/block/dasd_ioctl.c:666: warning:
Function parameter or member 'info' not described in 'dasd_biodasdinfo'
Reproduce this issue as follows:
1. nbd-server 8000 /tmp/disk
2. nbd-client localhost 8000 /dev/nbd1
3. cat /sys/block/nbd1/pid
Then trigger use-after-free in pid_show.
Reason is after do step '2', nbd-client progress is already exit. So
it's task_struct already freed.
To solve this issue, revert part of 6521d39a64b3's modify and remove
useless 'recv_task' member of nbd_device.
Michael Schmitz [Tue, 19 Oct 2021 06:13:21 +0000 (19:13 +1300)]
block: ataflop: fix breakage introduced at blk-mq refactoring
Refactoring of the Atari floppy driver when converting to blk-mq
has broken the state machine in not-so-subtle ways:
finish_fdc() must be called when operations on the floppy device
have completed. This is crucial in order to relase the ST-DMA
lock, which protects against concurrent access to the ST-DMA
controller by other drivers (some DMA related, most just related
to device register access - broken beyond compare, I know).
When rewriting the driver's old do_request() function, the fact
that finish_fdc() was called only when all queued requests had
completed appears to have been overlooked. Instead, the new
request function calls finish_fdc() immediately after the last
request has been queued. finish_fdc() executes a dummy seek after
most requests, and this overwrites the state machine's interrupt
hander that was set up to wait for completion of the read/write
request just prior. To make matters worse, finish_fdc() is called
before device interrupts are re-enabled, making certain that the
read/write interupt is missed.
Shifting the finish_fdc() call into the read/write request
completion handler ensures the driver waits for the request to
actually complete. With a queue depth of 2, we won't see long
request sequences, so calling finish_fdc() unconditionally just
adds a little overhead for the dummy seeks, and keeps the code
simple.
While we're at it, kill ataflop_commit_rqs() which does nothing
but run finish_fdc() unconditionally, again likely wiping out an
in-flight request.
There is a problem that nbd_handle_reply() might access freed request:
1) At first, a normal io is submitted and completed with scheduler:
internel_tag = blk_mq_get_tag -> get tag from sched_tags
blk_mq_rq_ctx_init
sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag]
...
blk_mq_get_driver_tag
__blk_mq_get_driver_tag -> get tag from tags
tags->rq[tag] = sched_tag->static_rq[internel_tag]
So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing
to the request: sched_tags->static_rq[internal_tag]. Even if the
io is finished.
2) nbd server send a reply with random tag directly:
commit 6a468d5990ec ("nbd: don't start req until after the dead
connection logic") move blk_mq_start_request() from nbd_queue_rq()
to nbd_handle_cmd() to skip starting request if the connection is
dead. However, request is still started in other error paths.
Currently, blk_mq_end_request() will be called immediately if
nbd_queue_rq() failed, thus start request in such situation is
useless. So remove blk_mq_start_request() from error paths in
nbd_handle_cmd().
nbd: make sure request completion won't concurrent
commit cddce0116058 ("nbd: Aovid double completion of a request")
try to fix that nbd_clear_que() and recv_work() can complete a
request concurrently. However, the problem still exists:
t1 t2 t3
nbd_disconnect_and_put
flush_workqueue
recv_work
blk_mq_complete_request
blk_mq_complete_request_remote -> this is true
WRITE_ONCE(rq->state, MQ_RQ_COMPLETE)
blk_mq_raise_softirq
blk_done_softirq
blk_complete_reqs
nbd_complete_rq
blk_mq_end_request
blk_mq_free_request
WRITE_ONCE(rq->state, MQ_RQ_IDLE)
nbd_clear_que
blk_mq_tagset_busy_iter
nbd_clear_req
__blk_mq_free_request
blk_mq_put_tag
blk_mq_complete_request -> complete again
There are three places where request can be completed in nbd:
recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they
all hold cmd->lock before completing the request, it's easy to
avoid the problem by setting and checking a cmd flag.
nbd: don't handle response without a corresponding request message
While handling a response message from server, nbd_read_stat() will
try to get request by tag, and then complete the request. However,
this is problematic if nbd haven't sent a corresponding request
message:
Xiao Ni [Wed, 13 Oct 2021 14:59:33 +0000 (22:59 +0800)]
md: update superblock after changing rdev flags in state_store
When the in memory flag is changed, we need to persist the change in the
rdev superblock flags. This is needed for "writemostly" and "failfast".
Reviewed-by: Li Feng <fengli@smartx.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Mon, 4 Oct 2021 15:34:48 +0000 (23:34 +0800)]
md/raid1: only allocate write behind bio for WriteMostly device
Commit 6607cd319b6b91bff94e90f798a61c031650b514 ("raid1: ensure write
behind bio has less than BIO_MAX_VECS sectors") tried to guarantee the
size of behind bio is not bigger than BIO_MAX_VECS sectors.
Unfortunately the same calltrace still could happen since an array could
enable write-behind without write mostly device.
To match the manpage of mdadm (which says "write-behind is only attempted
on drives marked as write-mostly"), we need to check WriteMostly flag to
avoid such unexpected behavior.
Christoph Hellwig [Wed, 1 Sep 2021 11:38:32 +0000 (13:38 +0200)]
md: extend disks_mutex coverage
disks_mutex is intended to serialize md_alloc. Extended it to also cover
the kobject_uevent call and getting the sysfs dirent to help reducing
error handling complexity.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 1 Sep 2021 11:38:31 +0000 (13:38 +0200)]
md: add the bitmap group to the default groups for the md kobject
Replace the deprecated default_attrs with the default_groups mechanism,
and add the always visible bitmap group to the groups created add
kobject_add time.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Wed, 1 Sep 2021 11:38:30 +0000 (13:38 +0200)]
md: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
We just do the unwinding of what was not done before, and are
sure to unlock prior to bailing.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 2 Oct 2021 01:23:26 +0000 (19:23 -0600)]
swim3: add missing major.h include
swim3 got this through blkdev.h previously, but blkdev.h is not including
it anymore. Include it specifically for the driver, otherwise FLOPPY_MAJOR
is undefined and breaks the compile on PPC if swim3 is configured.
Fixes: b81e0c2372e6 ("block: drop unused includes in <linux/genhd.h>") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Mon, 27 Sep 2021 22:03:01 +0000 (15:03 -0700)]
block/ataflop: provide a helper for cleanup up an atari disk
Instead of using two separate code paths for cleaning up an atari disk,
use one. We take the more careful approach to check for *all* disk
types, as is done on exit. The init path didn't have that check as
the alternative disk types are only probed for later, they are not
initialized by default.
Luis Chamberlain [Mon, 27 Sep 2021 22:03:00 +0000 (15:03 -0700)]
block/ataflop: add registration bool before calling del_gendisk()
The ataflop assumes del_gendisk() is safe to call, this is only
true because add_disk() does not return a failure, but that will
change soon. And so, before we get to adding error handling for
that case, let's make sure we keep track of which disks actually
get registered. Then we use this to only call del_gendisk for them.
Luis Chamberlain [Mon, 27 Sep 2021 22:02:58 +0000 (15:02 -0700)]
swim: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
Since we have a caller to do our unwinding for the disk,
and this is already dealt with safely we can re-use our
existing error path goto label which already deals with
the cleanup.
Luis Chamberlain [Mon, 27 Sep 2021 22:02:57 +0000 (15:02 -0700)]
swim: add a floppy registration bool which triggers del_gendisk()
Instead of calling del_gendisk() on exit alone, let's add
a registration bool to the floppy disk state, this way this can
be done on the shared caller, swim_cleanup_floppy_disk().
This will be more useful in subsequent patches. Right now, this
just shuffles functionality out to a helper in a safe way.
Luis Chamberlain [Mon, 27 Sep 2021 22:02:56 +0000 (15:02 -0700)]
swim: add helper for disk cleanup
Disk cleanup can be shared between exit and bringup. Use a
helper to do the work required. The only functional change at
this point is we're being overly paraoid on exit to check for
a null disk as well now, and this should be safe.
We'll later expand on this, this change just makes subsequent
changes easier to read.
Luis Chamberlain [Mon, 27 Sep 2021 22:02:54 +0000 (15:02 -0700)]
amiflop: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling. The caller for fd_alloc_disk() deals with
the rest of the cleanup like the tag.
Luis Chamberlain [Mon, 27 Sep 2021 22:02:52 +0000 (15:02 -0700)]
floppy: fix calling platform_device_unregister() on invalid drives
platform_device_unregister() should only be called when
a respective platform_device_register() is called. However
the floppy driver currently allows failures when registring
a drive and a bail out could easily cause an invalid call
to platform_device_unregister() where it was not intended.
Fix this by adding a bool to keep track of when the platform
device was registered for a drive.
This does not fix any known panic / bug. This issue was found
through code inspection while preparing the driver to use the
up and coming support for device_add_disk() error handling.
From what I can tell from code inspection, chances of this
ever happening should be insanely small, perhaps OOM.
Luis Chamberlain [Mon, 27 Sep 2021 22:02:50 +0000 (15:02 -0700)]
floppy: fix add_disk() assumption on exit due to new developments
After the patch titled "floppy: use blk_mq_alloc_disk and
blk_cleanup_disk" the floppy driver was modified to allocate
the blk_mq_alloc_disk() which allocates the disk with the
queue. This is further clarified later with the patch titled
"block: remove alloc_disk and alloc_disk_node". This clarifies
that:
Most drivers should use and have been converted to use
blk_alloc_disk and blk_mq_alloc_disk. Only the scsi
ULPs and dasd still allocate a disk separately from the
request_queue so don't bother with convenience macros for
something that should not see significant new users and
remove these wrappers.
And then we have the patch titled, "block: hold a request_queue
reference for the lifetime of struct gendisk" which ensures
that a queue is *always* present for sure during the entire
lifetime of a disk.
In the floppy driver's case then the disk always comes with the
queue. So even if even if the queue was cleaned up on exit, putting
the disk *is* still required, and likewise, blk_cleanup_queue() on
a null queue should not happen now as disk->queue is valid from
disk allocation time on.
Automatic backport code scrapers should hopefully not cherry pick
this patch as a stable fix candidate without full due dilligence to
ensure all the work done on the block layer to make this happen is
merged first.
Luis Chamberlain [Mon, 27 Sep 2021 22:01:55 +0000 (15:01 -0700)]
block/sx8: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
A completion is used to notify the initial probe what is
happening and so we must defer error handling on completion.
Do this by remembering the error and using the shared cleanup
function.
The tags are shared and so are hanlded later for the
driver already.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>