Sagi Grimberg [Sun, 13 Nov 2022 11:24:23 +0000 (13:24 +0200)]
nvme-tcp: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work
running, and it involves I/O commands. Given the controller is tearing
down there is no chance for the I/O to complete other than timing out
which may unnecessarily take a full io timeout.
So first tear down the queues, fail/cancel all inflight I/O (including
potentially authentication) and only then stop authentication. This
ensures that failover is not stalled due to blocked authentication I/O.
Sagi Grimberg [Sun, 13 Nov 2022 11:24:22 +0000 (13:24 +0200)]
nvme-auth: have dhchap_auth_work wait for queues auth to complete
It triggered the queue authentication work elements in parallel, but
the ctrl authentication work itself completes when all of them
completes. Hence wait for queues auth completions.
This also makes nvme_auth_stop simply a sync cancel of ctrl
dhchap_auth_work.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:21 +0000 (13:24 +0200)]
nvme-auth: remove redundant auth_work flush
only ctrl deletion calls nvme_auth_free, which was stopped prior in the
teardown stage, so there is no possibility that it should ever run when
nvme_auth_free is called. As a result, we can remove a local chap pointer
variable.
Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:20 +0000 (13:24 +0200)]
nvme-auth: convert dhchap_auth_list to an array
We know exactly how many dhchap contexts we will need, there is no need
to hold a list that we need to protect with a mutex. Convert to
a dynamically allocated array. And dhchap_context access state is
maintained by the chap itself.
Make dhchap_auth_mutex protect only the ctrl host_key and ctrl_key
in a fine-grained lock such that there is no long lasting acquisition
of the lock and no need to take/release this lock when flushing
authentication works.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:18 +0000 (13:24 +0200)]
nvme-auth: check chap ctrl_key once constructed
ctrl ctrl_key member may be overwritten from a sysfs context driven
by the user. Once a queue local copy was created, use that instead
to minimize checks on a shared resource.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:13 +0000 (13:24 +0200)]
nvme-auth: don't keep long lived 4k dhchap buffer
dhchap structure is per-queue, it is wasteful to keep it for the entire
lifetime of the queue. Allocate it dynamically and get rid of it after
authentication. We don't need kzalloc because all accessors are clearing
it before writing to it.
Also, remove redundant chap buf_size which is always 4096, use a define
instead.
Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 13 Nov 2022 11:24:05 +0000 (13:24 +0200)]
nvme-auth: rename __nvme_auth_[reset|free] to nvme_auth[reset|free]_dhchap
nvme_auth_[reset|free] operate on the controller while
__nvme_auth_[reset|free] operate on a chap struct (which maps to a queue
context). Rename it for clarity.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 13 Nov 2022 11:07:04 +0000 (12:07 +0100)]
nvme-pci: split the initial probe from the rest path
nvme_reset_work is a little fragile as it needs to handle both resetting
a live controller and initializing one during probe. Split out the initial
probe and open code it in nvme_probe and leave nvme_reset_work to just do
the live controller reset.
This fixes a recently introduced bug where nvme_dev_disable causes a NULL
pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer
is not set when the reset state is entered directly from the new state.
The separate probe code can skip the reset state and probe directly and
fixes this.
To make sure the system isn't single threaded on enabling nvme
controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver
structure so that the driver core probes in parallel.
Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Reported-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
Christoph Hellwig [Tue, 8 Nov 2022 09:56:49 +0000 (10:56 +0100)]
nvme-pci: simplify nvme_dbbuf_dma_alloc
Move the OACS check and the error checking into nvme_dbbuf_dma_alloc so
that an upcoming second caller doesn't have to duplicate this boilerplate
code.
Christoph Hellwig [Tue, 8 Nov 2022 08:44:00 +0000 (09:44 +0100)]
nvme-pci: factor out a nvme_pci_alloc_dev helper
Add a helper that allocates the nvme_dev structure up to the point where
we can call nvme_init_ctrl. This pairs with the free_ctrl method and can
thus be used to cleanup the teardown path and make it more symmetric.
Note that this now calls nvme_init_ctrl a lot earlier during probing,
which also means the per-controller character device shows up earlier.
Due to the controller state no commnds can be send on it, but it might
make sense to delay the cdev registration until nvme_init_ctrl_finish.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
Christoph Hellwig [Tue, 8 Nov 2022 08:11:13 +0000 (09:11 +0100)]
nvme-pci: move more teardown work to nvme_remove
nvme_dbbuf_dma_free frees dma coherent memory, so it must not be called
after ->remove has returned. Fortunately there is no way to use it
after shutdown as no more I/O is possible so it can be moved. Similarly
the iod_mempool can't be used for a device kept alive after shutdown, so
move it next to freeing the PRP pools.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
Christoph Hellwig [Thu, 27 Oct 2022 09:34:13 +0000 (02:34 -0700)]
nvme: simplify transport specific device attribute handling
Allow the transport driver to override the attribute groups for the
control device, so that the PCIe driver doesn't manually have to add a
group after device creation and keep track of it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
Christoph Hellwig [Tue, 8 Nov 2022 14:48:27 +0000 (15:48 +0100)]
nvme: move OPAL setup from PCIe to core
Nothing about the TCG Opal support is PCIe transport specific, so move it
to the core code. For this nvme_init_ctrl_finish grows a new
was_suspended argument that allows the transport driver to tell the OPAL
code if the controller came out of a suspend cycle.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: James Smart <jsmart2021@gmail.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
Christoph Hellwig [Tue, 8 Nov 2022 14:46:45 +0000 (15:46 +0100)]
nvme: don't call nvme_init_ctrl_finish from nvme_passthru_end
nvme_passthrough_end can race with a reset, which can lead to
racing stores to the cels xarray as well as further shengians
with upcoming more complicated initialization.
So drop the call and just log that the controller capabilities
might have changed and a reset could be required to use the new
controller capabilities.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
Christoph Hellwig [Sun, 30 Oct 2022 15:50:15 +0000 (16:50 +0100)]
nvme: implement the DEAC bit for the Write Zeroes command
While the specification allows devices to either deallocate data
or to actually write zeroes on any Write Zeroes command, many SSDs
only do the sensible thing and deallocate data when the DEAC bit
is specific. Set it when it is supported and the caller doesn't
explicitly opt out of deallocation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Kanchan Joshi [Tue, 1 Nov 2022 09:43:07 +0000 (10:43 +0100)]
nvme: identify-namespace without CAP_SYS_ADMIN
Allow all identify-namespace variants (CNS 00h, 05h and 08h) without
requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is
needed to form IO commands for passthrough interface.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
In the example above, ng0n1 appears as if it may allow unprivileged
read/write operation but it does not and behaves same as ng0n2.
This patch implements a shift from CAP_SYS_ADMIN to more fine-granular
control for io-commands.
If CAP_SYS_ADMIN is present, nothing else is checked as before.
Otherwise, following rules are in place
- any admin-cmd is not allowed
- vendor-specific and fabric commmand are not allowed
- io-commands that can write are allowed if matching FMODE_WRITE
permission is present
- io-commands that read are allowed
Add a helper nvme_cmd_allowed that implements above policy.
Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed
for any decision making.
Since file open mode is counted for any approval/denial, change at
various places to keep file-mode information handy.
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christophe JAILLET [Sun, 2 Oct 2022 09:59:45 +0000 (11:59 +0200)]
nvme-fc: improve memory usage in nvme_fc_rcv_ls_req()
sizeof( struct nvmefc_ls_rcv_op ) = 64
sizeof( union nvmefc_ls_requests ) = 1024
sizeof( union nvmefc_ls_responses ) = 128
So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when
kzalloc() is called.
Because of the way memory allocations are performed, 2048 bytes are
allocated. So about 800 bytes are wasted for each request.
Switch to 3 distinct memory allocations, in order to:
- save these 800 bytes
- avoid zeroing this extra memory
- make sure that memory is properly aligned in case of DMA access
("fc_dma_map_single(lsop->rspbuf)" just a few lines below)
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 25 Oct 2022 15:50:08 +0000 (17:50 +0200)]
nvmet: force reconnect when number of queue changes
In order to test queue number changes we need to make sure that the
host reconnects. Because only when the host disconnects from the
target the number of queues are allowed to change according the spec.
The initial idea was to disable and re-enable the ports and have the
host wait until the KATO timer expires, triggering error
recovery. Though the host would see a DNR reply when trying to
reconnect. Because of the DNR bit the connection is dropped
completely. There is no point in trying to reconnect with the same
parameters according the spec.
We can force to reconnect the host is by deleting all controllers. The
host will observe any newly posted request to fail and thus starts the
error recovery but this time without the DNR bit set.
Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Uros Bizjak [Thu, 20 Oct 2022 15:35:40 +0000 (17:35 +0200)]
nvmet: use try_cmpxchg in nvmet_update_sq_head
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
nvmet_update_sq_head. x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.
Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.
Jens Axboe [Mon, 14 Nov 2022 19:57:50 +0000 (12:57 -0700)]
Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block
Pull MD fixes from Song.
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid1: stop mdx_raid1 thread when raid1 array run failed
md/raid5: use bdev_write_cache instead of open coding it
md: fix a crash in mempool_free
md/raid0, raid10: Don't set discard sectors for request queue
md/bitmap: Fix bitmap chunk size overflow issues
md: introduce md_ro_state
md: factor out __md_set_array_info()
lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE
raid5-cache: use try_cmpxchg in r5l_wake_reclaim
drivers/md/md-bitmap: check the return value of md_bitmap_get_counter()
Jiang Li [Mon, 7 Nov 2022 14:16:59 +0000 (22:16 +0800)]
md/raid1: stop mdx_raid1 thread when raid1 array run failed
fail run raid1 array when we assemble array with the inactive disk only,
but the mdx_raid1 thread were not stop, Even if the associated resources
have been released. it will caused a NULL dereference when we do poweroff.
Mikulas Patocka [Fri, 4 Nov 2022 13:53:38 +0000 (09:53 -0400)]
md: fix a crash in mempool_free
There's a crash in mempool_free when running the lvm test
shell/lvchange-rebuild-raid.sh.
The reason for the crash is this:
* super_written calls atomic_dec_and_test(&mddev->pending_writes) and
wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
and bio_put(bio).
* so, the process that waited on sb_wait and that is woken up is racing
with bio_put(bio).
* if the process wins the race, it calls bioset_exit before bio_put(bio)
is executed.
* bio_put(bio) attempts to free a bio into a destroyed bio set - causing
a crash in mempool_free.
We fix this bug by moving bio_put before atomic_dec_and_test.
We also move rdev_dec_pending before atomic_dec_and_test as suggested by
Neil Brown.
The function md_end_flush has a similar bug - we must call bio_put before
we decrement the number of in-progress bios.
Xiao Ni [Wed, 2 Nov 2022 02:07:30 +0000 (10:07 +0800)]
md/raid0, raid10: Don't set discard sectors for request queue
It should use disk_stack_limits to get a proper max_discard_sectors
rather than setting a value by stack drivers.
And there is a bug. If all member disks are rotational devices,
raid0/raid10 set max_discard_sectors. So the member devices are
not ssd/nvme, but raid0/raid10 export the wrong value. It reports
warning messages in function __blkdev_issue_discard when mkfs.xfs
like this:
Signed-off-by: Xiao Ni <xni@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Song Liu <song@kernel.org>
Florian-Ewald Mueller [Tue, 25 Oct 2022 07:37:05 +0000 (09:37 +0200)]
md/bitmap: Fix bitmap chunk size overflow issues
- limit bitmap chunk size internal u64 variable to values not overflowing
the u32 bitmap superblock structure variable stored on persistent media
- assign bitmap chunk size internal u64 variable from unsigned values to
avoid possible sign extension artifacts when assigning from a s32 value
The bug has been there since at least kernel 4.0.
Steps to reproduce it:
1: mdadm -C /dev/mdx -l 1 --bitmap=internal --bitmap-chunk=256M -e 1.2
-n2 /dev/rnbd1 /dev/rnbd2
2 resize member device rnbd1 and rnbd2 to 8 TB
3 mdadm --grow /dev/mdx --size=max
The bitmap_chunksize will overflow without patch.
Cc: stable@vger.kernel.org Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Song Liu <song@kernel.org>
Giulio Benetti [Wed, 19 Oct 2022 16:04:07 +0000 (18:04 +0200)]
lib/raid6: drop RAID6_USE_EMPTY_ZERO_PAGE
RAID6_USE_EMPTY_ZERO_PAGE is unused and hardcoded to 0, so let's drop it.
Signed-off-by: Giulio Benetti <giulio.benetti@benettiengineering.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
Uros Bizjak [Thu, 20 Oct 2022 15:51:04 +0000 (17:51 +0200)]
raid5-cache: use try_cmpxchg in r5l_wake_reclaim
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
r5l_wake_reclaim. 86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg (and related move instruction in
front of cmpxchg).
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.
Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.
No functional change intended.
Cc: Song Liu <song@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
Gabriel Krisman Bertazi [Sat, 5 Nov 2022 23:10:55 +0000 (19:10 -0400)]
sbitmap: Use single per-bitmap counting to wake up queued tags
sbitmap suffers from code complexity, as demonstrated by recent fixes,
and eventual lost wake ups on nested I/O completion. The later happens,
from what I understand, due to the non-atomic nature of the updates to
wait_cnt, which needs to be subtracted and eventually reset when equal
to zero. This two step process can eventually miss an update when a
nested completion happens to interrupt the CPU in between the wait_cnt
updates. This is very hard to fix, as shown by the recent changes to
this code.
The code complexity arises mostly from the corner cases to avoid missed
wakes in this scenario. In addition, the handling of wake_batch
recalculation plus the synchronization with sbq_queue_wake_up is
non-trivial.
This patchset implements the idea originally proposed by Jan [1], which
removes the need for the two-step updates of wait_cnt. This is done by
tracking the number of completions and wakeups in always increasing,
per-bitmap counters. Instead of having to reset the wait_cnt when it
reaches zero, we simply keep counting, and attempt to wake up N threads
in a single wait queue whenever there is enough space for a batch.
Waking up less than batch_wake shouldn't be a problem, because we
haven't changed the conditions for wake up, and the existing batch
calculation guarantees at least enough remaining completions to wake up
a batch for each queue at any time.
Performance-wise, one should expect very similar performance to the
original algorithm for the case where there is no queueing. In both the
old algorithm and this implementation, the first thing is to check
ws_active, which bails out if there is no queueing to be managed. In the
new code, we took care to avoid accounting completions and wakeups when
there is no queueing, to not pay the cost of atomic operations
unnecessarily, since it doesn't skew the numbers.
For more interesting cases, where there is queueing, we need to take
into account the cross-communication of the atomic operations. I've
been benchmarking by running parallel fio jobs against a single hctx
nullb in different hardware queue depth scenarios, and verifying both
IOPS and queueing.
Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
varying only the hardware queue length per test.
The following is a similar experiment, ran against a nullb with a single
bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
parallel fio jobs operating on the same device
It has also survived blktests and a 12h-stress run against nullb. I also
ran the code against nvme and a scsi SSD, and I didn't observe
performance regression in those. If there are other tests you think I
should run, please let me know and I will follow up with results.
Cc: Hugh Dickins <hughd@google.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Liu Song <liusong@linux.alibaba.com> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20221105231055.25953-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Khazhismel Kumykov [Tue, 8 Nov 2022 18:10:29 +0000 (10:10 -0800)]
bfq: fix waker_bfqq inconsistency crash
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
but woken_list_node still being hashed. This would happen when
bfq_init_rq() expects a brand new allocated queue to be returned from
bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
without resetting woken_list_node. Since we can always return oom_bfqq
when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful.
Christoph Böhmwalder [Wed, 9 Nov 2022 13:34:53 +0000 (14:34 +0100)]
drbd: Store op in drbd_peer_request
(Sort of) cherry-picked from the out-of-tree drbd9 branch. Original
commit message by Joel Colledge:
This simplifies drbd_submit_peer_request by removing most of the
arguments. It also makes the treatment of the op better aligned with
that in struct bio.
Determine fault_type dynamically using information which is already
available instead of passing it in as a parameter.
Note: The opf in receive_rs_deallocated was changed from
REQ_OP_WRITE_ZEROES to REQ_OP_DISCARD. This was required in the
out-of-tree module, and does not matter in-tree. The opf is ignored
anyway in drbd_submit_peer_request, since the discard/zero-out is
decided by the EE_TRIM flag.
Philipp Reisner [Wed, 9 Nov 2022 13:34:52 +0000 (14:34 +0100)]
drbd: disable discard support if granularity > max
The discard_granularity describes the minimum unit of a discard.
If that is larger than the maximal discard size, we need to disable
discards completely.
Christoph Böhmwalder [Wed, 9 Nov 2022 13:34:51 +0000 (14:34 +0100)]
drbd: use blk_queue_max_discard_sectors helper
We currently only set q->limits.max_discard_sectors, but that is not
enough. Another field, max_hw_discard_sectors, was introduced in
commit 0034af036554 ("block: make /sys/block/<dev>/queue/discard_max_bytes
writeable").
The difference is that max_discard_sectors can be changed from user
space via sysfs, while max_hw_discard_sectors is the "hardware" upper
limit.
So use this helper, which sets both.
This is also a fixup for commit 998e9cbcd615 ("drbd: cleanup
decide_on_discard_support"): if discards are not supported, that does
not necessarily mean we also want to disable write_zeroes.
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:16 +0000 (11:41 -0600)]
ABI: sysfs-bus-pci: add documentation for p2pmem allocate
Add documentation for the p2pmem/allocate binary file which allows
for allocating p2pmem buffers in userspace for passing to drivers
that support them. (Currently only O_DIRECT to NVMe devices.)
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:15 +0000 (11:41 -0600)]
PCI/P2PDMA: Allow userspace VMA allocations through sysfs
Create a sysfs bin attribute called "allocate" under the existing
"p2pmem" group. The only allowable operation on this file is the mmap()
call.
When mmap() is called on this attribute, the kernel allocates a chunk of
memory from the genalloc and inserts the pages into the VMA. The
dev_pagemap .page_free callback will indicate when these pages are no
longer used and they will be put back into the genalloc.
On device unbind, remove the sysfs file before the memremap_pages are
cleaned up. This ensures unmap_mapping_range() is called on the files
inode and no new mappings can be created.
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:14 +0000 (11:41 -0600)]
block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:13 +0000 (11:41 -0600)]
block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:12 +0000 (11:41 -0600)]
lib/scatterlist: add check when merging zone device pages
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.
Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:11 +0000 (11:41 -0600)]
block: add check when merging zone device pages
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.
Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:09 +0000 (11:41 -0600)]
mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error in try_grab_page() or
try_grab_folio().
The check is safe to do before taking the reference to the page in both
cases seeing the page should be protected by either the appropriate
ptl or mmap_lock; or the gup fast guarantees preventing TLB flushes.
try_grab_folio() has one call site that WARNs on failure and cannot
actually deal with the failure of this function (it seems it will
get into an infinite loop). Expand the comment there to document a
couple more conditions on why it will not fail.
FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set. This is to copy
fsdax until pgmap refcounts are fixed (see the link below for more
information).
Logan Gunthorpe [Fri, 21 Oct 2022 17:41:08 +0000 (11:41 -0600)]
mm: allow multiple error returns in try_grab_page()
In order to add checks for P2PDMA memory into try_grab_page(), expand
the error return from a bool to an int/error code. Update all the
callsites handle change in usage.
Also remove the WARN_ON_ONCE() call at the callsites seeing there
already is a WARN_ON_ONCE() inside the function if it fails.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221021174116.7200-2-logang@deltatee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chao Leng [Tue, 1 Nov 2022 15:00:50 +0000 (16:00 +0100)]
nvme: use blk_mq_[un]quiesce_tagset
All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).
nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.
Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.
Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Chao Leng [Tue, 1 Nov 2022 15:00:49 +0000 (16:00 +0100)]
blk-mq: add tagset quiesce interface
Drivers that have shared tagsets may need to quiesce potentially a lot
of request queues that all share a single tagset (e.g. nvme). Add an
interface to quiesce all the queues on a given tagset. This interface is
useful because it can speedup the quiesce by doing it in parallel.
Because some queues should not need to be quiesced (e.g. the nvme
connect_q) when quiescing the tagset, introduce a
QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to
ski quiescing a particular queue.
Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: simplify for the per-tag_set srcu_struct] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Nov 2022 15:00:48 +0000 (16:00 +0100)]
blk-mq: pass a tagset to blk_mq_wait_quiesce_done
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Nov 2022 15:00:46 +0000 (16:00 +0100)]
blk-mq: skip non-mq queues in blk_mq_quiesce_queue
For submit_bio based queues there is no (S)RCU critical section during
I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done,
so skip doing any synchronization. No non-mq driver should be calling
this, but for now we have core callers that unconditionally call into it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 1 Nov 2022 15:00:43 +0000 (16:00 +0100)]
nvme: split nvme_kill_queues
nvme_kill_queues does two things:
1) mark the gendisk of all namespaces dead
2) unquiesce all I/O queues
These used to be be intertwined due to block layer issues, but aren't
any more. So move the unquiscing of the I/O queues into the callers,
and rename the rest of the function to the now more descriptive
nvme_mark_namespaces_dead.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:42 +0000 (16:00 +0100)]
nvme: don't unquiesce the admin queue in nvme_kill_queues
None of the callers of nvme_kill_queues needs it to unquiesce the
admin queues, as all of them already do it themselves:
1) nvme_reset_work explicit call nvme_start_admin_queue toward the
beginning of the function. The extra call to nvme_start_admin_queue
in nvme_reset_work this won't do anything as
NVME_CTRL_ADMIN_Q_STOPPED will already be cleared.
2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at
the very beginning of the function if the PCIe device was not present,
which is the precondition for the call to nvme_kill_queues.
nvme_dev_disable already calls nvme_start_admin_queue toward the
end of the function when the shutdown flag is set to true, so the
admin queue is already enabled at this point.
3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver,
which will end up in nvme_remove, which calls nvme_dev_disable with
the shutdown flag. This case will call nvme_start_admin_queue a bit
later than before.
4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl
above.
5) nvme_remove_namespaces only calls nvme_kill_queues when the
controller is in the DEAD state. That can only happen in the PCIe
driver, and only from nvme_remove. See item 2) above for the
conditions there.
So it is safe to just remove the call to nvme_start_admin_queue in
nvme_kill_queues without replacement.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:40 +0000 (16:00 +0100)]
nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces
The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:39 +0000 (16:00 +0100)]
nvme: don't remove namespaces in nvme_passthru_end
The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more. Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:38 +0000 (16:00 +0100)]
nvme-pci: refactor the tagset handling in nvme_reset_work
The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted. Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.
Christoph Hellwig [Tue, 1 Nov 2022 15:00:37 +0000 (16:00 +0100)]
block: set the disk capacity to 0 in blk_mark_disk_dead
nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later. Move it to the
common code.
This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kemeng Shi [Tue, 18 Oct 2022 11:12:40 +0000 (19:12 +0800)]
block: Replace struct rq_depth with unsigned int in struct iolatency_grp
We only need a max queue depth for every iolatency to limit the inflight io
number. Replace struct rq_depth with unsigned int to simplfy "struct
iolatency_grp" and save memory.
Kemeng Shi [Tue, 18 Oct 2022 11:12:39 +0000 (19:12 +0800)]
block: Correct comment for scale_cookie_change
Default queue depth of iolatency_grp is unlimited, so we scale down
quickly(once by half) in scale_cookie_change. Remove the "subtract
1/16th" part which is not the truth and add the actual way we
scale down.
Kemeng Shi [Tue, 18 Oct 2022 11:12:38 +0000 (19:12 +0800)]
block: Remove redundant parent blkcg_gp check in check_scale_change
Function blkcg_iolatency_throttle will make sure blkg->parent is not
NULL before calls check_scale_change. And function check_scale_change
is only called in blkcg_iolatency_throttle.
Christoph Hellwig [Sun, 30 Oct 2022 10:07:14 +0000 (11:07 +0100)]
block: split elevator_switch
Split an elevator_disable helper from elevator_switch for the case where
we want to switch to no scheduler at all. This includes removing the
pointless elevator_switch_mq helper and removing the switch to no
schedule logic from blk_mq_init_sched.
Christoph Hellwig [Sun, 30 Oct 2022 10:07:11 +0000 (11:07 +0100)]
block: cleanup the variable naming in elv_iosched_store
Use eq for the elevator_queue as done elsewhere. This frees e to be used
for the loop iterator instead of the odd __ prefix. In addition rename
elv to cur to make it more clear it is the currently selected elevator.
Christoph Hellwig [Sun, 30 Oct 2022 10:07:09 +0000 (11:07 +0100)]
block: cleanup elevator_get
Do the request_module and repeated lookup in the only caller that cares,
pick a saner name that explains where are actually doing a lookup and
use a sane calling conventions that passes the queue first.
block, bfq: do not idle if only one group is activated
Now that root group is counted into 'num_groups_with_pending_reqs',
'num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario(). Thus change the condition to '> 1'.
On the other hand, this change can enable concurrent sync io if only
one group is activated.