Christoph Hellwig [Wed, 7 May 2025 12:04:39 +0000 (14:04 +0200)]
xfs: simplify xfs_buf_submit_bio
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it and use bio_add_vmalloc
to insulate xfs from the details of adding vmalloc memory to a bio.
Christoph Hellwig [Wed, 7 May 2025 12:04:38 +0000 (14:04 +0200)]
dm-integrity: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it, and do the same for the
similar pattern using bio_add_page for adding the first segment after
a bio allocation as that can't fail either.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:36 +0000 (14:04 +0200)]
PM: hibernate: split and simplify hib_submit_io
Split hib_submit_io into a sync and async version. The sync version is
a small wrapper around bdev_rw_virt which implements all the logic to
add a kernel direct mapping range to a bio and synchronously submits it,
while the async version is slightly simplified using the
bio_add_virt_nofail for adding the single range.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:35 +0000 (14:04 +0200)]
zonefs: use bdev_rw_virt in zonefs_read_super
Switch zonefs_read_super to allocate the superblock buffer using kmalloc
which falls back to the page allocator for PAGE_SIZE allocation but
gives us a kernel virtual address and then use bdev_rw_virt to perform
the synchronous read into it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:34 +0000 (14:04 +0200)]
gfs2: use bdev_rw_virt in gfs2_read_super
Switch gfs2_read_super to allocate the superblock buffer using kmalloc
which falls back to the page allocator for PAGE_SIZE allocation but
gives us a kernel virtual address and then use bdev_rw_virt to perform
the synchronous read into it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:33 +0000 (14:04 +0200)]
rnbd-srv: use bio_add_virt_nofail
Use the bio_add_virt_nofail to add a single kernel virtual address
to a bio as that can't fail.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:32 +0000 (14:04 +0200)]
bcache: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:31 +0000 (14:04 +0200)]
block: simplify bio_map_kern
Rewrite bio_map_kern using the new bio_add_* helpers and drop the
kerneldoc comment that is superfluous for an internal helper.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:30 +0000 (14:04 +0200)]
block: pass the operation to bio_{map,copy}_kern
That way the bio can be allocated with the right operation already
set and there is no need to pass the separated 'reading' argument.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:29 +0000 (14:04 +0200)]
block: remove the q argument from blk_rq_map_kern
Remove the q argument from blk_rq_map_kern and the internal helpers
called by it as the queue can trivially be derived from the request.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:28 +0000 (14:04 +0200)]
block: add a bio_add_vmalloc helpers
Add a helper to add a vmalloc region to a bio, abstracting away the
vmalloc addresses from the underlying pages and another one wrapping
it for the simple case where all data fits into a single bio.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:27 +0000 (14:04 +0200)]
block: add a bio_add_max_vecs helper
Add a helper to check how many bio_vecs are needed to add a kernel
virtual address range to a bio, accounting for the always contiguous
direct mapping and vmalloc mappings that usually need a bio_vec
per page sized chunk.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:26 +0000 (14:04 +0200)]
block: add a bdev_rw_virt helper
Add a helper to perform synchronous I/O on a kernel direct map range.
Currently this is implemented in various places in usually not very
efficient ways, so provide a generic helper instead.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 7 May 2025 12:04:25 +0000 (14:04 +0200)]
block: add a bio_add_virt_nofail helper
Add a helper to add a directly mapped kernel virtual address to a
bio so that callers don't have to convert to pages or folios.
For now only the _nofail variant is provided as that is what all the
obvious callers want.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 7 May 2025 09:25:37 +0000 (17:25 +0800)]
block: fix warning on 'make htmldocs'
Fix the following warning when running 'make htmldocs':
+WARNING: include/linux/blk-mq.h:532 struct member 'update_nr_hwq_lock' not described in 'blk_mq_tag_set'
Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20250507163220.00141d77@canb.auug.org.au/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250507092537.3009112-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kanchan Joshi [Tue, 6 May 2025 18:48:43 +0000 (00:18 +0530)]
nvme: fix incorrect sizeof
The plid array, head->plids, is meant to store placement IDs, each of
type u16. But its size has been incorrectly calculated, as the size of
the pointer is being used instead of the size of the object it points
to.
Use the sizeof(*head->plids) in kcalloc so that we don't allocate extra.
Fixes: 38e8397dde63 ("nvme: use fdp streams if write stream is provided") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
write_stream_granularity is set to max(info->runs, U32_MAX), which means
that any RUNS value less than 2 ** 32 becomes U32_MAX, and any larger
value is silently truncated to an unsigned int.
Use min() instead to provide the correct semantics, capping RUNS values
at U32_MAX.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250506175413.1936110-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Tue, 6 May 2025 12:17:31 +0000 (17:47 +0530)]
nvme: register fdp parameters with the block layer
Register the device data placement limits if supported. This is just
registering the limits with the block layer. Nothing beyond reporting
these attributes is happening in this patch.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-11-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Tue, 6 May 2025 12:17:27 +0000 (17:47 +0530)]
io_uring: enable per-io write streams
Allow userspace to pass a per-I/O write stream in the SQE:
__u8 write_stream;
The __u8 type matches the size the filesystems and block layer support.
Application can query the supported values from the block devices
max_write_streams sysfs attribute. Unsupported values are ignored by
file operations that do not support write streams or rejected with an
error by those that support them.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 6 May 2025 12:17:23 +0000 (17:47 +0530)]
block: add a bi_write_stream field
Add the ability to pass a write stream for placement control in the bio.
The new field fits in an existing hole, so does not change the size of
the struct.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 6 May 2025 12:17:22 +0000 (17:47 +0530)]
fs: add a write stream field to the kiocb
Prepare for io_uring passthrough of write streams. The write stream
field in the kiocb structure fits into an existing 2-byte hole, so its
size is not changed.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Johannes Thumshirn [Tue, 6 May 2025 11:27:30 +0000 (13:27 +0200)]
block: only update request sector if needed
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the
emulation layer in the zone write plugging code, the sector the data got
written to by the device needs to be updated in the bio.
At the moment, this is done for every native ZONE APPEND write and every
request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus
superfluously updates the sector for regular writes to a zoned block
device.
Check if a bio is a native ZONE APPEND write or if the bio is flagged as
'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging
code handles the ZONE APPEND and translates it into a regular write and
back. Only if one of these two criterion is met, update the sector in the
bio upon completion.
Ming Lei [Mon, 5 May 2025 14:18:03 +0000 (22:18 +0800)]
block: move wbt_enable_default() out of queue freezing from sched ->exit()
scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:
Ming Lei [Mon, 5 May 2025 14:18:02 +0000 (22:18 +0800)]
block: move hctx cpuhp add/del out of queue freezing
Move hctx cpuhp add/del out of queue freezing for not connecting freeze
lock with cpuhp locks, then lockdep warning can be avoided.
This way is safe because both needn't queue to be frozen and scheduler
switch isn't allowed, with same reason for moving hctx debugfs/sysfs
register out of queue freeze.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-25-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:18:01 +0000 (22:18 +0800)]
block: don't acquire ->elevator_lock in blk_mq_map_swqueue and blk_mq_realloc_hw_ctxs
Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before
the request queue is added to tagset list, so the two won't run concurrently
with blk_mq_update_nr_hw_queues().
When the two functions are only called from queue initialization or
blk_mq_update_nr_hw_queues(), elevator switch can't happen.
So remove ->elevator_lock uses from the two functions.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-24-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
And registering/un-registering hctx debugfs/sysfs does not require queue to
be frozen:
- hctx sysfs attributes show() are drained when removing kobject, and
there isn't store() implementation for hctx sysfs attributes
- debugfs entry read() is drained too when removing debugfs directory,
and there isn't write() implementation for hctx debugfs too
- so it is safe to register/unregister hctx sysfs/debugfs without
freezing queue because the cod paths changes nothing, and we just
need to keep hctx live
Ming Lei [Mon, 5 May 2025 14:17:59 +0000 (22:17 +0800)]
block: move elv_register[unregister]_queue out of elevator_lock
Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing,
so we can kill many lockdep warnings.
elv_register[unregister]_queue() is serialized, and just dealing with sysfs/
debugfs things, no need to be done with queue frozen:
- when it is called from adding disk, elevator switch isn't possible
because ->queue_kobj isn't added yet
- when it is called from deleting disk, disable_elv_switch() is
responsible for preventing new elevator switch and draining old
elevator switch.
- when it is called from blk_mq_update_nr_hw_queues(), adding/removing
disk and elevator switch can't be allowed or in-progress
With this change, elevator's ->exit() is called before calling
elv_unregister_queue, then user may call into ->show()/store() of elevator's
sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`.
For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by
debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so
there isn't such issue.
Ming Lei [Mon, 5 May 2025 14:17:58 +0000 (22:17 +0800)]
block: add new helper for disabling elevator switch when deleting disk
Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH
for disabling elevator switch before deleting disk:
- originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator
switch during removing disk, but this flag has been used widely for
other purposes, so add one new flag for disabling elevator switch only
- for avoiding deadlock risk, we have to move elevator queue
register/unregister out of elevator lock and queue freeze, which will be
done in next patch. However, this way adds small race window between elevator
switch and deleting ->queue_kobj, in which elevator queue register/unregister
could be run concurrently. The added helper will be used for avoiding the race
in the following patch.
- drain in-progress elevator switch before deleting disk
Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250505141805.2751237-21-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:57 +0000 (22:17 +0800)]
block: fail to show/store elevator sysfs attribute if elevator is dying
Prepare for moving elv_register[unregister]_queue out of elevator_lock
& queue freezing, so we may have to call elv_unregister_queue() after
elevator ->exit() is called, then there is small window for user to
call into ->show()/store(), and user-after-free can be caused.
Fail to show/store elevator sysfs attribute if elevator is dying by
adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by
elevator ->sysfs_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250505141805.2751237-20-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:55 +0000 (22:17 +0800)]
block: pass elevator_queue to elv_register_queue & unregister_queue
Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue().
No functional change, and prepare for moving the two out of elevator
lock & freezing queue, when we need to store the old & new elevator
queue in `struct elv_change_ctx` instance, then both two can co-exist
for short while, so we have to pass the exact elevator_queue instance
to elv_register_queue & unregister_queue.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-18-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:54 +0000 (22:17 +0800)]
block: unifying elevator change
Elevator change is one well-define behavior:
- tear down current elevator if it exists
- setup new elevator
It is supposed to cover any case for changing elevator by single
internal API, typically the following cases:
- setup default elevator in add_disk()
- switch to none in del_disk()
- reset elevator in blk_mq_update_nr_hw_queues()
- switch elevator in sysfs `store` elevator attribute
This patch uses elevator_change() to cover all above cases:
- every elevator switch is serialized with each other: add_disk/del_disk/
store elevator is serialized already, blk_mq_update_nr_hw_queues() uses
srcu for syncing with the other three cases
- for both add_disk()/del_disk(), queue freeze works at atomic mode
or has been froze, so the freeze in elevator_change() won't add extra
delay
- `struct elev_change_ctx` instance holds any info for changing elevator
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-17-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:53 +0000 (22:17 +0800)]
block: add `struct elv_change_ctx` for unifying elevator change
Add `struct elv_change_ctx` and prepare for unifying elevator change by
elevator_change(). With this way, any input & output parameter can
be provided & observed in top helper.
This way helps to move kobject add/delete & debugfs register/unregister
out of ->elevator_lock & freezing queue.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-16-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:52 +0000 (22:17 +0800)]
block: move queue freezing & elevator_lock into elevator_change()
Move queue freezing & elevator_lock into elevator_change(), and prepare
for using elevator_change() for setting up & tearing down default elevator
too.
Also add lockdep_assert_held() in __elevator_change() because either
read or write lock is required for changing elevator.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-15-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:51 +0000 (22:17 +0800)]
block: simplify elevator reattachment for updating nr_hw_queues
In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data
depends on it, and elevator has to be reattached, so call elevator_switch()
to force attachment.
Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to
reattach elevator, since elevator switch isn't likely when running
blk_mq_update_nr_hw_queues(). This way removes the current switch
none and switch back code.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:50 +0000 (22:17 +0800)]
block: move blk_queue_registered() check into elv_iosched_store()
Move blk_queue_registered() check into elv_iosched_store() and prepare
for using elevator_change() for covering any kind of elevator change in
adding/deleting disk and updating nr_hw_queue.
Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-13-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:46 +0000 (22:17 +0800)]
block: prevent adding/deleting disk during updating nr_hw_queues
Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't
allow them in-progress when updating nr_hw_queues, kernel panic and
kasan has been reported in [1].
Prevent adding/deleting disk during updating nr_hw_queues by adding
rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(),
and read lock is acquired when adding/deleting disk.
Also mark GFP_NOIO allocation scope for adding/deleting disk because
blk_mq_update_nr_hw_queues() is part of some driver's error handler.
This way avoids lot of trouble.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:44 +0000 (22:17 +0800)]
block: move sched debugfs register into elvevator_register_queue
sched debugfs shares same lifetime with scheduler's kobject, and same
lock(elevator lock), so move sched debugfs register/unregister into
elevator_register_queue() and elevator_unregister_queue().
Then we needn't blk_mq_debugfs_register() for us to register sched
debugfs any more.
Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:43 +0000 (22:17 +0800)]
block: add two helpers for registering/un-registering sched debugfs
Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up
sched init/exit code a bit.
Register & unregister debugfs for sched & sched_hctx order is changed a
bit, but it is safe because sched & sched_hctx is guaranteed to be ready
when exporting via debugfs.
Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:42 +0000 (22:17 +0800)]
block: use q->elevator with ->elevator_lock held in elv_iosched_show()
Use q->elevator with ->elevator_lock held in elv_iosched_show(), since
the local cached elevator reference may become stale after getting
->elevator_lock.
Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 5 May 2025 14:17:40 +0000 (22:17 +0800)]
block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flag
ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is
in use. The flag is set in BFQ's init(), and cleared in BFQ's exit().
Making it as request queue flag, so that we can avoid to deal with elevator
switch race. Also it isn't graceful to checking one scheduler flag in
wbt_enable_default().
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every ublk I/O command except UBLK_IO_FETCH_REQ checks that the ublk_io
has UBLK_IO_FLAG_OWNED_BY_SRV set. Consolidate the separate checks into
a single one in __ublk_ch_uring_cmd(), analogous to those for
UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_NEED_GET_DATA.
Zizhi Wo [Thu, 17 Apr 2025 13:20:54 +0000 (21:20 +0800)]
blk-throttle: Add an additional overflow check to the call calculate_bytes/io_allowed
Now the tg->[bytes/io]_disp type is signed, and calculate_bytes/io_allowed
return type is unsigned. Even if the bps/iops limit is not set to max, the
return value of the function may still exceed INT_MAX or LLONG_MAX, which
can cause overflow in outer variables. In such cases, we can add additional
checks accordingly.
And in throtl_trim_slice(), if the BPS/IOPS limit is set to max, there's
no need to call calculate_bytes/io_allowed(). Introduces the helper
functions throtl_trim_bps/iops to simplifies the process. For cases when
the calculated trim value exceeds INT_MAX (causing an overflow), we reset
tg->[bytes/io]_disp to zero, so return original tg->[bytes/io]_disp because
it is the size that is actually trimmed.
Zizhi Wo [Thu, 17 Apr 2025 13:20:53 +0000 (21:20 +0800)]
blk-throttle: Delete unnecessary carryover-related fields from throtl_grp
We no longer need carryover_[bytes/ios] in tg, so it is removed. The
related comments about carryover in tg are also merged into
[bytes/io]_disp, and modify other related comments.
Zizhi Wo [Thu, 17 Apr 2025 13:20:52 +0000 (21:20 +0800)]
blk-throttle: Fix wrong tg->[bytes/io]_disp update in __tg_update_carryover()
In commit 6cc477c36875 ("blk-throttle: carry over directly"), the carryover
bytes/ios was be carried to [bytes/io]_disp. However, its update mechanism
has some issues.
In __tg_update_carryover(), we calculate "bytes" and "ios" to represent the
carryover, but the computation when updating [bytes/io]_disp is incorrect.
And if the sq->nr_queued is empty, we may not update tg->[bytes/io]_disp to
0 in tg_update_carryover(). We should set it to 0 in non carryover case.
This patch fixes the issue.
Fixes: 6cc477c36875 ("blk-throttle: carry over directly") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250417132054.2866409-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Building kublk currently fails (with a "could not find linux/ublk_cmd.h"
error message) if kernel headers are not installed in a system-global
location (i.e. somewhere in the compiler's default include search path).
This failure is unnecessary, as make kselftest installs kernel headers
in the build tree - kublk's build just isn't looking for them properly.
There is an include path in kublk's CFLAGS which is probably intended to
find the kernel headers installed in the build tree; fix it so that it
can actually find them.
This introduces some macro redefinition issues between glibc-provided
headers and kernel headers; fix those by eliminating one include in
kublk.
selftests: ublk: kublk: build with -Werror iff WERROR!=0
Compiler warnings can catch bugs at compile time; thus, heeding them is
usually a good idea. Turn warnings into errors by default for the kublk
build so that anyone making changes is forced to heed them. Compiler
warnings can also sometimes produce annoying false positives, so provide
a flag WERROR that the developer can use as follows to have the build
and selftests run go through even if there are warnings:
Christoph Hellwig [Mon, 5 May 2025 08:11:22 +0000 (10:11 +0200)]
scsi: make ppa depend on !HIGHMEM
This is one of the last drivers depending on the block layer bounce
buffering code. Restrict it to run on non-highmem configs so that the
bounce buffering code can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:21 +0000 (10:11 +0200)]
scsi: make imm depend on !HIGHMEM
This is one of the last drivers depending on the block layer bounce
buffering code. Restrict it to run on non-highmem configs so that the
bounce buffering code can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:20 +0000 (10:11 +0200)]
scsi: make aha152x depend on !HIGHMEM
This is one of the last drivers depending on the block layer bounce
buffering code. Restrict it to run on non-highmem configs so that the
bounce buffering code can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
A ublk_io is converted to a request in several places in the I/O path by
using blk_mq_tag_to_rq() to look up the (qid, tag) on the ublk device's
tagset. This involves a bunch of dereferences and a tag bounds check.
To make this conversion cheaper, store the request pointer in ublk_io.
Overlap this storage with the io_uring_cmd pointer. This is safe because
the io_uring_cmd pointer is only valid if UBLK_IO_FLAG_ACTIVE is set on
the ublk_io, the request pointer is valid if UBLK_IO_FLAG_OWNED_BY_SRV,
and these flags are mutually exclusive.
ublk: check UBLK_IO_FLAG_OWNED_BY_SRV in ublk_abort_queue()
ublk_abort_queue() currently checks whether the UBLK_IO_FLAG_ACTIVE flag
is cleared to tell whether to abort each ublk_io in the queue. But it's
possible for a ublk_io to not be ACTIVE but also not have a request in
flight, such as when no fetch request has yet been submitted for a tag
or when a fetch request is cancelled. So ublk_abort_queue() must
additionally check for an inflight request.
Simplify this code by checking for UBLK_IO_FLAG_OWNED_BY_SRV instead,
which indicates precisely whether a request is currently inflight.
ublk: don't call ublk_dispatch_req() for NEED_GET_DATA
ublk_dispatch_req() currently handles 3 different cases: incoming ublk
requests that don't need to wait for a data buffer, incoming requests
that do need to wait for a buffer, and resuming those requests once the
buffer is provided. But the call site that provides a data buffer
(UBLK_IO_NEED_GET_DATA) is separate from those for incoming requests.
So simplify the function by splitting the UBLK_IO_NEED_GET_DATA case
into its own function ublk_get_data(). This avoids several redundant
checks in the UBLK_IO_NEED_GET_DATA case, and streamlines the incoming
request cases.
Don't call ublk_fill_io_cmd() for UBLK_IO_NEED_GET_DATA, as it's no
longer necessary to set io->cmd or the UBLK_IO_FLAG_ACTIVE flag for
ublk_dispatch_req().
Since UBLK_IO_NEED_GET_DATA no longer relies on ublk_dispatch_req()
calling io_uring_cmd_done(), return the UBLK_IO_RES_OK status directly
from the ->uring_cmd() handler. If ublk_start_io() fails, don't complete
the UBLK_IO_NEED_GET_DATA command, matching the existing behavior.
ublk: don't log uring_cmd cmd_op in ublk_dispatch_req()
cmd_op is either UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ,
or UBLK_U_IO_NEED_GET_DATA. Which one isn't particularly interesting
and is already recorded by the log line in __ublk_ch_uring_cmd().
Move the logic for the UBLK_IO_COMMIT_AND_FETCH_REQ opcode into its own
function. This also allows us to mark ublk_queue pointers as const for
that operation, which can help prevent data races since we may allow
concurrent operation on one ublk_queue in the future. Also open code
ublk_commit_completion in ublk_commit_and_fetch to reduce the number of
parameters/avoid a redundant lookup.
[Restore __ublk_ch_uring_cmd() req variable used in commit d6aa0c178bf8
("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA")]
Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250430225234.2676781-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: avoid hctx spinlock for plug with multiple queues
blk_mq_flush_plug_list() has a fast path if all requests in the plug
are destined for the same request_queue. It calls ->queue_rqs() with the
whole batch of requests, falling back on ->queue_rq() for any requests
not handled by ->queue_rqs(). However, if the requests are destined for
multiple queues, blk_mq_flush_plug_list() has a slow path that calls
blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx.
Each queue's requests are inserted into the hctx's dispatch list under a
spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the
dispatch list (taking the spinlock again), and finally
blk_mq_dispatch_rq_list() calls ->queue_rq() on each request.
Acquiring the hctx spinlock twice and calling ->queue_rq() instead of
->queue_rqs() makes the slow path significantly more expensive. Thus,
batching more requests into a single plug (e.g. io_uring_enter syscall)
can counterintuitively hurt performance by causing the plug to span
multiple queues. We have observed 2-3% of CPU time spent acquiring the
hctx spinlock alone on workloads issuing requests to multiple NVMe
devices in the same io_uring SQE batches.
Add a medium path in blk_mq_flush_plug_list() for plugs that don't have
elevators or come from a schedule, but do span multiple queues. Filter
the requests by queue and call ->queue_rqs()/->queue_rq() on the list of
requests destined to each request_queue.
With this change, we no longer see any CPU time spent in _raw_spin_lock
from blk_mq_flush_plug_list and throughput increases accordingly.
block: factor out blk_mq_dispatch_queue_requests() helper
Factor out the logic from blk_mq_flush_plug_list() that calls
->queue_rqs() with a fallback to ->queue_rq() into a helper function
blk_mq_dispatch_queue_requests(). This is in preparation for using this
code with other lists of requests.
block: take rq_list instead of plug in dispatch functions
blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and
blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its
mq_list. Pass the struct rq_list * instead in preparation for calling
them with other lists of requests.
Drop "plug" from the function names as they are no longer plug-specific.
Damien Le Moal [Mon, 7 Apr 2025 07:52:22 +0000 (16:52 +0900)]
Documentation: Document the new zoned loop block device driver
Introduce the zoned_loop.rst documentation file under
admin-guide/blockdev to document the zoned loop block device driver.
An overview of the driver is provided and its usage to create and delete
zoned devices described.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Mon, 7 Apr 2025 07:52:21 +0000 (16:52 +0900)]
block: new zoned loop block device driver
The zoned loop block device driver allows a user to create emulated
zoned block devices using one regular file per zone as backing storage.
Compared to null_blk or scsi_debug, it has the advantage of allowing
emulating large zoned devices without requiring the same amount of
memory as the capacity of the emulated device. Furthermore, zoned
devices emulated with this driver can be re-started after a host reboot
without any loss of the state of the device zones, which is something
that null_blk and scsi_debug do not support.
This initial implementation is simple and does not support zone resource
limits. That is, a zoned loop block device limits for the maximum number
of open zones and maximum number of active zones is always 0.
This driver can be either compiled in-kernel or as a module, named
"zloop". Compilation of this driver depends on the block layer support
for zoned block device (CONFIG_BLK_DEV_ZONED must be set).
Using the zloop driver to create and delete zoned block devices is
done by writing commands to the zoned loop control character device file
(/dev/zloop-control). Creating a device is done with:
$ echo "add [options]" > /dev/zloop-control
The options available for the "add" operation cat be listed by reading
the zloop-control device file:
The options available allow controlling the zoned device total
capacity, zone size, zone capactity of sequential zones, total number
of conventional zones, base directory for the zones backing file, number
of I/O queues and the maximum queue depth of I/O queues.
Deleting a device is done using the "remove" command:
$ echo "remove id=0" > /dev/zloop-control
This implementation passes various tests using zonefs and fio (t/zbd
tests) and provides a state machine for zone conditions that is
compliant with the T10 ZBC and NVMe ZNS specifications.
Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250407075222.170336-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Fri, 25 Apr 2025 09:34:34 +0000 (11:34 +0200)]
nvmet-auth: always free derived key data
After calling nvme_auth_derive_tls_psk() we need to free the resulting
psk data, as either TLS is disable (and we don't need the data anyway)
or the psk data is copied into the resulting key (and can be free, too).
Fixes: fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation") Reported-by: Yi Zhang <yi.zhang@redhat.com> Suggested-by: Maurizio Lombardi <mlombard@bsdbackstore.eu> Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Wed, 23 Apr 2025 06:06:21 +0000 (16:06 +1000)]
nvmet-tcp: don't restore null sk_state_change
queue->state_change is set as part of nvmet_tcp_set_queue_sock(), but if
the TCP connection isn't established when nvmet_tcp_set_queue_sock() is
called then queue->state_change isn't set and sock->sk->sk_state_change
isn't replaced.
As such we don't need to restore sock->sk->sk_state_change if
queue->state_change is NULL.
This avoids NULL pointer dereferences such as this:
Alistair Francis [Tue, 29 Apr 2025 22:23:47 +0000 (08:23 +1000)]
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TARGET_TCP_TLS is enabled. Without this the code compiles,
but does not actually work unless something else enables CONFIG_TLS.
Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Tue, 29 Apr 2025 22:40:25 +0000 (08:40 +1000)]
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does
not actually work unless something else enables CONFIG_TLS.
Fixes: be8e82caa68 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Michael Liang [Tue, 29 Apr 2025 16:42:01 +0000 (10:42 -0600)]
nvme-tcp: fix premature queue removal and I/O failover
This patch addresses a data corruption issue observed in nvme-tcp during
testing.
In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.
However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.
Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.
We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.
To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.
This aligns with the behavior of queue stopping in other NVMe fabric
transports.
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Michael Liang <mliang@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user
reports and hardware analysis:
- NVME_QUIRK_NO_DEEPEST_PS:
liaozw talked to me the problem and solved with
nvme_core.default_ps_max_latency_us=0, so add the quirk.
I also found some reports in the following link.
- NVME_QUIRK_BROKEN_MSI:
after get the lspci from Jack Rio.
I think that the disk also have NVME_QUIRK_BROKEN_MSI.
described in commit d5887dc6b6c0 ("nvme-pci: Add quirk for broken MSIs")
as sean said in link which match the MSI 1/32 and MSI-X 17.
Log:
lspci -nn | grep -i memory
03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01)
lspci -v -d 15b7:5009
03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD
Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10
Memory at fe800000 (64-bit, non-prefetchable) [size=16K]
Memory at fe804000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [c0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [300] Secondary PCI Express
Capabilities: [900] L1 PM Substates
Kernel driver in use: nvme
dmesg | grep nvme
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.542430] nvme nvme0: pci function 0000:03:00.0
[ 0.560426] nvme nvme0: allocated 32 MiB host memory buffer.
[ 0.562491] nvme nvme0: 16/0/0 default/read/poll queues
[ 0.567764] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
[ 6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none.
[ 6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none.
[ 7.125419] Adding 16777212k swap on /dev/nvme0n1p8. Priority:-2 extents:1 across:16777212k SS
[ 7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 8.036932] nvme nvme0: using unchecked data buffer
[ 8.096023] block nvme0n1: No UUID available providing old NGUID
Keith Busch [Thu, 24 Apr 2025 17:18:01 +0000 (10:18 -0700)]
nvme-pci: fix queue unquiesce check on slot_reset
A zero return means the reset was successfully scheduled. We don't want
to unquiesce the queues while the reset_work is pending, as that will
just flush out requeued requests to a failed completion.
Fixes: 71a5bb153be104 ("nvme: ensure disabling pairs with unquiesce") Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Tue, 29 Apr 2025 02:29:39 +0000 (10:29 +0800)]
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
__ublk_check_and_get_req() is only called from ublk_check_and_get_req()
and ublk_register_io_buf(), the same check has been covered in the two
calling sites.
So remove the check from __ublk_check_and_get_req().
Ming Lei [Tue, 29 Apr 2025 02:29:38 +0000 (10:29 +0800)]
ublk: enhance check for register/unregister io buffer command
The simple check of UBLK_IO_FLAG_OWNED_BY_SRV can avoid incorrect
register/unregister io buffer easily, so check it before calling
starting to register/un-register io buffer.
Also only allow io buffer register/unregister uring_cmd in case of
UBLK_F_SUPPORT_ZERO_COPY.
Also mark argument 'ublk_queue *' of ublk_register_io_buf as const.