Christoph Hellwig [Wed, 13 Nov 2024 08:45:36 +0000 (09:45 +0100)]
btrfs: validate queue limits
Call blk_validate_limits on the queue limits used for zone append
splitting so that calculated values get filled in and any stacking
conflicts get cought.
Without this there isn't a max_zone_append_sectors limits as of commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors").
Fixes: 559218d43ec9 ("block: pre-calculate max_zone_append_sectors") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 13 Nov 2024 08:45:35 +0000 (09:45 +0100)]
block: export blk_validate_limits
While block drivers do the validation as part of committing them to the
queue, users that use the limit outside of a block device context have
to validate the limits and fill in the calculated values as well.
So far btrfs is the only user of queue limits without a block device,
and it has gotten away with that more or less by accident. But with
commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors")
this became fatal for setups that have small max zone append size,
as it won't be limited now.
Export blk_validate_limits so that it can be called directly from btrfs.
Jens Axboe [Wed, 13 Nov 2024 17:43:11 +0000 (10:43 -0700)]
Merge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/block
Pull NVMe updates from Keith:
"nvme updates for Linux 6.13
- Use uring_cmd helper (Pavel)
- Host Memory Buffer allocation enhancements (Christoph)
- Target persistent reservation support (Guixin)
- Persistent reservation tracing (Guixen)
- NVMe 2.1 specification support (Keith)
- Rotational Meta Support (Matias, Wang, Keith)
- Volatile cache detection enhancment (Guixen)"
* tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme: (22 commits)
nvmet: add tracing of reservation commands
nvme: parse reservation commands's action and rtype to string
nvmet: report ns's vwc not present
nvme: check ns's volatile write cache not present
nvme: add rotational support
nvme: use command set independent id ns if available
nvmet: support for csi identify ns
nvmet: implement rotational media information log
nvmet: implement endurance groups
nvmet: declare 2.1 version compliance
nvmet: implement crto property
nvmet: implement supported features log
nvmet: implement supported log pages
nvmet: implement active command set ns list
nvmet: implement id ns for nvm command set
nvmet: support reservation feature
nvme: add reservation command's defines
nvme-core: remove repeated wq flags
nvmet: make nvmet_wq visible in sysfs
nvme-pci: use dma_alloc_noncontigous if possible
...
Guixin Liu [Mon, 14 Oct 2024 10:14:58 +0000 (18:14 +0800)]
nvmet: add tracing of reservation commands
Add tracing of reservation commands, including register, acquire,
release and report, and also parse the action and rtype to string
to make the trace log more human-readable.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Guixin Liu [Wed, 13 Nov 2024 10:18:39 +0000 (18:18 +0800)]
nvmet: report ns's vwc not present
Currently, we report that controller has vwc even though the ns may
not have vwc. Report ns's vwc not present when not buffered_io or
backdev doesn't have vwc.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Tue, 12 Nov 2024 17:00:39 +0000 (18:00 +0100)]
block: remove the ioprio field from struct request
The request ioprio is only initialized from the first attached bio,
so requests without a bio already never set it. Directly use the
bio field instead.
Guixin Liu [Mon, 4 Nov 2024 08:55:00 +0000 (16:55 +0800)]
nvme: check ns's volatile write cache not present
When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE
flag should not be set when registering the block device, regardless
of whether the controller supports VWC.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Wang Yugui [Thu, 10 Oct 2024 12:39:50 +0000 (14:39 +0200)]
nvme: add rotational support
Rotational devices, such as hard-drives, can be detected using
the rotational bit in the namespace independent identify namespace
data structure. Make the bit visible to the block layer through the
rotational queue setting.
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Matias Bjørling [Tue, 8 Oct 2024 14:55:02 +0000 (16:55 +0200)]
nvme: use command set independent id ns if available
The NVMe 2.0 specification adds an independent identify namespace
data structure that contains generic attributes that apply to all
namespace types. Some attributes carry over from the NVM command set
identify namespace data structure, and others are new.
Currently, the data structure only considered when CRIMS is enabled or
when the namespace type is key-value.
However, the independent namespace data structure is mandatory for
devices that implement features from the 2.0+ specification. Therefore,
we can check this data structure first. If unavailable, retrieve the
generic attributes from the NVM command set identify namespace data
structure.
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Keith Busch [Fri, 1 Nov 2024 21:46:01 +0000 (14:46 -0700)]
nvmet: implement endurance groups
Most of the returned information is just stubbed data. The target must
support these in order to report rotational media. Since this driver
doesn't know any better, each namespace is its own endurance group with
the engid value matching the nsid.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Guixin Liu [Wed, 6 Nov 2024 07:34:46 +0000 (15:34 +0800)]
nvmet: support reservation feature
This patch implements the reservation feature, including:
1. reservation register(register, unregister and replace).
2. reservation acquire(acquire, preempt, preempt and abort).
3. reservation release(release and clear).
4. reservation report.
5. set feature and get feature of reservation notify mask.
6. get log page of reservation event.
Not supported:
1. persistent reservation through power loss.
Test cases:
Use nvme-cli and fio to test all implemented sub features:
1. use nvme resv-register to register host a registrant or
unregister or replace a new key.
2. use nvme resv-acquire to set host to the holder, and use fio
to send read and write io in all reservation type. And also
test preempt and "preempt and abort".
3. use nvme resv-report to show all registrants and reservation
status.
4. use nvme resv-release to release all registrants.
5. use nvme get-log to get events generated by the preceding
operations.
In addition, make reservation configurable, one can set ns to
support reservation before enable ns. The default of resv_enable
is false.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Fri, 8 Nov 2024 15:46:51 +0000 (16:46 +0100)]
block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Christoph Hellwig [Mon, 4 Nov 2024 06:26:31 +0000 (07:26 +0100)]
block: lift bio_is_zone_append to bio.h
Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 4 Nov 2024 06:26:29 +0000 (07:26 +0100)]
block: take chunk_sectors into account in bio_split_write_zeroes
For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors. For other uses like the
internally RAIDed NVMe devices it is probably at least useful.
Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes. Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.
Fixes: 885fa13f6559 ("block: implement splitting of REQ_OP_WRITE_ZEROES bios") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241104062647.91160-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
John Garry [Mon, 11 Nov 2024 11:21:45 +0000 (11:21 +0000)]
block: Rework bio_split() return value
Instead of returning an inconclusive value of NULL for an error in calling
bio_split(), return a ERR_PTR() always.
Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since
almost all callers don't check the return code from bio_split(), we'll
crash anyway (for those failures).
Fix up the only user which checks bio_split() return code today (directly
or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache
code does check the return code in cached_dev_cache_miss() ->
bio_next_split() -> bio_split(), but only to see if there was a split, so
there would be no change in behaviour here (when returning a ERR_PTR()).
and UBLK_MAX_QUEUE_DEPTH is 4096 and part of UAPI, so 'max_cmd_buf_size'
is always page aligned in 4K page size kernel. However, it isn't true in
64K page size kernel.
Fixes the issue by always rounding up 'max_cmd_buf_size' with PAGE_SIZE.
In case of an early failure in dasd_init, dasd_proc_init is never
called and /proc/dasd* files are never created. That can happen, for
example, if an incompatible or incorrect argument is provided to the
dasd_mod.dasd= kernel parameter.
However, the attempted removal of /proc/dasd* files causes 8 warnings
and backtraces in this case. 4 on the error path within dasd_init and
4 when the dasd module is unloaded. Notice the "removing permanent
/proc entry 'devices'" message that is caused by the dasd_proc_exit
function trying to remove /proc/devices instead of /proc/dasd/devices
since dasd_proc_root_entry is NULL and /proc/devices is indeed
permanent. Example:
------------[ cut here ]------------
removing permanent /proc entry 'devices'
WARNING: CPU: 6 PID: 557 at fs/proc/generic.c:701 remove_proc_entry+0x22e/0x240
While the cause is a user failure, the dasd module should handle the
situation more gracefully. One of the simplest solutions is to make
removal of the /proc/dasd* entries idempotent.
Jens Axboe [Sun, 10 Nov 2024 03:06:13 +0000 (20:06 -0700)]
Merge tag 'md-6.13-20241107' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.13/block
Pull raid5 fix from Song.
* tag 'md-6.13-20241107' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
MAINTAINERS: Make Yu Kuai co-maintainer of md/raid subsystem
md/raid5: Wait sync io to finish before changing group cnt
Li Wang [Sat, 9 Nov 2024 02:27:44 +0000 (10:27 +0800)]
loop: fix type of block size
PAGE_SIZE may be 64K, and the max block size can be PAGE_SIZE, so any
variable for holding block size can't be defined as 'unsigned short'.
Unfortunately commit 473516b36193 ("loop: use the atomic queue limits
update API") passes 'bsize' with type of 'unsigned short' to
loop_reconfigure_limits(), and causes LTP/ioctl_loop06 test failure:
12 ioctl_loop06.c:76: TINFO: Using LOOP_SET_BLOCK_SIZE with arg > PAGE_SIZE
13 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly
...
18 ioctl_loop06.c:76: TINFO: Using LOOP_CONFIGURE with block_size > PAGE_SIZE
19 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly
Fixes the issue by defining 'block size' variable with 'unsigned int', which is
aligned with block layer's definition.
(improve commit log & add fixes tag)
Fixes: 473516b36193 ("loop: use the atomic queue limits update API") Cc: John Garry <john.g.garry@oracle.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Li Wang <liwang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241109022744.1126003-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Xiao Ni [Wed, 6 Nov 2024 09:51:24 +0000 (17:51 +0800)]
md/raid5: Wait sync io to finish before changing group cnt
One customer reports a bug: raid5 is hung when changing thread cnt
while resync is running. The stripes are all in conf->handle_list
and new threads can't handle them.
Commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()") removes
pers->quiesce from mddev_suspend/resume. Before this patch, mddev_suspend
needs to wait for all ios including sync io to finish. Now it's used
to only wait normal io.
Fix this by calling raid5_quiesce from raid5_store_group_thread_cnt
directly to wait all sync requests to finish before changing the group
cnt.
Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") Cc: stable@vger.kernel.org Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241106095124.74577-1-xni@redhat.com Signed-off-by: Song Liu <song@kernel.org>
Ming Lei [Thu, 31 Oct 2024 13:37:20 +0000 (21:37 +0800)]
block: don't verify IO lock for freeze/unfreeze in elevator_init_mq()
elevator_init_mq() is only called at the entry of add_disk_fwnode() when
disk IO isn't allowed yet.
So not verify io lock(q->io_lockdep_map) for freeze & unfreeze in
elevator_init_mq().
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Lai Yi <yi1.lai@linux.intel.com> Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Thu, 31 Oct 2024 13:37:19 +0000 (21:37 +0800)]
block: always verify unfreeze lock on the owner task
commit f1be1788a32e ("block: model freeze & enter queue as lock for
supporting lockdep") tries to apply lockdep for verifying freeze &
unfreeze. However, the verification is only done the outmost freeze and
unfreeze. This way is actually not correct because q->mq_freeze_depth
still may drop to zero on other task instead of the freeze owner task.
Fix this issue by always verifying the last unfreeze lock on the owner
task context, and make sure both the outmost freeze & unfreeze are
verified in the current task.
Damien Le Moal [Thu, 7 Nov 2024 06:43:00 +0000 (15:43 +0900)]
block: Add a public bdev_zone_is_seq() helper
Turn the private disk_zone_is_conv() function in blk-zoned.c into a
public and documented bdev_zone_is_seq() helper with the inverse
polarity of the original function, also adding a check for non-zoned
devices so that all file systems can use the helper, even with a regular
block device.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Thu, 7 Nov 2024 06:42:59 +0000 (15:42 +0900)]
block: RCU protect disk->conv_zones_bitmap
Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.
disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.
disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
zhangguopeng [Thu, 7 Nov 2024 10:42:58 +0000 (18:42 +0800)]
block: Replace sprintf() with sysfs_emit()
Per Documentation/filesystems/sysfs.rst, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.
Guixin Liu [Wed, 6 Nov 2024 07:34:45 +0000 (15:34 +0800)]
nvme: add reservation command's defines
This is a preparation patch for NVMeOF target reservation
commands implantation.
Add the defines of reservation command, such as reservation log
and sub operations.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Tested-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Jens Axboe [Wed, 6 Nov 2024 14:55:19 +0000 (07:55 -0700)]
Merge tag 'md-6.13-20241105' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.13/block
Pull MD changes from Song:
"1. Enhance handling of faulty and blocked devices, by Yu Kuai.
2. raid5-ppl atomic improvement, by Uros Bizjak.
3. md-bitmap fix, by Yuan Can."
* tag 'md-6.13-20241105' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/md-bitmap: Add missing destroy_work_on_stack()
md/raid5: don't set Faulty rdev for blocked_rdev
md/raid10: don't wait for Faulty rdev in wait_blocked_rdev()
md/raid1: don't wait for Faulty rdev in wait_blocked_rdev()
md/raid1: factor out helper to handle blocked rdev from raid1_write_request()
md: don't record new badblocks for faulty rdev
md: don't wait faulty rdev in md_wait_for_blocked_rdev()
md: add a new helper rdev_blocked()
md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit()
Philipp Stanner [Wed, 6 Nov 2024 14:52:50 +0000 (15:52 +0100)]
mtip32xx: Replace deprecated PCI functions
pcim_iomap_table() and pcim_request_regions() have been deprecated in
commit e354bb84a4c1 ("PCI: Deprecate pcim_iomap_table(),
pcim_iomap_regions_request_all()") and commit d140f80f60358 ("PCI:
Deprecate pcim_iomap_regions() in favor of pcim_iomap_region()"),
respectively.
Yuan Can [Tue, 5 Nov 2024 13:01:05 +0000 (21:01 +0800)]
md/md-bitmap: Add missing destroy_work_on_stack()
This commit add missed destroy_work_on_stack() operations for
unplug_work.work in bitmap_unplug_async().
Fixes: a022325ab970 ("md/md-bitmap: add a new helper to unplug bitmap asynchrously") Cc: stable@vger.kernel.org Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241105130105.127336-1-yuancan@huawei.com Signed-off-by: Song Liu <song@kernel.org>
Yu Kuai [Thu, 31 Oct 2024 03:31:14 +0000 (11:31 +0800)]
md/raid5: don't set Faulty rdev for blocked_rdev
Faulty rdev should never be accessed anymore, hence there is no point to
wait for bad block to be acknowledged in this case while handling write
request.
Yu Kuai [Thu, 31 Oct 2024 03:31:13 +0000 (11:31 +0800)]
md/raid10: don't wait for Faulty rdev in wait_blocked_rdev()
Faulty rdev should never be accessed anymore, hence there is no point to
wait for bad block to be acknowledged in this case while handling write
request.
Yu Kuai [Thu, 31 Oct 2024 03:31:12 +0000 (11:31 +0800)]
md/raid1: don't wait for Faulty rdev in wait_blocked_rdev()
Faulty rdev should never be accessed anymore, hence there is no point to
wait for bad block to be acknowledged in this case while handling write
request.
Yu Kuai [Thu, 31 Oct 2024 03:31:11 +0000 (11:31 +0800)]
md/raid1: factor out helper to handle blocked rdev from raid1_write_request()
Currently raid1 is preparing IO for underlying disks while checking if
any disk is blocked, if so allocated resources must be released, then
waiting for rdev to be unblocked and try to prepare IO again.
Make code cleaner by checking blocked rdev first, it doesn't matter if
rdev is blocked while issuing IO, the IO will wait for rdev to be
unblocked or not.
Yu Kuai [Thu, 31 Oct 2024 03:31:10 +0000 (11:31 +0800)]
md: don't record new badblocks for faulty rdev
Faulty will be checked before issuing IO to the rdev, however, rdev can
be faulty at any time, hence it's possible that rdev_set_badblocks()
will be called for faulty rdev. In this case, mddev->sb_flags will be
set and some other path can be blocked by updating super block.
Since faulty rdev will not be accesed anymore, there is no need to
record new babblocks for faulty rdev and forcing updating super block.
Noted this is not a bugfix, just prevent updating superblock in some
corner cases, and will help to slice a bug related to external
metadata[1], testing also shows that devices are removed faster in the
case IO error.
Yu Kuai [Thu, 31 Oct 2024 03:31:09 +0000 (11:31 +0800)]
md: don't wait faulty rdev in md_wait_for_blocked_rdev()
md_wait_for_blocked_rdev() is called for write IO while rdev is
blocked, howerver, rdev can be faulty after choosing this rdev to write,
and faulty rdev should never be accessed anymore, hence there is no point
to wait for faulty rdev to be unblocked.
Yu Kuai [Thu, 31 Oct 2024 03:31:08 +0000 (11:31 +0800)]
md: add a new helper rdev_blocked()
The helper will be used in later patches for raid1/raid10/raid5, the
difference is that Faulty rdev with unacknowledged bad block will not
be considered blocked.
Uros Bizjak [Mon, 7 Oct 2024 08:48:04 +0000 (10:48 +0200)]
md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit()
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref)
to use optimized implementation and ease register pressure around
the primitive for targets that implement optimized variant.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20241007084831.48067-1-ubizjak@gmail.com Signed-off-by: Song Liu <song@kernel.org>
Guixin Liu [Thu, 31 Oct 2024 02:27:20 +0000 (10:27 +0800)]
nvmet: make nvmet_wq visible in sysfs
In some complex scenarios, we deploy multiple tasks on a single machine
(hybrid deployment), such as Docker containers for function computation
(background processing), real-time tasks, monitoring, event handling,
and management, along with an NVMe target server.
Each of these components is restricted to its own CPU cores to prevent
mutual interference and ensure strict isolation. To achieve this level
of isolation for nvmet_wq we need to use sysfs tunables such as
cpumask that are currently not accessible.
Add WQ_SYSFS flag to alloc_workqueue() when creating nvmet_wq so
workqueue tunables are exported in the userspace via sysfs.
with this patch :-
nvme (nvme-6.13) # ls /sys/devices/virtual/workqueue/nvmet-wq/
affinity_scope affinity_strict cpumask max_active nice per_cpu
power subsystem uevent
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Fri, 1 Nov 2024 04:40:05 +0000 (05:40 +0100)]
nvme-pci: use dma_alloc_noncontigous if possible
Use dma_alloc_noncontigous to allocate a single IOVA-contigous segment
when backed by an IOMMU. This allow to easily use bigger segments and
avoids running into segment limits if we can avoid it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Fri, 1 Nov 2024 04:40:04 +0000 (05:40 +0100)]
nvme-pci: fix freeing of the HMB descriptor table
The HMB descriptor table is sized to the maximum number of descriptors
that could be used for a given device, but __nvme_alloc_host_mem could
break out of the loop earlier on memory allocation failure and end up
using less descriptors than planned for, which leads to an incorrect
size passed to dma_free_coherent.
In practice this was not showing up because the number of descriptors
tends to be low and the dma coherent allocator always allocates and
frees at least a page.
Fixes: 87ad72a59a38 ("nvme-pci: implement host memory buffer support") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Christoph Hellwig [Mon, 4 Nov 2024 07:39:32 +0000 (08:39 +0100)]
block: pre-calculate max_zone_append_sectors
max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper. This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.
Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.
Christoph Hellwig [Mon, 4 Nov 2024 07:39:31 +0000 (08:39 +0100)]
block: remove the max_zone_append_sectors check in blk_revalidate_disk_zones
With the lock layer zone append emulation, we are now always setting a
max_zone_append_sectors value for zoned devices and this check can't
ever trigger.
Ming Lei [Sat, 2 Nov 2024 01:42:11 +0000 (09:42 +0800)]
lib/iov_iter: fix bvec iterator setup
.bi_size of bvec iterator should be initialized as real max size for
walking, and .bi_bvec_done just counts how many bytes need to be
skipped in the 1st bvec, so .bi_size isn't related with .bi_bvec_done.
This patch fixes bvec iterator initialization, and the inner `size`
check isn't needed any more, so revert Eric Dumazet's commit 7bc802acf193 ("iov-iter: do not return more bytes than requested in
iov_iter_extract_bvec_pages()").
Cc: Eric Dumazet <edumazet@google.com> Fixes: e4e535bff2bc ("iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages") Reported-by: syzbot+71abe7ab2b70bca770fd@syzkaller.appspotmail.com Tested-by: syzbot+71abe7ab2b70bca770fd@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 30 Oct 2024 05:18:52 +0000 (06:18 +0100)]
block: remove bio_add_zone_append_page
This is only used by the nvmet zns passthrough code, which can trivially
just use bio_add_pc_page and do the sanity check for the max zone append
limit itself.
All future zoned file systems should follow the btrfs lead and let the
upper layers fill up bios unlimited by hardware constraints and split
them to the limits in the I/O submission handler.
Christoph Hellwig [Wed, 30 Oct 2024 05:18:51 +0000 (06:18 +0100)]
block: remove zone append special casing from the direct I/O path
This code is unused, and all future zoned file systems should follow
the btrfs lead of splitting the bios themselves to the zoned limits
in the I/O submission handler, because if they didn't they would be
hit by commit ed9832bc08db ("block: introduce folio awareness and add
a bigger size from folio") breaking this code when the zone append
limit (that is usually the max_hw_sectors limit) is smaller than the
largest possible folio size.
Keith Busch [Wed, 16 Oct 2024 20:13:09 +0000 (13:13 -0700)]
blk-integrity: remove seed for user mapped buffers
The seed is only used for kernel generation and verification. That
doesn't happen for user buffers, so passing the seed around doesn't
accomplish anything.
loop_init() is calling loop_add() after __register_blkdev() succeeds and
is ignoring disk_add() failure from loop_add(), for loop_add() failure
is not fatal and successfully created disks are already visible to
bdev_open().
brd_init() is currently calling brd_alloc() before __register_blkdev()
succeeds and is releasing successfully created disks when brd_init()
returns an error. This can cause UAF for the latter two case:
case 1:
T1:
modprobe brd
brd_init
brd_alloc(0) // success
add_disk
disk_scan_partitions
bdev_file_open_by_dev // alloc file
fput // won't free until back to userspace
brd_alloc(1) // failed since mem alloc error inject
// error path for modprobe will release code segment
// back to userspace
__fput
blkdev_release
bdev_release
blkdev_put_whole
bdev->bd_disk->fops->release // fops is freed now, UAF!
case 2:
T1: T2:
modprobe brd
brd_init
brd_alloc(0) // success
open(/dev/ram0)
brd_alloc(1) // fail
// error path for modprobe
Ming Lei [Thu, 24 Oct 2024 05:00:15 +0000 (07:00 +0200)]
iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages
The iov_iter_extract_pages interface allows to return physically
discontiguous pages, as long as all but the first and last page
in the array are page aligned and page size. Rewrite
iov_iter_extract_bvec_pages to take advantage of that instead of only
returning ranges of physically contiguous pages.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: minor cleanups, new commit log] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241024050021.627350-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 25 Oct 2024 00:37:20 +0000 (08:37 +0800)]
block: model freeze & enter queue as lock for supporting lockdep
Recently we got several deadlock report[1][2][3] caused by
blk_mq_freeze_queue and blk_enter_queue().
Turns out the two are just like acquiring read/write lock, so model them
as read/write lock for supporting lockdep:
1) model q->q_usage_counter as two locks(io and queue lock)
- queue lock covers sync with blk_enter_queue()
- io lock covers sync with bio_enter_queue()
2) make the lockdep class/key as per-queue:
- different subsystem has very different lock use pattern, shared lock
class causes false positive easily
- freeze_queue degrades to no lock in case that disk state becomes DEAD
because bio_enter_queue() won't be blocked any more
- freeze_queue degrades to no lock in case that request queue becomes dying
because blk_enter_queue() won't be blocked any more
3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock
- it is exclusive lock, so dependency with blk_enter_queue() is covered
- it is trylock because blk_mq_freeze_queue() are allowed to run
concurrently
4) model blk_enter_queue() & bio_enter_queue() as acquire_read()
- nested blk_enter_queue() are allowed
- dependency with blk_mq_freeze_queue() is covered
- blk_queue_exit() is often called from other contexts(such as irq), and
it can't be annotated as lock_release(), so simply do it in
blk_enter_queue(), this way still covered cases as many as possible
With lockdep support, such kind of reports may be reported asap and
needn't wait until the real deadlock is triggered.
For example, lockdep report can be triggered in the report[3] with this
patch applied.
[1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
https://bugzilla.kernel.org/show_bug.cgi?id=219166
[2] del_gendisk() vs blk_queue_enter() race condition
https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/
[3] queue_freeze & queue_enter deadlock in scsi
https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u
Ming Lei [Fri, 25 Oct 2024 00:37:18 +0000 (08:37 +0800)]
blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the
caller knows that what they are doing, and we can skip lockdep support
for non_owner variant in per-call level.
Prepare for supporting lockdep for freezing/unfreezing queue.
Bart Van Assche [Wed, 23 Oct 2024 20:28:50 +0000 (13:28 -0700)]
blk-mq: Unexport blk_mq_flush_busy_ctxs()
Commit a6088845c2bf ("block: kyber: make kyber more friendly with merging")
removed the only blk_mq_flush_busy_ctxs() call from outside the block layer
core. Hence unexport blk_mq_flush_busy_ctxs().
Bart Van Assche [Tue, 22 Oct 2024 18:16:17 +0000 (11:16 -0700)]
blk-mq: Make blk_mq_quiesce_tagset() hold the tag list mutex less long
Make sure that the tag_list_lock mutex is not held any longer than
necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset()
is called concurrently from more than one thread. This function is used
by the NVMe core and also by the UFS driver.
Reported-by: Peter Wang <peter.wang@mediatek.com> Cc: Chao Leng <lengchao@huawei.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Fixes: 414dd48e882c ("blk-mq: add tagset quiesce interface") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241022181617.2716173-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Muchun Song [Mon, 21 Oct 2024 08:52:51 +0000 (16:52 +0800)]
block: remove redundant explicit memory barrier from rq_qos waiter and waker
The memory barriers in list_del_init_careful() and list_empty_careful()
in pairs already handle the proper ordering between data.got_token
and data.wq.entry. So remove the redundant explicit barriers. And also
change a "break" statement to "return" to avoid redundant calling of
finish_wait().
Pavel Begunkov [Fri, 18 Oct 2024 16:16:37 +0000 (17:16 +0100)]
nvme: use helpers to access io_uring cmd space
Command implementations shouldn't be directly looking into io_uring_cmd
to carve free space. Use an io_uring helper, which will also do build
time size sanitisation.
Li Lingfeng [Sat, 17 Aug 2024 07:11:08 +0000 (15:11 +0800)]
block: flush all throttled bios when deleting the cgroup
When a process migrates to another cgroup and the original cgroup is deleted,
the restrictions of throttled bios cannot be removed. If the restrictions
are set too low, it will take a long time to complete these bios.
Refer to the process of deleting a disk to remove the restrictions and
issue bios when deleting the cgroup.
This makes difference on the behavior of throttled bios:
Before: the limit of the throttled bios can't be changed and the bios will
complete under this limit;
Now: the limit will be canceled and the throttled bios will be flushed
immediately.
Muchun Song [Mon, 14 Oct 2024 09:29:34 +0000 (17:29 +0800)]
block: fix ordering between checking BLK_MQ_S_STOPPED request adding
Supposing first scenario with a virtio_blk driver.
CPU0 CPU1
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_request_bypass_insert() 1) store
blk_mq_start_stopped_hw_queue()
clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending())
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
return
__blk_mq_sched_dispatch_requests()
Supposing another scenario.
CPU0 CPU1
blk_mq_requeue_work()
blk_mq_insert_request() 1) store
virtblk_done()
blk_mq_start_stopped_hw_queue()
blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
continue
blk_mq_run_hw_queue()
Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing
starvation of the request.
The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.
Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Muchun Song [Mon, 14 Oct 2024 09:29:33 +0000 (17:29 +0800)]
block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding
Supposing the following scenario.
CPU0 CPU1
blk_mq_insert_request() 1) store
blk_mq_unquiesce_queue()
blk_queue_flag_clear() 3) store
blk_mq_run_hw_queues()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_run_hw_queue()
if (blk_queue_quiesced()) 2) load
return
blk_mq_sched_dispatch_requests()
The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED
is cleared or CPU1 sees dispatch list or setting of bitmap of software
queue. Otherwise, either CPU will not rerun the hardware queue causing
starvation.
So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since
memory barrier is not easy to be maintained.
Fixes: f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-3-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.
Fixes: d964f04a8fde ("blk-mq: fix direct issue") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-2-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
The main purpose of this patch is cleanup.
The throtl_adjusted_limit function was removed after
commit bf20ab538c81 ("blk-throttle: remove
CONFIG_BLK_DEV_THROTTLING_LOW"), so the problem of not being
able to scale after setting bps or iops to 1 will not occur.
So revert this commit that bps/iops can be set to 1.
Julia Lawall [Sun, 13 Oct 2024 20:16:56 +0000 (22:16 +0200)]
block: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed and since
commit 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"),
it is not necessary to use call_rcu when the callback only performs
kmem_cache_free. Use kfree_rcu() directly.
Greg Joyce [Thu, 29 Aug 2024 17:56:11 +0000 (12:56 -0500)]
block: sed-opal: add ioctl IOC_OPAL_SET_SID_PW
After a SED drive is provisioned, there is no way to change the SID
password via the ioctl() interface. A new ioctl IOC_OPAL_SET_SID_PW
will allow the password to be changed. The valid current password is
required.
Uday Shankar [Mon, 7 Oct 2024 18:24:17 +0000 (12:24 -0600)]
ublk: support device recovery without I/O queueing
ublk currently supports the following behaviors on ublk server exit:
A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue
and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:
1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands
The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:
default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2
The behavior A + 2 is currently unsupported. Add support for this
behavior under the new flag combination
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_FAIL_IO.
Uday Shankar [Mon, 7 Oct 2024 18:24:16 +0000 (12:24 -0600)]
ublk: merge stop_work and quiesce_work
Save some lines by merging stop_work and quiesce_work into nosrv_work,
which looks at the recovery flags and does the right thing when the "no
ublk server" condition is detected.
Uday Shankar [Mon, 7 Oct 2024 18:24:15 +0000 (12:24 -0600)]
ublk: refactor recovery configuration flag helpers
ublk currently supports the following behaviors on ublk server exit:
A: outstanding I/Os get errors, subsequently issued I/Os get errors
B: outstanding I/Os get errors, subsequently issued I/Os queue
C: outstanding I/Os get reissued, subsequently issued I/Os queue
and the following behaviors for recovery of preexisting block devices by
a future incarnation of the ublk server:
1: ublk devices stopped on ublk server exit (no recovery possible)
2: ublk devices are recoverable using start/end_recovery commands
The userspace interface allows selection of combinations of these
behaviors using flags specified at device creation time, namely:
default behavior: A + 1
UBLK_F_USER_RECOVERY: B + 2
UBLK_F_USER_RECOVERY|UBLK_F_USER_RECOVERY_REISSUE: C + 2
We can't easily change the userspace interface to allow independent
selection of one of {A, B, C} and one of {1, 2}, but we can refactor the
internal helpers which test for the flags. Replace the existing helpers
with the following set:
ublk_nosrv_should_reissue_outstanding: tests for behavior C
ublk_nosrv_[dev_]should_queue_io: tests for behavior B
ublk_nosrv_should_stop_dev: tests for behavior 1
Uday Shankar [Mon, 7 Oct 2024 18:24:14 +0000 (12:24 -0600)]
ublk: check recovery flags for validity
Setting UBLK_F_USER_RECOVERY_REISSUE without also setting
UBLK_F_USER_RECOVERY is currently silently equivalent to not setting any
recovery flags at all, even though that's obviously not intended. Check
for this case and fail add_dev (with a paranoid warning to aid debugging
any program which might rely on the old behavior) with EINVAL if it is
detected.
Keith Busch [Mon, 7 Oct 2024 15:32:35 +0000 (08:32 -0700)]
block: enable passthrough command statistics
Applications using the passthrough interfaces for IO want to continue
seeing the disk stats. These requests had been fenced off from this
block layer feature. While the block layer doesn't necessarily know what
a passthrough command does, we do know the data size and direction,
which is enough to account for the command's stats.
Since tracking these has the potential to produce unexpected results,
the passthrough stats are locked behind a new queue flag that needs to
be enabled with the /sys/block/<dev>/queue/iostats_passthrough
attribute.
Christoph Hellwig [Tue, 8 Oct 2024 05:08:41 +0000 (07:08 +0200)]
block: return void from the queue_sysfs_entry load_module method
Requesting a module either succeeds or does nothing, return an error from
this method does not make sense.
Also move the load_module after the store method in the struct
declaration to keep the important show and store methods together.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Link: https://lore.kernel.org/r/20241008050841.104602-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Konstantin Khlebnikov [Sat, 5 Oct 2024 00:13:43 +0000 (17:13 -0700)]
block: add partition uuid into uevent as "PARTUUID"
Both most common formats have uuid in addition to partition name:
GPT: standard uuid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
DOS: 4 byte disk signature and 1 byte partition xxxxxxxx-xx
Tools from util-linux use the same notation for them.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
[dianders: rebased to modern kernels] Signed-off-by: Douglas Anderson <dianders@google.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241004171340.v2.1.I938c91d10e454e841fdf5d64499a8ae8514dc004@changeid Signed-off-by: Jens Axboe <axboe@kernel.dk>