]> www.infradead.org Git - users/hch/misc.git/log
users/hch/misc.git
5 months agoRFC: block: allow write streams on partitions block-write-streams
Christoph Hellwig [Mon, 11 Nov 2024 10:03:46 +0000 (11:03 +0100)]
RFC: block: allow write streams on partitions

By default assign all write streams to partition 1, and add a hack
sysfs files that distributes them all equally.

This is implemented by storing the number of per-partition write
streams in struct block device, as well as the offset to the global
ones, and then remapping the write streams in the I/O submission
path.

The sysfs is hacky and undocumented, better suggestions welcome
from actual users of write stream on partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agonvme: enable FDP support
Christoph Hellwig [Tue, 19 Nov 2024 10:25:45 +0000 (11:25 +0100)]
nvme: enable FDP support

Wire up the block level write streams to the NVMe Flexible Data Placement
(FDP) feature as ratified in TP 4146a.

Based on code from Kanchan Joshi <joshi.k@samsung.com>,
Hui Qi <hui81.qi@samsung.com>, Nitesh Shetty <nj.shetty@samsung.com> and
Keith Busch <kbusch@kernel.org>, but a lot of it has been rewritten to
fit the block layer write stream infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agonvme.h: add FDP definitions
Christoph Hellwig [Tue, 19 Nov 2024 10:29:49 +0000 (11:29 +0100)]
nvme.h: add FDP definitions

Add the config feature result, config log page, and management receive
commands needed for FDP.

Partially based on a patch from Kanchan Joshi <joshi.k@samsung.com>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agonvme: add a nvme_get_log_lsi helper
Christoph Hellwig [Tue, 19 Nov 2024 09:39:28 +0000 (10:39 +0100)]
nvme: add a nvme_get_log_lsi helper

For log pages that need to pass in a LSI value, while at the same time
not touching all the existing nvme_get_log callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agonvme: pass a void pointer to nvme_get/set_features for the result
Christoph Hellwig [Tue, 19 Nov 2024 09:13:32 +0000 (10:13 +0100)]
nvme: pass a void pointer to nvme_get/set_features for the result

That allows passing in structures instead of the u32 result, and thus
reduce the amount of bit shifting and masking required to parse the
result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agonvme: store the endurance group id in struct nvme_ns_head
Christoph Hellwig [Tue, 19 Nov 2024 10:25:11 +0000 (11:25 +0100)]
nvme: store the endurance group id in struct nvme_ns_head

The FDP code needs this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: expose write streams for block device nodes
Christoph Hellwig [Mon, 11 Nov 2024 08:05:28 +0000 (09:05 +0100)]
block: expose write streams for block device nodes

Export statx information about the number and granularity of write
streams, use the per-kiocb write hint and map temperature hints
to write streams (which is a bit questionable, but this shows how it is
done).

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: introduce a write_stream_granularity queue limit
Christoph Hellwig [Mon, 11 Nov 2024 08:34:34 +0000 (09:34 +0100)]
block: introduce a write_stream_granularity queue limit

Export the granularity that write streams should be discarded with,
as it is essential for making good use of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: introduce max_write_streams queue limit
Keith Busch [Fri, 8 Nov 2024 19:36:22 +0000 (11:36 -0800)]
block: introduce max_write_streams queue limit

Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.

Note: compared to Keith's origina version this does not automatically
stack the limit.  There is no good way to generically stack them.  For
mirroring or striping just mirroring the write streams will work, but
for everything more complex the stacking drive actually needs to manage
them.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[hch: renamed from max_write_hints to max_write_streams]
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: add a bi_write_stream field
Christoph Hellwig [Tue, 19 Nov 2024 08:31:26 +0000 (09:31 +0100)]
block: add a bi_write_stream field

Add the ability to pass a write stream for placement control in the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: req->bio is always set in the merge code
Christoph Hellwig [Tue, 19 Nov 2024 08:31:01 +0000 (09:31 +0100)]
block: req->bio is always set in the merge code

As smatch, which is a lot smarter than me noticed.  So remove the checks
for it, and condense these checks a bit including the comments stating
the obvious.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: don't bother checking the data direction for merges
Christoph Hellwig [Tue, 19 Nov 2024 08:30:07 +0000 (09:30 +0100)]
block: don't bother checking the data direction for merges

Because it already is encoded in the opcode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoio_uring: enable passing a per-io write stream
Kanchan Joshi [Fri, 8 Nov 2024 19:36:26 +0000 (11:36 -0800)]
io_uring: enable passing a per-io write stream

Allow userspace to pass a per-I/O write stream in the SQE:

__u16 write_stream;

Application can query the supported values from the statx
max_write_streams field.  Unsupported values are ignored by
file operations that do not support write streams or rejected
with an error by those that support them.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
[hch: s/write_hints/write_streams/g]
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agofs: add a write stream field to the kiocb
Christoph Hellwig [Mon, 11 Nov 2024 10:05:14 +0000 (11:05 +0100)]
fs: add a write stream field to the kiocb

Prepare for io_uring passthrough of write streams.

The write stream field in the kiocb structure fits into an existing
2-byte hole, so its size is not changed.

Based on a patch from Keith Busch <kbusch@kernel.org>

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agofs: add write stream information to statx
Keith Busch [Fri, 8 Nov 2024 19:36:23 +0000 (11:36 -0800)]
fs: add write stream information to statx

Add new statx field to report the maximum number of write streams
supported and the granularity for them.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[hch: s/write_hint/write_stream/g, add granularity]
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Tue, 19 Nov 2024 01:32:07 +0000 (18:32 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  block: fix uaf for flush rq while iterating tags

5 months agoblock: fix uaf for flush rq while iterating tags
Yu Kuai [Mon, 4 Nov 2024 11:00:05 +0000 (19:00 +0800)]
block: fix uaf for flush rq while iterating tags

blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
atomic mode after del_gendisk"), hence for disk like scsi, following
blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
cause following uaf that is found by our syzkaller for v6.6:

==================================================================
BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909

CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32
Workqueue: kblockd blk_mq_timeout_work
Call Trace:

__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
print_report+0x3e/0x70 mm/kasan/report.c:475
kasan_report+0xb8/0xf0 mm/kasan/report.c:588
blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
bt_iter block/blk-mq-tag.c:288 [inline]
__sbitmap_for_each_set include/linux/sbitmap.h:295 [inline]
sbitmap_for_each_set include/linux/sbitmap.h:316 [inline]
bt_for_each+0x455/0x790 block/blk-mq-tag.c:325
blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534
blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673
process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
process_scheduled_works kernel/workqueue.c:2704 [inline]
worker_thread+0x804/0xe40 kernel/workqueue.c:2785
kthread+0x346/0x450 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293

Allocated by task 942:
kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
__kasan_kmalloc mm/kasan/common.c:383 [inline]
__kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380
kasan_kmalloc include/linux/kasan.h:198 [inline]
__do_kmalloc_node mm/slab_common.c:1007 [inline]
__kmalloc_node+0x69/0x170 mm/slab_common.c:1014
kmalloc_node include/linux/slab.h:620 [inline]
kzalloc_node include/linux/slab.h:732 [inline]
blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499
blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788
blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261
blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294
blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350
blk_mq_init_queue_data block/blk-mq.c:4166 [inline]
blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176
scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335
scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189
__scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727
scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline]
scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791
scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844
scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151
store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191
dev_attr_store+0x5c/0x90 drivers/base/core.c:2388
sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136
kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338
call_write_iter include/linux/fs.h:2083 [inline]
new_sync_write+0x1b4/0x2d0 fs/read_write.c:493
vfs_write+0x76c/0xb00 fs/read_write.c:586
ksys_write+0x127/0x250 fs/read_write.c:639
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x78/0xe2

Freed by task 244687:
kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1815 [inline]
slab_free_freelist_hook mm/slub.c:1841 [inline]
slab_free mm/slub.c:3807 [inline]
__kmem_cache_free+0xe4/0x520 mm/slub.c:3820
blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520
blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37
kobject_cleanup+0x136/0x410 lib/kobject.c:689
kobject_release lib/kobject.c:720 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x119/0x140 lib/kobject.c:737
blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144
blk_free_queue block/blk-core.c:298 [inline]
blk_put_queue+0xe2/0x180 block/blk-core.c:314
blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144
process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
process_scheduled_works kernel/workqueue.c:2704 [inline]
worker_thread+0x804/0xe40 kernel/workqueue.c:2785
kthread+0x346/0x450 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293

Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in
blk_register_queue() from initialization path, hence it's safe not to
clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already
make sure that queue should only be registered once, there is no need
to test the flag as well.

Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]")
Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241104110005.1412161-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Mon, 18 Nov 2024 21:55:14 +0000 (14:55 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  blk-settings: round down io_opt to physical_block_size

5 months agoblk-settings: round down io_opt to physical_block_size
Mikulas Patocka [Mon, 18 Nov 2024 14:52:50 +0000 (15:52 +0100)]
blk-settings: round down io_opt to physical_block_size

There was a bug report [1] where the user got a warning alignment
inconsistency. The user has optimal I/O 16776704 (0xFFFE00) and physical
block size 4096. Note that the optimal I/O size may be set by the DMA
engines or SCSI controllers and they have no knowledge about the disks
attached to them, so the situation with optimal I/O not aligned to
physical block size may happen.

This commit makes blk_validate_limits round down optimal I/O size to the
physical block size of the block device.

Closes: https://lore.kernel.org/dm-devel/1426ad71-79b4-4062-b2bf-84278be66a5d@redhat.com/T/ [1]
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors")
Cc: stable@vger.kernel.org # v6.11+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/3dc0014b-9690-dc38-81c9-4a316a2d4fb2@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/io_uring' into for-next
Jens Axboe [Mon, 18 Nov 2024 16:11:04 +0000 (09:11 -0700)]
Merge branch 'for-6.13/io_uring' into for-next

* for-6.13/io_uring:
  io_uring: protect register tracing
  io_uring: remove io_uring_cqwait_reg_arg

5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Mon, 18 Nov 2024 16:11:01 +0000 (09:11 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  rust: block: simplify Result<()> in validate_block_size return

5 months agoio_uring: protect register tracing
Pavel Begunkov [Mon, 18 Nov 2024 15:14:50 +0000 (15:14 +0000)]
io_uring: protect register tracing

Syz reports:

BUG: KCSAN: data-race in __se_sys_io_uring_register / io_sqe_files_register

read-write to 0xffff8881021940b8 of 4 bytes by task 5923 on cpu 1:
 io_sqe_files_register+0x2c4/0x3b0 io_uring/rsrc.c:713
 __io_uring_register io_uring/register.c:403 [inline]
 __do_sys_io_uring_register io_uring/register.c:611 [inline]
 __se_sys_io_uring_register+0x8d0/0x1280 io_uring/register.c:591
 __x64_sys_io_uring_register+0x55/0x70 io_uring/register.c:591
 x64_sys_call+0x202/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:428
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff8881021940b8 of 4 bytes by task 5924 on cpu 0:
 __do_sys_io_uring_register io_uring/register.c:613 [inline]
 __se_sys_io_uring_register+0xe4a/0x1280 io_uring/register.c:591
 __x64_sys_io_uring_register+0x55/0x70 io_uring/register.c:591
 x64_sys_call+0x202/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:428
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Which should be due to reading the table size after unlock. We don't
care much as it's just to print it in trace, but we might as well do it
under the lock.

Reported-by: syzbot+5a486fef3de40e0d8c76@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8233af2886a37b57f79e444e3db88fcfda1817ac.1731942203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoio_uring: remove io_uring_cqwait_reg_arg
Pavel Begunkov [Mon, 18 Nov 2024 15:14:34 +0000 (15:14 +0000)]
io_uring: remove io_uring_cqwait_reg_arg

A separate wait argument registration API was removed, also delete
leftover uapi definitions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/143b6a53591badac23632d3e6fa3e5db4b342ee2.1731942445.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agorust: block: simplify Result<()> in validate_block_size return
Manas [Mon, 18 Nov 2024 14:36:58 +0000 (20:06 +0530)]
rust: block: simplify Result<()> in validate_block_size return

`Result` is used in place of `Result<()>` because the default type
parameters are unit `()` and `Error` types, which are automatically
inferred. Thus keep the usage consistent throughout codebase.

Suggested-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://github.com/Rust-for-Linux/linux/issues/1128
Signed-off-by: Manas <manas18244@iiitd.ac.in>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://lore.kernel.org/r/20241118-simplify-result-v3-1-6b1566a77eab@iiitd.ac.in
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/io_uring' into for-next
Jens Axboe [Sun, 17 Nov 2024 16:01:43 +0000 (09:01 -0700)]
Merge branch 'for-6.13/io_uring' into for-next

* for-6.13/io_uring:
  io_uring/region: fix error codes after failed vmap

5 months agoio_uring/region: fix error codes after failed vmap
Pavel Begunkov [Sun, 17 Nov 2024 00:38:33 +0000 (00:38 +0000)]
io_uring/region: fix error codes after failed vmap

io_create_region() jumps after a vmap failure without setting the return
code, it could be 0 or just uninitialised.

Fixes: dfbbfbf191878 ("io_uring: introduce concept of memory regions")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0abac19dbf81c061cffaa9534a2471ed5460ad3e.1731803848.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Fri, 15 Nov 2024 19:38:05 +0000 (12:38 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  MAINTAINERS: Update git tree for mdraid subsystem
  md/raid5: Increase r5conf.cache_name size

5 months agoMerge tag 'md-6.13-20241115' of https://git.kernel.org/pub/scm/linux/kernel/git/mdrai...
Jens Axboe [Fri, 15 Nov 2024 19:37:33 +0000 (12:37 -0700)]
Merge tag 'md-6.13-20241115' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.13/block

Pull MD fixes from Song:

"This set contains a fix for a W=1 warning, by John Garry, and a
 MAINTAINERS update."

* tag 'md-6.13-20241115' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  MAINTAINERS: Update git tree for mdraid subsystem
  md/raid5: Increase r5conf.cache_name size

5 months agoMerge branch 'for-6.13/io_uring' into for-next
Jens Axboe [Fri, 15 Nov 2024 19:29:09 +0000 (12:29 -0700)]
Merge branch 'for-6.13/io_uring' into for-next

* for-6.13/io_uring:
  io_uring: restore back registered wait arguments
  io_uring: add memory region registration
  io_uring: introduce concept of memory regions
  io_uring: temporarily disable registered waits
  io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
  io_uring: fortify io_pin_pages with a warning
  switch io_msg_ring() to CLASS(fd)

5 months agoio_uring: restore back registered wait arguments
Pavel Begunkov [Fri, 15 Nov 2024 16:54:43 +0000 (16:54 +0000)]
io_uring: restore back registered wait arguments

Now we've got a more generic region registration API, place
IORING_ENTER_EXT_ARG_REG and re-enable it.

First, the user has to register a region with the
IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a
ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races
with already running waiters. With that we should have stable constant
values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no
READ_ONCE required.

The other API difference is that we're now passing byte offsets instead
of indexes. The user _must_ align all offsets / pointers to the native
word size, failing to do so might but not necessarily has to lead to a
failure usually returned as -EFAULT. liburing will be hiding this
details from users.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/81822c1b4ffbe8ad391b4f9ad1564def0d26d990.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMAINTAINERS: Update git tree for mdraid subsystem
Song Liu [Fri, 15 Nov 2024 18:25:53 +0000 (10:25 -0800)]
MAINTAINERS: Update git tree for mdraid subsystem

Moving the official git tree to the MDRAID Group account.

Signed-off-by: Song Liu <song@kernel.org>
5 months agoio_uring: add memory region registration
Pavel Begunkov [Fri, 15 Nov 2024 16:54:42 +0000 (16:54 +0000)]
io_uring: add memory region registration

Regions will serve multiple purposes. First, with it we can decouple
ring/etc. object creation from registration / mapping of the memory they
will be placed in. We already have hacks that allow to put both SQ and
CQ into the same huge page, in the future we should be able to:

region = create_region(io_ring);
create_pbuf_ring(io_uring, region, offset=0);
create_pbuf_ring(io_uring, region, offset=N);

The second use case is efficiently passing parameters. The following
patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which
optimises wait arguments. It'll also be useful for request arguments
replacing iovecs, msghdr, etc. pointers. Eventually it would also be
handy for BPF as well if it comes to fruition.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0798cf3a14fad19cfc96fc9feca5f3e11481691d.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoio_uring: introduce concept of memory regions
Pavel Begunkov [Fri, 15 Nov 2024 16:54:41 +0000 (16:54 +0000)]
io_uring: introduce concept of memory regions

We've got a good number of mappings we share with the userspace, that
includes the main rings, provided buffer rings, upcoming rings for
zerocopy rx and more. All of them duplicate user argument parsing and
some internal details as well (page pinnning, huge page optimisations,
mmap'ing, etc.)

Introduce a notion of regions. For userspace for now it's just a new
structure called struct io_uring_region_desc which is supposed to
parameterise all such mapping / queue creations. A region either
represents a user provided chunk of memory, in which case the user_addr
field should point to it, or a request for the kernel to allocate the
memory, in which case the user would need to mmap it after using the
offset returned in the mmap_offset field. With a uniform userspace API
we can avoid additional boiler plate code and apply future optimisation
to all of them at once.

Internally, there is a new structure struct io_mapped_region holding all
relevant runtime information and some helpers to work with it. This
patch limits it to user provided regions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e6fe25818dfbaebd1bd90b870a6cac503fe1a24.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoio_uring: temporarily disable registered waits
Pavel Begunkov [Fri, 15 Nov 2024 16:54:40 +0000 (16:54 +0000)]
io_uring: temporarily disable registered waits

Disable wait argument registration as it'll be replaced with a more
generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing
in a few commits so leave it be.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70b1d1d218c41ba77a76d1789c8641dab0b0563e.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoio_uring: disable ENTER_EXT_ARG_REG for IOPOLL
Pavel Begunkov [Fri, 15 Nov 2024 16:54:39 +0000 (16:54 +0000)]
io_uring: disable ENTER_EXT_ARG_REG for IOPOLL

IOPOLL doesn't use the extended arguments, no need for it to support
IORING_ENTER_EXT_ARG_REG. Let's disable it for IOPOLL, if anything it
leaves more space for future extensions.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a35ecd919dbdc17bd5b7932273e317832c531b45.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoio_uring: fortify io_pin_pages with a warning
Pavel Begunkov [Fri, 15 Nov 2024 16:54:38 +0000 (16:54 +0000)]
io_uring: fortify io_pin_pages with a warning

We're a bit too frivolous with types of nr_pages arguments, converting
it to long and back to int, passing an unsigned int pointer as an int
pointer and so on. Shouldn't cause any problem but should be carefully
reviewed, but until then let's add a WARN_ON_ONCE check to be more
confident callers don't pass poorely checked arguents.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d48e0c097cbd90fb47acaddb6c247596510d8cfc.1731689588.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoswitch io_msg_ring() to CLASS(fd)
Al Viro [Fri, 15 Nov 2024 03:49:02 +0000 (03:49 +0000)]
switch io_msg_ring() to CLASS(fd)

Use CLASS(fd) to get the file for sync message ring requests, rather
than open-code the file retrieval dance.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20241115034902.GP3387508@ZenIV
[axboe: make a more coherent commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Fri, 15 Nov 2024 15:46:15 +0000 (08:46 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  block: make struct rq_list available for !CONFIG_BLOCK

5 months agoblock: make struct rq_list available for !CONFIG_BLOCK
Jens Axboe [Fri, 15 Nov 2024 14:14:03 +0000 (07:14 -0700)]
block: make struct rq_list available for !CONFIG_BLOCK

A previous commit changed how requests are linked in the plug structure,
but unlike the previous method, it uses a new type for it rather than
struct request. The latter is available even for !CONFIG_BLOCK, while
struct rq_list is now. Move it outside CONFIG_BLOCK.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Fixes: a3396b99990d ("block: add a rq_list type")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Wed, 13 Nov 2024 21:03:15 +0000 (14:03 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  block/genhd: use seq_put_decimal_ull for diskstats decimal values

5 months agoblock/genhd: use seq_put_decimal_ull for diskstats decimal values
David Wang [Fri, 8 Nov 2024 05:45:00 +0000 (13:45 +0800)]
block/genhd: use seq_put_decimal_ull for diskstats decimal values

seq_printf is costly. For each block device, 19 decimal values are
yielded in /proc/diskstats via seq_printf. On a system with 16 logical
block devices, profiling for open/read/close sequences shows seq_printf
took ~75% samples of diskstats_show:

diskstats_show(92.626% 2269372/2450040)
    seq_printf(76.026% 1725313/2269372)
vsnprintf(99.163% 1710866/1725313)
    format_decode(26.597% 455040/1710866)
    number(19.554% 334542/1710866)
    memcpy_orig(4.183% 71570/1710866)
...
srso_return_thunk(0.009% 148/1725313)
    part_stat_read_all(8.030% 182236/2269372)

One million rounds of open/read/close /proc/diskstats takes:

real 0m37.687s
user 0m0.264s
sys 0m32.911s
On average, each sequence tooks ~0.032ms

With this patch, most decimal values are yield via seq_put_decimal_ull,
performance is significantly improved:

real 0m20.792s
user 0m0.316s
sys 0m20.463s
On average, each sequence tooks ~0.020ms, a ~37.5% improvement.

Signed-off-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/r/20241108054500.4251-1-00107082@163.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Wed, 13 Nov 2024 19:05:30 +0000 (12:05 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block:
  block: don't reorder requests in blk_mq_add_to_batch
  block: don't reorder requests in blk_add_rq_to_plug
  block: add a rq_list type
  block: remove rq_list_move
  virtio_blk: reverse request order in virtio_queue_rqs
  nvme-pci: reverse request order in nvme_queue_rqs
  btrfs: validate queue limits
  block: export blk_validate_limits

5 months agoblock: don't reorder requests in blk_mq_add_to_batch
Christoph Hellwig [Wed, 13 Nov 2024 15:20:46 +0000 (16:20 +0100)]
block: don't reorder requests in blk_mq_add_to_batch

LIFO ordering for batched completions is a bit unexpected and also
defeats some merging optimizations in e.g. the XFS buffered write
code.  Now that we can easily add the request to the tail of the list
do that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: don't reorder requests in blk_add_rq_to_plug
Christoph Hellwig [Wed, 13 Nov 2024 15:20:45 +0000 (16:20 +0100)]
block: don't reorder requests in blk_add_rq_to_plug

Add requests to the tail of the list instead of the front so that they
are queued up in submission order.

Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: add a rq_list type
Christoph Hellwig [Wed, 13 Nov 2024 15:20:44 +0000 (16:20 +0100)]
block: add a rq_list type

Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers.  Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: remove rq_list_move
Christoph Hellwig [Wed, 13 Nov 2024 15:20:43 +0000 (16:20 +0100)]
block: remove rq_list_move

Unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agovirtio_blk: reverse request order in virtio_queue_rqs
Christoph Hellwig [Wed, 13 Nov 2024 15:20:42 +0000 (16:20 +0100)]
virtio_blk: reverse request order in virtio_queue_rqs

blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern
especially in rotational devices. Fix this by rewriting virtio_queue_rqs
so that it always pops the requests from the passed in request list, and
then adds them to the head of a local submit list. This actually
simplifies the code a bit as it removes the complicated list splicing,
at the cost of extra updates of the rq_next pointer. As that should be
cache hot anyway it should be an easy price to pay.

Fixes: 0e9911fa768f ("virtio-blk: support mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agonvme-pci: reverse request order in nvme_queue_rqs
Christoph Hellwig [Wed, 13 Nov 2024 15:20:41 +0000 (16:20 +0100)]
nvme-pci: reverse request order in nvme_queue_rqs

blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list.  This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer.  As that should be cache hot
anyway it should be an easy price to pay.

Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agobtrfs: validate queue limits
Christoph Hellwig [Wed, 13 Nov 2024 08:45:36 +0000 (09:45 +0100)]
btrfs: validate queue limits

Call blk_validate_limits on the queue limits used for zone append
splitting so that calculated values get filled in and any stacking
conflicts get cought.

Without this there isn't a max_zone_append_sectors limits as of commit
559218d43ec9 ("block: pre-calculate max_zone_append_sectors").

Fixes: 559218d43ec9 ("block: pre-calculate max_zone_append_sectors")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: export blk_validate_limits
Christoph Hellwig [Wed, 13 Nov 2024 08:45:35 +0000 (09:45 +0100)]
block: export blk_validate_limits

While block drivers do the validation as part of committing them to the
queue, users that use the limit outside of a block device context have
to validate the limits and fill in the calculated values as well.

So far btrfs is the only user of queue limits without a block device,
and it has gotten away with that more or less by accident.  But with
commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors")
this became fatal for setups that have small max zone append size,
as it won't be limited now.

Export blk_validate_limits so that it can be called directly from btrfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241113084541.34315-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Wed, 13 Nov 2024 17:43:49 +0000 (10:43 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block: (22 commits)
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  nvmet: support for csi identify ns
  nvmet: implement rotational media information log
  nvmet: implement endurance groups
  nvmet: declare 2.1 version compliance
  nvmet: implement crto property
  nvmet: implement supported features log
  nvmet: implement supported log pages
  nvmet: implement active command set ns list
  nvmet: implement id ns for nvm command set
  nvmet: support reservation feature
  nvme: add reservation command's defines
  nvme-core: remove repeated wq flags
  nvmet: make nvmet_wq visible in sysfs
  nvme-pci: use dma_alloc_noncontigous if possible
  ...

5 months agoMerge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/block
Jens Axboe [Wed, 13 Nov 2024 17:43:11 +0000 (10:43 -0700)]
Merge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.13

 - Use uring_cmd helper (Pavel)
 - Host Memory Buffer allocation enhancements (Christoph)
 - Target persistent reservation support (Guixin)
 - Persistent reservation tracing (Guixen)
 - NVMe 2.1 specification support (Keith)
 - Rotational Meta Support (Matias, Wang, Keith)
 - Volatile cache detection enhancment (Guixen)"

* tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme: (22 commits)
  nvmet: add tracing of reservation commands
  nvme: parse reservation commands's action and rtype to string
  nvmet: report ns's vwc not present
  nvme: check ns's volatile write cache not present
  nvme: add rotational support
  nvme: use command set independent id ns if available
  nvmet: support for csi identify ns
  nvmet: implement rotational media information log
  nvmet: implement endurance groups
  nvmet: declare 2.1 version compliance
  nvmet: implement crto property
  nvmet: implement supported features log
  nvmet: implement supported log pages
  nvmet: implement active command set ns list
  nvmet: implement id ns for nvm command set
  nvmet: support reservation feature
  nvme: add reservation command's defines
  nvme-core: remove repeated wq flags
  nvmet: make nvmet_wq visible in sysfs
  nvme-pci: use dma_alloc_noncontigous if possible
  ...

5 months agonvmet: add tracing of reservation commands
Guixin Liu [Mon, 14 Oct 2024 10:14:58 +0000 (18:14 +0800)]
nvmet: add tracing of reservation commands

Add tracing of reservation commands, including register, acquire,
release and report, and also parse the action and rtype to string
to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvme: parse reservation commands's action and rtype to string
Guixin Liu [Mon, 14 Oct 2024 10:14:57 +0000 (18:14 +0800)]
nvme: parse reservation commands's action and rtype to string

Parse reservation commands's action(including rrega, racqa and rrela)
and rtype to string to make the trace log more human-readable.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: report ns's vwc not present
Guixin Liu [Wed, 13 Nov 2024 10:18:39 +0000 (18:18 +0800)]
nvmet: report ns's vwc not present

Currently, we report that controller has vwc even though the ns may
not have vwc. Report ns's vwc not present when not buffered_io or
backdev doesn't have vwc.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoMerge branch 'for-6.13/block' into for-next
Jens Axboe [Wed, 13 Nov 2024 14:39:12 +0000 (07:39 -0700)]
Merge branch 'for-6.13/block' into for-next

* for-6.13/block: (87 commits)
  block: remove the ioprio field from struct request
  block: remove the write_hint field from struct request
  nvme-multipath: don't bother clearing max_hw_zone_append_sectors
  block: pre-calculate max_zone_append_sectors
  block: lift bio_is_zone_append to bio.h
  block: fix bio_split_rw_at to take zone_write_granularity into account
  block: take chunk_sectors into account in bio_split_write_zeroes
  md/raid10: Handle bio_split() errors
  md/raid1: Handle bio_split() errors
  md/raid0: Handle bio_split() errors
  block: Handle bio_split() errors in bio_submit_split()
  block: Error an attempt to split an atomic write in bio_split()
  block: Rework bio_split() return value
  ublk: fix ublk_ch_mmap() for 64K page size
  s390/dasd: Fix typo in comment
  s390/dasd: fix redundant /proc/dasd* entries removal
  loop: fix type of block size
  MAINTAINERS: Make Yu Kuai co-maintainer of md/raid subsystem
  md/raid5: Wait sync io to finish before changing group cnt
  block: don't verify IO lock for freeze/unfreeze in elevator_init_mq()
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge branch 'for-6.13/io_uring' into for-next
Jens Axboe [Wed, 13 Nov 2024 14:38:35 +0000 (07:38 -0700)]
Merge branch 'for-6.13/io_uring' into for-next

* for-6.13/io_uring: (71 commits)
  io_uring: fix invalid hybrid polling ctx leaks
  io_uring/uring_cmd: fix buffer index retrieval
  io_uring/rsrc: add & apply io_req_assign_buf_node()
  io_uring/rsrc: remove '->ctx_ptr' of 'struct io_rsrc_node'
  io_uring/rsrc: pass 'struct io_ring_ctx' reference to rsrc helpers
  io_uring: avoid normal tw intermediate fallback
  io_uring/napi: add static napi tracking strategy
  io_uring/napi: clean up __io_napi_do_busy_loop
  io_uring/napi: Use lock guards
  io_uring/napi: improve __io_napi_add
  io_uring/napi: fix io_napi_entry RCU accesses
  io_uring/napi: protect concurrent io_napi_entry timeout accesses
  io_uring: prevent speculating sq_array indexing
  io_uring: move struct io_kiocb from task_struct to io_uring_task
  io_uring: remove task ref helpers
  io_uring: move cancelations to be io_uring_task based
  io_uring/rsrc: split io_kiocb node type assignments
  io_uring/rsrc: encode node type and ctx together
  io_uring: add support for hybrid IOPOLL
  io_uring/rsrc: allow cloning with node replacements
  ...

5 months agoio_uring: fix invalid hybrid polling ctx leaks
Pavel Begunkov [Wed, 13 Nov 2024 03:26:01 +0000 (03:26 +0000)]
io_uring: fix invalid hybrid polling ctx leaks

It has already allocated the ctx by the point where it checks the hybrid
poll configuration, plain return leaks the memory.

Fixes: 01ee194d1aba1 ("io_uring: add support for hybrid IOPOLL")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/b57f2608088020501d352fcdeebdb949e281d65b.1731468230.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Wed, 13 Nov 2024 00:39:34 +0000 (16:39 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fix from Michael Tsirkin:
 "A last minute mlx5 bugfix"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa/mlx5: Fix PA offset with unaligned starting iotlb map

5 months agomd/raid5: Increase r5conf.cache_name size
John Garry [Tue, 12 Nov 2024 16:10:18 +0000 (16:10 +0000)]
md/raid5: Increase r5conf.cache_name size

For compiling with W=1, the following warning can be seen:

drivers/md/raid5.c: In function ‘setup_conf’:
drivers/md/raid5.c:2423:12: error: ‘%s’ directive output may be truncated writing up to 31 bytes into a region of size between 16 and 26 [-Werror=format-truncation=]
    "raid%d-%s", conf->level, mdname(conf->mddev));
            ^~
drivers/md/raid5.c:2422:3: note: ‘snprintf’ output between 7 and 48 bytes into a destination of size 32
   snprintf(conf->cache_name[0], namelen,
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "raid%d-%s", conf->level, mdname(conf->mddev));
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

Increase the array size to avoid this warning.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241112161019.4154616-2-john.g.garry@oracle.com
Signed-off-by: Song Liu <song@kernel.org>
5 months agovdpa/mlx5: Fix PA offset with unaligned starting iotlb map
Si-Wei Liu [Mon, 21 Oct 2024 13:40:39 +0000 (16:40 +0300)]
vdpa/mlx5: Fix PA offset with unaligned starting iotlb map

When calculating the physical address range based on the iotlb and mr
[start,end) ranges, the offset of mr->start relative to map->start
is not taken into account. This leads to some incorrect and duplicate
mappings.

For the case when mr->start < map->start the code is already correct:
the range in [mr->start, map->start) was handled by a different
iteration.

Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Cc: stable@vger.kernel.org
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Message-Id: <20241021134040.975221-2-dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
5 months agoblock: remove the ioprio field from struct request
Christoph Hellwig [Tue, 12 Nov 2024 17:00:39 +0000 (18:00 +0100)]
block: remove the ioprio field from struct request

The request ioprio is only initialized from the first attached bio,
so requests without a bio already never set it.  Directly use the
bio field instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: remove the write_hint field from struct request
Christoph Hellwig [Tue, 12 Nov 2024 17:00:38 +0000 (18:00 +0100)]
block: remove the write_hint field from struct request

The write_hint is only used for read/write requests, which must have a
bio attached to them.  Just use the bio field instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Tue, 12 Nov 2024 21:35:13 +0000 (13:35 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86 and selftests fixes.

  x86:

   - When emulating a guest TLB flush for a nested guest, flush vpid01,
     not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if
     L2 and L1 are sharing VPID '0' (from L1's perspective).

   - Fix a bug in the SNP initialization flow where KVM would return '0'
     to userspace instead of -errno on failure.

   - Move the Intel PT virtualization (i.e. outputting host trace to
     host buffer and guest trace to guest buffer) behind CONFIG_BROKEN.

   - Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START

   - Fix a bug where KVM fails to inject an interrupt from the IRR after
     KVM_SET_LAPIC.

  Selftests:

   - Increase the timeout for the memslot performance selftest to avoid
     false failures on arm64 and nested x86 platforms.

   - Fix a goof in the guest_memfd selftest where a for-loop initialized
     a bit mask to zero instead of BIT(0).

   - Disable strict aliasing when building KVM selftests to prevent the
     compiler from treating things like "u64 *" to "uint64_t *" cases as
     undefined behavior, which can lead to nasty, hard to debug
     failures.

   - Force -march=x86-64-v2 for KVM x86 selftests if and only if the
     uarch is supported by the compiler.

   - Fix broken compilation of kvm selftests after a header sync in
     tools/"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
  KVM: x86: Unconditionally set irr_pending when updating APICv state
  kvm: svm: Fix gctx page leak on invalid inputs
  KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB
  KVM: SVM: Propagate error from snp_guest_req_init() to userspace
  KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled
  KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported
  KVM: selftests: Disable strict aliasing
  KVM: selftests: fix unintentional noop test in guest_memfd_test.c
  KVM: selftests: memslot_perf_test: increase guest sync timeout

5 months agoMerge tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 12 Nov 2024 21:21:07 +0000 (13:21 -0800)]
Merge tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - fix warnings about duplicate slab cache names

* tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-cache: fix warnings about duplicate slab caches
  dm-bufio: fix warnings about duplicate slab caches

5 months agoMerge tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar...
Linus Torvalds [Tue, 12 Nov 2024 21:06:31 +0000 (13:06 -0800)]
Merge tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity fixes from Mimi Zohar:
 "One bug fix, one performance improvement, and the use of
  static_assert:

   - The bug fix addresses "only a cosmetic change" commit, which didn't
     take into account the original 'ima' template definition.

  - The performance improvement limits the atomic_read()"

* tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Use static_assert() to check struct sizes
  evm: stop avoidably reading i_writecount in evm_file_release
  ima: fix buffer overrun in ima_eventdigest_init_common

5 months agoMerge tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic...
Linus Torvalds [Tue, 12 Nov 2024 21:01:09 +0000 (13:01 -0800)]
Merge tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock fixes from Mickaël Salaün:
 "This fixes issues in the Landlock's sandboxer sample and
  documentation, slightly refactors helpers (required for ongoing patch
  series), and improve/fix a feature merged in v6.12 (signal and
  abstract UNIX socket scoping)"

* tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Optimize scope enforcement
  landlock: Refactor network access mask management
  landlock: Refactor filesystem access mask management
  samples/landlock: Clarify option parsing behaviour
  samples/landlock: Refactor help message
  samples/landlock: Fix port parsing in sandboxer
  landlock: Fix grammar issues in documentation
  landlock: Improve documentation of previous limitations

5 months agoMerge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Mon, 11 Nov 2024 22:09:57 +0000 (14:09 -0800)]
Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - The fair sched class currently has a bug where its balance() returns
   true telling the sched core that it has tasks to run but then NULL
   from pick_task(). This makes sched core call sched_ext's pick_task()
   without preceding balance() which can lead to stalls in partial mode.

   For now, work around by detecting the condition and forcing the CPU
   to go through another scheduling cycle.

 - Add a missing newline to an error message and fix drgn introspection
   tool which went out of sync.

* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
  sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
  sched_ext: Add a missing newline at the end of an error message

5 months agonvme: check ns's volatile write cache not present
Guixin Liu [Mon, 4 Nov 2024 08:55:00 +0000 (16:55 +0800)]
nvme: check ns's volatile write cache not present

When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE
flag should not be set when registering the block device, regardless
of whether the controller supports VWC.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvme: add rotational support
Wang Yugui [Thu, 10 Oct 2024 12:39:50 +0000 (14:39 +0200)]
nvme: add rotational support

Rotational devices, such as hard-drives, can be detected using
the rotational bit in the namespace independent identify namespace
data structure. Make the bit visible to the block layer through the
rotational queue setting.

Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvme: use command set independent id ns if available
Matias Bjørling [Tue, 8 Oct 2024 14:55:02 +0000 (16:55 +0200)]
nvme: use command set independent id ns if available

The NVMe 2.0 specification adds an independent identify namespace
data structure that contains generic attributes that apply to all
namespace types. Some attributes carry over from the NVM command set
identify namespace data structure, and others are new.

Currently, the data structure only considered when CRIMS is enabled or
when the namespace type is key-value.

However, the independent namespace data structure is mandatory for
devices that implement features from the 2.0+ specification. Therefore,
we can check this data structure first. If unavailable, retrieve the
generic attributes from the NVM command set identify namespace data
structure.

Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: support for csi identify ns
Keith Busch [Fri, 1 Nov 2024 19:29:40 +0000 (12:29 -0700)]
nvmet: support for csi identify ns

Implements reporting the I/O Command Set Independent Identify Namespace
command.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement rotational media information log
Keith Busch [Fri, 1 Nov 2024 20:48:47 +0000 (13:48 -0700)]
nvmet: implement rotational media information log

Most of the information is stubbed. Supporting these commands is a
requirement for supporting rotational media.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement endurance groups
Keith Busch [Fri, 1 Nov 2024 21:46:01 +0000 (14:46 -0700)]
nvmet: implement endurance groups

Most of the returned information is just stubbed data. The target must
support these in order to report rotational media. Since this driver
doesn't know any better, each namespace is its own endurance group with
the engid value matching the nsid.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: declare 2.1 version compliance
Keith Busch [Mon, 4 Nov 2024 22:29:45 +0000 (14:29 -0800)]
nvmet: declare 2.1 version compliance

The target driver implements all the mandatory logs, identifications,
features, and properties up to nvme sepcification 2.1.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement crto property
Keith Busch [Mon, 4 Nov 2024 22:17:59 +0000 (14:17 -0800)]
nvmet: implement crto property

This property is required for nvme 2.1. The target only supports ready
with media, so this is just the same value as CAP.TO.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement supported features log
Keith Busch [Mon, 4 Nov 2024 22:07:42 +0000 (14:07 -0800)]
nvmet: implement supported features log

This log is required for nvme 2.1.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement supported log pages
Keith Busch [Mon, 4 Nov 2024 22:00:14 +0000 (14:00 -0800)]
nvmet: implement supported log pages

This log is required for nvme 2.1.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement active command set ns list
Keith Busch [Mon, 4 Nov 2024 21:24:36 +0000 (13:24 -0800)]
nvmet: implement active command set ns list

This is required for nvme 2.1 for targets that support multiple command
sets. We support NVM and ZNS, so are required to support this
identification.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: implement id ns for nvm command set
Keith Busch [Mon, 4 Nov 2024 15:27:44 +0000 (07:27 -0800)]
nvmet: implement id ns for nvm command set

We don't report anything here, but it's a mandatory identification for
nvme 2.1.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvmet: support reservation feature
Guixin Liu [Wed, 6 Nov 2024 07:34:46 +0000 (15:34 +0800)]
nvmet: support reservation feature

This patch implements the reservation feature, including:
  1. reservation register(register, unregister and replace).
  2. reservation acquire(acquire, preempt, preempt and abort).
  3. reservation release(release and clear).
  4. reservation report.
  5. set feature and get feature of reservation notify mask.
  6. get log page of reservation event.

Not supported:
  1. persistent reservation through power loss.

Test cases:
  Use nvme-cli and fio to test all implemented sub features:
  1. use nvme resv-register to register host a registrant or
     unregister or replace a new key.
  2. use nvme resv-acquire to set host to the holder, and use fio
     to send read and write io in all reservation type. And also
     test preempt and "preempt and abort".
  3. use nvme resv-report to show all registrants and reservation
     status.
  4. use nvme resv-release to release all registrants.
  5. use nvme get-log to get events generated by the preceding
     operations.

In addition, make reservation configurable, one can set ns to
support reservation before enable ns. The default of resv_enable
is false.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Mon, 11 Nov 2024 17:06:17 +0000 (09:06 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
 "Several small bugfixes all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa/mlx5: Fix error path during device add
  vp_vdpa: fix id_table array not null terminated error
  virtio_pci: Fix admin vq cleanup by using correct info pointer
  vDPA/ifcvf: Fix pci_read_config_byte() return code handling
  Fix typo in vringh_test.c
  vdpa: solidrun: Fix UB bug with devres
  vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans

5 months agonvme-multipath: don't bother clearing max_hw_zone_append_sectors
Christoph Hellwig [Fri, 8 Nov 2024 15:46:52 +0000 (16:46 +0100)]
nvme-multipath: don't bother clearing max_hw_zone_append_sectors

The limits stacking now properly zeroes it if at least one of the
underlying limits clears it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: pre-calculate max_zone_append_sectors
Christoph Hellwig [Fri, 8 Nov 2024 15:46:51 +0000 (16:46 +0100)]
block: pre-calculate max_zone_append_sectors

max_zone_append_sectors differs from all other queue limits in that the
final value used is not stored in the queue_limits but needs to be
obtained using queue_limits_max_zone_append_sectors helper.  This not
only adds (tiny) extra overhead to the I/O path, but also can be easily
forgotten in file system code.

Add a new max_hw_zone_append_sectors value to queue_limits which is
set by the driver, and calculate max_zone_append_sectors from that and
the other inputs in blk_validate_zoned_limits, similar to how
max_sectors is calculated to fix this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: lift bio_is_zone_append to bio.h
Christoph Hellwig [Mon, 4 Nov 2024 06:26:31 +0000 (07:26 +0100)]
block: lift bio_is_zone_append to bio.h

Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: fix bio_split_rw_at to take zone_write_granularity into account
Christoph Hellwig [Mon, 4 Nov 2024 06:26:30 +0000 (07:26 +0100)]
block: fix bio_split_rw_at to take zone_write_granularity into account

Otherwise it can create unaligned writes on zoned devices.

Fixes: a805a4fa4fa3 ("block: introduce zone_write_granularity limit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20241104062647.91160-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: take chunk_sectors into account in bio_split_write_zeroes
Christoph Hellwig [Mon, 4 Nov 2024 06:26:29 +0000 (07:26 +0100)]
block: take chunk_sectors into account in bio_split_write_zeroes

For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors.  For other uses like the
internally RAIDed NVMe devices it is probably at least useful.

Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes.  Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.

Fixes: 885fa13f6559 ("block: implement splitting of REQ_OP_WRITE_ZEROES bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20241104062647.91160-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agodm-cache: fix warnings about duplicate slab caches
Mikulas Patocka [Mon, 11 Nov 2024 15:51:02 +0000 (16:51 +0100)]
dm-cache: fix warnings about duplicate slab caches

The commit 4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache
code.

The dm-cache code allocates a slab cache for each device. This commit
changes it to allocate just one slab cache in the module init function.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
5 months agodm-bufio: fix warnings about duplicate slab caches
Mikulas Patocka [Mon, 11 Nov 2024 15:48:18 +0000 (16:48 +0100)]
dm-bufio: fix warnings about duplicate slab caches

The commit 4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio
code. The dm-bufio code allocates a slab cache with each client. It is
not possible to preallocate the caches in the module init function
because the size of auxiliary per-buffer data is not known at this point.

So, this commit changes dm-bufio so that it appends a unique atomic value
to the cache name, to avoid the warnings.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
5 months agomd/raid10: Handle bio_split() errors
John Garry [Mon, 11 Nov 2024 11:21:50 +0000 (11:21 +0000)]
md/raid10: Handle bio_split() errors

Add proper bio_split() error handling. For any error, call
raid_end_bio_io() and return. Except for discard, where we end the bio
directly.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-7-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agomd/raid1: Handle bio_split() errors
John Garry [Mon, 11 Nov 2024 11:21:49 +0000 (11:21 +0000)]
md/raid1: Handle bio_split() errors

Add proper bio_split() error handling. For any error, call
raid_end_bio_io() and return.

For the case of an in the write path, we need to undo the increment in
the rdev pending count and NULLify the r1_bio->bios[] pointers.

For read path failure, we need to undo rdev pending count increment from
the earlier read_balance() call.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-6-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agomd/raid0: Handle bio_split() errors
John Garry [Mon, 11 Nov 2024 11:21:48 +0000 (11:21 +0000)]
md/raid0: Handle bio_split() errors

Add proper bio_split() error handling. For any error, set bi_status, end
the bio, and return.

Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: Handle bio_split() errors in bio_submit_split()
John Garry [Mon, 11 Nov 2024 11:21:47 +0000 (11:21 +0000)]
block: Handle bio_split() errors in bio_submit_split()

bio_split() may error, so check this.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: Error an attempt to split an atomic write in bio_split()
John Garry [Mon, 11 Nov 2024 11:21:46 +0000 (11:21 +0000)]
block: Error an attempt to split an atomic write in bio_split()

This is disallowed.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoblock: Rework bio_split() return value
John Garry [Mon, 11 Nov 2024 11:21:45 +0000 (11:21 +0000)]
block: Rework bio_split() return value

Instead of returning an inconclusive value of NULL for an error in calling
bio_split(), return a ERR_PTR() always.

Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since
almost all callers don't check the return code from bio_split(), we'll
crash anyway (for those failures).

Fix up the only user which checks bio_split() return code today (directly
or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache
code does check the return code in cached_dev_cache_miss() ->
bio_next_split() -> bio_split(), but only to see if there was a split, so
there would be no change in behaviour here (when returning a ERR_PTR()).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241111112150.3756529-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoublk: fix ublk_ch_mmap() for 64K page size
Ming Lei [Mon, 11 Nov 2024 11:07:18 +0000 (19:07 +0800)]
ublk: fix ublk_ch_mmap() for 64K page size

In ublk_ch_mmap(), queue id is calculated in the following way:

(vma->vm_pgoff << PAGE_SHIFT) / `max_cmd_buf_size`

'max_cmd_buf_size' is equal to

`UBLK_MAX_QUEUE_DEPTH * sizeof(struct ublksrv_io_desc)`

and UBLK_MAX_QUEUE_DEPTH is 4096 and part of UAPI, so 'max_cmd_buf_size'
is always page aligned in 4K page size kernel. However, it isn't true in
64K page size kernel.

Fixes the issue by always rounding up 'max_cmd_buf_size' with PAGE_SIZE.

Cc: stable@vger.kernel.org
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241111110718.1394001-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoio_uring/uring_cmd: fix buffer index retrieval
Ming Lei [Mon, 11 Nov 2024 10:13:18 +0000 (18:13 +0800)]
io_uring/uring_cmd: fix buffer index retrieval

Add back buffer index retrieval for IORING_URING_CMD_FIXED.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Fixes: b54a14041ee6 ("io_uring/rsrc: add io_rsrc_node_lookup() helper")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Tested-by: Guangwu Zhang <guazhang@redhat.com>
Link: https://lore.kernel.org/r/20241111101318.1387557-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 months agoLinux 6.12-rc7
Linus Torvalds [Sun, 10 Nov 2024 22:19:35 +0000 (14:19 -0800)]
Linux 6.12-rc7

5 months agoMerge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 10 Nov 2024 22:16:28 +0000 (14:16 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
 "A handful of Qualcomm clk driver fixes:

   - Correct flags for X Elite USB MP GDSC and pcie pipediv2 clocks

   - Fix alpha PLL post_div mask for the cases where width is not
     specified

   - Avoid hangs in the SM8350 video driver (venus) by setting HW_CTRL
     trigger feature on the video clocks"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags
  clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks
  clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set
  clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs

5 months agoMerge tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sun, 10 Nov 2024 22:13:05 +0000 (14:13 -0800)]
Merge tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "i2c-host fixes for v6.12-rc7 (from Andi):

   - Fix designware incorrect behavior when concluding a transmission

   - Fix Mule multiplexer error value evaluation"

* tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set
  i2c: muxes: Fix return value check in mule_i2c_mux_probe()