www.infradead.org Git - users/jedix/linux-maple.git/log

]> www.infradead.org Git - users/jedix/linux-maple.git/log

projects / users / jedix / linux-maple.git / log

Yu Kuai [Mon, 9 Sep 2024 13:41:51 +0000 (21:41 +0800)]

block, bfq: remove bfq_log_bfqg()

It's not used, hence can be removed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 9 Sep 2024 13:41:50 +0000 (21:41 +0800)]

block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()

Because bfq_put_cooperator() is always followed by
bfq_release_process_ref().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 9 Sep 2024 13:41:49 +0000 (21:41 +0800)]

block, bfq: fix procress reference leakage for bfqq in merge chain

Original state:

        Process 1       Process 2       Process 3       Process 4
         (BIC1)          (BIC2)          (BIC3)          (BIC4)
          Λ                |               |               |
           \--------------\ \-------------\ \-------------\|
                           V               V               V
          bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
    ref    0               1               2               4

After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:

Without the patch:

        Process 1       Process 2       Process 3       Process 4
         (BIC1)          (BIC2)          (BIC3)          (BIC4)
          Λ                |               |               |
           \------------------------------\ \-------------\|
                                           V               V
          bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
    ref    0               0               2               4

bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;

With the patch:

          -------------------------------------------
          |                                         |
        Process 1       Process 2       Process 3   |   Process 4
         (BIC1)          (BIC2)          (BIC3)     |    (BIC4)
                           |               |        |      |
                            \-------------\ \-------------\|
                                           V               V
          bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
    ref    0               0               2               4

IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.

Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.

Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 9 Sep 2024 13:41:48 +0000 (21:41 +0800)]

block, bfq: fix uaf for accessing waker_bfqq after splitting

After commit 42c306ed7233 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.

Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.

Fixes: 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Tue, 3 Sep 2024 13:51:49 +0000 (21:51 +0800)]

blk-throttle: support prioritized processing of metadata

Currently, blk-throttle handle all IO fifo, hence if data IO is
throttled and then meta IO is dispatched, the meta IO will have to wait
for the data IO, causing priority inversion problems.

This patch support to handle metadata first and then pay debt while
throttling data.

Test script: use cgroup v1 to throttle root cgroup, then create new
dir and file while write back is throttled

test() {
  mkdir /mnt/test/xxx
  touch /mnt/test/xxx/1
  sync /mnt/test/xxx
  sync /mnt/test/xxx
}

mkfs.ext4 -F /dev/nvme0n1 -E lazy_itable_init=0,lazy_journal_init=0
mount /dev/nvme0n1 /mnt/test

echo "259:0 $((1024*1024))" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device
dd if=/dev/zero of=/mnt/test/foo1 bs=16M count=1 conv=fdatasync status=none &
sleep 4

time test
echo "259:0 0" > /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device

sleep 1
umount /dev/nvme0n1

Test result: time cost for creating new dir and file
before this patch:  14s
after this patch:   0.1s

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240903135149.271857-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Tue, 3 Sep 2024 13:51:48 +0000 (21:51 +0800)]

blk-throttle: remove last_low_overflow_time

last_low_overflow_time is not used anymore after commit bf20ab538c81
("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW").

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240903135149.271857-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Mikhail Lobanov [Mon, 9 Sep 2024 13:37:36 +0000 (09:37 -0400)]

drbd: Add NULL check for net_conf to prevent dereference in state validation

If the net_conf pointer is NULL and the code attempts to access its
fields without a check, it will lead to a null pointer dereference.
Add a NULL check before dereferencing the pointer.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 44ed167da748 ("drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf")
Cc: stable@vger.kernel.org
Signed-off-by: Mikhail Lobanov <m.lobanov@rosalinux.ru>
Link: https://lore.kernel.org/r/20240909133740.84297-1-m.lobanov@rosalinux.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Keith Busch [Fri, 6 Sep 2024 19:45:40 +0000 (12:45 -0700)]

blk-mq: add missing unplug trace event

The single-queue optimized list flush doesn't have an unplug trace event
to pair with the plug event. Add one.

In the unlikely event an error occurs and falls back to the less
optimized plug flush path, it's possible a 2nd unplug trace event will
be logged, but it will show the remainig count that weren't previously
handled.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240906194540.3719642-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Li Zetao [Sat, 7 Sep 2024 03:40:46 +0000 (11:40 +0800)]

mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()

Since the debugfs_create_dir() never returns a null pointer, checking
the return value for a null pointer is redundant. Since
debugfs_create_file() can deal with a ERR_PTR() style pointer, drop
the check. Since mtip_hw_debugfs_init does not pay attention to the
return value, its return type can be changed to void.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20240907034046.3595268-1-lizetao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Jens Axboe [Fri, 6 Sep 2024 20:43:16 +0000 (14:43 -0600)]

Merge tag 'md-6.12-20240906' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.12/block

Pull MD updates from Song:

"This patch, by Xiao Ni, adds a sysfs entry 'new_level'."

* tag 'md-6.12-20240906' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: Add new_level sysfs interface

commit | commitdiff | tree

Jens Axboe [Fri, 6 Sep 2024 20:42:33 +0000 (14:42 -0600)]

Merge tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme into for-6.12/block

Pull NVMe updates from Keith:

"nvme updates for Linux 6.12

- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)"

* tag 'nvme-6.12-2024-09-06' of git://git.infradead.org/nvme:
  nvme: fix metadata handling in nvme-passthrough
  nvme: rename apptag and appmask to lbat and lbatm
  nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data
  nvme-target: do not check authentication status for admin commands twice
  nvmet-auth: allow to clear DH-HMAC-CHAP keys
  nvme-sysfs: add 'tls_keyring' attribute
  nvme-sysfs: add 'tls_configured_key' sysfs attribute
  nvme: split off TLS sysfs attributes into a separate group
  nvme: add a newline to the 'tls_key' sysfs attribute
  nvme-tcp: check for invalidated or revoked key
  nvme-tcp: sanitize TLS key handling
  nvme-keyring: restrict match length for version '1' identifiers
  nvme_core: scan namespaces asynchronously

commit | commitdiff | tree

Xiao Ni [Wed, 4 Sep 2024 23:54:53 +0000 (07:54 +0800)]

md: Add new_level sysfs interface

Now reshape supports two ways: with backup file or without backup file.
For the situation without backup file, it needs to change data offset.
It doesn't need systemd service mdadm-grow-continue. So it can finish
the reshape job in one process environment. It can know the new level
from mdadm --grow command and can change to new level after reshape
finishes.

For the situation with backup file, it needs systemd service
mdadm-grow-continue to monitor reshape progress. So there are two process
envolved. One is mdadm --grow command whick kicks off reshape and wakes
up mdadm-grow-continue service. The second process is the service, which
doesn't know the new level from the first process.

In kernel space mddev->new_level is used to record the new level when
doing reshape. This patch adds a new interface to help mdadm update
new_level and sync it to metadata. Then mdadm-grow-continue can read the
right new_level.

Commit log revised by Song Liu. Please refer to the link for more details.

Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20240904235453.99120-1-xni@redhat.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Fri, 6 Sep 2024 14:14:45 +0000 (16:14 +0200)]

zram: Shrink zram_table_entry::flags.

The zram_table_entry::flags member is of type long and uses 8 bytes on a
64bit architecture. With a PAGE_SIZE of 256KiB we have PAGE_SHIFT of 18
which in turn leads to __NR_ZRAM_PAGEFLAGS = 27. This still fits in an
ordinary integer.
By reducing the size of `flags' to four bytes, the size of the struct
goes back to 16 bytes. The padding between the lock and ac_time (if
enabled) is also gone.

Make zram_table_entry::flags an unsigned int and update the build test
to reflect the change.

Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-4-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Sebastian Andrzej Siewior [Fri, 6 Sep 2024 14:14:44 +0000 (16:14 +0200)]

zram: Remove ZRAM_LOCK

The ZRAM_LOCK was used for locking and after the addition of spinlock_t
the bit set and cleared but there no reader of it.

Remove the ZRAM_LOCK bit.

Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-3-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Mike Galbraith [Fri, 6 Sep 2024 14:14:43 +0000 (16:14 +0200)]

zram: Replace bit spinlocks with a spinlock_t.

The bit spinlock disables preemption. The spinlock_t lock becomes a sleeping
lock on PREEMPT_RT and it can not be acquired in this context. In this locked
section, zs_free() acquires a zs_pool::lock, and there is access to
zram::wb_limit_lock.

Add a spinlock_t for locking. Keep the set/ clear ZRAM_LOCK bit after
the lock has been acquired/ dropped. The size of struct zram_table_entry
increases by 4 bytes due to lock and additional 4 bytes padding with
CONFIG_ZRAM_TRACK_ENTRY_ACTIME enabled.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20240906141520.730009-2-bigeasy@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Wouter Verhelst [Mon, 12 Aug 2024 13:20:42 +0000 (15:20 +0200)]

nbd: correct the maximum value for discard sectors

The version of the NBD protocol implemented by the kernel driver
currently has a 32 bit field for length values. As the NBD protocol uses
bytes as a unit of length, length values larger than 2^32 bytes cannot
be expressed.

Update the max_hw_discard_sectors field to match that.

Signed-off-by: Wouter Verhelst <w@uter.be>
Fixes: 268283244c0f ("nbd: use the atomic queue limits API in nbd_set_size")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Eric Blake <eblake@redhat.Com>
Link: https://lore.kernel.org/r/20240812133032.115134-8-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Wouter Verhelst [Mon, 12 Aug 2024 13:20:40 +0000 (15:20 +0200)]

nbd: nbd_bg_flags_show: add NBD_FLAG_ROTATIONAL

Also handle NBD_FLAG_ROTATIONAL in our debug helper function

Signed-off-by: Wouter Verhelst <w@uter.be>
Cc: Eric Blake <eblake@redhat.Com>
Link: https://lore.kernel.org/r/20240812133032.115134-6-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Wouter Verhelst [Mon, 12 Aug 2024 13:20:37 +0000 (15:20 +0200)]

nbd: implement the WRITE_ZEROES command

The NBD protocol defines a message for zeroing out a region of an export

Add support to the kernel driver for that message.

Signed-off-by: Wouter Verhelst <w@uter.be>
Cc: Eric Blake <eblake@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240812133032.115134-3-w@uter.be
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Fri, 6 Sep 2024 10:21:53 +0000 (18:21 +0800)]

MAINTAINERS: Move the BFQ io scheduler to Odd Fixes state

BFQ has been lacking active maintenance for approximately two years, and it
was recently transitioned to the Orphan state. However, there are still
many users, I have decided to step forward and assume the role of
maintainer to ensure continued support and development.

While I may not be the one with the most extensive knowledge of BFQ's
internals, I have been actively involved in its development since 2021.
Moreover, our team continues to rigorously test BFQ in downstream kernels,
ensuring it's stability and performance. Despite my confidence to maintain
BFQ, I believe it is prudent to classify its state as "Odd Fixes" to
accurately reflect my relatively new position as the maintainer.

By assuming this responsibility, I am committed to providing the necessary
support and addressing any issues that may arise with BFQ. As time
progresses, we will reassess the situation and determine the appropriate
state.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240906102153.612997-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Jens Axboe [Thu, 5 Sep 2024 19:47:06 +0000 (13:47 -0600)]

Merge tag 'md-6.12-20240905' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.12/block

Pull MD fix from Song:

"This patch, from Mateusz Kusiak, improves the information reported in
/proc/mdstat."

* tag 'md-6.12-20240905' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: Report failed arrays as broken in mdstat

commit | commitdiff | tree

Mateusz Kusiak [Tue, 3 Sep 2024 14:29:49 +0000 (16:29 +0200)]

md: Report failed arrays as broken in mdstat

Depending on if array has personality, it is either reported as active or
inactive. This patch adds third status "broken" for arrays with
personality that became inoperative. The reason is end users tend to
assume that "active" indicates array is operational.

Add "broken" state for inoperative arrays with personality and refactor
the code.

Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com>
Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Alexey Dobriyan [Tue, 3 Sep 2024 19:48:19 +0000 (22:48 +0300)]

block: fix integer overflow in BLKSECDISCARD

I independently rediscovered

commit 22d24a544b0d49bbcbd61c8c0eaf77d3c9297155
block: fix overflow in blk_ioctl_discard()

but for secure erase.

Same problem:

uint64_t r[2] = {512, 18446744073709551104ULL};
ioctl(fd, BLKSECDISCARD, r);

will enter near infinite loop inside blkdev_issue_secure_erase():

a.out: attempt to access beyond end of device
loop0: rw=5, sector=3399043073, nr_sectors = 1024 limit=2048
bio_check_eod: 3286214 callbacks suppressed

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://lore.kernel.org/r/9e64057f-650a-46d1-b9f7-34af391536ef@p183
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Alvaro Parker [Tue, 3 Sep 2024 17:22:14 +0000 (13:22 -0400)]

block: fix comment to use set_current_state

The explanatory comment used `set_task_state` instead of
`set_current_state` which is the function actually used in the code.

Signed-off-by: Alvaro Parker <alparkerdf@gmail.com>
Link: https://lore.kernel.org/r/20240903172214.520086-1-alparkerdf@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Jens Axboe [Tue, 3 Sep 2024 15:52:43 +0000 (09:52 -0600)]

MAINTAINERS: move the BFQ io scheduler to orphan state

Nobody is maintaining this code, and it just falls under the umbrella
of block layer code. But at least mark it as such, in case anyone wants
to care more deeply about it and assume the responsibility of doing so.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 2 Sep 2024 13:03:29 +0000 (21:03 +0800)]

block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move()

Instead of open coding it, there are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 2 Sep 2024 13:03:28 +0000 (21:03 +0800)]

block, bfq: don't break merge chain in bfq_split_bfqq()

Consider the following scenario:

    Process 1       Process 2       Process 3       Process 4
     (BIC1)          (BIC2)          (BIC3)          (BIC4)
      Λ               |               |                |
       \-------------\ \-------------\ \--------------\|
                      V               V                V
      bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref    0              1               2                4

If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.

Expected result: caller will allocate a new bfqq for BIC1

    Process 1       Process 2       Process 3       Process 4
     (BIC1)          (BIC2)          (BIC3)          (BIC4)
                      |               |                |
                       \-------------\ \--------------\|
                                      V                V
      bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref    0              0               1                3

Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.

Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 2 Sep 2024 13:03:27 +0000 (21:03 +0800)]

block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()

Consider the following merge chain:

Process 1       Process 2       Process 3 Process 4
(BIC1)          (BIC2)          (BIC3) (BIC4)
  Λ                |               |               |
   \--------------\ \-------------\ \-------------\|
                   V               V    V
  bfqq1--------->bfqq2---------->bfqq3----------->bfqq4

IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.

Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().

Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Mon, 2 Sep 2024 13:03:26 +0000 (21:03 +0800)]

block, bfq: fix possible UAF for bfqq->bic with merge chain

1) initial state, three tasks:

Process 1       Process 2 Process 3
(BIC1)          (BIC2) (BIC3)
  |  Λ            |  Λ   |  Λ
  |  |            |  |   |  |
  V  |            V  |   V  |
  bfqq1           bfqq2   bfqq3
process ref:    1     1     1

2) bfqq1 merged to bfqq2:

Process 1       Process 2 Process 3
(BIC1)          (BIC2) (BIC3)
  |               |   |  Λ
  \--------------\|   |  |
                  V   V  |
  bfqq1--------->bfqq2   bfqq3
process ref:    0     2     1

3) bfqq2 merged to bfqq3:

Process 1       Process 2 Process 3
(BIC1)          (BIC2) (BIC3)
here -> Λ                |   |
  \--------------\ \-------------\|
                  V   V
  bfqq1--------->bfqq2---------->bfqq3
process ref:    0     1     3

In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then
get bfqq3 through merge chain, and finially handle IO by bfqq3.
Howerver, current code will think bfqq2 is owned by BIC1, like initial
state, and set bfqq2->bic to BIC1.

bfq_insert_request
-> by Process 1
bfqq = bfq_init_rq(rq)
  bfqq = bfq_get_bfqq_handle_split
   bfqq = bic_to_bfqq
   -> get bfqq2 from BIC1
bfqq->ref++
rq->elv.priv[0] = bic
rq->elv.priv[1] = bfqq
if (bfqq_process_refs(bfqq) == 1)
  bfqq->bic = bic
  -> record BIC1 to bfqq2

  __bfq_insert_request
   new_bfqq = bfq_setup_cooperator
   -> get bfqq3 from bfqq2->new_bfqq
   bfqq_request_freed(bfqq)
   new_bfqq->ref++
   rq->elv.priv[1] = new_bfqq
   -> handle IO by bfqq3

Fix the problem by checking bfqq is from merge chain fist. And this
might fix a following problem reported by our syzkaller(unreproducible):

==================================================================
BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595

CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G             L     6.6.0-07439-gba2303cacfda #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x10d/0x610 mm/kasan/report.c:475
kasan_report+0x8e/0xc0 mm/kasan/report.c:588
bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757
bfq_init_rq block/bfq-iosched.c:6876 [inline]
bfq_insert_request block/bfq-iosched.c:6254 [inline]
bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304
blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593
blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
</TASK>

Allocated by task 20776:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:763 [inline]
slab_alloc_node mm/slub.c:3458 [inline]
kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503
ioc_create_icq block/blk-ioc.c:370 [inline]
ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436
bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812
blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403
__blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517
blk_mq_get_new_requests block/blk-mq.c:2940 [inline]
blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042
__submit_bio+0x331/0x6f0 block/blk-core.c:624
__submit_bio_noacct_mq block/blk-core.c:703 [inline]
submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732
submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826
xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958
xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619
xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305

Freed by task 946:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1815 [inline]
slab_free_freelist_hook mm/slub.c:1841 [inline]
slab_free mm/slub.c:3786 [inline]
kmem_cache_free+0x118/0x6f0 mm/slub.c:3808
rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189
rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462
__do_softirq+0x1b0/0x7a2 kernel/softirq.c:553

Last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305

Second to last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305

The buggy address belongs to the object at ffff888123839d68
which belongs to the cache bfq_io_cq of size 1360
The buggy address is located 336 bytes inside of
freed 1360-byte region [ffff888123839d68, ffff88812383a2b8)

The buggy address belongs to the physical page:
page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838
head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488
raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                        ^
ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Ming Lei [Fri, 30 Aug 2024 03:41:45 +0000 (11:41 +0800)]

nbd: fix race between timeout and normal completion

If request timetout is handled by nbd_requeue_cmd(), normal completion
has to be stopped for avoiding to complete this requeued request, other
use-after-free can be triggered.

Fix the race by clearing NBD_CMD_INFLIGHT in nbd_requeue_cmd(), meantime
make sure that cmd->lock is grabbed for clearing the flag and the
requeue.

Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Fixes: 2895f1831e91 ("nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240830034145.1827742-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Puranjay Mohan [Thu, 29 Aug 2024 13:32:17 +0000 (13:32 +0000)]

nvme: fix metadata handling in nvme-passthrough

On an NVMe namespace that does not support metadata, it is possible to
send an IO command with metadata through io-passthru. This allows issues
like [1] to trigger in the completion code path.
nvme_map_user_request() doesn't check if the namespace supports metadata
before sending it forward. It also allows admin commands with metadata to
be processed as it ignores metadata when bdev == NULL and may report
success.

Reject an IO command with metadata when the NVMe namespace doesn't
support it and reject an admin command if it has metadata.

[1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Puranjay Mohan <pjy@amazon.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Jens Axboe [Thu, 29 Aug 2024 19:37:43 +0000 (13:37 -0600)]

Merge tag 'md-6.12-20240829' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.12/block

Pull MD updates from Song:

"Major changes in this set are:

1. md-bitmap refactoring, by Yu Kuai;
2. raid5 performance optimization, by Artur Paszkiewicz;
3. Other small fixes, by Yu Kuai and Chen Ni."

* tag 'md-6.12-20240829' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (49 commits)
  md/raid5: rename wait_for_overlap to wait_for_reshape
  md/raid5: only add to wq if reshape is in progress
  md/raid5: use wait_on_bit() for R5_Overlap
  md: Remove flush handling
  md/md-bitmap: make in memory structure internal
  md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
  md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
  md/md-bitmap: merge md_bitmap_free() into bitmap_operations
  md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
  md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
  md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
  md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
  md/md-bitmap: pass in mddev directly for md_bitmap_resize()
  md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
  md/md-bitmap: merge bitmap_unplug() into bitmap_operations
  md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
  md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
  md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
  md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
  md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
  ...

commit | commitdiff | tree

Song Liu [Thu, 29 Aug 2024 18:22:13 +0000 (11:22 -0700)]

Merge branch 'md-6.12-raid5-opt' into md-6.12

From Artur:

The wait_for_overlap wait queue is currently used in two cases, which
are not really related:
- waiting for actual overlapping bios, which uses R5_Overlap bit,
- waiting for events related to reshape.

Handling every write request in raid5_make_request() involves adding to
and removing from this wait queue, which uses a spinlock. With fast
storage and multiple submitting threads the contention on this lock is
noticeable.

This patch series aims to resolve this by separating the two cases
mentioned above and using this wait queue only when reshape is in
progress.

The results when testing 4k random writes on raid5 with null_blk
(8 jobs, qd=64, group_thread_cnt=8):
before: 463k IOPS
after:  523k IOPS

The improvement is not huge with this series alone but it is just one of
the bottlenecks. When applied onto some other changes I'm working on, it
allowed to go from 845k IOPS to 975k IOPS on the same test.

* md-6.12-raid5-opt:
  md/raid5: rename wait_for_overlap to wait_for_reshape
  md/raid5: only add to wq if reshape is in progress
  md/raid5: use wait_on_bit() for R5_Overlap

Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Artur Paszkiewicz [Tue, 27 Aug 2024 15:35:36 +0000 (17:35 +0200)]

md/raid5: rename wait_for_overlap to wait_for_reshape

The only remaining uses of wait_for_overlap are related to reshape so
rename it accordingly.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20240827153536.6743-4-artur.paszkiewicz@intel.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Artur Paszkiewicz [Tue, 27 Aug 2024 15:35:35 +0000 (17:35 +0200)]

md/raid5: only add to wq if reshape is in progress

Now that actual overlaps are not handled on the wait_for_overlap wq
anymore, the remaining cases when we wait on this wq are limited to
reshape. If reshape is not in progress, don't add to the wq in
raid5_make_request() because add_wait_queue() / remove_wait_queue()
operations take a spinlock and cause noticeable contention when multiple
threads are submitting requests to the mddev.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20240827153536.6743-3-artur.paszkiewicz@intel.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Artur Paszkiewicz [Tue, 27 Aug 2024 15:35:34 +0000 (17:35 +0200)]

md/raid5: use wait_on_bit() for R5_Overlap

Convert uses of wait_for_overlap wait queue with R5_Overlap bit to
wait_on_bit() / wake_up_bit().

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Link: https://lore.kernel.org/r/20240827153536.6743-2-artur.paszkiewicz@intel.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Christoph Hellwig [Mon, 26 Aug 2024 17:37:57 +0000 (19:37 +0200)]

block: don't use bio_split_rw on misc operations

bio_split_rw is designed to split read and write bios with a payload.
Currently it is called by __bio_split_to_limits for all operations not
explicitly list, which works because bio_may_need_split explicitly checks
for bi_vcnt == 1 and thus skips the bypass if there is no payload and
bio_for_each_bvec loop will never execute it's body if bi_size is 0.

But all this is hard to understand, fragile and wasted pointless cycles.
Switch __bio_split_to_limits to only call bio_split_rw for READ and
WRITE command and don't attempt any kind split for operation that do not
require splitting.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 26 Aug 2024 17:37:56 +0000 (19:37 +0200)]

block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limits

Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in
__bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND
bios do not adhere to the soft max_limits value but instead use their
own capped version of max_hw_sectors, leading to incorrect splits that
later blow up in bio_split.

We still need the bio_split_rw logic to count nr_segs for blk-mq code,
so add a new wrapper that passes in the right limit, and turns any bio
that would need a split into an error as an additional debugging aid.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 26 Aug 2024 17:37:55 +0000 (19:37 +0200)]

block: constify the lim argument to queue_limits_max_zone_append_sectors

queue_limits_max_zone_append_sectors doesn't change the lim argument,
so mark it as const.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 26 Aug 2024 17:37:54 +0000 (19:37 +0200)]

block: rework bio splitting

The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.

Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.

To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Song Liu [Wed, 28 Aug 2024 21:55:57 +0000 (14:55 -0700)]

Merge branch 'md-6.12-bitmap' into md-6.12

From Yu Kuai (with minor changes by Song Liu):

The background is that currently bitmap is using a global spin_lock,
causing lock contention and huge IO performance degradation for all raid
levels.

However, it's impossible to implement a new lock free bitmap with
current situation that md-bitmap exposes the internal implementation
with lots of exported apis. Hence bitmap_operations is invented, to
describe bitmap core implementation, and a new bitmap can be introduced
with a new bitmap_operations, we only need to switch to the new one
during initialization.

And with this we can build bitmap as kernel module, but that's not
our concern for now.

This version was tested with mdadm tests and lvm2 tests. This set does
not introduce new errors in these tests.

* md-6.12-bitmap: (42 commits)
  md/md-bitmap: make in memory structure internal
  md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations
  md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations
  md/md-bitmap: merge md_bitmap_free() into bitmap_operations
  md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations
  md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.
  md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations
  md/md-bitmap: merge md_bitmap_resize() into bitmap_operations
  md/md-bitmap: pass in mddev directly for md_bitmap_resize()
  md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations
  md/md-bitmap: merge bitmap_unplug() into bitmap_operations
  md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()
  md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations
  md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations
  md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations
  md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations
  md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()
  md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations
  md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations
  md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations
  ...

Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Md Haris Iqbal [Fri, 9 Aug 2024 13:53:46 +0000 (15:53 +0200)]

block/rnbd-srv: Add sanity check and remove redundant assignment

The bio->bi_iter.bi_size is updated when bio_add_page() is called. So we
do not need to assign msg->bi_size again to it, since its redudant and
can also be harmful. Instead we can use it to add a sanity check, which
checks the locally calculated bi_size, with the one sent in msg.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://lore.kernel.org/r/20240809135346.978320-1-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yu Kuai [Tue, 27 Aug 2024 11:06:16 +0000 (19:06 +0800)]

md: Remove flush handling

For flush request, md has a special flush handling to merge concurrent
flush request into single one, however, the whole mechanism is based on
a disk level spin_lock 'mddev->lock'. And fsync can be called quite
often in some user cases, for consequence, spin lock from IO fast path can
cause performance degradation.

Fortunately, the block layer already has flush handling to merge
concurrent flush request, and it only acquires hctx level spin lock. (see
details in blk-flush.c)

This patch removes the flush handling in md, and converts to use general
block layer flush handling in underlying disks.

Flush test for 4 nvme raid10:
start 128 threads to do fsync 100000 times, on arm64, see how long it
takes.

Test script:
void* thread_func(void* arg) {
    int fd = *(int*)arg;
    for (int i = 0; i < FSYNC_COUNT; i++) {
        fsync(fd);
    }
    return NULL;
}

int main() {
    int fd = open("/dev/md0", O_RDWR);
    if (fd < 0) {
        perror("open");
        exit(1);
    }

    pthread_t threads[THREADS];
    struct timeval start, end;

    gettimeofday(&start, NULL);

    for (int i = 0; i < THREADS; i++) {
        pthread_create(&threads[i], NULL, thread_func, &fd);
    }

    for (int i = 0; i < THREADS; i++) {
        pthread_join(threads[i], NULL);
    }

    gettimeofday(&end, NULL);

    close(fd);

    long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec);
    printf("Elapsed time: %lld microseconds\n", elapsed);

    return 0;
}

Test result: about 10 times faster:
Before this patch: 50943374 microseconds
After this patch:  5096347  microseconds

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:52 +0000 (15:44 +0800)]

md/md-bitmap: make in memory structure internal

Now that struct bitmap_page and bitmap is not used externally anymore,
move them from md-bitmap.h to md-bitmap.c (expect that dm-raid is still
using define marco 'COUNTER_MAX').

Also fix some checkpatch warnings.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-43-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:51 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-42-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:50 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-41-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:49 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_free() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
o invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-40-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:48 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations

o that the implementation won't be exposed, and it'll be possible
o invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-39-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:47 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation.

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-38-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:46 +0000 (15:44 +0800)]

md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-37-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:45 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_resize() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-36-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:44 +0000 (15:44 +0800)]

md/md-bitmap: pass in mddev directly for md_bitmap_resize()

And move the condition "if (mddev->bitmap)" into md_bitmap_resize() as
well, on the one hand make code cleaner, on the other hand try not to
access bitmap directly.

Since we are here, also change the parameter 'init' from int to bool.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-35-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:43 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-34-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:42 +0000 (15:44 +0800)]

md/md-bitmap: merge bitmap_unplug() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-33-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:41 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug()

Add a parameter 'bool sync' to distinguish them, and
md_bitmap_unplug_async() won't be exported anymore, hence
bitmap_operations only need one op to cover them.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:40 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-31-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:39 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-30-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:38 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-29-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:37 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-28-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:36 +0000 (15:44 +0800)]

md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync()

For internal callers, aborted are always set to false, while for
external callers, aborted are always set to true.

Hence there is no need to always pass in true for exported api.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-27-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:35 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.

Also fix lots of code style.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-26-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:34 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible. And change the type
of 'success' and 'behind' from int to bool.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-25-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:33 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible. And change the type
of 'behind' from int to bool.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-24-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:32 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.

And while we're here, also fix coding style for bitmap_store().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:31 +0000 (15:44 +0800)]

md/md-bitmap: merge bitmap_write_all() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Also change the parameter from bitmap to mddev, to avoid access
bitmap outside md-bitmap.c as much as possible.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:30 +0000 (15:44 +0800)]

md/md-bitmap: remove md_bitmap_setallbits()

md_bitmap_setallbits() is not used, hence can be removed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-21-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:29 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_status() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:28 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:27 +0000 (15:44 +0800)]

md/md-bitmap: make md_bitmap_print_sb() internal

md_bitmap_print_sb() is only used inside md-bitmap.c, hence make it
static, also rename it to bitmap_print_sb.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-18-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:26 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_flush() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-17-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:25 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-16-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:24 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_load() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-15-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:23 +0000 (15:44 +0800)]

md/md-bitmap: merge md_bitmap_create() into bitmap_operations

So that the implementation won't be exposed, and it'll be possible
to invent a new bitmap by replacing bitmap_operations.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-14-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:22 +0000 (15:44 +0800)]

md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load()

Other than internal api get_bitmap_from_slot(), all other places will
set returned bitmap to mddev->bitmap. So move the setting of
mddev->bitmap into md_bitmap_create() to simplify code.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:21 +0000 (15:44 +0800)]

md/md-bitmap: introduce struct bitmap_operations

The structure is empty for now, and will be used in later patches to
merge in bitmap operations, so that bitmap implementation won't be
exposed.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:20 +0000 (15:44 +0800)]

md/md-bitmap: add a new helper md_bitmap_set_pages()

Currently md-cluster will set bitmap->counts.pages directly, add a
helper to do this to avoid dereferencing bitmap directly.

Noted that after this patch bitmap is not dereferenced directly anymore
and following patches will move the structure inside md-bitmap.c.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-11-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:19 +0000 (15:44 +0800)]

md/md-cluster: use helper md_bitmap_get_stats() to get pages in resize_bitmaps()

Use the existed helper instead of open coding it, avoid dereferencing
bitmap directly to prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-10-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:18 +0000 (15:44 +0800)]

md/md-bitmap: add 'behind_writes' and 'behind_wait' into struct md_bitmap_stats

There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.

Also fix following checkpatch warning by using wq_has_sleeper().

WARNING: waitqueue_active without comment

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-9-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:17 +0000 (15:44 +0800)]

md/md-bitmap: add 'file_pages' into struct md_bitmap_stats

There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:16 +0000 (15:44 +0800)]

md/md-bitmap: add 'sync_size' into struct md_bitmap_stats

To avoid dereferencing bitmap directly in md-cluster to prepare
inventing a new bitmap.

BTW, also fix following checkpatch warnings:

WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-7-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:15 +0000 (15:44 +0800)]

md/md-cluster: fix spares warnings for __le64

drivers/md/md-cluster.c:1220:22: warning: incorrect type in assignment (different base types)
drivers/md/md-cluster.c:1220:22:    expected unsigned long my_sync_size
drivers/md/md-cluster.c:1220:22:    got restricted __le64 [usertype] sync_size
drivers/md/md-cluster.c:1252:35: warning: incorrect type in assignment (different base types)
drivers/md/md-cluster.c:1252:35:    expected unsigned long sync_size
drivers/md/md-cluster.c:1252:35:    got restricted __le64 [usertype] sync_size
drivers/md/md-cluster.c:1253:41: warning: restricted __le64 degrades to integer

Fix the warnings by using le64_to_cpu() to convet __le64 to integer.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:14 +0000 (15:44 +0800)]

md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats

Also add a new helper to get events_cleared to avoid dereferencing
bitmap directly to prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:13 +0000 (15:44 +0800)]

md: use new helper md_bitmap_get_stats() in update_array_info()

There are no functional changes, avoid dereferencing bitmap directly to
prepare inventing a new bitmap.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-4-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:12 +0000 (15:44 +0800)]

md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats()

There are no functional changes, and the new helper will be used in
multiple places in following patches to avoid dereferencing bitmap
directly.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Mon, 26 Aug 2024 07:44:11 +0000 (15:44 +0800)]

md/raid1: use md_bitmap_wait_behind_writes() in raid1_read_request()

Use the existed helper instead of open coding it to make the code cleaner.
There are no functional changes, and also avoid dereferencing bitmap
directly to prepare inventing a new bitmap.

Noted that this patch also export md_bitmap_wait_behind_writes(), which
is necessary for now, and the exported api will be removed in following
patches to convert bitmap apis into ops.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240826074452.1490072-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Thu, 1 Aug 2024 13:30:08 +0000 (21:30 +0800)]

md/raid1: Clean up local variable 'b' from raid1_read_request()

The local variable will only be used onced, in the error path that
read_balance() failed to find a valid rdev to read. Since now the rdev
is ensured can't be removed from conf while IO is still pending,
remove the local variable and dereference rdev directly.

Since we're here, also remove an extra empty line, and unnecessary
type conversion from sector_t(u64) to unsigned long long.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801133008.459998-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yu Kuai [Thu, 1 Aug 2024 12:47:46 +0000 (20:47 +0800)]

md: Don't flush sync_work in md_write_start()

Because flush sync_work may trigger mddev_suspend() if there are spares,
and this should never be done in IO path because mddev_suspend() is used
to wait for IO.

This problem is found by code review.

Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: stable@vger.kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>

commit | commitdiff | tree

Yang Ruibin [Tue, 27 Aug 2024 02:27:40 +0000 (10:27 +0800)]

pktcdvd: remove unnecessary debugfs_create_dir() error check

Remove the debugfs_create_dir() error check. It's safe to pass in error
pointers to the debugfs API, hence the user isn't supposed to include
error checking of the return values.

Signed-off-by: Yang Ruibin <11162571@vivo.com>
Link: https://lore.kernel.org/r/20240827022741.3410294-1-11162571@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Konstantin Ovsepian [Thu, 22 Aug 2024 15:41:36 +0000 (08:41 -0700)]

blk_iocost: fix more out of bound shifts

Recently running UBSAN caught few out of bound shifts in the
ioc_forgive_debts() function:

UBSAN: shift-out-of-bounds in block/blk-iocost.c:2142:38
shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long
long')
...
UBSAN: shift-out-of-bounds in block/blk-iocost.c:2144:30
shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long
long')
...
Call Trace:
<IRQ>
dump_stack_lvl+0xca/0x130
__ubsan_handle_shift_out_of_bounds+0x22c/0x280
? __lock_acquire+0x6441/0x7c10
ioc_timer_fn+0x6cec/0x7750
? blk_iocost_init+0x720/0x720
? call_timer_fn+0x5d/0x470
call_timer_fn+0xfa/0x470
? blk_iocost_init+0x720/0x720
__run_timer_base+0x519/0x700
...

Actual impact of this issue was not identified but I propose to fix the
undefined behaviour.
The proposed fix to prevent those out of bound shifts consist of
precalculating exponent before using it the shift operations by taking
min value from the actual exponent and maximum possible number of bits.

Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Konstantin Ovsepian <ovs@ovs.to>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240822154137.2627818-1-ovs@ovs.to
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Anuj Gupta [Mon, 26 Aug 2024 10:39:43 +0000 (16:09 +0530)]

nvme: rename apptag and appmask to lbat and lbatm

Rename apptag and appmask to lbat and lbatm so that it matches the field
names used in NVMe spec.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Niklas Cassel [Thu, 15 Aug 2024 20:11:31 +0000 (22:11 +0200)]

nvme-rdma: send cntlid in the RDMA_CM_REQUEST Private Data

When sending a RDMA_CM_REQUEST, the NVMe RDMA Transport Specification
allows you to populate the cntlid field in the RDMA_CM_REQUEST Private
Data.

The cntlid is returned by the target on completion of the first
RDMA_CM_REQUEST command (which creates the admin queue).

The cntlid field can then be populated by the host when the I/O queues
are created (using additional RDMA_CM_REQUEST commands), such that the
target can perform extra validation for additional RDMA_CM_REQUEST
commands.

This additional error code and error message is also added, such that
nvme_rdma_cm_msg() will display the proper error message if the target
fails the RDMA_CM_REQUEST command because of this extra validation.

Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Christophe JAILLET [Sun, 25 Aug 2024 16:22:23 +0000 (18:22 +0200)]

drbd: Remove an unused field in struct drbd_device

'next_barrier_nr' is not used in this driver. Remove it.

It was already part of the original commit b411b3637fa7 ("The DRBD driver")
Apparently, it has never been used.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/d5322ef88d1d6f544963ee277cb0b427da8dceef.1724602922.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:26 +0000 (14:02 +0200)]

nvme-target: do not check authentication status for admin commands twice

nvmet_check_ctrl_status() checks the authentication status, so
we don't need to do that prior to calling it.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:25 +0000 (14:02 +0200)]

nvmet-auth: allow to clear DH-HMAC-CHAP keys

As we can set DH-HMAC-CHAP keys, we should also be
able to unset them.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:24 +0000 (14:02 +0200)]

nvme-sysfs: add 'tls_keyring' attribute

Add a 'tls_keyring' attribute to display the contents of the
--keyring option from the connect string. Adding this attribute
allows us to recreate the original connect string from sysfs
settings.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:23 +0000 (14:02 +0200)]

nvme-sysfs: add 'tls_configured_key' sysfs attribute

There is a difference between the negotiated TLS key (which is
always present for a TLS encrypted connection) and the configured
TLS key (which is specified with the --tls_key command line option).
To differentate between these two add a new sysfs attribute
'tls_configured_key' to hold the specified on the command line.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:22 +0000 (14:02 +0200)]

nvme: split off TLS sysfs attributes into a separate group

Split off TLS sysfs attributes into a separate group to improve
readability and to keep all TLS related handling in one section.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:21 +0000 (14:02 +0200)]

nvme: add a newline to the 'tls_key' sysfs attribute

Print a newline for easier userspace handling.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:20 +0000 (14:02 +0200)]

nvme-tcp: check for invalidated or revoked key

key_lookup() will always return a key, even if that key is revoked
or invalidated. So check for invalid keys before continuing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:19 +0000 (14:02 +0200)]

nvme-tcp: sanitize TLS key handling

There is a difference between TLS configured (ie the user has
provisioned/requested a key) and TLS enabled (ie the connection
is encrypted with TLS). This becomes important for secure concatenation,
where the initial authentication is run on an unencrypted connection
(ie with TLS configured, but not enabled), and then the queue is reset to
run over TLS (ie TLS configured _and_ enabled).
So to differentiate between those two states store the generated
key in opts->tls_key (as we're using the same TLS key for all queues),
the key serial of the resulting TLS handshake in ctrl->tls_pskid
(to signal that TLS on the admin queue is enabled), and a simple
flag for the queues to indicated that TLS has been enabled.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

commit | commitdiff | tree

Hannes Reinecke [Mon, 22 Jul 2024 12:02:18 +0000 (14:02 +0200)]

nvme-keyring: restrict match length for version '1' identifiers

TP8018 introduced a new TLS PSK identifier version (version 1), which appended
a PSK hash value to the existing identifier (cf NVMe TCP specification v1.1,
section 3.6.1.3 'TLS PSK and PSK Identity Derivation').
An original (version 0) identifier has the form:

NVMe0<type><hmac> <hostnqn> <subsysnqn>

and a version 1 identifier has the form:

NVMe1<type><hmac> <hostnqn> <subsysnqn> <hash>

This patch modifies the lookup algorthm to compare only the first part
of the identifier (excluding the hash value) to handle both version 0 and
version 1 identifiers.
And the spec declares 'version 0' identifiers obsolete, so the lookup
algorithm is modified to prever v1 identifiers.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom