]> www.infradead.org Git - users/hch/xfs.git/log
users/hch/xfs.git
3 weeks agoFOLD: keep the code for stale buffers in a single place in xfs_buf_item_release shutdown-deadlock-split
Christoph Hellwig [Fri, 23 May 2025 05:01:29 +0000 (07:01 +0200)]
FOLD: keep the code for stale buffers in a single place in xfs_buf_item_release

Keep the stale buffer handling together to make it easier to explain and
easier to follow.

3 weeks agoFOLD: drop the superfluous return at the end of xfs_buf_item_finish_stale
Christoph Hellwig [Fri, 23 May 2025 04:53:15 +0000 (06:53 +0200)]
FOLD: drop the superfluous return at the end of xfs_buf_item_finish_stale

Signed-off-by: Christoph Hellwig <hch@lst.de>
3 weeks agoxfs: fix unmount hang with unflushable inodes stuck in the AIL
Dave Chinner [Fri, 23 May 2025 04:47:07 +0000 (06:47 +0200)]
xfs: fix unmount hang with unflushable inodes stuck in the AIL

"The most effective debugging tool is still careful
thought, coupled with judiciously placed print statements."

- Brian Kernighan

I've been chasing an occasional unmount hangs when running
check-parallel for a couple of months now:

[95964.140623] Call Trace:
[95964.144641]  __schedule+0x699/0xb70
[95964.154003]  schedule+0x64/0xd0
[95964.156851]  xfs_ail_push_all_sync+0x9b/0xf0
[95964.164816]  xfs_unmount_flush_inodes+0x41/0x70
[95964.168698]  xfs_unmountfs+0x7f/0x170
[95964.171846]  xfs_fs_put_super+0x3b/0x90
[95964.175216]  generic_shutdown_super+0x77/0x160
[95964.178060]  kill_block_super+0x1b/0x40
[95964.180553]  xfs_kill_sb+0x12/0x30
[95964.182796]  deactivate_locked_super+0x38/0x100
[95964.185735]  deactivate_super+0x41/0x50
[95964.188245]  cleanup_mnt+0x9f/0x160
[95964.190519]  __cleanup_mnt+0x12/0x20
[95964.192899]  task_work_run+0x89/0xb0
[95964.195221]  resume_user_mode_work+0x4f/0x60
[95964.197931]  syscall_exit_to_user_mode+0x76/0xb0
[95964.201003]  do_syscall_64+0x74/0x130

$ pstree -N mnt |grep umount
     |-check-parallel---nsexec---run_test.sh---753---umount

It always seems to be generic/753 that triggers this, and repeating
a quick group test run triggers it every 10-15 iterations. Hence it
generally triggers once up every 30-40 minutes of test time. just
running generic/753 by itself or concurrently with a limited group
of tests doesn't reproduce this issue at all.

Tracing on a hung system shows the AIL repeating every 50ms a log
force followed by an attempt to push pinned, aborted inodes from the
AIL (trimmed for brevity):

 xfs_log_force:   lsn 0x1c caller xfsaild+0x18e
 xfs_log_force:   lsn 0x0 caller xlog_cil_flush+0xbd
 xfs_log_force:   lsn 0x1c caller xfs_log_force+0x77
 xfs_ail_pinned:  lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_ail_pinned:  lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
 ....

The inode log items are marked as aborted, which means that either:

a) a transaction commit has occurred, seen an error or shutdown, and
called xfs_trans_free_items() to abort the items. This should happen
before any pinning of log items occurs.

or

b) a dirty transaction has been cancelled. This should also happen
before any pinning of log items occurs.

or

c) AIL insertion at journal IO completion is marked as aborted. In
this case, the log item is pinned by the CIL until journal IO
completes and hence needs to be unpinned. This is then done after
the ->iop_committed() callback is run, so the pin count should be
balanced correctly.

There are no other cases that set XFS_LI_ABORTED for inodes are set,
so it's not at all obvious why the item is pinned in the AIL.  I
added tracing to indicate why the inode item push is returning a
XFS_ITEM_PINNED value:

 xfs_log_force:         lsn 0x5e caller xfsaild+0x18e
 xfs_log_force:         lsn 0x0 caller xlog_cil_flush+0xbd
 xfs_log_force:         lsn 0x5e caller xfs_log_force+0x77
 xfs_inode_push_stale:  ino 0xc4 count 0 pincount 0 iflags 0x86 caller xfsaild+0x432
 xfs_ail_pinned:        lip 0xffff8882a20d5900 lsn 1/40853 type XFS_LI_INODE flags IN_AIL|ABORTED
 xfs_inode_push_stale:  ino 0xc1 count 0 pincount 0 iflags 0x86 caller xfsaild+0x432
 xfs_ail_pinned:        lip 0xffff8882a20d5518 lsn 1/40853 type XFS_LI_INODE flags IN_AIL

The inode flags are XFS_ISTALE | XFS_IRECLAIMABLE | XFS_IFLUSHING,
and this means xfs_inode_push_item() is triggering this code:

        if (!bp || (ip->i_flags & XFS_ISTALE)) {
                /*
                 * Inode item/buffer is being aborted due to cluster
                 * buffer deletion. Trigger a log force to have that operation
                 * completed and items removed from the AIL before the next push
                 * attempt.
                 */
>>>>>>          trace_xfs_inode_push_stale(ip, _RET_IP_);
                return XFS_ITEM_PINNED;
        }

XFS_IFLUSHING is set, so there should have been a buffer IO
completion that cleared that and removed the inode from the AIL.
Inode cluster freeing marks the inode XFS_ISTALE | XFS_IFLUSHING,
so this is a valid state for the inode to be in.

The inode cluster buffer is not in the AIL, so these inodes should
have been removed from the AIL when the inode cluster buffer was
aborted and unpinned.

However, I'm unclear on how the XFS_LI_ABORTED state is getting set
- there are a couple of places this can occur, and both should be
triggering the inode cluster buffer to remove the attached inodes
from the AIL and drop them. More tracing....

... and that was unexpected:

    xfsaild/dm-3-1747912 [047] ...1.   604.344253: xfs_inode_push_pinned: dev 251:3 ino 0x2020082 count 0 pincount 10 iflags 0x4 caller xfsaild+0x432
    xfsaild/dm-3-1747912 [047] ...1.   604.344253: xfs_ail_pinned:       dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL
   kworker/u259:14-391   [019] .....   604.366776: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL
   kworker/16:1H-1600438 [016] .....   604.366802: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
             <...>-382   [021] .....   604.366849: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
   kworker/16:1H-1600438 [016] .....   604.366866: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
   kworker/16:1H-1600438 [016] .....   604.366969: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
   kworker/16:1H-1600438 [016] .....   604.367005: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
 kworker/u258:32-1245394 [021] .....   604.367054: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
  kworker/u259:9-1580996 [023] .....   604.367109: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
             <...>-356   [028] .....   604.367163: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
           <...>-1245384 [028] .....   604.367213: xlog_ail_insert_abort: dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED
    xfsaild/dm-3-1747912 [047] ...1.   604.396155: xfs_inode_push_stale: dev 251:3 ino 0x2020082 count 0 pincount 0 iflags 0x86 caller xfsaild+0x432
    xfsaild/dm-3-1747912 [047] ...1.   604.396156: xfs_ail_pinned:       dev 251:3 lip 0xffff88907a7a9838 lsn 1/4199 type XFS_LI_INODE flags IN_AIL|ABORTED

There are *10* log item aborts for the inode in question in about
500us. It's one for every pin count, so at least that adds up, but
we can only have 4 checkpoints in flight at once because we only
have 4 workers, right?

Ah - we are limited to *building* 4 checkpoints at once. As soon as
the worker commits a checkpoint and releases the iclog, it can start
working on the next checkpoint.  So we've got 4 checkpoint
completions doing insert aborts from journal IO completion
(kworker/16:1H-1600438). There are 3 traces that are clearly unbound
CIL push kworkers (kworker/u259:14-391 and the like). There are also
3 that are truncated names, but the -XXX is very close to the PIDs
of the other unbound push kworkers, so that's probably what they
were.

This means that during a shutdown we clearly have racing calls to
xlog_cil_committed() rather than having them completely serialised
by journal IO completion. This means that we may still have a
transaction commit in flight that holds a reference to a
xfs_buf_log_item (BLI) after CIL insertion. e.g. a synchronous
transaction will flush the CIL before the transaction is torn down.

The concurrent CIL push then aborts insertion it and drops the
commit/AIL reference to the BLI. This can leave the transaction
commit context with the last reference to the BLI.

The abort of the inode buffer is supposed to abort all the inodes
attached to it on journal IO completion. This is done by the
xfs_buf_item_unpin() call, but if the last unpin doesn't drop the
last reference to the BLI, it doesn't complete the stale inodes
attached to it - it leaves that for the last reference.

Then when the last reference is released from transaction context
(xfs_trans_commit() or xfs_trans_cancel()), we end up here:

xfs_trans_free_items()
  ->iop_release
    xfs_buf_item_release
      xfs_buf_item_put
        if (XFS_LI_ABORTED)
  xfs_trans_ail_delete
xfs_buf_item_relse()

And we do not process the inode objects attached to the buffer, even
though we've checked if this is an aborted log item on last release.

Hence in this case, it would seem that we've released a stale inode
buffer with stale inodes attached and in the AIL and haven't aborted
the inodes....

OK, let's add an assert:

ASSERT(list_empty(&bip->bli_buf->b_li_list));

to the abort case in xfs_buf_item_put()....

 XFS: Assertion failed: list_empty(&bip->bli_buf->b_li_list), file: fs/xfs/xfs_buf_item.c, line: 561
 ------------[ cut here ]------------
 kernel BUG at fs/xfs/xfs_message.c:102!
 Oops: invalid opcode: 0000 [#1] SMP NOPTI
 CPU: 12 UID: 0 PID: 816468 Comm: kworker/12:53 Not tainted 6.15.0-rc5-dgc+ #326 PREEMPT(full)
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Workqueue: xfs-inodegc/dm-1 xfs_inodegc_worker
 RIP: 0010:assfail+0x3a/0x40
 Code: 89 f1 48 89 fe 48 c7 c7 cf c7 ed 82 48 c7 c2 86 56 e8 82 e8 c8 fc ff ff 80 3d d1 d3 50 03 01 74 09 0f 0b 5d c3 cc cc cc cc cc <0f> 0b 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
 RSP: 0018:ffffc900184e3b40 EFLAGS: 00010246
 RAX: a79ae1118c7fdd00 RBX: ffff88812b04ba80 RCX: a79ae1118c7fdd00
 RDX: ffffc900184e3a08 RSI: 000000000000000a RDI: ffffffff82edc7cf
 RBP: ffffc900184e3b40 R08: 0000000000000000 R09: 000000000000000a
 R10: 0000000000000000 R11: 0000000000000021 R12: 0000000000000024
 R13: 0000000000000000 R14: ffff88816a67a750 R15: 000000000000006c
 FS:  0000000000000000(0000) GS:ffff88889a78a000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fbe8536d3dc CR3: 00000008471e7000 CR4: 0000000000350ef0
 Call Trace:
  <TASK>
  xfs_buf_item_release+0x22c/0x280
  xfs_trans_free_items+0x94/0x140
  __xfs_trans_commit+0x1e4/0x320
  xfs_trans_roll+0x4d/0xe0
  xfs_defer_trans_roll+0x83/0x160
  xfs_defer_finish_noroll+0x1d1/0x440
  xfs_trans_commit+0x46/0x90
  xfs_inactive_ifree+0x1b6/0x230
  xfs_inactive+0x30e/0x3b0
  xfs_inodegc_worker+0xaa/0x180
  process_scheduled_works+0x1d6/0x400
  worker_thread+0x202/0x2e0
  kthread+0x20c/0x240
  ret_from_fork+0x3e/0x50
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Modules linked in:
 ---[ end trace 0000000000000000 ]---

And there's the smoking gun - the transaction commit holds the last
reference to the BLI that still has stale inodes attached to the
buffer.

-----

The fix for this is not immediately clear. Let's talk to the dog for
a bit...

We first need to observe that xfs_buf_item_release() must be
called with the buffer still locked as one of it's functions is to
unlock the buffer after the transaction is committed. Hence we can
run operations IO completion/failure routines that require the
buffer to be locked from xfs_buf_item_release() context if
necessary.

However, the locking for stale buffers is different and quite
complicated. Stale buffers cover freed metadata extents, so we can't
allow them to be accessed for anything until the metadata related to
the freeing operation is on stable storage. i.e. the buffer has to
remained stale and locked until journal IO completion, not just
transaction commit completion. This is what allows metadata freeing
transactions to run asynchronously and avoid needing a log force
during xfs_trans_commit() context....

This means that the buffer lock context needs to be passed to the
last BLI reference. For stale buffers, this is normally the last BLI
unpin on journal IO completion. The unpin then processes the stale
buffer completion and releases the buffer lock.  However, if the
unpin from journal IO completion does not hold the last reference,
there -must- still be a transaction context that references the BLI,
and so the buffer must remain locked until that transaction context
is torn down.

IOWs, there is an inherent race between journal IO completion and
transaction freeing dropping the last reference to a stale buffer.

In normal runtime situations, this isn't a problem because because
we can't use the reallocated block until the buffer lock has been
dropped. Hence we serialise on transaction commit completion rather
than journal IO completion.

However, in the case of stale inode buffers, this race condition
implies we failed to clean up stale inodes attached to the buffer
at journal IO completion.

This race condition generally doesn't occur because the BLI has
already been logged multiple times for unlinked inode list
modifications prior to the invalidation. Hence at inode cluster
freeing time it often has an elevated pin count because it is in the
CIL (and potentially other checkpoints in flight) already. This
results in the transaction commit context almost always dropping
it's reference first due to how long CIL checkpoint submission and
completion takes.

Further, xfs_buf_item_release() has an ASSERT that checks the
transaction context cleanup is not dropping the last reference to a
stale buffer -except- in the case where the BLI has been aborted:

        /*
         * Unref the item and unlock the buffer unless held or stale. Stale
         * buffers remain locked until final unpin unless the bli is freed by
         * the unref call. The latter implies shutdown because buffer
         * invalidation dirties the bli and transaction.
         */
        released = xfs_buf_item_put(bip);
        if (hold || (stale && !released))
                return;
>>>>>   ASSERT(!stale || aborted);
        xfs_buf_relse(bp);

I've never seen that ASSERT trip at runtime for stale && !aborted
buffers, so it seems pretty clear that the unpin race only manifests
during shutdown conditions when abort conditions are asserted.

That's part of the problem - this code acknowledges that a race
condition can occur during shutdown, but then asserts that it is
benign. This may once have been true, but we've attached inodes
being flushed to the buffer for a long time under the guise that
buffer IO completion or stale buffer teardown will also complete or
abort attached inodes appropriately.

The debug ASSERT I added above catches this exact condition - a
stale buffer being released from transaction context with stale
inodes still attached to it.

Hence the way we are processing the last BLI reference in
xfs_buf_item_release() needs fixing. xfs_buf_item_put() is
needed, but it should not be doing any handling of dirty/stale
state. There are only 3 callers, and two of them explicitly only
pass clean BLIs to xfs_buf_item_put() to remove them from a
transaction context and do not check the return value.

These cases do not care if the item is in the AIL or not; buffers
that are in the AIL on shutdown will be cleared from the AIL by
pushing them. They will get queued for IO, then the IO will get
failed and IO completion will remove them from the AIL. Hence
these transaction removal cases do not need to care about whether
the item is aborted or not - they just need to check if it is in the
AIL or not. This state is protected by the buffer lock the
caller holds...

Hence I think that xfs_buf_item_release needs to look an awful lot
like xfs_buf_item_unpin()....

<hack hack hack>

generic/753 again:

 XFS: Assertion failed: !(bip->bli_flags & XFS_BLI_DIRTY), file: fs/xfs/xfs_buf_item.c, line: 616
  xfs_buf_item_put+0x97/0xf0
  xfs_trans_brelse+0x9b/0x1d0
  xfs_btree_del_cursor+0x2f/0xc0
  xfs_alloc_ag_vextent_near+0x359/0x470
  xfs_alloc_vextent_near_bno+0xbc/0x180
  xfs_ialloc_ag_alloc+0x6dd/0x7a0
  xfs_dialloc+0x38e/0x8e0
  xfs_create+0x1d4/0x430
  xfs_generic_create+0x141/0x3e0
  xfs_vn_mkdir+0x1a/0x30
  vfs_mkdir+0xe6/0x190
  do_mkdirat+0xaa/0x260
  __x64_sys_mkdir+0x2b/0x40
  x64_sys_call+0x228f/0x2f60
  do_syscall_64+0x68/0x130

But wait, there's more!

This is trying to free a buffer that is not dirty in the transaction
(XFS_LI_DIRTY on the log item is not set), but the BLI is marked
dirty but isn't in the AIL. That, I think, is another shutdown
race where an item is in a committing checkpoint (not locked, has a
bli reference) at the same time as something reads and then releases
the buffer. A shutdown occurs and the committing checkpoint
completes, aborts the log item instead of inserting it into the ail,
and because it's not the last bli reference is then does nothing
else.

At this point, the only remaining bli reference is owned by the
btree cursor, the BLI is dirty and not stale, and we call
xfs_trans_brelse() -> xfs_buf_item_put() to drop the buffer which
then would free the BLI directly, even though it is dirty. That
triggers the aobve ASSERT.

So, in a shutdown state, we can be freeing dirty BLIs from
xfs_buf_item_put() via xfs_trans_brelse() and xfs_trans_bdetach().
Yes, the existing code handles this case by considering shutdown
state as "aborted", but that's one of the confusions that masks the
original bug from the xfs_buf_item_release() path - the buffer may
be considered "aborted" in the shutdown case, but we still may have
to do cleanup work on it regardless of whether it is in the AIL or
not.

The question is whether we can get this state occurring with stale
inode buffers that have attached stale inodes. It can't happen
with xfs_trans_brelse(), as it keeps the buffer attached to the
transaction if XFS_BLI_STALE is set. xfs_trans_bdetach() also has
an assert that would fire if it was called on a XFS_BLI_DIRTY
or XFS_BLI_STALE buffer. Hence I don't think that we can get
races with stale buffers here, and so all that needs doing is
changing the assert....

Ok, it has survived check-parallel for an hour, so I think that the
problem is fixed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
3 weeks agoxfs: move up some functionality up in xfs_buf_item.c
Dave Chinner [Fri, 23 May 2025 04:46:06 +0000 (06:46 +0200)]
xfs: move up some functionality up in xfs_buf_item.c

also changes the argument passed to xfs_buf_item_relse and marks it
static.

hch: split from a larger patch by Dave, needs a commit log

3 weeks agoxfs: factor out a xfs_buf_item_finish_stale helper
Dave Chinner [Fri, 23 May 2025 04:40:41 +0000 (06:40 +0200)]
xfs: factor out a xfs_buf_item_finish_stale helper

hch: split from a larger patch by Dave, needs a proper commit log

3 weeks agoxfs: improve inode item tracing
Dave Chinner [Fri, 23 May 2025 04:38:08 +0000 (06:38 +0200)]
xfs: improve inode item tracing

hch: split from a larger patch from Dave, needs a proper commit log.

3 weeks agoxfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
Dave Chinner [Thu, 15 May 2025 02:42:09 +0000 (12:42 +1000)]
xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock

Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
-> IFLUSHING, except for the last inode in the cluster that is
triggering the free. In that case, the lock order is ILOCK ->
cluster buffer -> IFLUSHING.

xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
so this can safely run concurrently with xfs_ifree_cluster().

xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this
cannot race with xfs_ifree_cluster() so being in a different order
will not trigger a deadlock.

xfs_reclaim_inode() during a filesystem shutdown uses ILOCK ->
IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and
this deadlocks against xfs_ifree_cluster() like so:

 sysrq: Show Blocked State
 task:kworker/10:37   state:D stack:12560 pid:276182 tgid:276182 ppid:2      flags:0x00004000
 Workqueue: xfs-inodegc/dm-3 xfs_inodegc_worker
 Call Trace:
  <TASK>
  __schedule+0x650/0xb10
  schedule+0x6d/0xf0
  schedule_timeout+0x8b/0x180
  schedule_timeout_uninterruptible+0x1e/0x30
  xfs_ifree+0x326/0x730
  xfs_inactive_ifree+0xcb/0x230
  xfs_inactive+0x2c8/0x380
  xfs_inodegc_worker+0xaa/0x180
  process_scheduled_works+0x1d4/0x400
  worker_thread+0x234/0x2e0
  kthread+0x147/0x170
  ret_from_fork+0x3e/0x50
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 task:fsync-tester    state:D stack:12160 pid:2255943 tgid:2255943 ppid:3988702 flags:0x00004006
 Call Trace:
  <TASK>
  __schedule+0x650/0xb10
  schedule+0x6d/0xf0
  schedule_timeout+0x31/0x180
  __down_common+0xbe/0x1f0
  __down+0x1d/0x30
  down+0x48/0x50
  xfs_buf_lock+0x3d/0xe0
  xfs_iflush_shutdown_abort+0x51/0x1e0
  xfs_icwalk_ag+0x386/0x690
  xfs_reclaim_inodes_nr+0x114/0x160
  xfs_fs_free_cached_objects+0x19/0x20
  super_cache_scan+0x17b/0x1a0
  do_shrink_slab+0x180/0x350
  shrink_slab+0xf8/0x430
  drop_slab+0x97/0xf0
  drop_caches_sysctl_handler+0x59/0xc0
  proc_sys_call_handler+0x189/0x280
  proc_sys_write+0x13/0x20
  vfs_write+0x33d/0x3f0
  ksys_write+0x7c/0xf0
  __x64_sys_write+0x1b/0x30
  x64_sys_call+0x271d/0x2ee0
  do_syscall_64+0x68/0x130
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE
and XFS_IFLUSHING are serialised through to journal IO completion
by the cluster buffer lock being held.

There's quite a few asserts in the code that check that XFS_ISTALE
does not occur out of sync with buffer locking (e.g. in
xfs_iflush_cluster). There's also a dependency on the inode log item
being removed from the buffer before XFS_IFLUSHING is cleared, also
with asserts that trigger on this.

Further, we don't have a requirement for the inode to be locked when
completing or aborting inode flushing because all the inode state
updates are serialised by holding the cluster buffer lock across the
IO to completion.

We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and
skip the inode, because there is no guarantee that the inode will be
reclaimed. Hence it *must* be marked XFS_ISTALE regardless of
whether reclaim is preparing to free that inode. Similarly, we can't
check for IFLUSHING before locking the inode because that would
result in dirty inodes not being marked with ISTALE in the event of
racing with XFS_IRECLAIM.

Hence we have to address this issue from the xfs_reclaim_inode()
side. It is clear that we cannot hold the inode locked here when
calling xfs_iflush_shutdown_abort() because it is the inode->buffer
lock order that causes the deadlock against xfs_ifree_cluster().

Hence we need to drop the ILOCK before aborting the inode in the
shutdown case. Once we've aborted the inode, we can grab the ILOCK
again and then immediately reclaim it as it is now guaranteed to be
clean.

Note that dropping the ILOCK in xfs_reclaim_inode() means that it
can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in
this state. This is safe because we have left the XFS_IFLUSHING flag
on the inode and so xfs_ifree_mark_inode_stale() will simply set
XFS_ISTALE and move to the next inode. An ASSERT check in this path
needs to be tweaked to take into account this new shutdown
interaction.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
4 weeks agoMerge branch 'xfs-6.15-fixes' into for-next
Carlos Maiolino [Wed, 14 May 2025 17:13:07 +0000 (19:13 +0200)]
Merge branch 'xfs-6.15-fixes' into for-next

 Conflicts:
fs/xfs/xfs_super.c

Caused by patches:
xfs: Fail remount with noattr2 on a v5 with v4 enabled

xfs: allow sysadmins to specify a maximum atomic write limit at
     mount time

Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: add inode to zone caching for data placement
Hans Holmberg [Wed, 14 May 2025 10:50:37 +0000 (10:50 +0000)]
xfs: add inode to zone caching for data placement

Placing data from the same file in the same zone is a great heuristic
for reducing write amplification and we do this already - but only
for sequential writes.

To support placing data in the same way for random writes, reuse the
xfs mru cache to map inodes to open zones on first write. If a mapping
is present, use the open zone for data placement for this file until
the zone is full.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: free the item in xfs_mru_cache_insert on failure
Christoph Hellwig [Wed, 14 May 2025 10:50:37 +0000 (10:50 +0000)]
xfs: free the item in xfs_mru_cache_insert on failure

Call the provided free_func when xfs_mru_cache_insert as that's what
the callers need to do anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: Fix comment on xfs_trans_ail_update_bulk()
Carlos Maiolino [Mon, 12 May 2025 11:42:56 +0000 (13:42 +0200)]
xfs: Fix comment on xfs_trans_ail_update_bulk()

This function doesn't take the AIL lock, but should be called
with AIL lock held. Also (hopefuly) simplify the comment.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: Fix a comment on xfs_ail_delete
Carlos Maiolino [Mon, 12 May 2025 11:42:55 +0000 (13:42 +0200)]
xfs: Fix a comment on xfs_ail_delete

It doesn't return anything.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Chandan Babu R <chandanbabu@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: Fail remount with noattr2 on a v5 with v4 enabled
Nirjhar Roy (IBM) [Mon, 12 May 2025 16:30:32 +0000 (22:00 +0530)]
xfs: Fail remount with noattr2 on a v5 with v4 enabled

Bug: When we compile the kernel with CONFIG_XFS_SUPPORT_V4=y,
remount with "-o remount,noattr2" on a v5 XFS does not
fail explicitly.

Reproduction:
mkfs.xfs -f /dev/loop0
mount /dev/loop0 /mnt/scratch
mount -o remount,noattr2 /dev/loop0 /mnt/scratch

However, with CONFIG_XFS_SUPPORT_V4=n, the remount
correctly fails explicitly. This is because the way the
following 2 functions are defined:

static inline bool xfs_has_attr2 (struct xfs_mount *mp)
{
return !IS_ENABLED(CONFIG_XFS_SUPPORT_V4) ||
(mp->m_features & XFS_FEAT_ATTR2);
}
static inline bool xfs_has_noattr2 (const struct xfs_mount *mp)
{
return mp->m_features & XFS_FEAT_NOATTR2;
}

xfs_has_attr2() returns true when CONFIG_XFS_SUPPORT_V4=n
and hence, the following if condition in
xfs_fs_validate_params() succeeds and returns -EINVAL:

/*
 * We have not read the superblock at this point, so only the attr2
 * mount option can set the attr2 feature by this stage.
 */

if (xfs_has_attr2(mp) && xfs_has_noattr2(mp)) {
xfs_warn(mp, "attr2 and noattr2 cannot both be specified.");
return -EINVAL;
}

With CONFIG_XFS_SUPPORT_V4=y, xfs_has_attr2() always return
false and hence no error is returned.

Fix: Check if the existing mount has crc enabled(i.e, of
type v5 and has attr2 enabled) and the
remount has noattr2, if yes, return -EINVAL.

I have tested xfs/{189,539} in fstests with v4
and v5 XFS with both CONFIG_XFS_SUPPORT_V4=y/n and
they both behave as expected.

This patch also fixes remount from noattr2 -> attr2 (on a v4 xfs).

Related discussion in [1]

[1] https://lore.kernel.org/all/Z65o6nWxT00MaUrW@dread.disaster.area/

Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: fix zoned GC data corruption due to wrong bv_offset
Christoph Hellwig [Mon, 12 May 2025 14:43:05 +0000 (16:43 +0200)]
xfs: fix zoned GC data corruption due to wrong bv_offset

xfs_zone_gc_write_chunk writes out the data buffer read in earlier using
the same bio, and currenly looks at bv_offset for the offset into the
scratch folio for that.  But commit 26064d3e2b4d ("block: fix adding
folio to bio") changed how bv_page and bv_offset are calculated for
adding larger folios, breaking this fragile logic.

Switch to extracting the full physical address from the old bio_vec,
and calculate the offset into the folio from that instead.

This fixes data corruption during garbage collection with heavy rockdsb
workloads.  Thanks to Hans for tracking down the culprit commit during
long bisection sessions.

Fixes: 26064d3e2b4d ("block: fix adding folio to bio")
Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <Hans.Holmberg@wdc.com>
Tested-by: Hans Holmberg <Hans.Holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: free up mp->m_free[0].count in error case
Wengang Wang [Mon, 5 May 2025 23:35:49 +0000 (16:35 -0700)]
xfs: free up mp->m_free[0].count in error case

In xfs_init_percpu_counters(), memory for mp->m_free[0].count wasn't freed
in error case. Free it up in this patch.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Fixes: 712bae96631852 ("xfs: generalize the freespace and reserved blocks handling")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: remove the EXPERIMENTAL warning for pNFS
Christoph Hellwig [Wed, 14 May 2025 04:44:20 +0000 (06:44 +0200)]
xfs: remove the EXPERIMENTAL warning for pNFS

The pNFS layout support has been around for 10 years without major
issues, drop the EXPERIMENTAL warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: remove some EXPERIMENTAL warnings
Darrick J. Wong [Tue, 13 May 2025 14:35:29 +0000 (07:35 -0700)]
xfs: remove some EXPERIMENTAL warnings

Online fsck was finished a year ago, in Linux 6.10.  The exchange-range
syscall and parent pointers were merged in the same cycle.  None of
these have encountered any serious errors in the year that they've been
in the kernel (or the many many years they've been under development) so
let's drop the shouty warnings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoMerge branch 'atomic_writes-6.16' into xfs-6.16-merge
Carlos Maiolino [Wed, 14 May 2025 10:38:53 +0000 (12:38 +0200)]
Merge branch 'atomic_writes-6.16' into xfs-6.16-merge

Required update due to conflict with patch:
xfs: stop using set_blocksize

 Conflicts:
fs/xfs/xfs_buf.c

Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: Remove deprecated xfs_bufd sysctl parameters
Zizhi Wo [Tue, 6 May 2025 01:15:40 +0000 (09:15 +0800)]
xfs: Remove deprecated xfs_bufd sysctl parameters

Commit 64af7a6ea5a4 ("xfs: remove deprecated sysctls") removed the
deprecated xfsbufd-related sysctl interface, but forgot to delete the
corresponding parameters: "xfs_buf_timer" and "xfs_buf_age".

This patch removes those parameters and makes no other changes.

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoxfs: stop using set_blocksize
Darrick J. Wong [Wed, 23 Apr 2025 19:54:13 +0000 (12:54 -0700)]
xfs: stop using set_blocksize

XFS has its own buffer cache for metadata that uses submit_bio, which
means that it no longer uses the block device pagecache for anything.
Create a more lightweight helper that runs the blocksize checks and
flushes dirty data and use that instead.  No more truncating the
pagecache because XFS does not use it or care about it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 weeks agoMerge branch 'block-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe...
Carlos Maiolino [Wed, 14 May 2025 10:20:57 +0000 (12:20 +0200)]
Merge branch 'block-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into xfs-6.16-merge

Merging block tree into XFS because of some dependencies like
bdev_validate_blocksize()

Signed-off-by: Carlos Maiolino <cem@kernel.org>
5 weeks agoblock: always allocate integrity buffer when required
Keith Busch [Fri, 9 May 2025 15:38:02 +0000 (08:38 -0700)]
block: always allocate integrity buffer when required

Many nvme metadata formats can not strip or generate the metadata on the
controller side. For these formats, a host provided integrity buffer is
mandatory even if it isn't checked.

The block integrity read_verify and write_generate attributes prevent
allocating the metadata buffer, but we need it when the format requires
it, otherwise reads and writes will be rejected by the driver with IO
errors.

Assume the integrity buffer can be offloaded to the controller if the
metadata size is the same as the protection information size. Otherwise
provide an unchecked host buffer when the read verify or write
generation attributes are disabled. This fixes the following nvme
warning:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 371 at drivers/nvme/host/core.c:1036 nvme_setup_rw+0x122/0x210
 ...
 RIP: 0010:nvme_setup_rw+0x122/0x210
 ...
 Call Trace:
  <TASK>
  nvme_setup_cmd+0x1b4/0x280
  nvme_queue_rqs+0xc4/0x1f0 [nvme]
  blk_mq_dispatch_queue_requests+0x24a/0x430
  blk_mq_flush_plug_list+0x50/0x140
  __blk_flush_plug+0xc1/0x100
  __submit_bio+0x1c1/0x360
  ? submit_bio_noacct_nocheck+0x2d6/0x3c0
  submit_bio_noacct_nocheck+0x2d6/0x3c0
  ? submit_bio_noacct+0x47/0x4c0
  submit_bio_wait+0x48/0xa0
  __blkdev_direct_IO_simple+0xee/0x210
  ? current_time+0x1d/0x100
  ? current_time+0x1d/0x100
  ? __bio_clone+0xb0/0xb0
  blkdev_read_iter+0xbb/0x140
  vfs_read+0x239/0x310
  ksys_read+0x58/0xc0
  do_syscall_64+0x6c/0x180
  entry_SYSCALL_64_after_hwframe+0x4b/0x53

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250509153802.3482493-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 weeks agoMerge tag 'atomic-writes-6.16_2025-05-07' of https://git.kernel.org/pub/scm/linux...
Carlos Maiolino [Fri, 9 May 2025 07:15:39 +0000 (09:15 +0200)]
Merge tag 'atomic-writes-6.16_2025-05-07' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into atomic_writes

large atomic writes for xfs [v12.1]

Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.

This series introduces a method to issue atomic writes via a
software-based method.

The software-based method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.

For xfs, this support is based on reflink CoW support.

The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.

Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.

For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].

Catherine is currently working on further xfstests for this feature,
which we hope to share soon.

About 17/17, maybe it can be omitted as there is no strong demand to have
it included.

Based on bfecc4091e07 (xfs/next-rc, xfs/for-next) xfs: allow ro mounts
if rtdev or logdev are read-only

[0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/

Differences to v12:
- add more review tags

Differences to v11:
- split "xfs: ignore ..." patch
- inline sync_blockdev() in xfs_alloc_buftarg() (Christoph)
- fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick)
- Add RB tag from Christoph (thanks!)

Differences to v10:
- add "xfs: only call xfs_setsize_buftarg once ..." by Darrick
- symbol renames in "xfs: ignore HW which cannot..." by Darrick

Differences to v9:
- rework "ignore HW which cannot .." patch by Darrick
- Ensure power-of-2 max always for unit min/max when no HW support

With a bit of luck, this should all go splendidly.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
5 weeks agoMerge tag 'nvme-6.15-2025-05-08' of git://git.infradead.org/nvme into block-6.15
Jens Axboe [Thu, 8 May 2025 15:08:23 +0000 (09:08 -0600)]
Merge tag 'nvme-6.15-2025-05-08' of git://git.infradead.org/nvme into block-6.15

Pull NVMe fix from Christoph:

"nvme fixes for linux 6.15

 - unblock ctrl state transition for firmware update (Daniel Wagner)"

* tag 'nvme-6.15-2025-05-08' of git://git.infradead.org/nvme:
  nvme: unblock ctrl state transition for firmware update

5 weeks agoblock: remove test of incorrect io priority level
Aaron Lu [Thu, 8 May 2025 08:30:36 +0000 (16:30 +0800)]
block: remove test of incorrect io priority level

Ever since commit eca2040972b4("scsi: block: ioprio: Clean up interface
definition"), the macro IOPRIO_PRIO_LEVEL() will mask the level value to
something between 0 and 7 so necessarily, level will always be lower than
IOPRIO_NR_LEVELS(8).

Remove this obsolete check.

Reported-by: Kexin Wei <ys.weikexin@h3c.com>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250508083018.GA769554@bytedance
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 weeks agoxfs: allow sysadmins to specify a maximum atomic write limit at mount time
Darrick J. Wong [Wed, 7 May 2025 21:18:34 +0000 (14:18 -0700)]
xfs: allow sysadmins to specify a maximum atomic write limit at mount time

Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write.  If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.

The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size).  We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.

The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits.  Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: update atomic write limits
John Garry [Wed, 7 May 2025 21:18:33 +0000 (14:18 -0700)]
xfs: update atomic write limits

Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)().

No reflink support always means no CoW-based atomic writes.

For updating xfs_get_atomic_write_min(), we support blocksize only and that
depends on HW or reflink support.

For updating xfs_get_atomic_write_max(), for no reflink, we are limited to
blocksize but only if HW support. Otherwise we are limited to combined
limit in mp->m_atomic_write_unit_max.

For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by
the bdev atomic write limit. If xfs_get_atomic_write_max() does not report
 > 1x blocksize, then just continue to report 0 as before.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: update comments in the helper functions]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: add xfs_calc_atomic_write_unit_max()
John Garry [Wed, 7 May 2025 21:18:32 +0000 (14:18 -0700)]
xfs: add xfs_calc_atomic_write_unit_max()

Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.

The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.

Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism.  In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.

[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: add xfs_file_dio_write_atomic()
John Garry [Wed, 7 May 2025 21:18:31 +0000 (14:18 -0700)]
xfs: add xfs_file_dio_write_atomic()

Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

Now HW offload will not be required to support atomic writes and is
an optional feature.

CoW-based atomic writes will be supported with out-of-places write and
atomic extent remapping.

Either mode of operation may be used for an atomic write, depending on the
circumstances.

The preferred method is HW offload as it will be faster. If HW offload is
not available then we always use the CoW-based method.  If HW offload is
available but not possible to use, then again we use the CoW-based method.

If available, HW offload would not be possible for the write length
exceeding the HW offload limit, the write spanning multiple extents,
unaligned disk blocks, etc.

Apart from the write exceeding the HW offload limit, other conditions for
HW offload usage can only be detected in the iomap handling for the write.
As such, we use a fallback method to issue the write if we detect in the
->iomap_begin() handler that HW offload is not possible. Special code
-ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is
not possible.

In other words, atomic writes are supported on any filesystem that can
perform out of place write remapping atomically (i.e. reflink) up to
some fairly large size.  If the conditions are right (a single correctly
aligned overwrite mapping) then the filesystem will use any available
hardware support to avoid the filesystem metadata updates.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: commit CoW-based atomic writes atomically
John Garry [Wed, 7 May 2025 21:18:31 +0000 (14:18 -0700)]
xfs: commit CoW-based atomic writes atomically

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
John Garry [Wed, 7 May 2025 21:18:30 +0000 (14:18 -0700)]
xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()

For when large atomic writes (> 1x FS block) are supported, there will be
various occasions when HW offload may not be possible.

Such instances include:
- unaligned extent mapping wrt write length
- extent mappings which do not cover the full write, e.g. the write spans
  sparse or mixed-mapping extents
- the write length is greater than HW offload can support
- no hardware support at all

In those cases, we need to fallback to the CoW-based atomic write mode. For
this, report special code -ENOPROTOOPT to inform the caller that HW
offload-based method is not possible.

In addition to the occasions mentioned, if the write covers an unallocated
range, we again judge that we need to rely on the CoW-based method when we
would need to allocate anything more than 1x block. This is because if we
allocate less blocks that is required for the write, then again HW
offload-based method would not be possible. So we are taking a pessimistic
approach to writes covering unallocated space.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: various cleanups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: add xfs_atomic_write_cow_iomap_begin()
John Garry [Wed, 7 May 2025 21:18:29 +0000 (14:18 -0700)]
xfs: add xfs_atomic_write_cow_iomap_begin()

For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork
support.

Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create
staging mappings in the CoW fork for atomic write updates.

The general steps in the function are as follows:
- find extent mapping in the CoW fork for the FS block range being written
- if part or full extent is found, proceed to process found extent
- if no extent found, map in new blocks to the CoW fork
- convert unwritten blocks in extent if required
- update iomap extent mapping and return

The bulk of this function is quite similar to the processing in
xfs_reflink_allocate_cow(), where we try to find an extent mapping; if
none exists, then allocate a new extent in the CoW fork, convert unwritten
blocks, and return a mapping.

Performance testing has shown the XFS_ILOCK_EXCL locking to be quite
a bottleneck, so this is an area which could be optimised in future.

Christoph Hellwig contributed almost all of the code in
xfs_atomic_write_cow_iomap_begin().

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add a new xfs_can_sw_atomic_write to convey intent better]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: refine atomic write size check in xfs_file_write_iter()
John Garry [Wed, 7 May 2025 21:18:28 +0000 (14:18 -0700)]
xfs: refine atomic write size check in xfs_file_write_iter()

Currently the size of atomic write allowed is fixed at the blocksize.

To start to lift this restriction, partly refactor
xfs_report_atomic_write() to into helpers -
xfs_get_atomic_write_{min, max}() - and use those helpers to find the
per-inode atomic write limits and check according to that.

Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and
just return 0 since large atomics aren't supported yet.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: refactor xfs_reflink_end_cow_extent()
John Garry [Wed, 7 May 2025 21:18:28 +0000 (14:18 -0700)]
xfs: refactor xfs_reflink_end_cow_extent()

Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.

This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: allow block allocator to take an alignment hint
John Garry [Wed, 7 May 2025 21:18:27 +0000 (14:18 -0700)]
xfs: allow block allocator to take an alignment hint

Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.

This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: ignore HW which cannot atomic write a single block
Darrick J. Wong [Wed, 7 May 2025 21:18:26 +0000 (14:18 -0700)]
xfs: ignore HW which cannot atomic write a single block

Currently only HW which can write at least 1x block is supported.

For supporting atomic writes > 1x block, a CoW-based method will also be
used and this will not be resticted to using HW which can write >= 1x
block.

However for deciding if HW-based atomic writes can be used, we need to
start adding checks for write length < HW min, which complicates the
code.  Indeed, a statx field similar to unit_max_opt should also be
added for this minimum, which is undesirable.

HW which can only write > 1x blocks would be uncommon and quite weird,
so let's just not support it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 weeks agoxfs: add helpers to compute transaction reservation for finishing intent items
Darrick J. Wong [Wed, 7 May 2025 21:18:25 +0000 (14:18 -0700)]
xfs: add helpers to compute transaction reservation for finishing intent items

In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions.  These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: add helpers to compute log item overhead
Darrick J. Wong [Wed, 7 May 2025 21:18:25 +0000 (14:18 -0700)]
xfs: add helpers to compute log item overhead

Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log.  These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: separate out setting buftarg atomic writes limits
Darrick J. Wong [Wed, 7 May 2025 21:18:24 +0000 (14:18 -0700)]
xfs: separate out setting buftarg atomic writes limits

Separate out setting buftarg atomic writes limits into a dedicated
function, xfs_configure_buftarg_atomic_writes(), to keep the specific
functionality self-contained.

For naming consistency, rename xfs_setsize_buftarg() ->
xfs_configure_buftarg().

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: separate out from patch "xfs: ignore HW which ..."]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 weeks agoxfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomic_write()
John Garry [Wed, 7 May 2025 21:18:23 +0000 (14:18 -0700)]
xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomic_write()

In future we will want to be able to check if specifically HW offload-based
atomic writes are possible, so rename xfs_inode_can_atomicwrite() ->
xfs_inode_can_hw_atomicwrite().

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: add an underscore to be consistent with everything else]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
5 weeks agoxfs: only call xfs_setsize_buftarg once per buffer target
Darrick J. Wong [Wed, 7 May 2025 21:18:22 +0000 (14:18 -0700)]
xfs: only call xfs_setsize_buftarg once per buffer target

It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the
block device LBA size because we don't need to ask the block layer to
validate a geometry number that it provided us.  Instead, set the
preliminary bt_meta_sector* fields to the LBA size in preparation for
reading the primary super.

However, we still want to flush and invalidate the pagecache for all
three block devices before we start reading metadata from those devices,
so call sync_blockdev() per bdev in xfs_alloc_buftarg().

This will enable a subsequent patch to validate hw atomic write geometry
against the filesystem geometry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
[jpg: call sync_blockdev() from xfs_alloc_buftarg()]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 weeks agofs: add atomic write unit max opt to statx
John Garry [Wed, 7 May 2025 21:18:21 +0000 (14:18 -0700)]
fs: add atomic write unit max opt to statx

XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.

Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.

The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.

XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.

As such, there is an atomic write length at which the user may experience
appreciably slower performance.

Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.

When zero, it means that there is no such performance boundary.

Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
6 weeks agonvme: unblock ctrl state transition for firmware update
Daniel Wagner [Fri, 2 May 2025 08:58:00 +0000 (10:58 +0200)]
nvme: unblock ctrl state transition for firmware update

The original nvme subsystem design didn't have a CONNECTING state; the
state machine allowed transitions from RESETTING to LIVE directly.

With the introduction of nvme fabrics the CONNECTING state was
introduce. Over time the nvme-pci started to use the CONNECTING state as
well.

Eventually, a bug fix for the nvme-fc started to depend that the only
valid transition to LIVE was from CONNECTING. Though this change didn't
update the firmware update handler which was still depending on
RESETTING to LIVE transition.

The simplest way to address it for the time being is to switch into
CONNECTING state before going to LIVE state.

Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Closes: https://lore.kernel.org/all/0134ea15-8d5f-41f7-9e9a-d7e6d82accaa@roeck-us.net
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
6 weeks agoblock: only update request sector if needed
Johannes Thumshirn [Tue, 6 May 2025 11:27:30 +0000 (13:27 +0200)]
block: only update request sector if needed

In case of a ZONE APPEND write, regardless of native ZONE APPEND or the
emulation layer in the zone write plugging code, the sector the data got
written to by the device needs to be updated in the bio.

At the moment, this is done for every native ZONE APPEND write and every
request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus
superfluously updates the sector for regular writes to a zoned block
device.

Check if a bio is a native ZONE APPEND write or if the bio is flagged as
'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging
code handles the ZONE APPEND and translates it into a regular write and
back. Only if one of these two criterion is met, update the sector in the
bio upon completion.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 weeks agoloop: Add sanity check for read/write_iter
Lizhi Xu [Mon, 28 Apr 2025 14:36:26 +0000 (22:36 +0800)]
loop: Add sanity check for read/write_iter

Some file systems do not support read_iter/write_iter, such as selinuxfs
in this issue.
So before calling them, first confirm that the interface is supported and
then call it.

It is releavant in that vfs_iter_read/write have the check, and removal
of their used caused szybot to be able to hit this issue.

Fixes: f2fed441c69b ("loop: stop using vfs_iter__{read,write} for buffered I/O")
Reported-by: syzbot+6af973a3b8dfd2faefdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6af973a3b8dfd2faefdc
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250428143626.3318717-1-lizhi.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 weeks agoxfs: don't assume perags are initialised when trimming AGs
Dave Chinner [Wed, 30 Apr 2025 23:27:24 +0000 (09:27 +1000)]
xfs: don't assume perags are initialised when trimming AGs

When running fstrim immediately after mounting a V4 filesystem,
the fstrim fails to trim all the free space in the filesystem. It
only trims the first extent in the by-size free space tree in each
AG and then returns. If a second fstrim is then run, it runs
correctly and the entire free space in the filesystem is iterated
and discarded correctly.

The problem lies in the setup of the trim cursor - it assumes that
pag->pagf_longest is valid without either reading the AGF first or
checking if xfs_perag_initialised_agf(pag) is true or not.

As a result, when a filesystem is mounted without reading the AGF
(e.g. a clean mount on a v4 filesystem) and the first operation is a
fstrim call, pag->pagf_longest is zero and so the free extent search
starts at the wrong end of the by-size btree and exits after
discarding the first record in the tree.

Fix this by deferring the initialisation of tcur->count to after
we have locked the AGF and guaranteed that the perag is properly
initialised. We trigger this on tcur->count == 0 after locking the
AGF, as this will only occur on the first call to
xfs_trim_gather_extents() for each AG. If we need to iterate,
tcur->count will be set to the length of the record we need to
restart at, so we can use this to ensure we only sample a valid
pag->pagf_longest value for the iteration.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Fixes: 89cfa899608f ("xfs: reduce AGF hold times during fstrim operations")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 weeks agoMerge tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme into block-6.15
Jens Axboe [Thu, 1 May 2025 13:56:02 +0000 (07:56 -0600)]
Merge tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme into block-6.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 6.15

 - fix queue unquiesce check on PCI slot_reset (Keith Busch)
 - fix premature queue removal and I/O failover in nvme-tcp
   (Michael Liang)
 - don't restore null sk_state_change (Alistair Francis)
 - select CONFIG_TLS where needed (Alistair Francis)
 - always free derived key data (Hannes Reinecke)
 - more quirks (Wentao Guan)"

* tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme:
  nvmet-auth: always free derived key data
  nvmet-tcp: don't restore null sk_state_change
  nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
  nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
  nvme-tcp: fix premature queue removal and I/O failover
  nvme-pci: add quirks for WDC Blue SN550 15b7:5009
  nvme-pci: add quirks for device 126f:1001
  nvme-pci: fix queue unquiesce check on slot_reset

6 weeks agoxfs: allow ro mounts if rtdev or logdev are read-only
Hans Holmberg [Wed, 30 Apr 2025 08:35:34 +0000 (08:35 +0000)]
xfs: allow ro mounts if rtdev or logdev are read-only

Allow read-only mounts on rtdevs and logdevs that are marked as
read-only and make sure those mounts can't be remounted read-write.

Use the sb_open_mode helper to make sure that we don't try to open
devices with write access enabled for read-only mounts.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 weeks agonvmet-auth: always free derived key data
Hannes Reinecke [Fri, 25 Apr 2025 09:34:34 +0000 (11:34 +0200)]
nvmet-auth: always free derived key data

After calling nvme_auth_derive_tls_psk() we need to free the resulting
psk data, as either TLS is disable (and we don't need the data anyway)
or the psk data is copied into the resulting key (and can be free, too).

Fixes: fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Suggested-by: Maurizio Lombardi <mlombard@bsdbackstore.eu>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 weeks agonvmet-tcp: don't restore null sk_state_change
Alistair Francis [Wed, 23 Apr 2025 06:06:21 +0000 (16:06 +1000)]
nvmet-tcp: don't restore null sk_state_change

queue->state_change is set as part of nvmet_tcp_set_queue_sock(), but if
the TCP connection isn't established when nvmet_tcp_set_queue_sock() is
called then queue->state_change isn't set and sock->sk->sk_state_change
isn't replaced.

As such we don't need to restore sock->sk->sk_state_change if
queue->state_change is NULL.

This avoids NULL pointer dereferences such as this:

[  286.462026][    C0] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  286.462814][    C0] #PF: supervisor instruction fetch in kernel mode
[  286.463796][    C0] #PF: error_code(0x0010) - not-present page
[  286.464392][    C0] PGD 8000000140620067 P4D 8000000140620067 PUD 114201067 PMD 0
[  286.465086][    C0] Oops: Oops: 0010 [#1] SMP KASAN PTI
[  286.465559][    C0] CPU: 0 UID: 0 PID: 1628 Comm: nvme Not tainted 6.15.0-rc2+ #11 PREEMPT(voluntary)
[  286.466393][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[  286.467147][    C0] RIP: 0010:0x0
[  286.467420][    C0] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[  286.467977][    C0] RSP: 0018:ffff8883ae008580 EFLAGS: 00010246
[  286.468425][    C0] RAX: 0000000000000000 RBX: ffff88813fd34100 RCX: ffffffffa386cc43
[  286.469019][    C0] RDX: 1ffff11027fa68b6 RSI: 0000000000000008 RDI: ffff88813fd34100
[  286.469545][    C0] RBP: ffff88813fd34160 R08: 0000000000000000 R09: ffffed1027fa682c
[  286.470072][    C0] R10: ffff88813fd34167 R11: 0000000000000000 R12: ffff88813fd344c3
[  286.470585][    C0] R13: ffff88813fd34112 R14: ffff88813fd34aec R15: ffff888132cdd268
[  286.471070][    C0] FS:  00007fe3c04c7d80(0000) GS:ffff88840743f000(0000) knlGS:0000000000000000
[  286.471644][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  286.472543][    C0] CR2: ffffffffffffffd6 CR3: 000000012daca000 CR4: 00000000000006f0
[  286.473500][    C0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  286.474467][    C0] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400
[  286.475453][    C0] Call Trace:
[  286.476102][    C0]  <IRQ>
[  286.476719][    C0]  tcp_fin+0x2bb/0x440
[  286.477429][    C0]  tcp_data_queue+0x190f/0x4e60
[  286.478174][    C0]  ? __build_skb_around+0x234/0x330
[  286.478940][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.479659][    C0]  ? __pfx_tcp_data_queue+0x10/0x10
[  286.480431][    C0]  ? tcp_try_undo_loss+0x640/0x6c0
[  286.481196][    C0]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[  286.482046][    C0]  ? kvm_clock_get_cycles+0x14/0x30
[  286.482769][    C0]  ? ktime_get+0x66/0x150
[  286.483433][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.484146][    C0]  tcp_rcv_established+0x6e4/0x2050
[  286.484857][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.485523][    C0]  ? ipv4_dst_check+0x160/0x2b0
[  286.486203][    C0]  ? __pfx_tcp_rcv_established+0x10/0x10
[  286.486917][    C0]  ? lock_release+0x217/0x2c0
[  286.487595][    C0]  tcp_v4_do_rcv+0x4d6/0x9b0
[  286.488279][    C0]  tcp_v4_rcv+0x2af8/0x3e30
[  286.488904][    C0]  ? raw_local_deliver+0x51b/0xad0
[  286.489551][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.490198][    C0]  ? __pfx_tcp_v4_rcv+0x10/0x10
[  286.490813][    C0]  ? __pfx_raw_local_deliver+0x10/0x10
[  286.491487][    C0]  ? __pfx_nf_confirm+0x10/0x10 [nf_conntrack]
[  286.492275][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.492900][    C0]  ip_protocol_deliver_rcu+0x8f/0x370
[  286.493579][    C0]  ip_local_deliver_finish+0x297/0x420
[  286.494268][    C0]  ip_local_deliver+0x168/0x430
[  286.494867][    C0]  ? __pfx_ip_local_deliver+0x10/0x10
[  286.495498][    C0]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[  286.496204][    C0]  ? ip_rcv_finish_core+0x19a/0x1f20
[  286.496806][    C0]  ? lock_release+0x217/0x2c0
[  286.497414][    C0]  ip_rcv+0x455/0x6e0
[  286.497945][    C0]  ? __pfx_ip_rcv+0x10/0x10
[  286.498550][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.499137][    C0]  ? __pfx_ip_rcv_finish+0x10/0x10
[  286.499763][    C0]  ? lock_release+0x217/0x2c0
[  286.500327][    C0]  ? dl_scaled_delta_exec+0xd1/0x2c0
[  286.500922][    C0]  ? __pfx_ip_rcv+0x10/0x10
[  286.501480][    C0]  __netif_receive_skb_one_core+0x166/0x1b0
[  286.502173][    C0]  ? __pfx___netif_receive_skb_one_core+0x10/0x10
[  286.502903][    C0]  ? lock_acquire+0x2b2/0x310
[  286.503487][    C0]  ? process_backlog+0x372/0x1350
[  286.504087][    C0]  ? lock_release+0x217/0x2c0
[  286.504642][    C0]  process_backlog+0x3b9/0x1350
[  286.505214][    C0]  ? process_backlog+0x372/0x1350
[  286.505779][    C0]  __napi_poll.constprop.0+0xa6/0x490
[  286.506363][    C0]  net_rx_action+0x92e/0xe10
[  286.506889][    C0]  ? __pfx_net_rx_action+0x10/0x10
[  286.507437][    C0]  ? timerqueue_add+0x1f0/0x320
[  286.507977][    C0]  ? sched_clock_cpu+0x68/0x540
[  286.508492][    C0]  ? lock_acquire+0x2b2/0x310
[  286.509043][    C0]  ? kvm_sched_clock_read+0xd/0x20
[  286.509607][    C0]  ? handle_softirqs+0x1aa/0x7d0
[  286.510187][    C0]  handle_softirqs+0x1f2/0x7d0
[  286.510754][    C0]  ? __pfx_handle_softirqs+0x10/0x10
[  286.511348][    C0]  ? irqtime_account_irq+0x181/0x290
[  286.511937][    C0]  ? __dev_queue_xmit+0x85d/0x3450
[  286.512510][    C0]  do_softirq.part.0+0x89/0xc0
[  286.513100][    C0]  </IRQ>
[  286.513548][    C0]  <TASK>
[  286.513953][    C0]  __local_bh_enable_ip+0x112/0x140
[  286.514522][    C0]  ? __dev_queue_xmit+0x85d/0x3450
[  286.515072][    C0]  __dev_queue_xmit+0x872/0x3450
[  286.515619][    C0]  ? nft_do_chain+0xe16/0x15b0 [nf_tables]
[  286.516252][    C0]  ? __pfx___dev_queue_xmit+0x10/0x10
[  286.516817][    C0]  ? selinux_ip_postroute+0x43c/0xc50
[  286.517433][    C0]  ? __pfx_selinux_ip_postroute+0x10/0x10
[  286.518061][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.518606][    C0]  ? ip_output+0x164/0x4a0
[  286.519149][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.519671][    C0]  ? ip_finish_output2+0x17d5/0x1fb0
[  286.520258][    C0]  ip_finish_output2+0xb4b/0x1fb0
[  286.520787][    C0]  ? __pfx_ip_finish_output2+0x10/0x10
[  286.521355][    C0]  ? __ip_finish_output+0x15d/0x750
[  286.521890][    C0]  ip_output+0x164/0x4a0
[  286.522372][    C0]  ? __pfx_ip_output+0x10/0x10
[  286.522872][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.523402][    C0]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[  286.524031][    C0]  ? __pfx_ip_finish_output+0x10/0x10
[  286.524605][    C0]  ? __ip_queue_xmit+0x999/0x2260
[  286.525200][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.525744][    C0]  ? ipv4_dst_check+0x16a/0x2b0
[  286.526279][    C0]  ? lock_release+0x217/0x2c0
[  286.526793][    C0]  __ip_queue_xmit+0x1883/0x2260
[  286.527324][    C0]  ? __skb_clone+0x54c/0x730
[  286.527827][    C0]  __tcp_transmit_skb+0x209b/0x37a0
[  286.528374][    C0]  ? __pfx___tcp_transmit_skb+0x10/0x10
[  286.528952][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.529472][    C0]  ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[  286.530152][    C0]  ? trace_hardirqs_on+0x12/0x120
[  286.530691][    C0]  tcp_write_xmit+0xb81/0x88b0
[  286.531224][    C0]  ? mod_memcg_state+0x4d/0x60
[  286.531736][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.532253][    C0]  __tcp_push_pending_frames+0x90/0x320
[  286.532826][    C0]  tcp_send_fin+0x141/0xb50
[  286.533352][    C0]  ? __pfx_tcp_send_fin+0x10/0x10
[  286.533908][    C0]  ? __local_bh_enable_ip+0xab/0x140
[  286.534495][    C0]  inet_shutdown+0x243/0x320
[  286.535077][    C0]  nvme_tcp_alloc_queue+0xb3b/0x2590 [nvme_tcp]
[  286.535709][    C0]  ? do_raw_spin_lock+0x129/0x260
[  286.536314][    C0]  ? __pfx_nvme_tcp_alloc_queue+0x10/0x10 [nvme_tcp]
[  286.536996][    C0]  ? do_raw_spin_unlock+0x54/0x1e0
[  286.537550][    C0]  ? _raw_spin_unlock+0x29/0x50
[  286.538127][    C0]  ? do_raw_spin_lock+0x129/0x260
[  286.538664][    C0]  ? __pfx_do_raw_spin_lock+0x10/0x10
[  286.539249][    C0]  ? nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp]
[  286.539892][    C0]  ? __wake_up+0x40/0x60
[  286.540392][    C0]  nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp]
[  286.541047][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.541589][    C0]  nvme_tcp_setup_ctrl+0x8b/0x7a0 [nvme_tcp]
[  286.542254][    C0]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[  286.542887][    C0]  ? __pfx_nvme_tcp_setup_ctrl+0x10/0x10 [nvme_tcp]
[  286.543568][    C0]  ? trace_hardirqs_on+0x12/0x120
[  286.544166][    C0]  ? _raw_spin_unlock_irqrestore+0x35/0x60
[  286.544792][    C0]  ? nvme_change_ctrl_state+0x196/0x2e0 [nvme_core]
[  286.545477][    C0]  nvme_tcp_create_ctrl+0x839/0xb90 [nvme_tcp]
[  286.546126][    C0]  nvmf_dev_write+0x3db/0x7e0 [nvme_fabrics]
[  286.546775][    C0]  ? rw_verify_area+0x69/0x520
[  286.547334][    C0]  vfs_write+0x218/0xe90
[  286.547854][    C0]  ? do_syscall_64+0x9f/0x190
[  286.548408][    C0]  ? trace_hardirqs_on_prepare+0xdb/0x120
[  286.549037][    C0]  ? syscall_exit_to_user_mode+0x93/0x280
[  286.549659][    C0]  ? __pfx_vfs_write+0x10/0x10
[  286.550259][    C0]  ? do_syscall_64+0x9f/0x190
[  286.550840][    C0]  ? syscall_exit_to_user_mode+0x8e/0x280
[  286.551516][    C0]  ? trace_hardirqs_on_prepare+0xdb/0x120
[  286.552180][    C0]  ? syscall_exit_to_user_mode+0x93/0x280
[  286.552834][    C0]  ? ksys_read+0xf5/0x1c0
[  286.553386][    C0]  ? __pfx_ksys_read+0x10/0x10
[  286.553964][    C0]  ksys_write+0xf5/0x1c0
[  286.554499][    C0]  ? __pfx_ksys_write+0x10/0x10
[  286.555072][    C0]  ? trace_hardirqs_on_prepare+0xdb/0x120
[  286.555698][    C0]  ? syscall_exit_to_user_mode+0x93/0x280
[  286.556319][    C0]  ? do_syscall_64+0x54/0x190
[  286.556866][    C0]  do_syscall_64+0x93/0x190
[  286.557420][    C0]  ? rcu_read_unlock+0x17/0x60
[  286.557986][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.558526][    C0]  ? lock_release+0x217/0x2c0
[  286.559087][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.559659][    C0]  ? count_memcg_events.constprop.0+0x4a/0x60
[  286.560476][    C0]  ? exc_page_fault+0x7a/0x110
[  286.561064][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.561647][    C0]  ? lock_release+0x217/0x2c0
[  286.562257][    C0]  ? do_user_addr_fault+0x171/0xa00
[  286.562839][    C0]  ? do_user_addr_fault+0x4a2/0xa00
[  286.563453][    C0]  ? irqentry_exit_to_user_mode+0x84/0x270
[  286.564112][    C0]  ? rcu_is_watching+0x11/0xb0
[  286.564677][    C0]  ? irqentry_exit_to_user_mode+0x84/0x270
[  286.565317][    C0]  ? trace_hardirqs_on_prepare+0xdb/0x120
[  286.565922][    C0]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  286.566542][    C0] RIP: 0033:0x7fe3c05e6504
[  286.567102][    C0] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[  286.568931][    C0] RSP: 002b:00007fff76444f58 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[  286.569807][    C0] RAX: ffffffffffffffda RBX: 000000003b40d930 RCX: 00007fe3c05e6504
[  286.570621][    C0] RDX: 00000000000000cf RSI: 000000003b40d930 RDI: 0000000000000003
[  286.571443][    C0] RBP: 0000000000000003 R08: 00000000000000cf R09: 000000003b40d930
[  286.572246][    C0] R10: 0000000000000000 R11: 0000000000000202 R12: 000000003b40cd60
[  286.573069][    C0] R13: 00000000000000cf R14: 00007fe3c07417f8 R15: 00007fe3c073502e
[  286.573886][    C0]  </TASK>

Closes: https://lore.kernel.org/linux-nvme/5hdonndzoqa265oq3bj6iarwtfk5dewxxjtbjvn5uqnwclpwt6@a2n6w3taxxex/
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 weeks agonvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
Alistair Francis [Tue, 29 Apr 2025 22:23:47 +0000 (08:23 +1000)]
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS

Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TARGET_TCP_TLS is enabled. Without this the code compiles,
but does not actually work unless something else enables CONFIG_TLS.

Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 weeks agonvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
Alistair Francis [Tue, 29 Apr 2025 22:40:25 +0000 (08:40 +1000)]
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS

Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does
not actually work unless something else enables CONFIG_TLS.

Fixes: be8e82caa68 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
6 weeks agonvme-tcp: fix premature queue removal and I/O failover
Michael Liang [Tue, 29 Apr 2025 16:42:01 +0000 (10:42 -0600)]
nvme-tcp: fix premature queue removal and I/O failover

This patch addresses a data corruption issue observed in nvme-tcp during
testing.

In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.

We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.

To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.

This aligns with the behavior of queue stopping in other NVMe fabric
transports.

Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 weeks agonvme-pci: add quirks for WDC Blue SN550 15b7:5009
Wentao Guan [Thu, 24 Apr 2025 02:40:10 +0000 (10:40 +0800)]
nvme-pci: add quirks for WDC Blue SN550 15b7:5009

Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user
reports and hardware analysis:

 - NVME_QUIRK_NO_DEEPEST_PS:
liaozw talked to me the problem and solved with
nvme_core.default_ps_max_latency_us=0, so add the quirk.
I also found some reports in the following link.

 - NVME_QUIRK_BROKEN_MSI:
after get the lspci from Jack Rio.
I think that the disk also have NVME_QUIRK_BROKEN_MSI.
described in commit d5887dc6b6c0 ("nvme-pci: Add quirk for broken MSIs")
as sean said in link which match the MSI 1/32 and MSI-X 17.

Log:
lspci -nn | grep -i memory
03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01)
lspci -v -d 15b7:5009
03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD
        Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10
        Memory at fe800000 (64-bit, non-prefetchable) [size=16K]
        Memory at fe804000 (64-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
        Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
        Capabilities: [c0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [1b8] Latency Tolerance Reporting
        Capabilities: [300] Secondary PCI Express
        Capabilities: [900] L1 PM Substates
        Kernel driver in use: nvme
dmesg | grep nvme
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[    0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[    0.542430] nvme nvme0: pci function 0000:03:00.0
[    0.560426] nvme nvme0: allocated 32 MiB host memory buffer.
[    0.562491] nvme nvme0: 16/0/0 default/read/poll queues
[    0.567764]  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
[    6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none.
[    6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none.
[    7.125419] Adding 16777212k swap on /dev/nvme0n1p8.  Priority:-2 extents:1 across:16777212k SS
[    7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none.
[    7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none.
[    8.036932] nvme nvme0: using unchecked data buffer
[    8.096023] block nvme0n1: No UUID available providing old NGUID

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5887dc6b6c054d0da3cd053afc15b7be1f45ff6
Link: https://lore.kernel.org/all/20240422162822.3539156-1-sean.anderson@linux.dev/
Reported-by: liaozw <hedgehog-002@163.com>
Closes: https://bbs.deepin.org.cn/post/286300
Reported-by: rugk <rugk+github@posteo.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=208123
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 weeks agonvme-pci: add quirks for device 126f:1001
Wentao Guan [Tue, 22 Apr 2025 12:17:25 +0000 (20:17 +0800)]
nvme-pci: add quirks for device 126f:1001

This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:1001].

It is similar to commit e89086c43f05 ("drivers/nvme: Add quirks for
device 126f:2262")

Diff is according the dmesg, use NVME_QUIRK_IGNORE_DEV_SUBNQN.

dmesg | grep -i nvme0:
  nvme nvme0: pci function 0000:01:00.0
  nvme nvme0: missing or invalid SUBNQN field.
  nvme nvme0: 12/0/0 default/read/poll queues

Link:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e89086c43f0500bc7c4ce225495b73b8ce234c1f
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 weeks agonvme-pci: fix queue unquiesce check on slot_reset
Keith Busch [Thu, 24 Apr 2025 17:18:01 +0000 (10:18 -0700)]
nvme-pci: fix queue unquiesce check on slot_reset

A zero return means the reset was successfully scheduled. We don't want
to unquiesce the queues while the reset_work is pending, as that will
just flush out requeued requests to a failed completion.

Fixes: 71a5bb153be104 ("nvme: ensure disabling pairs with unquiesce")
Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 weeks agoublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
Ming Lei [Tue, 29 Apr 2025 02:29:39 +0000 (10:29 +0800)]
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req

__ublk_check_and_get_req() is only called from ublk_check_and_get_req()
and ublk_register_io_buf(), the same check has been covered in the two
calling sites.

So remove the check from __ublk_check_and_get_req().

Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: enhance check for register/unregister io buffer command
Ming Lei [Tue, 29 Apr 2025 02:29:38 +0000 (10:29 +0800)]
ublk: enhance check for register/unregister io buffer command

The simple check of UBLK_IO_FLAG_OWNED_BY_SRV can avoid incorrect
register/unregister io buffer easily, so check it before calling
starting to register/un-register io buffer.

Also only allow io buffer register/unregister uring_cmd in case of
UBLK_F_SUPPORT_ZERO_COPY.

Also mark argument 'ublk_queue *' of ublk_register_io_buf as const.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: decouple zero copy from user copy
Ming Lei [Tue, 29 Apr 2025 02:29:37 +0000 (10:29 +0800)]
ublk: decouple zero copy from user copy

UBLK_F_USER_COPY and UBLK_F_SUPPORT_ZERO_COPY are two different
features, and shouldn't be coupled together.

Commit 1f6540e2aabb ("ublk: zc register/unregister bvec") enables
user copy automatically in case of UBLK_F_SUPPORT_ZERO_COPY, this way
isn't correct.

So decouple zero copy from user copy, and use independent helper to
check each one.

Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests: ublk: fix UBLK_F_NEED_GET_DATA
Ming Lei [Tue, 29 Apr 2025 02:29:36 +0000 (10:29 +0800)]
selftests: ublk: fix UBLK_F_NEED_GET_DATA

Commit 57e13a2e8cd2 ("selftests: ublk: support user recovery") starts to
support UBLK_F_NEED_GET_DATA for covering recovery feature, however the
ublk utility implementation isn't done correctly.

Fix it by supporting UBLK_F_NEED_GET_DATA correctly.

Also add test generic_07 for covering UBLK_F_NEED_GET_DATA.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoxfs: stop using set_blocksize
Darrick J. Wong [Wed, 23 Apr 2025 19:54:13 +0000 (12:54 -0700)]
xfs: stop using set_blocksize

XFS has its own buffer cache for metadata that uses submit_bio, which
means that it no longer uses the block device pagecache for anything.
Create a more lightweight helper that runs the blocksize checks and
flushes dirty data and use that instead.  No more truncating the
pagecache because XFS does not use it or care about it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
7 weeks agoMerge remote-tracking branch 'linux-block/block-6.15' into xfs tree
Carlos Maiolino [Mon, 28 Apr 2025 09:32:06 +0000 (11:32 +0200)]
Merge remote-tracking branch 'linux-block/block-6.15' into xfs tree

We need two patches inside linux-block tree as dependencies of the patch
which will follow this merge.

Specifically, we need:

block: fix race between set_blocksize and read paths
block: hoist block size validation code to a separate function

Signed-off-by: Carlos Maiolino <cem@kernel.org>
7 weeks agoublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
Ming Lei [Fri, 25 Apr 2025 01:37:40 +0000 (09:37 +0800)]
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd

ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but
we may have scheduled task work via io_uring_cmd_complete_in_task() for
dispatching request, then kernel crash can be triggered.

Fix it by not trying to canceling the command if ublk block request is
started.

Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Reported-by: Jared Holzman <jholzman@nvidia.com>
Tested-by: Jared Holzman <jholzman@nvidia.com>
Closes: https://lore.kernel.org/linux-block/d2179120-171b-47ba-b664-23242981ef19@nvidia.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
Ming Lei [Fri, 25 Apr 2025 01:37:39 +0000 (09:37 +0800)]
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA

We call io_uring_cmd_complete_in_task() to schedule task_work for handling
UBLK_U_IO_NEED_GET_DATA.

This way is really not necessary because the current context is exactly
the ublk queue context, so call ublk_dispatch_req() directly for handling
UBLK_U_IO_NEED_GET_DATA.

Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Tested-by: Jared Holzman <jholzman@nvidia.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: don't autoload drivers on blk-cgroup configuration
Christoph Hellwig [Wed, 23 Apr 2025 05:37:42 +0000 (07:37 +0200)]
block: don't autoload drivers on blk-cgroup configuration

Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: don't autoload drivers on stat
Christoph Hellwig [Wed, 23 Apr 2025 05:37:41 +0000 (07:37 +0200)]
block: don't autoload drivers on stat

blkdev_get_no_open can trigger the legacy autoload of block drivers.  A
simple stat of a block device has not historically done that, so disable
this behavior again.

Fixes: 9abcfbd235f5 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: remove the backing_inode variable in bdev_statx
Christoph Hellwig [Wed, 23 Apr 2025 05:37:40 +0000 (07:37 +0200)]
block: remove the backing_inode variable in bdev_statx

backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: move blkdev_{get,put} _no_open prototypes out of blkdev.h
Christoph Hellwig [Wed, 23 Apr 2025 05:37:39 +0000 (07:37 +0200)]
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h

These are only to be used by block internal code.  Remove the comment
as we grew more users due to reworking block device node opening.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: never reduce ra_pages in blk_apply_bdi_limits
Christoph Hellwig [Thu, 24 Apr 2025 08:25:21 +0000 (10:25 +0200)]
block: never reduce ra_pages in blk_apply_bdi_limits

When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.

As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.

This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers.  As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.

Fixes: 804e498e0496 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
Uday Shankar [Wed, 23 Apr 2025 21:29:03 +0000 (15:29 -0600)]
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils

Some distributions, such as centos stream 9, still have a version of
coreutils which does not yet support the %Hr and %Lr formats for stat(1)
[1, 2]. Running ublk selftests on these distributions results in the
following error in tests that use the _get_disk_dev_t helper:

line 23: ?r: syntax error: operand expected (error token is "?r")

To better accommodate older distributions, rewrite _get_disk_dev_t to
use the much older %t and %T formats for stat instead.

[1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114
[2] https://pkgs.org/download/coreutils

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoMerge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15
Jens Axboe [Thu, 24 Apr 2025 12:27:54 +0000 (06:27 -0600)]
Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15

Pull NVMe fix from Christoph:

"nvme fixes for Linux 6.15

 - fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)"

* tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme:
  nvmet: fix out-of-bounds access in nvmet_enable_port

7 weeks agoselftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
Ming Lei [Mon, 21 Apr 2025 23:59:42 +0000 (07:59 +0800)]
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'

'delay_us' shouldn't be added to 'struct dev_ctx' since now it is
handled by per-target command line & 'struct fault_inject_ctx'.

So remove it.

Fixes: 81586652bb1f ("selftests: ublk: add generic_06 for covering fault inject")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoselftests: ublk: fix recover test
Ming Lei [Mon, 21 Apr 2025 23:59:41 +0000 (07:59 +0800)]
selftests: ublk: fix recover test

When adding recovery test:

- 'break' is missed for handling '-g' argument

- test name of test_generic_05.sh is wrong

So fix the two.

Fixes: 57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: hoist block size validation code to a separate function
Darrick J. Wong [Wed, 23 Apr 2025 19:53:57 +0000 (12:53 -0700)]
block: hoist block size validation code to a separate function

Hoist the block size validation code to bdev_validate_blocksize so that
we can call it from filesystems that don't care about the bdev pagecache
manipulations of set_blocksize.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 weeks agoblock: fix race between set_blocksize and read paths
Darrick J. Wong [Wed, 23 Apr 2025 19:53:42 +0000 (12:53 -0700)]
block: fix race between set_blocksize and read paths

With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.

Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device.  The read call can create an order-0 folio to
read the first 4096 bytes from the disk.  But then udev is preempted.

Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device.  The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.

Now udev resumes, still holding the order-0 folio it allocated.  It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev.  We then submit the bio with a
NULL block device and crash.

Therefore, truncate the page cache after flushing but before updating
i_blksize.  However, that's not enough -- we also need to lock out file
IO and page faults during the update.  Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.

I don't know if this is the correct fix, but xfs/259 found it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 weeks agoxfs: remove duplicate Zoned Filesystems sections in admin-guide
Hans Holmberg [Tue, 22 Apr 2025 11:50:07 +0000 (11:50 +0000)]
xfs: remove duplicate Zoned Filesystems sections in admin-guide

Remove the duplicated section and while at it, turn spaces into tabs.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Fixes: c7b67ddc3c99 ("xfs: document zoned rt specifics in admin-guide")
Signed-off-by: Carlos Maiolino <cem@kernel.org>
8 weeks agoXFS: fix zoned gc threshold math for 32-bit arches
Carlos Maiolino [Tue, 22 Apr 2025 12:54:54 +0000 (14:54 +0200)]
XFS: fix zoned gc threshold math for 32-bit arches

xfs_zoned_need_gc makes use of mult_frac() to calculate the threshold
for triggering the zoned garbage collector, but, turns out mult_frac()
doesn't properly work with 64-bit data types and this caused build
failures on some 32-bit architectures.

Fix this by essentially open coding mult_frac() in a 64-bit friendly
way.

Notice we don't need to bother with counters underflow here because
xfs_estimate_freecounter() will always return a positive value, as it
leverages percpu_counter_read_positive to read such counters.

Fixes: 845abeb1f06a ("xfs: add tunable threshold parameter for triggering zone GC")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202504181233.F7D9Atra-lkp@intel.com/
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
8 weeks agonvmet: fix out-of-bounds access in nvmet_enable_port
Richard Weinberger [Fri, 18 Apr 2025 08:02:50 +0000 (10:02 +0200)]
nvmet: fix out-of-bounds access in nvmet_enable_port

When trying to enable a port that has no transport configured yet,
nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports
array, causing an out-of-bounds access:

[  106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da
[  106.058719] Read of size 8 at addr ffffffff89dafa58 by task ln/632
[...]
[  106.076026] nvmet: transport type 255 not supported

Since commit 200adac75888, NVMF_TRTYPE_MAX is the default state as configured by
nvmet_ports_make().
Avoid this by checking for NVMF_TRTYPE_MAX before proceeding.

Fixes: 200adac75888 ("nvme: Add PCI transport type")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2 months agoMerge tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme into block-6.15
Jens Axboe [Thu, 17 Apr 2025 12:18:49 +0000 (06:18 -0600)]
Merge tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme into block-6.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 6.15

 - fix scan failure for non-ANA multipath controllers (Hannes Reinecke)
 - fix multipath sysfs links creation for some cases (Hannes Reinecke)
 - PCIe endpoint fixes (Damien Le Moal)
 - use NULL instead of 0 in the auth code (Damien Le Moal)"

* tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme:
  nvmet: pci-epf: cleanup link state management
  nvmet: pci-epf: clear CC and CSTS when disabling the controller
  nvmet: pci-epf: always fully initialize completion entries
  nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()
  nvme-multipath: sysfs links may not be created for devices
  nvme: fixup scan failure for non-ANA multipath controllers

2 months agoxfs: document zoned rt specifics in admin-guide
Hans Holmberg [Wed, 9 Apr 2025 12:39:56 +0000 (12:39 +0000)]
xfs: document zoned rt specifics in admin-guide

Document the lifetime, nolifetime and max_open_zones mount options
added for zoned rt file systems.

Also add documentation describing the max_open_zones sysfs attribute
exposed in /sys/fs/xfs/<dev>/zoned/

Fixes: 4e4d52075577 ("xfs: add the zoned space allocator")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2 months agoMerge tag 'md-6.15-20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdrai...
Jens Axboe [Thu, 17 Apr 2025 03:08:28 +0000 (21:08 -0600)]
Merge tag 'md-6.15-20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into block-6.15

Pull MD fixes from Yu:

"- fix raid10 missing discard IO accounting (Yu Kuai)
 - fix bitmap stats for bitmap file (Zheng Qixing)
 - fix oops while reading all member disks failed during check/repair
   (Meir Elisha)"

* tag 'md-6.15-20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md/raid1: Add check for missing source disk in process_checks()
  md/md-bitmap: fix stats collection for external bitmaps
  md/raid10: fix missing discard IO accounting

2 months agoselftests: ublk: add generic_06 for covering fault inject
Uday Shankar [Wed, 16 Apr 2025 03:54:42 +0000 (11:54 +0800)]
selftests: ublk: add generic_06 for covering fault inject

Add one simple fault inject target, and verify if an application using ublk
device sees an I/O error quickly after the ublk server dies.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: simplify aborting ublk request
Ming Lei [Wed, 16 Apr 2025 03:54:41 +0000 (11:54 +0800)]
ublk: simplify aborting ublk request

Now ublk_abort_queue() is moved to ublk char device release handler,
meantime our request queue is "quiesced" because either ->canceling was
set from uring_cmd cancel function or all IOs are inflight and can't be
completed by ublk server, things becomes easy much:

- all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED
for handling completion from uring_cmd

- ublk char device is closed, no one can hold IO request reference any more,
so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding.

Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: remove __ublk_quiesce_dev()
Ming Lei [Wed, 16 Apr 2025 03:54:40 +0000 (11:54 +0800)]
ublk: remove __ublk_quiesce_dev()

Remove __ublk_quiesce_dev() and open code for updating device state as
QUIESCED.

We needn't to drain inflight requests in __ublk_quiesce_dev() any more,
because all inflight requests are aborted in ublk char device release
handler.

Also we needn't to set ->canceling in __ublk_quiesce_dev() any more
because it is done unconditionally now in ublk_ch_release().

Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: improve detection and handling of ublk server exit
Uday Shankar [Wed, 16 Apr 2025 03:54:39 +0000 (11:54 +0800)]
ublk: improve detection and handling of ublk server exit

There are currently two ways in which ublk server exit is detected by
ublk_drv:

1. uring_cmd cancellation. If there are any outstanding uring_cmds which
   have not been completed to the ublk server when it exits, io_uring
   calls the uring_cmd callback with a special cancellation flag as the
   issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
   "saturated queue" case, when all I/Os for a given queue are in the
   ublk server, and therefore there are no outstanding uring_cmds to
   cancel when the ublk server exits.

There are a couple of issues with this approach:

- It is complex and inelegant to have two methods to detect the same
  condition
- The second method detects ublk server exit only after a long delay
  (~30s, the default timeout assigned by the block layer). This delays
  the nosrv behavior from kicking in and potential subsequent recovery
  of the device.

The second issue is brought to light with the new test_generic_06 which
will be added in following patch. It fails before this fix:

selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_06 : [FAIL]

Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:

- This one place can handle both saturated and unsaturated queues. Thus,
  it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
  needed if we want to relax restrictions in the driver like letting
  only one thread serve each queue

There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.

With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:

selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: move device reset into ublk_ch_release()
Ming Lei [Wed, 16 Apr 2025 03:54:38 +0000 (11:54 +0800)]
ublk: move device reset into ublk_ch_release()

ublk_ch_release() is called after ublk char device is closed, when all
uring_cmd are done, so it is perfect fine to move ublk device reset to
ublk_ch_release() from ublk_ctrl_start_recovery().

This way can avoid to grab the exiting daemon task_struct too long.

However, reset of the following ublk IO flags has to be moved until ublk
io_uring queues are ready:

- ubq->canceling

For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
is recovered

- ubq->fail_io

For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
recovered

- ublk_io->flags

For preventing using io->cmd

With this way, recovery is simplified a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
Ming Lei [Wed, 16 Apr 2025 03:54:37 +0000 (11:54 +0800)]
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io

Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
queue as quiesced. This way is fragile because queue quiesce crosses syscalls
or process contexts.

Switch to rely on ubq->canceling for dealing with
ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose
during io_uring context exiting, and it can be reused before recovering too.
In ublk_queue_rq(), the request will be added to requeue list without
kicking off requeue in case of ubq->canceling, and finally requests added in
requeue list will be dispatched from either ublk_stop_dev() or
ublk_ctrl_end_recovery().

Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery()
to ublk_ctrl_end_recovery(), when IO handling can be recovered completely.

Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
in same context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416035444.99569-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: add ublk_force_abort_dev()
Ming Lei [Wed, 16 Apr 2025 03:54:36 +0000 (11:54 +0800)]
ublk: add ublk_force_abort_dev()

Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io()
in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in
single function.

Meantime not change device state to QUIESCED any more, since the disk is
going to be removed soon.

Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoublk: properly serialize all FETCH_REQs
Uday Shankar [Wed, 16 Apr 2025 03:54:35 +0000 (11:54 +0800)]
ublk: properly serialize all FETCH_REQs

Most uring_cmds issued against ublk character devices are serialized
because each command affects only one queue, and there is an early check
which only allows a single task (the queue's ubq_daemon) to issue
uring_cmds against that queue. However, this mechanism does not work for
FETCH_REQs, since they are expected before ubq_daemon is set. Since
FETCH_REQs are only used at initialization and not in the fast path,
serialize them using the per-ublk-device mutex. This fixes a number of
data races that were previously possible if a badly behaved ublk server
decided to issue multiple FETCH_REQs against the same qid/tag
concurrently.

Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: move creating UBLK_TMP into _prep_test()
Ming Lei [Sat, 12 Apr 2025 02:30:29 +0000 (10:30 +0800)]
selftests: ublk: move creating UBLK_TMP into _prep_test()

test may exit early because of missing program or not having required
feature before calling _prep_test(), then $UBLK_TMP isn't cleaned.

Fix it by moving creating $UBLK_TMP into _prep_test(), any resources
created since _prep_test() will be cleaned by _cleanup_test().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-14-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: add test_stress_05.sh
Ming Lei [Sat, 12 Apr 2025 02:30:28 +0000 (10:30 +0800)]
selftests: ublk: add test_stress_05.sh

Add test_stress_05.sh for covering removing device with recovery
enabled.

io-hang has been observed with the following patch:

https://lore.kernel.org/linux-block/20250403-ublk_timeout-v3-1-aa09f76c7451@purestorage.com/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-13-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: support user recovery
Ming Lei [Sat, 12 Apr 2025 02:30:27 +0000 (10:30 +0800)]
selftests: ublk: support user recovery

Add user recovery feature.

Meantime add user recovery test: generic_04 and generic_05(zero copy)

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-12-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: support target specific command line
Ming Lei [Sat, 12 Apr 2025 02:30:26 +0000 (10:30 +0800)]
selftests: ublk: support target specific command line

Support target specific command line for making related command line code
handling more readable & clean.

Also helps for adding new features.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: increase max nr_queues and queue depth
Ming Lei [Sat, 12 Apr 2025 02:30:25 +0000 (10:30 +0800)]
selftests: ublk: increase max nr_queues and queue depth

Increase max nr_queues to 32, and queue depth to 1024.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: set queue pthread's cpu affinity
Ming Lei [Sat, 12 Apr 2025 02:30:24 +0000 (10:30 +0800)]
selftests: ublk: set queue pthread's cpu affinity

In NUMA machine, ublk IO performance is very sensitive with queue
pthread's affinity setting.

Retrieve queue's affinity and select the 1st cpu as queue thread's sched
affinity, and it is observed that single cpu task affinity can get
stable & good performance if client application is put on proper cpu.

Dump this info when adding one ublk device. Use shmem to communicate
queue's tid between parent and daemon.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
Ming Lei [Sat, 12 Apr 2025 02:30:23 +0000 (10:30 +0800)]
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN

It is observed that this way is more efficient for fast nvme backing
file.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: add two stress tests for zero copy feature
Ming Lei [Sat, 12 Apr 2025 02:30:22 +0000 (10:30 +0800)]
selftests: ublk: add two stress tests for zero copy feature

Add stress_03 & stress_04 for covering zero copy feature.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: run stress tests in parallel
Ming Lei [Sat, 12 Apr 2025 02:30:21 +0000 (10:30 +0800)]
selftests: ublk: run stress tests in parallel

Run stress tests in parallel, meantime add shell local function to
simplify the two stress tests.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: make sure _add_ublk_dev can return in sub-shell
Ming Lei [Sat, 12 Apr 2025 02:30:20 +0000 (10:30 +0800)]
selftests: ublk: make sure _add_ublk_dev can return in sub-shell

Detach ublk daemon from the starting process completely by double-fork and
clearing its process group, so that `_add_ublk_dev` can return from sub-shell.

Then it is more friendly for writing shell test script for adding/recovering
ublk device.

Prepare for running ublk test in parallel.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 months agoselftests: ublk: cleanup backfile automatically
Ming Lei [Sat, 12 Apr 2025 02:30:19 +0000 (10:30 +0800)]
selftests: ublk: cleanup backfile automatically

Use global array of $UBLK_BACKFILES for storing all backfile name, then
clean them automatically.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>