Christoph Hellwig [Mon, 14 Jul 2025 05:37:24 +0000 (07:37 +0200)]
xfs: factor out a xlog_write_space_advance helper
Add a new xlog_write_space_advance that returns the current place in the
iclog that data is written to, and advances the various counters by the
amount taken from xlog_write_iovec, and also use it xlog_write_partial,
which open codes the counter adjustments, but misses the asserts.
Christoph Hellwig [Mon, 14 Jul 2025 05:33:41 +0000 (07:33 +0200)]
xfs: improve the calling convention for the xlog_write helpers
The xlog_write chain passes around the same seven variables that are
often passed by reference. Add a xlog_write_data structure to contain
them to improve code generation and readability.
This change reduces the generated code size by almost 200 bytes for my
x86_64 build:
$ size fs/xfs/xfs_log.o*
text data bss dec hex filename
26330 1292 8 27630 6bee fs/xfs/xfs_log.o
26158 1292 8 27458 6b42 fs/xfs/xfs_log.o.old
Christoph Hellwig [Mon, 14 Jul 2025 05:33:13 +0000 (07:33 +0200)]
xfs: regularize iclog space accounting in xlog_write_partial
When xlog_write_partial splits a log region over multiple iclogs, it
has to include the continuation ophder in the length requested for the
new iclog. Currently is simply adds that to the request, which makes
the accounting of the used space below look slightly different from the
other users of iclog space that decrement it. To prepare for more code
sharing, adding the ophdr size to the len variable before the call to
xlog_write_get_more_iclog_space and then decrement it later.
This changes the contents of len when xlog_write_get_more_iclog_space
returns an error, but as nothing looks at len in that case the
difference doesn't matter.
Christoph Hellwig [Fri, 23 May 2025 03:20:09 +0000 (05:20 +0200)]
xfs: improve the ->iop_format interface
Export a higher level interface to format log items. The xlog_format_buf
structure is hidden inside xfs_log_cil.c and only accessed using two
helpers (and a wrapper build on top), hiding details of log iovecs from
the log items. This also allows simply using an index into lv_iovecp
instead of keeping a cursor vec.
Christoph Hellwig [Thu, 22 May 2025 03:30:37 +0000 (05:30 +0200)]
xfs: set lv_bytes in xlog_write_one_vec
lv_bytes is mostly just use by the CIL code, but has crept into the
low-level log writing code to decide on a full or partial iclog
write. Ensure it is valid even for the special log writes that don't
go through the CIL by initializing it in xlog_write_one_vec.
Christoph Hellwig [Sun, 18 May 2025 10:59:55 +0000 (12:59 +0200)]
xfs: don't pass the old lv to xfs_cil_prepare_item
By the time xfs_cil_prepare_item is called, the old lv is still pointed
to by the log item. Take it from there instead of spreading the old lv
logic over xlog_cil_insert_format_items and xfs_cil_prepare_item.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
xfs: refactor xfs_btree_diff_two_ptrs() to take advantage of cmp_int()
Use cmp_int() to yield the result of a three-way-comparison instead of
performing subtractions with extra casts. Thus also rename the function
to make its name clearer in purpose.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs: use a proper variable name and type for storing a comparison result
Perhaps that's just my silly imagination but 'diff' doesn't look good for
the name of a variable to hold a result of a three-way-comparison
(-1, 0, 1) which is what ->cmp_key_with_cur() does. It implies to contain
an actual difference between the two integer variables but that's not true
anymore after recent refactoring.
Declaring it as int64_t is also misleading now. Plain integer type is
more than enough.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs: refactor cmp_key_with_cur routines to take advantage of cmp_int()
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_key_with_cur routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs: refactor cmp_two_keys routines to take advantage of cmp_int()
The net value of these functions is to determine the result of a
three-way-comparison between operands of the same type.
Simplify the code using cmp_int() to eliminate potential errors with
opencoded casts and subtractions. This also means we can change the return
value type of cmp_two_keys routines from int64_t to int and make the
interface a bit clearer.
Found by Linux Verification Center (linuxtesting.org).
Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
One may think that diff_two_keys routines are used to compute the actual
difference between the arguments but they return a result of a
three-way-comparison of the passed operands. So it looks more appropriate
to denote them as cmp_two_keys.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Fri, 23 May 2025 12:31:28 +0000 (14:31 +0200)]
xfs: remove the bt_bdev_file buftarg field
And use bt_file for both bdev and shmem backed buftargs.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 7 Jul 2025 12:53:16 +0000 (14:53 +0200)]
xfs: rename the bt_bdev_* buftarg fields
The extra bdev_ is weird, so drop it. Also improve the comment to make
it clear these are the hardware limits.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 7 Jul 2025 12:53:15 +0000 (14:53 +0200)]
xfs: refactor xfs_calc_atomic_write_unit_max
This function and the helpers used by it duplicate the same logic for AGs
and RTGs. Use the xfs_group_type enum to unify both variants.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 7 Jul 2025 12:53:14 +0000 (14:53 +0200)]
xfs: add a xfs_group_type_buftarg helper
Generalize the xfs_group_type helper in the discard code to return a buftarg
and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Mon, 7 Jul 2025 12:53:12 +0000 (14:53 +0200)]
xfs: clean up the initial read logic in xfs_readsb
The initial sb read is always for a device logical block size buffer.
The device logical block size is provided in the bt_logical_sectorsize in
struct buftarg, so use that instead of the confusingly named
xfs_getsize_buftarg buffer that reads it from the bdev.
Update the comments surrounding the code to better describe what is going
on.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Pranav Tyagi [Tue, 17 Jun 2025 13:14:46 +0000 (18:44 +0530)]
xfs: replace strncpy with memcpy in xattr listing
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent().
The length is known and a null byte is added manually.
No functional change intended.
Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Youling Tang [Mon, 30 Jun 2025 01:11:48 +0000 (09:11 +0800)]
xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags mask
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in
XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains
by explicitly showing this flag in the supported flags mask.
Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition
has no functional modifications.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
It always seems to be generic/753 that triggers this, and repeating
a quick group test run triggers it every 10-15 iterations. Hence it
generally triggers once up every 30-40 minutes of test time. just
running generic/753 by itself or concurrently with a limited group
of tests doesn't reproduce this issue at all.
Tracing on a hung system shows the AIL repeating every 50ms a log
force followed by an attempt to push pinned, aborted inodes from the
AIL (trimmed for brevity):
The inode log items are marked as aborted, which means that either:
a) a transaction commit has occurred, seen an error or shutdown, and
called xfs_trans_free_items() to abort the items. This should happen
before any pinning of log items occurs.
or
b) a dirty transaction has been cancelled. This should also happen
before any pinning of log items occurs.
or
c) AIL insertion at journal IO completion is marked as aborted. In
this case, the log item is pinned by the CIL until journal IO
completes and hence needs to be unpinned. This is then done after
the ->iop_committed() callback is run, so the pin count should be
balanced correctly.
Yet none of these seemed to be occurring. Further tracing indicated
this:
d) Shutdown during CIL pushing resulting in log item completion
being called from checkpoint abort processing. Items are unpinned
and released without serialisation against each other, journal IO
completion or transaction commit completion.
In this case, we may still have a transaction commit in flight that
holds a reference to a xfs_buf_log_item (BLI) after CIL insertion.
e.g. a synchronous transaction will flush the CIL before the
transaction is torn down. The concurrent CIL push then aborts
insertion it and drops the commit/AIL reference to the BLI. This can
leave the transaction commit context with the last reference to the
BLI which is dropped here:
xfs_trans_free_items()
->iop_release
xfs_buf_item_release
xfs_buf_item_put
if (XFS_LI_ABORTED)
xfs_trans_ail_delete
xfs_buf_item_relse()
Unlike the journal completion ->iop_unpin path, this path does not
run stale buffer completion process when it drops the last
reference, hence leaving the stale inodes attached to the buffer
sitting the AIL. There are no other references to those inodes, so
there is no other mechanism to remove them from the AIL. Hence
unmount hangs.
The buffer lock context for stale buffers is passed to the last BLI
reference. This is normally the last BLI unpin on journal IO
completion. The unpin then processes the stale buffer completion and
releases the buffer lock. However, if the final unpin from journal
IO completion (or CIL push abort) does not hold the last reference
to the BLI, there -must- still be a transaction context that
references the BLI, and so that context must perform the stale
buffer completion processing before the buffer is unlocked and the
BLI torn down.
The fix for this is to rework the xfs_buf_item_relse() path to run
stale buffer completion processing if it drops the last reference to
the BLI. We still hold the buffer locked, so the buffer owner and
lock context is the same as if we passed the BLI and buffer to the
->iop_unpin() context to finish stale process on journal commit.
However, we have to be careful here. In a shutdown state, we can be
freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse()
and xfs_trans_bdetach(). The existing code handles this case by
considering shutdown state as "aborted", but in doing so
largely masks the failure to clean up stale BLI state from the
xfs_buf_item_relse() path. i.e regardless of the shutdown state and
whether the item is in the AIL, we must finish the stale buffer
cleanup if we are are dropping the last BLI reference from the
->iop_relse path in transaction commit context.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Wed, 25 Jun 2025 22:48:59 +0000 (08:48 +1000)]
xfs: factor out stale buffer item completion
The stale buffer item completion handling is currently only done
from BLI unpinning. We need to perform this function from where-ever
the last reference to the BLI is dropped, so first we need to
factor this code out into a helper.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Wed, 25 Jun 2025 22:48:58 +0000 (08:48 +1000)]
xfs: rearrange code in xfs_buf_item.c
The code to initialise, release and free items is all the way down
the bottom of the file. Upcoming fixes need to these functions
earlier in the file, so move them to the top.
There is one code change in this move - the parameter to
xfs_buf_item_relse() is changed from the xfs_buf to the
xfs_buf_log_item - the thing that the function is releasing.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Wed, 25 Jun 2025 22:48:57 +0000 (08:48 +1000)]
xfs: add tracepoints for stale pinned inode state debug
I needed more insight into how stale inodes were getting stuck on
the AIL after a forced shutdown when running fsstress. These are the
tracepoints I added for that purpose.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The CIL push has seen the deadlock, so it has aborted the push and
is running CIL checkpoint completion to abort all the items in the
checkpoint. This calls ->iop_unpin(remove = true) to clean up the
log items in the checkpoint.
When a buffer log item is unpined like this, it needs to lock the
buffer to run io completion to correctly fail the buffer and run all
the required completions to fail attached log items as well. In this
case, the attempt to lock the buffer on unpin is hanging because the
buffer is already locked.
I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE
handling changes I was testing, so I went looking for
pin events on HOLD buffers and unpin events on locked buffer. That
isolated this one buffer with these two events:
Yup, another test run by check-parallel is running drop_caches
concurrently and the dquot shrinker for the hung filesystem is
running. That's trying to flush a dirty dquot from reclaim context,
and it waiting on a log force to complete. xfs_qm_dqflush is called
with the dquot buffer held locked, and so we've called
xfs_log_force() with that buffer locked.
Now the log force is waiting for a workqueue flush to complete, and
that workqueue flush is waiting of CIL checkpoint processing to
finish.
The CIL checkpoint processing is aborting all the log items it has,
and that requires locking aborted buffers to cancel them.
Now, normally this isn't a problem if we are issuing a log force
to unpin an object, because the ->iop_unpin() method wakes pin
waiters first. That results in the pin waiter finishing off whatever
it was doing, dropping the lock and then xfs_buf_item_unpin() can
lock the buffer and fail it.
However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not
the dquot buffer unpin event, and so it never gets woken and so does
not drop the buffer lock.
Inodes do not have this problem, as they can only be written from
one spot (->iop_push) whilst dquots can be written from multiple
places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck).
The reason that the dquot buffer has an attached buffer log item is
that it has been recently allocated. Initialisation of the dquot
buffer logs the buffer directly, thereby pinning it in memory. We
then modify the dquot in a separate operation, and have memory
reclaim racing with a shutdown and we trigger this deadlock.
check-parallel reproduces this reliably on 1kB FSB filesystems with
quota enabled because it does all of these things concurrently
without having to explicitly write tests to exercise these corner
case conditions.
xfs_qm_dquot_logitem_push() doesn't have this deadlock because it
checks if the dquot is pinned before locking the dquot buffer and
skipping it if it is pinned. This means the xfs_qm_dqunpin_wait()
log force in xfs_qm_dqflush() never triggers and we unlock the
buffer safely allowing a concurrent shutdown to fail the buffer
appropriately.
xfs_qm_dqpurge() could have this problem as it is called from
quotacheck and we might have allocated dquot buffers when recording
the quota updates. This can be fixed by calling
xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we
hold the dquot locked, nothing will be able to add to the pin count
between the unpin_wait and the dqflush callout, so this now makes
xfs_qm_dqpurge() safe against this race.
xfs_qm_dquot_isolate() can also be fixed this same way but, quite
frankly, we shouldn't be doing IO in memory reclaim context. If the
dquot is pinned or dirty, simply rotate it and let memory reclaim
come back to it later, same as we do for inodes.
This then gets rid of the nasty issue in xfs_qm_flush_one() where
quotacheck writeback races with memory reclaim flushing the dquots.
We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of
the "can't get the dqflush lock" buffer write to cycle the dqlfush
lock and enable it to be flushed again. checking if the dquot is
pinned and returning -EAGAIN so that the dquot walk will revisit the
dquot again later.
Finally, with xfs_qm_dqunpin_wait() lifted into all the callers,
we can remove it from the xfs_qm_dqflush() code.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Wed, 25 Jun 2025 22:48:55 +0000 (08:48 +1000)]
xfs: catch stale AGF/AGF metadata
There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:
Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. Writes
are reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.
If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.
Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.
This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.
Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:
Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.
Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.
Fix this by replacing the asserts with a corruption detection check
and a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Dave Chinner [Wed, 25 Jun 2025 22:48:54 +0000 (08:48 +1000)]
xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
-> IFLUSHING, except for the last inode in the cluster that is
triggering the free. In that case, the lock order is ILOCK ->
cluster buffer -> IFLUSHING.
xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
so this can safely run concurrently with xfs_ifree_cluster().
xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this
cannot race with xfs_ifree_cluster() so being in a different order
will not trigger a deadlock.
xfs_reclaim_inode() during a filesystem shutdown uses ILOCK ->
IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and
this deadlocks against xfs_ifree_cluster() like so:
We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE
and XFS_IFLUSHING are serialised through to journal IO completion
by the cluster buffer lock being held.
There's quite a few asserts in the code that check that XFS_ISTALE
does not occur out of sync with buffer locking (e.g. in
xfs_iflush_cluster). There's also a dependency on the inode log item
being removed from the buffer before XFS_IFLUSHING is cleared, also
with asserts that trigger on this.
Further, we don't have a requirement for the inode to be locked when
completing or aborting inode flushing because all the inode state
updates are serialised by holding the cluster buffer lock across the
IO to completion.
We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and
skip the inode, because there is no guarantee that the inode will be
reclaimed. Hence it *must* be marked XFS_ISTALE regardless of
whether reclaim is preparing to free that inode. Similarly, we can't
check for IFLUSHING before locking the inode because that would
result in dirty inodes not being marked with ISTALE in the event of
racing with XFS_IRECLAIM.
Hence we have to address this issue from the xfs_reclaim_inode()
side. It is clear that we cannot hold the inode locked here when
calling xfs_iflush_shutdown_abort() because it is the inode->buffer
lock order that causes the deadlock against xfs_ifree_cluster().
Hence we need to drop the ILOCK before aborting the inode in the
shutdown case. Once we've aborted the inode, we can grab the ILOCK
again and then immediately reclaim it as it is now guaranteed to be
clean.
Note that dropping the ILOCK in xfs_reclaim_inode() means that it
can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in
this state. This is safe because we have left the XFS_IFLUSHING flag
on the inode and so xfs_ifree_mark_inode_stale() will simply set
XFS_ISTALE and move to the next inode. An ASSERT check in this path
needs to be tweaked to take into account this new shutdown
interaction.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The difference between TRACE_EVENT() and DECLARE_EVENT_CLASS() is that
TRACE_EVENT() also creates an event with the class name.
Switch xfs_xattr_class over to being a class and not an event as it is not
called directly, and that event with the class name takes up unnecessary
memory.
Fixes: e47dcf113ae3 ("xfs: repair extended attributes") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:58 +0000 (13:51 -0400)]
xfs: only create event xfs_file_compat_ioctl when CONFIG_COMPAT is configure
The trace event xfs_file_compat_ioctl is only used when CONFIG_COMPAT is
configured in the build. As trace events can take up to 5K in memory for
text and meta data regardless if they are used, they should not be created
when unused. Add #ifdef CONFIG_COMPAT around the event so that it is only
created when that is configured.
Fixes: cca28fb83d9e6 ("xfs: split xfs_itrace_entry") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:57 +0000 (13:51 -0400)]
xfs: remove usused xfs_end_io_direct events
When the use of iomap_dio_rw was added, the calls to the trace events
xfs_end_io_direct_unwritten and xfs_end_io_direct_append were removed but
those trace events were not. As trace events can take up to 5K in memory
for text and meta data regardless if they are used or not, they should not
be created when not used. Remove the unused events.
Fixes: acdda3aae146 ("xfs: use iomap_dio_rw") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:56 +0000 (13:51 -0400)]
xfs: remove unused event xfs_pagecache_inval
When the function xfs_flushinval_pages() was removed, it removed the only
caller to the trace event xfs_pagecache_inval. As trace events can take up
to 5K of memory in text and meta data each regardless if they are used or
not, they should not be created when unused. Remove the unused event.
Fixes: fb59581404ab ("xfs: remove xfs_flushinval_pages") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:55 +0000 (13:51 -0400)]
xfs: remove unused event xfs_alloc_near_nominleft
When the function xfs_alloc_space_available() was restructured, it removed
the only calls to the trace event xfs_alloc_near_nominleft. As trace
events take up to 5K of memory for text and meta data for each event, they
should not be created when not used. Remove this unused event.
Fixes: 54fee133ad59 ("xfs: adjust allocation length in xfs_alloc_space_available") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:54 +0000 (13:51 -0400)]
xfs: remove unused event xfs_alloc_near_error
Trace events take up to 5K of memory in text and meta data regardless if
they are used or not. The call to the event xfs_alloc_near_error was
removed when the cursor data structure allocation was introduced. Remove
it as it is no longer used and is just wasting memory.
Fixes: f5e7dbea1e3e ("xfs: introduce allocation cursor data structure") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:53 +0000 (13:51 -0400)]
xfs: remove unused event xfs_attr_node_removename
When xfs_attri_remove_iter() was removed, so was the call to the trace
event xfs_attr_node_removename. As trace events can take up to 5K in
memory for text and meta data regardless if they are used or not, they
should not be created when unused. Remove the unused event.
Fixes: 59782a236b622 ("xfs: remove xfs_attri_remove_iter") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:52 +0000 (13:51 -0400)]
xfs: remove unused xfs_attr events
Trace events can take up to 5K in memory for text and meta data per event
regardless if they are used or not, so they should not be defined when not
used. The events xfs_attr_fillstate and xfs_attr_refillstate are only
called in code that is #ifdef out and exists only for future reference.
Remove these unused events. If the code is needed again, then git history
can recover what the events were.
Suggested-by: Christoph Hellwig <hch@lst.de> Fixes: 59782a236b622 ("xfs: remove xfs_attri_remove_iter") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the function xfs_attr_rmtval_set() was removed, the call to the
corresponding trace event was also removed but the trace event itself was
not. As trace events can take up to 5K of memory in text and meta data
regardless if they are used or not they should not be created when not
used. Remove the unused trace event.
Fixes: 0e6acf29db6f ("xfs: Remove xfs_attr_rmtval_set") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the clone/dedupe_file_rang common functions were refactored, it
removed the calls to the xfs_reflink_compare_extents and
xfs_reflink_compare_extents_error events. As each event can take up to 5K
in memory for text and meta data regardless if they are used or not, they
should not be created if they are not used. Remove these unused events.
Fixes: 876bec6f9bbf ("vfs: refactor clone/dedupe_file_range common functions") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:49 +0000 (13:51 -0400)]
xfs: remove unused event xfs_ioctl_clone
The trace event xfs_ioctl_clone was added but never used. As trace events
can take up to 5K of memory in text and meta data regardless if they are
used or not, remove the unused trace event.
Fixes: 53aa1c34f4eb ("xfs: define tracepoints for reflink activities") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Steven Rostedt [Mon, 16 Jun 2025 17:51:48 +0000 (13:51 -0400)]
xfs: remove unused event xlog_iclog_want_sync
The trace event xlog_iclog_want_sync was added but never used. As trace
events can take up around 5K of memory in text and meta data regardless if
they are used or not, remove this unused event.
Fixes: 956f6daa84bf ("xfs: add iclog state trace events") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
When the function xfs_attri_remove_iter was removed, it did not remove the
trace event that it called. As a trace event can take up to 5K of memory for
text and meta data regardless of if it is used or not, remove this unused trace
event.
Fixes: 59782a236b62 ("xfs: remove xfs_attri_remove_iter") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Darrick J. Wong [Thu, 12 Jun 2025 17:51:12 +0000 (10:51 -0700)]
xfs: actually use the xfs_growfs_check_rtgeom tracepoint
We created a new tracepoint but forgot to put it in. Fix that.
Cc: rostedt@goodmis.org Cc: stable@vger.kernel.org # v6.14 Fixes: 59a57acbce282d ("xfs: check that the rtrmapbt maxlevels doesn't increase when growing fs") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250612131021.114e6ec8@batman.local.home/ Signed-off-by: Carlos Maiolino <cem@kernel.org>
Markus Elfring [Tue, 10 Jun 2025 12:50:07 +0000 (14:50 +0200)]
xfs: Improve error handling in xfs_mru_cache_create()
Simplify error handling in this function implementation.
* Delete unnecessary pointer checks and variable assignments.
* Omit a redundant function call.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 5 Jun 2025 06:16:30 +0000 (08:16 +0200)]
xfs: move xfs_submit_zoned_bio a bit
Commit f3e2e53823b9 ("xfs: add inode to zone caching for data placement")
add the new code right between xfs_submit_zoned_bio and
xfs_zone_alloc_and_submit which implement the main zoned write path.
Move xfs_submit_zoned_bio down to keep it together again.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 5 Jun 2025 06:16:29 +0000 (08:16 +0200)]
xfs: use xfs_readonly_buftarg in xfs_remount_rw
Use xfs_readonly_buftarg instead of open coding it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 5 Jun 2025 06:16:28 +0000 (08:16 +0200)]
xfs: remove NULL pointer checks in xfs_mru_cache_insert
Remove the check for a NULL mru or mru->list in xfs_mru_cache_insert
as this API misused lead to a direct NULL pointer dereference on first
use and is not user triggerable. As a smatch run by Dan points out
with the recent cleanup it would otherwise try to free the object we
just determined to be NULL for this impossible to reach case.
Fixes: 70b95cb86513 ("xfs: free the item in xfs_mru_cache_insert on failure") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Christoph Hellwig [Thu, 5 Jun 2025 06:16:27 +0000 (08:16 +0200)]
xfs: check for shutdown before going to sleep in xfs_select_zone
Ensure the file system hasn't been shut down before waiting for a free
zone to become available, because that won't happen on a shut down
file system. Without this processes can occasionally get stuck in
the allocator wait loop when racing with a file system shutdown.
This sporadically happens when running generic/388 or generic/475.
Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Linus Torvalds [Sun, 15 Jun 2025 16:14:27 +0000 (09:14 -0700)]
Merge tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Move warnings about linux/export.h from W=1 to W=2
- Fix structure type overrides in gendwarfksyms
* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
gendwarfksyms: Fix structure type overrides
kbuild: move warnings about linux/export.h from W=1 to W=2
Sami Tolvanen [Sat, 14 Jun 2025 00:55:33 +0000 (00:55 +0000)]
gendwarfksyms: Fix structure type overrides
As we always iterate through the entire die_map when expanding
type strings, recursively processing referenced types in
type_expand_child() is not actually necessary. Furthermore,
the type_string kABI rule added in commit c9083467f7b9
("gendwarfksyms: Add a kABI rule to override type strings") can
fail to override type strings for structures due to a missing
kabi_get_type_string() check in this function.
Fix the issue by dropping the unnecessary recursion and moving
the override check to type_expand(). Note that symbol versions
are otherwise unchanged with this patch.
Fixes: c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings") Reported-by: Giuliano Procida <gprocida@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Linus Torvalds [Sat, 14 Jun 2025 16:25:22 +0000 (09:25 -0700)]
Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for a deadlock on queue freeze with zoned writes
- Fix for zoned append emulation
- Two bio folio fixes, for sparsemem and for very large folios
- Fix for a performance regression introduced in 6.13 when plug
insertion was changed
- Fix for NVMe passthrough handling for polled IO
- Document the ublk auto registration feature
- loop lockdep warning fix
* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
nvme: always punt polled uring_cmd end_io work to task_work
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
block: Fix bvec_set_folio() for very large folios
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
block: use plug request list tail for one-shot backmerge attempt
block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
loop: move lo_set_size() out of queue freeze
Linus Torvalds [Sat, 14 Jun 2025 15:44:54 +0000 (08:44 -0700)]
Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix for a race between SQPOLL exit and fdinfo reading.
It's slim and I was only able to reproduce this with an artificial
delay in the kernel. Followup sparse fix as well to unify the access
to ->thread.
- Fix for multiple buffer peeking, avoiding truncation if possible.
- Run local task_work for IOPOLL reaping when the ring is exiting.
This currently isn't done due to an assumption that polled IO will
never need task_work, but a fix on the block side is going to change
that.
* tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux:
io_uring: run local task_work from ring exit IOPOLL reaping
io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
io_uring: consistently use rcu semantics with sqpoll thread
io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()
Linus Torvalds [Sat, 14 Jun 2025 15:18:09 +0000 (08:18 -0700)]
Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues
or aren't considered necessary for -stable kernels. Only 4 are for MM"
* tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: add mmap_prepare() compatibility layer for nested file systems
init: fix build warnings about export.h
MAINTAINERS: add Barry as a THP reviewer
drivers/rapidio/rio_cm.c: prevent possible heap overwrite
mm: close theoretical race where stale TLB entries could linger
mm/vma: reset VMA iterator on commit_merge() OOM failure
docs: proc: update VmFlags documentation in smaps
scatterlist: fix extraneous '@'-sign kernel-doc notation
selftests/mm: skip failed memfd setups in gup_longterm
Linus Torvalds [Fri, 13 Jun 2025 23:49:39 +0000 (16:49 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"All fixes for drivers.
The core change in the error handler is simply to translate an ALUA
specific sense code into a retry the ALUA components can handle and
won't impact any other devices"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: error: alua: I/O errors for ALUA state transitions
scsi: storvsc: Increase the timeouts to storvsc_timeout
scsi: s390: zfcp: Ensure synchronous unit_add
scsi: iscsi: Fix incorrect error path labels for flashnode operations
scsi: mvsas: Fix typos in per-phy comments and SAS cmd port registers
scsi: core: ufs: Fix a hang in the error handler
Linus Torvalds [Fri, 13 Jun 2025 23:27:27 +0000 (16:27 -0700)]
Merge tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Quiet week, only two pull requests came my way, xe has a couple of
fixes and then a bunch of fixes across the board, vc4 probably fixes
the biggest problem:
vc4:
- Fix infinite EPROBE_DEFER loop in vc4 probing
amdxdna:
- Fix amdxdna firmware size
meson:
- modesetting fixes
sitronix:
- Kconfig fix for st7171-i2c
dma-buf:
- Fix -EBUSY WARN_ON_ONCE in dma-buf
udmabuf:
- Use dma_sync_sgtable_for_cpu in udmabuf
xe:
- Fix regression disallowing 64K SVM migration
- Use a bounce buffer for WA BB"
* tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/lrc: Use a temporary buffer for WA BB
udmabuf: use sgtable-based scatterlist wrappers
dma-buf: fix compare in WARN_ON_ONCE
drm/sitronix: st7571-i2c: Select VIDEOMODE_HELPERS
drm/meson: fix more rounding issues with 59.94Hz modes
drm/meson: use vclk_freq instead of pixel_freq in debug print
drm/meson: fix debug log statement when setting the HDMI clocks
drm/vc4: fix infinite EPROBE_DEFER loop
drm/xe/svm: Fix regression disallowing 64K SVM migration
accel/amdxdna: Fix incorrect PSP firmware size
Jens Axboe [Fri, 13 Jun 2025 21:24:53 +0000 (15:24 -0600)]
io_uring: run local task_work from ring exit IOPOLL reaping
In preparation for needing to shift NVMe passthrough to always use
task_work for polled IO completions, ensure that those are suitably
run at exit time. See commit:
9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work")
Jens Axboe [Fri, 13 Jun 2025 19:37:41 +0000 (13:37 -0600)]
nvme: always punt polled uring_cmd end_io work to task_work
Currently NVMe uring_cmd completions will complete locally, if they are
polled. This is done because those completions are always invoked from
task context. And while that is true, there's no guarantee that it's
invoked under the right ring context, or even task. If someone does
NVMe passthrough via multiple threads and with a limited number of
poll queues, then ringA may find completions from ringB. For that case,
completing the request may not be sound.
Always just punt the passthrough completions via task_work, which will
redirect the completion, if needed.
Cc: stable@vger.kernel.org Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 13 Jun 2025 20:39:15 +0000 (13:39 -0700)]
Merge tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix an ACPI APEI error injection driver failure that started to
occur after switching it over to using a faux device, address an EC
driver issue related to invalid ECDT tables, clean up the usage of
mwait_idle_with_hints() in the ACPI PAD driver, add a new IRQ override
quirk, and fix a NULL pointer dereference related to nosmp:
- Update the faux device handling code in the driver core and address
an ACPI APEI error injection driver failure that started to occur
after switching it over to using a faux device on top of that (Dan
Williams)
- Update data types of variables passed as arguments to
mwait_idle_with_hints() in the ACPI PAD (processor aggregator
device) driver to match the function definition after recent
changes (Uros Bizjak)
- Fix a NULL pointer dereference in the ACPI CPPC library that occurs
when nosmp is passed to the kernel in the command line (Yunhui Cui)
- Ignore ECDT tables with an invalid ID string to prevent using an
incorrect GPE for signaling events on some systems (Armin Wolf)
- Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan)"
* tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: resource: Use IRQ override on MACHENIKE 16P
ACPI: EC: Ignore ECDT tables with an invalid ID string
ACPI: CPPC: Fix NULL pointer dereference when nosmp is used
ACPI: PAD: Update arguments of mwait_idle_with_hints()
ACPI: APEI: EINJ: Do not fail einj_init() on faux_device_create() failure
driver core: faux: Quiet probe failures
driver core: faux: Suppress bind attributes
Linus Torvalds [Fri, 13 Jun 2025 20:27:41 +0000 (13:27 -0700)]
Merge tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix the cpupower utility installation, fix up the recently added
Rust abstractions for cpufreq and OPP, restore the x86 update
eliminating mwait_play_dead_cpuid_hint() that has been reverted during
the 6.16 merge window along with preventing the failure caused by it
from happening, and clean up mwait_idle_with_hints() usage in
intel_idle:
- Implement CpuId Rust abstraction and use it to fix doctest failure
related to the recently introduced cpumask abstraction (Viresh
Kumar)
- Do minor cleanups in the `# Safety` sections for cpufreq
abstractions added recently (Viresh Kumar)
- Unbreak cpupower systemd service units installation on some systems
by adding a unitdir variable for specifying the location to install
them (Francesco Poli)
- Eliminate mwait_play_dead_cpuid_hint() again after reverting its
elimination during the 6.16 merge window due to a problem with
handling "dead" SMT siblings, but this time prevent leaving them in
C1 after initialization by taking them online and back offline when
a proper cpuidle driver for the platform has been registered
(Rafael Wysocki)
- Update data types of variables passed as arguments to
mwait_idle_with_hints() to match the function definition after
recent changes (Uros Bizjak)"
* tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
rust: cpu: Add CpuId::current() to retrieve current CPU ID
rust: Use CpuId in place of raw CPU numbers
rust: cpu: Introduce CpuId abstraction
intel_idle: Update arguments of mwait_idle_with_hints()
cpufreq: Convert `/// SAFETY` lines to `# Safety` sections
cpupower: split unitdir from libdir in Makefile
Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
ACPI: processor: Rescan "dead" SMT siblings during initialization
intel_idle: Rescan "dead" SMT siblings during initialization
x86/smp: PM/hibernate: Split arch_resume_nosmt()
intel_idle: Use subsys_initcall_sync() for initialization
Rafael J. Wysocki [Fri, 13 Jun 2025 19:55:30 +0000 (21:55 +0200)]
Merge branches 'acpi-pad', 'acpi-cppc', 'acpi-ec' and 'acpi-resource'
Merge assorted ACPI updates for 6.16-rc2:
- Update data types of variables passed as arguments to
mwait_idle_with_hints() in the ACPI PAD (processor aggregator device)
driver to match the function definition after recent changes (Uros
Bizjak).
- Fix a NULL pointer dereference in the ACPI CPPC library that occurs
when nosmp is passed to the kernel in the command line (Yunhui Cui).
- Ignore ECDT tables with an invalid ID string to prevent using an
incorrect GPE for signaling events on some systems (Armin Wolf).
- Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan).
* acpi-pad:
ACPI: PAD: Update arguments of mwait_idle_with_hints()
* acpi-cppc:
ACPI: CPPC: Fix NULL pointer dereference when nosmp is used
* acpi-ec:
ACPI: EC: Ignore ECDT tables with an invalid ID string
* acpi-resource:
ACPI: resource: Use IRQ override on MACHENIKE 16P
Rafael J. Wysocki [Fri, 13 Jun 2025 19:28:07 +0000 (21:28 +0200)]
Merge branch 'pm-cpuidle'
Merge cpuidle updates for 6.16-rc2:
- Update data types of variables passed as arguments to
mwait_idle_with_hints() to match the function definition
after recent changes (Uros Bizjak).
- Eliminate mwait_play_dead_cpuid_hint() again after reverting its
elimination during the merge window due to a problem with handling
"dead" SMT siblings, but this time prevent leaving them in C1 after
initialization by taking them online and back offline when a proper
cpuidle driver for the platform has been registered (Rafael Wysocki).
* pm-cpuidle:
intel_idle: Update arguments of mwait_idle_with_hints()
Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
ACPI: processor: Rescan "dead" SMT siblings during initialization
intel_idle: Rescan "dead" SMT siblings during initialization
x86/smp: PM/hibernate: Split arch_resume_nosmt()
intel_idle: Use subsys_initcall_sync() for initialization
Linus Torvalds [Fri, 13 Jun 2025 18:01:44 +0000 (11:01 -0700)]
Merge tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A collection of driver specific fixes, most minor apart from the OMAP
ones which disable some recent performance optimisations in some
non-standard cases where we could start driving the bus incorrectly.
The change to the stm32-ospi driver to use the newer reset APIs is a
fix for interactions with other IP sharing the same reset line in some
SoCs"
* tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine
spi: stm32-ospi: clean up on error in probe()
spi: stm32-ospi: Make usage of reset_control_acquire/release() API
spi: offload: check offload ops existence before disabling the trigger
spi: spi-pci1xxxx: Fix error code in probe
spi: loongson: Fix build warnings about export.h
spi: omap2-mcspi: Disable multi-mode when the previous message kept CS asserted
spi: omap2-mcspi: Disable multi mode when CS should be kept asserted after message
Linus Torvalds [Fri, 13 Jun 2025 18:00:19 +0000 (11:00 -0700)]
Merge tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One minor fix for a leak in the DT parsing code in the max20086 driver"
* tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: max20086: Fix refcount leak in max20086_parse_regulators_dt()
Oleg Nesterov [Fri, 13 Jun 2025 17:26:50 +0000 (19:26 +0200)]
posix-cpu-timers: fix race between handle_posix_cpu_timers() and posix_cpu_timer_del()
If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().
If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.
Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.
This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.
Linus Torvalds [Fri, 13 Jun 2025 17:51:11 +0000 (10:51 -0700)]
Merge tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Do not free "head" variable in filter_free_subsystem_filters()
The first error path jumps to "free_now" label but first frees the
newly allocated "head" variable. But the "free_now" code checks this
variable, and if it is not NULL, it will iterate the list. As this
list variable was already initialized, the "free_now" code will not
do anything as it is empty. But freeing it will cause a UAF bug.
The error path should simply jump to the "free_now" label and leave
the "head" variable alone.
* tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Do not free "head" on error path of filter_free_subsystem_filters()
Linus Torvalds [Fri, 13 Jun 2025 17:05:31 +0000 (10:05 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Rework of system register accessors for system registers that are
directly writen to memory, so that sanitisation of the in-memory
value happens at the correct time (after the read, or before the
write). For convenience, RMW-style accessors are also provided.
- Multiple fixes for the so-called "arch-timer-edge-cases' selftest,
which was always broken.
x86:
- Make KVM_PRE_FAULT_MEMORY stricter for TDX, allowing userspace to
pass only the "untouched" addresses and flipping the shared/private
bit in the implementation.
- Disable SEV-SNP support on initialization failure
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY
KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
KVM: SEV: Disable SEV-SNP support on initialization failure
KVM: arm64: selftests: Determine effective counter width in arch_timer_edge_cases
KVM: arm64: selftests: Fix xVAL init in arch_timer_edge_cases
KVM: arm64: selftests: Fix thread migration in arch_timer_edge_cases
KVM: arm64: selftests: Fix help text for arch_timer_edge_cases
KVM: arm64: Make __vcpu_sys_reg() a pure rvalue operand
KVM: arm64: Don't use __vcpu_sys_reg() to get the address of a sysreg
KVM: arm64: Add RMW specific sysreg accessor
KVM: arm64: Add assignment-specific sysreg accessor
Jens Axboe [Fri, 13 Jun 2025 17:01:49 +0000 (11:01 -0600)]
io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
If peeking a bunch of buffers, normally io_ring_buffers_peek() will
truncate the end buffer. This isn't optimal as presumably more data will
be arriving later, and hence it's better to stop with the last full
buffer rather than truncate the end buffer.
Cc: stable@vger.kernel.org Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers") Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 13 Jun 2025 16:49:07 +0000 (09:49 -0700)]
Merge tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"As usual, highlighting the ones users have been noticing:
- Fix a small issue with has_case_insensitive not being propagated on
snapshot creation; this led to fsck errors, which we're harmless
because we're not using this flag yet (it's for overlayfs +
casefolding).
- Log the error being corrected in the journal when we're doing fsck
repair: this was one of the "lessons learned" from the i_nlink 0 ->
subvolume deletion bug, where reconstructing what had happened by
analyzing the journal was a bit more difficult than it needed to
be.
- Don't schedule btree node scan to run in the superblock: this fixes
a regression from the 6.16 recovery passes rework, and let to it
running unnecessarily.
The real issue here is that we don't have online, "self healing"
style topology repair yet: topology repair currently has to run
before we go RW, which means that we may schedule it unnecessarily
after a transient error. This will be fixed in the future.
- We now track, in btree node flags, the reason it was scheduled to
be rewritten. We discovered a deadlock in recovery when many btree
nodes need to be rewritten because they're degraded: fully fixing
this will take some work but it's now easier to see what's going
on.
For the bug report where this came up, a device had been kicked RO
due to transient errors: manually setting it back to RW was
sufficient to allow recovery to succeed.
- Mark a few more fsck errors as autofix: as a reminder to users,
please do keep reporting cases where something needs to be repaired
and is not repaired automatically (i.e. cases where -o fix_errors
or fsck -y is required).
- rcu_pending.c now works with PREEMPT_RT
- 'bcachefs device add', then umount, then remount wasn't working -
we now emit a uevent so that the new device's new superblock is
correctly picked up
- Assorted repair fixes: btree node scan will no longer incorrectly
update sb->version_min,
- Assorted syzbot fixes"
* tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs: (23 commits)
bcachefs: Don't trace should_be_locked unless changing
bcachefs: Ensure that snapshot creation propagates has_case_insensitive
bcachefs: Print devices we're mounting on multi device filesystems
bcachefs: Don't trust sb->nr_devices in members_to_text()
bcachefs: Fix version checks in validate_bset()
bcachefs: ioctl: avoid stack overflow warning
bcachefs: Don't pass trans to fsck_err() in gc_accounting_done
bcachefs: Fix leak in bch2_fs_recovery() error path
bcachefs: Fix rcu_pending for PREEMPT_RT
bcachefs: Fix downgrade_table_extra()
bcachefs: Don't put rhashtable on stack
bcachefs: Make sure opts.read_only gets propagated back to VFS
bcachefs: Fix possible console lock involved deadlock
bcachefs: mark more errors autofix
bcachefs: Don't persistently run scan_for_btree_nodes
bcachefs: Read error message now prints if self healing
bcachefs: Only run 'increase_depth' for keys from btree node csan
bcachefs: Mark need_discard_freespace_key_bad autofix
bcachefs: Update /dev/disk/by-uuid on device add
bcachefs: Add more flags to btree nodes for rewrite reason
...
Jason Gunthorpe [Tue, 3 Jun 2025 19:14:45 +0000 (16:14 -0300)]
iommu/tegra: Fix incorrect size calculation
This driver uses a mixture of ways to get the size of a PTE,
tegra_smmu_set_pde() did it as sizeof(*pd) which became wrong when pd
switched to a struct tegra_pd.
Switch pd back to a u32* in tegra_smmu_set_pde() so the sizeof(*pd)
returns 4.
Fixes: 50568f87d1e2 ("iommu/terga: Do not use struct page as the handle for as->pd memory") Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt> Closes: https://lore.kernel.org/all/62e7f7fe-6200-4e4f-ad42-d58ad272baa6@tecnico.ulisboa.pt/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Tested-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt> Link: https://lore.kernel.org/r/0-v1-da7b8b3d57eb+ce-iommu_terga_sizeof_jgg@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Matthew Wilcox (Oracle) [Thu, 12 Jun 2025 14:42:53 +0000 (15:42 +0100)]
block: Fix bvec_set_folio() for very large folios
Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if
we attempt to add a folio that is larger than 4GB, we'll silently
truncate the offset and len. Widen the parameters to size_t, assert
that the length is less than 4GB and set the first page that contains
the interesting data rather than the first page of the folio.
Matthew Wilcox (Oracle) [Thu, 12 Jun 2025 14:41:25 +0000 (15:41 +0100)]
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
It is possible for physically contiguous folios to have discontiguous
struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not.
This is correctly handled by folio_page_idx(), so remove this open-coded
implementation.
Lorenzo Stoakes [Mon, 9 Jun 2025 16:57:49 +0000 (17:57 +0100)]
mm: add mmap_prepare() compatibility layer for nested file systems
Nested file systems, that is those which invoke call_mmap() within their
own f_op->mmap() handlers, may encounter underlying file systems which
provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83
("mm: introduce new .mmap_prepare() file callback").
We have a chicken-and-egg scenario here - until all file systems are
converted to using .mmap_prepare(), we cannot convert these nested
handlers, as we can't call f_op->mmap from an .mmap_prepare() hook.
So we have to do it the other way round - invoke the .mmap_prepare() hook
from an .mmap() one.
in order to do so, we need to convert VMA state into a struct vm_area_desc
descriptor, invoking the underlying file system's f_op->mmap_prepare()
callback passing a pointer to this, and then setting VMA state accordingly
and safely.
This patch achieves this via the compat_vma_mmap_prepare() function, which
we invoke from call_mmap() if f_op->mmap_prepare() is specified in the
passed in file pointer.
We place the fundamental logic into mm/vma.h where VMA manipulation
belongs. We also update the VMA userland tests to accommodate the
changes.
The compat_vma_mmap_prepare() function and its associated machinery is
temporary, and will be removed once the conversion of file systems is
complete.
We carefully place this code so it can be used with CONFIG_MMU and also
with cutting edge nommu silicon.
[akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build]
[lorenzo.stoakes@oracle.com: remove unused declarations] Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bharath SM [Wed, 11 Jun 2025 11:29:02 +0000 (16:59 +0530)]
smb: improve directory cache reuse for readdir operations
Currently, cached directory contents were not reused across subsequent
'ls' operations because the cache validity check relied on comparing
the ctx pointer, which changes with each readdir invocation. As a
result, the cached dir entries was not marked as valid and the cache was
not utilized for subsequent 'ls' operations.
This change uses the file pointer, which remains consistent across all
readdir calls for a given directory instance, to associate and validate
the cache. As a result, cached directory contents can now be
correctly reused, improving performance for repeated directory listings.
Performance gains with local windows SMB server:
Without the patch and default actimeo=1:
1000 directory enumeration operations on dir with 10k files took 135.0s
With this patch and actimeo=0:
1000 directory enumeration operations on dir with 10k files took just 5.1s
Signed-off-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
Paulo Alcantara [Thu, 12 Jun 2025 15:45:04 +0000 (12:45 -0300)]
smb: client: fix perf regression with deferred closes
Customer reported that one of their applications started failing to
open files with STATUS_INSUFFICIENT_RESOURCES due to NetApp server
hitting the maximum number of opens to same file that it would allow
for a single client connection.
It turned out the client was failing to reuse open handles with
deferred closes because matching ->f_flags directly without masking
off O_CREAT|O_EXCL|O_TRUNC bits first broke the comparision and then
client ended up with thousands of deferred closes to same file. Those
bits are already satisfied on the original open, so no need to check
them against existing open handles.