Christoph Hellwig [Wed, 31 Jul 2024 17:20:05 +0000 (10:20 -0700)]
xfs: push down unsigned char casts
To get us of casting hell, push down the casts to the unsigned char
to the low-level hash code and the on-disk format definitions. The
hash code is the only place that actually cares about the signedness.
The on-disk format would be cleaner by not using it, but using it
is a correct headsup to worry about the signedness.
Note that right now this only really matters for userspace, as the
kernel forces -funsigned-char for all architectures.
Darrick J. Wong [Tue, 9 Jul 2024 14:43:42 +0000 (07:43 -0700)]
xfs: fix file_path handling in tracepoints
Since file_path() takes the output buffer as one of its arguments, we
might as well have it format directly into the tracepoint's char array
instead of wasting stack space.
Fixes: 3934e8ebb7cc6 ("xfs: create a big array data structure") Fixes: 5076a6040ca16 ("xfs: support in-memory buffer cache targets") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403290419.HPcyvqZu-lkp@intel.com/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Tue, 16 Jul 2024 21:50:25 +0000 (14:50 -0700)]
xfs: fix rtalloc rotoring when delalloc is in use
If we're trying to allocate real space for a delalloc reservation at
offset 0, we should use the rotor to spread files across the rt volume.
Switch the rtalloc to use the XFS_ALLOC_INITIAL_USER_DATA flag that
is set for any write at startoff to make it match the behavior for
the main data device.
Based on a patch from Darrick J. Wong.
Fixes: 6a94b1acda7e ("xfs: reinstate delalloc for RT inodes (if sb_rextsize == 1)") Reported-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Long Li [Tue, 16 Jul 2024 21:50:24 +0000 (14:50 -0700)]
xfs: get rid of xfs_ag_resv_rmapbt_alloc
The pag in xfs_ag_resv_rmapbt_alloc() is already held when the struct
xfs_btree_cur is initialized in xfs_rmapbt_init_cursor(), so there is no
need to get pag again.
On the other hand, in xfs_rmapbt_free_block(), the similar function
xfs_ag_resv_rmapbt_free() was removed in commit 92a005448f6f ("xfs: get
rid of unnecessary xfs_perag_{get,put} pairs"), xfs_ag_resv_rmapbt_alloc()
was left because scrub used it, but now scrub has removed it. Therefore,
we could get rid of xfs_ag_resv_rmapbt_alloc() just like the rmap free
block, make the code cleaner.
Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:23 +0000 (14:50 -0700)]
xfs: skip flushing log items during push
The AIL pushing code spends a huge amount of time skipping over
items that are already marked as flushing. It is not uncommon to
see hundreds of thousands of items skipped every second due to inode
clustering marking all the inodes in a cluster as flushing when the
first one is flushed.
However, to discover an item is already flushing and should be
skipped we have to call the iop_push() method for it to try to flush
the item. For inodes (where this matters most), we have to first
check that inode is flushable first.
We can optimise this overhead away by tracking whether the log item
is flushing internally. This allows xfsaild_push() to check the log
item directly for flushing state and immediately skip the log item.
Whilst this doesn't remove the CPU cache misses for loading the log
item, it does avoid the overhead of an indirect function call
and the cache misses involved in accessing inode and
backing cluster buffer structures to determine flushing state. When
trying to flush hundreds of thousands of inodes each second, this
CPU overhead saving adds up quickly.
It's so noticeable that the biggest issue with pushing on the AIL on
fast storage becomes the 10ms back-off wait when we hit enough
pinned buffers to break out of the push loop but not enough for the
AIL pushing to be considered stuck. This limits the xfsaild to about
70% total CPU usage, and on fast storage this isn't enough to keep
the storage 100% busy.
The xfsaild will block on IO submission on slow storage and so is
self throttling - it does not need a backoff in the case where we
are really just breaking out of the walk to submit the IO we have
gathered.
Further with no backoff we don't need to gather huge delwri lists to
mitigate the impact of backoffs, so we can submit IO more frequently
and reduce the time log items spend in flushing state by breaking
out of the item push loop once we've gathered enough IO to batch
submission effectively.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:23 +0000 (14:50 -0700)]
xfs: grant heads track byte counts, not LSNs
The grant heads in the log track the space reserved in the log for
running transactions. They do this by tracking how far ahead of the
tail that the reservation has reached, and the units for doing this
are {cycle,bytes} for the reserve head rather than {cycle,blocks}
which are normal used by LSNs.
This is annoyingly complex because we have to split, crack and
combined these tuples for any calculation we do to determine log
space and targets. This is computationally expensive as well as
difficult to do atomically and locklessly, as well as limiting the
size of the log to 2^32 bytes.
Really, though, all the grant heads are tracking is how much space
is currently available for use in the log. We can track this as a
simply byte count - we just don't care what the actual physical
location in the log the head and tail are at, just how much space we
have remaining before the head and tail overlap.
So, convert the grant heads to track the byte reservations that are
active rather than the current (cycle, offset) tuples. This means an
empty log has zero bytes consumed, and a full log is when the
reservations reach the size of the log minus the space consumed by
the AIL.
This greatly simplifies the accounting and checks for whether there
is space available. We no longer need to crack or combine LSNs to
determine how much space the log has left, nor do we need to look at
the head or tail of the log to determine how close to full we are.
There is, however, a complexity that needs to be handled. We know
how much space is being tracked in the AIL now via log->l_tail_space
and the log tickets track active reservations and return the unused
portions to the grant heads when ungranted. Unfortunately, we don't
track the used portion of the grant, so when we transfer log items
from the CIL to the AIL, the space accounted to the grant heads is
transferred to the log tail space. Hence when we move the AIL head
forwards on item insert, we have to remove that space from the grant
heads.
We also remove the xlog_verify_grant_tail() debug function as it is
no longer useful. The check it performs has been racy since delayed
logging was introduced, but now it is clearly only detecting false
positives so remove it.
The result of this substantially simpler accounting algorithm is an
increase in sustained transaction rate from ~1.3 million
transactions/s to ~1.9 million transactions/s with no increase in
CPU usage. We also remove the 32 bit space limitation on the grant
heads, which will allow us to increase the journal size beyond 2GB
in future.
Note that this renames the sysfs files exposing the log grant space
now that the values are exported in bytes. This allows xfstests
to auto-detect the old or new ABI.
[hch: move xlog_grant_sub_space out of line,
update the xlog_grant_{add,sub}_space prototypes,
rename the sysfs files to allow auto-detection in xfstests]
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:22 +0000 (14:50 -0700)]
xfs: pass the full grant head to accounting functions
Because we are going to need them soon. API change only, no logic
changes.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:21 +0000 (14:50 -0700)]
xfs: track log space pinned by the AIL
Currently we track space used in the log by grant heads.
These store the reserved space as a physical log location and
combine both space reserved for future use with space already used in
the log in a single variable. The amount of space consumed in the
log is then calculated as the distance between the log tail and
the grant head.
The problem with tracking the grant head as a physical location
comes from the fact that it tracks both log cycle count and offset
into the log in bytes in a single 64 bit variable. because the cycle
count on disk is a 32 bit number, this also limits the offset into
the log to 32 bits. ANd because that is in bytes, we are limited to
being able to track only 2GB of log space in the grant head.
Hence to support larger physical logs, we need to track used space
differently in the grant head. We no longer use the grant head for
guiding AIL pushing, so the only thing it is now used for is
determining if we've run out of reservation space via the
calculation in xlog_space_left().
What we really need to do is move the grant heads away from tracking
physical space in the log. The issue here is that space consumed in
the log is not directly tracked by the current mechanism - the
space consumed in the log by grant head reservations gets returned
to the free pool by the tail of the log moving forward. i.e. the
space isn't directly tracked or calculated, but the used grant space
gets "freed" as the physical limits of the log are updated without
actually needing to update the grant heads.
Hence to move away from implicit, zero-update log space tracking we
need to explicitly track the amount of physical space the log
actually consumes separately to the in-memory reservations for
operations that will be committed to the journal. Luckily, we
already track the information we need to calculate this in the AIL
itself.
That is, the space currently consumed by the journal is the maximum
LSN that the AIL has seen minus the current log tail. As we update
both of these items dynamically as the head and tail of the log
moves, we always know exactly how much space the journal consumes.
This means that we also know exactly how much space the currently
active reservations require, and exactly how much free space we have
remaining for new reservations to be made. Most importantly, we know
what these spaces are indepedently of the physical locations of
the head and tail of the log.
Hence by separating out the physical space consumed by the journal,
we can now track reservations in the grant heads purely as a byte
count, and the log can be considered full when the tail space +
reservation space exceeds the size of the log. This means we can use
the full 64 bits of grant head space for reservation space,
completely removing the 32 bit byte count limitation on log size
that they impose.
Hence the first step in this conversion is to track and update the
"log tail space" every time the AIL tail or maximum seen LSN
changes.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:20 +0000 (14:50 -0700)]
xfs: collapse xlog_state_set_callback in caller
The function is called from a single place, and it isn't just
setting the iclog state to XLOG_STATE_CALLBACK - it can mark iclogs
clean, which moves them to states after CALLBACK. Hence the function
is now badly named, and should just be folded into the caller where
the iclog completion logic makes a whole lot more sense.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:19 +0000 (14:50 -0700)]
xfs: l_last_sync_lsn is really AIL state
The current implementation of xlog_assign_tail_lsn() assumes that
when the AIL is empty, the log tail matches the LSN of the last
written commit record. This is recorded in xlog_state_set_callback()
as log->l_last_sync_lsn when the iclog state changes to
XLOG_STATE_CALLBACK. This change is then immediately followed by
running the callbacks on the iclog which then insert the log items
into the AIL at the "commit lsn" of that checkpoint.
The AIL tracks log items via the start record LSN of the checkpoint,
not the commit record LSN. This is because we can pipeline multiple
checkpoints, and so the start record of checkpoint N+1 can be
written before the commit record of checkpoint N. i.e:
start N commit N
+-------------+------------+----------------+
start N+1 commit N+1
The tail of the log cannot be moved to the LSN of commit N when all
the items of that checkpoint are written back, because then the
start record for N+1 is no longer in the active portion of the log
and recovery will fail/corrupt the filesystem.
Hence when all the log items in checkpoint N are written back, the
tail of the log most now only move as far forwards as the start LSN
of checkpoint N+1.
Hence we cannot use the maximum start record LSN the AIL sees as a
replacement the pointer to the current head of the on-disk log
records. However, we currently only use the l_last_sync_lsn when the
AIL is empty - when there is no start LSN remaining, the tail of the
log moves to the LSN of the last commit record as this is where
recovery needs to start searching for recoverable records. THe next
checkpoint will have a start record LSN that is higher than
l_last_sync_lsn, and so everything still works correctly when new
checkpoints are written to an otherwise empty log.
l_last_sync_lsn is an atomic variable because it is currently
updated when an iclog with callbacks attached moves to the CALLBACK
state. While we hold the icloglock at this point, we don't hold the
AIL lock. When we assign the log tail, we hold the AIL lock, not the
icloglock because we have to look up the AIL. Hence it is an atomic
variable so it's not bound to a specific lock context.
However, the iclog callbacks are only used for CIL checkpoints. We
don't use callbacks with unmount record writes, so the
l_last_sync_lsn variable only gets updated when we are processing
CIL checkpoint callbacks. And those callbacks run under AIL lock
contexts, not icloglock context. The CIL checkpoint already knows
what the LSN of the iclog the commit record was written to (obtained
when written into the iclog before submission) and so we can update
the l_last_sync_lsn under the AIL lock in this callback. No other
iclog callbacks will run until the currently executing one
completes, and hence we can update the l_last_sync_lsn under the AIL
lock safely.
This means l_last_sync_lsn can move to the AIL as the "ail_head_lsn"
and it can be used to replace the atomic l_last_sync_lsn in the
iclog code. This makes tracking the log tail belong entirely to the
AIL, rather than being smeared across log, iclog and AIL state and
locking.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:18 +0000 (14:50 -0700)]
xfs: ensure log tail is always up to date
Whenever we write an iclog, we call xlog_assign_tail_lsn() to update
the current tail before we write it into the iclog header. This
means we have to take the AIL lock on every iclog write just to
check if the tail of the log has moved.
This doesn't avoid races with log tail updates - the log tail could
move immediately after we assign the tail to the iclog header and
hence by the time the iclog reaches stable storage the tail LSN has
moved forward in memory. Hence the log tail LSN in the iclog header
is really just a point in time snapshot of the current state of the
AIL.
With this in mind, if we simply update the in memory log->l_tail_lsn
every time it changes in the AIL, there is no need to update the in
memory value when we are writing it into an iclog - it will already
be up-to-date in memory and checking the AIL again will not change
this. Hence xlog_state_release_iclog() does not need to check the
AIL to update the tail lsn and can just sample it directly without
needing to take the AIL lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:17 +0000 (14:50 -0700)]
xfs: background AIL push should target physical space
Currently the AIL attempts to keep 25% of the "log space" free,
where the current used space is tracked by the reserve grant head.
That is, it tracks both physical space used plus the amount reserved
by transactions in progress.
When we start tail pushing, we are trying to make space for new
reservations by writing back older metadata and the log is generally
physically full of dirty metadata, and reservations for modifications
in flight take up whatever space the AIL can physically free up.
Hence we don't really need to take into account the reservation
space that has been used - we just need to keep the log tail moving
as fast as we can to free up space for more reservations to be made.
We know exactly how much physical space the journal is consuming in
the AIL (i.e. max LSN - min LSN) so we can base push thresholds
directly on this state rather than have to look at grant head
reservations to determine how much to physically push out of the
log.
This also allows code that needs to know if log items in the current
transaction need to be pushed or re-logged to simply sample the
current target - they don't need to calculate the current target
themselves. This avoids the need for any locking when doing such
checks.
Further, moving to a physical target means we don't need "push all
until empty semantics" like were introduced in the previous patch.
We can now test and clear the "push all" as a one-shot command to
set the target to the current head of the AIL. This allows the
xfsaild to maximise the use of log space right up to the point where
conditions indicate that the xfsaild is not keeping up with load and
it needs to work harder, and as soon as those constraints go away
(i.e. external code no longer needs everything pushed) the xfsaild
will return to maintaining the normal 25% free space thresholds.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:16 +0000 (14:50 -0700)]
xfs: AIL doesn't need manual pushing
We have a mechanism that checks the amount of log space remaining
available every time we make a transaction reservation. If the
amount of space is below a threshold (25% free) we push on the AIL
to tell it to do more work. To do this, we end up calculating the
LSN that the AIL needs to push to on every reservation and updating
the push target for the AIL with that new target LSN.
This is silly and expensive. The AIL is perfectly capable of
calculating the push target itself, and it will always be running
when the AIL contains objects.
What the target does is determine if the AIL needs to do
any work before it goes back to sleep. If we haven't run out of
reservation space or memory (or some other push all trigger), it
will simply go back to sleep for a while if there is more than 25%
of the journal space free without doing anything.
If there are items in the AIL at a lower LSN than the target, it
will try to push up to the target or to the point of getting stuck
before going back to sleep and trying again soon after.`
Hence we can modify the AIL to calculate it's own 25% push target
before it starts a push using the same reserve grant head based
calculation as is currently used, and remove all the places where we
ask the AIL to push to a new 25% free target. We can also drop the
minimum free space size of 256BBs from the calculation because the
25% of a minimum sized log is *always going to be larger than
256BBs.
This does still require a manual push in certain circumstances.
These circumstances arise when the AIL is not full, but the
reservation grants consume the entire of the free space in the log.
In this case, we still need to push on the AIL to free up space, so
when we hit this condition (i.e. reservation going to sleep to wait
on log space) we do a single push to tell the AIL it should empty
itself. This will keep the AIL moving as new reservations come in
and want more space, rather than keep queuing them and having to
push the AIL repeatedly.
The reason for using the "push all" when grant space runs out is
that we can run out of grant space when there is more than 25% of
the log free. Small logs are notorious for this, and we have a hack
in the log callback code (xlog_state_set_callback()) where we push
the AIL because the *head* moved) to ensure that we kick the AIL
when we consume space in it because that can push us over the "less
than 25% available" available that starts tail pushing back up
again.
Hence when we run out of grant space and are going to sleep, we have
to consider that the grant space may be consuming almost all the log
space and there is almost nothing in the AIL. In this situation, the
AIL pins the tail and moving the tail forwards is the only way the
grant space will come available, so we have to force the AIL to push
everything to guarantee grant space will eventually be returned.
Hence triggering a "push all" just before sleeping removes all the
nasty corner cases we have in other parts of the code that work
around the "we didn't ask the AIL to push enough to free grant
space" condition that leads to log space hangs...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Dave Chinner [Tue, 16 Jul 2024 21:50:15 +0000 (14:50 -0700)]
xfs: move and rename xfs_trans_committed_bulk
Ever since the CIL and delayed logging was introduced,
xfs_trans_committed_bulk() has been a purely CIL checkpoint
completion function and not a transaction commit completion
function. Now that we are adding log specific updates to this
function, it really does not have anything to do with the
transaction subsystem - it is really log and log item level
functionality.
This should be part of the CIL code as it is the callback
that moves log items from the CIL checkpoint to the AIL. Move it
and rename it to xlog_cil_ail_insert().
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Zizhi Wo [Tue, 16 Jul 2024 21:50:14 +0000 (14:50 -0700)]
xfs: Avoid races with cnt_btree lastrec updates
A concurrent file creation and little writing could unexpectedly return
-ENOSPC error since there is a race window that the allocator could get
the wrong agf->agf_longest.
Write file process steps:
1) Find the entry that best meets the conditions, then calculate the start
address and length of the remaining part of the entry after allocation.
2) Delete this entry and update the -current- agf->agf_longest.
3) Insert the remaining unused parts of this entry based on the
calculations in 1), and update the agf->agf_longest again if necessary.
Create file process steps:
1) Check whether there are free inodes in the inode chunk.
2) If there is no free inode, check whether there has space for creating
inode chunks, perform the no-lock judgment first.
3) If the judgment succeeds, the judgment is performed again with agf lock
held. Otherwire, an error is returned directly.
If the write process is in step 2) but not go to 3) yet, the create file
process goes to 2) at this time, it may be mistaken for no space,
resulting in the file system still has space but the file creation fails.
We have sent two different commits to the community in order to fix this
problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
discussed with Dave and Darrick, realized that a better solution to this
problem requires the "last cnt record tracking" to be ripped out of the
generic btree code. And surprisingly, Dave directly provided his fix code.
This patch includes appropriate modifications based on his tmp-code to
address this issue.
The entire fix can be roughly divided into two parts:
1) Delete the code related to lastrec-update in the generic btree code.
2) Place the process of updating longest freespace with cntbt separately
to the end of the cntbt modifications. Move the cursor to the rightmost
firstly, and update the longest free extent based on the record.
Note that we can not update the longest with xfs_alloc_get_rec() after
find the longest record, as xfs_verify_agbno() may not pass because
pag->block_count is updated on the outside. Therefore, use
xfs_btree_get_rec() as a replacement.
Darrick J. Wong [Wed, 29 May 2024 04:13:05 +0000 (21:13 -0700)]
xfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c
Move the code that adds the incore xfs_refcount_update_item deferred
work data to a transaction live with the CUI log item code. This means
that the refcount code no longer has to know about the inner workings of
the CUI log items.
As a consequence, we can get rid of the _{get,put}_group helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:13:05 +0000 (21:13 -0700)]
xfs: simplify usage of the rcur local variable in xfs_refcount_finish_one
Only update rcur when we know the final *pcur value.
Inspired-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:13:04 +0000 (21:13 -0700)]
xfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one
In xfs_refcount_finish_one we know the cursor is non-zero when calling
xfs_refcount_finish_one_cleanup and we pass a 0 error variable. This
means xfs_refcount_finish_one_cleanup is just doing a
xfs_btree_del_cursor.
Open code that and move xfs_refcount_finish_one_cleanup to
fs/xfs/xfs_refcount_item.c.
Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:13:03 +0000 (21:13 -0700)]
xfs: add a ci_entry helper
Add a helper to translate from the item list head to the
refcount_intent_item structure and use it so shorten assignments and
avoid the need for extra local variables.
Inspired-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:13:01 +0000 (21:13 -0700)]
xfs: pass btree cursors to refcount btree tracepoints
Prepare the rest of refcount btree tracepoints for use with realtime
reflink by making them take the btree cursor object as a parameter.
This will save us a lot of trouble later on.
Remove the xfs_refcount_recover_extent tracepoint since it's already
covered by other refcount tracepoints.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:13:01 +0000 (21:13 -0700)]
xfs: create specialized classes for refcount tracepoints
The only user of the "ag" tracepoint event classes is the refcount
btree, so rename them to make that obvious and make them take the btree
cursor to simplify the arguments. This will save us a lot of trouble
later on.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:11:45 +0000 (21:11 -0700)]
xfs: move xfs_rmap_update_defer_add to xfs_rmap_item.c
Move the code that adds the incore xfs_rmap_update_item deferred work
data to a transaction to live with the RUI log item code. This means
that the rmap code no longer has to know about the inner workings of the
RUI log items.
As a consequence, we can get rid of the _{get,put}_group helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Wed, 29 May 2024 04:11:45 +0000 (21:11 -0700)]
xfs: simplify usage of the rcur local variable in xfs_rmap_finish_one
Only update rcur when we know the final *pcur value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 29 May 2024 04:11:44 +0000 (21:11 -0700)]
xfs: don't bother calling xfs_rmap_finish_one_cleanup in xfs_rmap_finish_one
In xfs_rmap_finish_one we known the cursor is non-zero when calling
xfs_rmap_finish_one_cleanup and we pass a 0 error variable. This means
xfs_rmap_finish_one_cleanup is just doing a xfs_btree_del_cursor.
Open code that and move xfs_rmap_finish_one_cleanup to
fs/xfs/xfs_rmap_item.c.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor porting changes] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 29 May 2024 04:11:43 +0000 (21:11 -0700)]
xfs: add a ri_entry helper
Add a helper to translate from the item list head to the
rmap_intent_item structure and use it so shorten assignments
and avoid the need for extra local variables.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:11:41 +0000 (21:11 -0700)]
xfs: pass btree cursors to rmap btree tracepoints
Prepare the rmap btree tracepoints for use with realtime rmap btrees by
making them take the btree cursor object as a parameter. This will save
us a lot of trouble later on.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:11:41 +0000 (21:11 -0700)]
xfs: give rmap btree cursor error tracepoints their own class
Create a new tracepoint class for btree-related errors, then convert all
the rmap tracepoints to use it. Also fix the one tracepoint that was
abusing the old class by making it a separate tracepoint.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:11:39 +0000 (21:11 -0700)]
xfs: move xfs_extent_free_defer_add to xfs_extfree_item.c
Move the code that adds the incore xfs_extent_free_item deferred work
data to a transaction to live with the EFI log item code. This means
that the allocator code no longer has to know about the inner workings
of the EFI log items.
As a consequence, we can get rid of the _{get,put}_group helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Wed, 29 May 2024 04:11:37 +0000 (21:11 -0700)]
xfs: remove duplicate asserts in xfs_defer_extent_free
The bno/len verification is already done by the calls to
xfs_verify_rtbext / xfs_verify_fsbext, and reporting a corruption error
seem like the better handling than tripping an assert anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 29 May 2024 04:11:36 +0000 (21:11 -0700)]
xfs: add a xefi_entry helper
Add a helper to translate from the item list head to the
xfs_extent_free_item structure and use it so shorten assignments
and avoid the need for extra local variables.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 29 May 2024 04:11:35 +0000 (21:11 -0700)]
xfs: pass the fsbno to xfs_perag_intent_get
All callers of xfs_perag_intent_get have a fsbno and need boilerplate
code to turn that into an agno. Just pass the fsbno to
xfs_perag_intent_get and look up the agno there.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:10:51 +0000 (21:10 -0700)]
xfs: don't use the incore struct xfs_sb for offsets into struct xfs_dsb
Currently, the XFS_SB_CRC_OFF macro uses the incore superblock struct
(xfs_sb) to compute the address of sb_crc within the ondisk superblock
struct (xfs_dsb). This is a landmine if we ever change the layout of
the incore superblock (as we're about to do), so redefine the macro
to use xfs_dsb to compute the layout of xfs_dsb.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:50 +0000 (21:10 -0700)]
xfs: move dirent update hooks to xfs_dir2.c
Move the directory entry update hook code to xfs_dir2 so that it is
mostly consolidated with the higher level directory functions. Retain
the exports so that online fsck can still send notifications through the
hooks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:50 +0000 (21:10 -0700)]
xfs: create libxfs helper to rename two directory entries
Create a new libxfs function to rename two directory entries. The
upcoming metadata directory feature will need this to replace a metadata
inode directory entry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:49 +0000 (21:10 -0700)]
xfs: create libxfs helper to exchange two directory entries
Create a new libxfs function to exchange two directory entries.
The upcoming metadata directory feature will need this to replace a
metadata inode directory entry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:49 +0000 (21:10 -0700)]
xfs: create libxfs helper to remove an existing inode/name from a directory
Create a new libxfs function to remove a (name, inode) entry from a
directory. The upcoming metadata directory feature will need this to
create a metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:48 +0000 (21:10 -0700)]
xfs: create libxfs helper to link an existing inode into a directory
Create a new libxfs function to link an existing inode into a directory.
The upcoming metadata directory feature will need this to create a
metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:47 +0000 (21:10 -0700)]
xfs: create libxfs helper to link a new inode into a directory
Create a new libxfs function to link a newly created inode into a
directory. The upcoming metadata directory feature will need this to
create a metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 18 Jun 2024 19:56:54 +0000 (12:56 -0700)]
xfs: separate the icreate logic around INIT_XATTRS
INIT_XATTRS is overloaded here -- it's set during the creat process when
we think that we're immediately going to set some ACL xattrs to save
time. However, it's also used by the parent pointers code to enable the
attr fork in preparation to receive ppptr xattrs. This results in
xfs_has_parent() branches scattered around the codebase to turn on
INIT_XATTRS.
Linkable files are created far more commonly than unlinkable temporary
files or directory tree roots, so we should centralize this logic in
xfs_inode_init. For the three callers that don't want parent pointers
(online repiar tempfiles, unlinkable tempfiles, rootdir creation) we
provide an UNLINKABLE flag to skip attr fork initialization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:45 +0000 (21:10 -0700)]
xfs: wrap inode creation dqalloc calls
Create a helper that calls dqalloc to allocate and grab a reference to
dquots for the user, group, and project ids listed in an icreate
structure. This simplifies the creat-related dqalloc callsites
scattered around the code base.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:45 +0000 (21:10 -0700)]
xfs: push xfs_icreate_args creation out of xfs_create*
Move the initialization of the xfs_icreate_args structure out of
xfs_create and xfs_create_tempfile into their callers so that we can set
the new inode's attributes in one place and pass that through instead of
open coding the collection of attributes all over the code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:44 +0000 (21:10 -0700)]
xfs: split new inode creation into two pieces
There are two parts to initializing a newly allocated inode: setting up
the incore structures, and initializing the new inode core based on the
parent inode and the current user's environment. The initialization
code is not specific to the kernel, so we would like to share that with
userspace by hoisting it to libxfs. Therefore, split xfs_icreate into
separate functions to prepare for the next few patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:43 +0000 (21:10 -0700)]
xfs: implement atime updates in xfs_trans_ichgtime
Enable xfs_trans_ichgtime to change the inode access time so that we can
use this function to set inode times when allocating inodes instead of
open-coding it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Wed, 29 May 2024 04:10:42 +0000 (21:10 -0700)]
xfs: pack icreate initialization parameters into a separate structure
Callers that want to create an inode currently pass all possible file
attribute values for the new inode into xfs_init_new_inode as ten
separate parameters. This causes two code maintenance issues: first, we
have large multi-line call sites which programmers must read carefully
to make sure they did not accidentally invert a value. Second, all
three file id parameters must be passed separately to the quota
functions; any discrepancy results in quota count errors.
Clean this up by creating a new icreate_args structure to hold all this
information, some helpers to initialize them properly, and make the
callers pass this structure through to the creation function, whose name
we shorten to xfs_icreate. This eliminates the issues, enables us to
keep the inode init code in sync with userspace via libxfs, and is
needed for future metadata directory tree management.
(A subsequent cleanup will also fix the quota alloc calls.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 17 Jun 2024 18:39:54 +0000 (11:39 -0700)]
xfs: use consistent uid/gid when grabbing dquots for inodes
I noticed that callers of xfs_qm_vop_dqalloc use the following code to
compute the anticipated uid of the new file:
mapped_fsuid(idmap, &init_user_ns);
whereas the VFS uses a slightly different computation for actually
assigning i_uid:
mapped_fsuid(idmap, i_user_ns(inode));
Technically, these are not the same things. According to Christian
Brauner, the only time that inode->i_sb->s_user_ns != &init_user_ns is
when the filesystem was mounted in a new mount namespace by an
unpriviledged user. XFS does not allow this, which is why we've never
seen bug reports about quotas being incorrect or the uid checks in
xfs_qm_vop_create_dqattach tripping debug assertions.
However, this /is/ a logic bomb, so let's make the code consistent.
This inode log item recovery failing the dinode verifier after
replaying the contents of the inode log item into the ondisk inode.
Looking back into what the kernel was doing at the time of the fs
shutdown, a thread was in the middle of running a series of
transactions, each of which committed changes to the inode.
At some point in the middle of that chain, an invalid (at least
according to the verifier) change was committed. Had the filesystem not
shut down in the middle of the chain, a subsequent transaction would
have corrected the invalid state and nobody would have noticed. But
that's not what happened here. Instead, the invalid inode state was
committed to the ondisk log, so log recovery tripped over it.
The actual defect here was an overzealous inode verifier, which was
fixed in a separate patch. This patch adds some transaction precommit
functions for CONFIG_XFS_DEBUG=y mode so that we can detect these kinds
of transient errors at transaction commit time, where it's much easier
to find the root cause.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 16 Jul 2024 21:48:03 +0000 (14:48 -0700)]
xfs: enable FITRIM on the realtime device
Implement FITRIM for the realtime device by pretending that it's
"space" immediately after the data device. We have to hold the
rtbitmap ILOCK while the discard operations are ongoing because there's
no busy extent tracking for the rt volume to prevent reallocations.
Cc: Konst Mayer <cdlscpmv@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
xfs: Remove header files which are included more than once
Following warning is reported, so remove these duplicated header
including:
./fs/xfs/libxfs/xfs_trans_resv.c: xfs_da_format.h is included more than once.
./fs/xfs/scrub/quota_repair.c: xfs_format.h is included more than once.
./fs/xfs/xfs_handle.c: xfs_da_btree.h is included more than once.
./fs/xfs/xfs_qm_bhv.c: xfs_mount.h is included more than once.
./fs/xfs/xfs_trace.c: xfs_bmap.h is included more than once.
This is just a clean code, no logic changed.
Signed-off-by: Wenchao Hao <haowenchao22@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Tue, 16 Jul 2024 21:48:01 +0000 (14:48 -0700)]
xfs: fold xfs_ilock_for_write_fault into xfs_write_fault
Now that the page fault handler has been refactored, the only caller
of xfs_ilock_for_write_fault is simple enough and calls it
unconditionally. Fold the logic and expand the comments explaining it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Tue, 16 Jul 2024 21:48:00 +0000 (14:48 -0700)]
xfs: always take XFS_MMAPLOCK shared in xfs_dax_read_fault
After the previous refactoring, xfs_dax_fault is now never used for write
faults, so don't bother with the xfs_ilock_for_write_fault logic to
protect against writes when remapping is in progress.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Christoph Hellwig [Tue, 16 Jul 2024 21:47:57 +0000 (14:47 -0700)]
xfs: move the dio write relocking out of xfs_ilock_for_iomap
About half of xfs_ilock_for_iomap deals with a special case for direct
I/O writes to COW files that need to take the ilock exclusively. Move
this code into the one callers that cares and simplify
xfs_ilock_for_iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
lei lu [Tue, 16 Jul 2024 21:47:56 +0000 (14:47 -0700)]
xfs: don't walk off the end of a directory data block
This adds sanity checks for xfs_dir2_data_unused and xfs_dir2_data_entry
to make sure don't stray beyond valid memory region. Before patching, the
loop simply checks that the start offset of the dup and dep is within the
range. So in a crafted image, if last entry is xfs_dir2_data_unused, we
can change dup->length to dup->length-1 and leave 1 byte of space. In the
next traversal, this space will be considered as dup or dep. We may
encounter an out of bound read when accessing the fixed members.
In the patch, we make sure that the remaining bytes large enough to hold
an unused entry before accessing xfs_dir2_data_unused and
xfs_dir2_data_unused is XFS_DIR2_DATA_ALIGN byte aligned. We also make
sure that the remaining bytes large enough to hold a dirent with a
single-byte name before accessing xfs_dir2_data_entry.
Signed-off-by: lei lu <llfamsec@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
lei lu [Tue, 16 Jul 2024 21:47:55 +0000 (14:47 -0700)]
xfs: add bounds checking to xlog_recover_process_data
There is a lack of verification of the space occupied by fixed members
of xlog_op_header in the xlog_recover_process_data.
We can create a crafted image to trigger an out of bounds read by
following these steps:
1) Mount an image of xfs, and do some file operations to leave records
2) Before umounting, copy the image for subsequent steps to simulate
abnormal exit. Because umount will ensure that tail_blk and
head_blk are the same, which will result in the inability to enter
xlog_recover_process_data
3) Write a tool to parse and modify the copied image in step 2
4) Make the end of the xlog_op_header entries only 1 byte away from
xlog_rec_header->h_size
5) xlog_rec_header->h_num_logops++
6) Modify xlog_rec_header->h_crc
Fix:
Add a check to make sure there is sufficient space to access fixed members
of xlog_op_header.
Signed-off-by: lei lu <llfamsec@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
John Garry [Tue, 16 Jul 2024 21:47:54 +0000 (14:47 -0700)]
xfs: Fix xfs_prepare_shift() range for RT
The RT extent range must be considered in the xfs_flush_unmap_range() call
to stabilize the boundary.
This code change is originally from Dave Chinner.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
John Garry [Tue, 16 Jul 2024 21:47:53 +0000 (14:47 -0700)]
xfs: Fix xfs_flush_unmap_range() range for RT
Currently xfs_flush_unmap_range() does unmap for a full RT extent range,
which we also want to ensure is clean and idle.
This code change is originally from Dave Chinner.
Reviewed-by: Christoph Hellwig <hch@lst.de>4 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Currently AGFL blocks can be filled from the following three sources:
- allocbt free blocks, as in xfs_allocbt_free_block();
- rmapbt free blocks, as in xfs_rmapbt_free_block();
- refilled from freespace btrees, as in xfs_alloc_fix_freelist().
Originally, allocbt free blocks would be marked as stale only when they
put back in the general free space pool as Dave mentioned on IRC, "we
don't stale AGF metadata btree blocks when they are returned to the
AGFL .. but once they get put back in the general free space pool, we
have to make sure the buffers are marked stale as the next user of
those blocks might be user data...."
However, after commit ca250b1b3d71 ("xfs: invalidate allocbt blocks
moved to the free list") and commit edfd9dd54921 ("xfs: move buffer
invalidation to xfs_btree_free_block"), even allocbt / bmapbt free
blocks will be invalidated immediately since they may fail to pass
V5 format validation on writeback even writeback to free space would be
safe.
IOWs, IMHO currently there is actually no difference of free blocks
between AGFL freespace pool and the general free space pool. So let's
avoid extra redundant AGFL buffer invalidation, since otherwise we're
currently facing unnecessary xfs_log_force() due to xfs_trans_binval()
again on buffers already marked as stale before as below:
xfs_log_force() will take tens of milliseconds with AGF buffer locked.
It becomes an unnecessary long latency especially on our PMEM devices
with FSDAX enabled and fsops like xfs_reflink_find_shared() at the same
time are stuck due to the same AGF lock. Removing the double
invalidation on the AGFL blocks does not make this issue go away, but
this patch fixes for our workloads in reality and it should also work
by the code analysis.
Note that I'm not sure I need to remove another redundant one in
xfs_alloc_ag_vextent_small() since it's unrelated to our workloads.
Also fstests are passed with this patch.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Merge tag 'kbuild-fixes-v6.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Make scripts/ld-version.sh robust against the latest LLD
- Fix warnings in rpm-pkg with device tree support
- Fix warnings in fortify tests with KASAN
* tag 'kbuild-fixes-v6.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
fortify: fix warnings in fortify tests with KASAN
kbuild: rpm-pkg: avoid the warnings with dtb's listed twice
kbuild: Make ld-version.sh more robust against version string changes
When a software KASAN mode is enabled, the fortify tests emit warnings
on some architectures.
For example, for ARCH=arm, the combination of CONFIG_FORTIFY_SOURCE=y
and CONFIG_KASAN=y produces the following warnings:
TEST lib/test_fortify/read_overflow-memchr.log
warning: unsafe memchr() usage lacked '__read_overflow' warning in lib/test_fortify/read_overflow-memchr.c
TEST lib/test_fortify/read_overflow-memchr_inv.log
warning: unsafe memchr_inv() usage lacked '__read_overflow' symbol in lib/test_fortify/read_overflow-memchr_inv.c
TEST lib/test_fortify/read_overflow-memcmp.log
warning: unsafe memcmp() usage lacked '__read_overflow' warning in lib/test_fortify/read_overflow-memcmp.c
TEST lib/test_fortify/read_overflow-memscan.log
warning: unsafe memscan() usage lacked '__read_overflow' symbol in lib/test_fortify/read_overflow-memscan.c
TEST lib/test_fortify/read_overflow2-memcmp.log
warning: unsafe memcmp() usage lacked '__read_overflow2' warning in lib/test_fortify/read_overflow2-memcmp.c
[ more and more similar warnings... ]
Commit 9c2d1328f88a ("kbuild: provide reasonable defaults for tool
coverage") removed KASAN flags from non-kernel objects by default.
It was an intended behavior because lib/test_fortify/*.c are unit
tests that are not linked to the kernel.
As it turns out, some architectures require -fsanitize=kernel-(hw)address
to define __SANITIZE_ADDRESS__ for the fortify tests.
Without __SANITIZE_ADDRESS__ defined, arch/arm/include/asm/string.h
defines __NO_FORTIFY, thus excluding <linux/fortify-string.h>.
This issue does not occur on x86 thanks to commit 4ec4190be4cf
("kasan, x86: don't rename memintrinsics in uninstrumented files"),
but there are still some architectures that define __NO_FORTIFY
in such a situation.
Set KASAN_SANITIZE=y explicitly to the fortify tests.
Fixes: 9c2d1328f88a ("kbuild: provide reasonable defaults for tool coverage") Reported-by: Arnd Bergmann <arnd@arndb.de> Closes: https://lore.kernel.org/all/0e8dee26-41cc-41ae-9493-10cd1a8e3268@app.fastmail.com/ Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
kbuild: rpm-pkg: avoid the warnings with dtb's listed twice
After 8d1001f7bdd0 (kbuild: rpm-pkg: fix build error with CONFIG_MODULES=n),
the following warning "warning: File listed twice: *.dtb" is appearing for
every dtb file that is included.
The reason is that the commented commit already adds the folder
/lib/modules/%{KERNELRELEASE} in kernel.list file so the folder
/lib/modules/%{KERNELRELEASE}/dtb is no longer necessary, just remove it.
because the trailing comma is included in the patch level part of the
expression. While [1] has been partially reverted in [2] to avoid this
breakage (as it impacts the configuration stage and it is present in all
LTS branches), it would be good to make ld-version.sh more robust
against such miniscule changes like this one.
Use POSIX shell parameter expansion [3] to remove the largest suffix
after just numbers and periods, replacing of the current removal of
everything after a hyphen. ld-version.sh continues to work for a number
of distributions (Arch Linux, Debian, and Fedora) and the kernel.org
toolchains and no longer errors on a version of ld.lld with [1].
Merge tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix a performance regression when measuring the CPU time of a thread
(clock_gettime(CLOCK_THREAD_CPUTIME_ID,...)) due to the addition of
PSI IRQ time accounting in the hotpath
- Fix a task_struct leak due to missing to decrement the refcount when
the task is enqueued before the timer which is supposed to do that,
expires
- Revert an attempt to expedite detaching of movable tasks, as finding
those could become very costly. Turns out the original issue wasn't
even hit by anyone
* tag 'sched_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
sched/deadline: Fix task_struct reference leak
Revert "sched/fair: Make sure to try to detach at least one movable task"
Merge tag 'x86_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Make sure TF is cleared before calling other functions (BHI
mitigation in this case) in the SYSENTER compat handler, as
otherwise it will warn about being in single-step mode
* tag 'x86_urgent_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bhi: Avoid warning in #DB handler due to BHI mitigation
Merge tag 'i2c-for-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Fixes for the I2C testunit, the Renesas R-Car driver and some
MAINTAINERS corrections"
* tag 'i2c-for-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: testunit: avoid re-issued work after read message
i2c: rcar: ensure Gen3+ reset does not disturb local targets
i2c: mark HostNotify target address as used
i2c: testunit: correct Kconfig description
MAINTAINERS: VIRTIO I2C loses a maintainer, gains a reviewer
MAINTAINERS: delete entries for Thor Thayer
i2c: rcar: clear NO_RXDMA flag after resetting
i2c: rcar: bring hardware to known state when probing
Steve French [Tue, 9 Jul 2024 23:07:35 +0000 (18:07 -0500)]
cifs: fix setting SecurityFlags to true
If you try to set /proc/fs/cifs/SecurityFlags to 1 it
will set them to CIFSSEC_MUST_NTLMV2 which no longer is
relevant (the less secure ones like lanman have been removed
from cifs.ko) and is also missing some flags (like for
signing and encryption) and can even cause mount to fail,
so change this to set it to Kerberos in this case.
Also change the description of the SecurityFlags to remove mention
of flags which are no longer supported.
Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Wolfram Sang [Sat, 13 Jul 2024 08:50:55 +0000 (10:50 +0200)]
Merge tag 'i2c-host-fixes-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
This tag includes three fixes for the Renesas R-Car driver:
1. Ensures the device is in a known state after probing.
2. Allows clearing the NO_RXDMA flag after a reset.
3. Forces a reset before any transfer on Gen3+ platforms to
prevent disruption of the configuration during parallel
transfers.
Merge tag 'net-6.10-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull more networking fixes from Jakub Kicinski:
"A quick follow up to yesterday's pull. We got a regressions report for
the bnxt patch as soon as it got to your tree. The ethtool fix is also
good to have, although it's an older regression.
Current release - regressions:
- eth: bnxt_en: fix crash in bnxt_get_max_rss_ctx_ring() on older HW
when user tries to decrease the ring count
Previous releases - regressions:
- ethtool: fix RSS setting, accept "no change" setting if the driver
doesn't support the new features
- eth: i40e: remove needless retries of NVM update, don't wait 20min
when we know the firmware update won't succeed"
* tag 'net-6.10-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
bnxt_en: Fix crash in bnxt_get_max_rss_ctx_ring()
octeontx2-af: fix issue with IPv4 match for RSS
octeontx2-af: fix issue with IPv6 ext match for RSS
octeontx2-af: fix detection of IP layer
octeontx2-af: fix a issue with cpt_lf_alloc mailbox
octeontx2-af: replace cpt slot with lf id on reg write
i40e: fix: remove needless retries of NVM update
net: ethtool: Fix RSS setting
bp->rss_ctx_list is not initialized if the chip or firmware does not
support multiple RSS contexts. Fix it by adding a check in
bnxt_get_max_rss_ctx_ring() before proceeding to reference
bp->rss_ctx_list.
Merge tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fix a regression in extent map shrinker behaviour.
In the past weeks we got reports from users that there are huge
latency spikes or freezes. This was bisected to newly added shrinker
of extent maps (it was added to fix a build up of the structures in
memory).
I'm assuming that the freezes would happen to many users after release
so I'd like to get it merged now so it's in 6.10. Although the diff
size is not small the changes are relatively straightforward, the
reporters verified the fixes and we did testing on our side.
The fixes:
- adjust behaviour under memory pressure and check lock or scheduling
conditions, bail out if needed
- synchronize tracking of the scanning progress so inode ranges are
not skipped or work duplicated
- do a delayed iput when scanning a root so evicting an inode does
not slow things down in case of lots of dirty data, also fix
lockdep warning, a deadlock could happen when writing the dirty
data would need to start a transaction"
* tag 'for-6.10-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid races when tracking progress for extent map shrinking
btrfs: stop extent map shrinker if reschedule is needed
btrfs: use delayed iput during extent map shrinking
Merge tag 'pmdomain-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain fix from Ulf Hansson:
- qcom: Skip retention level for rpmhpd's
* tag 'pmdomain-v6.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: qcom: rpmhpd: Skip retention level for Power Domains