]> www.infradead.org Git - users/hch/xfsprogs.git/log
users/hch/xfsprogs.git
12 months agoxfs: get rid of xfs_ag_resv_rmapbt_alloc
Long Li [Tue, 16 Jul 2024 21:57:26 +0000 (14:57 -0700)]
xfs: get rid of xfs_ag_resv_rmapbt_alloc

Source kernel commit: 49cdc4e834e46d7c11a91d7adcfa04f56d19efaf

The pag in xfs_ag_resv_rmapbt_alloc() is already held when the struct
xfs_btree_cur is initialized in xfs_rmapbt_init_cursor(), so there is no
need to get pag again.

On the other hand, in xfs_rmapbt_free_block(), the similar function
xfs_ag_resv_rmapbt_free() was removed in commit 92a005448f6f ("xfs: get
rid of unnecessary xfs_perag_{get,put} pairs"), xfs_ag_resv_rmapbt_alloc()
was left because scrub used it, but now scrub has removed it. Therefore,
we could get rid of xfs_ag_resv_rmapbt_alloc() just like the rmap free
block, make the code cleaner.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agoxfs: background AIL push should target physical space
Dave Chinner [Tue, 16 Jul 2024 21:57:26 +0000 (14:57 -0700)]
xfs: background AIL push should target physical space

Source kernel commit: b50b4c49d8d79af05ac3bb3587f58589713139cc

Currently the AIL attempts to keep 25% of the "log space" free,
where the current used space is tracked by the reserve grant head.
That is, it tracks both physical space used plus the amount reserved
by transactions in progress.

When we start tail pushing, we are trying to make space for new
reservations by writing back older metadata and the log is generally
physically full of dirty metadata, and reservations for modifications
in flight take up whatever space the AIL can physically free up.

Hence we don't really need to take into account the reservation
space that has been used - we just need to keep the log tail moving
as fast as we can to free up space for more reservations to be made.
We know exactly how much physical space the journal is consuming in
the AIL (i.e. max LSN - min LSN) so we can base push thresholds
directly on this state rather than have to look at grant head
reservations to determine how much to physically push out of the
log.

This also allows code that needs to know if log items in the current
transaction need to be pushed or re-logged to simply sample the
current target - they don't need to calculate the current target
themselves. This avoids the need for any locking when doing such
checks.

Further, moving to a physical target means we don't need "push all
until empty semantics" like were introduced in the previous patch.
We can now test and clear the "push all" as a one-shot command to
set the target to the current head of the AIL. This allows the
xfsaild to maximise the use of log space right up to the point where
conditions indicate that the xfsaild is not keeping up with load and
it needs to work harder, and as soon as those constraints go away
(i.e. external code no longer needs everything pushed) the xfsaild
will return to maintaining the normal 25% free space thresholds.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agoxfs: AIL doesn't need manual pushing
Dave Chinner [Tue, 16 Jul 2024 21:57:26 +0000 (14:57 -0700)]
xfs: AIL doesn't need manual pushing

Source kernel commit: 9adf40249e6cfd7231c2973bb305f6c20902bfd9

We have a mechanism that checks the amount of log space remaining
available every time we make a transaction reservation. If the
amount of space is below a threshold (25% free) we push on the AIL
to tell it to do more work. To do this, we end up calculating the
LSN that the AIL needs to push to on every reservation and updating
the push target for the AIL with that new target LSN.

This is silly and expensive. The AIL is perfectly capable of
calculating the push target itself, and it will always be running
when the AIL contains objects.

What the target does is determine if the AIL needs to do
any work before it goes back to sleep. If we haven't run out of
reservation space or memory (or some other push all trigger), it
will simply go back to sleep for a while if there is more than 25%
of the journal space free without doing anything.

If there are items in the AIL at a lower LSN than the target, it
will try to push up to the target or to the point of getting stuck
before going back to sleep and trying again soon after.`

Hence we can modify the AIL to calculate it's own 25% push target
before it starts a push using the same reserve grant head based
calculation as is currently used, and remove all the places where we
ask the AIL to push to a new 25% free target. We can also drop the
minimum free space size of 256BBs from the calculation because the
25% of a minimum sized log is *always going to be larger than
256BBs.

This does still require a manual push in certain circumstances.
These circumstances arise when the AIL is not full, but the
reservation grants consume the entire of the free space in the log.
In this case, we still need to push on the AIL to free up space, so
when we hit this condition (i.e. reservation going to sleep to wait
on log space) we do a single push to tell the AIL it should empty
itself. This will keep the AIL moving as new reservations come in
and want more space, rather than keep queuing them and having to
push the AIL repeatedly.

The reason for using the "push all" when grant space runs out is
that we can run out of grant space when there is more than 25% of
the log free. Small logs are notorious for this, and we have a hack
in the log callback code (xlog_state_set_callback()) where we push
the AIL because the *head* moved) to ensure that we kick the AIL
when we consume space in it because that can push us over the "less
than 25% available" available that starts tail pushing back up
again.

Hence when we run out of grant space and are going to sleep, we have
to consider that the grant space may be consuming almost all the log
space and there is almost nothing in the AIL. In this situation, the
AIL pins the tail and moving the tail forwards is the only way the
grant space will come available, so we have to force the AIL to push
everything to guarantee grant space will eventually be returned.
Hence triggering a "push all" just before sleeping removes all the
nasty corner cases we have in other parts of the code that work
around the "we didn't ask the AIL to push enough to free grant
space" condition that leads to log space hangs...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agoFrom 94a0333b9212a114d19096a77903f76d0d5bca26 Mon Sep 17 00:00:00 2001
Zizhi Wo [Mon, 1 Jul 2024 06:02:36 +0000 (14:02 +0800)]
From 94a0333b9212a114d19096a77903f76d0d5bca26 Mon Sep 17 00:00:00 2001
Subject: xfs: Avoid races with cnt_btree lastrec updates

A concurrent file creation and little writing could unexpectedly return
-ENOSPC error since there is a race window that the allocator could get
the wrong agf->agf_longest.

Write file process steps:
1) Find the entry that best meets the conditions, then calculate the start
   address and length of the remaining part of the entry after allocation.
2) Delete this entry and update the -current- agf->agf_longest.
3) Insert the remaining unused parts of this entry based on the
   calculations in 1), and update the agf->agf_longest again if necessary.

Create file process steps:
1) Check whether there are free inodes in the inode chunk.
2) If there is no free inode, check whether there has space for creating
   inode chunks, perform the no-lock judgment first.
3) If the judgment succeeds, the judgment is performed again with agf lock
   held. Otherwire, an error is returned directly.

If the write process is in step 2) but not go to 3) yet, the create file
process goes to 2) at this time, it may be mistaken for no space,
resulting in the file system still has space but the file creation fails.

We have sent two different commits to the community in order to fix this
problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
discussed with Dave and Darrick, realized that a better solution to this
problem requires the "last cnt record tracking" to be ripped out of the
generic btree code. And surprisingly, Dave directly provided his fix code.
This patch includes appropriate modifications based on his tmp-code to
address this issue.

The entire fix can be roughly divided into two parts:
1) Delete the code related to lastrec-update in the generic btree code.
2) Place the process of updating longest freespace with cntbt separately
   to the end of the cntbt modifications. Move the cursor to the rightmost
   firstly, and update the longest free extent based on the record.

Note that we can not update the longest with xfs_alloc_get_rec() after
find the longest record, as xfs_verify_agbno() may not pass because
pag->block_count is updated on the outside. Therefore, use
xfs_btree_get_rec() as a replacement.

[1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
[2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com

Reported by: Ye Bin <yebin10@huawei.com>

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agoxfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c
Darrick J. Wong [Wed, 3 Jul 2024 21:21:43 +0000 (14:21 -0700)]
xfs: move xfs_refcount_update_defer_add to xfs_refcount_item.c

Move the code that adds the incore xfs_refcount_update_item deferred
work data to a transaction live with the CUI log item code.  This means
that the refcount code no longer has to know about the inner workings of
the CUI log items.

As a consequence, we can get rid of the _{get,put}_group helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: simplify usage of the rcur local variable in xfs_refcount_finish_one
Darrick J. Wong [Wed, 3 Jul 2024 21:21:42 +0000 (14:21 -0700)]
xfs: simplify usage of the rcur local variable in xfs_refcount_finish_one

Only update rcur when we know the final *pcur value.

Inspired-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one
Darrick J. Wong [Wed, 3 Jul 2024 21:21:42 +0000 (14:21 -0700)]
xfs: don't bother calling xfs_refcount_finish_one_cleanup in xfs_refcount_finish_one

In xfs_refcount_finish_one we know the cursor is non-zero when calling
xfs_refcount_finish_one_cleanup and we pass a 0 error variable.  This
means xfs_refcount_finish_one_cleanup is just doing a
xfs_btree_del_cursor.

Open code that and move xfs_refcount_finish_one_cleanup to
fs/xfs/xfs_refcount_item.c.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: reuse xfs_refcount_update_cancel_item
Darrick J. Wong [Wed, 3 Jul 2024 21:21:42 +0000 (14:21 -0700)]
xfs: reuse xfs_refcount_update_cancel_item

Reuse xfs_refcount_update_cancel_item to put the AG/RTG and free the
item in a few places that currently open code the logic.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: add a ci_entry helper
Darrick J. Wong [Wed, 3 Jul 2024 21:21:42 +0000 (14:21 -0700)]
xfs: add a ci_entry helper

Add a helper to translate from the item list head to the
refcount_intent_item structure and use it so shorten assignments and
avoid the need for extra local variables.

Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: clean up refcount log intent item tracepoint callsites
Darrick J. Wong [Wed, 3 Jul 2024 21:21:42 +0000 (14:21 -0700)]
xfs: clean up refcount log intent item tracepoint callsites

Pass the incore refcount intent structure to the tracepoints instead of
open-coding the argument passing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: prepare refcount btree tracepoints for widening
Darrick J. Wong [Wed, 3 Jul 2024 21:21:41 +0000 (14:21 -0700)]
xfs: prepare refcount btree tracepoints for widening

Prepare the rest of refcount btree tracepoints for use with realtime
reflink by making them take the btree cursor object as a parameter.
This will save us a lot of trouble later on.

Remove the xfs_refcount_recover_extent tracepoint since it's already
covered by other refcount tracepoints.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: create specialized classes for refcount tracepoints
Darrick J. Wong [Wed, 3 Jul 2024 21:21:41 +0000 (14:21 -0700)]
xfs: create specialized classes for refcount tracepoints

The only user of the "ag" tracepoint event classes is the refcount
btree, so rename them to make that obvious and make them take the btree
cursor to simplify the arguments.  This will save us a lot of trouble
later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: give refcount btree cursor error tracepoints their own class
Darrick J. Wong [Wed, 3 Jul 2024 21:21:41 +0000 (14:21 -0700)]
xfs: give refcount btree cursor error tracepoints their own class

Convert all the refcount tracepoints to use the btree error tracepoint
class.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: move xfs_rmap_update_defer_add to xfs_rmap_item.c
Darrick J. Wong [Wed, 3 Jul 2024 21:21:41 +0000 (14:21 -0700)]
xfs: move xfs_rmap_update_defer_add to xfs_rmap_item.c

Move the code that adds the incore xfs_rmap_update_item deferred work
data to a transaction live with the RUI log item code.  This means that
the rmap code no longer has to know about the inner workings of the RUI
log items.

As a consequence, we can get rid of the _{get,put}_group helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: simplify usage of the rcur local variable in xfs_rmap_finish_one
Christoph Hellwig [Wed, 3 Jul 2024 21:21:41 +0000 (14:21 -0700)]
xfs: simplify usage of the rcur local variable in xfs_rmap_finish_one

Only update rcur when we know the final *pcur value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: don't bother calling xfs_rmap_finish_one_cleanup in xfs_rmap_finish_one
Christoph Hellwig [Wed, 3 Jul 2024 21:21:40 +0000 (14:21 -0700)]
xfs: don't bother calling xfs_rmap_finish_one_cleanup in xfs_rmap_finish_one

In xfs_rmap_finish_one we known the cursor is non-zero when calling
xfs_rmap_finish_one_cleanup and we pass a 0 error variable.  This means
xfs_rmap_finish_one_cleanup is just doing a xfs_btree_del_cursor.

Open code that and move xfs_rmap_finish_one_cleanup to
fs/xfs/xfs_rmap_item.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor porting changes]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: reuse xfs_rmap_update_cancel_item
Christoph Hellwig [Wed, 3 Jul 2024 21:21:40 +0000 (14:21 -0700)]
xfs: reuse xfs_rmap_update_cancel_item

Reuse xfs_rmap_update_cancel_item to put the AG/RTG and free the item in
a few places that currently open code the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: add a ri_entry helper
Christoph Hellwig [Wed, 3 Jul 2024 21:21:40 +0000 (14:21 -0700)]
xfs: add a ri_entry helper

Add a helper to translate from the item list head to the
rmap_intent_item structure and use it so shorten assignments
and avoid the need for extra local variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: clean up rmap log intent item tracepoint callsites
Darrick J. Wong [Wed, 3 Jul 2024 21:21:40 +0000 (14:21 -0700)]
xfs: clean up rmap log intent item tracepoint callsites

Pass the incore rmap structure to the tracepoints instead of open-coding
the argument passing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: prepare rmap btree tracepoints for widening
Darrick J. Wong [Wed, 3 Jul 2024 21:21:39 +0000 (14:21 -0700)]
xfs: prepare rmap btree tracepoints for widening

Prepare the rmap btree tracepoints for use with realtime rmap btrees by
making them take the btree cursor object as a parameter.  This will save
us a lot of trouble later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: give rmap btree cursor error tracepoints their own class
Darrick J. Wong [Wed, 3 Jul 2024 21:21:39 +0000 (14:21 -0700)]
xfs: give rmap btree cursor error tracepoints their own class

Create a new tracepoint class for btree-related errors, then convert all
the rmap tracepoints to use it.  Also fix the one tracepoint that was
abusing the old class by making it a separate tracepoint.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: move xfs_extent_free_defer_add to xfs_extfree_item.c
Darrick J. Wong [Wed, 3 Jul 2024 21:21:39 +0000 (14:21 -0700)]
xfs: move xfs_extent_free_defer_add to xfs_extfree_item.c

Move the code that adds the incore xfs_extent_free_item deferred work
data to a transaction live with the EFI log item code.  This means that
the allocator code no longer has to know about the inner workings of the
EFI log items.

As a consequence, we can get rid of the _{get,put}_group helpers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: remove xfs_defer_agfl_block
Christoph Hellwig [Wed, 3 Jul 2024 21:21:39 +0000 (14:21 -0700)]
xfs: remove xfs_defer_agfl_block

xfs_free_extent_later can handle the extra AGFL special casing with
very little extra logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: remove duplicate asserts in xfs_defer_extent_free
Christoph Hellwig [Wed, 3 Jul 2024 21:21:39 +0000 (14:21 -0700)]
xfs: remove duplicate asserts in xfs_defer_extent_free

The bno/len verification is already done by the calls to
xfs_verify_rtbext / xfs_verify_fsbext, and reporting a corruption error
seem like the better handling than tripping an assert anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: reuse xfs_extent_free_cancel_item
Christoph Hellwig [Wed, 3 Jul 2024 21:21:38 +0000 (14:21 -0700)]
xfs: reuse xfs_extent_free_cancel_item

Reuse xfs_extent_free_cancel_item to put the AG/RTG and free the item in
a few places that currently open code the logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: add a xefi_entry helper
Christoph Hellwig [Wed, 3 Jul 2024 21:21:38 +0000 (14:21 -0700)]
xfs: add a xefi_entry helper

Add a helper to translate from the item list head to the
xfs_extent_free_item structure and use it so shorten assignments
and avoid the need for extra local variables.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: pass the fsbno to xfs_perag_intent_get
Christoph Hellwig [Wed, 3 Jul 2024 21:21:38 +0000 (14:21 -0700)]
xfs: pass the fsbno to xfs_perag_intent_get

All callers of xfs_perag_intent_get have a fsbno and need boilerplate
code to turn that into an agno.  Just pass the fsbno to
xfs_perag_intent_get and look up the agno there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: convert "skip_discard" to a proper flags bitset
Darrick J. Wong [Wed, 3 Jul 2024 21:21:38 +0000 (14:21 -0700)]
xfs: convert "skip_discard" to a proper flags bitset

Convert the boolean to skip discard on free into a proper flags field so
that we can add more flags in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: clean up extent free log intent item tracepoint callsites
Darrick J. Wong [Wed, 3 Jul 2024 21:21:38 +0000 (14:21 -0700)]
xfs: clean up extent free log intent item tracepoint callsites

Pass the incore EFI structure to the tracepoints instead of open-coding
the argument passing.  This cleans up the call sites a bit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_repair: use library functions for orphanage creation
Darrick J. Wong [Wed, 3 Jul 2024 21:21:37 +0000 (14:21 -0700)]
xfs_repair: use library functions for orphanage creation

Use new library functions to create lost+found.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_repair: use library functions to reset root/rbm/rsum inodes
Darrick J. Wong [Wed, 3 Jul 2024 21:21:37 +0000 (14:21 -0700)]
xfs_repair: use library functions to reset root/rbm/rsum inodes

Use the iroot reset function to reset root inodes instead of open-coding
the reset routine.  While we're at it, fix a longstanding memory leak if
the inode being reset actually had an xattr fork full of mappings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_db: port the iunlink command to use the libxfs iunlink function
Darrick J. Wong [Wed, 3 Jul 2024 21:21:37 +0000 (14:21 -0700)]
xfs_db: port the iunlink command to use the libxfs iunlink function

Now that we've ported the kernel's iunlink code to userspace, adapt the
debugger command to use it instead of duplicating the logic.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: don't use the incore struct xfs_sb for offsets into struct xfs_dsb
Darrick J. Wong [Wed, 3 Jul 2024 21:21:37 +0000 (14:21 -0700)]
xfs: don't use the incore struct xfs_sb for offsets into struct xfs_dsb

Currently, the XFS_SB_CRC_OFF macro uses the incore superblock struct
(xfs_sb) to compute the address of sb_crc within the ondisk superblock
struct (xfs_dsb).  This is a landmine if we ever change the layout of
the incore superblock (as we're about to do), so redefine the macro
to use xfs_dsb to compute the layout of xfs_dsb.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: move dirent update hooks to xfs_dir2.c
Darrick J. Wong [Wed, 3 Jul 2024 21:21:37 +0000 (14:21 -0700)]
xfs: move dirent update hooks to xfs_dir2.c

Move the directory entry update hook code to xfs_dir2 so that it is
mostly consolidated with the higher level directory functions.  Retain
the exports so that online fsck can still send notifications through the
hooks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: create libxfs helper to rename two directory entries
Darrick J. Wong [Wed, 3 Jul 2024 21:21:36 +0000 (14:21 -0700)]
xfs: create libxfs helper to rename two directory entries

Create a new libxfs function to rename two directory entries.  The
upcoming metadata directory feature will need this to replace a metadata
inode directory entry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: create libxfs helper to exchange two directory entries
Darrick J. Wong [Wed, 3 Jul 2024 21:21:36 +0000 (14:21 -0700)]
xfs: create libxfs helper to exchange two directory entries

Create a new libxfs function to exchange two directory entries.
The upcoming metadata directory feature will need this to replace a
metadata inode directory entry.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: create libxfs helper to remove an existing inode/name from a directory
Darrick J. Wong [Wed, 3 Jul 2024 21:21:36 +0000 (14:21 -0700)]
xfs: create libxfs helper to remove an existing inode/name from a directory

Create a new libxfs function to remove a (name, inode) entry from a
directory.  The upcoming metadata directory feature will need this to
create a metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist inode free function to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:36 +0000 (14:21 -0700)]
xfs: hoist inode free function to libxfs

Create a libxfs helper function that marks an inode free on disk.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: create libxfs helper to link an existing inode into a directory
Darrick J. Wong [Wed, 3 Jul 2024 21:21:36 +0000 (14:21 -0700)]
xfs: create libxfs helper to link an existing inode into a directory

Create a new libxfs function to link an existing inode into a directory.
The upcoming metadata directory feature will need this to create a
metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: create libxfs helper to link a new inode into a directory
Darrick J. Wong [Wed, 3 Jul 2024 21:21:35 +0000 (14:21 -0700)]
xfs: create libxfs helper to link a new inode into a directory

Create a new libxfs function to link a newly created inode into a
directory.  The upcoming metadata directory feature will need this to
create a metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: separate the icreate logic around INIT_XATTRS
Darrick J. Wong [Wed, 3 Jul 2024 21:21:35 +0000 (14:21 -0700)]
xfs: separate the icreate logic around INIT_XATTRS

INIT_XATTRS is overloaded here -- it's set during the creat process when
we think that we're immediately going to set some ACL xattrs to save
time.  However, it's also used by the parent pointers code to enable the
attr fork in preparation to receive ppptr xattrs.  This results in
xfs_has_parent() branches scattered around the codebase to turn on
INIT_XATTRS.

Linkable files are created far more commonly than unlinkable temporary
files or directory tree roots, so we should centralize this logic in
xfs_inode_init.  For the three callers that don't want parent pointers
(online repiar tempfiles, unlinkable tempfiles, rootdir creation) we
provide an UNLINKABLE flag to skip attr fork initialization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist xfs_{bump,drop}link to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:35 +0000 (14:21 -0700)]
xfs: hoist xfs_{bump,drop}link to libxfs

Move xfs_bumplink and xfs_droplink to libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist xfs_iunlink to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:35 +0000 (14:21 -0700)]
xfs: hoist xfs_iunlink to libxfs

Move xfs_iunlink and xfs_iunlink_remove to libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist new inode initialization functions to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:34 +0000 (14:21 -0700)]
xfs: hoist new inode initialization functions to libxfs

Move all the code that initializes a new inode's attributes from the
icreate_args structure and the parent directory into libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: implement get_random_u32
Darrick J. Wong [Wed, 3 Jul 2024 21:21:34 +0000 (14:21 -0700)]
libxfs: implement get_random_u32

Actually query the kernel for some random bytes instead of returning
zero, if that's possible.  The most noticeable effect of this is that
mkfs will now create the rtbitmap file, the rtsummary file, and children
of the root directory with a nonzero generation.  Apparently xfsdump
requires that the root directory have a generation number of zero.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: remove libxfs_dir_ialloc
Darrick J. Wong [Wed, 3 Jul 2024 21:21:34 +0000 (14:21 -0700)]
libxfs: remove libxfs_dir_ialloc

This function no longer exists in the kernel, and it's not really needed
in userspace either.  There are two users of it: repair and mkfs.
xfs_repair and xfs_db do not have useful cred and fsxattr structures so
they can call libxfs_dialloc and libxfs_icreate directly.  For mkfs
we'll move the guts of libxfs_dir_ialloc into proto.c as a creatproto
function that handles setting user/group ids, and move struct cred to
mkfs since it's now the only user.

This gets us ready to hoist the rest of the inode initialization code to
libxfs for metadata directories.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: backport inode init code from the kernel
Darrick J. Wong [Wed, 3 Jul 2024 21:21:34 +0000 (14:21 -0700)]
libxfs: backport inode init code from the kernel

Reorganize the userspace inode initialization code to more closely
resemble its kernel counterpart.  This is preparation to hoist the
initialization routines to libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: split new inode creation into two pieces
Darrick J. Wong [Wed, 3 Jul 2024 21:21:34 +0000 (14:21 -0700)]
xfs: split new inode creation into two pieces

There are two parts to initializing a newly allocated inode: setting up
the incore structures, and initializing the new inode core based on the
parent inode and the current user's environment.  The initialization
code is not specific to the kernel, so we would like to share that with
userspace by hoisting it to libxfs.  Therefore, split xfs_icreate into
separate functions to prepare for the next few patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: pass flags2 from parent to child when creating files
Darrick J. Wong [Wed, 3 Jul 2024 21:21:33 +0000 (14:21 -0700)]
libxfs: pass flags2 from parent to child when creating files

When mkfs creates a new file as a child of an existing directory, we
should propagate the flags2 field from parent to child like the kernel
does.  This ensures that mkfs propagates cowextsize hints properly when
protofiles are in use.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: when creating a file in a directory, set the project id based on the parent
Darrick J. Wong [Wed, 3 Jul 2024 21:21:33 +0000 (14:21 -0700)]
libxfs: when creating a file in a directory, set the project id based on the parent

When we're creating a file as a child of an existing directory, use
xfs_get_initial_prid to have the child inherit the project id of the
directory if the directory has PROJINHERIT set, just like the kernel
does.  This fixes mkfs project id propagation with -d projinherit=X when
protofiles are in use.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: set access time when creating files
Darrick J. Wong [Wed, 3 Jul 2024 21:21:33 +0000 (14:21 -0700)]
libxfs: set access time when creating files

Set the access time on files that we're creating, to match the behavior
of the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: rearrange libxfs_trans_ichgtime call when creating inodes
Darrick J. Wong [Wed, 3 Jul 2024 21:21:33 +0000 (14:21 -0700)]
libxfs: rearrange libxfs_trans_ichgtime call when creating inodes

Rearrange the libxfs_trans_ichgtime call in libxfs_ialloc so that we
call it once with the flags we want.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: implement atime updates in xfs_trans_ichgtime
Darrick J. Wong [Wed, 3 Jul 2024 21:21:33 +0000 (14:21 -0700)]
xfs: implement atime updates in xfs_trans_ichgtime

Enable xfs_trans_ichgtime to change the inode access time so that we can
use this function to set inode times when allocating inodes instead of
open-coding it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: pack icreate initialization parameters into a separate structure
Darrick J. Wong [Wed, 3 Jul 2024 21:21:32 +0000 (14:21 -0700)]
xfs: pack icreate initialization parameters into a separate structure

Callers that want to create an inode currently pass all possible file
attribute values for the new inode into xfs_init_new_inode as ten
separate parameters.  This causes two code maintenance issues: first, we
have large multi-line call sites which programmers must read carefully
to make sure they did not accidentally invert a value.  Second, all
three file id parameters must be passed separately to the quota
functions; any discrepancy results in quota count errors.

Clean this up by creating a new icreate_args structure to hold all this
information, some helpers to initialize them properly, and make the
callers pass this structure through to the creation function, whose name
we shorten to xfs_icreate.  This eliminates the issues, enables us to
keep the inode init code in sync with userspace via libxfs, and is
needed for future metadata directory tree management.

(A subsequent cleanup will also fix the quota alloc calls.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: pass IGET flags through to xfs_iread
Darrick J. Wong [Wed, 3 Jul 2024 21:21:32 +0000 (14:21 -0700)]
libxfs: pass IGET flags through to xfs_iread

Change the lock_flags parameter to iget_flags so that we can supply
XFS_IGET_ flags in future patches.  All callers of libxfs_iget and
libxfs_trans_iget pass zero for this parameter and there are no inode
locks in xfsprogs, so there's no behavior change here.

Port the kernel's version of the xfs_inode_from_disk callsite.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: put all the inode functions in a single file
Darrick J. Wong [Wed, 3 Jul 2024 21:21:32 +0000 (14:21 -0700)]
libxfs: put all the inode functions in a single file

Move all the inode functions into a single source code file.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist project id get/set functions to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:32 +0000 (14:21 -0700)]
xfs: hoist project id get/set functions to libxfs

Move the project id get and set functions into libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist inode flag conversion functions to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:31 +0000 (14:21 -0700)]
xfs: hoist inode flag conversion functions to libxfs

Hoist the inode flag conversion functions into libxfs so that we can
keep them in sync.  Do this by creating a new xfs_inode_util.c file in
libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: hoist extent size helpers to libxfs
Darrick J. Wong [Wed, 3 Jul 2024 21:21:31 +0000 (14:21 -0700)]
xfs: hoist extent size helpers to libxfs

Move the extent size helpers to xfs_bmap.c in libxfs since they're used
there already.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs: Remove header files which are included more than once
Wenchao Hao [Tue, 16 Jul 2024 21:54:14 +0000 (14:54 -0700)]
xfs: Remove header files which are included more than once

Source kernel commit: a330cae8a7147890262b06e1aa13db048e3b130f

Following warning is reported, so remove these duplicated header
including:

./fs/xfs/libxfs/xfs_trans_resv.c: xfs_da_format.h is included more than once.
./fs/xfs/scrub/quota_repair.c: xfs_format.h is included more than once.
./fs/xfs/xfs_handle.c: xfs_da_btree.h is included more than once.
./fs/xfs/xfs_qm_bhv.c: xfs_mount.h is included more than once.
./fs/xfs/xfs_trace.c: xfs_bmap.h is included more than once.

This is just a clean code, no logic changed.

Signed-off-by: Wenchao Hao <haowenchao22@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agoFrom 0c7fcdb6d06cdf8b19b57c17605215b06afa864a Mon Sep 17 00:00:00 2001
lei lu [Fri, 14 Jun 2024 02:22:53 +0000 (10:22 +0800)]
From 0c7fcdb6d06cdf8b19b57c17605215b06afa864a Mon Sep 17 00:00:00 2001
Subject: xfs: don't walk off the end of a directory data block

This adds sanity checks for xfs_dir2_data_unused and xfs_dir2_data_entry
to make sure don't stray beyond valid memory region. Before patching, the
loop simply checks that the start offset of the dup and dep is within the
range. So in a crafted image, if last entry is xfs_dir2_data_unused, we
can change dup->length to dup->length-1 and leave 1 byte of space. In the
next traversal, this space will be considered as dup or dep. We may
encounter an out of bound read when accessing the fixed members.

In the patch, we make sure that the remaining bytes large enough to hold
an unused entry before accessing xfs_dir2_data_unused and
xfs_dir2_data_unused is XFS_DIR2_DATA_ALIGN byte aligned. We also make
sure that the remaining bytes large enough to hold a dirent with a
single-byte name before accessing xfs_dir2_data_entry.

Signed-off-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agoFrom d40c2865bdbbbba6418436b0a877daebe1d7c63e Mon Sep 17 00:00:00 2001
Gao Xiang [Tue, 28 May 2024 04:12:39 +0000 (12:12 +0800)]
From d40c2865bdbbbba6418436b0a877daebe1d7c63e Mon Sep 17 00:00:00 2001
Subject: xfs: avoid redundant AGFL buffer invalidation

Currently AGFL blocks can be filled from the following three sources:
 - allocbt free blocks, as in xfs_allocbt_free_block();
 - rmapbt free blocks, as in xfs_rmapbt_free_block();
 - refilled from freespace btrees, as in xfs_alloc_fix_freelist().

Originally, allocbt free blocks would be marked as stale only when they
put back in the general free space pool as Dave mentioned on IRC, "we
don't stale AGF metadata btree blocks when they are returned to the
AGFL .. but once they get put back in the general free space pool, we
have to make sure the buffers are marked stale as the next user of
those blocks might be user data...."

However, after commit ca250b1b3d71 ("xfs: invalidate allocbt blocks
moved to the free list") and commit edfd9dd54921 ("xfs: move buffer
invalidation to xfs_btree_free_block"), even allocbt / bmapbt free
blocks will be invalidated immediately since they may fail to pass
V5 format validation on writeback even writeback to free space would be
safe.

IOWs, IMHO currently there is actually no difference of free blocks
between AGFL freespace pool and the general free space pool.  So let's
avoid extra redundant AGFL buffer invalidation, since otherwise we're
currently facing unnecessary xfs_log_force() due to xfs_trans_binval()
again on buffers already marked as stale before as below:

[  333.507469] Call Trace:
[  333.507862]  xfs_buf_find+0x371/0x6a0       <- xfs_buf_lock
[  333.508451]  xfs_buf_get_map+0x3f/0x230
[  333.509062]  xfs_trans_get_buf_map+0x11a/0x280
[  333.509751]  xfs_free_agfl_block+0xa1/0xd0
[  333.510403]  xfs_agfl_free_finish_item+0x16e/0x1d0
[  333.511157]  xfs_defer_finish_noroll+0x1ef/0x5c0
[  333.511871]  xfs_defer_finish+0xc/0xa0
[  333.512471]  xfs_itruncate_extents_flags+0x18a/0x5e0
[  333.513253]  xfs_inactive_truncate+0xb8/0x130
[  333.513930]  xfs_inactive+0x223/0x270

xfs_log_force() will take tens of milliseconds with AGF buffer locked.
It becomes an unnecessary long latency especially on our PMEM devices
with FSDAX enabled and fsops like xfs_reflink_find_shared() at the same
time are stuck due to the same AGF lock.  Removing the double
invalidation on the AGFL blocks does not make this issue go away, but
this patch fixes for our workloads in reality and it should also work
by the code analysis.

Note that I'm not sure I need to remove another redundant one in
xfs_alloc_ag_vextent_small() since it's unrelated to our workloads.
Also fstests are passed with this patch.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
12 months agodebian: create a new package for automatic self-healing
Darrick J. Wong [Thu, 11 Jul 2024 22:59:42 +0000 (15:59 -0700)]
debian: create a new package for automatic self-healing

Create a new package for people who explicilty want self-healing turned
on by default for XFS.  This package is named xfsprogs-self-healing.

Note: This introduces a new "install-selfheal" target to install only
the files needed for enabling online fsck by default.  Other
distributions should take note of the new target if they choose to
create a package for enabling autonomous self healing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_scrub: use the self_healing fsproperty to select mode
Darrick J. Wong [Mon, 29 Jul 2024 18:02:51 +0000 (11:02 -0700)]
xfs_scrub: use the self_healing fsproperty to select mode

Now that we can set properties on xfs filesystems, make the xfs_scrub
background service query the self_healing property to figure out which
mode (dry run, optimize, repair, none) it should use.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agomisc: shift install targets
Darrick J. Wong [Wed, 3 Jul 2024 21:25:59 +0000 (14:25 -0700)]
misc: shift install targets

Modify each Makefile so that "install-pkg" installs the main package
contents, and "install" just invokes "install-pkg".  We'll need this
indirection for the next patch where we add an install-selfheal target
to build the xfsprogs-self-healing package but will still want 'make
install' to install everything on a developer's workstation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agomkfs: set self_healing property
Darrick J. Wong [Fri, 26 Jul 2024 18:31:47 +0000 (11:31 -0700)]
mkfs: set self_healing property

Add a new mkfs options so that sysadmins can control the background
scrubbing behavior of filesystems from the start.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_scrub: allow sysadmin to control background scrubs
Darrick J. Wong [Fri, 26 Jul 2024 05:47:58 +0000 (22:47 -0700)]
xfs_scrub: allow sysadmin to control background scrubs

Define a "self_healing" filesystem property so that sysadmins can
indicate their preferences for background online fsck.  Add an extended
option to xfs_scrub so that it selects the operation mode from the self
healing fs property.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibfrog: define a self_healing filesystem property
Darrick J. Wong [Fri, 26 Jul 2024 20:32:43 +0000 (13:32 -0700)]
libfrog: define a self_healing filesystem property

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_property: add a new tool to administer fs properties
Darrick J. Wong [Fri, 26 Jul 2024 22:09:28 +0000 (15:09 -0700)]
xfs_property: add a new tool to administer fs properties

Create a tool to list, get, set, and remove filesystem properties.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_db: add a command to list xattrs
Darrick J. Wong [Fri, 26 Jul 2024 21:32:56 +0000 (14:32 -0700)]
xfs_db: add a command to list xattrs

Add a command to list extended attributes from xfs_db.  We'll need this
later to manage the fs properties when unmounted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: pass a transaction context through listxattr
Darrick J. Wong [Fri, 26 Jul 2024 21:15:49 +0000 (14:15 -0700)]
libxfs: pass a transaction context through listxattr

Pass a transaction context so that a new caller can walk the attr names
and query the values all in one go without deadlocking on nested buffer
access.

While we're at it, make the existing xfs_repair callers try to use
empty transactions so that we don't deadlock on cycles in the xattr
structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibxfs: hoist listxattr from xfs_repair
Darrick J. Wong [Fri, 26 Jul 2024 21:08:19 +0000 (14:08 -0700)]
libxfs: hoist listxattr from xfs_repair

Hoist the listxattr code from xfs_repair so that we can use it in
xfs_db.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_db: improve getting and setting extended attributes
Darrick J. Wong [Sat, 27 Jul 2024 00:38:11 +0000 (17:38 -0700)]
xfs_db: improve getting and setting extended attributes

Add an attr_get command to retrieve the value of an xattr from a file;
and extend the attr_set command to allow passing of string values.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_spaceman: edit filesystem properties
Darrick J. Wong [Thu, 25 Jul 2024 18:46:48 +0000 (11:46 -0700)]
xfs_spaceman: edit filesystem properties

Add some new subcommands to xfs_spaceman so that we can examine
filesystem properties.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agolibfrog: support editing filesystem property sets
Darrick J. Wong [Fri, 26 Jul 2024 17:37:30 +0000 (10:37 -0700)]
libfrog: support editing filesystem property sets

Add some library functions so that spaceman and scrub can share the same
code to edit and retrieve filesystem properties.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
12 months agoxfs_repair: allow symlinks with short remote targets
Darrick J. Wong [Wed, 3 Jul 2024 21:21:31 +0000 (14:21 -0700)]
xfs_repair: allow symlinks with short remote targets

Symbolic links can have extended attributes.  If the attr fork consumes
enough space in the inode record, a shortform symlink can become a
remote symlink.  However, if we delete those extended attributes, the
target is not moved back into the inode core.

IOWs, we can end up with a symlink inode that looks like this:

core.magic = 0x494e
core.mode = 0120777
core.version = 3
core.format = 2 (extents)
core.nlinkv2 = 1
core.nextents = 1
core.size = 297
core.nblocks = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,12,1,0]

This is a symbolic link with a 297-byte target stored in a disk block,
which is to say this is a symlink with a remote target.  The forkoff is
0, which is to say that there's 512 - 176 == 336 bytes in the inode core
to store the data fork.

Prior to kernel commit 1eb70f54c445f, the kernel was ok with this
arrangement, but the change to symlink validation in that patch now
produces corruption errors on filesystems written by older kernels that
are not otherwise inconsistent.  Those changes were inspired by reports
of illegal memory accesses, which I think were a result of making data
fork access decisions based on symlink di_size and not on di_format.

Unfortunately, for a very long time xfs_repair has flagged these inodes
as being corrupt, even though the kernel has historically been willing
to read and write symlinks with these properties.  Resolve the conflict
by adjusting the xfs_repair corruption tests to allow extents format.
This change matches the kernel patch "xfs: allow symlinks with short
remote targets".

While we're at it, fix a lurking bad symlink fork access.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: try spot repairs of metadata items to make scrub progress
Darrick J. Wong [Wed, 3 Jul 2024 21:21:31 +0000 (14:21 -0700)]
xfs_scrub: try spot repairs of metadata items to make scrub progress

Now that we've enabled scrub dependency barriers, it's possible that a
scrub_item_check call will return with some of the scrub items still in
NEEDSCHECK state.  If, for example, scrub type B depends on scrub type
A being clean and A is not clean, B will still be in NEEDSCHECK state.

In order to make as much scanning progress as possible during phase 2
and phase 3, allow ourselves to try some spot repairs in the hopes that
it will enable us to make progress towards at least scanning the whole
metadata item.  If we can't make any forward progress, we'll queue the
scrub item for repair in phase 4, which means that anything still in in
NEEDSCHECK state becomes CORRUPT state.  (At worst, the NEEDSCHECK item
will actually be clean by phase 4, and xfs_scrub will report that it
didn't need any work after all.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: use scrub barriers to reduce kernel calls
Darrick J. Wong [Wed, 3 Jul 2024 21:21:31 +0000 (14:21 -0700)]
xfs_scrub: use scrub barriers to reduce kernel calls

Use scrub barriers so that we can submit a single scrub request for a
bunch of things, and have the kernel stop midway through if it finds
anything broken.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: vectorize repair calls
Darrick J. Wong [Wed, 3 Jul 2024 21:21:30 +0000 (14:21 -0700)]
xfs_scrub: vectorize repair calls

Use the new vectorized scrub kernel calls to reduce the overhead of
performing repairs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: vectorize scrub calls
Darrick J. Wong [Wed, 3 Jul 2024 21:21:30 +0000 (14:21 -0700)]
xfs_scrub: vectorize scrub calls

Use the new vectorized kernel scrub calls to reduce the overhead of
checking metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: convert scrub and repair epilogues to use xfs_scrub_vec
Darrick J. Wong [Wed, 3 Jul 2024 21:21:30 +0000 (14:21 -0700)]
xfs_scrub: convert scrub and repair epilogues to use xfs_scrub_vec

Convert the scrub and repair epilogue code to pass around xfs_scrub_vecs
as we prepare for vectorized operation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: split the repair epilogue code into a separate function
Darrick J. Wong [Wed, 3 Jul 2024 21:21:30 +0000 (14:21 -0700)]
xfs_scrub: split the repair epilogue code into a separate function

Move all the code that updates the internal state in response to a
repair ioctl() call completion into a separate function.  This will help
with vectorizing repair calls later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: split the scrub epilogue code into a separate function
Darrick J. Wong [Wed, 3 Jul 2024 21:21:29 +0000 (14:21 -0700)]
xfs_scrub: split the scrub epilogue code into a separate function

Move all the code that updates the internal state in response to a scrub
ioctl() call completion into a separate function.  This will help with
vectorizing scrub calls later on.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_io: support vectored scrub
Darrick J. Wong [Wed, 3 Jul 2024 21:21:29 +0000 (14:21 -0700)]
xfs_io: support vectored scrub

Create a new scrubv command to xfs_io to support the vectored scrub
ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agolibfrog: support vectored scrub
Darrick J. Wong [Wed, 3 Jul 2024 21:21:29 +0000 (14:21 -0700)]
libfrog: support vectored scrub

Enhance libfrog to support performing vectored metadata scrub.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoman: document vectored scrub mode
Darrick J. Wong [Wed, 3 Jul 2024 21:21:29 +0000 (14:21 -0700)]
man: document vectored scrub mode

Add a manpage to document XFS_IOC_SCRUBV_METADATA.  From the kernel
patch:

Introduce a variant on XFS_SCRUB_METADATA that allows for a vectored
mode.  The caller specifies the principal metadata object that they want
to scrub (allocation group, inode, etc.) once, followed by an array of
scrub types they want called on that object.  The kernel runs the scrub
operations and writes the output flags and errno code to the
corresponding array element.

A new pseudo scrub type BARRIER is introduced to force the kernel to
return to userspace if any corruptions have been found when scrubbing
the previous scrub types in the array.  This enables userspace to
schedule, for example, the sequence:

 1. data fork
 2. barrier
 3. directory

If the data fork scrub is clean, then the kernel will perform the
directory scrub.  If not, the barrier in 2 will exit back to userspace.

The alternative would have been an interface where userspace passes a
pointer to an empty buffer, and the kernel formats that with
xfs_scrub_vecs that tell userspace what it scrubbed and what the outcome
was.  With that the kernel would have to communicate that the buffer
needed to have been at least X size, even though for our cases
XFS_SCRUB_TYPE_NR + 2 would always be enough.

Compared to that, this design keeps all the dependency policy and
ordering logic in userspace where it already resides instead of
duplicating it in the kernel. The downside of that is that it needs the
barrier logic.

When running fstests in "rebuild all metadata after each test" mode, I
observed a 10% reduction in runtime due to fewer transitions across the
system call boundary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: defer phase5 file scans if dirloop fails
Darrick J. Wong [Wed, 3 Jul 2024 21:21:29 +0000 (14:21 -0700)]
xfs_scrub: defer phase5 file scans if dirloop fails

If we cannot fix dirloop problems during the initial phase 5 inode scan,
defer them until later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: detect and repair directory tree corruptions
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
xfs_scrub: detect and repair directory tree corruptions

Now that we have online fsck for directory tree structure problems, we
need to find a place to call it.  The scanner requires that parent
pointers are enabled, that directory link counts are correct, and that
every directory entry has a corresponding parent pointer.  Therefore, we
can only run it after phase 4 fixes every file, and phase 5 resets the
link counts.

In other words, we call it as part of the phase 5 file scan that we do
to warn about weird looking file names.  This has the added benefit that
opening the directory by handle is less likely to fail if there are
loops in the directory structure.  For now, only plumb in enough to try
to fix directory tree problems right away; the next patch will make
phase 5 retry the dirloop scanner until the problems are fixed or we
stop making forward progress.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_scrub: fix erroring out of check_inode_names
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
xfs_scrub: fix erroring out of check_inode_names

The early exit logic in this function is a bit suboptimal -- we don't
need to close the @fd if we haven't even opened it, and since all errors
are fatal, we don't need to bump the progress counter.  The logic in
this function is about to get more involved due to the addition of the
directory tree structure checker, so clean up these warts.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_spaceman: report directory tree corruption in the health information
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
xfs_spaceman: report directory tree corruption in the health information

Report directories that are the source of corruption in the directory
tree.  While we're at it, add the documentation updates for the new
reporting flags and scrub type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agolibfrog: add directory tree structure scrubber to scrub library
Darrick J. Wong [Wed, 3 Jul 2024 21:21:28 +0000 (14:21 -0700)]
libfrog: add directory tree structure scrubber to scrub library

Make it so that scrub clients can detect corruptions within the
directory tree structure itself.  Update the documentation for the scrub
ioctl to mention this new functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: wipe ondisk parent pointers when there are none
Darrick J. Wong [Wed, 3 Jul 2024 21:21:27 +0000 (14:21 -0700)]
xfs_repair: wipe ondisk parent pointers when there are none

Erase all the parent pointers when there aren't any found by the
directory entry scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: update ondisk parent pointer records
Darrick J. Wong [Wed, 3 Jul 2024 21:21:27 +0000 (14:21 -0700)]
xfs_repair: update ondisk parent pointer records

Update the ondisk parent pointer records as necessary.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: dump garbage parent pointer attributes
Darrick J. Wong [Wed, 3 Jul 2024 21:21:27 +0000 (14:21 -0700)]
xfs_repair: dump garbage parent pointer attributes

Delete xattrs that have ATTR_PARENT set but are so garbage that they
clearly aren't parent pointers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: check parent pointers
Darrick J. Wong [Wed, 3 Jul 2024 21:21:27 +0000 (14:21 -0700)]
xfs_repair: check parent pointers

Use the parent pointer index that we constructed in the previous patch
to check that each file's parent pointer records exactly match the
directory entries that we recorded while walking directory entries.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: deduplicate strings stored in string blob
Darrick J. Wong [Wed, 3 Jul 2024 21:21:27 +0000 (14:21 -0700)]
xfs_repair: deduplicate strings stored in string blob

Reduce the memory requirements of the string blob structure by
deduplicating the strings stored within.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: move the global dirent name store to a separate object
Darrick J. Wong [Wed, 3 Jul 2024 21:21:26 +0000 (14:21 -0700)]
xfs_repair: move the global dirent name store to a separate object

Abstract the main parent pointer dirent names xfblob object into a
separate data structure to hide implementation details.

The goals here are (a) reduce memory usage when we can by deduplicating
dirent names that exist in multiple directories; and (b) provide a
unique id for each name in the system so that sorting incore parent
pointer records can be done in a stable manner.  Fast stable sorting of
records is required for the dirent <-> pptr matching algorithm.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: build a parent pointer index
Darrick J. Wong [Wed, 3 Jul 2024 21:21:26 +0000 (14:21 -0700)]
xfs_repair: build a parent pointer index

When we're walking directories during phase 6, build an index of parent
pointers that we expect to find.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: junk duplicate hashtab entries when processing sf dirents
Darrick J. Wong [Wed, 3 Jul 2024 21:21:26 +0000 (14:21 -0700)]
xfs_repair: junk duplicate hashtab entries when processing sf dirents

dir_hash_add() adds the passed-in dirent to the directory hashtab even
if there's already a duplicate.  Therefore, if we detect a duplicate or
a garbage entry while processing the a shortform directory's entries, we
need to junk the newly added entry, just like we do when processing
directory data blocks.

This will become particularly relevant in the next patch, where we
generate a master index of parent pointers from the non-junked hashtab
entries of each directory that phase6 scans.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
12 months agoxfs_repair: add parent pointers when messing with /lost+found
Darrick J. Wong [Wed, 3 Jul 2024 21:21:26 +0000 (14:21 -0700)]
xfs_repair: add parent pointers when messing with /lost+found

Make sure that the /lost+found gets created with parent pointers, and
that lost children being put in there get new parent pointers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>