]> www.infradead.org Git - users/hch/xfsprogs.git/log
users/hch/xfsprogs.git
11 months agoxfs: report shutdown events through healthmon
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: report shutdown events through healthmon

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: report metadata health events through healthmon
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: report metadata health events through healthmon

Set up a metadata health event hook so that we can send events to
userspace as we collect information.  The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: create event queuing, formatting, and discovery infrastructure
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: create event queuing, formatting, and discovery infrastructure

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.

Here, we've chosen json to export information to userspace.  The
structured key-value nature of json gives us enormous flexibility to
modify the schema of what we'll send to userspace because we can add new
keys at any time.  Userspace can use whatever json parsers are available
to consume the events and will not be confused by keys they don't
recognize.

Note that we do NOT allow sending json back to the kernel, nor is there
any intent to do that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: create a special file to pass filesystem health to userspace
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: create a special file to pass filesystem health to userspace

Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: create hooks for monitoring health updates
Darrick J. Wong [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
xfs: create hooks for monitoring health updates

Create hooks for monitoring health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agospaceman: move inodes with hardlinks
Dave Chinner [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
spaceman: move inodes with hardlinks

When a inode to be moved to a different AG has multiple hard links,
we need to "move" all the hard links, too. To do this, we need to
create temporary hardlinks to the new file, and then use rename
exchange to swap all the hardlinks that point to the old inode
with new hardlinks that point to the new inode.

We already know that an inode has hard links via the path discovery,
and we can check it against the link count that is reported for the
inode before we start building the link farm.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agospaceman: relocate the contents of an AG
Dave Chinner [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
spaceman: relocate the contents of an AG

Shrinking a filesystem needs to first remove all the active user
data and metadata from the AGs that are going to be lopped off the
filesystem. Before we can do this, we have to relocate this
information to a region of the filesystem that is going to be
retained.

We have a function to move an inode and all it's related
information to a specific AG, we have functions to find the
owners of all the information in an AG and we can find their paths.
This gives us all the information we need to relocate all the
objects in an AG we are going to remove via shrinking.

Firstly we scan the AG to be emptied to find the inodes that need to
be relocated, then we scan the directory structure to find all the
paths to those inodes that need to be moved. Then we iterate over
all the inodes to be moved attempting to move them to the lowest
numbers AGs.

When the destination AG fills up, we'll get ENOSPC from
the moving code and this is a trigger to bump the destination AG and
retry the move. If we haven't moved all the inodes and their data by
the time the destination reaches the source AG, then the entire
operation will fail with ENOSPC - there is not enough room in the
filesystem to empty the selected AG in preparation for a shrink.

This, once again, is not intended as an optimal or even guaranteed
way of emptying an AG for shrink. It simply provides the basic
algorithm and mechanisms we need to perform a shrink operation.
Improvements and optimisations will come in time, but we can't get
to an optimal solution without first having basic functionality in
place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_spaceman: port relocation structure to 32-bit systems
Darrick J. Wong [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
xfs_spaceman: port relocation structure to 32-bit systems

We can't use the radix tree to store relocation information on 32-bit
systems because unsigned longs are not large enough to hold 64-bit
inodes.  Use an avl64 tree instead.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_spaceman: wrap radix tree accesses in find_owner.c
Darrick J. Wong [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
xfs_spaceman: wrap radix tree accesses in find_owner.c

Wrap the raw radix tree accesses here so that we can provide an
alternate implementation on platforms where radix tree indices cannot
store a full 64-bit inode number.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agospaceman: find owners of space in an AG
Dave Chinner [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
spaceman: find owners of space in an AG

Before we can move inodes for a shrink operation, we have to find
all the inodes that own space in the AG(s) we want to empty.

This implementation uses FS_IOC_GETFSMAP on the assumption that
filesystems to be shrunk have reverse mapping enabled as it is the
only way to identify inode related metadata that userspace is unable
to see or influence (e.g. BMBT blocks) that may be located in the
specific AG. We can use GETFSMAP to identify both inodes to be moved
(via XFS_FMR_OWN_INODES records) and inodes with just data and/or
metadata to be moved.

Once we have identified all the inodes to be moved, we have to
map them to paths so that we can use renameat2() to exchange the
directory entries pointing at the moved inode atomically. We also
need to record inodes with hard links and all of the paths to the
inode so that hard links can be recreated appropriately.

This requires a directory tree walk to discover the paths (until
parent pointers are a thing). Hence for filesystems that aren't
reverse mapping enabled, we can eventually use this pass to discover
inodes with visible data and metadata that need to be moved.

As we resolve the paths to the inodes to be moved, output the
information to stdout so that it can be acted upon by other
utilities. This results in a command that acts similar to find but
with a physical location filter rather than an inode metadata
filter.

Again, this is not meant to be an optimal implementation. It
shouldn't suck, but there is plenty of scope for performance
optimisation, especially with a multithreaded and/or async directory
traversal/parent pointer path resolution process to hide access
latencies.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agospaceman: physically move a regular inode
Dave Chinner [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
spaceman: physically move a regular inode

To be able to shrink a filesystem, we need to be able to physically
move an inode and all it's data and metadata from it's current
location to a new AG.  Add a command to spaceman to allow an inode
to be moved to a new AG.

This new command is not intended to be a perfect solution. I am not
trying to handle atomic movement of open files - this is intended to
be run as a maintenance operation on idle filesystem. If root
filesystems are the target, then this should be run via a rescue
environment that is not executing directly on the root fs. With
those caveats in place, we can do the entire inode move as a set of
non-destructive operations finalised by an atomic inode swap
without any needing special kernel support.

To ensure we move metadata such as BMBT blocks even if we don't need
to move data, we clone the data to a new inode that we've allocated
in the destination AG. This will result in new bmbt blocks being
allocated in the new location even though the data is not copied.
Attributes need to be copied one at a time from the original inode.

If data needs to be moved, then we use fallocate(UNSHARE) to create
a private copy of the range of data that needs to be moved in the
new inode. This will be allocated in the destination AG by normal
allocation policy.

Once the new inode has been finalised, use RENAME_EXCHANGE to swap
it into place and unlink the original inode to free up all the
resources it still pins.

There are many optimisations still possible to speed this up, but
the goal here is "functional" rather than "optimal". Performance can
be optimised once all the parts for a "empty the tail of the
filesystem before shrink" operation are implemented and solidly
tested.

This functionality has been smoke tested by creating a 32MB data
file with 4k extents and several hundred attributes:

$ cat test.sh
fname=/mnt/scratch/foo
xfs_io -f -c "pwrite 0 32m" -c sync $fname
for (( i=0; i < 4096 ; i++ )); do
xfs_io -c "fpunch $((i * 8))k 4k" $fname
done

for (( i=0; i < 100 ; i++ )); do
setfattr -n user.blah.$i.$i.blah -v blah.$i.$i.blah $fname
setfattr -n user.foo.$i.$i.foo -v $i.cantbele.$i.ve.$i.tsnotbutter $fname
done
for (( i=0; i < 100 ; i++ )); do
setfattr -n security.baz.$i.$i.baz -v wotchul$i$iookinat $fname
done

xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname
xfs_spaceman -c "move_inode -a 22" /mnt/scratch/foo
xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname
$

and the output looks something like:

$ sudo ./test.sh
....
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 133
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          hole                                      8
   1: [8..15]:         208..215           0 (208..215)           8 000000
   2: [16..23]:        hole                                      8
   3: [24..31]:        224..231           0 (224..231)           8 000000
....
8189: [65512..65519]:  65712..65719       0 (65712..65719)       8 000000
8190: [65520..65527]:  hole                                      8
8191: [65528..65535]:  65728..65735       0 (65728..65735)       8 000000
mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          392..399           0 (392..399)           8 000000
   1: [8..15]:         408..415           0 (408..415)           8 000000
   2: [16..23]:        424..431           0 (424..431)           8 000000
   3: [24..31]:        456..463           0 (456..463)           8 000000
move mnt /mnt/scratch, path /mnt/scratch/foo, agno 22
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 47244651475
....
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE               AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          hole                                              8
   1: [8..15]:         47244763192..47244763199  22 (123112..123119)     8 000000
   2: [16..23]:        hole                                              8
   3: [24..31]:        47244763208..47244763215  22 (123128..123135)     8 000000
....
8189: [65512..65519]:  47244828808..47244828815  22 (188728..188735)     8 000000
8190: [65520..65527]:  hole                                              8
8191: [65528..65535]:  47244828824..47244828831  22 (188744..188751)     8 000000
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE               AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          47244763176..47244763183  22 (123096..123103)     8 000000
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_spaceman: implement clearing free space
Darrick J. Wong [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
xfs_spaceman: implement clearing free space

First attempt at evacuating all the used blocks from part of a
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: get and put blocks on the AGFL
Darrick J. Wong [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
xfs_db: get and put blocks on the AGFL

Add a new xfs_db command to let people add and remove blocks from an
AGFL.  This isn't really related to rmap btree reconstruction, other
than enabling debugging code to mess around with the AGFL to exercise
various odd scenarios.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_io: support using XFS_IOC_MAP_FREESP to map free space
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs_io: support using XFS_IOC_MAP_FREESP to map free space

Add a command to call XFS_IOC_MAP_FREESP.  This is experimental code to
see if we can build a free space defragmenter out of this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: add an ioctl to map free space into a file
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs: add an ioctl to map free space into a file

Add a new ioctl to map free physical space into a file, at the same file
offset as if the file were a sparse image of the physical device backing
the filesystem.  The intent here is to use this to prototype a free
space defragmentation tool.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_io: dump reference count information
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs_io: dump reference count information

Dump refcount info from the kernel so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: export reference count information to userspace
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs: export reference count information to userspace

Export refcount info to userspace so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_io: enhance the aginfo command to control the noalloc flag
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs_io: enhance the aginfo command to control the noalloc flag

Augment the aginfo command to be able to set and clear the noalloc
state for an AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: apply noalloc mode to inode allocations too
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: apply noalloc mode to inode allocations too

Don't allow inode allocations from this group if it's marked noalloc.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: enable userspace to hide an AG from allocation
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: enable userspace to hide an AG from allocation

Add an administrative interface so that userspace can hide an allocation
group from block allocation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: create a noalloc mode for allocation groups
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: create a noalloc mode for allocation groups

Create a new noalloc state for the per-AG structure that will disable
block allocation in this AG.  We accomplish this by subtracting from
fdblocks all the free blocks in this AG, hiding those blocks from the
allocator, and preventing freed blocks from updating fdblocks until
we're ready to lift noalloc mode.

Note that we reduce the free block count of the filesystem so that we
can prevent transactions from entering the allocator looking for "free"
space that we've turned off incore.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: track deferred ops statistics
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: track deferred ops statistics

Track some basic statistics on how hard we're pushing the defer ops.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_quota: report warning limits for realtime space quotas
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs_quota: report warning limits for realtime space quotas

Report the number of warnings that a user will get for exceeding the
soft limit of a realtime volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agomkfs: enable reflink with realtime extent sizes > 1
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
mkfs: enable reflink with realtime extent sizes > 1

Allow creation of filesystems with reflink enabled and realtime extent
size larger than 1 block.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: fix integer overflow when validating extent size hints
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs: fix integer overflow when validating extent size hints

Both file extent size hints are stored as 32-bit quantities, in units of
filesystem blocks.  As part of validating the hints, we convert these
quantities to bytes to ensure that the hint is congruent with the file's
allocation size.

The maximum possible hint value is 2097151 (aka XFS_MAX_BMBT_EXTLEN).
If the file allocation unit is larger than 2048, the unit conversion
will exceed 32 bits in size, which overflows the uint32_t used to store
the value used in the comparison.  This isn't a problem for files on the
data device since the hint will always be a multiple of the block size.
However, this is a problem for realtime files because the rtextent size
can be any integer number of fs blocks, and truncation of upper bits
changes the outcome of division.

Eliminate the overflow by performing the congruency check in units of
blocks, not bytes.  Otherwise, we get errors like this:

$ truncate -s 500T /tmp/a
$ mkfs.xfs -f -N /tmp/a -d extszinherit=2097151,rtinherit=1 -r extsize=28k
illegal extent size hint 2097151, must be less than 2097151 and a multiple of 7.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: enable extent size hints for CoW when rtextsize > 1
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs: enable extent size hints for CoW when rtextsize > 1

CoW extent size hints are not allowed on filesystems that have large
realtime extents because we only want to perform the minimum required
amount of write-around (aka write amplification) for shared extents.

On filesystems where rtextsize > 1, allocations can only be done in
units of full rt extents, which means that we can only map an entire rt
extent's worth of blocks into the data fork.  Hole punch requests become
conversions to unwritten if the request isn't aligned properly.

Because a copy-write fundamentally requires remapping, this means that
we also can only do copy-writes of a full rt extent.  This is too
expensive for large hint sizes, since it's all or nothing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agomkfs: enable reflink on the realtime device
Darrick J. Wong [Mon, 12 Aug 2024 21:19:49 +0000 (14:19 -0700)]
mkfs: enable reflink on the realtime device

Allow the creation of filesystems with both reflink and realtime volumes
enabled.  For now we don't support a realtime extent size > 1.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agomkfs: validate CoW extent size hint when rtinherit is set
Darrick J. Wong [Mon, 12 Aug 2024 21:19:49 +0000 (14:19 -0700)]
mkfs: validate CoW extent size hint when rtinherit is set

Extent size hints exist to nudge the behavior of the file data block
allocator towards trying to make aligned allocations.  Therefore, it
doesn't make sense to allow a hint that isn't a multiple of the
fundamental allocation unit for a given file.

This means that if the sysadmin is formatting with rtinherit set on the
root dir, validate_cowextsize_hint needs to check the hint value on a
simulated realtime file to make sure that it's correct.  This hasn't
been necessary in the past since one cannot have a CoW hint without a
reflink filesystem, and we previously didn't allow rt reflink
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_logprint: report realtime CUIs
Darrick J. Wong [Mon, 12 Aug 2024 21:19:48 +0000 (14:19 -0700)]
xfs_logprint: report realtime CUIs

Decode the CUI format just enough to report if an CUI targets the
realtime device or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: allow sysadmins to add realtime reflink
Darrick J. Wong [Mon, 12 Aug 2024 21:19:26 +0000 (14:19 -0700)]
xfs_repair: allow sysadmins to add realtime reflink

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the realtime reference count btree, and therefore reflink on
realtime volumes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: validate CoW extent size hint on rtinherit directories
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: validate CoW extent size hint on rtinherit directories

XFS allows a sysadmin to change the rt extent size when adding a rt
section to a filesystem after formatting.  If there are any directories
with both a cowextsize hint and rtinherit set, the hint could become
misaligned with the new rextsize.  Offer to fix the problem if we're in
modify mode and the verifier didn't trip.  If we're in dry run mode,
we let the kernel fix it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: allow realtime files to have the reflink flag set
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: allow realtime files to have the reflink flag set

Now that we allow reflink on the realtime volume, allow that combination
of inode flags if the feature's enabled.  Note that we now allow inodes
to have rtinherit even if there's no realtime volume, since the kernel
has never restricted that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: rebuild the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: rebuild the realtime refcount btree

Use the collected reference count information to rebuild the btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: reject unwritten shared extents
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: reject unwritten shared extents

We don't allow sharing of unwritten extents, which means that repair
should reject an unwritten extent if someone else has already claimed
the space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: check existing realtime refcountbt entries against observed refcounts
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: compute refcount data for the realtime groups
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: compute refcount data for the realtime groups

At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: find and mark the rtrefcountbt inode
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: find and mark the rtrefcountbt inode

Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: use realtime refcount btree data to check block types
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_repair: use realtime refcount btree data to check block types

Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: allow CoW staging extents in the realtime rmap records
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_repair: allow CoW staging extents in the realtime rmap records

Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports.  As far as reporting
leftover staging extents, we'll report them when we scan the rt refcount
btree, in a future patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_spaceman: report health of the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_spaceman: report health of the realtime refcount btree

Report the health of the realtime reference count btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: copy the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_db: copy the realtime refcount btree

Copy the realtime refcountbt when we're metadumping the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: support the realtime refcountbt
Darrick J. Wong [Mon, 12 Aug 2024 21:19:22 +0000 (14:19 -0700)]
xfs_db: support the realtime refcountbt

Wire up various parts of xfs_db for realtime refcount support.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: display the realtime refcount btree contents
Darrick J. Wong [Mon, 12 Aug 2024 21:19:22 +0000 (14:19 -0700)]
xfs_db: display the realtime refcount btree contents

Implement all the code we need to dump rtrefcountbt contents, starting
from the root inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agolibfrog: enable scrubbing of the realtime refcount data
Darrick J. Wong [Mon, 12 Aug 2024 21:19:22 +0000 (14:19 -0700)]
libfrog: enable scrubbing of the realtime refcount data

Add a new entry so that we can scrub the rtrefcountbt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agofixup
Darrick J. Wong [Mon, 12 Aug 2024 21:19:21 +0000 (14:19 -0700)]
fixup

11 months agoxfs: scrub the metadir path of rt refcount btree files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:21 +0000 (14:19 -0700)]
xfs: scrub the metadir path of rt refcount btree files

Source kernel commit: 08745bdf226a413246fc4edb2947985804dbcb86

Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata
directory tree path to the refcount btree file for each rt group.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agoxfs: scrub the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:21 +0000 (14:19 -0700)]
xfs: scrub the realtime refcount btree

Source kernel commit: 844d7f8755a67b01391da92b99a5342c8b2b83f4

Add code to scrub realtime refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agoxfs: report realtime refcount btree corruption errors to the health system
Darrick J. Wong [Mon, 12 Aug 2024 21:11:05 +0000 (14:11 -0700)]
xfs: report realtime refcount btree corruption errors to the health system

Whenever we encounter corrupt realtime refcount btree blocks, we should
report that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: enable extent size hints for CoW operations
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: enable extent size hints for CoW operations

Wire up the copy-on-write extent size hint for realtime files, and
connect it to the rt allocator so that we avoid fragmentation on rt
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: apply rt extent alignment constraints to CoW extsize hint
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint

The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint.  Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.

Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files

Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files.  This apparently was done to disable
delayed allocation for realtime files.

However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files.  In this case, the
logic will incorrectly send the write through the delalloc write path.

Fix this by adjusting the logic slightly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: refcover CoW leftovers in the realtime volume
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: refcover CoW leftovers in the realtime volume

Scan the realtime refcount tree at mount time to get rid of leftover
CoW staging extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: allow inodes to have the realtime and reflink flags
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: allow inodes to have the realtime and reflink flags

Now that we can share blocks between realtime files, allow this
combination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: compute rtrmap btree max levels when reflink enabled
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: compute rtrmap btree max levels when reflink enabled

Compute the maximum possible height of the realtime rmap btree when
reflink is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: update rmap to allow cow staging extents in the rt rmap
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: update rmap to allow cow staging extents in the rt rmap

Don't error out on CoW staging extent records when realtime reflink is
enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: create routine to allocate and initialize a realtime refcount btree inode
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: create routine to allocate and initialize a realtime refcount btree inode

Source kernel commit: 0066145ac851fd746ed22e523c3b60062e94c250

Create a library routine to allocate and initialize an empty realtime
refcountbt inode.  We'll use this for growfs, mkfs, and repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agoxfs: wire up realtime refcount btree cursors
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: wire up realtime refcount btree cursors

Source kernel commit: fb0ac941a3e35fe16375f89d8d817e2790aeab35

Wire up realtime refcount btree cursors wherever they're needed
throughout the code base.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agoxfs: wire up a new inode fork type for the realtime refcount
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: wire up a new inode fork type for the realtime refcount

Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: add metadata reservations for realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: add metadata reservations for realtime refcount btree

Reserve some free blocks so that we will always have enough free blocks
in the data volume to handle expansion of the realtime refcount btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: add realtime refcount btree inode to metadata directory
Darrick J. Wong [Mon, 12 Aug 2024 21:10:50 +0000 (14:10 -0700)]
xfs: add realtime refcount btree inode to metadata directory

Add a metadir path to select the realtime refcount btree inode and load
it at mount time.  The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: add a realtime flag to the refcount update log redo items
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: add a realtime flag to the refcount update log redo items

Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt.  We'll
wire up the actual refcount code later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: prepare refcount functions to deal with rtrefcountbt
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: prepare refcount functions to deal with rtrefcountbt

Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions.  Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.

Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: add realtime refcount btree operations
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: add realtime refcount btree operations

Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: realtime refcount btree transaction reservations
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: realtime refcount btree transaction reservations

Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents.  We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: define the on-disk realtime refcount btree format
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: define the on-disk realtime refcount btree format

Start filling out the rtrefcount btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate refcount btree blocks. This prepares the way for connecting
the btree operations implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: namespace the maximum length/refcount symbols
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: namespace the maximum length/refcount symbols

Actually namespace these variables properly, so that readers can tell
that this is an XFS symbol, and that it's for the refcount
functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: introduce realtime refcount btree definitions
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: introduce realtime refcount btree definitions

Add new realtime refcount btree definitions. The realtime refcount btree
will be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agomkfs: use file write helper to populate files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:15 +0000 (14:19 -0700)]
mkfs: use file write helper to populate files

Use the file write helper to write files into the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agolibxfs: resync libxfs_alloc_file_space interface with the kernel
Darrick J. Wong [Mon, 12 Aug 2024 21:19:15 +0000 (14:19 -0700)]
libxfs: resync libxfs_alloc_file_space interface with the kernel

Make the userspace xfs_alloc_file_space behave (more or less) like the
kernel version, at least as far as the interface goes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agomkfs: create the realtime rmap inode
Darrick J. Wong [Mon, 12 Aug 2024 21:19:15 +0000 (14:19 -0700)]
mkfs: create the realtime rmap inode

Create a realtime rmapbt inode if we format the fs with realtime
and rmap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_logprint: report realtime RUIs
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_logprint: report realtime RUIs

Decode the RUI format just enough to report if an RUI targets the
realtime device or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: allow sysadmins to add realtime reverse mapping indexes
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: allow sysadmins to add realtime reverse mapping indexes

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reverse mapping btree index for realtime volumes.  This
is needed for online fsck.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: reserve per-AG space while rebuilding rt metadata
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: reserve per-AG space while rebuilding rt metadata

Realtime metadata btrees can consume quite a bit of space on a full
filesystem.  Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata.  This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: rebuild the bmap btree for realtime files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: rebuild the bmap btree for realtime files

Use the realtime rmap btree information to rebuild an inode's data fork
when appropriate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: check for global free space concerns with default btree slack levels
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check for global free space concerns with default btree slack levels

It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full.  If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space.  The
solution to this is to pack the new blocks completely full if we fear
running out of space.

Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.

Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: rebuild the realtime rmap btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: rebuild the realtime rmap btree

Rebuild the realtime rmap btree file from the reverse mapping records we
gathered from walking the inodes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: always check realtime file mappings against incore info
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: always check realtime file mappings against incore info

Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3).  As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.

Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state.  The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork.  We already
do this for regular data files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: check existing realtime rmapbt entries against observed rmaps
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: find and mark the rtrmapbt inodes
Darrick J. Wong [Mon, 12 Aug 2024 21:19:12 +0000 (14:19 -0700)]
xfs_repair: find and mark the rtrmapbt inodes

Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: refactor realtime inode check
Darrick J. Wong [Mon, 12 Aug 2024 21:19:12 +0000 (14:19 -0700)]
xfs_repair: refactor realtime inode check

Refactor the realtime bitmap and summary checks into a helper function.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: collect realtime reverse-mapping data for refcount/rmap tree rebuilding
Darrick J. Wong [Mon, 12 Aug 2024 21:19:12 +0000 (14:19 -0700)]
xfs_repair: collect realtime reverse-mapping data for refcount/rmap tree rebuilding

Collect reverse-mapping data for realtime files so that we can later
check and rebuild the reference count tree and the reverse mapping
tree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: create a new set of incore rmap information for rt groups
Darrick J. Wong [Mon, 12 Aug 2024 21:19:12 +0000 (14:19 -0700)]
xfs_repair: create a new set of incore rmap information for rt groups

Create a parallel set of "xfs_ag_rmap" structures to cache information
about reverse mappings for the realtime groups.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: use realtime rmap btree data to check block types
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_repair: use realtime rmap btree data to check block types

Use the realtime rmap btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_repair: flag suspect long-format btree blocks
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_repair: flag suspect long-format btree blocks

Pass a "suspect" counter through scan_lbtree just like we do for
short-format btree blocks, and increment its value when we encounter
blocks with bad CRCs or outright corruption.  This makes it so that
repair actually catches bmbt blocks with bad crcs or other verifier
errors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_spaceman: report health status of the realtime rmap btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_spaceman: report health status of the realtime rmap btree

Add reporting of the rt rmap btree health to spaceman.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agolibfrog: enable scrubbing of the realtime rmap
Darrick J. Wong [Mon, 12 Aug 2024 21:19:10 +0000 (14:19 -0700)]
libfrog: enable scrubbing of the realtime rmap

Add a new entry so that we can scrub the rtrmapbt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: make fsmap query the realtime reverse mapping tree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:10 +0000 (14:19 -0700)]
xfs_db: make fsmap query the realtime reverse mapping tree

Extend the 'fsmap' debugger command to support querying the realtime
rmap btree via a new -r argument.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: copy the realtime rmap btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:10 +0000 (14:19 -0700)]
xfs_db: copy the realtime rmap btree

Copy the realtime rmapbt when we're metadumping the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: support the realtime rmapbt
Darrick J. Wong [Mon, 12 Aug 2024 21:19:10 +0000 (14:19 -0700)]
xfs_db: support the realtime rmapbt

Wire up various parts of xfs_db for realtime rmap support.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: display the realtime rmap btree contents
Darrick J. Wong [Mon, 12 Aug 2024 21:19:09 +0000 (14:19 -0700)]
xfs_db: display the realtime rmap btree contents

Implement all the code we need to dump rtrmapbt contents, starting
from the root inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs_db: don't abort when bmapping on a non-extents/bmbt fork
Darrick J. Wong [Wed, 7 Aug 2024 22:54:35 +0000 (15:54 -0700)]
xfs_db: don't abort when bmapping on a non-extents/bmbt fork

We're going to introduce new fork formats, so let's fix the problem that
xfs_db's bmap command aborts when the fork format isn't one of the
existing ones.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: hook live realtime rmap operations during a repair operation
Darrick J. Wong [Mon, 12 Aug 2024 21:19:09 +0000 (14:19 -0700)]
xfs: hook live realtime rmap operations during a repair operation

Source kernel commit: 95ca3a8b151f34e4084aeade83ef25893a41f37e

Hook the regular realtime rmap code when an rtrmapbt repair operation is
running so that we can unlock the AGF buffer to scan the filesystem and
keep the in-memory btree up to date during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agoxfs: create a shadow rmap btree during realtime rmap repair
Darrick J. Wong [Mon, 12 Aug 2024 21:19:09 +0000 (14:19 -0700)]
xfs: create a shadow rmap btree during realtime rmap repair

Create an in-memory btree of rmap records instead of an array.  This
enables us to do live record collection instead of freezing the fs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agoxfs: online repair of the realtime rmap btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:09 +0000 (14:19 -0700)]
xfs: online repair of the realtime rmap btree

Source kernel commit: f813af307d62d4c4d620a358bbd406f89ffdeca2

Repair the realtime rmap btree while mounted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agofixup
Christoph Hellwig [Mon, 12 Aug 2024 21:19:08 +0000 (14:19 -0700)]
fixup

11 months agoxfs: scrub the metadir path of rt rmap btree files
Darrick J. Wong [Mon, 12 Aug 2024 21:06:21 +0000 (14:06 -0700)]
xfs: scrub the metadir path of rt rmap btree files

Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata
directory tree path to the rmap btree file for each rt group.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
11 months agofixup
Christoph Hellwig [Mon, 12 Aug 2024 21:19:08 +0000 (14:19 -0700)]
fixup

11 months agoxfs: scrub the realtime rmapbt
Darrick J. Wong [Mon, 12 Aug 2024 21:19:08 +0000 (14:19 -0700)]
xfs: scrub the realtime rmapbt

Source kernel commit: 15b31f2d8b71d1e775e9f1fa3cf4d740fa4e917f

Check the realtime reverse mapping btree against the rtbitmap, and
modify the rtbitmap scrub to check against the rtrmapbt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
11 months agofixup
Christoph Hellwig [Mon, 12 Aug 2024 21:19:07 +0000 (14:19 -0700)]
fixup

11 months agoxfs: allow queued realtime intents to drain before scrubbing
Darrick J. Wong [Mon, 12 Aug 2024 21:19:07 +0000 (14:19 -0700)]
xfs: allow queued realtime intents to drain before scrubbing

Source kernel commit: 77b81645574605ea0c0199ec32fc4a9cdc87bc53

When a writer thread executes a chain of log intent items for the
realtime volume, the ILOCKs taken during each step are for each rt
metadata file, not the entire rt volume itself.  Although scrub takes
all rt metadata ILOCKs, this isn't sufficient to guard against scrub
checking the rt volume while that writer thread is in the middle of
finishing a chain because there's no higher level locking primitive
guarding the realtime volume.

When there's a collision, cross-referencing between data structures
(e.g. rtrmapbt and rtrefcountbt) yields false corruption events; if
repair is running, this results in incorrect repairs, which is
catastrophic.

Fix this by adding to the mount structure the same drain that we use to
protect scrub against concurrent AG updates, but this time for the
realtime volume.

[Contains a few cleanups from hch]

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>