]> www.infradead.org Git - users/hch/xfsprogs.git/log
users/hch/xfsprogs.git
10 months agorepair: stop tracking duplicate RT extents with rtgroups rtg-repair-cleanups
Christoph Hellwig [Thu, 15 Aug 2024 07:43:48 +0000 (09:43 +0200)]
repair: stop tracking duplicate RT extents with rtgroups

Nothing ever looks them up, so don't bother with tracking them by
overloading the AG numbers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agorepair: use a separate bmaps array for real time groups
Christoph Hellwig [Thu, 15 Aug 2024 07:18:08 +0000 (09:18 +0200)]
repair: use a separate bmaps array for real time groups

Stop pretending RTGs are high numbered AGs and just use separate
structures instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agorepair: add a real per-AG bitmap abstraction
Christoph Hellwig [Thu, 15 Aug 2024 06:44:30 +0000 (08:44 +0200)]
repair: add a real per-AG bitmap abstraction

Add a struct bmap that contains the btree root and the lock, and provide
helpers for loking instead of directly poking into the data structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agorepair: simplify rt_lock handling
Christoph Hellwig [Thu, 15 Aug 2024 06:52:04 +0000 (08:52 +0200)]
repair: simplify rt_lock handling

No need to cacheline align rt_lock if we move it next to the data
it protects.  Also reduce the critical section to just where those
data structures are accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agorepair: remove rmap handling in process_rt_rec
Christoph Hellwig [Thu, 15 Aug 2024 06:53:29 +0000 (08:53 +0200)]
repair: remove rmap handling in process_rt_rec

rtrmap is only supported with RT groups, which don't use this path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agodb: simplify rtinode checks
Christoph Hellwig [Wed, 14 Aug 2024 08:20:49 +0000 (10:20 +0200)]
db: simplify rtinode checks

Stop discoverŅ–ng the RT inodes and just look at di_metatype instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agoxfs_repair: allow adding rmapbt to reflink filesystems
Darrick J. Wong [Wed, 7 Aug 2024 22:54:59 +0000 (15:54 -0700)]
xfs_repair: allow adding rmapbt to reflink filesystems

New debugging knob so that I can upgrade a filesystem to have rmap
btrees even if reflink was already enabled.  We cannot easily precompute
the space requirements, so this is dangerous.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: skip free space checks when upgrading
Darrick J. Wong [Wed, 7 Aug 2024 22:54:59 +0000 (15:54 -0700)]
xfs_repair: skip free space checks when upgrading

Add a debug knob to disable the free space checks when upgrading a
system.  This is extremely risky and will cause severe tire damage!!!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: upgrade filesystem features
Darrick J. Wong [Wed, 7 Aug 2024 22:54:58 +0000 (15:54 -0700)]
xfs: upgrade filesystem features

Add the ability to upgrade *some* filesystem features.  Note that you'll
have to run online fsck immediately afterwards to build metadata!

XXX DO NOT MERGE

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agodebian: enable xfs_scrubbed on the root filesystem by default
Darrick J. Wong [Wed, 7 Aug 2024 22:54:58 +0000 (15:54 -0700)]
debian: enable xfs_scrubbed on the root filesystem by default

Now that we're finished building autonomous repair, enable the service
on the root filesystem by default.  The root filesystem is mounted by
the initrd prior to starting systemd, which is why the udev rule cannot
autostart the service for the root filesystem.

dh_installsystemd won't activate a template service (aka one with an
at-sign in the name) even if it provides a DefaultInstance directive to
make that possible.  Use a fugly shim for this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: use the autofsck fsproperty to select mode
Darrick J. Wong [Wed, 7 Aug 2024 22:54:58 +0000 (15:54 -0700)]
xfs_scrubbed: use the autofsck fsproperty to select mode

Make the xfs_scrubbed background service query the autofsck filesystem
property to figure out which operating mode it should use.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: don't start service if kernel support unavailable
Darrick J. Wong [Wed, 7 Aug 2024 22:54:58 +0000 (15:54 -0700)]
xfs_scrubbed: don't start service if kernel support unavailable

Use ExecCondition= in the system service to check if kernel support for
the health monitor is available.  If not, we don't want to run the
service, have it fail, and generate a bunch of silly log messages.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: create a background monitoring service
Darrick J. Wong [Wed, 7 Aug 2024 22:54:58 +0000 (15:54 -0700)]
xfs_scrubbed: create a background monitoring service

Create a systemd service and activate it automatically.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agobuilddefs: refactor udev directory specification
Darrick J. Wong [Wed, 7 Aug 2024 22:54:57 +0000 (15:54 -0700)]
builddefs: refactor udev directory specification

Refactor the code that finds the udev rules directory to detect the
location of the parent udev directory instead.  IOWs, we go from:

UDEV_RULE_DIR=/foo/bar/rules.d

to:

UDEV_DIR=/foo/bar
UDEV_RULE_DIR=/foo/bar/rules.d

This is needed by the next patch, which adds a helper script.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: use getparents to look up file names
Darrick J. Wong [Wed, 7 Aug 2024 22:54:57 +0000 (15:54 -0700)]
xfs_scrubbed: use getparents to look up file names

If the kernel tells about something that happened to a file, use the
GETPARENTS ioctl to try to look up the path to that file for more
ergonomic reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: check for fs features needed for effective repairs
Darrick J. Wong [Wed, 7 Aug 2024 22:54:57 +0000 (15:54 -0700)]
xfs_scrubbed: check for fs features needed for effective repairs

Online repair relies heavily on back references such as reverse mappings
and directory parent pointers to add redundancy to the filesystem.
Check for these two features and whine a bit if they are missing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: enable repairing filesystems
Darrick J. Wong [Wed, 7 Aug 2024 22:54:57 +0000 (15:54 -0700)]
xfs_scrubbed: enable repairing filesystems

Make it so that our health monitoring daemon can initiate repairs.
Because repairs can take a while to run, so we don't actually want to be
doing that work in the event thread because the kernel queue can drop
events if userspace doesn't respond in time.

Therefore, create a subprocess executor to run the repairs in the
background, and do the repairs from there.  The subprocess executor is
similar in concept to what a libfrog workqueue does, but the workers do
not share address space, which eliminates GIL contention.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: check events against schema
Darrick J. Wong [Wed, 7 Aug 2024 22:54:56 +0000 (15:54 -0700)]
xfs_scrubbed: check events against schema

Validate that the event objects that we get from the kernel actually
obey the schema that the kernel publishes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_scrubbed: create daemon to listen for health events
Darrick J. Wong [Wed, 7 Aug 2024 22:54:56 +0000 (15:54 -0700)]
xfs_scrubbed: create daemon to listen for health events

Create a daemon program that can listen for and log health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_io: monitor filesystem health events
Darrick J. Wong [Wed, 7 Aug 2024 22:54:56 +0000 (15:54 -0700)]
xfs_io: monitor filesystem health events

Create a subcommand to monitor for health events generated by the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: report file io errors through healthmon
Darrick J. Wong [Wed, 7 Aug 2024 22:54:56 +0000 (15:54 -0700)]
xfs: report file io errors through healthmon

Set up a file io error event hook so that we can send events about read
errors, writeback errors, and directio errors to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: report media errors through healthmon
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: report media errors through healthmon

Now that we have hooks to report media errors, connect this to the
health monitor as well.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: report shutdown events through healthmon
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: report shutdown events through healthmon

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: report metadata health events through healthmon
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: report metadata health events through healthmon

Set up a metadata health event hook so that we can send events to
userspace as we collect information.  The unmount hook severs the weak
reference between the health monitor and the filesystem it's monitoring;
when this happens, we stop reporting events because there's no longer
any point.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: create event queuing, formatting, and discovery infrastructure
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: create event queuing, formatting, and discovery infrastructure

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.

Here, we've chosen json to export information to userspace.  The
structured key-value nature of json gives us enormous flexibility to
modify the schema of what we'll send to userspace because we can add new
keys at any time.  Userspace can use whatever json parsers are available
to consume the events and will not be confused by keys they don't
recognize.

Note that we do NOT allow sending json back to the kernel, nor is there
any intent to do that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: create a special file to pass filesystem health to userspace
Darrick J. Wong [Wed, 7 Aug 2024 22:54:55 +0000 (15:54 -0700)]
xfs: create a special file to pass filesystem health to userspace

Create an ioctl that installs a file descriptor backed by an anon_inode
file that will convey filesystem health events to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: create hooks for monitoring health updates
Darrick J. Wong [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
xfs: create hooks for monitoring health updates

Create hooks for monitoring health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agospaceman: move inodes with hardlinks
Dave Chinner [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
spaceman: move inodes with hardlinks

When a inode to be moved to a different AG has multiple hard links,
we need to "move" all the hard links, too. To do this, we need to
create temporary hardlinks to the new file, and then use rename
exchange to swap all the hardlinks that point to the old inode
with new hardlinks that point to the new inode.

We already know that an inode has hard links via the path discovery,
and we can check it against the link count that is reported for the
inode before we start building the link farm.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agospaceman: relocate the contents of an AG
Dave Chinner [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
spaceman: relocate the contents of an AG

Shrinking a filesystem needs to first remove all the active user
data and metadata from the AGs that are going to be lopped off the
filesystem. Before we can do this, we have to relocate this
information to a region of the filesystem that is going to be
retained.

We have a function to move an inode and all it's related
information to a specific AG, we have functions to find the
owners of all the information in an AG and we can find their paths.
This gives us all the information we need to relocate all the
objects in an AG we are going to remove via shrinking.

Firstly we scan the AG to be emptied to find the inodes that need to
be relocated, then we scan the directory structure to find all the
paths to those inodes that need to be moved. Then we iterate over
all the inodes to be moved attempting to move them to the lowest
numbers AGs.

When the destination AG fills up, we'll get ENOSPC from
the moving code and this is a trigger to bump the destination AG and
retry the move. If we haven't moved all the inodes and their data by
the time the destination reaches the source AG, then the entire
operation will fail with ENOSPC - there is not enough room in the
filesystem to empty the selected AG in preparation for a shrink.

This, once again, is not intended as an optimal or even guaranteed
way of emptying an AG for shrink. It simply provides the basic
algorithm and mechanisms we need to perform a shrink operation.
Improvements and optimisations will come in time, but we can't get
to an optimal solution without first having basic functionality in
place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_spaceman: port relocation structure to 32-bit systems
Darrick J. Wong [Wed, 7 Aug 2024 22:54:54 +0000 (15:54 -0700)]
xfs_spaceman: port relocation structure to 32-bit systems

We can't use the radix tree to store relocation information on 32-bit
systems because unsigned longs are not large enough to hold 64-bit
inodes.  Use an avl64 tree instead.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_spaceman: wrap radix tree accesses in find_owner.c
Darrick J. Wong [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
xfs_spaceman: wrap radix tree accesses in find_owner.c

Wrap the raw radix tree accesses here so that we can provide an
alternate implementation on platforms where radix tree indices cannot
store a full 64-bit inode number.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agospaceman: find owners of space in an AG
Dave Chinner [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
spaceman: find owners of space in an AG

Before we can move inodes for a shrink operation, we have to find
all the inodes that own space in the AG(s) we want to empty.

This implementation uses FS_IOC_GETFSMAP on the assumption that
filesystems to be shrunk have reverse mapping enabled as it is the
only way to identify inode related metadata that userspace is unable
to see or influence (e.g. BMBT blocks) that may be located in the
specific AG. We can use GETFSMAP to identify both inodes to be moved
(via XFS_FMR_OWN_INODES records) and inodes with just data and/or
metadata to be moved.

Once we have identified all the inodes to be moved, we have to
map them to paths so that we can use renameat2() to exchange the
directory entries pointing at the moved inode atomically. We also
need to record inodes with hard links and all of the paths to the
inode so that hard links can be recreated appropriately.

This requires a directory tree walk to discover the paths (until
parent pointers are a thing). Hence for filesystems that aren't
reverse mapping enabled, we can eventually use this pass to discover
inodes with visible data and metadata that need to be moved.

As we resolve the paths to the inodes to be moved, output the
information to stdout so that it can be acted upon by other
utilities. This results in a command that acts similar to find but
with a physical location filter rather than an inode metadata
filter.

Again, this is not meant to be an optimal implementation. It
shouldn't suck, but there is plenty of scope for performance
optimisation, especially with a multithreaded and/or async directory
traversal/parent pointer path resolution process to hide access
latencies.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agospaceman: physically move a regular inode
Dave Chinner [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
spaceman: physically move a regular inode

To be able to shrink a filesystem, we need to be able to physically
move an inode and all it's data and metadata from it's current
location to a new AG.  Add a command to spaceman to allow an inode
to be moved to a new AG.

This new command is not intended to be a perfect solution. I am not
trying to handle atomic movement of open files - this is intended to
be run as a maintenance operation on idle filesystem. If root
filesystems are the target, then this should be run via a rescue
environment that is not executing directly on the root fs. With
those caveats in place, we can do the entire inode move as a set of
non-destructive operations finalised by an atomic inode swap
without any needing special kernel support.

To ensure we move metadata such as BMBT blocks even if we don't need
to move data, we clone the data to a new inode that we've allocated
in the destination AG. This will result in new bmbt blocks being
allocated in the new location even though the data is not copied.
Attributes need to be copied one at a time from the original inode.

If data needs to be moved, then we use fallocate(UNSHARE) to create
a private copy of the range of data that needs to be moved in the
new inode. This will be allocated in the destination AG by normal
allocation policy.

Once the new inode has been finalised, use RENAME_EXCHANGE to swap
it into place and unlink the original inode to free up all the
resources it still pins.

There are many optimisations still possible to speed this up, but
the goal here is "functional" rather than "optimal". Performance can
be optimised once all the parts for a "empty the tail of the
filesystem before shrink" operation are implemented and solidly
tested.

This functionality has been smoke tested by creating a 32MB data
file with 4k extents and several hundred attributes:

$ cat test.sh
fname=/mnt/scratch/foo
xfs_io -f -c "pwrite 0 32m" -c sync $fname
for (( i=0; i < 4096 ; i++ )); do
xfs_io -c "fpunch $((i * 8))k 4k" $fname
done

for (( i=0; i < 100 ; i++ )); do
setfattr -n user.blah.$i.$i.blah -v blah.$i.$i.blah $fname
setfattr -n user.foo.$i.$i.foo -v $i.cantbele.$i.ve.$i.tsnotbutter $fname
done
for (( i=0; i < 100 ; i++ )); do
setfattr -n security.baz.$i.$i.baz -v wotchul$i$iookinat $fname
done

xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname
xfs_spaceman -c "move_inode -a 22" /mnt/scratch/foo
xfs_io -c stat -c "bmap -vp" -c "bmap -avp" $fname
$

and the output looks something like:

$ sudo ./test.sh
....
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 133
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          hole                                      8
   1: [8..15]:         208..215           0 (208..215)           8 000000
   2: [16..23]:        hole                                      8
   3: [24..31]:        224..231           0 (224..231)           8 000000
....
8189: [65512..65519]:  65712..65719       0 (65712..65719)       8 000000
8190: [65520..65527]:  hole                                      8
8191: [65528..65535]:  65728..65735       0 (65728..65735)       8 000000
mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          392..399           0 (392..399)           8 000000
   1: [8..15]:         408..415           0 (408..415)           8 000000
   2: [16..23]:        424..431           0 (424..431)           8 000000
   3: [24..31]:        456..463           0 (456..463)           8 000000
move mnt /mnt/scratch, path /mnt/scratch/foo, agno 22
fd.path = "/mnt/scratch/foo"
fd.flags = non-sync,non-direct,read-write
stat.ino = 47244651475
....
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE               AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          hole                                              8
   1: [8..15]:         47244763192..47244763199  22 (123112..123119)     8 000000
   2: [16..23]:        hole                                              8
   3: [24..31]:        47244763208..47244763215  22 (123128..123135)     8 000000
....
8189: [65512..65519]:  47244828808..47244828815  22 (188728..188735)     8 000000
8190: [65520..65527]:  hole                                              8
8191: [65528..65535]:  47244828824..47244828831  22 (188744..188751)     8 000000
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE               AG AG-OFFSET        TOTAL FLAGS
   0: [0..7]:          47244763176..47244763183  22 (123096..123103)     8 000000
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_spaceman: implement clearing free space
Darrick J. Wong [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
xfs_spaceman: implement clearing free space

First attempt at evacuating all the used blocks from part of a
filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_db: get and put blocks on the AGFL
Darrick J. Wong [Wed, 7 Aug 2024 22:54:53 +0000 (15:54 -0700)]
xfs_db: get and put blocks on the AGFL

Add a new xfs_db command to let people add and remove blocks from an
AGFL.  This isn't really related to rmap btree reconstruction, other
than enabling debugging code to mess around with the AGFL to exercise
various odd scenarios.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_io: support using XFS_IOC_MAP_FREESP to map free space
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs_io: support using XFS_IOC_MAP_FREESP to map free space

Add a command to call XFS_IOC_MAP_FREESP.  This is experimental code to
see if we can build a free space defragmenter out of this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: add an ioctl to map free space into a file
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs: add an ioctl to map free space into a file

Add a new ioctl to map free physical space into a file, at the same file
offset as if the file were a sparse image of the physical device backing
the filesystem.  The intent here is to use this to prototype a free
space defragmentation tool.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_io: dump reference count information
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs_io: dump reference count information

Dump refcount info from the kernel so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: export reference count information to userspace
Darrick J. Wong [Wed, 7 Aug 2024 22:54:52 +0000 (15:54 -0700)]
xfs: export reference count information to userspace

Export refcount info to userspace so we can prototype a sharing-aware
defrag/fs rearranging tool.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_io: enhance the aginfo command to control the noalloc flag
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs_io: enhance the aginfo command to control the noalloc flag

Augment the aginfo command to be able to set and clear the noalloc
state for an AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: apply noalloc mode to inode allocations too
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: apply noalloc mode to inode allocations too

Don't allow inode allocations from this group if it's marked noalloc.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: enable userspace to hide an AG from allocation
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: enable userspace to hide an AG from allocation

Add an administrative interface so that userspace can hide an allocation
group from block allocation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: create a noalloc mode for allocation groups
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: create a noalloc mode for allocation groups

Create a new noalloc state for the per-AG structure that will disable
block allocation in this AG.  We accomplish this by subtracting from
fdblocks all the free blocks in this AG, hiding those blocks from the
allocator, and preventing freed blocks from updating fdblocks until
we're ready to lift noalloc mode.

Note that we reduce the free block count of the filesystem so that we
can prevent transactions from entering the allocator looking for "free"
space that we've turned off incore.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: track deferred ops statistics
Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: track deferred ops statistics

Track some basic statistics on how hard we're pushing the defer ops.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_quota: report warning limits for realtime space quotas
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs_quota: report warning limits for realtime space quotas

Report the number of warnings that a user will get for exceeding the
soft limit of a realtime volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agomkfs: enable reflink with realtime extent sizes > 1
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
mkfs: enable reflink with realtime extent sizes > 1

Allow creation of filesystems with reflink enabled and realtime extent
size larger than 1 block.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: fix integer overflow when validating extent size hints
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs: fix integer overflow when validating extent size hints

Both file extent size hints are stored as 32-bit quantities, in units of
filesystem blocks.  As part of validating the hints, we convert these
quantities to bytes to ensure that the hint is congruent with the file's
allocation size.

The maximum possible hint value is 2097151 (aka XFS_MAX_BMBT_EXTLEN).
If the file allocation unit is larger than 2048, the unit conversion
will exceed 32 bits in size, which overflows the uint32_t used to store
the value used in the comparison.  This isn't a problem for files on the
data device since the hint will always be a multiple of the block size.
However, this is a problem for realtime files because the rtextent size
can be any integer number of fs blocks, and truncation of upper bits
changes the outcome of division.

Eliminate the overflow by performing the congruency check in units of
blocks, not bytes.  Otherwise, we get errors like this:

$ truncate -s 500T /tmp/a
$ mkfs.xfs -f -N /tmp/a -d extszinherit=2097151,rtinherit=1 -r extsize=28k
illegal extent size hint 2097151, must be less than 2097151 and a multiple of 7.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: enable extent size hints for CoW when rtextsize > 1
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs: enable extent size hints for CoW when rtextsize > 1

CoW extent size hints are not allowed on filesystems that have large
realtime extents because we only want to perform the minimum required
amount of write-around (aka write amplification) for shared extents.

On filesystems where rtextsize > 1, allocations can only be done in
units of full rt extents, which means that we can only map an entire rt
extent's worth of blocks into the data fork.  Hole punch requests become
conversions to unwritten if the request isn't aligned properly.

Because a copy-write fundamentally requires remapping, this means that
we also can only do copy-writes of a full rt extent.  This is too
expensive for large hint sizes, since it's all or nothing.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agomkfs: enable reflink on the realtime device
Darrick J. Wong [Mon, 12 Aug 2024 21:19:49 +0000 (14:19 -0700)]
mkfs: enable reflink on the realtime device

Allow the creation of filesystems with both reflink and realtime volumes
enabled.  For now we don't support a realtime extent size > 1.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agomkfs: validate CoW extent size hint when rtinherit is set
Darrick J. Wong [Mon, 12 Aug 2024 21:19:49 +0000 (14:19 -0700)]
mkfs: validate CoW extent size hint when rtinherit is set

Extent size hints exist to nudge the behavior of the file data block
allocator towards trying to make aligned allocations.  Therefore, it
doesn't make sense to allow a hint that isn't a multiple of the
fundamental allocation unit for a given file.

This means that if the sysadmin is formatting with rtinherit set on the
root dir, validate_cowextsize_hint needs to check the hint value on a
simulated realtime file to make sure that it's correct.  This hasn't
been necessary in the past since one cannot have a CoW hint without a
reflink filesystem, and we previously didn't allow rt reflink
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_logprint: report realtime CUIs
Darrick J. Wong [Mon, 12 Aug 2024 21:19:48 +0000 (14:19 -0700)]
xfs_logprint: report realtime CUIs

Decode the CUI format just enough to report if an CUI targets the
realtime device or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: allow sysadmins to add realtime reflink
Darrick J. Wong [Mon, 12 Aug 2024 21:19:26 +0000 (14:19 -0700)]
xfs_repair: allow sysadmins to add realtime reflink

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the realtime reference count btree, and therefore reflink on
realtime volumes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: validate CoW extent size hint on rtinherit directories
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: validate CoW extent size hint on rtinherit directories

XFS allows a sysadmin to change the rt extent size when adding a rt
section to a filesystem after formatting.  If there are any directories
with both a cowextsize hint and rtinherit set, the hint could become
misaligned with the new rextsize.  Offer to fix the problem if we're in
modify mode and the verifier didn't trip.  If we're in dry run mode,
we let the kernel fix it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: allow realtime files to have the reflink flag set
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: allow realtime files to have the reflink flag set

Now that we allow reflink on the realtime volume, allow that combination
of inode flags if the feature's enabled.  Note that we now allow inodes
to have rtinherit even if there's no realtime volume, since the kernel
has never restricted that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: rebuild the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: rebuild the realtime refcount btree

Use the collected reference count information to rebuild the btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: reject unwritten shared extents
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: reject unwritten shared extents

We don't allow sharing of unwritten extents, which means that repair
should reject an unwritten extent if someone else has already claimed
the space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: check existing realtime refcountbt entries against observed refcounts
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: compute refcount data for the realtime groups
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: compute refcount data for the realtime groups

At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: find and mark the rtrefcountbt inode
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: find and mark the rtrefcountbt inode

Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: use realtime refcount btree data to check block types
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_repair: use realtime refcount btree data to check block types

Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: allow CoW staging extents in the realtime rmap records
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_repair: allow CoW staging extents in the realtime rmap records

Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports.  As far as reporting
leftover staging extents, we'll report them when we scan the rt refcount
btree, in a future patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_spaceman: report health of the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_spaceman: report health of the realtime refcount btree

Report the health of the realtime reference count btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_db: copy the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_db: copy the realtime refcount btree

Copy the realtime refcountbt when we're metadumping the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_db: support the realtime refcountbt
Darrick J. Wong [Mon, 12 Aug 2024 21:19:22 +0000 (14:19 -0700)]
xfs_db: support the realtime refcountbt

Wire up various parts of xfs_db for realtime refcount support.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_db: display the realtime refcount btree contents
Darrick J. Wong [Mon, 12 Aug 2024 21:19:22 +0000 (14:19 -0700)]
xfs_db: display the realtime refcount btree contents

Implement all the code we need to dump rtrefcountbt contents, starting
from the root inode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agolibfrog: enable scrubbing of the realtime refcount data
Darrick J. Wong [Mon, 12 Aug 2024 21:19:22 +0000 (14:19 -0700)]
libfrog: enable scrubbing of the realtime refcount data

Add a new entry so that we can scrub the rtrefcountbt.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agofixup
Darrick J. Wong [Mon, 12 Aug 2024 21:19:21 +0000 (14:19 -0700)]
fixup

10 months agoxfs: scrub the metadir path of rt refcount btree files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:21 +0000 (14:19 -0700)]
xfs: scrub the metadir path of rt refcount btree files

Source kernel commit: 08745bdf226a413246fc4edb2947985804dbcb86

Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata
directory tree path to the refcount btree file for each rt group.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agoxfs: scrub the realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:21 +0000 (14:19 -0700)]
xfs: scrub the realtime refcount btree

Source kernel commit: 844d7f8755a67b01391da92b99a5342c8b2b83f4

Add code to scrub realtime refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agoxfs: report realtime refcount btree corruption errors to the health system
Darrick J. Wong [Mon, 12 Aug 2024 21:11:05 +0000 (14:11 -0700)]
xfs: report realtime refcount btree corruption errors to the health system

Whenever we encounter corrupt realtime refcount btree blocks, we should
report that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: enable extent size hints for CoW operations
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: enable extent size hints for CoW operations

Wire up the copy-on-write extent size hint for realtime files, and
connect it to the rt allocator so that we avoid fragmentation on rt
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: apply rt extent alignment constraints to CoW extsize hint
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint

The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint.  Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.

Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files

Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files.  This apparently was done to disable
delayed allocation for realtime files.

However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files.  In this case, the
logic will incorrectly send the write through the delalloc write path.

Fix this by adjusting the logic slightly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: refcover CoW leftovers in the realtime volume
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: refcover CoW leftovers in the realtime volume

Scan the realtime refcount tree at mount time to get rid of leftover
CoW staging extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: allow inodes to have the realtime and reflink flags
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: allow inodes to have the realtime and reflink flags

Now that we can share blocks between realtime files, allow this
combination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: compute rtrmap btree max levels when reflink enabled
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: compute rtrmap btree max levels when reflink enabled

Compute the maximum possible height of the realtime rmap btree when
reflink is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: update rmap to allow cow staging extents in the rt rmap
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: update rmap to allow cow staging extents in the rt rmap

Don't error out on CoW staging extent records when realtime reflink is
enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: create routine to allocate and initialize a realtime refcount btree inode
Darrick J. Wong [Mon, 12 Aug 2024 21:19:19 +0000 (14:19 -0700)]
xfs: create routine to allocate and initialize a realtime refcount btree inode

Source kernel commit: 0066145ac851fd746ed22e523c3b60062e94c250

Create a library routine to allocate and initialize an empty realtime
refcountbt inode.  We'll use this for growfs, mkfs, and repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agoxfs: wire up realtime refcount btree cursors
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: wire up realtime refcount btree cursors

Source kernel commit: fb0ac941a3e35fe16375f89d8d817e2790aeab35

Wire up realtime refcount btree cursors wherever they're needed
throughout the code base.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
10 months agoxfs: wire up a new inode fork type for the realtime refcount
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: wire up a new inode fork type for the realtime refcount

Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: add metadata reservations for realtime refcount btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: add metadata reservations for realtime refcount btree

Reserve some free blocks so that we will always have enough free blocks
in the data volume to handle expansion of the realtime refcount btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: add realtime refcount btree inode to metadata directory
Darrick J. Wong [Mon, 12 Aug 2024 21:10:50 +0000 (14:10 -0700)]
xfs: add realtime refcount btree inode to metadata directory

Add a metadir path to select the realtime refcount btree inode and load
it at mount time.  The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: add a realtime flag to the refcount update log redo items
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: add a realtime flag to the refcount update log redo items

Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt.  We'll
wire up the actual refcount code later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: prepare refcount functions to deal with rtrefcountbt
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: prepare refcount functions to deal with rtrefcountbt

Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions.  Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.

Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: add realtime refcount btree operations
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: add realtime refcount btree operations

Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: realtime refcount btree transaction reservations
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: realtime refcount btree transaction reservations

Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents.  We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: define the on-disk realtime refcount btree format
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: define the on-disk realtime refcount btree format

Start filling out the rtrefcount btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate refcount btree blocks. This prepares the way for connecting
the btree operations implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: namespace the maximum length/refcount symbols
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: namespace the maximum length/refcount symbols

Actually namespace these variables properly, so that readers can tell
that this is an XFS symbol, and that it's for the refcount
functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs: introduce realtime refcount btree definitions
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: introduce realtime refcount btree definitions

Add new realtime refcount btree definitions. The realtime refcount btree
will be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agomkfs: use file write helper to populate files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:15 +0000 (14:19 -0700)]
mkfs: use file write helper to populate files

Use the file write helper to write files into the filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agolibxfs: resync libxfs_alloc_file_space interface with the kernel
Darrick J. Wong [Mon, 12 Aug 2024 21:19:15 +0000 (14:19 -0700)]
libxfs: resync libxfs_alloc_file_space interface with the kernel

Make the userspace xfs_alloc_file_space behave (more or less) like the
kernel version, at least as far as the interface goes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agomkfs: create the realtime rmap inode
Darrick J. Wong [Mon, 12 Aug 2024 21:19:15 +0000 (14:19 -0700)]
mkfs: create the realtime rmap inode

Create a realtime rmapbt inode if we format the fs with realtime
and rmap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_logprint: report realtime RUIs
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_logprint: report realtime RUIs

Decode the RUI format just enough to report if an RUI targets the
realtime device or not.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: allow sysadmins to add realtime reverse mapping indexes
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: allow sysadmins to add realtime reverse mapping indexes

Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reverse mapping btree index for realtime volumes.  This
is needed for online fsck.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: reserve per-AG space while rebuilding rt metadata
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: reserve per-AG space while rebuilding rt metadata

Realtime metadata btrees can consume quite a bit of space on a full
filesystem.  Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata.  This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: rebuild the bmap btree for realtime files
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: rebuild the bmap btree for realtime files

Use the realtime rmap btree information to rebuild an inode's data fork
when appropriate.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: check for global free space concerns with default btree slack levels
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check for global free space concerns with default btree slack levels

It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full.  If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space.  The
solution to this is to pack the new blocks completely full if we fear
running out of space.

Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.

Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: rebuild the realtime rmap btree
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: rebuild the realtime rmap btree

Rebuild the realtime rmap btree file from the reverse mapping records we
gathered from walking the inodes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: always check realtime file mappings against incore info
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: always check realtime file mappings against incore info

Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3).  As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.

Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state.  The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork.  We already
do this for regular data files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
10 months agoxfs_repair: check existing realtime rmapbt entries against observed rmaps
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>