Darrick J. Wong [Wed, 7 Aug 2024 22:54:51 +0000 (15:54 -0700)]
xfs: create a noalloc mode for allocation groups
Create a new noalloc state for the per-AG structure that will disable
block allocation in this AG. We accomplish this by subtracting from
fdblocks all the free blocks in this AG, hiding those blocks from the
allocator, and preventing freed blocks from updating fdblocks until
we're ready to lift noalloc mode.
Note that we reduce the free block count of the filesystem so that we
can prevent transactions from entering the allocator looking for "free"
space that we've turned off incore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs: fix integer overflow when validating extent size hints
Both file extent size hints are stored as 32-bit quantities, in units of
filesystem blocks. As part of validating the hints, we convert these
quantities to bytes to ensure that the hint is congruent with the file's
allocation size.
The maximum possible hint value is 2097151 (aka XFS_MAX_BMBT_EXTLEN).
If the file allocation unit is larger than 2048, the unit conversion
will exceed 32 bits in size, which overflows the uint32_t used to store
the value used in the comparison. This isn't a problem for files on the
data device since the hint will always be a multiple of the block size.
However, this is a problem for realtime files because the rtextent size
can be any integer number of fs blocks, and truncation of upper bits
changes the outcome of division.
Eliminate the overflow by performing the congruency check in units of
blocks, not bytes. Otherwise, we get errors like this:
$ truncate -s 500T /tmp/a
$ mkfs.xfs -f -N /tmp/a -d extszinherit=2097151,rtinherit=1 -r extsize=28k
illegal extent size hint 2097151, must be less than 2097151 and a multiple of 7.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 22:54:50 +0000 (15:54 -0700)]
xfs: enable extent size hints for CoW when rtextsize > 1
CoW extent size hints are not allowed on filesystems that have large
realtime extents because we only want to perform the minimum required
amount of write-around (aka write amplification) for shared extents.
On filesystems where rtextsize > 1, allocations can only be done in
units of full rt extents, which means that we can only map an entire rt
extent's worth of blocks into the data fork. Hole punch requests become
conversions to unwritten if the request isn't aligned properly.
Because a copy-write fundamentally requires remapping, this means that
we also can only do copy-writes of a full rt extent. This is too
expensive for large hint sizes, since it's all or nothing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:49 +0000 (14:19 -0700)]
mkfs: validate CoW extent size hint when rtinherit is set
Extent size hints exist to nudge the behavior of the file data block
allocator towards trying to make aligned allocations. Therefore, it
doesn't make sense to allow a hint that isn't a multiple of the
fundamental allocation unit for a given file.
This means that if the sysadmin is formatting with rtinherit set on the
root dir, validate_cowextsize_hint needs to check the hint value on a
simulated realtime file to make sure that it's correct. This hasn't
been necessary in the past since one cannot have a CoW hint without a
reflink filesystem, and we previously didn't allow rt reflink
filesystems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:26 +0000 (14:19 -0700)]
xfs_repair: allow sysadmins to add realtime reflink
Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the realtime reference count btree, and therefore reflink on
realtime volumes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: validate CoW extent size hint on rtinherit directories
XFS allows a sysadmin to change the rt extent size when adding a rt
section to a filesystem after formatting. If there are any directories
with both a cowextsize hint and rtinherit set, the hint could become
misaligned with the new rextsize. Offer to fix the problem if we're in
modify mode and the verifier didn't trip. If we're in dry run mode,
we let the kernel fix it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:25 +0000 (14:19 -0700)]
xfs_repair: allow realtime files to have the reflink flag set
Now that we allow reflink on the realtime volume, allow that combination
of inode flags if the feature's enabled. Note that we now allow inodes
to have rtinherit even if there's no realtime volume, since the kernel
has never restricted that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: compute refcount data for the realtime groups
At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:24 +0000 (14:19 -0700)]
xfs_repair: find and mark the rtrefcountbt inode
Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_repair: use realtime refcount btree data to check block types
Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:23 +0000 (14:19 -0700)]
xfs_repair: allow CoW staging extents in the realtime rmap records
Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports. As far as reporting
leftover staging extents, we'll report them when we scan the rt refcount
btree, in a future patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:20 +0000 (14:19 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files. This apparently was done to disable
delayed allocation for realtime files.
However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files. In this case, the
logic will incorrectly send the write through the delalloc write path.
Fix this by adjusting the logic slightly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:18 +0000 (14:19 -0700)]
xfs: wire up a new inode fork type for the realtime refcount
Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:10:50 +0000 (14:10 -0700)]
xfs: add realtime refcount btree inode to metadata directory
Add a metadir path to select the realtime refcount btree inode and load
it at mount time. The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: add a realtime flag to the refcount update log redo items
Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt. We'll
wire up the actual refcount code later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: prepare refcount functions to deal with rtrefcountbt
Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions. Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.
Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:17 +0000 (14:19 -0700)]
xfs: add realtime refcount btree operations
Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:16 +0000 (14:19 -0700)]
xfs: define the on-disk realtime refcount btree format
Start filling out the rtrefcount btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate refcount btree blocks. This prepares the way for connecting
the btree operations implementation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Add new realtime refcount btree definitions. The realtime refcount btree
will be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: allow sysadmins to add realtime reverse mapping indexes
Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reverse mapping btree index for realtime volumes. This
is needed for online fsck.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:14 +0000 (14:19 -0700)]
xfs_repair: reserve per-AG space while rebuilding rt metadata
Realtime metadata btrees can consume quite a bit of space on a full
filesystem. Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata. This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check for global free space concerns with default btree slack levels
It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full. If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space. The
solution to this is to pack the new blocks completely full if we fear
running out of space.
Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.
Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: always check realtime file mappings against incore info
Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3). As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.
Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state. The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork. We already
do this for regular data files.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:12 +0000 (14:19 -0700)]
xfs_repair: find and mark the rtrmapbt inodes
Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_repair: use realtime rmap btree data to check block types
Use the realtime rmap btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_repair: flag suspect long-format btree blocks
Pass a "suspect" counter through scan_lbtree just like we do for
short-format btree blocks, and increment its value when we encounter
blocks with bad CRCs or outright corruption. This makes it so that
repair actually catches bmbt blocks with bad crcs or other verifier
errors.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 22:54:35 +0000 (15:54 -0700)]
xfs_db: don't abort when bmapping on a non-extents/bmbt fork
We're going to introduce new fork formats, so let's fix the problem that
xfs_db's bmap command aborts when the fork format isn't one of the
existing ones.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Hook the regular realtime rmap code when an rtrmapbt repair operation is
running so that we can unlock the AGF buffer to scan the filesystem and
keep the in-memory btree up to date during the scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
When a writer thread executes a chain of log intent items for the
realtime volume, the ILOCKs taken during each step are for each rt
metadata file, not the entire rt volume itself. Although scrub takes
all rt metadata ILOCKs, this isn't sufficient to guard against scrub
checking the rt volume while that writer thread is in the middle of
finishing a chain because there's no higher level locking primitive
guarding the realtime volume.
When there's a collision, cross-referencing between data structures
(e.g. rtrmapbt and rtrefcountbt) yields false corruption events; if
repair is running, this results in incorrect repairs, which is
catastrophic.
Fix this by adding to the mount structure the same drain that we use to
protect scrub against concurrent AG updates, but this time for the
realtime volume.
[Contains a few cleanups from hch]
Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Connect the map and unmap reverse-mapping operations to the realtime
rmapbt via the deferred operation callbacks. This enables us to
perform rmap operations against the correct btree.
[Contains a minor bugfix from hch]
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:06 +0000 (14:19 -0700)]
xfs: wire up a new inode fork type for the realtime rmap
Plumb in the pieces we need to embed the root of the realtime rmap
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 20:57:50 +0000 (13:57 -0700)]
xfs: add realtime reverse map inode to metadata directory
Add a metadir path to select the realtime rmap btree inode and load
it at mount time. The rtrmapbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:05 +0000 (14:19 -0700)]
xfs: add a realtime flag to the rmap update log redo items
Extend the rmap update (RUI) log items with a new realtime flag that
indicates that the updates apply against the realtime rmapbt. We'll
wire up the actual rmap code later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:05 +0000 (14:19 -0700)]
xfs: prepare rmap functions to deal with rtrmapbt
Prepare the high-level rmap functions to deal with the new realtime
rmapbt and its slightly different conventions. Provide the ability
to talk to either rmapbt or rtrmapbt formats from the same high
level code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:05 +0000 (14:19 -0700)]
xfs: add realtime rmap btree operations
Implement the generic btree operations needed to manipulate rtrmap
btree blocks. This is different from the regular rmapbt in that we
allocate space from the filesystem at large, and are neither
constrained to the free space nor any particular AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrmapbt to add the record and a second
split in the regular rmapbt to record the rtrmapbt split.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Start filling out the rtrmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for connecting the
btree operations implementation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:04 +0000 (14:19 -0700)]
xfs: introduce realtime rmap btree definitions
Add new realtime rmap btree definitions. The realtime rmap btree will
be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>