Darrick J. Wong [Mon, 12 Aug 2024 21:19:13 +0000 (14:19 -0700)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:12 +0000 (14:19 -0700)]
xfs_repair: find and mark the rtrmapbt inodes
Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_repair: use realtime rmap btree data to check block types
Use the realtime rmap btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:11 +0000 (14:19 -0700)]
xfs_repair: flag suspect long-format btree blocks
Pass a "suspect" counter through scan_lbtree just like we do for
short-format btree blocks, and increment its value when we encounter
blocks with bad CRCs or outright corruption. This makes it so that
repair actually catches bmbt blocks with bad crcs or other verifier
errors.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 22:54:35 +0000 (15:54 -0700)]
xfs_db: don't abort when bmapping on a non-extents/bmbt fork
We're going to introduce new fork formats, so let's fix the problem that
xfs_db's bmap command aborts when the fork format isn't one of the
existing ones.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Hook the regular realtime rmap code when an rtrmapbt repair operation is
running so that we can unlock the AGF buffer to scan the filesystem and
keep the in-memory btree up to date during the scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
When a writer thread executes a chain of log intent items for the
realtime volume, the ILOCKs taken during each step are for each rt
metadata file, not the entire rt volume itself. Although scrub takes
all rt metadata ILOCKs, this isn't sufficient to guard against scrub
checking the rt volume while that writer thread is in the middle of
finishing a chain because there's no higher level locking primitive
guarding the realtime volume.
When there's a collision, cross-referencing between data structures
(e.g. rtrmapbt and rtrefcountbt) yields false corruption events; if
repair is running, this results in incorrect repairs, which is
catastrophic.
Fix this by adding to the mount structure the same drain that we use to
protect scrub against concurrent AG updates, but this time for the
realtime volume.
[Contains a few cleanups from hch]
Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Connect the map and unmap reverse-mapping operations to the realtime
rmapbt via the deferred operation callbacks. This enables us to
perform rmap operations against the correct btree.
[Contains a minor bugfix from hch]
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:06 +0000 (14:19 -0700)]
xfs: wire up a new inode fork type for the realtime rmap
Plumb in the pieces we need to embed the root of the realtime rmap
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 20:57:50 +0000 (13:57 -0700)]
xfs: add realtime reverse map inode to metadata directory
Add a metadir path to select the realtime rmap btree inode and load
it at mount time. The rtrmapbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:05 +0000 (14:19 -0700)]
xfs: add a realtime flag to the rmap update log redo items
Extend the rmap update (RUI) log items with a new realtime flag that
indicates that the updates apply against the realtime rmapbt. We'll
wire up the actual rmap code later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:05 +0000 (14:19 -0700)]
xfs: prepare rmap functions to deal with rtrmapbt
Prepare the high-level rmap functions to deal with the new realtime
rmapbt and its slightly different conventions. Provide the ability
to talk to either rmapbt or rtrmapbt formats from the same high
level code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:05 +0000 (14:19 -0700)]
xfs: add realtime rmap btree operations
Implement the generic btree operations needed to manipulate rtrmap
btree blocks. This is different from the regular rmapbt in that we
allocate space from the filesystem at large, and are neither
constrained to the free space nor any particular AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrmapbt to add the record and a second
split in the regular rmapbt to record the rtrmapbt split.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Start filling out the rtrmap btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate rmap btree blocks. This prepares the way for connecting the
btree operations implementation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:04 +0000 (14:19 -0700)]
xfs: introduce realtime rmap btree definitions
Add new realtime rmap btree definitions. The realtime rmap btree will
be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:03 +0000 (14:19 -0700)]
xfs: simplify the xfs_rmap_{alloc,free}_extent calling conventions
Simplify the calling conventions by allowing callers to pass a fsbno
(xfs_fsblock_t) directly into these functions, since we're just going to
set it in a struct anyway.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:02 +0000 (14:19 -0700)]
xfs: update btree keys correctly when _insrec splits an inode root block
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Fixes: 2c813ad66a72 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:02 +0000 (14:19 -0700)]
xfs: support storing records in the inode core root
Add the necessary flags and code so that we can support storing leaf
records in the inode root block of a btree. This hasn't been necessary
before, but the realtime rmapbt will need to be able to do this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:02 +0000 (14:19 -0700)]
xfs: hoist the node iroot update code out of xfs_btree_kill_iroot
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node child into the root block
to a separate function. Remove some unnecessary conditionals and clean
up a few function calls in the new function. Note that this change
reorders the ->free_block call with respect to the change in bc_nlevels
to make it easier to support inode root leaf blocks in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:02 +0000 (14:19 -0700)]
xfs: hoist the node iroot update code out of xfs_btree_new_iroot
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node root into a child block
to a separate function. Note that the new function explicitly computes
the keys of the new child block and stores that in the root block; while
the bmap btree could rely on leaving the key alone, realtime rmap needs
to set the new high key.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:01 +0000 (14:19 -0700)]
xfs: generalize the btree root reallocation function
In preparation for storing realtime rmap btree roots in an inode fork,
make xfs_iroot_realloc take an ops structure that takes care of all the
btree-specific geometry pieces.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:01 +0000 (14:19 -0700)]
xfs: standardize the btree maxrecs function parameters
Standardize the parameters in xfs_{alloc,bm,ino,rmap,refcount}bt_maxrecs
so that we have consistent calling conventions. This doesn't affect the
kernel that much, but enables us to clean up userspace a bit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:01 +0000 (14:19 -0700)]
xfs: move the zero records logic into xfs_bmap_broot_space_calc
The bmap btree cannot ever have zero records in an incore btree block.
If the number of records drops to zero, that means we're converting the
fork to extents format and are trying to remove the tree. This logic
won't hold for the future realtime rmap btree, so move the logic into
the bmbt code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:00 +0000 (14:19 -0700)]
xfs: hoist the code that moves the incore inode fork broot memory
Whenever we change the size of the memory buffer holding an inode fork
btree root block, we have to copy the contents over. Refactor all this
into a single function that handles both, in preparation for making
xfs_iroot_realloc more generic.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:00 +0000 (14:19 -0700)]
xfs: fix a sloppy memory handling bug in xfs_iroot_realloc
While refactoring code, I noticed that when xfs_iroot_realloc tries to
shrink a bmbt root block, it allocates a smaller new block and then
copies "records" and pointers to the new block. However, bmbt root
blocks cannot ever be leaves, which means that it's not technically
correct to copy records. We /should/ be copying keys.
Note that this has never resulted in actual memory corruption because
sizeof(bmbt_rec) == (sizeof(bmbt_key) + sizeof(bmbt_ptr)). However,
this will no longer be true when we start adding realtime rmap stuff,
so fix this now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:00 +0000 (14:19 -0700)]
xfs: refactor creation of bmap btree roots
Now that we've created inode fork helpers to allocate and free btree
roots, create a new bmap btree helper to create a new bmbt root, and
refactor the extents <-> btree conversion functions to use our new
helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:19:00 +0000 (14:19 -0700)]
xfs: refactor the allocation and freeing of incore inode fork btree roots
Refactor the code that allocates and freese the incore inode fork btree
roots. This will help us disentangle some of the weird logic when we're
creating and tearing down inode-based btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 12 Aug 2024 21:18:59 +0000 (14:18 -0700)]
xfs: replace shouty XFS_BM{BT,DR} macros
Replace all the shouty bmap btree and bmap disk root macros with actual
functions, and fix a type handling error in the xattr code that the
macros previously didn't care about.
Darrick J. Wong [Wed, 7 Aug 2024 23:04:46 +0000 (16:04 -0700)]
mkfs: add headers to realtime bitmap blocks
When the rtgroups feature is enabled, format rtbitmap blocks with the
appropriate block headers. libxfs takes care of the actual writing for
us, so all we have to do is ensure that the bitmap is the correct size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 23:04:45 +0000 (16:04 -0700)]
xfs_scrub: trim realtime volumes too
On the kernel side, the XFS realtime groups patchset added support for
FITRIM of the realtime volume. This support doesn't actually require
there to be any realtime groups, so teach scrub to run through the whole
region.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Run the rtgroup metapath scrubber during phase 5 to ensure that any
rtgroup metadata files are still connected to the metadir tree after
we've pruned any bad links.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 23:04:45 +0000 (16:04 -0700)]
xfs_scrub: scrub realtime allocation group metadata
Scan realtime group metadata as part of phase 2, just like we do for AG
metadata. For pre-rtgroup filesystems, pretend that this is a "rtgroup
0" scrub request because the kernel expects that. Replace the old
cond_wait code with a scrub barrier because they're equivalent for two
items that cannot be scrubbed in parallel.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 23:04:42 +0000 (16:04 -0700)]
xfs_db: metadump realtime devices
Teach the metadump device to dump the filesystem metadata of a realtime
device to the metadump file. Currently, this is limited to the realtime
superblock.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 7 Aug 2024 23:04:42 +0000 (16:04 -0700)]
xfs_db: metadump metadir rt bitmap and summary files
Don't skip dumping the data fork for regular files that are marked as
metadata inodes. This catches rtbitmap and summary inodes on rtgroup
enabled file systems where their inode numbers aren't recorded in the
superblock.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 23:04:41 +0000 (16:04 -0700)]
xfs_db: support dumping realtime superblocks
Allow debugging of realtime superblocks, and add the relevant fields in
the fs superblock that point us at the existence and location of the rt
supers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Tue, 13 Aug 2024 19:27:56 +0000 (12:27 -0700)]
xfs_repair: find and clobber rtgroup bitmap and summary files
On a rtgroups filesystem, if the rtgroups bitmap or summary files are
garbage, we need to clear the dinode and update the incore bitmap so
that we don't bother to check the old rt freespace metadata.
However, we regenerate the entire rt metadata directory tree during
phase 6. If the bitmap and summary files are ok, we still want to clear
the dinode, but we can still use the incore inode to check the old
freespace contents. Split the clear_dinode function into two pieces,
one that merely zeroes the inode, and the old clear_dinode now turns off
checking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 7 Aug 2024 23:04:40 +0000 (16:04 -0700)]
xfs_repair: support realtime groups
Make repair aware of multiple rtgroups. This now uses the same code as the
AG-based data device for block usage tracking instead of the less optimal
AVL trees and bitmaps used for the traditonal RT device.
Note this is still a bit hacky at the moment by just going beyond the AG
arrays and not fully supporting the unknown state for RT allocation yet.
All this should be fixable.
Large parts of the code are based on patches from Darrick J. Wong.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Improve the reporting of discrepancies in the realtime bitmap and
summary files by creating a separate helper function that will pinpoint
the exact (word) locations of mismatches. This will help developers to
diagnose problems with the rtgroups feature and users to figure out
exactly what's bad in a filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 23:04:39 +0000 (16:04 -0700)]
libxfs: implement some sanity checking for enormous rgcount
Similar to what we do for suspiciously large sb_agcount values, if
someone tries to get libxfs to load a filesystem with a very large
realtime group count, let's do some basic checks of the rt device to
see if it's really that large. If the read fails, only load the first
rtgroup and warn the user.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 22:54:37 +0000 (15:54 -0700)]
libxfs: dirty buffers should be marked uptodate too
I started fuzz-testing the realtime rmap feature with a very large
number of realtime allocation groups. There were so many rt groups that
repair had to rebuild /realtime in the metadata directory tree, and that
directory was big enough to spur the creation of a block format
directory.
Unfortunately, repair then walks both directory trees to look for
unconnceted files. This part of phase 6 emits CRC errors on the newly
created buffers for the /realtime directory, declares the directory to
be garbage, and moves all the rt rmap inodes to /lost+found, resulting
in a corrupt fs.
Poking around in gdb, I noticed that the buffer contents were indeed
zero, and that UPTODATE was not set. This was very strange, until I
added a watch on bp->b_flags to watch for accesses. It turns out that
xfs_repair's prefetch code will _get a buffer and zero the contents if
UPTODATE is not set.
The directory tree code in libxfs will also _get a buffer, initialize
it, and log it to the coordinating transaction, which in this case is
the transactions used to reconnect the rmap btree inodes to /realtime.
At no point does any of that code ever set UPTODATE on the buffer, which
is why prefetch zaps the contents.
Hence change both buffer dirtying functions to set UPTODATE, since a
dirty buffer is by definition at least as recent as whatever's on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make the allocator rtgroup aware by either picking a specific group if
there is a hint, or loop over all groups otherwise. A simple rotor is
provided to pick the placement for initial allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>