Darrick J. Wong [Wed, 3 Jul 2024 21:22:15 +0000 (14:22 -0700)]
xfs: update btree keys correctly when _insrec splits an inode root block
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Fixes: 2c813ad66a72 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:11:32 +0000 (21:11 -0700)]
xfs: support storing records in the inode core root
Add the necessary flags and code so that we can support storing leaf
records in the inode root block of a btree. This hasn't been necessary
before, but the realtime rmapbt will need to be able to do this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:14 +0000 (14:22 -0700)]
xfs: hoist the node iroot update code out of xfs_btree_kill_iroot
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node child into the root block
to a separate function. Remove some unnecessary conditionals and clean
up a few function calls in the new function. Note that this change
reorders the ->free_block call with respect to the change in bc_nlevels
to make it easier to support inode root leaf blocks in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:14 +0000 (14:22 -0700)]
xfs: hoist the node iroot update code out of xfs_btree_new_iroot
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node root into a child block
to a separate function. Note that the new function explicitly computes
the keys of the new child block and stores that in the root block; while
the bmap btree could rely on leaving the key alone, realtime rmap needs
to set the new high key.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:13 +0000 (14:22 -0700)]
xfs: generalize the btree root reallocation function
In preparation for storing realtime rmap btree roots in an inode fork,
make xfs_iroot_realloc take an ops structure that takes care of all the
btree-specific geometry pieces.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:13 +0000 (14:22 -0700)]
xfs: standardize the btree maxrecs function parameters
Standardize the parameters in xfs_{alloc,bm,ino,rmap,refcount}bt_maxrecs
so that we have consistent calling conventions. This doesn't affect the
kernel that much, but enables us to clean up userspace a bit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:13 +0000 (14:22 -0700)]
xfs: move the zero records logic into xfs_bmap_broot_space_calc
The bmap btree cannot ever have zero records in an incore btree block.
If the number of records drops to zero, that means we're converting the
fork to extents format and are trying to remove the tree. This logic
won't hold for the future realtime rmap btree, so move the logic into
the bmbt code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:13 +0000 (14:22 -0700)]
xfs: hoist the code that moves the incore inode fork broot memory
Whenever we change the size of the memory buffer holding an inode fork
btree root block, we have to copy the contents over. Refactor all this
into a single function that handles both, in preparation for making
xfs_iroot_realloc more generic.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:12 +0000 (14:22 -0700)]
xfs: fix a sloppy memory handling bug in xfs_iroot_realloc
While refactoring code, I noticed that when xfs_iroot_realloc tries to
shrink a bmbt root block, it allocates a smaller new block and then
copies "records" and pointers to the new block. However, bmbt root
blocks cannot ever be leaves, which means that it's not technically
correct to copy records. We /should/ be copying keys.
Note that this has never resulted in actual memory corruption because
sizeof(bmbt_rec) == (sizeof(bmbt_key) + sizeof(bmbt_ptr)). However,
this will no longer be true when we start adding realtime rmap stuff,
so fix this now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:12 +0000 (14:22 -0700)]
xfs: refactor creation of bmap btree roots
Now that we've created inode fork helpers to allocate and free btree
roots, create a new bmap btree helper to create a new bmbt root, and
refactor the extents <-> btree conversion functions to use our new
helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:12 +0000 (14:22 -0700)]
xfs: refactor the allocation and freeing of incore inode fork btree roots
Refactor the code that allocates and freese the incore inode fork btree
roots. This will help us disentangle some of the weird logic when we're
creating and tearing down inode-based btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:12 +0000 (14:22 -0700)]
xfs: replace shouty XFS_BM{BT,DR} macros
Replace all the shouty bmap btree and bmap disk root macros with actual
functions, and fix a type handling error in the xattr code that the
macros previously didn't care about.
Darrick J. Wong [Wed, 3 Jul 2024 21:22:11 +0000 (14:22 -0700)]
mkfs: add headers to realtime bitmap blocks
When the rtgroups feature is enabled, format rtbitmap blocks with the
appropriate block headers. libxfs takes care of the actual writing for
us, so all we have to do is ensure that the bitmap is the correct size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:11 +0000 (14:22 -0700)]
xfs_scrub: trim realtime volumes too
On the kernel side, the XFS realtime groups patchset added support for
FITRIM of the realtime volume. This support doesn't actually require
there to be any realtime groups, so teach scrub to run through the whole
region.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:08 +0000 (14:22 -0700)]
xfs_db: metadump realtime devices
Teach the metadump device to dump the filesystem metadata of a realtime
device to the metadump file. Currently, this is limited to the realtime
superblock.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Sat, 3 Aug 2024 08:28:42 +0000 (10:28 +0200)]
xfs_db: metadump metadir rt bitmap and summary files
Don't skip dumping the data fork for regular files that are marked as
metadata inodes. This catches rtbitmap and summary inodes on rtgroup
enabled file systems where their inode numbers aren't recorded in the
superblock.
Darrick J. Wong [Wed, 3 Jul 2024 21:22:06 +0000 (14:22 -0700)]
xfs_db: support dumping realtime superblocks
Allow debugging of realtime superblocks, and add the relevant fields in
the fs superblock that point us at the existence and location of the rt
supers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 5 Aug 2024 18:19:36 +0000 (11:19 -0700)]
xfs_repair: support realtime groups
Make repair aware of multiple rtgroups. This now uses the same code as the
AG-based data device for block usage tracking instead of the less optimal
AVL trees and bitmaps used for the traditonal RT device.
Note this is still a bit hacky at the moment by just going beyond the AG
arrays and not fully supporting the unknown state for RT allocation yet.
All this should be fixable.
Large parts of the code are based on patches from Darrick J. Wong.
Improve the reporting of discrepancies in the realtime bitmap and
summary files by creating a separate helper function that will pinpoint
the exact (word) locations of mismatches. This will help developers to
diagnose problems with the rtgroups feature and users to figure out
exactly what's bad in a filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:05 +0000 (14:22 -0700)]
libxfs: implement some sanity checking for enormous rgcount
Similar to what we do for suspiciously large sb_agcount values, if
someone tries to get libxfs to load a filesystem with a very large
realtime group count, let's do some basic checks of the rt device to
see if it's really that large. If the read fails, only load the first
rtgroup and warn the user.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make the allocator rtgroup aware by either picking a specific group if
there is a hint, or loop over all groups otherwise. A simple rotor is
provided to pick the placement for initial allocations.
When rmap is enabled, XFS expects a certain order of operations, which
is: 1) remove the file mapping, 2) remove the reverse mapping, and then
3) free the blocks. When reflink is enabled, XFS replaces (3) with a
deferred refcount decrement operation that can schedule freeing the
blocks if that was the last refcount.
For realtime files, xfs_bmap_del_extent_real tries to do 1 and 3 in the
same transaction, which will break both rmap and reflink unless we
switch it to use realtime EFIs. Both rmap and reflink depend on the
rtgroups feature, so let's turn on EFIs for all rtgroups filesystems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
A handful of fstests expect to be able to test what happens when extent
free intents fail to actually free the extent. Now that we're
supporting EFIs for realtime extents, add to xfs_rtfree_extent the same
injection point that exists in the regular extent freeing code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Teach the EFI mechanism how to free realtime extents. We're going to
need this to enforce proper ordering of operations when we enable
realtime rmap.
Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
ops for rt extents. This keeps the ondisk artifacts and processing code
completely separate between the rt and non-rt cases. Hopefully this
will make it easier to debug filesystem problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Make the bmap intent items take an active reference to the rtgroup
containing the space that is being mapped or unmapped. We will need
this functionality once we start enabling rmap and reflink on the rt
volume.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, the ondisk realtime summary file counters are accessed in
units of 32-bit words. There's no endian translation of the contents of
this file, which means that the Bad Things Happen(tm) if you go from
(say) x86 to powerpc. Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file. Encode the summary
information in big endian format, like most of the rest of the
filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Currently, the ondisk realtime bitmap file is accessed in units of
32-bit words. There's no endian translation of the contents of this
file, which means that the Bad Things Happen(tm) if you go from (say)
x86 to powerpc. Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Make the free rt extent count a part of the lazy sb counters when the
realtime groups feature is enabled. This is possible because the patch
to recompute frextents from the rtbitmap during log recovery predates
the code adding rtgroup support, hence we know that the value will
always be correct during runtime.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Except for the rt superblock, realtime groups do not store any metadata
at the start (or end) of the group. There is nothing to prevent the
bmap code from merging allocations from multiple groups into a single
bmap record. Add a helper to check for this case.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: massage the commit message after pulling this into rtgroups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Every time we update parts of the primary filesystem superblock that are
echoed in the rt superblock, we must update the rt super. Avoid
changing the log to support logging to the rt device by using ordered
buffers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Define the ondisk format of realtime group metadata, and a superblock
for realtime volumes. rt supers are protected by a separate rocompat
bit so that we can leave them off if the rt device is zoned.
Add a xfs_sb_version_hasrtgroups so that xfs_repair knows how to zero
the tail of superblocks.
For rt group enabled file systems there is a separate bitmap and summary
file for each group and thus the number of bitmap and summary blocks
needs to be calculated differently.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t
type encode the RT extent number relative to the rtgroup. The biggest
part of this to clearly distinguish between the relative extent number
that gets masked when converting from a global block number and length
values that just have a factor applied to them when converting from
file system blocks.
Make xfs_rtsummary_blockcount take all the required information from
the mount structure and return the number of summary levels from it
as well. This cleans up many of the callers and prepares for making the
rtsummary files per-rtgroup where they need to look at different value.
This means we recalculate some values in some callers, but as all these
calculations are outside the fast path and cheap that seems like a price
worth paying.
Rename the existing xfs_rtbitmap_blockcount to
xfs_rtbitmap_blockcount_len and add a new xfs_rtbitmap_blockcount wrapper
around it that takes the number of extents from the mount structure.
This will simplify the move to per-rtgroup bitmaps as those will need to
pass in the number of extents per rtgroup instead.
Move the pointers to the RT bitmap and summary inodes as well as the
summary cache to the rtgroups structure to prepare for having a
separate bitmap and summary inodes for each rtgroup.
Code using the inodes now needs to operate on a rtgroup. Where easily
possible such code is converted to iterate over all rtgroups, else
rtgroup 0 (the only one that can currently exist) is hardcoded.
Add a dynamic lockdep class key for rtgroup inodes. This will enable
lockdep to deduce inconsistencies in the rtgroup metadata ILOCK locking
order. Each class can have 8 subclasses, and for now we will only have
2 inodes per group. This enables rtgroup order and inode order checks
when nesting ILOCKs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Define helper functions to lock all metadata inodes related to a
realtime group. There's not much to look at now, but this will become
important when we add per-rtgroup metadata files and online fsck code
for them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Create an incore object that will contain information about a realtime
allocation group. This will eventually enable us to shard the realtime
section in a similar manner to how we shard the data section, but for
now just a single object for the entire RT subvolume is created.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
Track the RT summary file size in blocks, just like the RT bitmap
file. While we have users of both units, blocks are used slightly
more often and this matches the bitmap file for consistency.
xfs_rtbitmap_wordcount and xfs_rtsummary_wordcount are currently unused,
so remove them to simplify refactoring other rtbitmap helpers. They
can be added back or simply open coded when actually needed.
Christoph Hellwig [Wed, 3 Jul 2024 21:21:59 +0000 (14:21 -0700)]
xfs: simplify xfs_rtalloc_query_range
There isn't much of a good reason to pass the xfs_rtalloc_rec structures
that describe extents to xfs_rtalloc_query_range as we really just want
a lower and upper bound xfs_rtxnum_t. Pass the rtxnum directly and
simply the interface.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>