Darrick J. Wong [Thu, 15 Aug 2024 18:56:36 +0000 (11:56 -0700)]
xfs: update btree keys correctly when _insrec splits an inode root block
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Fixes: 2c813ad66a72 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:48:39 +0000 (11:48 -0700)]
xfs: support storing records in the inode core root
Add the necessary flags and code so that we can support storing leaf
records in the inode root block of a btree. This hasn't been necessary
before, but the realtime rmapbt will need to be able to do this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:48:38 +0000 (11:48 -0700)]
xfs: hoist the node iroot update code out of xfs_btree_kill_iroot
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node child into the root block
to a separate function. Remove some unnecessary conditionals and clean
up a few function calls in the new function. Note that this change
reorders the ->free_block call with respect to the change in bc_nlevels
to make it easier to support inode root leaf blocks in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:48:38 +0000 (11:48 -0700)]
xfs: hoist the node iroot update code out of xfs_btree_new_iroot
In preparation for allowing records in an inode btree root, hoist the
code that copies keyptrs from an existing node root into a child block
to a separate function. Note that the new function explicitly computes
the keys of the new child block and stores that in the root block; while
the bmap btree could rely on leaving the key alone, realtime rmap needs
to set the new high key.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Fri, 30 Aug 2024 17:47:34 +0000 (10:47 -0700)]
xfs: tidy up xfs_bmap_broot_realloc a bit
Hoist out the code that migrates broot pointers during a resize
operation to avoid code duplication and streamline the caller. Also
use the correct bmbt pointer type for the sizeof operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Fri, 30 Aug 2024 03:08:34 +0000 (20:08 -0700)]
xfs: make xfs_iroot_realloc take the new numrecs instead of deltas
Change the calling signature of xfs_iroot_realloc to take the ifork and
the new number of records in the btree block, not a diff against the
current number. This will make the callsites easier to understand.
Note that this function is misnamed because it is very specific to the
single type of inode-rooted btree supported. This will be addressed in
a subsequent patch.
Return the new btree root to reduce the amount of code clutter.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:48:30 +0000 (11:48 -0700)]
xfs: refactor the inode fork memory allocation functions
Hoist the code that allocates, frees, and reallocates if_broot into a
single xfs_iroot_krealloc function. Eventually we're going to push
xfs_iroot_realloc into the btree ops structure to handle multiple
inode-rooted btrees, but first let's separate out the bits that should
stay in xfs_inode_fork.c.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 7 Aug 2024 22:54:28 +0000 (15:54 -0700)]
xfs: enable metadata directory feature
Enable the metadata directory feature. With this feature, all metadata
inodes are placed in the metadata directory, and the only inumbers in
the superblock are the roots of the two directory trees.
The RT device is now sharded into a number of rtgroups, where 0 rtgroups
mean that no RT extents are supported, and the traditional XFS stub RT
bitmap and summary inodes don't exist. A single rtgroup gives roughly
identical behavior to the traditional RT setup, but now with checksummed
and self identifying free space metadata.
For quota, the quota options are read from the superblock unless
explicitly overridden via mount options.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 22 Aug 2024 16:43:53 +0000 (09:43 -0700)]
mkfs: add quota flags when setting up filesystem
If we're creating a metadir filesystem, the quota accounting and
enforcement flags persist until the sysadmin changes them. Add a means
to specify those qflags at format time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 22 Aug 2024 16:43:34 +0000 (09:43 -0700)]
xfs_repair: refactor quota inumber handling
In preparation for putting quota files in the metadata directory tree,
refactor repair's quota inumber handling to use its own variables
instead of the xfs_mount's.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 22 Aug 2024 16:00:00 +0000 (09:00 -0700)]
xfs: use metadir for quota inodes
Store the quota inodes in the /quota metadata directory if metadir is
enabled. This enables us to stop using the sb_[ugp]uotino fields in the
superblock. From this point on, all metadata files will be children of
the metadata directory tree root.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:32 +0000 (11:56 -0700)]
mkfs: add headers to realtime bitmap blocks
When the rtgroups feature is enabled, format rtbitmap blocks with the
appropriate block headers. libxfs takes care of the actual writing for
us, so all we have to do is ensure that the bitmap is the correct size.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:32 +0000 (11:56 -0700)]
xfs_scrub: trim realtime volumes too
On the kernel side, the XFS realtime groups patchset added support for
FITRIM of the realtime volume. This support doesn't actually require
there to be any realtime groups, so teach scrub to run through the whole
region.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Run the rtgroup metapath scrubber during phase 5 to ensure that any
rtgroup metadata files are still connected to the metadir tree after
we've pruned any bad links.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:31 +0000 (11:56 -0700)]
xfs_scrub: scrub realtime allocation group metadata
Scan realtime group metadata as part of phase 2, just like we do for AG
metadata. For pre-rtgroup filesystems, pretend that this is a "rtgroup
0" scrub request because the kernel expects that. Replace the old
cond_wait code with a scrub barrier because they're equivalent for two
items that cannot be scrubbed in parallel.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 26 Sep 2024 20:39:07 +0000 (13:39 -0700)]
xfs_db: report rt group and block number in the bmap command
The bmap command does not report startblocks for realtime files
correctly. If rtgroups are enabled, we need to use the appropriate
functions to crack the startblock into rtgroup and block numbers; if
not, then we need to report a linear address and not try to report a
group number.
Fix both of these issues.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:29 +0000 (11:56 -0700)]
xfs_db: metadump realtime devices
Teach the metadump device to dump the filesystem metadata of a realtime
device to the metadump file. Currently, this is limited to the realtime
superblock.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Thu, 15 Aug 2024 18:56:28 +0000 (11:56 -0700)]
xfs_db: metadump metadir rt bitmap and summary files
Don't skip dumping the data fork for regular files that are marked as
metadata inodes. This catches rtbitmap and summary inodes on rtgroup
enabled file systems where their inode numbers aren't recorded in the
superblock.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:26 +0000 (11:56 -0700)]
xfs_repair: find and clobber rtgroup bitmap and summary files
On a rtgroups filesystem, if the rtgroups bitmap or summary files are
garbage, we need to clear the dinode and update the incore bitmap so
that we don't bother to check the old rt freespace metadata.
However, we regenerate the entire rt metadata directory tree during
phase 6. If the bitmap and summary files are ok, we still want to clear
the dinode, but we can still use the incore inode to check the old
freespace contents. Split the clear_dinode function into two pieces,
one that merely zeroes the inode, and the old clear_dinode now turns off
checking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Thu, 15 Aug 2024 18:56:26 +0000 (11:56 -0700)]
xfs_repair: support realtime groups
Make repair aware of multiple rtgroups. This now uses the same code as the
AG-based data device for block usage tracking instead of the less optimal
AVL trees and bitmaps used for the traditonal RT device.
Note this is still a bit hacky at the moment by just going beyond the AG
arrays and not fully supporting the unknown state for RT allocation yet.
The next patch will clean this up.
All this should be fixable.
Large parts of the code are based on patches from Darrick J. Wong.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Thu, 15 Aug 2024 19:36:16 +0000 (12:36 -0700)]
xfs_repair: simplify rt_lock handling
No need to cacheline align rt_lock if we move it next to the data
it protects. Also reduce the critical section to just where those
data structures are accessed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Improve the reporting of discrepancies in the realtime bitmap and
summary files by creating a separate helper function that will pinpoint
the exact (word) locations of mismatches. This will help developers to
diagnose problems with the rtgroups feature and users to figure out
exactly what's bad in a filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:25 +0000 (11:56 -0700)]
libxfs: implement some sanity checking for enormous rgcount
Similar to what we do for suspiciously large sb_agcount values, if
someone tries to get libxfs to load a filesystem with a very large
realtime group count, let's do some basic checks of the rt device to
see if it's really that large. If the read fails, only load the first
rtgroup and warn the user.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 2 Oct 2024 15:52:07 +0000 (08:52 -0700)]
xfs: implement busy extent tracking for rtgroups
For rtgroups filesystems, track newly freed (rt) space through the log
until the rt EFIs have been committed to disk. This way we ensure that
space cannot be reused until all traces of the old owner are gone.
As a fringe benefit, we now support -o discard on the realtime device.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Tue, 8 Oct 2024 23:41:12 +0000 (16:41 -0700)]
xfs: move the min and max group block numbers to xfs_group
Move the min and max agblock numbers to the generic xfs_group structure
so that we can start building validators for extents within an rtgroup.
While we're at it, use check_add_overflow for the extent length
computation because that has much better overflow checking.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 9 Oct 2024 00:43:56 +0000 (17:43 -0700)]
xfs: fix minor bug in xfs_verify_agbno
There's a minor bug in xfs_verify_agbno -- min_block ought to be the
first agblock number in the AG that can be used by non-static metadata.
Unfortunately, we set it to the last agblock of the static metadata.
Fortunately this works due to the <= check, but this isn't technically
correct.
Instead, change the check to < and set it to the next agblock past the
static metadata. This hasn't been an issue up to now, but we're going
to move these things into the generic group struct, and this will cause
problems with rtgroups, where min_block can be zero for an rtgroup.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Thu, 3 Oct 2024 20:12:08 +0000 (13:12 -0700)]
xfs: move the group geometry into struct xfs_groups
Add/move the blocks, blklog and blkmask fields to the generic groups
structure so that code can work with AGs and RTGs by just using the
right index into the array.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 30 Sep 2024 20:49:00 +0000 (13:49 -0700)]
xfs: make xfs_rtblock_t a segmented address like xfs_fsblock_t
Now that we've finished adding allocation groups to the realtime volume,
let's make the file block mapping address (xfs_rtblock_t) a segmented
value just like we do on the data device. This means that group number
and block number conversions can be done with shifting and masking
instead of integer division.
While in theory we could continue caching the rgno shift value in
m_rgblklog, the fact that we now always use the shift value means that
we have an opportunity to increase the redundancy of the rt geometry by
storing it in the ondisk superblock and adding more sb verifier code.
Reuse the space vacated by sb_bad_feature2 to store the rgblklog value.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 30 Sep 2024 20:47:22 +0000 (13:47 -0700)]
xfs: create helpers to deal with rounding xfs_filblks_t to rtx boundaries
We're about to segment xfs_rtblock_t addresses, so we must create
type-specific helpers to do rt extent rounding of file mapping block
lengths because the rtb helpers soon will not do the right thing there.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 30 Sep 2024 20:43:12 +0000 (13:43 -0700)]
xfs: create helpers to deal with rounding xfs_fileoff_t to rtx boundaries
We're about to segment xfs_rtblock_t addresses, so we must create
type-specific helpers to do rt extent rounding of file block offsets
because the rtb helpers soon will not do the right thing there.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 23 Sep 2024 20:41:39 +0000 (13:41 -0700)]
xfs: mask off the rtbitmap and summary inodes when metadir in use
Set the rtbitmap and summary file inumbers to NULLFSINO in the
superblock and make sure they're zeroed whenever we write the superblock
to disk, to mimic mkfs behavior.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Mon, 23 Sep 2024 20:41:35 +0000 (13:41 -0700)]
xfs: make the RT allocator rtgroup aware
Make the allocator rtgroup aware by either picking a specific group if
there is a hint, or loop over all groups otherwise. A simple rotor is
provided to pick the placement for initial allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 23 Sep 2024 20:41:33 +0000 (13:41 -0700)]
xfs: use realtime EFI to free extents when rtgroups are enabled
When rmap is enabled, XFS expects a certain order of operations, which
is: 1) remove the file mapping, 2) remove the reverse mapping, and then
3) free the blocks. When reflink is enabled, XFS replaces (3) with a
deferred refcount decrement operation that can schedule freeing the
blocks if that was the last refcount.
For realtime files, xfs_bmap_del_extent_real tries to do 1 and 3 in the
same transaction, which will break both rmap and reflink unless we
switch it to use realtime EFIs. Both rmap and reflink depend on the
rtgroups feature, so let's turn on EFIs for all rtgroups filesystems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 23 Sep 2024 20:41:32 +0000 (13:41 -0700)]
xfs: support error injection when freeing rt extents
A handful of fstests expect to be able to test what happens when extent
free intents fail to actually free the extent. Now that we're
supporting EFIs for realtime extents, add to xfs_rtfree_extent the same
injection point that exists in the regular extent freeing code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 1 Oct 2024 23:02:10 +0000 (16:02 -0700)]
xfs: support logging EFIs for realtime extents
Teach the EFI mechanism how to free realtime extents. We're going to
need this to enforce proper ordering of operations when we enable
realtime rmap.
Declare a new log intent item type (XFS_LI_EFI_RT) and a separate defer
ops for rt extents. This keeps the ondisk artifacts and processing code
completely separate between the rt and non-rt cases. Hopefully this
will make it easier to debug filesystem problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 23 Sep 2024 20:41:28 +0000 (13:41 -0700)]
xfs: encode the rtsummary in big endian format
Currently, the ondisk realtime summary file counters are accessed in
units of 32-bit words. There's no endian translation of the contents of
this file, which means that the Bad Things Happen(tm) if you go from
(say) x86 to powerpc. Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file. Encode the summary
information in big endian format, like most of the rest of the
filesystem.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 23 Sep 2024 20:41:27 +0000 (13:41 -0700)]
xfs: encode the rtbitmap in big endian format
Currently, the ondisk realtime bitmap file is accessed in units of
32-bit words. There's no endian translation of the contents of this
file, which means that the Bad Things Happen(tm) if you go from (say)
x86 to powerpc. Since we have a new feature flag, let's take the
opportunity to enforce an endianness on the file.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 23 Sep 2024 20:41:23 +0000 (13:41 -0700)]
xfs: add frextents to the lazysbcounters when rtgroups enabled
Make the free rt extent count a part of the lazy sb counters when the
realtime groups feature is enabled. This is possible because the patch
to recompute frextents from the rtbitmap during log recovery predates
the code adding rtgroup support, hence we know that the value will
always be correct during runtime.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Except for the rt superblock, realtime groups do not store any metadata
at the start (or end) of the group. There is nothing to prevent the
bmap code from merging allocations from multiple groups into a single
bmap record. Add a helper to check for this case.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: massage the commit message after pulling this into rtgroups] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Tue, 1 Oct 2024 23:14:51 +0000 (16:14 -0700)]
xfs: update realtime super every time we update the primary fs super
Every time we update parts of the primary filesystem superblock that are
echoed in the rt superblock, we must update the rt super. Avoid
changing the log to support logging to the rt device by using ordered
buffers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 30 Sep 2024 18:35:46 +0000 (11:35 -0700)]
xfs: define the format of rt groups
Define the ondisk format of realtime group metadata, and a superblock
for realtime volumes. rt supers are conditionally enabled by a
predicate function so that they can be disabled if we ever implement
zoned storage support for the realtime volume.
For rt group enabled file systems there is a separate bitmap and summary
file for each group and thus the number of bitmap and summary blocks
needs to be calculated differently.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 15 Aug 2024 18:56:17 +0000 (11:56 -0700)]
libxfs: use correct rtx count to block count conversion
Fix a place where we use the wrong conversion functions to convert
between a number of rt extents and a number of rt blocks. This isn't
really necessary since userspace cannot allocate rt extents, but let's
not leave a logic bomb.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 23 Sep 2024 20:41:17 +0000 (13:41 -0700)]
xfs: make RT extent numbers relative to the rtgroup
To prepare for adding per-rtgroup bitmap files, make the xfs_rtxnum_t
type encode the RT extent number relative to the rtgroup. The biggest
part of this to clearly distinguish between the relative extent number
that gets masked when converting from a global block number and length
values that just have a factor applied to them when converting from
file system blocks.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 23 Sep 2024 20:41:16 +0000 (13:41 -0700)]
xfs: refactor xfs_rtsummary_blockcount
Make xfs_rtsummary_blockcount take all the required information from
the mount structure and return the number of summary levels from it
as well. This cleans up many of the callers and prepares for making the
rtsummary files per-rtgroup where they need to look at different value.
This means we recalculate some values in some callers, but as all these
calculations are outside the fast path and cheap, which seems like a
price worth paying.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 23 Sep 2024 20:41:16 +0000 (13:41 -0700)]
xfs: refactor xfs_rtbitmap_blockcount
Rename the existing xfs_rtbitmap_blockcount to
xfs_rtbitmap_blockcount_len and add a new xfs_rtbitmap_blockcount wrapper
around it that takes the number of extents from the mount structure.
This will simplify the move to per-rtgroup bitmaps as those will need to
pass in the number of extents per rtgroup instead.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>