Christoph Hellwig [Thu, 18 Jul 2024 13:22:29 +0000 (15:22 +0200)]
repair: move rt file block allocation to fill_rtino
Move the block allocation to fill_rtino to be next to writing to the
files. That also makes it clear that there is no need to zero the
blocks as we are writing data to all of them. If repair is interrupted
before finishing to write the blocks it will detect a mismatch on the
next run and regenerate the bitmaps anyway, and a random pattern in
the remaining blocks isn't any worse than zeros.
What really should happen is to write the data to a new file first and
link it in when done, but that's a totally separate project.
Christoph Hellwig [Wed, 17 Jul 2024 12:32:39 +0000 (14:32 +0200)]
repair: refactor generate_rtinfo
Move the allocation of the computed values into generate_rtinfo, and thus
make the variables holding them private in rt.c, and clean up a few
formatting nits.
Christoph Hellwig [Thu, 18 Jul 2024 07:08:24 +0000 (09:08 +0200)]
repair: use libxfs_trans_get_buf in fill_rtino
The buffer gets entirely rewritten, no need to read it from disk.
Also pass 0 instead of 1 for the flags, which doesn't change a thing
given that the flags are entirely ignored in userspace, but it looks
less weird now.
Christoph Hellwig [Mon, 15 Jul 2024 13:34:45 +0000 (15:34 +0200)]
repair: create a common helper to fill the bitmap and summary inodes
fill_rbmino and fill_rsumino are almost identical. Merge them into a
single common helper in rt.c, next to the code that computes the values
that are written to the files.
Christoph Hellwig [Thu, 18 Jul 2024 07:16:29 +0000 (09:16 +0200)]
mkfs: move more code into create_sb_metadata_file
Move more boilerplate for creating the two possible files into
create_sb_metadata_file, and turn the actual bitmap/summary specific code
into callbacks.
Christoph Hellwig [Sun, 14 Jul 2024 16:29:28 +0000 (18:29 +0200)]
mkfs: use xfs_rtfile_initialize_blocks
Use the new libxfs helper for initializing the rtbitmap/summary files
for rtgroup-enabled file systems. Also skip the zeroing of the blocks
for rtgroup file systems as we'll overwrite every block instantly.
Christoph Hellwig [Tue, 16 Jul 2024 12:35:56 +0000 (14:35 +0200)]
metadump: refactor inode processing
The code to dump inodes in metadump is rather convoluted because it tries
to determine the inode "type" early and pass it down, which also leads to
dumping the metadata inodes twice as we only determine their actual type
in the second pass and not as part of the AGI scan.
Switch to passing down the mount and dinode as far as possible, and then
use a helper to determine the type only when it actually is needed. With
that we can simply add the data for dump for the metadata inodes to the
normal AGI scan and remove the second pass.
Christoph Hellwig [Mon, 15 Jul 2024 15:30:30 +0000 (17:30 +0200)]
repair: btree blocks are never on the RT subvolume
scan_bmapbt tries to track btree blocks in the RT duplicate extent
AVL tree if the inode has the realtime flag set. Given that the
RT subvolume is only ever used for file data this is incorrect.
Christoph Hellwig [Sat, 13 Jul 2024 07:53:08 +0000 (09:53 +0200)]
db: refactor per-RTG inode handling
Simplify the helpers dealing with the per-RTG inodes by making them
work on arrays indexed by the XFS_RTG_ types instead of duplicating
the logic for the rmap vs refcount inodes.
Christoph Hellwig [Wed, 17 Jul 2024 06:39:40 +0000 (08:39 +0200)]
repair: refactor ensure_rtgroup_file
Refactor ensure_rtgroup_file so that re-population of the inodes is
handled by the callers, and by using the generic xfs_rtginode_create
helper instead of open coding it.
Christoph Hellwig [Sat, 13 Jul 2024 06:52:40 +0000 (08:52 +0200)]
repair: refactor per-RTG inode handling
Simplify the helpers dealing with the per-RTG inodes by making them
work on arrays indexed by the XFS_RTG_ types instead of duplicating
the logic for the rmap vs refcount inodes.
Christoph Hellwig [Wed, 17 Jul 2024 06:36:53 +0000 (08:36 +0200)]
repair: load per-RTG inodes early and keep them around
Simplify the code that deals with the per-RTG inodes by not only looking
up the inode numbers in discover_rtgroup_inodes, but to also reading the
inodes into memory there, and keeping them around in the rtg_inodes
pointers in the xfs_rtgroup structure, similar to what the kernel does.
- an enum to index into them
- turn the two inode pointers in the rtg structure into an array
indexed by the above enum
- add a xfs_rtginode_ops structure to abstract the different between
them into a descriptive format
This allows to consolidate a fair bit of code into libxfs that is
shared between the different users in callers in the kernel, mkfs,
and repair. This will be even more useful if we move the bitmap
and summary to be per RTG.
Switch the imeta lookup / create / link APIs to work on [parent inode,
pathname component] pairs instead of the multi-level xfs_imeta_path
structure. This simplifies the code a lot, and makes the dependencies
on the parent directory (there is only one right one) more clear.
Note that the metapath code looks a bit off with this now, we might want
to try the same unlink/relink trick as we do for the leaf files there.
Christoph Hellwig [Wed, 10 Jul 2024 07:35:23 +0000 (09:35 +0200)]
repair: factor out a ensure_rtgroup_file helper
The code to relink/rebuŃ–ld the rtrmap and rtrefcount inodes is
almost entirely duplicated. Add a common helper for them, which
will also be useful if/when we add more per-rtg inodes.
Christoph Hellwig [Wed, 10 Jul 2024 07:27:44 +0000 (09:27 +0200)]
repair: don't reset fork format on link
The inode fork formats won't change with an unlink/link cycle, so don't
reset them for the rtgroup inodes in phase6. If the inode was corrupted
we won't even get here and will re-create the inode anyway.
Darrick J. Wong [Wed, 3 Jul 2024 21:22:36 +0000 (14:22 -0700)]
mkfs: validate CoW extent size hint when rtinherit is set
Extent size hints exist to nudge the behavior of the file data block
allocator towards trying to make aligned allocations. Therefore, it
doesn't make sense to allow a hint that isn't a multiple of the
fundamental allocation unit for a given file.
This means that if the sysadmin is formatting with rtinherit set on the
root dir, validate_cowextsize_hint needs to check the hint value on a
simulated realtime file to make sure that it's correct. This hasn't
been necessary in the past since one cannot have a CoW hint without a
reflink filesystem, and we previously didn't allow rt reflink
filesystems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:35 +0000 (14:22 -0700)]
xfs_repair: allow sysadmins to add realtime reflink
Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the realtime reference count btree, and therefore reflink on
realtime volumes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:35 +0000 (14:22 -0700)]
xfs_repair: validate CoW extent size hint on rtinherit directories
XFS allows a sysadmin to change the rt extent size when adding a rt
section to a filesystem after formatting. If there are any directories
with both a cowextsize hint and rtinherit set, the hint could become
misaligned with the new rextsize. Offer to fix the problem if we're in
modify mode and the verifier didn't trip. If we're in dry run mode,
we let the kernel fix it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:35 +0000 (14:22 -0700)]
xfs_repair: allow realtime files to have the reflink flag set
Now that we allow reflink on the realtime volume, allow that combination
of inode flags if the feature's enabled. Note that we now allow inodes
to have rtinherit even if there's no realtime volume, since the kernel
has never restricted that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:34 +0000 (14:22 -0700)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:34 +0000 (14:22 -0700)]
xfs_repair: compute refcount data for the realtime groups
At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:34 +0000 (14:22 -0700)]
xfs_repair: find and mark the rtrefcountbt inode
Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:34 +0000 (14:22 -0700)]
xfs_repair: use realtime refcount btree data to check block types
Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:33 +0000 (14:22 -0700)]
xfs_repair: allow CoW staging extents in the realtime rmap records
Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports. As far as reporting
leftover staging extents, we'll report them when we scan the rt refcount
btree, in a future patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:33 +0000 (14:22 -0700)]
xfs_db: support rudimentary checks of the rtrefcount btree
Perform some fairly superficial checks of the rtrefcount btree. We'll
do more sophisticated checks in xfs_repair, but provide enough of
a spot-check here that we can do simple things.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:31 +0000 (14:22 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:31 +0000 (14:22 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files. This apparently was done to disable
delayed allocation for realtime files.
However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files. In this case, the
logic will incorrectly send the write through the delalloc write path.
Fix this by adjusting the logic slightly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:29 +0000 (14:22 -0700)]
xfs: wire up a new inode fork type for the realtime refcount
Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:28 +0000 (14:22 -0700)]
xfs: add realtime refcount btree inode to metadata directory
Add a metadir path to select the realtime refcount btree inode and load
it at mount time. The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:28 +0000 (14:22 -0700)]
xfs: add a realtime flag to the refcount update log redo items
Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt. We'll
wire up the actual refcount code later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:28 +0000 (14:22 -0700)]
xfs: prepare refcount functions to deal with rtrefcountbt
Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions. Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.
Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:28 +0000 (14:22 -0700)]
xfs: add realtime refcount btree operations
Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:27 +0000 (14:22 -0700)]
xfs: define the on-disk realtime refcount btree format
Start filling out the rtrefcount btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate refcount btree blocks. This prepares the way for connecting
the btree operations implementation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Add new realtime refcount btree definitions. The realtime refcount btree
will be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:25 +0000 (14:22 -0700)]
xfs_repair: allow sysadmins to add realtime reverse mapping indexes
Allow the sysadmin to use xfs_repair to upgrade an existing filesystem
to support the reverse mapping btree index for realtime volumes. This
is needed for online fsck.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:25 +0000 (14:22 -0700)]
xfs_repair: reserve per-AG space while rebuilding rt metadata
Realtime metadata btrees can consume quite a bit of space on a full
filesystem. Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata. This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:25 +0000 (14:22 -0700)]
xfs_repair: check for global free space concerns with default btree slack levels
It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full. If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space. The
solution to this is to pack the new blocks completely full if we fear
running out of space.
Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.
Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:24 +0000 (14:22 -0700)]
xfs_repair: always check realtime file mappings against incore info
Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3). As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.
Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state. The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork. We already
do this for regular data files.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:24 +0000 (14:22 -0700)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps
Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:24 +0000 (14:22 -0700)]
xfs_repair: find and mark the rtrmapbt inodes
Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:23 +0000 (14:22 -0700)]
xfs_repair: use realtime rmap btree data to check block types
Use the realtime rmap btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:22 +0000 (14:22 -0700)]
xfs_repair: flag suspect long-format btree blocks
Pass a "suspect" counter through scan_lbtree just like we do for
short-format btree blocks, and increment its value when we encounter
blocks with bad CRCs or outright corruption. This makes it so that
repair actually catches bmbt blocks with bad crcs or other verifier
errors.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:22 +0000 (14:22 -0700)]
libxfs: dirty buffers should be marked uptodate too
I started fuzz-testing the realtime rmap feature with a very large
number of realtime allocation groups. There were so many rt groups that
repair had to rebuild /realtime in the metadata directory tree, and that
directory was big enough to spur the creation of a block format
directory.
Unfortunately, repair then walks both directory trees to look for
unconnceted files. This part of phase 6 emits CRC errors on the newly
created buffers for the /realtime directory, declares the directory to
be garbage, and moves all the rt rmap inodes to /lost+found, resulting
in a corrupt fs.
Poking around in gdb, I noticed that the buffer contents were indeed
zero, and that UPTODATE was not set. This was very strange, until I
added a watch on bp->b_flags to watch for accesses. It turns out that
xfs_repair's prefetch code will _get a buffer and zero the contents if
UPTODATE is not set.
The directory tree code in libxfs will also _get a buffer, initialize
it, and log it to the coordinating transaction, which in this case is
the transactions used to reconnect the rmap btree inodes to /realtime.
At no point does any of that code ever set UPTODATE on the buffer, which
is why prefetch zaps the contents.
Hence change both buffer dirtying functions to set UPTODATE, since a
dirty buffer is by definition at least as recent as whatever's on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:22 +0000 (14:22 -0700)]
xfs_scrub: retest metadata across scrub groups after a repair
Certain types of metadata have dependencies that cross scrub groups.
For example, after a repair the part of realtime bitmap corresponding to
a realtime group, we potentially need to rebuild the realtime summary to
reflect the new bitmap contents. The rtsummary is a separate scrub group
(metafiles) from the rgbitmap (rtgroup), which means that the rtsummary
repairs must be tracked by a separate scrub_item.
Create the necessary dependency table and code to make these kinds of
cross-group validations possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 3 Jul 2024 21:22:20 +0000 (14:22 -0700)]
xfs_db: support rudimentary checks of the rtrmap btree
Perform some fairly superficial checks of the rtrmap btree. We'll
do more sophisticated checks in xfs_repair, but provide enough of
a spot-check here that we can do simple things.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>