]> www.infradead.org Git - users/hch/xfsprogs.git/log
users/hch/xfsprogs.git
3 months agoxfs: implement zoned garbage collection
Christoph Hellwig [Tue, 8 Apr 2025 07:16:12 +0000 (09:16 +0200)]
xfs: implement zoned garbage collection

Source kernel commit: 080d01c41d44f0993f2c235a6bfdb681f0a66be6

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used.  To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoFIXUP: xfs: add support for zoned space reservations
Christoph Hellwig [Fri, 20 Dec 2024 03:49:37 +0000 (19:49 -0800)]
FIXUP: xfs: add support for zoned space reservations

Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: add support for zoned space reservations
Christoph Hellwig [Tue, 8 Apr 2025 07:15:06 +0000 (09:15 +0200)]
xfs: add support for zoned space reservations

Source kernel commit: 0bb2193056b5969e4148fc0909e89a5362da873e

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: add the zoned space allocator
Christoph Hellwig [Tue, 8 Apr 2025 07:15:00 +0000 (09:15 +0200)]
xfs: add the zoned space allocator

Source kernel commit: 4e4d52075577707f8393e3fc74c1ef79ca1d3ce6

For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.

Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode.  As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.

Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available.  A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup.  Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.

Allocations are only recorded at I/O completion time.  The scheme used
for that is very similar to the reflink COW end I/O path.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoFIXUP: xfs: parse and validate hardware zone information
Christoph Hellwig [Fri, 20 Dec 2024 03:49:36 +0000 (19:49 -0800)]
FIXUP: xfs: parse and validate hardware zone information

Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: parse and validate hardware zone information
Christoph Hellwig [Tue, 8 Apr 2025 07:13:01 +0000 (09:13 +0200)]
xfs: parse and validate hardware zone information

Source kernel commit: 720c2d58348329ce57bfa7ecef93ee0c9bf4b405

Add support to validate and parse reported hardware zone state.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
3 months agoxfs: disable sb_frextents for zoned file systems
Christoph Hellwig [Tue, 8 Apr 2025 07:10:55 +0000 (09:10 +0200)]
xfs: disable sb_frextents for zoned file systems

Source kernel commit: 1d319ac6fe1bd6364c5fc6e285ac47b117aed117

Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information.  Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: export zoned geometry via XFS_FSOP_GEOM
Christoph Hellwig [Tue, 8 Apr 2025 07:10:49 +0000 (09:10 +0200)]
xfs: export zoned geometry via XFS_FSOP_GEOM

Source kernel commit: 1fd8159e7ca41203798b6f65efaf1724eb318cd4

Export the zoned geometry information so that userspace can query it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoFIXUP: xfs: allow internal RT devices for zoned mode
Christoph Hellwig [Fri, 20 Dec 2024 03:49:36 +0000 (19:49 -0800)]
FIXUP: xfs: allow internal RT devices for zoned mode

Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: allow internal RT devices for zoned mode
Christoph Hellwig [Tue, 8 Apr 2025 07:09:44 +0000 (09:09 +0200)]
xfs: allow internal RT devices for zoned mode

Source kernel commit: bdc03eb5f98f6f1ae4bd5e020d1582a23efb7799

Allow creating an RT subvolume on the same device as the main data
device.  This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoFIXUP: xfs: define the zoned on-disk format
Christoph Hellwig [Fri, 20 Dec 2024 03:49:36 +0000 (19:49 -0800)]
FIXUP: xfs: define the zoned on-disk format

3 months agoxfs: define the zoned on-disk format
Christoph Hellwig [Tue, 8 Apr 2025 07:09:00 +0000 (09:09 +0200)]
xfs: define the zoned on-disk format

Source kernel commit: 2167eaabe2fadde24cb8f1dafbec64da1d2ed2f5

Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data).  There are few minor but important changes, which are indicated
by a new incompat flag:

1) there are no bitmap and summary inodes, thus the
/rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the
sb_rbmblocks superblock field must be cleared to zero.

2) there is a new superblock field that specifies the start of an
internal RT section.  This allows supporting SMR HDDs that have random
writable space at the beginning which is used for the XFS data device
(which really is the metadata device for this configuration), directly
followed by a RT device on the same block device.  While something
similar could be achieved using dm-linear just having a single device
directly consumed by XFS makes handling the file systems a lot easier.

3) Another superblock field that tracks the amount of reserved space (or
overprovisioning) that is never used for user capacity, but allows GC
to run more smoothly.

4) an overlay of the cowextsize field for the rtrmap inode so that we
can persistently track the total amount of rtblocks currently used in
a RT group.  There is no data structure other than the rmap that
tracks used space in an RT group, and this counter is used to decide
when a RT group has been entirely emptied, and to select one that
is relatively empty if garbage collection needs to be performed.
While this counter could be tracked entirely in memory and rebuilt
from the rmap at mount time, that would lead to very long mount times
with the large number of RT groups implied by the number of hardware
zones especially on SMR hard drives with 256MB zone sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: add a xfs_rtrmap_highest_rgbno helper
Christoph Hellwig [Tue, 8 Apr 2025 07:08:53 +0000 (09:08 +0200)]
xfs: add a xfs_rtrmap_highest_rgbno helper

Source kernel commit: aacde95a37160b1462e46e0fd0cc7fd70e3bf1cc

Add a helper to find the last offset mapped in the rtrmap.  This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
Christoph Hellwig [Tue, 8 Apr 2025 07:08:31 +0000 (09:08 +0200)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

Source kernel commit: f42c652434de5e26e02798bf6a0c2a4a8627196b

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation.  To support that pass the
flags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c
Christoph Hellwig [Tue, 8 Apr 2025 07:07:42 +0000 (09:07 +0200)]
xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c

Source kernel commit: 7c879c8275c0505c551f0fc6c152299c8d11f756

Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c.  Move it to xfs_iomap.c
toward the two callers.

Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: add a rtg_blocks helper
Christoph Hellwig [Tue, 8 Apr 2025 07:07:07 +0000 (09:07 +0200)]
xfs: add a rtg_blocks helper

Source kernel commit: 012482b3308a49a84c2a7df08218dd4ad081e1da

Shortcut dereferencing the xg_block_count field in the generic group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: reduce metafile reservations
Christoph Hellwig [Tue, 8 Apr 2025 07:06:53 +0000 (09:06 +0200)]
xfs: reduce metafile reservations

Source kernel commit: 272e20bb24dc895375ccc18a82596a7259b5a652

There is no point in reserving more space than actually available
on the data device for the worst case scenario that is unlikely to
happen.  Reserve at most 1/4th of the data device blocks, which is
still a heuristic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoFIXUP: xfs: make metabtree reservations global
Christoph Hellwig [Fri, 17 Jan 2025 09:39:30 +0000 (10:39 +0100)]
FIXUP: xfs: make metabtree reservations global

3 months agoxfs: make metabtree reservations global
Christoph Hellwig [Tue, 8 Apr 2025 07:05:47 +0000 (09:05 +0200)]
xfs: make metabtree reservations global

Source kernel commit: 1df8d75030b787a9fae270b59b93eef809dd2011

Currently each metabtree inode has it's own space reservation to ensure
it can be expanded to the maximum size, mirroring what is done for the
AG-based btrees.  But unlike the AG-based btrees the metabtree inodes
aren't restricted to allocate from a single AG but can use free space
form the entire file system.  And unlike AG-based btrees where the
required reservation shrinks with the available free space due to this,
the metabtree reservations for the rtrmap and rtfreflink trees are not
bound in any way by the data device free space as they track RT extent
allocations.  This is not very efficient as it requires a large number
of blocks to be set aside that can't be used at all by other btrees.

Switch to a model that uses a global pool instead in preparation for
reducing the amount of reserved space, which now also removes the
overloading of the i_nblocks field for metabtree inodes, which would
create problems if metabtree inodes ever had a big enough xattr fork
to require xattr blocks outside the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoFIXUP: xfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Fri, 20 Dec 2024 03:48:51 +0000 (19:48 -0800)]
FIXUP: xfs: generalize the freespace and reserved blocks handling

Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Tue, 8 Apr 2025 07:02:27 +0000 (09:02 +0200)]
xfs: generalize the freespace and reserved blocks handling

Source kernel commit: 712bae96631852c1a1822ee4f57a08ccd843358b

xfs_{add,dec}_freecounter already handles the block and RT extent
percpu counters, but it currently hardcodes the passed in counter.

Add a freecounter abstraction that uses an enum to designate the counter
and add wrappers that hide the actual percpu_counters.  This will allow
expanding the reserved block handling to the RT extent counter in the
next step, and also prepares for adding yet another such counter that
can share the code.  Both these additions will be needed for the zoned
allocator.

Also switch the flooring of the frextents counter to 0 in statfs for the
rthinherit case to a manual min_t call to match the handling of the
fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs_repair: handling a block with bad crc, bad uuid, and bad magic number needs fixing
Bill O'Donnell [Fri, 21 Mar 2025 22:05:35 +0000 (17:05 -0500)]
xfs_repair: handling a block with bad crc, bad uuid, and bad magic number needs fixing

In certain cases, if a block is so messed up that crc, uuid and magic
number are all bad, we need to not only detect in phase3 but fix it
properly in phase6. In the current code, the mechanism doesn't work
in that it only pays attention to one of the parameters.

Note: in this case, the nlink inode link count drops to 1, but
re-running xfs_repair fixes it back to 2. This is a side effect that
should probably be handled in update_inode_nlinks() with separate patch.
Regardless, running xfs_repair twice, with this patch applied
fixes the issue. Recognize that this patch is a fix for xfs v5.

Signed-off-by: Bill O'Donnell <bodonnel@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
v2: remove superfluous needmagic logic
v3: clarify the description
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
3 months agoxfs: Use abs_diff instead of XFS_ABSDIFF
Matthew Wilcox (Oracle) [Fri, 21 Mar 2025 16:31:15 +0000 (09:31 -0700)]
xfs: Use abs_diff instead of XFS_ABSDIFF

Source kernel commit: ca3ac4bf4dc307cea5781dccccf41c1d14c2f82f

We have a central definition for this function since 2023, used by
a number of different parts of the kernel.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
3 months agoxfs_repair: fix stupid argument error in verify_inode_chunk
Darrick J. Wong [Fri, 21 Mar 2025 16:32:17 +0000 (09:32 -0700)]
xfs_repair: fix stupid argument error in verify_inode_chunk

An arm64 VM running fstests with 64k fsblock size blew up the test
filesystem when the OOM killer whacked xfs_repair as it was rebuilding a
sample filesystem.  A subsequent attempt by fstests to repair the
filesystem printed stuff like this:

inode rec for ino 39144576 (1/5590144) overlaps existing rec (start 1/5590144)
inode rec for ino 39144640 (1/5590208) overlaps existing rec (start 1/5590208)

followed by a lot of errors such as:

cannot read agbno (1/5590208), disk block 734257664
xfs_repair: error - read only 0 of 65536 bytes

Here we're feeding per-AG inode numbers into a block reading function as
if it were a per-AG block number.  This is wrong by a factor of 128x so
we read past the end of the filesystem.  Worse yet, the buffer cache
fills up memory and thus the second repair process is also OOM killed.
The filesystem is not fixed.

Cc: linux-xfs@vger.kernel.org # v3.1.8
Fixes: 0553a94f522c17 ("repair: kill check_inode_block")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs_repair: fix infinite loop in longform_dir2_entry_check*
Darrick J. Wong [Fri, 21 Mar 2025 16:32:02 +0000 (09:32 -0700)]
xfs_repair: fix infinite loop in longform_dir2_entry_check*

If someone corrupts the data fork of a directory to have a bmap record
whose br_startoff only has bits set in the upper 32 bits, the code will
suffer an integer overflow when assigning the 64-bit next_da_bno to the
32-bit da_bno.  This leads to an infinite loop.

Found by fuzzing xfs/812 with u3.bmx[0].startoff = firstbit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs_repair: fix crash in reset_rt_metadir_inodes
Darrick J. Wong [Fri, 21 Mar 2025 16:31:46 +0000 (09:31 -0700)]
xfs_repair: fix crash in reset_rt_metadir_inodes

I observed that xfs_repair -n segfaults during xfs/812 after corrupting
the /rtgroups metadir inode because mp->m_rtdirip isn't loaded.  Fix the
crash and print a warning about the missing inode.

Cc: linux-xfs@vger.kernel.org # v6.13.0
Fixes: 7c541c90fd77a2 ("xfs_repair: support realtime groups")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs_repair: don't recreate /quota metadir if there are no quota inodes
Darrick J. Wong [Fri, 21 Mar 2025 16:31:31 +0000 (09:31 -0700)]
xfs_repair: don't recreate /quota metadir if there are no quota inodes

If repair does not discover even a single quota file, then don't have it
try to create a /quota metadir to hold them.  This avoids pointless
repair failures on quota-less filesystems that are nearly full.

Found via generic/558 on a zoned=1 filesystem.

Cc: linux-xfs@vger.kernel.org # v6.13.0
Fixes: b790ab2a303d58 ("xfs_repair: support quota inodes in the metadata directory")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
3 months agoxfs_repair: fix wording of error message about leftover CoW blocks on the rt device
Darrick J. Wong [Mon, 24 Mar 2025 17:09:51 +0000 (10:09 -0700)]
xfs_repair: fix wording of error message about leftover CoW blocks on the rt device

Fix the wording so the user knows it's the rt cow staging extents that
were lost.

Fixes: a9b8f0134594d0 ("xfs_repair: use realtime refcount btree data to check block types")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
3 months agoxfs_io: Add cachestat syscall support
Ritesh Harjani (IBM) [Sun, 16 Mar 2025 18:45:29 +0000 (00:15 +0530)]
xfs_io: Add cachestat syscall support

This adds -c "cachestat off len" command which uses cachestat() syscall
[1]. This can provide following pagecache detail for a file.

- no. of cached pages,
- no. of dirty pages,
- no. of pages marked for writeback,
- no. of evicted pages,
- no. of recently evicted pages

[1]: https://lore.kernel.org/all/20230503013608.2431726-3-nphamcs@gmail.com/T/#u

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
[aalbersh remove [] from command arguments help]

3 months agoxfs_io: Add RWF_DONTCACHE support to preadv2
Ritesh Harjani (IBM) [Sat, 15 Mar 2025 08:20:13 +0000 (13:50 +0530)]
xfs_io: Add RWF_DONTCACHE support to preadv2

Add per-io RWF_DONTCACHE support flag to preadv2().
This enables xfs_io to perform uncached buffered-io reads.

e.g. xfs_io -c "pread -U -V 1 0 16K" /mnt/f1

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
3 months agoxfs_io: Add RWF_DONTCACHE support to pwritev2
Ritesh Harjani (IBM) [Sat, 15 Mar 2025 08:20:12 +0000 (13:50 +0530)]
xfs_io: Add RWF_DONTCACHE support to pwritev2

Add per-io RWF_DONTCACHE support flag to pwritev2().
This enables xfs_io to perform uncached buffered-io writes.

e.g. xfs_io -fc "pwrite -U -V 1 0 16K" /mnt/f1

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
3 months agoxfs_io: Add support for preadv2
Ritesh Harjani (IBM) [Sat, 15 Mar 2025 08:20:11 +0000 (13:50 +0530)]
xfs_io: Add support for preadv2

This patch adds support for preadv2() to xfs_io.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
3 months agomake: remove the .extradep file in libxfs on "make clean"
Theodore Ts'o [Wed, 19 Feb 2025 16:05:00 +0000 (11:05 -0500)]
make: remove the .extradep file in libxfs on "make clean"

Commit 6e1d3517d108 ("libxfs: test compiling public headers with a C++
compiler") will create the .extradep file.  This can cause future
builds to fail if the header files in $(DESTDIR) no longer exist.

Fix this by removing .extradep (along with files like .ltdep) on a
"make clean".

Fixes: 6e1d3517d108 ("libxfs: test compiling public headers with a C++ compiler")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
4 months agoxfs_{admin,repair},man5: tell the user to mount with nouuid for snapshots
Darrick J. Wong [Fri, 7 Mar 2025 17:55:01 +0000 (09:55 -0800)]
xfs_{admin,repair},man5: tell the user to mount with nouuid for snapshots

Augment the messaging in xfs_admin and xfs_repair to advise the user to
replay a dirty log on a snapshotted filesystem by mounting with nouuid
if the origin filesystem is still mounted.  A user accidentally zapped
the log when trying to mount a backup snapshot because the instructions
we gave them weren't sufficient.

Reported-by: Kjetil Torgrim Homme <kjetilho@ifi.uio.no>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
[add missing whitespace in 'the nouuid option.If you are']

4 months agogitignore: ignore a few newly generated files
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:35 +0000 (15:50 +0100)]
gitignore: ignore a few newly generated files

These files are generated from corresponding *.in templates.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agolibxfs-apply: drop Cc: to stable release list
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:34 +0000 (15:50 +0100)]
libxfs-apply: drop Cc: to stable release list

These Cc: tags are intended for kernel commits which need to be
backported to stable kernels. Maintainers of stable kernel aren't
interested in xfsprogs syncs.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agorelease.sh: add -f to generate for-next update email
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:33 +0000 (15:50 +0100)]
release.sh: add -f to generate for-next update email

Add --for-next/-f to generate ANNOUNCE email for for-next branch
update. This doesn't require new commit/tarball/tags, so skip it.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agorelease.sh: generate ANNOUNCE email
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:32 +0000 (15:50 +0100)]
release.sh: generate ANNOUNCE email

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agogit-contributors: make revspec required and shebang fix
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:31 +0000 (15:50 +0100)]
git-contributors: make revspec required and shebang fix

Without default value script will show help instead of just hanging
waiting for input on stdin.

Shebang fix for system with different python location than the
/usr/bin one.

Cut leading delimiter from the final CC string.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agogit-contributors: better handling of hash mark/multiple emails
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:30 +0000 (15:50 +0100)]
git-contributors: better handling of hash mark/multiple emails

Better handling of hash mark, tags with multiple emails and not
quoted names in emails. See comments in the script.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agoAdd git-contributors script to notify about merges
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:29 +0000 (15:50 +0100)]
Add git-contributors script to notify about merges

Add python script used to collect emails over all changes merged in
the next release.

CC: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agorelease.sh: update version files make commit optional
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:28 +0000 (15:50 +0100)]
release.sh: update version files make commit optional

Based on ./VERSION script updates all other files. For
./doc/changelog script asks maintainer to fill it manually as not
all changes goes into changelog.

--no-commit|-n flag is handy when something got into the version commit
and need to be changed manually. Then ./release.sh -c will use fixed
history

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agorelease.sh: add --kup to upload release tarball to kernel.org
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:27 +0000 (15:50 +0100)]
release.sh: add --kup to upload release tarball to kernel.org

Add kup support so that the maintainer can push the newly formed
release tarballs to kernel.org.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agorelease.sh: add signing and fix outdated commands
Andrey Albershteyn [Wed, 26 Feb 2025 14:50:26 +0000 (15:50 +0100)]
release.sh: add signing and fix outdated commands

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
4 months agoxfsprogs: Fix mismatched return type of filesize()
Pavel Reichl [Fri, 21 Feb 2025 18:57:57 +0000 (19:57 +0100)]
xfsprogs: Fix mismatched return type of filesize()

The function filesize() was declared with a return type of 'long' but
defined with 'off_t'. This mismatch caused build issues due to type
incompatibility.

This commit updates the declaration to match the definition, ensuring
consistency and preventing potential compilation errors.

Fixes: 73fb78e5ee8 ("mkfs: support copying in large or sparse files")
Signed-off-by: Pavel Reichl <preichl@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cem@kernel.org>
Fixes: 73fb78e5ee8 ("mkfs: support copying in large or sparse files")
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
5 months agolibxfs-apply: allow stgit users to force-apply a patch
Darrick J. Wong [Thu, 20 Feb 2025 16:49:33 +0000 (08:49 -0800)]
libxfs-apply: allow stgit users to force-apply a patch

Currently, libxfs-apply handles merge conflicts in the auto-backported
patches in a somewhat unfriendly way -- either it applies completely
cleanly, or the user has to ^Z, find the raw diff file in /tmp, apply it
by hand, resume the process, and then tell it to skip the patch.

This is annoying, and I've long worked around that by using my handy
stg-force-import script that imports the patch with --reject, undoes the
partially-complete diff, uses patch(1) to import as much of the diff as
possible, and then starts an editor so the caller can clean up the rest.

When patches are fuzzy, patch(1) is /much/ less strict about applying
changes than stg-import.  Since Carlos sent in his own workaround for
guilt, I figured I might as well port stg-force-import into libxfs-apply
and contribute that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
5 months agolibxfs-apply: fix stgit detection
Andrey Albershteyn [Thu, 20 Feb 2025 17:56:01 +0000 (18:56 +0100)]
libxfs-apply: fix stgit detection

stgit top doesn't seem to return 0 if stack is created for a branch
but no patches applied. The code is 2 as when no 'stgit init' was
run.

Replace top with log which always has at least "initialize" action.

Stacked Git 2.4.12

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
5 months agoxfs_io: don't fail FS_IOC_FSGETXATTR on filesystems that lack support
Anthony Iliopoulos [Sat, 22 Feb 2025 15:08:32 +0000 (16:08 +0100)]
xfs_io: don't fail FS_IOC_FSGETXATTR on filesystems that lack support

Not all filesystems implement the FS_IOC_FSGETXATTR ioctl, and in those
cases -ENOTTY will be returned. There is no need to return with an error
when this happens, so just silently return.

Without this fstest generic/169 fails on NFS that doesn't implement the
fileattr_get inode operation.

Fixes: e6b48f451a5d ("xfs_io: allow foreign FSes to show FS_IOC_FSGETXATTR details")
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
5 months agoconfigure: additionally get icu-uc from pkg-config
Alyssa Ross [Fri, 14 Feb 2025 08:45:10 +0000 (09:45 +0100)]
configure: additionally get icu-uc from pkg-config

Upstream libicu changed its pkgconfig files[0] in version 76 to require
callers to call out to each .pc file they need for the libraries they
want to link against.  This apparently reduces overlinking, at a cost of
needing the world to fix themselves up.

This patch fixes the following build error with icu 76, also seen by
Fedora[1]:

    /bin/ld: unicrash.o: undefined reference to symbol 'uiter_setString_76'
    /bin/ld: /lib/libicuuc.so.76: error adding symbols: DSO missing from command line
    collect2: error: ld returned 1 exit status
    make[2]: *** [../include/buildrules:65: xfs_scrub] Error 1
    make[1]: *** [include/buildrules:35: scrub] Error 2

Link: https://github.com/unicode-org/icu/commit/199bc827021ffdb43b6579d68e5eecf54c7f6f56
Link: https://src.fedoraproject.org/rpms/xfsprogs/c/624b0fdf7b2a31c1a34787b04e791eee47c97340
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
5 months agoxfs_scrub: use the display mountpoint for reporting file corruptions
Darrick J. Wong [Mon, 24 Feb 2025 18:22:08 +0000 (10:22 -0800)]
xfs_scrub: use the display mountpoint for reporting file corruptions

In systemd service mode, we make systemd bind-mount the target
mountpoint onto /tmp/scrub (/tmp is private to the service) so that
updates to the global mountpoint in the shared mount namespace don't
propagate into our service container and vice versa, and pass the path
to the bind mount to xfs_scrub via -M.  This solves races such as
unmounting of the target mount point after service container creation
but before process invocation that result in the wrong filesystem being
scanned.

IOWs, to scrub /usr, systemd runs "xfs_scrub -M /tmp/scrub /usr".
Pretend that /usr is a separate filesystem.

However, when xfs_scrub snapshots the handle of /tmp/scrub, libhandle
remembers that /tmp/scrub the beginning of the path, not the pathname
that we want to use for reporting (/usr).  This means that
handle_to_path returns /tmp/scrub and not /usr as well, with the
unfortunate result that file corrupts are reported with the pathnames in
the xfs_scrub@ service container, not the global ones.

Put another way, xfs_scrub should complain that /usr/bin/X is corrupt,
not /tmp/scrub/bin/X.

Therefore, modify scrub_render_ino_descr to manipulate the path buffer
during error reporting so that the user always gets the mountpoint
passed in, even if someone tells us to use another path for the actual
open() call in phase 1.

Cc: <linux-xfs@vger.kernel.org> # v6.10.0
Fixes: 9a8b09762f9a52 ("xfs_scrub: use parent pointers when possible to report file operations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_scrub: don't warn about zero width joiner control characters
Darrick J. Wong [Mon, 24 Feb 2025 18:22:08 +0000 (10:22 -0800)]
xfs_scrub: don't warn about zero width joiner control characters

The Unicode code point for "zero width joiners" (aka 0x200D) is used to
hint to renderers that a sequence of simple code points should be
combined into a more complex rendering.  This is how compound emoji such
as "wounded heart" are composed out of "heart" and "bandaid"; and how
complex glyphs are rendered in Malayam.

Emoji in filenames are a supported usecase, so stop warning about the
mere existence of ZWJ.  We already warn about ZWJ that are used to
produce confusingly rendered names in a single namespace, so we're not
losing any robustness here.

Cc: <linux-xfs@vger.kernel.org> # v6.10.0
Fixes: d43362c78e3e37 ("xfs_scrub: store bad flags with the name entry")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_scrub: fix buffer overflow in string_escape
Darrick J. Wong [Mon, 24 Feb 2025 18:22:08 +0000 (10:22 -0800)]
xfs_scrub: fix buffer overflow in string_escape

Need to allocate one more byte for the null terminator, just in case the
/entire/ input string consists of non-printable bytes e.g. emoji.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: 396cd0223598bb ("xfs_scrub: warn about suspicious characters in directory/xattr names")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: add command to copy directory trees out of filesystems
Darrick J. Wong [Mon, 24 Feb 2025 18:22:08 +0000 (10:22 -0800)]
xfs_db: add command to copy directory trees out of filesystems

Aheada of deprecating V4 support in the kernel, let's give people a way
to extract their files from a filesystem without needing to mount.  The
libxfs code won't be removed from the kernel until 2030 and xfsprogs
effectively builds with XFS_SUPPORT_V4=y so that'll give us five years
of releases for archaeologists to draw from.  Also, doing this in
userspace gives people a way to recover files in an unprivileged
container for better safety.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: make listdir more generally useful
Darrick J. Wong [Mon, 24 Feb 2025 18:22:07 +0000 (10:22 -0800)]
xfs_db: make listdir more generally useful

Enhance the current directory entry iteration code in xfs_db to be more
generally useful by allowing callers to pass around a transaction, a
callback function, and a private pointer.  This will be used in the next
patch to iterate directories when we want to copy their contents out of
the filesystem into a directory.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
5 months agoxfs_db: use an empty transaction to try to prevent livelocks in path_navigate
Darrick J. Wong [Mon, 24 Feb 2025 18:22:07 +0000 (10:22 -0800)]
xfs_db: use an empty transaction to try to prevent livelocks in path_navigate

A couple of patches from now we're going to reuse the path_walk code in
a new xfs_db subcommand that tries to recover directory trees from
old/damaged filesystems.  Let's pass around an empty transaction to try
too avoid livelocks on malicious/broken metadata.  This is not
completely foolproof, but it's quick enough for most purposes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
5 months agoxfs_db: pass const pointers when we're not modifying them
Darrick J. Wong [Mon, 24 Feb 2025 18:22:07 +0000 (10:22 -0800)]
xfs_db: pass const pointers when we're not modifying them

Pass a const pointer to path_walk since we don't actually modify the
contents.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
5 months agomkfs: enable reflink on the realtime device
Darrick J. Wong [Mon, 24 Feb 2025 18:22:07 +0000 (10:22 -0800)]
mkfs: enable reflink on the realtime device

Allow the creation of filesystems with both reflink and realtime volumes
enabled.  For now we don't support a realtime extent size > 1.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agomkfs: validate CoW extent size hint when rtinherit is set
Darrick J. Wong [Mon, 24 Feb 2025 18:22:07 +0000 (10:22 -0800)]
mkfs: validate CoW extent size hint when rtinherit is set

Extent size hints exist to nudge the behavior of the file data block
allocator towards trying to make aligned allocations.  Therefore, it
doesn't make sense to allow a hint that isn't a multiple of the
fundamental allocation unit for a given file.

This means that if the sysadmin is formatting with rtinherit set on the
root dir, validate_cowextsize_hint needs to check the hint value on a
simulated realtime file to make sure that it's correct.  This hasn't
been necessary in the past since one cannot have a CoW hint without a
reflink filesystem, and we previously didn't allow rt reflink
filesystems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_logprint: report realtime CUIs
Darrick J. Wong [Mon, 24 Feb 2025 18:22:06 +0000 (10:22 -0800)]
xfs_logprint: report realtime CUIs

Decode the CUI format just enough to report if an CUI targets the
realtime device or not.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: validate CoW extent size hint on rtinherit directories
Darrick J. Wong [Mon, 24 Feb 2025 18:22:06 +0000 (10:22 -0800)]
xfs_repair: validate CoW extent size hint on rtinherit directories

XFS allows a sysadmin to change the rt extent size when adding a rt
section to a filesystem after formatting.  If there are any directories
with both a cowextsize hint and rtinherit set, the hint could become
misaligned with the new rextsize.  Offer to fix the problem if we're in
modify mode and the verifier didn't trip.  If we're in dry run mode,
we let the kernel fix it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: allow realtime files to have the reflink flag set
Darrick J. Wong [Mon, 24 Feb 2025 18:22:06 +0000 (10:22 -0800)]
xfs_repair: allow realtime files to have the reflink flag set

Now that we allow reflink on the realtime volume, allow that combination
of inode flags if the feature's enabled.  Note that we now allow inodes
to have rtinherit even if there's no realtime volume, since the kernel
has never restricted that.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: rebuild the realtime refcount btree
Darrick J. Wong [Mon, 24 Feb 2025 18:22:06 +0000 (10:22 -0800)]
xfs_repair: rebuild the realtime refcount btree

Use the collected reference count information to rebuild the btree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: reject unwritten shared extents
Darrick J. Wong [Mon, 24 Feb 2025 18:22:06 +0000 (10:22 -0800)]
xfs_repair: reject unwritten shared extents

We don't allow sharing of unwritten extents, which means that repair
should reject an unwritten extent if someone else has already claimed
the space.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: check existing realtime refcountbt entries against observed refcounts
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: compute refcount data for the realtime groups
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: compute refcount data for the realtime groups

At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: find and mark the rtrefcountbt inode
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: find and mark the rtrefcountbt inode

Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: use realtime refcount btree data to check block types
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: use realtime refcount btree data to check block types

Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: allow CoW staging extents in the realtime rmap records
Darrick J. Wong [Mon, 24 Feb 2025 18:22:05 +0000 (10:22 -0800)]
xfs_repair: allow CoW staging extents in the realtime rmap records

Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports reflink.  As far as
reporting leftover staging extents, we'll report them when we scan the
rt refcount btree, in a future patch.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_spaceman: report health of the realtime refcount btree
Darrick J. Wong [Mon, 24 Feb 2025 18:22:04 +0000 (10:22 -0800)]
xfs_spaceman: report health of the realtime refcount btree

Report the health of the realtime reference count btree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: add rtrefcount reservations to the rgresv command
Darrick J. Wong [Mon, 24 Feb 2025 18:22:04 +0000 (10:22 -0800)]
xfs_db: add rtrefcount reservations to the rgresv command

Report rt refcount btree reservations in the rgresv subcommand output.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: copy the realtime refcount btree
Darrick J. Wong [Mon, 24 Feb 2025 18:22:04 +0000 (10:22 -0800)]
xfs_db: copy the realtime refcount btree

Copy the realtime refcountbt when we're metadumping the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: support the realtime refcountbt
Darrick J. Wong [Mon, 24 Feb 2025 18:22:04 +0000 (10:22 -0800)]
xfs_db: support the realtime refcountbt

Wire up various parts of xfs_db for realtime refcount support so that we
can dump the rt refcount btree contents.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: display the realtime refcount btree contents
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
xfs_db: display the realtime refcount btree contents

Implement all the code we need to dump rtrefcountbt contents, starting
from the inode root.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoman: document userspace API changes due to rt reflink
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
man: document userspace API changes due to rt reflink

Update documentation to describe userspace ABI changes made for realtime
reflink support.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agolibfrog: enable scrubbing of the realtime refcount data
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
libfrog: enable scrubbing of the realtime refcount data

Add a new entry so that we can scrub the rtrefcountbt and its metadata
directory tree path.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agolibxfs: apply rt extent alignment constraints to CoW extsize hint
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
libxfs: apply rt extent alignment constraints to CoW extsize hint

The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint.  Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.

Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agolibxfs: add a realtime flag to the refcount update log redo items
Darrick J. Wong [Mon, 24 Feb 2025 18:22:03 +0000 (10:22 -0800)]
libxfs: add a realtime flag to the refcount update log redo items

Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt.  We'll
wire up the actual refcount code later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agolibxfs: compute the rt refcount btree maxlevels during initialization
Darrick J. Wong [Mon, 24 Feb 2025 18:22:02 +0000 (10:22 -0800)]
libxfs: compute the rt refcount btree maxlevels during initialization

Compute max rt refcount btree height information when we set up libxfs.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agomkfs: create the realtime rmap inode
Darrick J. Wong [Mon, 24 Feb 2025 18:22:02 +0000 (10:22 -0800)]
mkfs: create the realtime rmap inode

Create a realtime rmapbt inode if we format the fs with realtime
and rmap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_logprint: report realtime RUIs
Darrick J. Wong [Mon, 24 Feb 2025 18:22:02 +0000 (10:22 -0800)]
xfs_logprint: report realtime RUIs

Decode the RUI format just enough to report if an RUI targets the
realtime device or not.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: reserve per-AG space while rebuilding rt metadata
Darrick J. Wong [Mon, 24 Feb 2025 18:22:02 +0000 (10:22 -0800)]
xfs_repair: reserve per-AG space while rebuilding rt metadata

Realtime metadata btrees can consume quite a bit of space on a full
filesystem.  Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata.  This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: rebuild the bmap btree for realtime files
Darrick J. Wong [Mon, 24 Feb 2025 18:22:02 +0000 (10:22 -0800)]
xfs_repair: rebuild the bmap btree for realtime files

Use the realtime rmap btree information to rebuild an inode's data fork
when appropriate.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: check for global free space concerns with default btree slack levels
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: check for global free space concerns with default btree slack levels

It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full.  If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space.  The
solution to this is to pack the new blocks completely full if we fear
running out of space.

Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.

Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: rebuild the realtime rmap btree
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: rebuild the realtime rmap btree

Rebuild the realtime rmap btree file from the reverse mapping records we
gathered from walking the inodes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: always check realtime file mappings against incore info
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: always check realtime file mappings against incore info

Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3).  As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.

Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state.  The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork.  We already
do this for regular data files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: check existing realtime rmapbt entries against observed rmaps
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: find and mark the rtrmapbt inodes
Darrick J. Wong [Mon, 24 Feb 2025 18:22:01 +0000 (10:22 -0800)]
xfs_repair: find and mark the rtrmapbt inodes

Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: refactor realtime inode check
Darrick J. Wong [Mon, 24 Feb 2025 18:22:00 +0000 (10:22 -0800)]
xfs_repair: refactor realtime inode check

Refactor the realtime bitmap and summary checks into a helper function.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: create a new set of incore rmap information for rt groups
Darrick J. Wong [Mon, 24 Feb 2025 18:22:00 +0000 (10:22 -0800)]
xfs_repair: create a new set of incore rmap information for rt groups

Create a parallel set of "xfs_ag_rmap" structures to cache information
about reverse mappings for the realtime groups.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: use realtime rmap btree data to check block types
Darrick J. Wong [Mon, 24 Feb 2025 18:22:00 +0000 (10:22 -0800)]
xfs_repair: use realtime rmap btree data to check block types

Use the realtime rmap btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: flag suspect long-format btree blocks
Darrick J. Wong [Mon, 24 Feb 2025 18:22:00 +0000 (10:22 -0800)]
xfs_repair: flag suspect long-format btree blocks

Pass a "suspect" counter through scan_lbtree just like we do for
short-format btree blocks, and increment its value when we encounter
blocks with bad CRCs or outright corruption.  This makes it so that
repair actually catches bmbt blocks with bad crcs or other verifier
errors.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_repair: tidy up rmap_diffkeys
Darrick J. Wong [Mon, 24 Feb 2025 18:21:59 +0000 (10:21 -0800)]
xfs_repair: tidy up rmap_diffkeys

Tidy up the comparison code in this function to match the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_spaceman: report health status of the realtime rmap btree
Darrick J. Wong [Mon, 24 Feb 2025 18:21:59 +0000 (10:21 -0800)]
xfs_spaceman: report health status of the realtime rmap btree

Add reporting of the rt rmap btree health to spaceman.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: add an rgresv command
Darrick J. Wong [Mon, 24 Feb 2025 18:21:59 +0000 (10:21 -0800)]
xfs_db: add an rgresv command

Create a command to dump rtgroup btree space reservations.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: make fsmap query the realtime reverse mapping tree
Darrick J. Wong [Mon, 24 Feb 2025 18:21:59 +0000 (10:21 -0800)]
xfs_db: make fsmap query the realtime reverse mapping tree

Extend the 'fsmap' debugger command to support querying the realtime
rmap btree via a new -r argument.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: copy the realtime rmap btree
Darrick J. Wong [Mon, 24 Feb 2025 18:21:58 +0000 (10:21 -0800)]
xfs_db: copy the realtime rmap btree

Copy the realtime rmapbt when we're metadumping the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: support the realtime rmapbt
Darrick J. Wong [Mon, 24 Feb 2025 18:21:58 +0000 (10:21 -0800)]
xfs_db: support the realtime rmapbt

Wire up various parts of xfs_db for realtime rmap support so that we can
dump the btree contents.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: display the realtime rmap btree contents
Darrick J. Wong [Mon, 24 Feb 2025 18:21:58 +0000 (10:21 -0800)]
xfs_db: display the realtime rmap btree contents

Implement all the code we need to dump rtrmapbt contents, starting
from the inode root.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: don't abort when bmapping on a non-extents/bmbt fork
Darrick J. Wong [Mon, 24 Feb 2025 18:21:58 +0000 (10:21 -0800)]
xfs_db: don't abort when bmapping on a non-extents/bmbt fork

We're going to introduce new fork formats, so let's fix the problem that
xfs_db's bmap command aborts when the fork format isn't one of the
existing ones.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs_db: compute average btree height
Darrick J. Wong [Mon, 24 Feb 2025 18:21:58 +0000 (10:21 -0800)]
xfs_db: compute average btree height

Compute the btree height assuming that the blocks are 75% full.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>