]> www.infradead.org Git - users/hch/xfsprogs.git/log
users/hch/xfsprogs.git
7 months agoxfs_mkfs: limit user capacity to explicitly requested size xfs-zoned-2025-01-08
Christoph Hellwig [Thu, 12 Dec 2024 06:54:55 +0000 (07:54 +0100)]
xfs_mkfs: limit user capacity to explicitly requested size

When rounding up the size specified on the command line to the zone
capacity, the extra blocks are currently added to the user capacity.
Switch to adding them to the reserved blocks instead so that the
user capacity exactly matches what the user requested.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mkfs: factor out a adjust_nr_zones helper
Christoph Hellwig [Thu, 12 Dec 2024 06:49:49 +0000 (07:49 +0100)]
xfs_mkfs: factor out a adjust_nr_zones helper

Split the code to add the OP zones to the command line size into a
separate helpers.  The logic is already is pretty complex in a complex
function and will become even more so.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_growfs: support internal RT devices
Christoph Hellwig [Sat, 23 Nov 2024 08:27:35 +0000 (09:27 +0100)]
xfs_growfs: support internal RT devices

Allow RT growfs when rtstart is set in the geomety, and adjust the
queried size for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mdrestore: support internal RT devices
Christoph Hellwig [Mon, 18 Nov 2024 16:35:30 +0000 (17:35 +0100)]
xfs_mdrestore: support internal RT devices

Calculate the size properly for internal RT devices and skip restoring
to the external one for this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_db: support metadump of internal rtdev configurations
Christoph Hellwig [Mon, 18 Nov 2024 16:32:52 +0000 (17:32 +0100)]
xfs_db: support metadump of internal rtdev configurations

Check for an internal RT device and force a v2 format without
setting the realtime_data flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_scrub: handle internal RT devices
Christoph Hellwig [Thu, 14 Nov 2024 09:30:03 +0000 (10:30 +0100)]
xfs_scrub: handle internal RT devices

Handle the synthetic fmr_device values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_scrub: cleanup fsmap keys initialization
Christoph Hellwig [Thu, 14 Nov 2024 09:28:33 +0000 (10:28 +0100)]
xfs_scrub: cleanup fsmap keys initialization

Use the good old array notations instead of pointer arithmetics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_scrub: support internal RT sections
Christoph Hellwig [Thu, 31 Oct 2024 06:18:14 +0000 (07:18 +0100)]
xfs_scrub: support internal RT sections

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_spaceman: handle internal RT devices
Christoph Hellwig [Thu, 14 Nov 2024 09:27:26 +0000 (10:27 +0100)]
xfs_spaceman: handle internal RT devices

Handle the synthetic fmr_device values for fsmap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_io: handle internal RT devices in fsmap output
Christoph Hellwig [Thu, 14 Nov 2024 08:21:37 +0000 (09:21 +0100)]
xfs_io: handle internal RT devices in fsmap output

Deal with the synthetic fmr_device values and the rt device offset when
calculating RG numbers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_io: don't re-query geometry information in fsmap_f
Christoph Hellwig [Thu, 14 Nov 2024 08:19:25 +0000 (09:19 +0100)]
xfs_io: don't re-query geometry information in fsmap_f

But use the information store in "file".

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_io: don't re-query fs_path information in fsmap_f
Christoph Hellwig [Thu, 14 Nov 2024 08:18:27 +0000 (09:18 +0100)]
xfs_io: don't re-query fs_path information in fsmap_f

But reuse the information stash in "file".

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_io: correctly report RGs with internal rt dev in bmap output
Christoph Hellwig [Thu, 14 Nov 2024 10:21:00 +0000 (11:21 +0100)]
xfs_io: correctly report RGs with internal rt dev in bmap output

Apply the proper offset.  Somehow this made gcc complain about
possible overflowing abuf, so increase the size for that as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoman: document XFS_FSOP_GEOM_FLAGS_ZONED
Christoph Hellwig [Fri, 29 Nov 2024 09:12:42 +0000 (10:12 +0100)]
man: document XFS_FSOP_GEOM_FLAGS_ZONED

Document the new zoned feature flag and the two new fields added
with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agolibfrog: report the zoned geometry
Christoph Hellwig [Wed, 13 Nov 2024 08:33:56 +0000 (09:33 +0100)]
libfrog: report the zoned geometry

Also fix up to report all the zoned information in a separate line,
which also helps with alignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mkfs: document the new zoned options in the man page
Christoph Hellwig [Fri, 29 Nov 2024 09:33:30 +0000 (10:33 +0100)]
xfs_mkfs: document the new zoned options in the man page

Add documentation for the zoned file system specific options.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mkfs: reflink conflicts with zoned file systems for now
Christoph Hellwig [Fri, 16 Aug 2024 18:23:12 +0000 (20:23 +0200)]
xfs_mkfs: reflink conflicts with zoned file systems for now

Until GC is enhanced to not unshared reflinked blocks we better prohibit
this combination.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mkfs: default to rtinherit=1 for zoned file systems
Christoph Hellwig [Mon, 11 Nov 2024 05:18:00 +0000 (06:18 +0100)]
xfs_mkfs: default to rtinherit=1 for zoned file systems

Zone file systems are intended to use sequential write required zones
(or areas treated as such) for data, and the main data device only for
metadata.  rtinherit=1 is the way to achieve that, so enabled it by
default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mkfs: calculate zone overprovisioning when specifying size
Hans Holmberg [Tue, 10 Sep 2024 08:55:02 +0000 (08:55 +0000)]
xfs_mkfs: calculate zone overprovisioning when specifying size

When size is specified for zoned file systems, calculate the required
over provisioning to back the requested capacity.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_mkfs: support creating zoned file systems
Christoph Hellwig [Tue, 8 Oct 2024 07:49:32 +0000 (09:49 +0200)]
xfs_mkfs: support creating zoned file systems

Default to use all sequential write required zoned for the RT device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_repair: validate rt groups vs reported hardware zones
Christoph Hellwig [Tue, 26 Nov 2024 07:35:31 +0000 (08:35 +0100)]
xfs_repair: validate rt groups vs reported hardware zones

Run a report zones ioctl, and verify the rt group state vs the
reported hardware zone state.  Note that there is no way to actually
fix up any discrepancies here, as that would be rather scary without
having transactions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_repair: fix the RT device check in process_dinode_int
Christoph Hellwig [Thu, 31 Oct 2024 04:07:30 +0000 (05:07 +0100)]
xfs_repair: fix the RT device check in process_dinode_int

Don't look at the variable for the rtname command line option, but
the actual file system geometry.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs_repair: support repairing zoned file systems
Christoph Hellwig [Tue, 8 Oct 2024 07:47:20 +0000 (09:47 +0200)]
xfs_repair: support repairing zoned file systems

Note really much to do here.  Mostly ignore the validation and
regeneration of the bitmap and summary inodes.  Eventually this
could grow a bit of validation of the hardware zone state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agolibfrog: report the zoned flag
Christoph Hellwig [Thu, 24 Oct 2024 08:56:44 +0000 (10:56 +0200)]
libfrog: report the zoned flag

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoFIXUP: xfs: support zone gaps
Christoph Hellwig [Mon, 18 Nov 2024 05:46:20 +0000 (06:46 +0100)]
FIXUP: xfs: support zone gaps

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: support zone gaps
Christoph Hellwig [Fri, 6 Dec 2024 10:27:09 +0000 (19:27 +0900)]
xfs: support zone gaps

Source kernel commit: 42e8f63a1eda25f4ee2574b8191f7534a3954375

Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space.  In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us.  In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: enable the zoned RT device feature
Christoph Hellwig [Fri, 6 Dec 2024 10:25:34 +0000 (19:25 +0900)]
xfs: enable the zoned RT device feature

Source kernel commit: dd0cba0a359cba24d61b8b3b48d2020742ecac40

Enable the zoned RT device directory feature.  With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks.  This perfectly maps to zoned devices, but can also be
used on conventional block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: enable fsmap reporting for internal RT devices
Christoph Hellwig [Fri, 6 Dec 2024 10:22:12 +0000 (19:22 +0900)]
xfs: enable fsmap reporting for internal RT devices

Source kernel commit: 647ef1b678020b9431ec8f30d8318272b523ea03

File system with internal RT devices are a bit odd in that we need
to report AGs and RGs.  To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.

The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: implement zoned garbage collection
Christoph Hellwig [Fri, 6 Dec 2024 10:21:53 +0000 (19:21 +0900)]
xfs: implement zoned garbage collection

Source kernel commit: f8e8f198c983d5738b3ea3bc812515537901b992

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To empty zones, the rmap is walked to find the
owners and the data is read and then written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Note: selection of garbage collection victims is extremely simple at the
moment and will probably see additional near term improvements.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoFIXUP: xfs: add support for zoned space reservations
Christoph Hellwig [Mon, 18 Nov 2024 05:57:52 +0000 (06:57 +0100)]
FIXUP: xfs: add support for zoned space reservations

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: add support for zoned space reservations
Christoph Hellwig [Fri, 6 Dec 2024 10:21:10 +0000 (19:21 +0900)]
xfs: add support for zoned space reservations

Source kernel commit: e664c681e622774e932446e2c19c71bd4ecf0808

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: add the zoned space allocator
Christoph Hellwig [Fri, 6 Dec 2024 10:20:46 +0000 (19:20 +0900)]
xfs: add the zoned space allocator

Source kernel commit: f115b375056165e97708d4c09179908687bbbc87

For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.

Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode.  As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.

Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available.  A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup.  Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.

Allocations are only recorded at I/O completion time.  The scheme used
for that is very similar to the reflink COW end I/O path.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agolibxfs: wire up xfs_zones.c to the build system
Christoph Hellwig [Sun, 24 Nov 2024 08:22:56 +0000 (09:22 +0100)]
libxfs: wire up xfs_zones.c to the build system

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: parse and validate hardware zone information
Christoph Hellwig [Mon, 18 Nov 2024 05:40:04 +0000 (06:40 +0100)]
xfs: parse and validate hardware zone information

Add support to validate and parse reported hardware zone state.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
[hch: manual libxfs import because this running it through libxfs-apply
 crashes git]
Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: disable sb_frextents for zoned file systems
Christoph Hellwig [Fri, 6 Dec 2024 10:18:21 +0000 (19:18 +0900)]
xfs: disable sb_frextents for zoned file systems

Source kernel commit: c2213f89785d5af01654add5ac42a334622a0189

Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information.  Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: export zoned geometry via XFS_FSOP_GEOM
Christoph Hellwig [Fri, 6 Dec 2024 10:17:33 +0000 (19:17 +0900)]
xfs: export zoned geometry via XFS_FSOP_GEOM

Source kernel commit: c0239f67c6bd8f0c5f128c1ba2cf166cf195a2f2

Export the zoned geometry information so that userspace can query it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoFIXUP: xfs: allow internal RT devices for zoned mode
Christoph Hellwig [Mon, 18 Nov 2024 05:45:27 +0000 (06:45 +0100)]
FIXUP: xfs: allow internal RT devices for zoned mode

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: allow internal RT devices for zoned mode
Christoph Hellwig [Fri, 6 Dec 2024 10:16:20 +0000 (19:16 +0900)]
xfs: allow internal RT devices for zoned mode

Source kernel commit: 0985ca5d0c32c69d9f136bff5c16a0ce7b293381

Allow creating an RT subvolume on the same device as the main data
device.  This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoFIXUP: xfs: define the zoned on-disk format
Christoph Hellwig [Mon, 18 Nov 2024 05:50:48 +0000 (06:50 +0100)]
FIXUP: xfs: define the zoned on-disk format

7 months agoxfs: define the zoned on-disk format
Christoph Hellwig [Fri, 6 Dec 2024 10:15:27 +0000 (19:15 +0900)]
xfs: define the zoned on-disk format

Source kernel commit: 9d59b6280b1a2a4164520342f7085c4617e99e36

Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data).  There are few minor but important changes, which are indicated
by a new incompat flag:

1) there are no bitmap and summary inodes, and thus the sb_rbmblocks
superblock field must be cleared to zero

2) there is a new superblock field that specifies the start of an
internal RT section.  This allows supporting SMR HDDs that have random
writable space at the beginning which is used for the XFS data device
(which really is the metadata device for this configuration), directly
followed by a RT device on the same block device.  While something
similar could be achieved using dm-linear just having a single device
directly consumed by XFS makes handling the file systems a lot easier.

3) Another superblock field that tracks the amount of reserved space (or
overprovisioning) that is never used for user capacity, but allows GC
to run more smoothly.

4) an overlay of the cowextsizse field for the rtrmap inode so that we
can persistently track the total amount of bytes currently used in
a RT group.  There is no data structure other than the rmap that
tracks used space in an RT group, and this counter is used to decide
when a RT group has been entirely emptied, and to select one that
is relatively empty if garbage collection needs to be performed.
While this counter could be tracked entirely in memory and rebuilt
from the rmap at mount time, that would lead to very long mount times
with the large number of RT groups implied by the number of hardware
zones especially on SMR hard drives with 256MB zone sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: add a xfs_rtrmap_first_unwritten_rgbno helper
Christoph Hellwig [Fri, 6 Dec 2024 10:14:50 +0000 (19:14 +0900)]
xfs: add a xfs_rtrmap_first_unwritten_rgbno helper

Source kernel commit: 69350313320023b5aa495c9c0b5b1e70a881ce8d

Add a helper to find the last offset mapped in the rtrmap.  This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
Christoph Hellwig [Fri, 6 Dec 2024 10:14:15 +0000 (19:14 +0900)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

Source kernel commit: 7ae4302cd4c25cda6dc8f87703558aee6bc24ac3

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation.  To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoFIXUP: xfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Tue, 31 Oct 2023 07:52:32 +0000 (08:52 +0100)]
FIXUP: xfs: generalize the freespace and reserved blocks handling

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Fri, 6 Dec 2024 10:12:49 +0000 (19:12 +0900)]
xfs: generalize the freespace and reserved blocks handling

Source kernel commit: 2910b8fc8f3f3ad38c7200d8d6b7531213182053

The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.

Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.

Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly.  This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c
Christoph Hellwig [Mon, 18 Nov 2024 05:29:23 +0000 (06:29 +0100)]
xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c

Source kernel commit: 7ae3b7c5b938095251cd70ddbda5e5729cd5f8aa

Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c.  Move it to xfs_iomap.c
toward the two callers.

Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: add a rtg_blocks helper
Christoph Hellwig [Fri, 6 Dec 2024 09:55:47 +0000 (18:55 +0900)]
xfs: add a rtg_blocks helper

Source kernel commit: 29b6c4726fa24986a2be2b288f942a41d98aaa7a

Shortcut dereferencing the xg_block_count field in the generic group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoFIXUP: xfs: constify feature checks
Christoph Hellwig [Tue, 5 Nov 2024 08:39:48 +0000 (09:39 +0100)]
FIXUP: xfs: constify feature checks

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoxfs: constify feature checks
Christoph Hellwig [Wed, 23 Oct 2024 11:24:48 +0000 (13:24 +0200)]
xfs: constify feature checks

Source kernel commit: 52a3f6d90a651c3eb14a4d7845934888d8eb9089

We'll need to call them on a const structure in growfs in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agoman: document rgextents geom field
Christoph Hellwig [Fri, 29 Nov 2024 09:11:32 +0000 (10:11 +0100)]
man: document rgextents geom field

Document the new rgextent geom field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
7 months agomkfs: small rgcount man page fixup
Christoph Hellwig [Fri, 29 Nov 2024 09:33:44 +0000 (10:33 +0100)]
mkfs: small rgcount man page fixup

All the other options that require a value spell that out, do the same
for the rgcount option.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agomkfs: enable reflink on the realtime device
Darrick J. Wong [Thu, 21 Nov 2024 00:25:03 +0000 (16:25 -0800)]
mkfs: enable reflink on the realtime device

Allow the creation of filesystems with both reflink and realtime volumes
enabled.  For now we don't support a realtime extent size > 1.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agomkfs: validate CoW extent size hint when rtinherit is set
Darrick J. Wong [Thu, 21 Nov 2024 00:25:03 +0000 (16:25 -0800)]
mkfs: validate CoW extent size hint when rtinherit is set

Extent size hints exist to nudge the behavior of the file data block
allocator towards trying to make aligned allocations.  Therefore, it
doesn't make sense to allow a hint that isn't a multiple of the
fundamental allocation unit for a given file.

This means that if the sysadmin is formatting with rtinherit set on the
root dir, validate_cowextsize_hint needs to check the hint value on a
simulated realtime file to make sure that it's correct.  This hasn't
been necessary in the past since one cannot have a CoW hint without a
reflink filesystem, and we previously didn't allow rt reflink
filesystems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_logprint: report realtime CUIs
Darrick J. Wong [Thu, 21 Nov 2024 00:25:03 +0000 (16:25 -0800)]
xfs_logprint: report realtime CUIs

Decode the CUI format just enough to report if an CUI targets the
realtime device or not.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: validate CoW extent size hint on rtinherit directories
Darrick J. Wong [Thu, 21 Nov 2024 00:25:03 +0000 (16:25 -0800)]
xfs_repair: validate CoW extent size hint on rtinherit directories

XFS allows a sysadmin to change the rt extent size when adding a rt
section to a filesystem after formatting.  If there are any directories
with both a cowextsize hint and rtinherit set, the hint could become
misaligned with the new rextsize.  Offer to fix the problem if we're in
modify mode and the verifier didn't trip.  If we're in dry run mode,
we let the kernel fix it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: allow realtime files to have the reflink flag set
Darrick J. Wong [Thu, 21 Nov 2024 00:25:02 +0000 (16:25 -0800)]
xfs_repair: allow realtime files to have the reflink flag set

Now that we allow reflink on the realtime volume, allow that combination
of inode flags if the feature's enabled.  Note that we now allow inodes
to have rtinherit even if there's no realtime volume, since the kernel
has never restricted that.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: rebuild the realtime refcount btree
Darrick J. Wong [Thu, 21 Nov 2024 00:25:02 +0000 (16:25 -0800)]
xfs_repair: rebuild the realtime refcount btree

Use the collected reference count information to rebuild the btree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: reject unwritten shared extents
Darrick J. Wong [Thu, 21 Nov 2024 00:25:02 +0000 (16:25 -0800)]
xfs_repair: reject unwritten shared extents

We don't allow sharing of unwritten extents, which means that repair
should reject an unwritten extent if someone else has already claimed
the space.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: check existing realtime refcountbt entries against observed refcounts
Darrick J. Wong [Thu, 21 Nov 2024 00:25:02 +0000 (16:25 -0800)]
xfs_repair: check existing realtime refcountbt entries against observed refcounts

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime refcount
btree (particularly if we're in -n mode) to detect rtrefcountbt
problems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: compute refcount data for the realtime groups
Darrick J. Wong [Thu, 21 Nov 2024 00:25:02 +0000 (16:25 -0800)]
xfs_repair: compute refcount data for the realtime groups

At the end of phase 4, compute reference count information for realtime
groups from the realtime rmap information collected, just like we do for
AGs in the data section.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: find and mark the rtrefcountbt inode
Darrick J. Wong [Thu, 21 Nov 2024 00:25:01 +0000 (16:25 -0800)]
xfs_repair: find and mark the rtrefcountbt inode

Make sure that we find the realtime refcountbt inode and mark it
appropriately, just in case we find a rogue inode claiming to
be an rtrefcount, or just plain garbage in the superblock field.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: use realtime refcount btree data to check block types
Darrick J. Wong [Thu, 21 Nov 2024 00:25:01 +0000 (16:25 -0800)]
xfs_repair: use realtime refcount btree data to check block types

Use the realtime refcount btree to pre-populate the block type information
so that when repair iterates the primary metadata, we can confirm the
block type.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: allow CoW staging extents in the realtime rmap records
Darrick J. Wong [Thu, 21 Nov 2024 00:25:01 +0000 (16:25 -0800)]
xfs_repair: allow CoW staging extents in the realtime rmap records

Don't flag the rt rmap btree as having errors if there are CoW staging
extent records in it and the filesystem supports reflink.  As far as
reporting leftover staging extents, we'll report them when we scan the
rt refcount btree, in a future patch.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_spaceman: report health of the realtime refcount btree
Darrick J. Wong [Thu, 21 Nov 2024 00:25:01 +0000 (16:25 -0800)]
xfs_spaceman: report health of the realtime refcount btree

Report the health of the realtime reference count btree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_db: copy the realtime refcount btree
Darrick J. Wong [Thu, 21 Nov 2024 00:25:01 +0000 (16:25 -0800)]
xfs_db: copy the realtime refcount btree

Copy the realtime refcountbt when we're metadumping the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_db: support the realtime refcountbt
Darrick J. Wong [Thu, 21 Nov 2024 00:25:00 +0000 (16:25 -0800)]
xfs_db: support the realtime refcountbt

Wire up various parts of xfs_db for realtime refcount support so that we
can dump the rt refcount btree contents.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_db: display the realtime refcount btree contents
Darrick J. Wong [Thu, 21 Nov 2024 00:25:00 +0000 (16:25 -0800)]
xfs_db: display the realtime refcount btree contents

Implement all the code we need to dump rtrefcountbt contents, starting
from the inode root.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoman: document userspace API changes due to rt reflink
Darrick J. Wong [Thu, 21 Nov 2024 00:25:00 +0000 (16:25 -0800)]
man: document userspace API changes due to rt reflink

Update documentation to describe userspace ABI changes made for realtime
reflink support.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agolibfrog: enable scrubbing of the realtime refcount data
Darrick J. Wong [Thu, 21 Nov 2024 00:25:00 +0000 (16:25 -0800)]
libfrog: enable scrubbing of the realtime refcount data

Add a new entry so that we can scrub the rtrefcountbt and its metadata
directory tree path.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: scrub the metadir path of rt refcount btree files
Darrick J. Wong [Thu, 21 Nov 2024 00:25:00 +0000 (16:25 -0800)]
xfs: scrub the metadir path of rt refcount btree files

Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata
directory tree path to the refcount btree file for each rt group.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: scrub the realtime refcount btree
Darrick J. Wong [Thu, 21 Nov 2024 00:24:59 +0000 (16:24 -0800)]
xfs: scrub the realtime refcount btree

Source kernel commit: 844d7f8755a67b01391da92b99a5342c8b2b83f4

Add code to scrub realtime refcount btrees.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: report realtime refcount btree corruption errors to the health system
Darrick J. Wong [Thu, 21 Nov 2024 00:24:59 +0000 (16:24 -0800)]
xfs: report realtime refcount btree corruption errors to the health system

Whenever we encounter corrupt realtime refcount btree blocks, we should
report that to the health monitoring system for later reporting.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: enable extent size hints for CoW operations
Darrick J. Wong [Thu, 21 Nov 2024 00:24:59 +0000 (16:24 -0800)]
xfs: enable extent size hints for CoW operations

Wire up the copy-on-write extent size hint for realtime files, and
connect it to the rt allocator so that we avoid fragmentation on rt
filesystems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: apply rt extent alignment constraints to CoW extsize hint
Darrick J. Wong [Thu, 21 Nov 2024 00:24:59 +0000 (16:24 -0800)]
xfs: apply rt extent alignment constraints to CoW extsize hint

The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint.  Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.

Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Darrick J. Wong [Thu, 21 Nov 2024 00:24:58 +0000 (16:24 -0800)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files

Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files.  This apparently was done to disable
delayed allocation for realtime files.

However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files.  In this case, the
logic will incorrectly send the write through the delalloc write path.

Fix this by adjusting the logic slightly.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: recover CoW leftovers in the realtime volume
Darrick J. Wong [Thu, 21 Nov 2024 00:24:58 +0000 (16:24 -0800)]
xfs: recover CoW leftovers in the realtime volume

Scan the realtime refcount tree at mount time to get rid of leftover
CoW staging extents.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: allow inodes to have the realtime and reflink flags
Darrick J. Wong [Thu, 21 Nov 2024 00:24:58 +0000 (16:24 -0800)]
xfs: allow inodes to have the realtime and reflink flags

Now that we can share blocks between realtime files, allow this
combination.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: compute rtrmap btree max levels when reflink enabled
Darrick J. Wong [Thu, 21 Nov 2024 00:24:58 +0000 (16:24 -0800)]
xfs: compute rtrmap btree max levels when reflink enabled

Compute the maximum possible height of the realtime rmap btree when
reflink is enabled.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: update rmap to allow cow staging extents in the rt rmap
Darrick J. Wong [Thu, 21 Nov 2024 00:24:58 +0000 (16:24 -0800)]
xfs: update rmap to allow cow staging extents in the rt rmap

Don't error out on CoW staging extent records when realtime reflink is
enabled.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: create routine to allocate and initialize a realtime refcount btree inode
Darrick J. Wong [Thu, 21 Nov 2024 00:24:57 +0000 (16:24 -0800)]
xfs: create routine to allocate and initialize a realtime refcount btree inode

Create a library routine to allocate and initialize an empty realtime
refcountbt inode.  We'll use this for growfs, mkfs, and repair.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: wire up realtime refcount btree cursors
Darrick J. Wong [Thu, 21 Nov 2024 00:20:55 +0000 (16:20 -0800)]
xfs: wire up realtime refcount btree cursors

Wire up realtime refcount btree cursors wherever they're needed
throughout the code base.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: wire up a new metafile type for the realtime refcount
Darrick J. Wong [Thu, 21 Nov 2024 00:24:57 +0000 (16:24 -0800)]
xfs: wire up a new metafile type for the realtime refcount

Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with metafile type and on-disk
interpretation functions.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: add metadata reservations for realtime refcount btree
Darrick J. Wong [Thu, 21 Nov 2024 00:24:57 +0000 (16:24 -0800)]
xfs: add metadata reservations for realtime refcount btree

Reserve some free blocks so that we will always have enough free blocks
in the data volume to handle expansion of the realtime refcount btree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: add realtime refcount btree inode to metadata directory
Darrick J. Wong [Thu, 21 Nov 2024 00:20:52 +0000 (16:20 -0800)]
xfs: add realtime refcount btree inode to metadata directory

Add a metadir path to select the realtime refcount btree inode and load
it at mount time.  The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: add a realtime flag to the refcount update log redo items
Darrick J. Wong [Thu, 21 Nov 2024 00:24:56 +0000 (16:24 -0800)]
xfs: add a realtime flag to the refcount update log redo items

Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt.  We'll
wire up the actual refcount code later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: prepare refcount functions to deal with rtrefcountbt
Darrick J. Wong [Thu, 21 Nov 2024 00:24:56 +0000 (16:24 -0800)]
xfs: prepare refcount functions to deal with rtrefcountbt

Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions.  Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.

Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: add realtime refcount btree operations
Darrick J. Wong [Thu, 21 Nov 2024 00:24:56 +0000 (16:24 -0800)]
xfs: add realtime refcount btree operations

Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: realtime refcount btree transaction reservations
Darrick J. Wong [Thu, 21 Nov 2024 00:24:56 +0000 (16:24 -0800)]
xfs: realtime refcount btree transaction reservations

Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents.  We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: introduce realtime refcount btree ondisk definitions
Darrick J. Wong [Thu, 21 Nov 2024 00:24:56 +0000 (16:24 -0800)]
xfs: introduce realtime refcount btree ondisk definitions

Add the ondisk structure definitions for realtime refcount btrees. The
realtime refcount btree will be rooted from a hidden inode so it needs
to have a separate btree block magic and pointer format.

Next, add everything needed to read, write and manipulate refcount btree
blocks. This prepares the way for connecting the btree operations
implementation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs: namespace the maximum length/refcount symbols
Darrick J. Wong [Thu, 21 Nov 2024 00:24:55 +0000 (16:24 -0800)]
xfs: namespace the maximum length/refcount symbols

Actually namespace these variables properly, so that readers can tell
that this is an XFS symbol, and that it's for the refcount
functionality.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agomkfs: create the realtime rmap inode
Darrick J. Wong [Thu, 21 Nov 2024 00:24:55 +0000 (16:24 -0800)]
mkfs: create the realtime rmap inode

Create a realtime rmapbt inode if we format the fs with realtime
and rmap.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_logprint: report realtime RUIs
Darrick J. Wong [Thu, 21 Nov 2024 00:24:55 +0000 (16:24 -0800)]
xfs_logprint: report realtime RUIs

Decode the RUI format just enough to report if an RUI targets the
realtime device or not.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: reserve per-AG space while rebuilding rt metadata
Darrick J. Wong [Thu, 21 Nov 2024 00:24:55 +0000 (16:24 -0800)]
xfs_repair: reserve per-AG space while rebuilding rt metadata

Realtime metadata btrees can consume quite a bit of space on a full
filesystem.  Since the metadata are just regular files, we need to
make the per-AG reservations to avoid overfilling any of the AGs while
rebuilding metadata.  This avoids the situation where a filesystem comes
straight from repair and immediately trips over not having enough space
in an AG.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: rebuild the bmap btree for realtime files
Darrick J. Wong [Thu, 21 Nov 2024 00:24:55 +0000 (16:24 -0800)]
xfs_repair: rebuild the bmap btree for realtime files

Use the realtime rmap btree information to rebuild an inode's data fork
when appropriate.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: check for global free space concerns with default btree slack levels
Darrick J. Wong [Thu, 21 Nov 2024 00:24:54 +0000 (16:24 -0800)]
xfs_repair: check for global free space concerns with default btree slack levels

It's possible that before repair was started, the filesystem might have
been nearly full, and its metadata btree blocks could all have been
nearly full.  If we then rebuild the btrees with blocks that are only
75% full, that expansion might be enough to run out of free space.  The
solution to this is to pack the new blocks completely full if we fear
running out of space.

Previously, we only had to check and decide that on a per-AG basis.
However, now that XFS can have filesystems with metadata btrees rooted
in inodes, we have a global free space concern because there might be
enough space in each AG to regenerate the AG btrees at 75%, but that
might not leave enough space to regenerate the inode btrees, even if we
fill those blocks to 100%.

Hence we need to precompute the worst case space usage for all btrees in
the filesystem and compare /that/ against the global free space to
decide if we're going to pack the btrees maximally to conserve space.
That decision can override the per-AG determination.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: rebuild the realtime rmap btree
Darrick J. Wong [Thu, 21 Nov 2024 00:24:54 +0000 (16:24 -0800)]
xfs_repair: rebuild the realtime rmap btree

Rebuild the realtime rmap btree file from the reverse mapping records we
gathered from walking the inodes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: always check realtime file mappings against incore info
Darrick J. Wong [Thu, 21 Nov 2024 00:24:54 +0000 (16:24 -0800)]
xfs_repair: always check realtime file mappings against incore info

Curiously, the xfs_repair code that processes data fork mappings of
realtime files doesn't actually compare the mappings against the incore
state map during the !check_dups phase (aka phase 3).  As a result, we
lose the opportunity to clear damaged realtime data forks before we get
to crosslinked file checking in phase 4, which results in ondisk
metadata errors calling do_error, which aborts repair.

Split the process_rt_rec_state code into two functions: one to check the
mapping, and another to update the incore state.  The first one can be
called to help us decide if we're going to zap the fork, and the second
one updates the incore state if we decide to keep the fork.  We already
do this for regular data files.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: check existing realtime rmapbt entries against observed rmaps
Darrick J. Wong [Thu, 21 Nov 2024 00:24:54 +0000 (16:24 -0800)]
xfs_repair: check existing realtime rmapbt entries against observed rmaps

Once we've finished collecting reverse mapping observations from the
metadata scan, check those observations against the realtime rmap btree
(particularly if we're in -n mode) to detect rtrmapbt problems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: find and mark the rtrmapbt inodes
Darrick J. Wong [Thu, 21 Nov 2024 00:24:53 +0000 (16:24 -0800)]
xfs_repair: find and mark the rtrmapbt inodes

Make sure that we find the realtime rmapbt inodes and mark them
appropriately, just in case we find a rogue inode claiming to be an
rtrmap, or garbage in the metadata directory tree.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: refactor realtime inode check
Darrick J. Wong [Thu, 21 Nov 2024 00:24:53 +0000 (16:24 -0800)]
xfs_repair: refactor realtime inode check

Refactor the realtime bitmap and summary checks into a helper function.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
8 months agoxfs_repair: create a new set of incore rmap information for rt groups
Darrick J. Wong [Thu, 21 Nov 2024 00:24:53 +0000 (16:24 -0800)]
xfs_repair: create a new set of incore rmap information for rt groups

Create a parallel set of "xfs_ag_rmap" structures to cache information
about reverse mappings for the realtime groups.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>