Hans Holmberg [Fri, 7 Jun 2024 08:31:43 +0000 (08:31 +0000)]
xfs: add data placement info to mount stats
Add per-rtg active refs, life time hint and data separation score and
an aggregate data separation score as output to the mount stats
to aid debugging and analysis.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Christoph Hellwig [Mon, 22 Jul 2024 13:31:28 +0000 (06:31 -0700)]
xfs: support xrep_require_rtext_inuse on zoned file systems
Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.
Christoph Hellwig [Fri, 10 May 2024 06:43:29 +0000 (08:43 +0200)]
xfs: support xchk_xref_is_used_rt_space on zoned file systems
Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.
Christoph Hellwig [Sun, 12 May 2024 05:39:45 +0000 (07:39 +0200)]
xfs: disable sb_frextents scrub/repair for zoned file systems
Zoned file systems not only don't use the frextents counter, but the
in-memory percpu couner also includes reservations take before even
allocating delalloc extent records, so it will never match the per-zone
used information.
Christoph Hellwig [Wed, 24 Jul 2024 14:20:34 +0000 (07:20 -0700)]
xfs: ensure we have blocks available before taking the iolock
With the last patch we have the infrastructure in place to have space
reservations before taking the iolock and thus avoid the GC deadlock
in generic/269. But right now it will happily take space that has
been freed in a used zoned that would still require GC. Add a new
RTAVAILABLE counter of blocks that are actually directly available to
be written into in addition to the classic free counter. Only allow
a write to go ahead if it has blocks available to write, and otherwise
wait for GC. This also requires tweaking the need GC condition a
bit as we now always need to GC if someone is waiting for space.
Because GC always allocates from the reserved pool that gets replenished
first we can also do away with the ratio to favor it.
Christoph Hellwig [Mon, 22 Jul 2024 13:29:10 +0000 (06:29 -0700)]
xfs: reserve blocks before taking the iolock
For zoned file systems we reserve the free extents before taking the
ilock so that if we have to force garbage collection it happens before we
take the iolock. This is done because GC has to take the iolock after it
moved data to a new place, and this could otherwise deadlock.
This unfortunately has to exclude block zeroing, as for truncate we are
called with the iolock (aka i_rwsem) already held. As zeroing is always
only for a single block at a time, or up to two total for a syscall in
case for free_file_range we deal with that by just stealing the block,
but failing the allocation if we'd have to wait for GC (this will only
be implemented in the following patch).
Christoph Hellwig [Wed, 24 Jul 2024 14:10:51 +0000 (07:10 -0700)]
xfs: support zoned RT devices
WARNING: this is early prototype code.
The zoned allocator works by handing out data blocks to the direct or
buffered write code at the place where XFS currently does block
allocations. It does not actually insert them into the bmap extent tree
at this time, but only after I/O completion when we known the block number.
The zoned allocator works on any kind of device, including conventional
devices or conventional zones by having a crude write pointer emulation.
For zone devices active zone management is fully support, as is
zone capacity < zone size.
The two major limitations are:
- there is no support for unwritten extents and thus persistent
file preallocations from fallocate(). This is inherent to an
always out of place write scheme as there is no way to persistently
preallocate blocks for an indefinite number of overwrites
- because the metadata blocks and data blocks are on different
device you can run out of space for metadata while having plenty
of space for data and vice versa. This is inherent to a scheme
where we use different devices or pools for each.
Thanks to Hans Holmberg <hans.holmberg@wdc.com> for lots of fixes
and improvements.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Mon, 22 Jul 2024 13:24:04 +0000 (06:24 -0700)]
xfs: don't do speculative preallocations for zoned RT allocations
XFS does extensive speculative delalloc preallocation to ensure it can
allocate large contiguous regions on disk. For the zone write path this
doesn't make sense as we can always just convert what is being written
anyway, and allocating more delalloc space also means that we take more
blocks out of the pool that we have reserved space for before.
Christoph Hellwig [Sun, 25 Feb 2024 05:57:49 +0000 (06:57 +0100)]
xfs: add a helper to check if an inode sits on a zoned device
Add a xfs_is_zoned_inode helper that returns true if an inode has the
RT flag set and the file system is zoned. This will be used to key
off zoned allocator behavior.
Make xfs_is_always_cow_inode return true for zoned inodes as we always
need to write out of place on zoned devices.
Christoph Hellwig [Fri, 27 Oct 2023 07:58:24 +0000 (09:58 +0200)]
xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write
For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.
Christoph Hellwig [Fri, 31 May 2024 09:27:32 +0000 (11:27 +0200)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.
Christoph Hellwig [Tue, 17 Oct 2023 07:15:13 +0000 (09:15 +0200)]
xfs: pass an optional startblock argument to xfs_reflink_end_cow
With the upcoming zoned allocator, extents won't be convered from
delalloc to real allocations before submitting the I/O, as we'll only
know the actual block number on I/O completion. Add a new argument to
xfs_reflink_end_cow that can pass in the startblock we got from the
I/O completion handler, and convert the completed range from delalloc
to a written extent covering that block range.
Christoph Hellwig [Fri, 31 May 2024 09:25:00 +0000 (11:25 +0200)]
xfs: factor our a __xfs_reflink_end_cow_extent helper
This will allow reusing the code for zoned direct I/O completions, where
the out of place write allocation is passed as private data with the
I/O and not stored in the COW fork.
Christoph Hellwig [Mon, 22 Jul 2024 13:20:19 +0000 (06:20 -0700)]
xfs: generalize the freespace and reserved blocks handling
The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.
Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.
Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly. This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.
Christoph Hellwig [Wed, 14 Feb 2024 14:17:15 +0000 (15:17 +0100)]
xfs: convert rtgroup lookup to an xarray
The xarray is the modern replacement for the radix-tree. It is simpler
to use, especially when using marks to find specific entries, something
we will heavily use for the zone allocator implementation.
Christoph Hellwig [Wed, 24 Apr 2024 06:53:45 +0000 (08:53 +0200)]
iomap: pass private data to iomap_file_buffered_write_punch_delalloc
Allow the file system to pass private data which then gets passed on
to the punch callback. Also move the iomap_punch_t typedef to
iomap.h so it doesn't have to be open coded in the declaration.
Christoph Hellwig [Mon, 22 Jul 2024 13:18:17 +0000 (06:18 -0700)]
iomap: pass private data to iomap_page_mkwrite
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Fri, 19 Apr 2024 06:41:28 +0000 (08:41 +0200)]
iomap: pass private data to iomap_file_buffered_write
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Fri, 24 Nov 2023 09:45:35 +0000 (10:45 +0100)]
iomap: optionally use ioends for direct I/O
struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.
For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.
So intead of reinventing the wheel, reuse the existing infrastructure.
Christoph Hellwig [Wed, 14 Feb 2024 14:09:44 +0000 (15:09 +0100)]
iomap: support IOMAP_F_ZONE_APPEND for buffered I/O
Add support for Zone Append commands to the iomap buffer writeback code.
This involves selecting the right block layer operation and using the
right helper to add data to the bio, as well as not creating chained
bios inside the iomap for zone append as they could be written
non-contiguously and adjusting the sector based merge criteria.
Christoph Hellwig [Fri, 20 Oct 2023 13:36:03 +0000 (15:36 +0200)]
iomap: make iomap_sector Zone Append aware
Zone Append commands always point to the zone start sector. Change
the iomap_sector() helper to not adjust the start block for the position
in the iomap range for Zone Append iomaps.
Christoph Hellwig [Sun, 5 Nov 2023 05:40:52 +0000 (06:40 +0100)]
iomap: reinstate IOMAP_F_ZONE_APPEND support
Add back the support for using Zone Append in the iomap direct I/O code
that was removed a while ago as we'll use it for the XFS zoned device
support.
This is essentially a revert of commit 8e81aa16a421 ("iomap: remove
IOMAP_F_ZONE_APPEND") with an additional comment describing the flag now
that all the other IOMAP_F_* flags have a nice description.
Christoph Hellwig [Fri, 13 Oct 2023 06:09:39 +0000 (08:09 +0200)]
iomap: wait for writeback before allocating new blocks
This means we are actually forced to allocate new delalloc space for the
new dirtier instead of reusing one that is currently being used for
writeback.
Darrick J. Wong [Wed, 29 May 2024 04:13:29 +0000 (21:13 -0700)]
xfs: check for shared rt extents when rebuilding rt file's data fork
When we're rebuilding the data fork of a realtime file, we need to
cross-reference each mapping with the rt refcount btree to ensure that
the reflink flag is set if there are any shared extents found.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:27 +0000 (21:13 -0700)]
xfs: walk the rt reference count tree when rebuilding rmap
When we're rebuilding the data device rmap, if we encounter a "refcount"
format fork, we have to walk the (realtime) refcount btree inode to
build the appropriate mappings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:26 +0000 (21:13 -0700)]
xfs: check new rtbitmap records against rt refcount btree
When we're rebuilding the realtime bitmap, check the proposed free
extents against the rt refcount btree to make sure we don't commit any
grievous errors.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:25 +0000 (21:13 -0700)]
xfs: don't flag quota rt block usage on rtreflink filesystems
Quota space usage is allowed to exceed the size of the physical storage
when reflink is enabled. Now that we have reflink for the realtime
volume, apply this same logic to the rtb repair logic.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:24 +0000 (21:13 -0700)]
xfs: detect and repair misaligned rtinherit directory cowextsize hints
If we encounter a directory that has been configured to pass on a CoW
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review and/or turn it off because that is a
misconfiguration.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:23 +0000 (21:13 -0700)]
xfs: check reference counts of gaps between rt refcount records
If there's a gap between records in the rt refcount btree, we ought to
cross-reference the gap with the rtrmap records to make sure that there
aren't any overlapping records for a region that doesn't have any shared
ownership.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:20 +0000 (21:13 -0700)]
xfs: add realtime refcount btree when adding rt volume
If we're adding enough space to the realtime section to require the
creation of new realtime groups, create the rt refcount btree inode
before we start adding the space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:20 +0000 (21:13 -0700)]
xfs: check that the rtrefcount maxlevels doesn't increase when growing fs
The size of filesystem transaction reservations depends on the maximum
height (maxlevels) of the realtime btrees. Since we don't want a grow
operation to increase the reservation size enough that we'll fail the
minimum log size checks on the next mount, constrain growfs operations
if they would cause an increase in the rt refcount btree maxlevels.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:19 +0000 (21:13 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:18 +0000 (21:13 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files. This apparently was done to disable
delayed allocation for realtime files.
However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files. In this case, the
logic will incorrectly send the write through the delalloc write path.
Fix this by adjusting the logic slightly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:16 +0000 (21:13 -0700)]
xfs: enable CoW for realtime data
Update our write paths to support copy on write on the rt volume. This
works in more or less the same way as it does on the data device, with
the major exception that we never do delalloc on the rt volume.
Because we consider unwritten CoW fork staging extents to be incore
quota reservation, we update xfs_quota_reserve_blkres to support this
case. Though xfs doesn't allow rt and quota together, the change is
trivial and we shouldn't leave a logic bomb here.
While we're at it, add a missing xfs_mod_delalloc call when we remove
delalloc block reservation from the inode. This is largely irrelvant
since realtime files do not use delalloc, but we want to avoid leaving
logic bombs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Mon, 1 Jul 2024 21:26:06 +0000 (14:26 -0700)]
From: Christoph Hellwig <hch@lst.de>
xfs: refactor xfs_reflink_find_shared
Move lookup of the perag structure from the callers into the helpers,
and return the offset into the extent of the shared region instead of
the block number that needs post-processing. This prepares the
callsites for the creation of an rt-specific variant in the next patch.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: port to the middle of the rtreflink series for cleanliness] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:12 +0000 (21:13 -0700)]
xfs: wire up a new inode fork type for the realtime refcount
Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:11 +0000 (21:13 -0700)]
xfs: add realtime refcount btree inode to metadata directory
Add a metadir path to select the realtime refcount btree inode and load
it at mount time. The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:10 +0000 (21:13 -0700)]
xfs: add a realtime flag to the refcount update log redo items
Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt. We'll
wire up the actual refcount code later.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:09 +0000 (21:13 -0700)]
xfs: prepare refcount functions to deal with rtrefcountbt
Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions. Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.
Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:09 +0000 (21:13 -0700)]
xfs: add realtime refcount btree operations
Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents. We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:07 +0000 (21:13 -0700)]
xfs: define the on-disk realtime refcount btree format
Start filling out the rtrefcount btree implementation. Start with the
on-disk btree format; add everything needed to read, write and
manipulate refcount btree blocks. This prepares the way for connecting
the btree operations implementation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Add new realtime refcount btree definitions. The realtime refcount btree
will be rooted from a hidden inode, but has its own shape and therefore
needs to have most of its own separate types.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:13:06 +0000 (21:13 -0700)]
xfs: prepare refcount btree cursor tracepoints for realtime
Rework the refcount btree cursor tracepoints in preparation to handle the
realtime refcount btree cursor. Mostly this involves renaming the field to
"refcbno" and extracting the group number from the cursor when possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:59 +0000 (21:12 -0700)]
xfs: hook live realtime rmap operations during a repair operation
Hook the regular realtime rmap code when an rtrmapbt repair operation is
running so that we can unlock the AGF buffer to scan the filesystem and
keep the in-memory btree up to date during the scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:57 +0000 (21:12 -0700)]
xfs: support repairing metadata btrees rooted in metadir inodes
Adapt the repair code so that we can stage a new btree in the data fork
area of a metadir inode and reap the old blocks. We already have nearly
all of the infrastructure; the only parts that were missing were the
metadata inode reservation handling.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:56 +0000 (21:12 -0700)]
xfs: repair rmap btree inodes
Teach the inode repair code how to deal with realtime rmap btree inodes
that won't load properly. This is most likely moot since the filesystem
generally won't mount without the rtrmapbt inodes being usable, but
we'll add this for completeness.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:55 +0000 (21:12 -0700)]
xfs: walk the rt reverse mapping tree when rebuilding rmap
When we're rebuilding the data device rmap, if we encounter an "rmap"
format fork, we have to walk the (realtime) rmap btree inode to build
the appropriate mappings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:53 +0000 (21:12 -0700)]
xfs: scan rt rmap when we're doing an intense rmap check of bmbt mappings
Teach the bmbt scrubber how to perform a comprehensive check that the
rmapbt does not contain /any/ mappings that are not described by bmbt
records when it's dealing with a realtime file.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:51 +0000 (21:12 -0700)]
xfs: allow queued realtime intents to drain before scrubbing
When a writer thread executes a chain of log intent items for the
realtime volume, the ILOCKs taken during each step are for each rt
metadata file, not the entire rt volume itself. Although scrub takes
all rt metadata ILOCKs, this isn't sufficient to guard against scrub
checking the rt volume while that writer thread is in the middle of
finishing a chain because there's no higher level locking primitive
guarding the realtime volume.
When there's a collision, cross-referencing between data structures
(e.g. rtrmapbt and rtrefcountbt) yields false corruption events; if
repair is running, this results in incorrect repairs, which is
catastrophic.
Fix this by adding to the mount structure the same drain that we use to
protect scrub against concurrent AG updates, but this time for the
realtime volume.
[Contains a few cleanups from hch]
Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:12:40 +0000 (21:12 -0700)]
xfs: fix scrub tracepoints when inode-rooted btrees are involved
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled. The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:11:57 +0000 (21:11 -0700)]
xfs: add realtime rmap btree when adding rt volume
If we're adding enough space to the realtime section to require the
creation of new realtime groups, create the rt rmap btree inode before
we start adding the space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Wed, 29 May 2024 04:11:57 +0000 (21:11 -0700)]
xfs: check that the rtrmapbt maxlevels doesn't increase when growing fs
The size of filesystem transaction reservations depends on the maximum
height (maxlevels) of the realtime btrees. Since we don't want a grow
operation to increase the reservation size enough that we'll fail the
minimum log size checks on the next mount, constrain growfs operations
if they would cause an increase in those maxlevels.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>