Christoph Hellwig [Sat, 19 Oct 2024 12:23:05 +0000 (14:23 +0200)]
xfs: shut the file system down on corrupted used counter
If a free is trying to free more blocks than the used counter the file
system is clearly corrupted so shut it down. Keep the debug only assert
to follow the (good?) old XFS tradition of panicing on corruption for
debug builds.
Christoph Hellwig [Fri, 18 Oct 2024 04:55:12 +0000 (06:55 +0200)]
xfs: don't clear XFS_RTG_RECLAIMABLE at the start of GC
It turns out GC can skip inodes when iget would have to get into retry
loops. When we skip an inode during GC of a zone that means we can't
empty the zone, and essentially leak the space in it until another block
is freed in that zone and XFS_RTG_RECLAIMABLE is set again.
Stop clearing XFS_RTG_RECLAIMABLE as freeing of the final block already
does that and GC is single threaded anyway.
Christoph Hellwig [Sun, 13 Oct 2024 08:26:47 +0000 (10:26 +0200)]
xfs: delete the COW fork delalloc extent when submitting writeback
->map_blocks is the place the actual disk blocks are allocated. To ensure
truncate (or any other xfs_reflink_cancel_cow_blocks caller) can't remove
a delalloc extent for which we've already allocated on-disk blocks, remove
it as soon as we've commited allocating disk blocks, making sure there either
is a delalloc extent or an in-flight I/O.
It turns out this also nicely simplifies the end I/O handler to be the same
for buffered vs direct I/O.
On the other hand it undoes some of the work to harvest the delalloc blocks
for the end I/O handler. At the same time it removes the double accounting
and thus make the behavior the same as the conventional COW end I/O handler,
which so far has worked. We might want to handle the block reservations a
little more intelligently, preferably in a way common to the zoned and COW
and I/O handlers.
Christoph Hellwig [Sat, 12 Oct 2024 05:55:24 +0000 (07:55 +0200)]
xfs: handle truncated delalloc reservations in xfs_zoned_map_blocks
Truncate could have removed some or all of the delalloc reservation
covering the folio range. Adjust xfs_zoned_map_blocks to not
write back such ranges, because we don't have a block reservation
covering it.
Christoph Hellwig [Tue, 8 Oct 2024 07:20:29 +0000 (09:20 +0200)]
xfs: use the indirect block reservation in the zoned end I/O handler
Use the indirect block reservation from the COW for delalloc extent to
reduce the blocks needed for the zoned end I/O handler so that we
don't run out of the reserved blocks pool too easily.
Christoph Hellwig [Mon, 7 Oct 2024 08:22:27 +0000 (10:22 +0200)]
xfs: allow harvesting indirect block reservation in xfs_bmap_del_extent_delay
The zoned end I/O path converts delalloc to real extents as part of
moving them from the COW to the data fork. Because of that it never
calls xfs_bmap_add_extent_delay_real, which is the usual place that
uses the indirect block reservations. Instead of letting the
reservation go to waste and stressing the reserved block pool, allow
to harvest the reservation when deleting the delayed extent and
donate it to the transaction chain used for remapping it into the data
fork.
Christoph Hellwig [Sun, 6 Oct 2024 14:28:06 +0000 (16:28 +0200)]
xfs: pass a xfs_daddr_t for the new block to xfs_zoned_end_io
That's what both callers have at hand, so do the conversion in a single
place. Also use the correct helper to prepare for the segmented
addressing coming to rtgroups.
Christoph Hellwig [Fri, 13 Sep 2024 08:11:08 +0000 (10:11 +0200)]
xfs: split the zoned from the COW end I/O path
While the high level logic of the zoned end I/O handler is the same
for reflink COW end I/O, the current implementation is a maze of
special cases, especially because zoned direct I/O is not recorded
in the COW fork.
Fork the code into a separate implementation for the zoned end I/O
handler, and revert the reflink end I/O code to the old state.
Note that the zoned end I/O handler works a bit different from the
previous version, as it is driven by the passed in ranges and only
then looks up the COW fork extent for the buffered I/O case. This
allows to share more code between the buffered and direct I/O cases.
Christoph Hellwig [Sat, 28 Sep 2024 05:38:59 +0000 (07:38 +0200)]
xfs: gracefully handle running out of block reservations
It turns out there is a nasty three way race (see the newly added comment
in the code) that can steal blocks from the reservation taken at the
beginning of a write. There isn't really much to do about it, so turn
it into a short write and warn about it.
Christoph Hellwig [Tue, 24 Sep 2024 05:24:59 +0000 (07:24 +0200)]
xfs; fix cleanup in xfs_zone_gc_data_alloc
Also clean up the last item.
Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202409150207.qmhcRNcV-lkp@intel.com/ Signed-off-by: Christoph Hellwig <hch@lst.de>
Hans Holmberg [Mon, 23 Sep 2024 16:23:41 +0000 (16:23 +0000)]
xfs: fix max length for zoned gc allocations on conventional devs
We can not rely on max append size when the backing device is not a
zoned block device as this attribute will be zero, resulting in gc
allocations to always fail.
In stead, use the max segment size which should provide a reasonable
maximum size for the gc allocations.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Hans Holmberg [Fri, 7 Jun 2024 08:31:43 +0000 (08:31 +0000)]
xfs: add data placement info to mount stats
Add per-rtg active refs, life time hint and data separation score and
an aggregate data separation score as output to the mount stats
to aid debugging and analysis.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Christoph Hellwig [Mon, 22 Jul 2024 13:31:28 +0000 (06:31 -0700)]
xfs: support xrep_require_rtext_inuse on zoned file systems
Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.
Christoph Hellwig [Thu, 15 Aug 2024 16:12:33 +0000 (18:12 +0200)]
xfs: support xchk_xref_is_used_rt_space on zoned file systems
Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.
Christoph Hellwig [Tue, 10 Sep 2024 05:03:47 +0000 (08:03 +0300)]
xfs: support zoned RT devices
WARNING: this is early prototype code.
The zoned allocator works by handing out data blocks to the direct or
buffered write code at the place where XFS currently does block
allocations. It does not actually insert them into the bmap extent tree
at this time, but only after I/O completion when we known the block number.
The zoned allocator works on any kind of device, including conventional
devices or conventional zones by having a crude write pointer emulation.
For zone devices active zone management is fully support, as is
zone capacity < zone size.
The two major limitations are:
- there is no support for unwritten extents and thus persistent
file preallocations from fallocate(). This is inherent to an
always out of place write scheme as there is no way to persistently
preallocate blocks for an indefinite number of overwrites
- because the metadata blocks and data blocks are on different
device you can run out of space for metadata while having plenty
of space for data and vice versa. This is inherent to a scheme
where we use different devices or pools for each.
For zoned file systems we reserve the free extents before taking the
ilock so that if we have to force garbage collection it happens before we
take the iolock. This is done because GC has to take the iolock after it
moved data to a new place, and this could otherwise deadlock.
This unfortunately has to exclude block zeroing, as for truncate we are
called with the iolock (aka i_rwsem) already held. As zeroing is always
only for a single block at a time, or up to two total for a syscall in
case for free_file_range we deal with that by just stealing the block,
but failing the allocation if we'd have to wait for GC.
Add a new RTAVAILABLE counter of blocks that are actually directly
available to be written into in addition to the classic free counter.
Only allow a write to go ahead if it has blocks available to write, and
otherwise wait for GC. This also requires tweaking the need GC condition a
bit as we now always need to GC if someone is waiting for space.
Thanks to Hans Holmberg <hans.holmberg@wdc.com> for lots of fixes
and improvements.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 12 May 2024 05:39:45 +0000 (07:39 +0200)]
xfs: disable sb_frextents scrub/repair for zoned file systems
Zoned file systems not only don't use the frextents counter, but the
in-memory percpu couner also includes reservations take before even
allocating delalloc extent records, so it will never match the per-zone
used information.
Christoph Hellwig [Tue, 10 Sep 2024 05:00:17 +0000 (08:00 +0300)]
xfs: don't zero post-EOF blocks on write and truncate up for zoned file systems
Zoned file systems don't leave blocks past the last allocated block
around ever, so don't bother with a zeroing operation for these
non-existent blocks. This avoids having to take a space resevation
for these operations.
Christoph Hellwig [Sun, 25 Feb 2024 05:57:49 +0000 (06:57 +0100)]
xfs: add a helper to check if an inode sits on a zoned device
Add a xfs_is_zoned_inode helper that returns true if an inode has the
RT flag set and the file system is zoned. This will be used to key
off zoned allocator behavior.
Make xfs_is_always_cow_inode return true for zoned inodes as we always
need to write out of place on zoned devices.
Christoph Hellwig [Fri, 27 Oct 2023 07:58:24 +0000 (09:58 +0200)]
xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write
For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.
Christoph Hellwig [Tue, 10 Sep 2024 04:58:17 +0000 (07:58 +0300)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.
Christoph Hellwig [Tue, 17 Oct 2023 07:15:13 +0000 (09:15 +0200)]
xfs: pass an optional startblock argument to xfs_reflink_end_cow
With the upcoming zoned allocator, extents won't be convered from
delalloc to real allocations before submitting the I/O, as we'll only
know the actual block number on I/O completion. Add a new argument to
xfs_reflink_end_cow that can pass in the startblock we got from the
I/O completion handler, and convert the completed range from delalloc
to a written extent covering that block range.
Christoph Hellwig [Fri, 31 May 2024 09:25:00 +0000 (11:25 +0200)]
xfs: factor our a __xfs_reflink_end_cow_extent helper
This will allow reusing the code for zoned direct I/O completions, where
the out of place write allocation is passed as private data with the
I/O and not stored in the COW fork.
Christoph Hellwig [Thu, 15 Aug 2024 15:19:39 +0000 (17:19 +0200)]
xfs: generalize the freespace and reserved blocks handling
The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.
Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.
Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly. This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.
Christoph Hellwig [Sat, 17 Aug 2024 06:57:51 +0000 (08:57 +0200)]
xfs: ensure st_blocks never goes to zero during COW writes
COW writes remove the amount overwritten either directly for delalloc
reservations, or in earlier deferred transactions than adding the new
amount back in the bmap map transaction. This means st_blocks on an
inode where all data is overwritten using the COW path can temporarily
show a 0 st_blocks. This can easily be reproduced with the pending
zoned device support where all writes use this path and trips the
check in generic/615, but could also happen on a reflink file without
that.
Fix this by temporarily add the pending blocks to be mapped to
i_delayed_blks while the item is queued.
Christoph Hellwig [Fri, 16 Aug 2024 16:49:13 +0000 (18:49 +0200)]
iomap: pass private data to iomap_truncate_page
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Fri, 16 Aug 2024 16:48:16 +0000 (18:48 +0200)]
iomap: pass private data to iomap_zero_range
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Tue, 10 Sep 2024 04:57:21 +0000 (07:57 +0300)]
iomap: pass private data to iomap_page_mkwrite
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Tue, 10 Sep 2024 04:56:53 +0000 (07:56 +0300)]
iomap: pass private data to iomap_file_buffered_write
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Fri, 24 Nov 2023 09:45:35 +0000 (10:45 +0100)]
iomap: optionally use ioends for direct I/O
struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.
For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.
So intead of reinventing the wheel, reuse the existing infrastructure.
Christoph Hellwig [Wed, 14 Feb 2024 14:09:44 +0000 (15:09 +0100)]
iomap: support IOMAP_F_ZONE_APPEND for buffered I/O
Add support for Zone Append commands to the iomap buffer writeback code.
This involves selecting the right block layer operation and using the
right helper to add data to the bio, as well as not creating chained
bios inside the iomap for zone append as they could be written
non-contiguously and adjusting the sector based merge criteria.
Christoph Hellwig [Fri, 20 Oct 2023 13:36:03 +0000 (15:36 +0200)]
iomap: make iomap_sector Zone Append aware
Zone Append commands always point to the zone start sector. Change
the iomap_sector() helper to not adjust the start block for the position
in the iomap range for Zone Append iomaps.
Christoph Hellwig [Sun, 5 Nov 2023 05:40:52 +0000 (06:40 +0100)]
iomap: reinstate IOMAP_F_ZONE_APPEND support
Add back the support for using Zone Append in the iomap direct I/O code
that was removed a while ago as we'll use it for the XFS zoned device
support.
This is essentially a revert of commit 8e81aa16a421 ("iomap: remove
IOMAP_F_ZONE_APPEND") with an additional comment describing the flag now
that all the other IOMAP_F_* flags have a nice description.
Christoph Hellwig [Fri, 13 Oct 2023 06:09:39 +0000 (08:09 +0200)]
iomap: wait for writeback before allocating new blocks
This means we are actually forced to allocate new delalloc space for the
new dirtier instead of reusing one that is currently being used for
writeback.
Christoph Hellwig [Mon, 26 Aug 2024 06:16:23 +0000 (08:16 +0200)]
xfs: punch delalloc extents from the COW fork for COW writes
When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.
Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 05:32:22 +0000 (07:32 +0200)]
xfs: set IOMAP_F_SHARED for all COW fork allocations
Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork. It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.
This is essentially a revert of commit 72a048c1056a ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.
Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 05:04:01 +0000 (07:04 +0200)]
xfs: share more code in xfs_buffered_write_iomap_begin
Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 06:16:10 +0000 (08:16 +0200)]
xfs: support the COW fork in xfs_bmap_punch_delalloc_range
xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Tue, 3 Sep 2024 08:24:34 +0000 (11:24 +0300)]
xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that
is acrually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.
The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.
Christoph Hellwig [Tue, 3 Sep 2024 08:19:33 +0000 (11:19 +0300)]
iomap: remove the iomap_file_buffered_write_punch_delalloc return value
iomap_file_buffered_write_punch_delalloc can only return errors if either
the ->punch callback returned an error, or if someone changed the API of
mapping_seek_hole_data to return a negative error code that is not
-ENXIO.
As the only instance of ->punch never returns an error, an such an error
would be fatal anyway remove the entire error propagation and don't
return an error code from iomap_file_buffered_write_punch_delalloc.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 04:47:04 +0000 (06:47 +0200)]
iomap: pass flags to iomap_file_buffered_write_punch_delalloc
To fix short write error handling, We'll need to figure out what operation
iomap_file_buffered_write_punch_delalloc is called for. Pass the flags
argument on to it, and reorder the argument list to match that of
->iomap_end so that the compiler only has to add the new punch argument
to the end of it instead of reshuffling the registers.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 05:40:31 +0000 (07:40 +0200)]
iomap: improve shared block detection in iomap_unshare_iter
Currently iomap_unshare_iter relies on the IOMAP_F_SHARED flag to detect
blocks to unshare. This is reasonable, but IOMAP_F_SHARED is also useful
for the file system to do internal book keeping for out of place writes.
XFS used to that, until it got removed in commit 72a048c1056a
("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write")
because unshare for incorrectly unshare such blocks.
Add an extra safeguard by checking the explicitly provided srcmap instead
of the fallback to the iomap for valid data, as that catches the case
where we'd just copy from the same place we'd write to easily, allowing
to reinstate setting IOMAP_F_SHARED for all XFS writes that go to the
COW fork.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 04:47:50 +0000 (06:47 +0200)]
iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release. If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Tue, 20 Aug 2024 16:13:09 +0000 (18:13 +0200)]
xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
variant fails to drop into the lowmode last resort allocator, and
thus can sometimes fail allocations for which the caller has a
transaction block reservation.
Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Wed, 4 Sep 2024 04:18:52 +0000 (07:18 +0300)]
xfs: don't ifdef around the exact minlen allocations
Exact minlen allocations only exist as an error injection tool for debug
builds. Currently this is implemented using ifdefs, which means the code
isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test
coverage by always building the code and use the compilers' dead code
elimination to remove it from the generated binary instead.
The only downside is that the new bitfield is unconditionally added to
struct xfs_alloc_args now.
Christoph Hellwig [Mon, 2 Sep 2024 08:01:51 +0000 (11:01 +0300)]
xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return
-ENOSPC both for an actual failure to allocate a disk block, but also
to signal the caller to convert the format of the attr fork. Use magic
1 to ask for the conversion here as well.
Note that unlike the similar issue in xfs_attr3_leaf_split, this one was
only found by code review.
Christoph Hellwig [Tue, 20 Aug 2024 04:02:40 +0000 (06:02 +0200)]
xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs_attr3_leaf_split propagates the need for an extra btree split as
-ENOSPC to it's only caller, but the same return value can also be
returned from xfs_da_grow_inode when it fails to find free space.
Distinguish the two cases by returning 1 for the extra split case instead
of overloading -ENOSPC.
This can be triggered relatively easily with the pending realtime group
support and a file system with a lot of small zones that use metadata
space on the main device. In this case every about 5-10th run of
xfs/538 runs into the following assert:
ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC);
in xfs_attr3_leaf_split caused by an allocation failure. Note that
the allocation failure is caused by another bug that will be fixed
subsequently, but this commit at least sorts out the error handling.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Tue, 20 Aug 2024 04:25:21 +0000 (06:25 +0200)]
xfs: return bool from xfs_attr3_leaf_add
xfs_attr3_leaf_add only has two potential return values, indicating if the
entry could be added or not. Replace the errno return with a bool so that
ENOSPC from it can't easily be confused with a real ENOSPC.
Remove the return value from the xfs_attr3_leaf_add_work helper entirely,
as it always return 0.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Christoph Hellwig [Mon, 26 Aug 2024 15:40:42 +0000 (17:40 +0200)]
block: don't use bio_split_rw on misc operations
bio_split_rw is designed to split read and write bios with a payload.
Currently it is called by __bio_split_to_limits for all operations not
explicitly list, which works because bio_may_need_split explicitly checks
for bi_vcnt == 1 and thus skips the bypass if there is no payload and
bio_for_each_bvec loop will never execute it's body if bi_size is 0.
But all this is hard to understand, fragile and wasted pointless cycles.
Switch __bio_split_to_limits to only call bio_split_rw for READ and
WRITE command and don't attempt any kind split for operation that do not
require splitting.
Christoph Hellwig [Mon, 26 Aug 2024 14:33:03 +0000 (16:33 +0200)]
block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limits
Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in
__bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND
bios do not adhere to the soft max_limits value but instead use their
own capped version of max_hw_sectors, leading to incorrect splits that
later blow up in bio_split.
We still need the bio_split_rw logic to count nr_segs for blk-mq code,
so add a new wrapper that passes in the right limit, and turns any bio
that would need a split into an error as an additional debugging aid.
Christoph Hellwig [Mon, 26 Aug 2024 13:45:40 +0000 (15:45 +0200)]
block: rework bio splitting
The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.
Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.
To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.
Darrick J. Wong [Thu, 15 Aug 2024 18:49:48 +0000 (11:49 -0700)]
xfs: check for shared rt extents when rebuilding rt file's data fork
When we're rebuilding the data fork of a realtime file, we need to
cross-reference each mapping with the rt refcount btree to ensure that
the reflink flag is set if there are any shared extents found.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:45 +0000 (11:49 -0700)]
xfs: walk the rt reference count tree when rebuilding rmap
When we're rebuilding the data device rmap, if we encounter a "refcount"
format fork, we have to walk the (realtime) refcount btree inode to
build the appropriate mappings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:44 +0000 (11:49 -0700)]
xfs: check new rtbitmap records against rt refcount btree
When we're rebuilding the realtime bitmap, check the proposed free
extents against the rt refcount btree to make sure we don't commit any
grievous errors.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:43 +0000 (11:49 -0700)]
xfs: don't flag quota rt block usage on rtreflink filesystems
Quota space usage is allowed to exceed the size of the physical storage
when reflink is enabled. Now that we have reflink for the realtime
volume, apply this same logic to the rtb repair logic.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:41 +0000 (11:49 -0700)]
xfs: detect and repair misaligned rtinherit directory cowextsize hints
If we encounter a directory that has been configured to pass on a CoW
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review and/or turn it off because that is a
misconfiguration.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:40 +0000 (11:49 -0700)]
xfs: check reference counts of gaps between rt refcount records
If there's a gap between records in the rt refcount btree, we ought to
cross-reference the gap with the rtrmap records to make sure that there
aren't any overlapping records for a region that doesn't have any shared
ownership.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:35 +0000 (11:49 -0700)]
xfs: check that the rtrefcount maxlevels doesn't increase when growing fs
The size of filesystem transaction reservations depends on the maximum
height (maxlevels) of the realtime btrees. Since we don't want a grow
operation to increase the reservation size enough that we'll fail the
minimum log size checks on the next mount, constrain growfs operations
if they would cause an increase in the rt refcount btree maxlevels.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:33 +0000 (11:49 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint
The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.
Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:33 +0000 (11:49 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files. This apparently was done to disable
delayed allocation for realtime files.
However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files. In this case, the
logic will incorrectly send the write through the delalloc write path.
Fix this by adjusting the logic slightly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Darrick J. Wong [Thu, 15 Aug 2024 18:49:29 +0000 (11:49 -0700)]
xfs: enable CoW for realtime data
Update our write paths to support copy on write on the rt volume. This
works in more or less the same way as it does on the data device, with
the major exception that we never do delalloc on the rt volume.
Because we consider unwritten CoW fork staging extents to be incore
quota reservation, we update xfs_quota_reserve_blkres to support this
case. Though xfs doesn't allow rt and quota together, the change is
trivial and we shouldn't leave a logic bomb here.
While we're at it, add a missing xfs_mod_delalloc call when we remove
delalloc block reservation from the inode. This is largely irrelvant
since realtime files do not use delalloc, but we want to avoid leaving
logic bombs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>