www.infradead.org Git - users/hch/xfs.git/log

xfs: shut the file system down on corrupted used counter

If a free is trying to free more blocks than the used counter the file
system is clearly corrupted so shut it down. Keep the debug only assert
to follow the (good?) old XFS tradition of panicing on corruption for
debug builds.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: don't clear XFS_RTG_RECLAIMABLE at the start of GC

It turns out GC can skip inodes when iget would have to get into retry
loops. When we skip an inode during GC of a zone that means we can't
empty the zone, and essentially leak the space in it until another block
is freed in that zone and XFS_RTG_RECLAIMABLE is set again.

Stop clearing XFS_RTG_RECLAIMABLE as freeing of the final block already
does that and GC is single threaded anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: fix freed block account at mount time

If there are less used blocks than the write pointer in a zone that is
open at mount time, it needs to be accounted to the reclaimable block
counter.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: delete the COW fork delalloc extent when submitting writeback

->map_blocks is the place the actual disk blocks are allocated.  To ensure
truncate (or any other xfs_reflink_cancel_cow_blocks caller) can't remove
a delalloc extent for which we've already allocated on-disk blocks, remove
it as soon as we've commited allocating disk blocks, making sure there either
is a delalloc extent or an in-flight I/O.

It turns out this also nicely simplifies the end I/O handler to be the same
for buffered vs direct I/O.

On the other hand it undoes some of the work to harvest the delalloc blocks
for the end I/O handler.  At the same time it removes the double accounting
and thus make the behavior the same as the conventional COW end I/O handler,
which so far has worked.  We might want to handle the block reservations a
little more intelligently, preferably in a way common to the zoned and COW
and I/O handlers.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: handle truncated delalloc reservations in xfs_zoned_map_blocks

Truncate could have removed some or all of the delalloc reservation
covering the folio range. Adjust xfs_zoned_map_blocks to not
write back such ranges, because we don't have a block reservation
covering it.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: assert that the written blocks aren't larger than the writepoint

This happened to me due to messed up daddr conversions in the rebase
branch. Add an assert to catch this.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: don't start new GC work on shut down file systems

Without this we can busy loop in gcd using up 100% CPU after the file
system has been shut down. This can be triggered by repeated runs
of xfs/548.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add a CONFIG_XFS_RT ifdef for the zoned end I/O handler

Else !XFS_RT builds fail.

Note that it's probably time to reorg the code a bit in the near future
to move this out of xfs_reflink.c where it doesn't really belong anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: use the indirect block reservation in the zoned end I/O handler

Use the indirect block reservation from the COW for delalloc extent to
reduce the blocks needed for the zoned end I/O handler so that we
don't run out of the reserved blocks pool too easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: allow harvesting indirect block reservation in xfs_bmap_del_extent_delay

The zoned end I/O path converts delalloc to real extents as part of
moving them from the COW to the data fork. Because of that it never
calls xfs_bmap_add_extent_delay_real, which is the usual place that
uses the indirect block reservations. Instead of letting the
reservation go to waste and stressing the reserved block pool, allow
to harvest the reservation when deleting the delayed extent and
donate it to the transaction chain used for remapping it into the data
fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: pass a xfs_daddr_t for the new block to xfs_zoned_end_io

That's what both callers have at hand, so do the conversion in a single
place. Also use the correct helper to prepare for the segmented
addressing coming to rtgroups.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: store the new block as xfs_daddr_t in struct xfs_gc_bio

We're doing I/O on the sector number, and with the upcoming xfs_rtgroup
refactoring it will be less work to generate that directly from the rgbno.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: split the zoned from the COW end I/O path

While the high level logic of the zoned end I/O handler is the same
for reflink COW end I/O, the current implementation is a maze of
special cases, especially because zoned direct I/O is not recorded
in the COW fork.

Fork the code into a separate implementation for the zoned end I/O
handler, and revert the reflink end I/O code to the old state.

Note that the zoned end I/O handler works a bit different from the
previous version, as it is driven by the passed in ranges and only
then looks up the COW fork extent for the buffered I/O case. This
allows to share more code between the buffered and direct I/O cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: fix xfs_bmap_punch_delalloc_range for !zoned

ac can be non-NULL when called xfs_free_file_space. Turn the zoned
assert into a real if.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: gracefully handle running out of block reservations

It turns out there is a nasty three way race (see the newly added comment
in the code) that can steal blocks from the reservation taken at the
beginning of a write. There isn't really much to do about it, so turn
it into a short write and warn about it.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs; fix cleanup in xfs_zone_gc_data_alloc

Also clean up the last item.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202409150207.qmhcRNcV-lkp@intel.com/
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: fix max length for zoned gc allocations on conventional devs

We can not rely on max append size when the backing device is not a
zoned block device as this attribute will be zero, resulting in gc
allocations to always fail.

In stead, use the max segment size which should provide a reasonable
maximum size for the gc allocations.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add data placement info to mount stats

Add per-rtg active refs, life time hint and data separation score and
an aggregate data separation score as output to the mount stats
to aid debugging and analysis.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>

xfs: first stab at write life time data placement

Add a file write life time data placement allocation scheme that aims
minimize fragmentation and thereby to do two things:

a) Complete separate file data when possible into diffent zones when
possible.
b) Colocate file data of similar life times when feasible.

To get best results, average file sizes should align with average
zone capacitity.

Benchmarked with RocksDB using leveled compaction, obeserving ~10%
throughput improvement for overwrite workloads at 80% file system
utilization.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>

xfs: add plumbing and mount option for write life time hints

Add a mount option and some plumbing for enabling usage
of file write life time hints.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: support xrep_require_rtext_inuse on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: support xchk_xref_is_used_rt_space on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: support zoned RT devices

WARNING: this is early prototype code.

The zoned allocator works by handing out data blocks to the direct or
buffered write code at the place where XFS currently does block
allocations.  It does not actually insert them into the bmap extent tree
at this time, but only after I/O completion when we known the block number.

The zoned allocator works on any kind of device, including conventional
devices or conventional zones by having a crude write pointer emulation.
For zone devices active zone management is fully support, as is
zone capacity < zone size.

The two major limitations are:

- there is no support for unwritten extents and thus persistent
   file preallocations from fallocate().  This is inherent to an
   always out of place write scheme as there is no way to persistently
   preallocate blocks for an indefinite number of overwrites
- because the metadata blocks and data blocks are on different
   device you can run out of space for metadata while having plenty
   of space for data and vice versa.  This is inherent to a scheme
   where we use different devices or pools for each.

For zoned file systems we reserve the free extents before taking the
ilock so that if we have to force garbage collection it happens before we
take the iolock.  This is done because GC has to take the iolock after it
moved data to a new place, and this could otherwise deadlock.

This unfortunately has to exclude block zeroing, as for truncate we are
called with the iolock (aka i_rwsem) already held.  As zeroing is always
only for a single block at a time, or up to two total for a syscall in
case for free_file_range we deal with that by just stealing the block,
but failing the allocation if we'd have to wait for GC.

Add a new RTAVAILABLE counter of blocks that are actually directly
available to be written into in addition to the classic free counter.
Only allow a write to go ahead if it has blocks available to write, and
otherwise wait for GC.  This also requires tweaking the need GC condition a
bit as we now always need to GC if someone is waiting for space.

Thanks to Hans Holmberg <hans.holmberg@wdc.com> for lots of fixes
and improvements.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: allow COW forks on zoned file systems in xchk_bmap

zoned file systems can have COW forks even without reflinks.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: disable sb_frextents scrub/repair for zoned file systems

Zoned file systems not only don't use the frextents counter, but the
in-memory percpu couner also includes reservations take before even
allocating delalloc extent records, so it will never match the per-zone
used information.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: don't zero post-EOF blocks on write and truncate up for zoned file systems

Zoned file systems don't leave blocks past the last allocated block
around ever, so don't bother with a zeroing operation for these
non-existent blocks. This avoids having to take a space resevation
for these operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add a helper to check if an inode sits on a zoned device

Add a xfs_is_zoned_inode helper that returns true if an inode has the
RT flag set and the file system is zoned. This will be used to key
off zoned allocator behavior.

Make xfs_is_always_cow_inode return true for zoned inodes as we always
need to write out of place on zoned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: add an incompat feature bit for zoned RT devices

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write

For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: pass an optional startblock argument to xfs_reflink_end_cow

With the upcoming zoned allocator, extents won't be convered from
delalloc to real allocations before submitting the I/O, as we'll only
know the actual block number on I/O completion. Add a new argument to
xfs_reflink_end_cow that can pass in the startblock we got from the
I/O completion handler, and convert the completed range from delalloc
to a written extent covering that block range.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: factor our a __xfs_reflink_end_cow_extent helper

This will allow reusing the code for zoned direct I/O completions, where
the out of place write allocation is passed as private data with the
I/O and not stored in the COW fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: skip always_cow inodes in xfs_reflink_trim_around_shared

xfs_reflink_trim_around_shared tries to find shared blocks in the
refcount btree. Always_cow inodes don't have that tree, so don't
bother.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: preserve RT reservations across remounts

Introduce a reservation setting for rt devices so that zoned GC
reservations are preserved over remount ro/rw cycles.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: generalize the freespace and reserved blocks handling

The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.

Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.

Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly. This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: ensure st_blocks never goes to zero during COW writes

COW writes remove the amount overwritten either directly for delalloc
reservations, or in earlier deferred transactions than adding the new
amount back in the bmap map transaction. This means st_blocks on an
inode where all data is overwritten using the COW path can temporarily
show a 0 st_blocks. This can easily be reproduced with the pending
zoned device support where all writes use this path and trips the
check in generic/615, but could also happen on a reflink file without
that.

Fix this by temporarily add the pending blocks to be mapped to
i_delayed_blks while the item is queued.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: pass private data to iomap_truncate_page

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: pass private data to iomap_zero_range

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: pass private data to iomap_page_mkwrite

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: pass private data to iomap_file_buffered_write

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: optionally use ioends for direct I/O

struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.

For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.

So intead of reinventing the wheel, reuse the existing infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: don't merge ioends with mismatching fs private flag

If the file system set it's private flag on one ioend but not the other
we better don't merge the two.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: export iomap_submit_ioend

This will let file systems submit the current ioend from ->map_blocks
to free resources for additional allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: support IOMAP_F_ZONE_APPEND for buffered I/O

Add support for Zone Append commands to the iomap buffer writeback code.
This involves selecting the right block layer operation and using the
right helper to add data to the bio, as well as not creating chained
bios inside the iomap for zone append as they could be written
non-contiguously and adjusting the sector based merge criteria.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: make iomap_sector Zone Append aware

Zone Append commands always point to the zone start sector. Change
the iomap_sector() helper to not adjust the start block for the position
in the iomap range for Zone Append iomaps.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: reinstate IOMAP_F_ZONE_APPEND support

Add back the support for using Zone Append in the iomap direct I/O code
that was removed a while ago as we'll use it for the XFS zoned device
support.

This is essentially a revert of commit 8e81aa16a421 ("iomap: remove
IOMAP_F_ZONE_APPEND") with an additional comment describing the flag now
that all the other IOMAP_F_* flags have a nice description.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: wait for writeback before allocating new blocks

This means we are actually forced to allocate new delalloc space for the
new dirtier instead of reusing one that is currently being used for
writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: factor out a iomap_last_written_block helper

Split out a pice of logic from iomap_file_buffered_write_punch_delalloc
that is useful for all iomap_end implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>

TEMP: nvme-pci: disable async probe

This keeps getting my ZNS vs ZNS drivers reordered a bit and is annoying
for testing.

Not-really-signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: punch delalloc extents from the COW fork for COW writes

When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.

Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: set IOMAP_F_SHARED for all COW fork allocations

Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork. It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.

This is essentially a revert of commit 72a048c1056a ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.

Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: share more code in xfs_buffered_write_iomap_begin

Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: support the COW fork in xfs_bmap_punch_delalloc_range

xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

iomap: zeroing already holds invalidate_lock

All callers of iomap_zero_range already hold invalidate_lock, so we can't
take it again in iomap_file_buffered_write_punch_delalloc.

Use the passed in flags argument to detect if we're called from a zeroing
operation and don't take the lock again in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof

xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that
is acrually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.

The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: factor out a xfs_file_write_zero_eof helper

Split a helper from xfs_file_write_checks that just deal with the
post-EOF zeroing to keep the code readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: remove the iomap_file_buffered_write_punch_delalloc return value

iomap_file_buffered_write_punch_delalloc can only return errors if either
the ->punch callback returned an error, or if someone changed the API of
mapping_seek_hole_data to return a negative error code that is not
-ENXIO.

As the only instance of ->punch never returns an error, an such an error
would be fatal anyway remove the entire error propagation and don't
return an error code from iomap_file_buffered_write_punch_delalloc.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

iomap: pass the iomap to the punch callback

XFS will need to look at the flags in the iomap structure, so pass it
down all the way to the callback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

iomap: pass flags to iomap_file_buffered_write_punch_delalloc

To fix short write error handling, We'll need to figure out what operation
iomap_file_buffered_write_punch_delalloc is called for. Pass the flags
argument on to it, and reorder the argument list to match that of
->iomap_end so that the compiler only has to add the new punch argument
to the end of it instead of reshuffling the registers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

iomap: improve shared block detection in iomap_unshare_iter

Currently iomap_unshare_iter relies on the IOMAP_F_SHARED flag to detect
blocks to unshare. This is reasonable, but IOMAP_F_SHARED is also useful
for the file system to do internal book keeping for out of place writes.
XFS used to that, until it got removed in commit 72a048c1056a
("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write")
because unshare for incorrectly unshare such blocks.

Add an extra safeguard by checking the explicitly provided srcmap instead
of the fallback to the iomap for valid data, as that catches the case
where we'd just copy from the same place we'd write to easily, allowing
to reinstate setting IOMAP_F_SHARED for all XFS writes that go to the
COW fork.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release

When direct I/O completions invalidates the page cache it holds neither the
i_rwsem nor the invalidate_lock so it can be racing with
iomap_write_delalloc_release. If the search for the end of the region that
contains data returns the start offset we hit such a race and just need to
look for the end of the newly created hole instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc

Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
variant fails to drop into the lowmode last resort allocator, and
thus can sometimes fail allocations for which the caller has a
transaction block reservation.

Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc

xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in
xfs_bmap_btalloc. Switch to call it from xfs_bmap_btalloc after
doing the basic setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: don't ifdef around the exact minlen allocations

Exact minlen allocations only exist as an error injection tool for debug
builds. Currently this is implemented using ifdefs, which means the code
isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test
coverage by always building the code and use the compilers' dead code
elimination to remove it from the generated binary instead.

The only downside is that the new bitfield is unconditionally added to
struct xfs_alloc_args now.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate

Userdata and metadata allocations end up in the same allocation helpers.
Remove the separate xfs_bmap_alloc_userdata function to make this more
clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname

Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return
-ENOSPC both for an actual failure to allocate a disk block, but also
to signal the caller to convert the format of the attr fork. Use magic
1 to ask for the conversion here as well.

Note that unlike the similar issue in xfs_attr3_leaf_split, this one was
only found by code review.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split

xfs_attr3_leaf_split propagates the need for an extra btree split as
-ENOSPC to it's only caller, but the same return value can also be
returned from xfs_da_grow_inode when it fails to find free space.

Distinguish the two cases by returning 1 for the extra split case instead
of overloading -ENOSPC.

This can be triggered relatively easily with the pending realtime group
support and a file system with a lot of small zones that use metadata
space on the main device. In this case every about 5-10th run of
xfs/538 runs into the following assert:

ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC);

in xfs_attr3_leaf_split caused by an allocation failure. Note that
the allocation failure is caused by another bug that will be fixed
subsequently, but this commit at least sorts out the error handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: return bool from xfs_attr3_leaf_add

xfs_attr3_leaf_add only has two potential return values, indicating if the
entry could be added or not. Replace the errno return with a bool so that
ENOSPC from it can't easily be confused with a real ENOSPC.

Remove the return value from the xfs_attr3_leaf_add_work helper entirely,
as it always return 0.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname

xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and
merging the two will simplify a following error handling fix.

To facilitate this move the remote block state save/restore helpers up in
the file so that they don't need forward declarations now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

block: don't use bio_split_rw on misc operations

bio_split_rw is designed to split read and write bios with a payload.
Currently it is called by __bio_split_to_limits for all operations not
explicitly list, which works because bio_may_need_split explicitly checks
for bi_vcnt == 1 and thus skips the bypass if there is no payload and
bio_for_each_bvec loop will never execute it's body if bi_size is 0.

But all this is hard to understand, fragile and wasted pointless cycles.
Switch __bio_split_to_limits to only call bio_split_rw for READ and
WRITE command and don't attempt any kind split for operation that do not
require splitting.

Signed-off-by: Christoph Hellwig <hch@lst.de>

block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limits

Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in
__bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND
bios do not adhere to the soft max_limits value but instead use their
own capped version of max_hw_sectors, leading to incorrect splits that
later blow up in bio_split.

We still need the bio_split_rw logic to count nr_segs for blk-mq code,
so add a new wrapper that passes in the right limit, and turns any bio
that would need a split into an error as an additional debugging aid.

Signed-off-by: Christoph Hellwig <hch@lst.de>

block: constify the lim argument to queue_limits_max_zone_append_sectors

queue_limits_max_zone_append_sectors doesn't change the lim argument,
so mark it as const.

Signed-off-by: Christoph Hellwig <hch@lst.de>

block: rework bio splitting

The current setup with bio_may_exceed_limit and __bio_split_to_limits
is a bit of a mess.

Change it so that __bio_split_to_limits does all the work and is just
a variant of bio_split_to_limits that returns nr_segs. This is done
by inlining it and instead have the various bio_split_* helpers directly
submit the potentially split bios.

To support btrfs, the rw version has a lower level helper split out
that just returns the offset to split. This turns out to nicely clean
up the btrfs flow as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: enable realtime reflink

Enable reflink for realtime devices, sort of.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: fix CoW forks for realtime files

Port the copy on write fork repair to realtime files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: check for shared rt extents when rebuilding rt file's data fork

When we're rebuilding the data fork of a realtime file, we need to
cross-reference each mapping with the rt refcount btree to ensure that
the reflink flag is set if there are any shared extents found.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: repair inodes that have a refcount btree in the data fork

Plumb knowledge of refcount btrees into the inode core repair code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: online repair of the realtime refcount btree

Port the data device's refcount btree repair code to the realtime
refcount btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: capture realtime CoW staging extents when rebuilding rt rmapbt

Walk the realtime refcount btree to find the CoW staging extents when
we're rebuilding the realtime rmap btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: walk the rt reference count tree when rebuilding rmap

When we're rebuilding the data device rmap, if we encounter a "refcount"
format fork, we have to walk the (realtime) refcount btree inode to
build the appropriate mappings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: check new rtbitmap records against rt refcount btree

When we're rebuilding the realtime bitmap, check the proposed free
extents against the rt refcount btree to make sure we don't commit any
grievous errors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: don't flag quota rt block usage on rtreflink filesystems

Quota space usage is allowed to exceed the size of the physical storage
when reflink is enabled. Now that we have reflink for the realtime
volume, apply this same logic to the rtb repair logic.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: scrub the metadir path of rt refcount btree files

Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata
directory tree path to the refcount btree file for each rt group.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: detect and repair misaligned rtinherit directory cowextsize hints

If we encounter a directory that has been configured to pass on a CoW
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review and/or turn it off because that is a
misconfiguration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: allow dquot rt block count to exceed rt blocks on reflink fs

Update the quota scrubber to allow dquots where the realtime block count
exceeds the block count of the rt volume if reflink is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: check reference counts of gaps between rt refcount records

If there's a gap between records in the rt refcount btree, we ought to
cross-reference the gap with the rtrmap records to make sure that there
aren't any overlapping records for a region that doesn't have any shared
ownership.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: allow overlapping rtrmapbt records for shared data extents

Allow overlapping realtime reverse mapping records if they both describe
shared data extents and the fs supports reflink on the realtime volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: cross-reference checks with the rt refcount btree

Use the realtime refcount btree to implement cross-reference checks in
other data structures.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: scrub the realtime refcount btree

Add code to scrub realtime refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: report realtime refcount btree corruption errors to the health system

Whenever we encounter corrupt realtime refcount btree blocks, we should
report that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: check that the rtrefcount maxlevels doesn't increase when growing fs

The size of filesystem transaction reservations depends on the maximum
height (maxlevels) of the realtime btrees. Since we don't want a grow
operation to increase the reservation size enough that we'll fail the
minimum log size checks on the next mount, constrain growfs operations
if they would cause an increase in the rt refcount btree maxlevels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: enable extent size hints for CoW operations

Wire up the copy-on-write extent size hint for realtime files, and
connect it to the rt allocator so that we avoid fragmentation on rt
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: apply rt extent alignment constraints to CoW extsize hint

The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint. Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.

Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files

Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files. This apparently was done to disable
delayed allocation for realtime files.

However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files. In this case, the
logic will incorrectly send the write through the delalloc write path.

Fix this by adjusting the logic slightly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: recover CoW leftovers in the realtime volume

Scan the realtime refcount tree at mount time to get rid of leftover
CoW staging extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: allow inodes to have the realtime and reflink flags

Now that we can share blocks between realtime files, allow this
combination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: enable sharing of realtime file blocks

Update the remapping routines to be able to handle realtime files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: enable CoW for realtime data

Update our write paths to support copy on write on the rt volume.  This
works in more or less the same way as it does on the data device, with
the major exception that we never do delalloc on the rt volume.

Because we consider unwritten CoW fork staging extents to be incore
quota reservation, we update xfs_quota_reserve_blkres to support this
case.  Though xfs doesn't allow rt and quota together, the change is
trivial and we shouldn't leave a logic bomb here.

While we're at it, add a missing xfs_mod_delalloc call when we remove
delalloc block reservation from the inode.  This is largely irrelvant
since realtime files do not use delalloc, but we want to avoid leaving
logic bombs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: refactor reflink quota updates

Hoist all quota updates for reflink into a helper function, since things
are about to become more complicated.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

xfs: compute rtrmap btree max levels when reflink enabled

Compute the maximum possible height of the realtime rmap btree when
reflink is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>