]> www.infradead.org Git - users/hch/xfs.git/log
users/hch/xfs.git
8 months agoxfs: fix the type for sb_rtreserved xfs-zoned-2024-12-09
Christoph Hellwig [Mon, 2 Dec 2024 02:29:57 +0000 (11:29 +0900)]
xfs: fix the type for sb_rtreserved

xfs_extlen_t is a 32-bit type, not matching the 64-bit on-disk
value.  Use the most fitting 64-bit type.

Noticed by the size checking macros on x86 which doesn't naturally
align 64-bit fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: don't issue discards when not supported
Christoph Hellwig [Fri, 29 Nov 2024 09:00:29 +0000 (10:00 +0100)]
xfs: don't issue discards when not supported

Fix reset on conventional zones when the device does not support
discard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: more %u printing
Christoph Hellwig [Fri, 29 Nov 2024 07:42:55 +0000 (08:42 +0100)]
xfs: more %u printing

Preemptively, before Damien finds them all :)

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: print max_open_zones using %u
Christoph Hellwig [Fri, 29 Nov 2024 07:42:13 +0000 (08:42 +0100)]
xfs: print max_open_zones using %u

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: fix has_daddr_gaps comment
Christoph Hellwig [Fri, 29 Nov 2024 07:40:10 +0000 (08:40 +0100)]
xfs: fix has_daddr_gaps comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: improve the xfs_gc_bio comment
Christoph Hellwig [Fri, 29 Nov 2024 07:36:04 +0000 (08:36 +0100)]
xfs: improve the xfs_gc_bio comment

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: fix a comment typo in xfs_zoned_buffered_write_iomap_begin
Christoph Hellwig [Fri, 29 Nov 2024 07:29:27 +0000 (08:29 +0100)]
xfs: fix a comment typo in xfs_zoned_buffered_write_iomap_begin

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: explain the reflink vs GC issue a bit better
Christoph Hellwig [Fri, 29 Nov 2024 07:24:03 +0000 (08:24 +0100)]
xfs: explain the reflink vs GC issue a bit better

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: fix comments in xfs_init_zone
Christoph Hellwig [Fri, 29 Nov 2024 07:19:08 +0000 (08:19 +0100)]
xfs: fix comments in xfs_init_zone

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: better documentation for xfs_zoned_default_resblks
Christoph Hellwig [Fri, 29 Nov 2024 07:17:43 +0000 (08:17 +0100)]
xfs: better documentation for xfs_zoned_default_resblks

Explain the two countes in more detail, and use a switch statement to
make it more clear what counters are affected.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: improve a comment in xfs_zone_validate
Christoph Hellwig [Fri, 29 Nov 2024 07:10:56 +0000 (08:10 +0100)]
xfs: improve a comment in xfs_zone_validate

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: use %u for printing the rgno in xfs_zones.c
Christoph Hellwig [Fri, 29 Nov 2024 07:09:39 +0000 (08:09 +0100)]
xfs: use %u for printing the rgno in xfs_zones.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: document XFS_GC_CHUNK_SIZE
Christoph Hellwig [Fri, 29 Nov 2024 06:51:54 +0000 (07:51 +0100)]
xfs: document XFS_GC_CHUNK_SIZE

And use the SZ_1M helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: bypass the GC reservation queue for reserved allocations
Christoph Hellwig [Thu, 28 Nov 2024 16:26:00 +0000 (17:26 +0100)]
xfs: bypass the GC reservation queue for reserved allocations

Directly go to the counter for reserved blocks.  Otherwise a truncate
the needs to zero the last block can easily fail with ENOSPC when
other threads are waiting for GC.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: skip zoned without used blocks in xfs_zone_reclaim_pick
Christoph Hellwig [Fri, 29 Nov 2024 05:36:43 +0000 (06:36 +0100)]
xfs: skip zoned without used blocks in xfs_zone_reclaim_pick

These are just waiting for a zone reset.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoRevert "xfs: simplify GC scratch buf management"
Christoph Hellwig [Thu, 28 Nov 2024 17:58:50 +0000 (18:58 +0100)]
Revert "xfs: simplify GC scratch buf management"

This reverts commit 06a313a2085bef3cdb47cda229d025821f3e7fd8.

Something in the accounting was off, leading GC tests to occasional not
finish.  Revert this for now until it can be done properly or we can come
up with an even better scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: remove rtxlen conversions in the zoned code
Christoph Hellwig [Thu, 28 Nov 2024 04:02:44 +0000 (05:02 +0100)]
xfs: remove rtxlen conversions in the zoned code

The zone allocator fundamentally can't support larger allocation sizes
because we don't support unwritten extents.  So don't bother with the
conversions and instead add a comment explaining that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: init the zone_alloc_ctx on-stack
Christoph Hellwig [Thu, 28 Nov 2024 03:54:15 +0000 (04:54 +0100)]
xfs: init the zone_alloc_ctx on-stack

Require the structure to be zeroed in the callers so that we can assert
that xfs_zoned_space_reserve is called exactly ones for a context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: document struct xfs_zone_reservation
Christoph Hellwig [Thu, 28 Nov 2024 04:09:13 +0000 (05:09 +0100)]
xfs: document struct xfs_zone_reservation

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: tighten up the superblock verifier for zoned file systems
Christoph Hellwig [Thu, 28 Nov 2024 04:44:26 +0000 (05:44 +0100)]
xfs: tighten up the superblock verifier for zoned file systems

Check that rtextsize is 1, and sanity check the rtstart and rtreserved
values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: use fsblock units for sb_rtstart
Christoph Hellwig [Thu, 28 Nov 2024 06:38:14 +0000 (07:38 +0100)]
xfs: use fsblock units for sb_rtstart

Darrick was a little unhappy with the daddr, so convert to fsblocks
instead.  For the kernel this is only a bit annoying in fsmap,
and mkfs becomes a little more hacky, but overall this doesn't make
much of a difference while removing the need to validate that the
value is fsblock aligned.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: document the GC data structures
Christoph Hellwig [Wed, 27 Nov 2024 16:20:00 +0000 (17:20 +0100)]
xfs: document the GC data structures

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: document the struct xfs_open_zone fields
Christoph Hellwig [Wed, 27 Nov 2024 16:12:46 +0000 (17:12 +0100)]
xfs: document the struct xfs_open_zone fields

Add a few comments explaining what the fields in struct xfs_open_zone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: document the xfs_zone_info fields
Christoph Hellwig [Wed, 27 Nov 2024 16:05:41 +0000 (17:05 +0100)]
xfs: document the xfs_zone_info fields

And move them around a bit to keep related fields together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: simplify GC scratch buf management
Christoph Hellwig [Wed, 27 Nov 2024 15:17:46 +0000 (16:17 +0100)]
xfs: simplify GC scratch buf management

Now that the GC chunks are processed in order, there isn't really any need
for the bank switching, and we can have a simple ring buffer with head and
tail pointers.  This allows allocating only a single 1MB folio insteaad of
two.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: reduce the scope for the rtg variable in xfs_mount_zones
Christoph Hellwig [Wed, 27 Nov 2024 09:30:08 +0000 (10:30 +0100)]
xfs: reduce the scope for the rtg variable in xfs_mount_zones

Only needed in the conventional device branch, move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: split out a xfs_calc_open_zones helper
Christoph Hellwig [Wed, 27 Nov 2024 09:29:31 +0000 (10:29 +0100)]
xfs: split out a xfs_calc_open_zones helper

Move the code to calculate the number of open zones out of
xfs_mount_zones into it's own helper.  The flow also changes a bit to be
more clear, but it should not change behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: move the zone specific fields out of struct xfs_mount
Christoph Hellwig [Wed, 27 Nov 2024 07:34:00 +0000 (08:34 +0100)]
xfs: move the zone specific fields out of struct xfs_mount

Split them into a dynamically allocated xfs_zone_info structure similar
to the quotainfo one, which is pointed to by the mount structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add casts to xfs_zoned_default_resblks
Christoph Hellwig [Wed, 27 Nov 2024 08:02:26 +0000 (09:02 +0100)]
xfs: add casts to xfs_zoned_default_resblks

Ensure the return value doesn't overflow unsigned long for those poor
souls using giant file systems on 32-bit systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: cleanup the freecounter abstraction a bit
Christoph Hellwig [Wed, 27 Nov 2024 07:36:43 +0000 (08:36 +0100)]
xfs: cleanup the freecounter abstraction a bit

Give the enum a name, it's values prefixes and add a
xfs_set_freecounter helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: sb_rtstart is in sectors
Christoph Hellwig [Wed, 27 Nov 2024 07:26:52 +0000 (08:26 +0100)]
xfs: sb_rtstart is in sectors

So use xfs_daddr_t for it in the in-memory superblock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: remove xfs_rtglock_zoned_adjust
Christoph Hellwig [Wed, 27 Nov 2024 07:16:46 +0000 (08:16 +0100)]
xfs: remove xfs_rtglock_zoned_adjust

Just skip locking the bitmap and summary inodes for zoned file systems,
but still require the rmap flag to be explicitly set.  Except for the
extfree_item just fixed nothing still relied on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: refactor xfs_rtextent_free_finish_item
Christoph Hellwig [Wed, 27 Nov 2024 07:08:39 +0000 (08:08 +0100)]
xfs: refactor xfs_rtextent_free_finish_item

Refactor the code so that it does the proper rmap locking for the
zoned case instead of relying on xfs_rtglock_zoned_adjust which
is about to go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: drop the IOMAP_UNSHARE check in xfs_zoned_buffered_write_iomap_begin
Christoph Hellwig [Wed, 27 Nov 2024 05:12:19 +0000 (06:12 +0100)]
xfs: drop the IOMAP_UNSHARE check in xfs_zoned_buffered_write_iomap_begin

We already assert that it isn't set at the start of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: fix spelling in xfs_zones.h
Christoph Hellwig [Wed, 27 Nov 2024 04:55:03 +0000 (05:55 +0100)]
xfs: fix spelling in xfs_zones.h

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: fix the type of wp_fsb in xfs_zone_validate_wp
Christoph Hellwig [Wed, 27 Nov 2024 04:48:21 +0000 (05:48 +0100)]
xfs: fix the type of wp_fsb in xfs_zone_validate_wp

wp_fsb isn't an offset into a file, but a raw FSB-unit block number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: fix the message logged in xfs_rt_check_size
Christoph Hellwig [Wed, 27 Nov 2024 04:30:47 +0000 (05:30 +0100)]
xfs: fix the message logged in xfs_rt_check_size

As pointed out by Darrick on the list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: add include guards to internal.h
Christoph Hellwig [Wed, 27 Nov 2024 04:25:23 +0000 (05:25 +0100)]
iomap: add include guards to internal.h

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: fix a iomap.h typo
Christoph Hellwig [Wed, 27 Nov 2024 04:22:18 +0000 (05:22 +0100)]
iomap: fix a iomap.h typo

s/ppend/append/

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a comment for XFS_SB_FEAT_INCOMPAT_ZONE_GAPS
Christoph Hellwig [Tue, 26 Nov 2024 10:08:36 +0000 (11:08 +0100)]
xfs: add a comment for XFS_SB_FEAT_INCOMPAT_ZONE_GAPS

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: export zone stats in /proc/*/mountstats
Hans Holmberg [Sun, 17 Nov 2024 06:22:06 +0000 (07:22 +0100)]
xfs: export zone stats in /proc/*/mountstats

Add the per-zone life time hint and the used block distribution
for fully written zones, grouping reclaimable zones in fixed-percentage
buckets spanning 0..9%, 10..19% and full zones as 100% used as well as a
few statistics about the zone allocator and open and reclaimable zones
in /proc/*/mountstats.

This gives good insight into data fragmentation and data placement
success rate.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: wire up the show_stats super operation
Christoph Hellwig [Sun, 17 Nov 2024 05:24:19 +0000 (06:24 +0100)]
xfs: wire up the show_stats super operation

The show_stats option allows a file system to dump plain text statistic
on a per-mount basis into /proc/*/mountstats.  Wire up a no-op version
which will grow useful information for zoned file systems later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: support write life time based data placement
Hans Holmberg [Sun, 24 Nov 2024 13:36:55 +0000 (14:36 +0100)]
xfs: support write life time based data placement

Add a file write life time data placement allocation scheme that aims to
minimize fragmentation and thereby to do two things:

 a) separate file data when into diffent zones when possible.
 b) colocate file data of similar life times when feasible.

To get best results, average file sizes should align with the zone
capacitity that is reported through the XFS_IOC_FSGEOMETRY ioctl.

For RocksDB using leveled compaction, the lifetime hints can improve
throughput improvement for overwrite workloads at 80% file system
utilization by ~10%.

Lifetime hints can be disabled using the nolifetime mount option.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a max_open_zones mount option
Christoph Hellwig [Sun, 17 Nov 2024 07:05:16 +0000 (08:05 +0100)]
xfs: add a max_open_zones mount option

Allow limiting the number of open zones used below that exported by the
device.  This is required to tune the number of write streams when zoned
RT devices are used on conventional devices, and can be useful on zoned
devices that support a very large number of open zones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: support zone gaps
Christoph Hellwig [Sun, 17 Nov 2024 08:07:41 +0000 (09:07 +0100)]
xfs: support zone gaps

Zoned devices can have gaps beyoned the usable capacity of a zone and the
end in the LBA/daddr address space.  In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us.  In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: enable the zoned RT device feature
Christoph Hellwig [Sun, 17 Nov 2024 07:36:47 +0000 (08:36 +0100)]
xfs: enable the zoned RT device feature

Enable the zoned RT device directory feature.  With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks.  This perfectly maps to zoned devices, but can also be
used on conventional block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: disable rt quotas for zoned file systems
Christoph Hellwig [Tue, 22 Oct 2024 12:16:44 +0000 (14:16 +0200)]
xfs: disable rt quotas for zoned file systems

They'll need a little more work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: disable reflink for zoned file systems
Christoph Hellwig [Sun, 17 Nov 2024 09:28:33 +0000 (10:28 +0100)]
xfs: disable reflink for zoned file systems

While the zoned on-disk format supports reflinks, the GC code currently
always unshares reflinks when moving blocks to new zones, thus making the
feature unusuable.  Disable reflinks until the GC code is refcount aware.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: enabled fsmap reporting for internal RT devices
Christoph Hellwig [Wed, 13 Nov 2024 05:51:55 +0000 (06:51 +0100)]
xfs: enabled fsmap reporting for internal RT devices

File system with internal RT devices are a bit odd in that we need
to report AGs and RGs.  To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.

The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: support xrep_require_rtext_inuse on zoned file systems
Christoph Hellwig [Mon, 22 Jul 2024 13:31:28 +0000 (06:31 -0700)]
xfs: support xrep_require_rtext_inuse on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-reference.  But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: support xchk_xref_is_used_rt_space on zoned file systems
Christoph Hellwig [Sun, 17 Nov 2024 06:35:44 +0000 (07:35 +0100)]
xfs: support xchk_xref_is_used_rt_space on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-reference.  But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: allow COW forks on zoned file systems in xchk_bmap
Christoph Hellwig [Fri, 10 May 2024 06:51:02 +0000 (08:51 +0200)]
xfs: allow COW forks on zoned file systems in xchk_bmap

zoned file systems can have COW forks even without reflinks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: support growfs on zoned file systems
Christoph Hellwig [Sun, 17 Nov 2024 09:27:24 +0000 (10:27 +0100)]
xfs: support growfs on zoned file systems

Replace the inner loop growing one RT bitmap block at a time with
one just modifying the superblock counters for growing an entire
zone (aka RTG).  The big restriction is just like at mkfs time only
a RT extent size of a single FSB is allowed, and the file system
capacity needs to be aligned to the zone size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: hide reserved RT blocks from statfs
Christoph Hellwig [Sun, 17 Nov 2024 07:06:55 +0000 (08:06 +0100)]
xfs: hide reserved RT blocks from statfs

File systems with a zoned RT device have a large number of reserved
blocks that are required for garbage collection, and which can't be
filled with user data.  Exclude them from the available blocks reported
through stat(v)fs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: wire up zoned block freeing in xfs_rtextent_free_finish_item
Christoph Hellwig [Sun, 17 Nov 2024 07:19:22 +0000 (08:19 +0100)]
xfs: wire up zoned block freeing in xfs_rtextent_free_finish_item

Make xfs_rtextent_free_finish_item call into the zoned allocator to free
blocks on zoned RT devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: implement direct writes to zoned RT devices
Christoph Hellwig [Sun, 17 Nov 2024 07:25:26 +0000 (08:25 +0100)]
xfs: implement direct writes to zoned RT devices

Direct writes to zoned RT devices are extremely simple.  After taking the
block reservation before acquiring the iolock, the iomap direct I/O
calls into ->iomap_begin which will return a fake iomap allowing writes
up the entire requested range.  The actual block allocation is then done
from the submit_io handler using code shared with the buffered I/O path.

The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize
the embedded ioend, which allows reusing the existing ioend based buffered
I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: implement buffered writes to zoned RT devices
Christoph Hellwig [Sun, 24 Nov 2024 12:49:53 +0000 (13:49 +0100)]
xfs: implement buffered writes to zoned RT devices

Implement buffered writes including page faults and block zeroing for
zoned RT devices.  Buffered writes to zoned RT devices are split into
three phases:

 1) a reservation for the worst case data block usage is taken before
    acquiring the iolock.  When not enough space is available this kicks
    off garbage collection, and when there still is not enough space is
    available the block reservation is reduced to the amount of space
    available, which will force a short write
 2) with the iolock held, the generic iomap buffered write code is
    called, which through the iomap_begin operation usually just inserts
    delalloc extents for the range in a single iteration.  Only for
    overwrites of existing data that are not block aligned, or zeroing
    operations the existing extent mapping is read to fill out the srcmap
    and to figure out if zeroing is required.
 3) the ->map_blocks callback to the generic iomap writeback code
    calls into the zoned space allocator to actually allocate on-disk
    space for the range before kicking of the writeback.

For block zeroing from truncate, ->setattr is called with the iolock
(aka i_rwsem) already held, so a hacky deviation from the above
scheme is needed.  In this case the space reservations is called with
the iolock held, but is required not to block and can dip into the
reserved block pool.  This can lead to -ENOSPC when truncating a
file, which is unfortunate.  But fixing the calling conventions in
the VFS is probably much easier with code requiring it already in
mainline.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: implement zoned garbage collection
Christoph Hellwig [Sun, 24 Nov 2024 13:06:42 +0000 (14:06 +0100)]
xfs: implement zoned garbage collection

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To empty zones, the rmap is walked to find the
owners and the data is read and then written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Note: selection of garbage collection victims is extremely simple at the
moment and will probably see additional near term improvements.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add the zoned space allocator
Christoph Hellwig [Sun, 17 Nov 2024 08:57:30 +0000 (09:57 +0100)]
xfs: add the zoned space allocator

For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.

Because of the actual allocation algorithm is very simple and just
involves picking a good zone - preferable the one used for the last
write to the inode.  Because the number of zones that can written to at
the same time is often limited by the hardware, this is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers.  Because the writers already took a reservation before
acquiring the iolock space will always be readily available if an
open zone slot is available.  A new structure is used to track
these open zones, and pointed to by the xfs_rtgroup.  Because
zoned file systems don't have a rsum cache the space for that pointer
can be reused.

Allocations are only recorded at I/O completion time.  The scheme
used for that is very similar to the reflink COW end I/O path.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add support for zoned space reservations
Christoph Hellwig [Sun, 17 Nov 2024 05:45:45 +0000 (06:45 +0100)]
xfs: add support for zoned space reservations

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add support for parsing and validating blk_zone structures
Christoph Hellwig [Sun, 17 Nov 2024 05:28:10 +0000 (06:28 +0100)]
xfs: add support for parsing and validating blk_zone structures

Add support to validate and parse reported hardware zone state.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file
Christoph Hellwig [Sun, 17 Nov 2024 07:02:04 +0000 (08:02 +0100)]
xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file

The zoned allocator never performs speculative preallocations, so don't
bother queueing up zoned inodes here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes
Christoph Hellwig [Thu, 21 Nov 2024 07:50:23 +0000 (08:50 +0100)]
xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes

There are no EOF blocks, so avoid the pointless roundtrip through the
ilock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: disable FITRIM for zoned RT devices
Christoph Hellwig [Sun, 17 Nov 2024 05:23:16 +0000 (06:23 +0100)]
xfs: disable FITRIM for zoned RT devices

The zoned allocator unconditionally issues zone resets or discards after
emptying an entire zone, so supporting FITRIM for a zoned RT device is
not useful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: disable sb_frextents for zoned file systems
Christoph Hellwig [Sun, 12 May 2024 05:39:45 +0000 (07:39 +0200)]
xfs: disable sb_frextents for zoned file systems

Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information.  Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: export zoned geometry via XFS_FSOP_GEOM
Christoph Hellwig [Thu, 24 Oct 2024 08:57:41 +0000 (10:57 +0200)]
xfs: export zoned geometry via XFS_FSOP_GEOM

Export the zoned geometry information so that userspace can query it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: don't allow growfs of the data device with internal RT device
Christoph Hellwig [Fri, 22 Nov 2024 06:26:43 +0000 (07:26 +0100)]
xfs: don't allow growfs of the data device with internal RT device

Because the RT blocks follow right after.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: allow internal RT devices for zoned mode
Christoph Hellwig [Sun, 17 Nov 2024 07:53:10 +0000 (08:53 +0100)]
xfs: allow internal RT devices for zoned mode

Allow creating an RT subvolume on the same device as the main data
device.  This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: define the zoned on-disk format
Christoph Hellwig [Sun, 17 Nov 2024 09:23:18 +0000 (10:23 +0100)]
xfs: define the zoned on-disk format

Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data).  There are few minor but important changes, which are indicated
by a new incompat flag:

1) there are not bitmap and summary inodes, and thus the sb_bmblocks
   superblock field must be cleared to zero

2) there is a new superblock field that specifies the start of an
   internal RT section.  This allows to support SMR HDDs that have random
   writable space at the beginning which is used for the XFS data device
   (which really is the metadata device for this configuration), directly
   followed by a RT device on the same block device.  While something
   similar could be archived using dm-linear just having a single device
   directly consumed by XFS make handling the file systems a lot easier.

3) Another superblock field that tracks the amount of reserved space (or
   overprovisioning) that is never used for user capacity, but allows GC
   to run more smoothly.

4) an overlay of the cowextsizse field for the rtrmap inode so that I
   can persistently track the total amount of bytes currently used in
   a RT group.  There is no data structure other than the rmap that
   tracks used space in an RT group, and this counter is used to decided
   when a RT group has been entirely emptied, and to select one that
   is relatively empty if garbage collection needs to be performed.
   While this counter could be tracked entirely in memory and rebuilt
   from the rmap at mount time, that would be lead to very long mount
   times with the large number of RT groups required by the typical
   hardware zone size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a xfs_rtrmap_first_unwritten_rgbno helper
Christoph Hellwig [Sun, 17 Nov 2024 04:48:45 +0000 (05:48 +0100)]
xfs: add a xfs_rtrmap_first_unwritten_rgbno helper

Add a helper to find the last offset mapped in the rtrmap.  This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
Christoph Hellwig [Tue, 10 Sep 2024 04:58:17 +0000 (07:58 +0300)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation.  To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: refine the unaligned check for always COW inodes in xfs_file_dio_write
Christoph Hellwig [Fri, 27 Oct 2023 07:58:24 +0000 (09:58 +0200)]
xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write

For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: report the correct dio alignment for COW inodes
Christoph Hellwig [Tue, 13 Aug 2024 06:22:40 +0000 (08:22 +0200)]
xfs: report the correct dio alignment for COW inodes

For I/O to reflinked blocks we always need to write an entire new
file system block, and the code enforces the file system block alignment
for the entire file if it has any reflinked blocks.

Unfortunately the reported dio alignment can only report a single value
for reads and writes, so unless we want to trigger these read-modify
write cycles all the time, we need to increase both limits.

Without this zoned xfs triggers the warnings about failed page cache
invalidation in kiocb_invalidate_post_direct_write all the time when
running generic/551 when running on a 512 byte sector device, and
eventually fails the test due to miscompares.

Hopefully we can add a separate read alignment to statx eventually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: skip always_cow inodes in xfs_reflink_trim_around_shared
Christoph Hellwig [Sat, 14 Oct 2023 05:50:35 +0000 (07:50 +0200)]
xfs: skip always_cow inodes in xfs_reflink_trim_around_shared

xfs_reflink_trim_around_shared tries to find shared blocks in the
refcount btree.  Always_cow inodes don't have that tree, so don't
bother.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: preserve RT reservations across remounts
Hans Holmberg [Tue, 9 Jan 2024 18:01:15 +0000 (19:01 +0100)]
xfs: preserve RT reservations across remounts

Introduce a reservation setting for rt devices so that zoned GC
reservations are preserved over remount ro/rw cycles.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Thu, 10 Oct 2024 05:27:50 +0000 (07:27 +0200)]
xfs: generalize the freespace and reserved blocks handling

The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.

Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.

Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly.  This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c
Christoph Hellwig [Sun, 17 Nov 2024 09:22:50 +0000 (10:22 +0100)]
xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c

Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c.  Move it to xfs_iomap.c
toward the two callers.

Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: cleanup xfs_getfsmap_rtdev_rmapbt_query
Christoph Hellwig [Thu, 14 Nov 2024 07:54:15 +0000 (08:54 +0100)]
xfs: cleanup xfs_getfsmap_rtdev_rmapbt_query

Move the last entry case out of xfs_getfsmap_rtdev_rmapbt_query into
the caller that actually needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a rtg_refcount helper
Christoph Hellwig [Sat, 16 Nov 2024 05:48:40 +0000 (06:48 +0100)]
xfs: add a rtg_refcount helper

Shortcut the long expression to find the refcount inode from the rtg.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a rtg_rmap helper
Christoph Hellwig [Sat, 16 Nov 2024 05:48:09 +0000 (06:48 +0100)]
xfs: add a rtg_rmap helper

Shortcut the long expression to find the bitmap inode from the rtg.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a rtg_summary helper
Christoph Hellwig [Sat, 16 Nov 2024 05:47:34 +0000 (06:47 +0100)]
xfs: add a rtg_summary helper

Shortcut the long expression to find the summary inode from the rtg.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a rtg_bitmap helper
Christoph Hellwig [Sun, 17 Nov 2024 04:11:56 +0000 (05:11 +0100)]
xfs: add a rtg_bitmap helper

Shortcut the long expression to find the bitmap inode from the rtg.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: add a rtg_blocks helper
Christoph Hellwig [Sun, 17 Nov 2024 04:11:38 +0000 (05:11 +0100)]
xfs: add a rtg_blocks helper

Shortcut dereferencing the xg_block_count field in the generic group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay
Christoph Hellwig [Sun, 17 Nov 2024 09:15:09 +0000 (10:15 +0100)]
xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay

xfs_bmap_add_extent_hole_delay works entirely on delalloc extents, for
which xfs_bmap_same_rtgroup doesn't make sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: factor out a xfs_rt_check_size helper
Christoph Hellwig [Tue, 30 Jul 2024 23:42:42 +0000 (16:42 -0700)]
xfs: factor out a xfs_rt_check_size helper

Add a helper to check that the last block of a RT device is readable
to share the code between mount and growfs.  This also adds the mount
time overflow check to growfs and improves the error messages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: constify feature checks
Christoph Hellwig [Tue, 5 Nov 2024 07:39:56 +0000 (08:39 +0100)]
xfs: constify feature checks

We'll need to call them on a const structure in growfs in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoxfs: simplify sector number calculation in xfs_zero_extent
Christoph Hellwig [Tue, 30 Jul 2024 23:15:43 +0000 (16:15 -0700)]
xfs: simplify sector number calculation in xfs_zero_extent

xfs_zero_extent does some really odd gymnstics to calculate the block
layer sectors numbers passed to blkdev_issue_zeroout.  This is because it
used to call sb_issue_zeroout and the calculations in that helper got
open coded here in the rather misleadingly named commit 3dc29161070a
("dax: use sb_issue_zerout instead of calling dax_clear_sectors").

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: pass private data to iomap_truncate_page
Christoph Hellwig [Fri, 16 Aug 2024 16:49:13 +0000 (18:49 +0200)]
iomap: pass private data to iomap_truncate_page

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: pass private data to iomap_zero_range
Christoph Hellwig [Fri, 16 Aug 2024 16:48:16 +0000 (18:48 +0200)]
iomap: pass private data to iomap_zero_range

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: pass private data to iomap_page_mkwrite
Christoph Hellwig [Tue, 10 Sep 2024 04:57:21 +0000 (07:57 +0300)]
iomap: pass private data to iomap_page_mkwrite

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: optionally use ioends for direct I/O
Christoph Hellwig [Sun, 24 Nov 2024 13:00:00 +0000 (14:00 +0100)]
iomap: optionally use ioends for direct I/O

struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.

For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.

So instead of reinventing the wheel, reuse the existing infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: split bios to zone append limits in the submission handlers
Christoph Hellwig [Sun, 24 Nov 2024 12:54:37 +0000 (13:54 +0100)]
iomap: split bios to zone append limits in the submission handlers

Provide helpers for file systems to split bios in the direct I/O and
writeback I/O submission handlers.

This Follows btrfs' lead and don't try to build bios to hardware limits
for zone append commands, but instead build them as normal unconstrained
bios and split them to the hardware limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: add a IOMAP_F_ZONE_APPEND flag
Christoph Hellwig [Sun, 5 Nov 2023 05:40:52 +0000 (06:40 +0100)]
iomap: add a IOMAP_F_ZONE_APPEND flag

This doesn't much - just always returns the start block number for each
iomap instead of increasing it.  This is because we'll keep building bios
unconstrained by the hardware limits and just split them in file system
submission handler.

Maybe we should find another name for it, because it might be useful for
btrfs compressed bio submissions as well, but I can't come up with a
good one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: simplify io_flags and io_type in struct iomap_ioend
Christoph Hellwig [Sun, 24 Nov 2024 12:53:36 +0000 (13:53 +0100)]
iomap: simplify io_flags and io_type in struct iomap_ioend

The ioend fields for distinct types of I/O are a bit complicated.
Consolidate them into a single io_flag field with it's own flags
decoupled from the iomap flags.  This also prepares for adding a new
flag that is unrelated to both of the iomap namespaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: allow the file system to submit the writeback bios
Christoph Hellwig [Tue, 5 Nov 2024 07:33:10 +0000 (08:33 +0100)]
iomap: allow the file system to submit the writeback bios

Change ->prepare_ioend to ->submit_ioend and require file systems that
implement it to submit the bio.  This is needed for file systems that
do their own work on the bios before submitting them to the block layer
like btrfs or zoned xfs.  To make this easier also pass the writeback
context to the method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoiomap: drop an obsolete comment in iomap_dio_bio_iter
Christoph Hellwig [Tue, 5 Nov 2024 08:25:28 +0000 (09:25 +0100)]
iomap: drop an obsolete comment in iomap_dio_bio_iter

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agovirtio_blk: reverse request order in virtio_queue_rqs
Christoph Hellwig [Wed, 13 Nov 2024 15:20:42 +0000 (16:20 +0100)]
virtio_blk: reverse request order in virtio_queue_rqs

blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern
especially in rotational devices. Fix this by rewriting virtio_queue_rqs
so that it always pops the requests from the passed in request list, and
then adds them to the head of a local submit list. This actually
simplifies the code a bit as it removes the complicated list splicing,
at the cost of extra updates of the rq_next pointer. As that should be
cache hot anyway it should be an easy price to pay.

Fixes: 0e9911fa768f ("virtio-blk: support mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 months agonvme-pci: reverse request order in nvme_queue_rqs
Christoph Hellwig [Wed, 13 Nov 2024 15:20:41 +0000 (16:20 +0100)]
nvme-pci: reverse request order in nvme_queue_rqs

blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices.  Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list.  This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer.  As that should be cache hot
anyway it should be an easy price to pay.

Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoblock: take chunk_sectors into account in bio_split_write_zeroes
Christoph Hellwig [Sat, 2 Nov 2024 06:04:18 +0000 (07:04 +0100)]
block: take chunk_sectors into account in bio_split_write_zeroes

For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors.  For other uses like the
internally RAIDed NVMe devices it is probably at least useful.

Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes.  Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
8 months agoblock: lift bio_is_zone_append to bio.h
Christoph Hellwig [Thu, 31 Oct 2024 14:09:05 +0000 (15:09 +0100)]
block: lift bio_is_zone_append to bio.h

Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.

Signed-off-by: Christoph Hellwig <hch@lst.de>