Christoph Hellwig [Thu, 28 Nov 2024 16:26:00 +0000 (17:26 +0100)]
xfs: bypass the GC reservation queue for reserved allocations
Directly go to the counter for reserved blocks. Otherwise a truncate
the needs to zero the last block can easily fail with ENOSPC when
other threads are waiting for GC.
Something in the accounting was off, leading GC tests to occasional not
finish. Revert this for now until it can be done properly or we can come
up with an even better scheme.
Christoph Hellwig [Thu, 28 Nov 2024 04:02:44 +0000 (05:02 +0100)]
xfs: remove rtxlen conversions in the zoned code
The zone allocator fundamentally can't support larger allocation sizes
because we don't support unwritten extents. So don't bother with the
conversions and instead add a comment explaining that.
Christoph Hellwig [Thu, 28 Nov 2024 06:38:14 +0000 (07:38 +0100)]
xfs: use fsblock units for sb_rtstart
Darrick was a little unhappy with the daddr, so convert to fsblocks
instead. For the kernel this is only a bit annoying in fsmap,
and mkfs becomes a little more hacky, but overall this doesn't make
much of a difference while removing the need to validate that the
value is fsblock aligned.
Christoph Hellwig [Wed, 27 Nov 2024 15:17:46 +0000 (16:17 +0100)]
xfs: simplify GC scratch buf management
Now that the GC chunks are processed in order, there isn't really any need
for the bank switching, and we can have a simple ring buffer with head and
tail pointers. This allows allocating only a single 1MB folio insteaad of
two.
Christoph Hellwig [Wed, 27 Nov 2024 09:29:31 +0000 (10:29 +0100)]
xfs: split out a xfs_calc_open_zones helper
Move the code to calculate the number of open zones out of
xfs_mount_zones into it's own helper. The flow also changes a bit to be
more clear, but it should not change behavior.
Christoph Hellwig [Wed, 27 Nov 2024 07:16:46 +0000 (08:16 +0100)]
xfs: remove xfs_rtglock_zoned_adjust
Just skip locking the bitmap and summary inodes for zoned file systems,
but still require the rmap flag to be explicitly set. Except for the
extfree_item just fixed nothing still relied on it.
Hans Holmberg [Sun, 17 Nov 2024 06:22:06 +0000 (07:22 +0100)]
xfs: export zone stats in /proc/*/mountstats
Add the per-zone life time hint and the used block distribution
for fully written zones, grouping reclaimable zones in fixed-percentage
buckets spanning 0..9%, 10..19% and full zones as 100% used as well as a
few statistics about the zone allocator and open and reclaimable zones
in /proc/*/mountstats.
This gives good insight into data fragmentation and data placement
success rate.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 17 Nov 2024 05:24:19 +0000 (06:24 +0100)]
xfs: wire up the show_stats super operation
The show_stats option allows a file system to dump plain text statistic
on a per-mount basis into /proc/*/mountstats. Wire up a no-op version
which will grow useful information for zoned file systems later.
Hans Holmberg [Sun, 24 Nov 2024 13:36:55 +0000 (14:36 +0100)]
xfs: support write life time based data placement
Add a file write life time data placement allocation scheme that aims to
minimize fragmentation and thereby to do two things:
a) separate file data when into diffent zones when possible.
b) colocate file data of similar life times when feasible.
To get best results, average file sizes should align with the zone
capacitity that is reported through the XFS_IOC_FSGEOMETRY ioctl.
For RocksDB using leveled compaction, the lifetime hints can improve
throughput improvement for overwrite workloads at 80% file system
utilization by ~10%.
Lifetime hints can be disabled using the nolifetime mount option.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 17 Nov 2024 07:05:16 +0000 (08:05 +0100)]
xfs: add a max_open_zones mount option
Allow limiting the number of open zones used below that exported by the
device. This is required to tune the number of write streams when zoned
RT devices are used on conventional devices, and can be useful on zoned
devices that support a very large number of open zones.
Christoph Hellwig [Sun, 17 Nov 2024 08:07:41 +0000 (09:07 +0100)]
xfs: support zone gaps
Zoned devices can have gaps beyoned the usable capacity of a zone and the
end in the LBA/daddr address space. In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us. In this case the sparse FSB/RTB address space maps 1:1
to the device address space.
Christoph Hellwig [Sun, 17 Nov 2024 07:36:47 +0000 (08:36 +0100)]
xfs: enable the zoned RT device feature
Enable the zoned RT device directory feature. With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks. This perfectly maps to zoned devices, but can also be
used on conventional block devices.
Christoph Hellwig [Sun, 17 Nov 2024 09:28:33 +0000 (10:28 +0100)]
xfs: disable reflink for zoned file systems
While the zoned on-disk format supports reflinks, the GC code currently
always unshares reflinks when moving blocks to new zones, thus making the
feature unusuable. Disable reflinks until the GC code is refcount aware.
Christoph Hellwig [Wed, 13 Nov 2024 05:51:55 +0000 (06:51 +0100)]
xfs: enabled fsmap reporting for internal RT devices
File system with internal RT devices are a bit odd in that we need
to report AGs and RGs. To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.
The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.
Christoph Hellwig [Mon, 22 Jul 2024 13:31:28 +0000 (06:31 -0700)]
xfs: support xrep_require_rtext_inuse on zoned file systems
Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.
Christoph Hellwig [Sun, 17 Nov 2024 06:35:44 +0000 (07:35 +0100)]
xfs: support xchk_xref_is_used_rt_space on zoned file systems
Space usage is tracked by the rmap, which already is separately
cross-reference. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.
Christoph Hellwig [Sun, 17 Nov 2024 09:27:24 +0000 (10:27 +0100)]
xfs: support growfs on zoned file systems
Replace the inner loop growing one RT bitmap block at a time with
one just modifying the superblock counters for growing an entire
zone (aka RTG). The big restriction is just like at mkfs time only
a RT extent size of a single FSB is allowed, and the file system
capacity needs to be aligned to the zone size.
Christoph Hellwig [Sun, 17 Nov 2024 07:06:55 +0000 (08:06 +0100)]
xfs: hide reserved RT blocks from statfs
File systems with a zoned RT device have a large number of reserved
blocks that are required for garbage collection, and which can't be
filled with user data. Exclude them from the available blocks reported
through stat(v)fs.
Christoph Hellwig [Sun, 17 Nov 2024 07:25:26 +0000 (08:25 +0100)]
xfs: implement direct writes to zoned RT devices
Direct writes to zoned RT devices are extremely simple. After taking the
block reservation before acquiring the iolock, the iomap direct I/O
calls into ->iomap_begin which will return a fake iomap allowing writes
up the entire requested range. The actual block allocation is then done
from the submit_io handler using code shared with the buffered I/O path.
The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize
the embedded ioend, which allows reusing the existing ioend based buffered
I/O completion path.
Christoph Hellwig [Sun, 24 Nov 2024 12:49:53 +0000 (13:49 +0100)]
xfs: implement buffered writes to zoned RT devices
Implement buffered writes including page faults and block zeroing for
zoned RT devices. Buffered writes to zoned RT devices are split into
three phases:
1) a reservation for the worst case data block usage is taken before
acquiring the iolock. When not enough space is available this kicks
off garbage collection, and when there still is not enough space is
available the block reservation is reduced to the amount of space
available, which will force a short write
2) with the iolock held, the generic iomap buffered write code is
called, which through the iomap_begin operation usually just inserts
delalloc extents for the range in a single iteration. Only for
overwrites of existing data that are not block aligned, or zeroing
operations the existing extent mapping is read to fill out the srcmap
and to figure out if zeroing is required.
3) the ->map_blocks callback to the generic iomap writeback code
calls into the zoned space allocator to actually allocate on-disk
space for the range before kicking of the writeback.
For block zeroing from truncate, ->setattr is called with the iolock
(aka i_rwsem) already held, so a hacky deviation from the above
scheme is needed. In this case the space reservations is called with
the iolock held, but is required not to block and can dip into the
reserved block pool. This can lead to -ENOSPC when truncating a
file, which is unfortunate. But fixing the calling conventions in
the VFS is probably much easier with code requiring it already in
mainline.
Christoph Hellwig [Sun, 24 Nov 2024 13:06:42 +0000 (14:06 +0100)]
xfs: implement zoned garbage collection
RT groups on a zoned file system need to be completely empty before their
space can be reused. This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.
Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied. To empty zones, the rmap is walked to find the
owners and the data is read and then written to the new place.
To automatically defragment files the rmap records are sorted by inode
and logical offset. This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O. Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.
Note: selection of garbage collection victims is extremely simple at the
moment and will probably see additional near term improvements.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 17 Nov 2024 08:57:30 +0000 (09:57 +0100)]
xfs: add the zoned space allocator
For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.
Because of the actual allocation algorithm is very simple and just
involves picking a good zone - preferable the one used for the last
write to the inode. Because the number of zones that can written to at
the same time is often limited by the hardware, this is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers. Because the writers already took a reservation before
acquiring the iolock space will always be readily available if an
open zone slot is available. A new structure is used to track
these open zones, and pointed to by the xfs_rtgroup. Because
zoned file systems don't have a rsum cache the space for that pointer
can be reused.
Allocations are only recorded at I/O completion time. The scheme
used for that is very similar to the reflink COW end I/O path.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 17 Nov 2024 05:45:45 +0000 (06:45 +0100)]
xfs: add support for zoned space reservations
For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers. This means waiting for garbage collection with the iolock can
deadlock.
To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection. The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC. This is
done using a list of on-stack reservations to ensure fairness.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 17 Nov 2024 05:28:10 +0000 (06:28 +0100)]
xfs: add support for parsing and validating blk_zone structures
Add support to validate and parse reported hardware zone state.
Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sun, 17 Nov 2024 05:23:16 +0000 (06:23 +0100)]
xfs: disable FITRIM for zoned RT devices
The zoned allocator unconditionally issues zone resets or discards after
emptying an entire zone, so supporting FITRIM for a zoned RT device is
not useful.
Christoph Hellwig [Sun, 12 May 2024 05:39:45 +0000 (07:39 +0200)]
xfs: disable sb_frextents for zoned file systems
Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information. Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.
Christoph Hellwig [Sun, 17 Nov 2024 07:53:10 +0000 (08:53 +0100)]
xfs: allow internal RT devices for zoned mode
Allow creating an RT subvolume on the same device as the main data
device. This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.
Christoph Hellwig [Sun, 17 Nov 2024 09:23:18 +0000 (10:23 +0100)]
xfs: define the zoned on-disk format
Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data). There are few minor but important changes, which are indicated
by a new incompat flag:
1) there are not bitmap and summary inodes, and thus the sb_bmblocks
superblock field must be cleared to zero
2) there is a new superblock field that specifies the start of an
internal RT section. This allows to support SMR HDDs that have random
writable space at the beginning which is used for the XFS data device
(which really is the metadata device for this configuration), directly
followed by a RT device on the same block device. While something
similar could be archived using dm-linear just having a single device
directly consumed by XFS make handling the file systems a lot easier.
3) Another superblock field that tracks the amount of reserved space (or
overprovisioning) that is never used for user capacity, but allows GC
to run more smoothly.
4) an overlay of the cowextsizse field for the rtrmap inode so that I
can persistently track the total amount of bytes currently used in
a RT group. There is no data structure other than the rmap that
tracks used space in an RT group, and this counter is used to decided
when a RT group has been entirely emptied, and to select one that
is relatively empty if garbage collection needs to be performed.
While this counter could be tracked entirely in memory and rebuilt
from the rmap at mount time, that would be lead to very long mount
times with the large number of RT groups required by the typical
hardware zone size.
Christoph Hellwig [Sun, 17 Nov 2024 04:48:45 +0000 (05:48 +0100)]
xfs: add a xfs_rtrmap_first_unwritten_rgbno helper
Add a helper to find the last offset mapped in the rtrmap. This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.
Christoph Hellwig [Tue, 10 Sep 2024 04:58:17 +0000 (07:58 +0300)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.
Christoph Hellwig [Fri, 27 Oct 2023 07:58:24 +0000 (09:58 +0200)]
xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write
For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.
Christoph Hellwig [Tue, 13 Aug 2024 06:22:40 +0000 (08:22 +0200)]
xfs: report the correct dio alignment for COW inodes
For I/O to reflinked blocks we always need to write an entire new
file system block, and the code enforces the file system block alignment
for the entire file if it has any reflinked blocks.
Unfortunately the reported dio alignment can only report a single value
for reads and writes, so unless we want to trigger these read-modify
write cycles all the time, we need to increase both limits.
Without this zoned xfs triggers the warnings about failed page cache
invalidation in kiocb_invalidate_post_direct_write all the time when
running generic/551 when running on a 512 byte sector device, and
eventually fails the test due to miscompares.
Hopefully we can add a separate read alignment to statx eventually.
Christoph Hellwig [Thu, 10 Oct 2024 05:27:50 +0000 (07:27 +0200)]
xfs: generalize the freespace and reserved blocks handling
The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.
Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.
Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly. This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.
Christoph Hellwig [Sun, 17 Nov 2024 09:22:50 +0000 (10:22 +0100)]
xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c
Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c
toward the two callers.
Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.
Christoph Hellwig [Tue, 30 Jul 2024 23:42:42 +0000 (16:42 -0700)]
xfs: factor out a xfs_rt_check_size helper
Add a helper to check that the last block of a RT device is readable
to share the code between mount and growfs. This also adds the mount
time overflow check to growfs and improves the error messages.
Christoph Hellwig [Tue, 30 Jul 2024 23:15:43 +0000 (16:15 -0700)]
xfs: simplify sector number calculation in xfs_zero_extent
xfs_zero_extent does some really odd gymnstics to calculate the block
layer sectors numbers passed to blkdev_issue_zeroout. This is because it
used to call sb_issue_zeroout and the calculations in that helper got
open coded here in the rather misleadingly named commit 3dc29161070a
("dax: use sb_issue_zerout instead of calling dax_clear_sectors").
Christoph Hellwig [Fri, 16 Aug 2024 16:49:13 +0000 (18:49 +0200)]
iomap: pass private data to iomap_truncate_page
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Fri, 16 Aug 2024 16:48:16 +0000 (18:48 +0200)]
iomap: pass private data to iomap_zero_range
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Tue, 10 Sep 2024 04:57:21 +0000 (07:57 +0300)]
iomap: pass private data to iomap_page_mkwrite
Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.
Christoph Hellwig [Sun, 24 Nov 2024 13:00:00 +0000 (14:00 +0100)]
iomap: optionally use ioends for direct I/O
struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.
For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.
So instead of reinventing the wheel, reuse the existing infrastructure.
Christoph Hellwig [Sun, 24 Nov 2024 12:54:37 +0000 (13:54 +0100)]
iomap: split bios to zone append limits in the submission handlers
Provide helpers for file systems to split bios in the direct I/O and
writeback I/O submission handlers.
This Follows btrfs' lead and don't try to build bios to hardware limits
for zone append commands, but instead build them as normal unconstrained
bios and split them to the hardware limits in the I/O submission handler.
Christoph Hellwig [Sun, 5 Nov 2023 05:40:52 +0000 (06:40 +0100)]
iomap: add a IOMAP_F_ZONE_APPEND flag
This doesn't much - just always returns the start block number for each
iomap instead of increasing it. This is because we'll keep building bios
unconstrained by the hardware limits and just split them in file system
submission handler.
Maybe we should find another name for it, because it might be useful for
btrfs compressed bio submissions as well, but I can't come up with a
good one.
Christoph Hellwig [Sun, 24 Nov 2024 12:53:36 +0000 (13:53 +0100)]
iomap: simplify io_flags and io_type in struct iomap_ioend
The ioend fields for distinct types of I/O are a bit complicated.
Consolidate them into a single io_flag field with it's own flags
decoupled from the iomap flags. This also prepares for adding a new
flag that is unrelated to both of the iomap namespaces.
Christoph Hellwig [Tue, 5 Nov 2024 07:33:10 +0000 (08:33 +0100)]
iomap: allow the file system to submit the writeback bios
Change ->prepare_ioend to ->submit_ioend and require file systems that
implement it to submit the bio. This is needed for file systems that
do their own work on the bios before submitting them to the block layer
like btrfs or zoned xfs. To make this easier also pass the writeback
context to the method.
Christoph Hellwig [Wed, 13 Nov 2024 15:20:42 +0000 (16:20 +0100)]
virtio_blk: reverse request order in virtio_queue_rqs
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern
especially in rotational devices. Fix this by rewriting virtio_queue_rqs
so that it always pops the requests from the passed in request list, and
then adds them to the head of a local submit list. This actually
simplifies the code a bit as it removes the complicated list splicing,
at the cost of extra updates of the rq_next pointer. As that should be
cache hot anyway it should be an easy price to pay.
Christoph Hellwig [Wed, 13 Nov 2024 15:20:41 +0000 (16:20 +0100)]
nvme-pci: reverse request order in nvme_queue_rqs
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices. Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list. This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer. As that should be cache hot
anyway it should be an easy price to pay.
Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()") Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Sat, 2 Nov 2024 06:04:18 +0000 (07:04 +0100)]
block: take chunk_sectors into account in bio_split_write_zeroes
For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors. For other uses like the
internally RAIDed NVMe devices it is probably at least useful.
Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes. Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.
Christoph Hellwig [Thu, 31 Oct 2024 14:09:05 +0000 (15:09 +0100)]
block: lift bio_is_zone_append to bio.h
Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.