]> www.infradead.org Git - users/hch/xfs.git/log
users/hch/xfs.git
5 months agoxfs: support write stream separation xfs-zoned-streams
Christoph Hellwig [Fri, 1 Nov 2024 04:51:08 +0000 (05:51 +0100)]
xfs: support write stream separation

Allow picking a write stream ID per "active zone" equivalent on
conventional devices.  The only complicated part is stealing yet
another time stamp on the rmap inode to store the write stream
ID so we can restart after a remount without de-synchronizing the
software write pointer and the hardware equivalent.  Due to the
lack of a block layer API to query or resync our write pointer
this still can happen on power fail or a kernel crash
unfortunately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agonvme-multipath: set BLK_FEAT_PLACEMENT_HINTS in nvme_mpath_alloc_disk
Christoph Hellwig [Sat, 2 Nov 2024 09:30:07 +0000 (10:30 +0100)]
nvme-multipath: set BLK_FEAT_PLACEMENT_HINTS in nvme_mpath_alloc_disk

Otherwise shared namespace won't ever set the feature.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoscsi: set permanent stream count in block limits
Keith Busch [Tue, 29 Oct 2024 15:19:22 +0000 (08:19 -0700)]
scsi: set permanent stream count in block limits

The block limits exports the number of write hints, so set this limit if
the device reports support for the lifetime hints. Not only does this
inform the user of which hints are possible, it also allows scsi devices
supporting the feature to utilize the full range through raw block
device direct-io.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agonvme: enable FDP support
Kanchan Joshi [Tue, 29 Oct 2024 15:19:21 +0000 (08:19 -0700)]
nvme: enable FDP support

Flexible Data Placement (FDP), as ratified in TP 4146a, allows the host
to control the placement of logical blocks so as to reduce the SSD WAF.
Userspace can send the write hint information using io_uring or fcntl.

Fetch the placement-identifiers if the device supports FDP. The incoming
write-hint is mapped to a placement-identifier, which in turn is set in
the DSPEC field of the write command.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Hui Qi <hui81.qi@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoblock: export placement hint feature
Keith Busch [Tue, 29 Oct 2024 15:19:20 +0000 (08:19 -0700)]
block: export placement hint feature

Add a feature flag for devices that support generic placement hints in
write commands. This is in contrast to data lifetime hints.

Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoio_uring: enable per-io hinting capability
Kanchan Joshi [Tue, 29 Oct 2024 15:19:19 +0000 (08:19 -0700)]
io_uring: enable per-io hinting capability

With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
all the subsequent writes on the file pass that hint value down. This
can be limiting for block device as all the writes will be tagged with
only one lifetime hint value. Concurrent writes (with different hint
values) are hard to manage. Per-IO hinting solves that problem.

Allow userspace to pass additional metadata in the SQE.

__u16 write_hint;

If the hint is provided, filesystems may optionally use it. A filesytem
may ignore this field if it does not support per-io hints, or if the
value is invalid for its backing storage. Just like the inode hints,
requesting values that are not supported by the hardware are not an
error.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoblock, fs: add write hint to kiocb
Keith Busch [Tue, 29 Oct 2024 15:19:18 +0000 (08:19 -0700)]
block, fs: add write hint to kiocb

This prepares for sources other than the inode to provide a write hint.
The block layer will use it for direct IO if the requested hint is
within the block device's allowed hints.

Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoblock: allow ability to limit partition write hints
Keith Busch [Tue, 29 Oct 2024 15:19:17 +0000 (08:19 -0700)]
block: allow ability to limit partition write hints

When multiple partitions are used, you may want to enforce different
subsets of the available write hints for each partition. Provide a
bitmap attribute of the available write hints, and allow an admin to
write a different mask to set the partition's allowed write hints.

Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agostatx: add write hint information
Keith Busch [Tue, 29 Oct 2024 15:19:16 +0000 (08:19 -0700)]
statx: add write hint information

If requested on a raw block device, report the maximum write hint the
block device supports.

Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoblock: introduce max_write_hints queue limit
Keith Busch [Tue, 29 Oct 2024 15:19:15 +0000 (08:19 -0700)]
block: introduce max_write_hints queue limit

Drivers with hardware that support write streams need a way to export how
many are available so applications can generically query this.

Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoblock: use generic u16 for write hints
Keith Busch [Tue, 29 Oct 2024 15:19:14 +0000 (08:19 -0700)]
block: use generic u16 for write hints

This is still backwards compatible with lifetime hints. It just doesn't
constrain the hints to that definition. Using this type doesn't change
the size of either bio or request.

Signed-off-by: Keith Busch <kbusch@kernel.org>
5 months agoxfs: disable rt quotas for zoned file systems
Christoph Hellwig [Tue, 22 Oct 2024 12:16:44 +0000 (14:16 +0200)]
xfs: disable rt quotas for zoned file systems

They'll need a little more work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support an internal zoned rtdev
Christoph Hellwig [Tue, 5 Nov 2024 08:29:14 +0000 (09:29 +0100)]
xfs: support an internal zoned rtdev

Allow creating an RT subvolume on the same device as the main data
device.  This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.  One day we should also support the log
on sequential write required zones, but that is not supported here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support growfs on zoned file systems
Christoph Hellwig [Wed, 23 Oct 2024 07:01:12 +0000 (09:01 +0200)]
xfs: support growfs on zoned file systems

Replace the inner loop growing one RT bitmap block at a time with
one just modifying the superblock counters for growing an entire
zone (aka RTG).  The big restriction is just like at mkfs time only
a RT extent size of a single FSB is allowed, and the file system
capacity needs to be aligned to the zone size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: add data placement info to mount stats
Hans Holmberg [Sun, 6 Oct 2024 05:04:42 +0000 (07:04 +0200)]
xfs: add data placement info to mount stats

Add per-rtg active refs, life time hint and data separation score and
an aggregate data separation score as output to the mount stats
to aid debugging and analysis.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
5 months agoxfs: support write life time based data placement
Hans Holmberg [Tue, 5 Nov 2024 07:51:29 +0000 (08:51 +0100)]
xfs: support write life time based data placement

Add a file write life time data placement allocation scheme that aims
minimize fragmentation and thereby to do two things:

a) Complete separate file data when possible into diffent zones when
   possible.
b) Colocate file data of similar life times when feasible.

To get best results, average file sizes should align with average
zone capacitity.

Benchmarked with RocksDB using leveled compaction, obeserving ~10%
throughput improvement for overwrite workloads at 80% file system
utilization.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
5 months agoxfs: add plumbing and mount option for write life time hints
Hans Holmberg [Sun, 6 Oct 2024 05:03:32 +0000 (07:03 +0200)]
xfs: add plumbing and mount option for write life time hints

Add a mount option and some plumbing for enabling usage
of file write life time hints.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support xrep_require_rtext_inuse on zoned file systems
Christoph Hellwig [Mon, 22 Jul 2024 13:31:28 +0000 (06:31 -0700)]
xfs: support xrep_require_rtext_inuse on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-reference.  But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support xchk_xref_is_used_rt_space on zoned file systems
Christoph Hellwig [Thu, 15 Aug 2024 16:12:33 +0000 (18:12 +0200)]
xfs: support xchk_xref_is_used_rt_space on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-reference.  But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support zone gaps
Christoph Hellwig [Fri, 18 Oct 2024 14:19:15 +0000 (16:19 +0200)]
xfs: support zone gaps

Zoned devices can have gaps beyoned the usable capacity of a zone and the
end in the LBA/daddr address space.  In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us.  In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support zoned RT devices
Christoph Hellwig [Tue, 5 Nov 2024 08:27:11 +0000 (09:27 +0100)]
xfs: support zoned RT devices

WARNING: this is early prototype code.

The zoned allocator works by handing out data blocks to the direct or
buffered write code at the place where XFS currently does block
allocations.  It does not actually insert them into the bmap extent tree
at this time, but only after I/O completion when we known the block number.

The zoned allocator works on any kind of device, including conventional
devices or conventional zones by having a crude write pointer emulation.
For zone devices active zone management is fully support, as is
zone capacity < zone size.

The two major limitations are:

 - there is no support for unwritten extents and thus persistent
   file preallocations from fallocate().  This is inherent to an
   always out of place write scheme as there is no way to persistently
   preallocate blocks for an indefinite number of overwrites
 - because the metadata blocks and data blocks are on different
   device you can run out of space for metadata while having plenty
   of space for data and vice versa.  This is inherent to a scheme
   where we use different devices or pools for each.

For zoned file systems we reserve the free extents before taking the
ilock so that if we have to force garbage collection it happens before we
take the iolock.  This is done because GC has to take the iolock after it
moved data to a new place, and this could otherwise deadlock.

This unfortunately has to exclude block zeroing, as for truncate we are
called with the iolock (aka i_rwsem) already held.  As zeroing is always
only for a single block at a time, or up to two total for a syscall in
case for free_file_range we deal with that by just stealing the block,
but failing the allocation if we'd have to wait for GC.

Add a new RTAVAILABLE counter of blocks that are actually directly
available to be written into in addition to the classic free counter.
Only allow a write to go ahead if it has blocks available to write, and
otherwise wait for GC.  This also requires tweaking the need GC condition a
bit as we now always need to GC if someone is waiting for space.

Thanks to Hans Holmberg <hans.holmberg@wdc.com> for lots of fixes
and improvements.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: report a XFS_FSOP_GEOM_FLAGS_ZONED in the file system geometry
Christoph Hellwig [Thu, 24 Oct 2024 08:57:41 +0000 (10:57 +0200)]
xfs: report a XFS_FSOP_GEOM_FLAGS_ZONED in the file system geometry

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: allow COW forks on zoned file systems in xchk_bmap
Christoph Hellwig [Fri, 10 May 2024 06:51:02 +0000 (08:51 +0200)]
xfs: allow COW forks on zoned file systems in xchk_bmap

zoned file systems can have COW forks even without reflinks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: disable sb_frextents scrub/repair for zoned file systems
Christoph Hellwig [Sun, 12 May 2024 05:39:45 +0000 (07:39 +0200)]
xfs: disable sb_frextents scrub/repair for zoned file systems

Zoned file systems not only don't use the frextents counter, but the
in-memory percpu couner also includes reservations take before even
allocating delalloc extent records, so it will never match the per-zone
used information.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: add a helper to check if an inode sits on a zoned device
Christoph Hellwig [Sun, 6 Oct 2024 04:30:30 +0000 (06:30 +0200)]
xfs: add a helper to check if an inode sits on a zoned device

Add a xfs_is_zoned_inode helper that returns true if an inode has the
RT flag set and the file system is zoned.  This will be used to key
off zoned allocator behavior.

Make xfs_is_always_cow_inode return true for zoned inodes as we always
need to write out of place on zoned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: add an incompat feature bit for zoned RT devices
Christoph Hellwig [Fri, 23 Aug 2024 14:27:52 +0000 (16:27 +0200)]
xfs: add an incompat feature bit for zoned RT devices

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: refine the unaligned check for always COW inodes in xfs_file_dio_write
Christoph Hellwig [Fri, 27 Oct 2023 07:58:24 +0000 (09:58 +0200)]
xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write

For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
Christoph Hellwig [Tue, 10 Sep 2024 04:58:17 +0000 (07:58 +0300)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation.  To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: skip always_cow inodes in xfs_reflink_trim_around_shared
Christoph Hellwig [Sat, 14 Oct 2023 05:50:35 +0000 (07:50 +0200)]
xfs: skip always_cow inodes in xfs_reflink_trim_around_shared

xfs_reflink_trim_around_shared tries to find shared blocks in the
refcount btree.  Always_cow inodes don't have that tree, so don't
bother.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: preserve RT reservations across remounts
Hans Holmberg [Tue, 9 Jan 2024 18:01:15 +0000 (19:01 +0100)]
xfs: preserve RT reservations across remounts

Introduce a reservation setting for rt devices so that zoned GC
reservations are preserved over remount ro/rw cycles.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Thu, 10 Oct 2024 05:27:50 +0000 (07:27 +0200)]
xfs: generalize the freespace and reserved blocks handling

The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.

Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.

Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly.  This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: use the proper conversion helpers in xfs_rt_check_size
Christoph Hellwig [Sat, 26 Oct 2024 13:24:38 +0000 (15:24 +0200)]
xfs: use the proper conversion helpers in xfs_rt_check_size

Use the proper helpers to deal with sparse rtbno encoding.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: factor out a xfs_rt_check_size helper
Christoph Hellwig [Tue, 30 Jul 2024 23:42:42 +0000 (16:42 -0700)]
xfs: factor out a xfs_rt_check_size helper

Add a helper to check that the last block of a RT device is readable
to share the code between mount and growfs.  This also adds the mount
time overflow check to growfs and improves the error messages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: constify feature checks
Christoph Hellwig [Tue, 5 Nov 2024 07:39:56 +0000 (08:39 +0100)]
xfs: constify feature checks

We'll need to call them on a const structure in growfs in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: simplify sector number calculation in xfs_zero_extent
Christoph Hellwig [Tue, 30 Jul 2024 23:15:43 +0000 (16:15 -0700)]
xfs: simplify sector number calculation in xfs_zero_extent

xfs_zero_extent does some really odd gymnstics to calculate the block
layer sectors numbers passed to blkdev_issue_zeroout.  This is because it
used to call sb_issue_zeroout and the calculations in that helper got
open coded here in the rather misleadingly named commit 3dc29161070a
("dax: use sb_issue_zerout instead of calling dax_clear_sectors").

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: pass private data to iomap_truncate_page
Christoph Hellwig [Fri, 16 Aug 2024 16:49:13 +0000 (18:49 +0200)]
iomap: pass private data to iomap_truncate_page

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: pass private data to iomap_zero_range
Christoph Hellwig [Fri, 16 Aug 2024 16:48:16 +0000 (18:48 +0200)]
iomap: pass private data to iomap_zero_range

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: pass private data to iomap_page_mkwrite
Christoph Hellwig [Tue, 10 Sep 2024 04:57:21 +0000 (07:57 +0300)]
iomap: pass private data to iomap_page_mkwrite

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: optionally use ioends for direct I/O
Christoph Hellwig [Tue, 5 Nov 2024 08:14:33 +0000 (09:14 +0100)]
iomap: optionally use ioends for direct I/O

struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.

For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.

So intead of reinventing the wheel, reuse the existing infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: split bios to zone append limits in the submission handlers
Christoph Hellwig [Tue, 5 Nov 2024 08:09:21 +0000 (09:09 +0100)]
iomap: split bios to zone append limits in the submission handlers

Provide helpers for file systems to split bios in the direct I/O and
writeback I/O submission handlers.

This Follows btrfs' lead and don't try to build bios to hardware limits
for zone append commands, but instead build them as normal unconstrained
bios and split them to the hardware limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: add a IOMAP_F_ZONE_APPEND flag
Christoph Hellwig [Sun, 5 Nov 2023 05:40:52 +0000 (06:40 +0100)]
iomap: add a IOMAP_F_ZONE_APPEND flag

This doesn't much - just always returns the start block number for each
iomap instead of increasing it.  This is because we'll keep building bios
unconstrained by the hardware limits and just split them in file system
submission handler.

Maybe we should find another name for it, because it might be useful for
btrfs compressed bio submissions as well, but I can't come up with a
good one.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: allow the file system to submit the writeback bios
Christoph Hellwig [Tue, 5 Nov 2024 07:33:10 +0000 (08:33 +0100)]
iomap: allow the file system to submit the writeback bios

Change ->prepare_ioend to ->submit_ioend and require file systems that
implement it to submit the bio.  This is needed for file systems that
do their own work on the bios before submitting them to the block layer
like btrfs or zoned xfs.  To make this easier also pass the writeback
context to the method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: wait for writeback before allocating new blocks
Christoph Hellwig [Fri, 13 Oct 2023 06:09:39 +0000 (08:09 +0200)]
iomap: wait for writeback before allocating new blocks

This means we are actually forced to allocate new delalloc space for the
new dirtier instead of reusing one that is currently being used for
writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: don't merge ioends with mismatching fs private flag
Christoph Hellwig [Fri, 24 Nov 2023 18:51:38 +0000 (19:51 +0100)]
iomap: don't merge ioends with mismatching fs private flag

If the file system set it's private flag on one ioend but not the other
we better don't merge the two.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoiomap: drop an obsolete comment in iomap_dio_bio_iter
Christoph Hellwig [Tue, 5 Nov 2024 08:25:28 +0000 (09:25 +0100)]
iomap: drop an obsolete comment in iomap_dio_bio_iter

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: take chunk_sectors into account in bio_split_write_zeroes
Christoph Hellwig [Sat, 2 Nov 2024 06:04:18 +0000 (07:04 +0100)]
block: take chunk_sectors into account in bio_split_write_zeroes

For zoned devices, write zeroes must be split at the zone boundary
which is represented as chunk_sectors.  For other uses like the
internally RAIDed NVMe devices it is probably at least useful.

Enhance get_max_io_size to know about write zeroes and use it in
bio_split_write_zeroes.  Also add a comment about the seemingly
nonsensical zero max_write_zeroes limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: lift bio_is_zone_append to bio.h
Christoph Hellwig [Thu, 31 Oct 2024 14:09:05 +0000 (15:09 +0100)]
block: lift bio_is_zone_append to bio.h

Make bio_is_zone_append globally available, because file systems need
to use to check for a zone append bio in their end_io handlers to deal
with the block layer emulation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: fix bio_split_rw_at to take zone_write_granularity into account
Christoph Hellwig [Thu, 31 Oct 2024 13:16:37 +0000 (14:16 +0100)]
block: fix bio_split_rw_at to take zone_write_granularity into account

Otherwise it can create unaligned writes on zoned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: Add a public bdev_zone_is_seq() helper
Damien Le Moal [Fri, 1 Nov 2024 01:33:52 +0000 (10:33 +0900)]
block: Add a public bdev_zone_is_seq() helper

Turn the private disk_zone_is_conv() function in blk-zoned.c into a
public and documented bdev_zone_is_seq() helper with the inverse
polarity of the original function, also adding a check for non-zoned
devices so that all file systems can use the helper, even with a regular
block device.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: RCU protect disk->conv_zones_bitmap
Damien Le Moal [Fri, 1 Nov 2024 01:33:51 +0000 (10:33 +0900)]
block: RCU protect disk->conv_zones_bitmap

Ensure that a disk revalidation changing the conventional zones bitmap
of a disk does not cause invalid memory references when using the
disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap
pointer.

disk_zone_is_conv() is modified to operate under the RCU read lock and
the function disk_set_conv_zones_bitmap() is added to update a disk
conv_zones_bitmap pointer using rcu_replace_pointer() with the disk
zone_wplugs_lock spinlock held.

disk_free_zone_resources() is modified to call
disk_update_zone_resources() with a NULL bitmap pointer to free the disk
conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in
disk_update_zone_resources() to set the new (revalidated) bitmap and
free the old one.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoblock: add a bdev_limits helper
Christoph Hellwig [Tue, 29 Oct 2024 08:50:56 +0000 (09:50 +0100)]
block: add a bdev_limits helper

Add a helper to get the queue_limits from the bdev without having to
poke into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoTEMP: nvme-pci: disable async probe
Christoph Hellwig [Sun, 15 Oct 2023 07:26:20 +0000 (09:26 +0200)]
TEMP: nvme-pci: disable async probe

This keeps getting my ZNS vs ZNS drivers reordered a bit and is annoying
for testing.

Not-really-signed-off-by: Christoph Hellwig <hch@lst.de>
5 months agoxfs: enable realtime reflink
Darrick J. Wong [Tue, 15 Oct 2024 19:40:44 +0000 (12:40 -0700)]
xfs: enable realtime reflink

Enable reflink for realtime devices, sort of.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: fix CoW forks for realtime files
Darrick J. Wong [Tue, 15 Oct 2024 19:40:43 +0000 (12:40 -0700)]
xfs: fix CoW forks for realtime files

Port the copy on write fork repair to realtime files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: check for shared rt extents when rebuilding rt file's data fork
Darrick J. Wong [Tue, 15 Oct 2024 19:40:42 +0000 (12:40 -0700)]
xfs: check for shared rt extents when rebuilding rt file's data fork

When we're rebuilding the data fork of a realtime file, we need to
cross-reference each mapping with the rt refcount btree to ensure that
the reflink flag is set if there are any shared extents found.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: repair inodes that have a refcount btree in the data fork
Darrick J. Wong [Tue, 15 Oct 2024 19:40:42 +0000 (12:40 -0700)]
xfs: repair inodes that have a refcount btree in the data fork

Plumb knowledge of refcount btrees into the inode core repair code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: online repair of the realtime refcount btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:41 +0000 (12:40 -0700)]
xfs: online repair of the realtime refcount btree

Port the data device's refcount btree repair code to the realtime
refcount btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: capture realtime CoW staging extents when rebuilding rt rmapbt
Darrick J. Wong [Tue, 15 Oct 2024 19:40:40 +0000 (12:40 -0700)]
xfs: capture realtime CoW staging extents when rebuilding rt rmapbt

Walk the realtime refcount btree to find the CoW staging extents when
we're rebuilding the realtime rmap btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: walk the rt reference count tree when rebuilding rmap
Darrick J. Wong [Tue, 15 Oct 2024 19:40:39 +0000 (12:40 -0700)]
xfs: walk the rt reference count tree when rebuilding rmap

When we're rebuilding the data device rmap, if we encounter a "refcount"
format fork, we have to walk the (realtime) refcount btree inode to
build the appropriate mappings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: check new rtbitmap records against rt refcount btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:39 +0000 (12:40 -0700)]
xfs: check new rtbitmap records against rt refcount btree

When we're rebuilding the realtime bitmap, check the proposed free
extents against the rt refcount btree to make sure we don't commit any
grievous errors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: don't flag quota rt block usage on rtreflink filesystems
Darrick J. Wong [Tue, 15 Oct 2024 19:40:38 +0000 (12:40 -0700)]
xfs: don't flag quota rt block usage on rtreflink filesystems

Quota space usage is allowed to exceed the size of the physical storage
when reflink is enabled.  Now that we have reflink for the realtime
volume, apply this same logic to the rtb repair logic.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: scrub the metadir path of rt refcount btree files
Darrick J. Wong [Tue, 15 Oct 2024 19:40:37 +0000 (12:40 -0700)]
xfs: scrub the metadir path of rt refcount btree files

Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata
directory tree path to the refcount btree file for each rt group.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: detect and repair misaligned rtinherit directory cowextsize hints
Darrick J. Wong [Tue, 15 Oct 2024 19:40:36 +0000 (12:40 -0700)]
xfs: detect and repair misaligned rtinherit directory cowextsize hints

If we encounter a directory that has been configured to pass on a CoW
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review and/or turn it off because that is a
misconfiguration.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: allow dquot rt block count to exceed rt blocks on reflink fs
Darrick J. Wong [Tue, 15 Oct 2024 19:40:35 +0000 (12:40 -0700)]
xfs: allow dquot rt block count to exceed rt blocks on reflink fs

Update the quota scrubber to allow dquots where the realtime block count
exceeds the block count of the rt volume if reflink is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: check reference counts of gaps between rt refcount records
Darrick J. Wong [Tue, 15 Oct 2024 19:40:35 +0000 (12:40 -0700)]
xfs: check reference counts of gaps between rt refcount records

If there's a gap between records in the rt refcount btree, we ought to
cross-reference the gap with the rtrmap records to make sure that there
aren't any overlapping records for a region that doesn't have any shared
ownership.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: allow overlapping rtrmapbt records for shared data extents
Darrick J. Wong [Tue, 15 Oct 2024 19:40:34 +0000 (12:40 -0700)]
xfs: allow overlapping rtrmapbt records for shared data extents

Allow overlapping realtime reverse mapping records if they both describe
shared data extents and the fs supports reflink on the realtime volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: cross-reference checks with the rt refcount btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:33 +0000 (12:40 -0700)]
xfs: cross-reference checks with the rt refcount btree

Use the realtime refcount btree to implement cross-reference checks in
other data structures.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: scrub the realtime refcount btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:32 +0000 (12:40 -0700)]
xfs: scrub the realtime refcount btree

Add code to scrub realtime refcount btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: report realtime refcount btree corruption errors to the health system
Darrick J. Wong [Tue, 15 Oct 2024 19:40:32 +0000 (12:40 -0700)]
xfs: report realtime refcount btree corruption errors to the health system

Whenever we encounter corrupt realtime refcount btree blocks, we should
report that to the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: check that the rtrefcount maxlevels doesn't increase when growing fs
Darrick J. Wong [Tue, 15 Oct 2024 19:40:31 +0000 (12:40 -0700)]
xfs: check that the rtrefcount maxlevels doesn't increase when growing fs

The size of filesystem transaction reservations depends on the maximum
height (maxlevels) of the realtime btrees.  Since we don't want a grow
operation to increase the reservation size enough that we'll fail the
minimum log size checks on the next mount, constrain growfs operations
if they would cause an increase in the rt refcount btree maxlevels.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: enable extent size hints for CoW operations
Darrick J. Wong [Tue, 15 Oct 2024 19:40:30 +0000 (12:40 -0700)]
xfs: enable extent size hints for CoW operations

Wire up the copy-on-write extent size hint for realtime files, and
connect it to the rt allocator so that we avoid fragmentation on rt
filesystems.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: apply rt extent alignment constraints to CoW extsize hint
Darrick J. Wong [Tue, 15 Oct 2024 19:40:29 +0000 (12:40 -0700)]
xfs: apply rt extent alignment constraints to CoW extsize hint

The copy-on-write extent size hint is subject to the same alignment
constraints as the regular extent size hint.  Since we're in the process
of adding reflink (and therefore CoW) to the realtime device, we must
apply the same scattered rextsize alignment validation strategies to
both hints to deal with the possibility of rextsize changing.

Therefore, fix the inode validator to perform rextsize alignment checks
on regular realtime files, and to remove misaligned directory hints.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files
Darrick J. Wong [Tue, 15 Oct 2024 19:40:29 +0000 (12:40 -0700)]
xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files

Currently, we (ab)use xfs_get_extsz_hint so that it always returns a
nonzero value for realtime files.  This apparently was done to disable
delayed allocation for realtime files.

However, once we enable realtime reflink, we can also turn on the
alwayscow flag to force CoW writes to realtime files.  In this case, the
logic will incorrectly send the write through the delalloc write path.

Fix this by adjusting the logic slightly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: recover CoW leftovers in the realtime volume
Darrick J. Wong [Tue, 15 Oct 2024 19:40:28 +0000 (12:40 -0700)]
xfs: recover CoW leftovers in the realtime volume

Scan the realtime refcount tree at mount time to get rid of leftover
CoW staging extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: allow inodes to have the realtime and reflink flags
Darrick J. Wong [Tue, 15 Oct 2024 19:40:27 +0000 (12:40 -0700)]
xfs: allow inodes to have the realtime and reflink flags

Now that we can share blocks between realtime files, allow this
combination.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: enable sharing of realtime file blocks
Darrick J. Wong [Tue, 15 Oct 2024 19:40:26 +0000 (12:40 -0700)]
xfs: enable sharing of realtime file blocks

Update the remapping routines to be able to handle realtime files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: enable CoW for realtime data
Darrick J. Wong [Tue, 15 Oct 2024 19:40:25 +0000 (12:40 -0700)]
xfs: enable CoW for realtime data

Update our write paths to support copy on write on the rt volume.  This
works in more or less the same way as it does on the data device, with
the major exception that we never do delalloc on the rt volume.

Because we consider unwritten CoW fork staging extents to be incore
quota reservation, we update xfs_quota_reserve_blkres to support this
case.  Though xfs doesn't allow rt and quota together, the change is
trivial and we shouldn't leave a logic bomb here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: refactor reflink quota updates
Darrick J. Wong [Tue, 15 Oct 2024 19:40:25 +0000 (12:40 -0700)]
xfs: refactor reflink quota updates

Hoist all quota updates for reflink into a helper function, since things
are about to become more complicated.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: compute rtrmap btree max levels when reflink enabled
Darrick J. Wong [Tue, 15 Oct 2024 19:40:24 +0000 (12:40 -0700)]
xfs: compute rtrmap btree max levels when reflink enabled

Compute the maximum possible height of the realtime rmap btree when
reflink is enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: update rmap to allow cow staging extents in the rt rmap
Darrick J. Wong [Tue, 15 Oct 2024 19:40:23 +0000 (12:40 -0700)]
xfs: update rmap to allow cow staging extents in the rt rmap

Don't error out on CoW staging extent records when realtime reflink is
enabled.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: create routine to allocate and initialize a realtime refcount btree inode
Darrick J. Wong [Tue, 15 Oct 2024 19:40:22 +0000 (12:40 -0700)]
xfs: create routine to allocate and initialize a realtime refcount btree inode

Create a library routine to allocate and initialize an empty realtime
refcountbt inode.  We'll use this for growfs, mkfs, and repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: wire up realtime refcount btree cursors
Darrick J. Wong [Tue, 15 Oct 2024 19:40:22 +0000 (12:40 -0700)]
xfs: wire up realtime refcount btree cursors

Wire up realtime refcount btree cursors wherever they're needed
throughout the code base.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: refactor xfs_reflink_find_shared
Darrick J. Wong [Tue, 15 Oct 2024 19:40:21 +0000 (12:40 -0700)]
xfs: refactor xfs_reflink_find_shared

Move lookup of the perag structure from the callers into the helpers,
and return the offset into the extent of the shared region instead of
the block number that needs post-processing.  This prepares the
callsites for the creation of an rt-specific variant in the next patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: port to the middle of the rtreflink series for cleanliness]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: wire up a new inode fork type for the realtime refcount
Darrick J. Wong [Tue, 15 Oct 2024 19:40:20 +0000 (12:40 -0700)]
xfs: wire up a new inode fork type for the realtime refcount

Plumb in the pieces we need to embed the root of the realtime refcount
btree in an inode's data fork, complete with new fork type and
on-disk interpretation functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: add metadata reservations for realtime refcount btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:19 +0000 (12:40 -0700)]
xfs: add metadata reservations for realtime refcount btree

Reserve some free blocks so that we will always have enough free blocks
in the data volume to handle expansion of the realtime refcount btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: add realtime refcount btree inode to metadata directory
Darrick J. Wong [Tue, 15 Oct 2024 19:40:19 +0000 (12:40 -0700)]
xfs: add realtime refcount btree inode to metadata directory

Add a metadir path to select the realtime refcount btree inode and load
it at mount time.  The rtrefcountbt inode will have a unique extent format
code, which means that we also have to update the inode validation and
flush routines to look for it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: add realtime refcount btree block detection to log recovery
Darrick J. Wong [Tue, 15 Oct 2024 19:40:18 +0000 (12:40 -0700)]
xfs: add realtime refcount btree block detection to log recovery

Identify rt refcount btree blocks in the log correctly so that we can
validate them during log recovery.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: support recovering refcount intent items targetting realtime extents
Darrick J. Wong [Tue, 15 Oct 2024 19:40:17 +0000 (12:40 -0700)]
xfs: support recovering refcount intent items targetting realtime extents

Now that we have reflink on the realtime device, refcount intent items
have to support remapping extents on the realtime volume.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: add a realtime flag to the refcount update log redo items
Darrick J. Wong [Tue, 15 Oct 2024 19:40:16 +0000 (12:40 -0700)]
xfs: add a realtime flag to the refcount update log redo items

Extend the refcount update (CUI) log items with a new realtime flag that
indicates that the updates apply against the realtime refcountbt.  We'll
wire up the actual refcount code later.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: prepare refcount functions to deal with rtrefcountbt
Darrick J. Wong [Tue, 15 Oct 2024 19:40:15 +0000 (12:40 -0700)]
xfs: prepare refcount functions to deal with rtrefcountbt

Prepare the high-level refcount functions to deal with the new realtime
refcountbt and its slightly different conventions.  Provide the ability
to talk to either refcountbt or rtrefcountbt formats from the same high
level code.

Note that we leave the _recover_cow_leftovers functions for a separate
patch so that we can convert it all at once.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: add realtime refcount btree operations
Darrick J. Wong [Tue, 15 Oct 2024 19:40:15 +0000 (12:40 -0700)]
xfs: add realtime refcount btree operations

Implement the generic btree operations needed to manipulate rtrefcount
btree blocks. This is different from the regular refcountbt in that we
allocate space from the filesystem at large, and are neither constrained
to the free space nor any particular AG.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: realtime refcount btree transaction reservations
Darrick J. Wong [Tue, 15 Oct 2024 19:40:14 +0000 (12:40 -0700)]
xfs: realtime refcount btree transaction reservations

Make sure that there's enough log reservation to handle mapping
and unmapping realtime extents.  We have to reserve enough space
to handle a split in the rtrefcountbt to add the record and a second
split in the regular refcountbt to record the rtrefcountbt split.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: introduce realtime refcount btree ondisk definitions
Darrick J. Wong [Tue, 15 Oct 2024 19:40:13 +0000 (12:40 -0700)]
xfs: introduce realtime refcount btree ondisk definitions

Add the ondisk structure definitions for realtime refcount btrees. The
realtime refcount btree will be rooted from a hidden inode so it needs
to have a separate btree block magic and pointer format.

Next, add everything needed to read, write and manipulate refcount btree
blocks. This prepares the way for connecting the btree operations
implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: namespace the maximum length/refcount symbols
Darrick J. Wong [Tue, 15 Oct 2024 19:40:12 +0000 (12:40 -0700)]
xfs: namespace the maximum length/refcount symbols

Actually namespace these variables properly, so that readers can tell
that this is an XFS symbol, and that it's for the refcount
functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: prepare refcount btree cursor tracepoints for realtime
Darrick J. Wong [Tue, 15 Oct 2024 19:40:11 +0000 (12:40 -0700)]
xfs: prepare refcount btree cursor tracepoints for realtime

Rework the refcount btree cursor tracepoints in preparation to handle the
realtime refcount btree cursor.  Mostly this involves renaming the field to
"refcbno" and extracting the group number from the cursor when possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: enable realtime rmap btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:11 +0000 (12:40 -0700)]
xfs: enable realtime rmap btree

Permit mounting filesystems with realtime rmap btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: hook live realtime rmap operations during a repair operation
Darrick J. Wong [Tue, 15 Oct 2024 19:40:10 +0000 (12:40 -0700)]
xfs: hook live realtime rmap operations during a repair operation

Hook the regular realtime rmap code when an rtrmapbt repair operation is
running so that we can unlock the AGF buffer to scan the filesystem and
keep the in-memory btree up to date during the scan.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: create a shadow rmap btree during realtime rmap repair
Darrick J. Wong [Tue, 15 Oct 2024 19:40:09 +0000 (12:40 -0700)]
xfs: create a shadow rmap btree during realtime rmap repair

Create an in-memory btree of rmap records instead of an array.  This
enables us to do live record collection instead of freezing the fs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: online repair of the realtime rmap btree
Darrick J. Wong [Tue, 15 Oct 2024 19:40:08 +0000 (12:40 -0700)]
xfs: online repair of the realtime rmap btree

Repair the realtime rmap btree while mounted.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
5 months agoxfs: support repairing metadata btrees rooted in metadir inodes
Darrick J. Wong [Tue, 15 Oct 2024 19:40:07 +0000 (12:40 -0700)]
xfs: support repairing metadata btrees rooted in metadir inodes

Adapt the repair code so that we can stage a new btree in the data fork
area of a metadir inode and reap the old blocks.  We already have nearly
all of the infrastructure; the only parts that were missing were the
metadata inode reservation handling.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>