www.infradead.org Git - users/hch/xfs.git/log

LOCAL: add verbose kernel messages

xfs: export max_open_zones in sysfs

Add a zoned group with an attribute for the maximum number of open zones.
This allows querying the open zones for data placement tests, or also
for placement aware applications that are in control of the entire
file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: contain more sysfs code in xfs_sysfs.c

Extend the error sysfs initialization helper to include the neighbouring
attributes as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: export zone stats in /proc/*/mountstats

Add the per-zone life time hint and the used block distribution
for fully written zones, grouping reclaimable zones in fixed-percentage
buckets spanning 0..9%, 10..19% and full zones as 100% used as well as a
few statistics about the zone allocator and open and reclaimable zones
in /proc/*/mountstats.

This gives good insight into data fragmentation and data placement
success rate.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: wire up the show_stats super operation

The show_stats option allows a file system to dump plain text statistic
on a per-mount basis into /proc/*/mountstats. Wire up a no-op version
which will grow useful information for zoned file systems later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support write life time based data placement

Add a file write life time data placement allocation scheme that aims to
minimize fragmentation and thereby to do two things:

a) separate file data to different zones when possible.
b) colocate file data of similar life times when feasible.

To get best results, average file sizes should align with the zone
capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl.

This improvement in data placement efficiency reduces the number of
blocks requiring relocation by GC, and thus decreases overall write
amplification. The impact on performance varies depending on how full
the file system is.

For RocksDB using leveled compaction, the lifetime hints can improve
throughput for overwrite workloads at 80% file system utilization by
~10%, but for lower file system utilization there won't be as much
benefit in application performance as there is less need for garbage
collection to start with.

Lifetime hints can be disabled using the nolifetime mount option.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: add a max_open_zones mount option

Allow limiting the number of open zones used below that exported by the
device. This is required to tune the number of write streams when zoned
RT devices are used on conventional devices, and can be useful on zoned
devices that support a very large number of open zones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support zone gaps

Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space. In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us. In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: enable the zoned RT device feature

Enable the zoned RT device directory feature. With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks. This perfectly maps to zoned devices, but can also be
used on conventional block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: disable rt quotas for zoned file systems

They'll need a little more work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: disable reflink for zoned file systems

While the zoned on-disk format supports reflinks, the GC code currently
always unshares reflinks when moving blocks to new zones, thus making the
feature unusuable. Disable reflinks until the GC code is refcount aware.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: enable fsmap reporting for internal RT devices

File system with internal RT devices are a bit odd in that we need
to report AGs and RGs. To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.

The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support xrep_require_rtext_inuse on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-referenced. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support xchk_xref_is_used_rt_space on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-referenced. But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: allow COW forks on zoned file systems in xchk_bmap

Zoned file systems can have COW forks even without reflinks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support growfs on zoned file systems

Replace the inner loop growing one RT bitmap block at a time with
one just modifying the superblock counters for growing an entire
zone (aka RTG). The big restriction is just like at mkfs time only
a RT extent size of a single FSB is allowed, and the file system
capacity needs to be aligned to the zone size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: hide reserved RT blocks from statfs

File systems with a zoned RT device have a large number of reserved
blocks that are required for garbage collection, and which can't be
filled with user data. Exclude them from the available blocks reported
through stat(v)fs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: wire up zoned block freeing in xfs_rtextent_free_finish_item

Make xfs_rtextent_free_finish_item call into the zoned allocator to free
blocks on zoned RT devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: implement direct writes to zoned RT devices

Direct writes to zoned RT devices are extremely simple. After taking the
block reservation before acquiring the iolock, the iomap direct I/O calls
into ->iomap_begin which will return a "fake" iomap for the entire
requested range. The actual block allocation is then done from the
submit_io handler using code shared with the buffered I/O path.

The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize
the embedded ioend, which allows reusing the existing ioend based buffered
I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: implement buffered writes to zoned RT devices

Implement buffered writes including page faults and block zeroing for
zoned RT devices.  Buffered writes to zoned RT devices are split into
three phases:

1) a reservation for the worst case data block usage is taken before
    acquiring the iolock.  When there are enough free blocks but not
    enough available one, garbage collection is kicked off to free the
    space before continuing with the write.  If there isn't enough
    freeable space, the block reservation is reduced and a short write
    will happen as expected by normal Linux write semantics.
2) with the iolock held, the generic iomap buffered write code is
    called, which through the iomap_begin operation usually just inserts
    delalloc extents for the range in a single iteration.  Only for
    overwrites of existing data that are not block aligned, or zeroing
    operations the existing extent mapping is read to fill out the srcmap
    and to figure out if zeroing is required.
3) the ->map_blocks callback to the generic iomap writeback code
    calls into the zoned space allocator to actually allocate on-disk
    space for the range before kicking of the writeback.

Note that because all writes are out of place, truncate or hole punches
that are not aligned to block size boundaries need to allocate space.
For block zeroing from truncate, ->setattr is called with the iolock
(aka i_rwsem) already held, so a hacky deviation from the above
scheme is needed.  In this case the space reservations is called with
the iolock held, but is required not to block and can dip into the
reserved block pool.  This can lead to -ENOSPC when truncating a
file, which is unfortunate.  But fixing the calling conventions in
the VFS is probably much easier with code requiring it already in
mainline.

Similarly because all writes are out place, the zoned allocator can't
support unwritten extents and thus the FALLOC_FL_ALLOCATE_RANGE range
mode of fallocate.  Other fallocate modes that would reserved space
but don't need to to provide proper semantics do work but do not
reserve space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: implement zoned garbage collection

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used.  To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: add support for zoned space reservations

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: add the zoned space allocator

For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.

Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode.  As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.

Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available.  A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup.  Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.

Allocations are only recorded at I/O completion time.  The scheme used
for that is very similar to the reflink COW end I/O path.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: parse and validate hardware zone information

Add support to validate and parse reported hardware zone state.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file

The zoned allocator never performs speculative preallocations, so don't
bother queueing up zoned inodes here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes

Zoned file systems require out of place writes and thus can't support
post-EOF speculative preallocations. Avoid the pointless ilock critical
section to find out that none can be freed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: disable FITRIM for zoned RT devices

The zoned allocator unconditionally issues zone resets or discards after
emptying an entire zone, so supporting FITRIM for a zoned RT device is
not useful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: disable sb_frextents for zoned file systems

Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information. Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: export zoned geometry via XFS_FSOP_GEOM

Export the zoned geometry information so that userspace can query it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: allow internal RT devices for zoned mode

Allow creating an RT subvolume on the same device as the main data
device. This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: define the zoned on-disk format

Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data).  There are few minor but important changes, which are indicated
by a new incompat flag:

1) there are no bitmap and summary inodes, thus the
   /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the
   sb_rbmblocks superblock field must be cleared to zero.

2) there is a new superblock field that specifies the start of an
   internal RT section.  This allows supporting SMR HDDs that have random
   writable space at the beginning which is used for the XFS data device
   (which really is the metadata device for this configuration), directly
   followed by a RT device on the same block device.  While something
   similar could be achieved using dm-linear just having a single device
   directly consumed by XFS makes handling the file systems a lot easier.

3) Another superblock field that tracks the amount of reserved space (or
   overprovisioning) that is never used for user capacity, but allows GC
   to run more smoothly.

4) an overlay of the cowextsize field for the rtrmap inode so that we
   can persistently track the total amount of rtblocks currently used in
   a RT group.  There is no data structure other than the rmap that
   tracks used space in an RT group, and this counter is used to decide
   when a RT group has been entirely emptied, and to select one that
   is relatively empty if garbage collection needs to be performed.
   While this counter could be tracked entirely in memory and rebuilt
   from the rmap at mount time, that would lead to very long mount times
   with the large number of RT groups implied by the number of hardware
   zones especially on SMR hard drives with 256MB zone sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: add a xfs_rtrmap_highest_rgbno helper

Add a helper to find the last offset mapped in the rtrmap. This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation. To support that pass the
flags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write

For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block. The existing always_cow sysctl based code doesn't catch this
because nothing enforces that blocks aren't rewritten, but for zoned XFS
on sequential write required zones this is a hard error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: skip always_cow inodes in xfs_reflink_trim_around_shared

xfs_reflink_trim_around_shared tries to find shared blocks in the
refcount btree. Always_cow inodes don't have that tree, so don't
bother.

For the existing always_cow code this is a minor optimization. For
the upcoming zoned code that can do COW without the rtreflink code it
avoids triggering a NULL pointer dereference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c

Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c. Move it to xfs_iomap.c
toward the two callers.

Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: add a rtg_blocks helper

Shortcut dereferencing the xg_block_count field in the generic group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: factor out a xfs_rt_check_size helper

Add a helper to check that the last block of a RT device is readable
to share the code between mount and growfs. This also adds the mount
time overflow check to growfs and improves the error messages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: reduce metafile reservations

There is no point in reserving more space than actually available
on the data device for the worst case scenario that is unlikely to
happen. Reserve at most 1/4th of the data device blocks, which is
still a heuristic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: make metabtree reservations global

Currently each metabtree inode has it's own space reservation to ensure
it can be expanded to the maximum size, mirroring what is done for the
AG-based btrees.  But unlike the AG-based btrees the metabtree inodes
aren't restricted to allocate from a single AG but can use free space
form the entire file system.  And unlike AG-based btrees where the
required reservation shrinks with the available free space due to this,
the metabtree reservations for the rtrmap and rtfreflink trees are not
bound in any way by the data device free space as they track RT extent
allocations.  This is not very efficient as it requires a large number
of blocks to be set aside that can't be used at all by other btrees.

Switch to a model that uses a global pool instead in preparation for
reducing the amount of reserved space, which now also removes the
overloading of the i_nblocks field for metabtree inodes, which would
create problems if metabtree inodes ever had a big enough xattr fork
to require xattr blocks outside the inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: fixup the metabtree reservation in xrep_reap_metadir_fsblocks

All callers of xrep_reap_metadir_fsblocks need to fix up the metabtree
reservation, otherwise they'd leave the reservations in an incoherent
state. Move the call to xrep_reset_metafile_resv into
xrep_reap_metadir_fsblocks so it always is taken care of, and remove
now superfluous helper functions in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: trace in-memory freecounter reservations

Add two tracepoints when the freecounter dips into the reserved pool
and when it is entirely out of space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: support reserved blocks for the rt extent counter

The zoned space allocator will need reserved RT extents for garbage
collection and zeroing of partial blocks. Move the resblks related
fields into the freecounter array so that they can be used for all
counters.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: generalize the freespace and reserved blocks handling

xfs_{add,dec}_freecounter already handles the block and RT extent
percpu counters, but it currently hardcodes the passed in counter.

Add a freecounter abstraction that uses an enum to designate the counter
and add wrappers that hide the actual percpu_counters. This will allow
expanding the reserved block handling to the RT extent counter in the
next step, and also prepares for adding yet another such counter that
can share the code. Both these additions will be needed for the zoned
allocator.

Also switch the flooring of the frextents counter to 0 in statfs for the
rthinherit case to a manual min_t call to match the handling of the
fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: reflow xfs_dec_freecounter

Let the successful allocation be the main path through the function
with exception handling in branches to make the code easier to
follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: pass private data to iomap_truncate_page

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: pass private data to iomap_zero_range

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: pass private data to iomap_page_mkwrite

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: add a io_private field to struct iomap_ioend

Add a private data field to struct iomap_ioend so that the file system
can attach information to it. Zoned XFS will use this for a pointer to
the open zone.

Signed-off-by: Christoph Hellwig <hch@lst.de>

iomap: optionally use ioends for direct I/O

struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.

For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.

So instead of reinventing the wheel, reuse the existing infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: factor out a iomap_dio_done helper

Split out the struct iomap-dio level final completion from
iomap_dio_bio_end_io into a helper to clean up the code and make it
reusable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: move common ioend code to ioend.c

This code will be reused for direct I/O soon, so split it out of
buffered-io.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: split bios to zone append limits in the submission handlers

Provide helpers for file systems to split bios in the direct I/O and
writeback I/O submission handlers. The split ioends are chained to
the parent ioend so that only the parent ioend originally generated
by the iomap layer will be processed after all the chained off children
have completed. This is based on the block layer bio chaining that has
supported a similar mechanism for a long time.

This Follows btrfs' lead and don't try to build bios to hardware limits
for zone append commands, but instead build them as normal unconstrained
bios and split them to the hardware limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: add a IOMAP_F_ANON_WRITE flag

Add a IOMAP_F_ANON_WRITE flag that indicates that the write I/O does not
have a target block assigned to it yet at iomap time and the file system
will do that in the bio submission handler, splitting the I/O as needed.

This is used to implement Zone Append based I/O for zoned XFS, where
splitting writes to the hardware limits and assigning a zone to them
happens just before sending the I/O off to the block layer, but could
also be useful for other things like compressed I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: simplify io_flags and io_type in struct iomap_ioend

The ioend fields for distinct types of I/O are a bit complicated.
Consolidate them into a single io_flag field with it's own flags
decoupled from the iomap flags. This also prepares for adding a new
flag that is unrelated to both of the iomap namespaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

iomap: allow the file system to submit the writeback bios

Change ->prepare_ioend to ->submit_ioend and require file systems that
implement it to submit the bio. This is needed for file systems that
do their own work on the bios before submitting them to the block layer
like btrfs or zoned xfs. To make this easier also pass the writeback
context to the method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

Documentation: Document the new zoned loop block device driver

Introduce the zoned_loop.rst documentation file under
admin-guide/blockdev to document the zoned loop block device driver.
An overview of the driver is provided and its usage to create and delete
zoned devices described.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

block: new zoned loop block device driver

The zoned loop block device driver allows a user to create emulated
zoned block devices using one regular file per zone as backing storage.
Compared to null_blk or scsi_debug, it has the advantage of allowing
emulating large zoned devices without requiring the same amount of
memory as the capacity of the emulated device. Furthermore, zoned
devices emulated with this driver can be re-started after a host reboot
without any loss of the state of the device zones, which is something
that null_blk and scsi_debug do not support.

This initial implementation is simple and does not support zone resource
limits. That is, a zoned loop block device limits for the maximum number
of open zones and maximum number of active zones is always 0.

This driver can be either compiled in-kernel or as a module, named
"zloop". Compilation of this driver depends on the block layer support
for zoned block device (CONFIG_BLK_DEV_ZONED must be set).

Using the zloop driver to create and delete zoned block devices is
done by writing commands to the zoned loop control character device file
(/dev/zloop-control). Creating a device is done with:

  $ echo "add [options]" > /dev/zloop-control

The options available for the "add" operation cat be listed by reading
the zloop-control device file:

  $ cat /dev/zloop-control
  add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
  remove id=%d

The options available allow controlling the zoned device total
capacity, zone size, zone capactity of sequential zones, total number
of conventional zones, base directory for the zones backing file, number
of I/O queues and the maximum queue depth of I/O queues.

Deleting a device is done using the "remove" command:

  $ echo "remove id=0" > /dev/zloop-control

This implementation passes various tests using zonefs and fio (t/zbd
tests) and provides a state machine for zone conditions that is
compliant with the T10 ZBC and NVMe ZNS specifications.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>

TEMP: nvme-pci: disable async probe

This keeps getting my ZNS vs ZNS drivers reordered a bit and is annoying
for testing.

Not-really-signed-off-by: Christoph Hellwig <hch@lst.de>

xfs: flush inodegc before swapon

Fix the brand new xfstest that tries to swapon on a recently unshared
file and use the chance to document the other bit of magic in this
function.

The big comment is taken from a mailinglist post by Dave Chinner.

Fixes: 5e672cd69f0a53 ("xfs: introduce xfs_inodegc_push()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate

Match the method name and the naming convention or address_space
operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Do not allow norecovery mount with quotacheck

Mounting a filesystem that requires quota state changing will generate a
transaction.

We already check for a read-only device; we should do that for
norecovery too.

A quotacheck on a norecovery mount, and with the right log size, will cause
the mount process to hang on:

[<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs]
[<0>] xlog_grant_head_check+0x112/0x180 [xfs]
[<0>] xfs_log_reserve+0xe3/0x260 [xfs]
[<0>] xfs_trans_reserve+0x179/0x250 [xfs]
[<0>] xfs_trans_alloc+0x101/0x260 [xfs]
[<0>] xfs_sync_sb+0x3f/0x80 [xfs]
[<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs]
[<0>] xfs_mountfs+0x7ad/0xc20 [xfs]
[<0>] xfs_fs_fill_super+0x762/0xa50 [xfs]
[<0>] get_tree_bdev_flags+0x131/0x1d0
[<0>] vfs_get_tree+0x26/0xd0
[<0>] vfs_cmd_create+0x59/0xe0
[<0>] __do_sys_fsconfig+0x4e3/0x6b0
[<0>] do_syscall_64+0x82/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

This is caused by a transaction running with bogus initialized head/tail

I initially hit this while running generic/050, with random log
sizes, but I managed to reproduce it reliably here with the steps
below:

mkfs.xfs -f -lsize=1025M -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, -i
sparse=1 /dev/vdb2 > /dev/null
mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt
xfs_io -x -c 'shutdown -f' /mnt
umount /mnt
mount -o ro,norecovery,usrquota,grpquota,prjquota /dev/vdb2 /mnt

Last mount hangs up

As we add yet another validation if quota state is changing, this also
add a new helper named xfs_qm_validate_state_change(), factoring the
quota state changes out of xfs_qm_newmount() to reduce cluttering
within it.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: do not check NEEDSREPAIR if ro,norecovery mount.

If there is corrutpion on the filesystem andxfs_repair
fails to repair it. The last resort of getting the data
is to use norecovery,ro mount. But if the NEEDSREPAIR is
set the filesystem cannot be mounted. The flag must be
cleared out manually using xfs_db, to get access to what
left over of the corrupted fs.

Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix data fork format filtering during inode repair

Coverity noticed that xrep_dinode_bad_metabt_fork never runs because
XFS_DINODE_FMT_META_BTREE is always filtered out in the mode selection
switch of xrep_dinode_check_dfork.

Metadata btrees are allowed only in the data forks of regular files, so
add this case explicitly. I guess this got fubard during a refactoring
prior to 6.13 and I didn't notice until now. :/

Coverity-id: 1617714
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n

I received a report from the release engineering side of the house that
xfs_scrub without the -n flag (aka fix it mode) would try to fix a
broken filesystem even on a kernel that doesn't have online repair built
into it:

# xfs_scrub -dTvn /mnt/test
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Phase 1: Find filesystem geometry.
/mnt/test: using 1 threads to scrub.
Phase 1: Memory used: 132k/0k (108k/25k), time:  0.00/ 0.00/ 0.00s
<snip>
Phase 4: Repair filesystem.
<snip>
Info: /mnt/test/some/victimdir directory entries: Attempting repair. (repair.c line 351)
Corruption: /mnt/test/some/victimdir directory entries: Repair unsuccessful; offline repair required. (repair.c line 204)

Source: https://blogs.oracle.com/linux/post/xfs-online-filesystem-repair

It is strange that xfs_scrub doesn't refuse to run, because the kernel
is supposed to return EOPNOTSUPP if we actually needed to run a repair,
and xfs_io's repair subcommand will perror that.  And yet:

# xfs_io -x -c 'repair probe' /mnt/test
#

The first problem is commit dcb660f9222fd9 (4.15) which should have had
xchk_probe set the CORRUPT OFLAG so that any of the repair machinery
will get called at all.

It turns out that some refactoring that happened in the 6.6-6.8 era
broke the operation of this corner case.  What we *really* want to
happen is that all the predicates that would steer xfs_scrub_metadata()
towards calling xrep_attempt() should function the same way that they do
when repair is compiled in; and then xrep_attempt gets to return the
fatal EOPNOTSUPP error code that causes the probe to fail.

Instead, commit 8336a64eb75cba (6.6) started the failwhale swimming by
hoisting OFLAG checking logic into a helper whose non-repair stub always
returns false, causing scrub to return "repair not needed" when in fact
the repair is not supported.  Prior to that commit, the oflag checking
that was open-coded in scrub.c worked correctly.

Similarly, in commit 4bdfd7d15747b1 (6.8) we hoisted the IFLAG_REPAIR
and ALREADY_FIXED logic into a helper whose non-repair stub always
returns false, so we never enter the if test body that would have called
xrep_attempt, let alone fail to decode the OFLAGs correctly.

The final insult (yes, we're doing The Naked Gun now) is commit
48a72f60861f79 (6.8) in which we hoisted the "are we going to try a
repair?" predicate into yet another function with a non-repair stub
always returns false.

Fix xchk_probe to trigger xrep_probe if repair is enabled, or return
EOPNOTSUPP directly if it is not.  For all the other scrub types, we
need to fix the header predicates so that the ->repair functions (which
are all xrep_notsupported) get called to return EOPNOTSUPP.  Commit
48a72 is tagged here because the scrub code prior to LTS 6.12 are
incomplete and not worth patching.

Reported-by: David Flynn <david.flynn@oracle.com>
Cc: <stable@vger.kernel.org> # v6.8
Fixes: 8336a64eb75c ("xfs: don't complain about unfixed metadata when repairs were injected")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Linux 6.14-rc2

Merge tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- Suppress false-positive -Wformat-{overflow,truncation}-non-kprintf
   warnings regardless of the W= option

- Avoid CONFIG_TRIM_UNUSED_KSYMS dropping symbols passed to symbol_get()

- Fix a build regression of the Debian linux-headers package

* tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: install-extmod-build: add missing quotation marks for CC variable
  kbuild: fix misspelling in scripts/Makefile.lib
  kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMS
  scripts/Makefile.extrawarn: Do not show clang's non-kprintf warnings at W=1

Merge tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
"Fix a recently introduced kernel crash due to a NULL pointer
dereference during system-wide suspend (Rafael Wysocki)"

* tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Restrict power.set_active propagation

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Correctly clean the BSS to the PoC before allowing EL2 to access it
     on nVHE/hVHE/protected configurations

   - Propagate ownership of debug registers in protected mode after the
     rework that landed in 6.14-rc1

   - Stop pretending that we can run the protected mode without a GICv3
     being present on the host

   - Fix a use-after-free situation that can occur if a vcpu fails to
     initialise the NV shadow S2 MMU contexts

   - Always evaluate the need to arm a background timer for fully
     emulated guest timers

   - Fix the emulation of EL1 timers in the absence of FEAT_ECV

   - Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0

  s390:

   - move some of the guest page table (gmap) logic into KVM itself,
     inching towards the final goal of completely removing gmap from the
     non-kvm memory management code.

     As an initial set of cleanups, move some code from mm/gmap into kvm
     and start using __kvm_faultin_pfn() to fault-in pages as needed;
     but especially stop abusing page->index and page->lru to aid in the
     pgdesc conversion.

  x86:

   - Add missing check in the fix to defer starting the huge page
     recovery vhost_task

   - SRSO_USER_KERNEL_NO does not need SYNTHESIZED_F"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (31 commits)
  KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking
  KVM: remove kvm_arch_post_init_vm
  KVM: selftests: Fix spelling mistake "initally" -> "initially"
  kvm: x86: SRSO_USER_KERNEL_NO is not synthesized
  KVM: arm64: timer: Don't adjust the EL2 virtual timer offset
  KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECV
  KVM: arm64: timer: Always evaluate the need for a soft timer
  KVM: arm64: Fix nested S2 MMU structures reallocation
  KVM: arm64: Fail protected mode init if no vgic hardware is present
  KVM: arm64: Flush/sync debug state in protected mode
  KVM: s390: selftests: Streamline uc_skey test to issue iske after sske
  KVM: s390: remove the last user of page->index
  KVM: s390: move PGSTE softbits
  KVM: s390: remove useless page->index usage
  KVM: s390: move gmap_shadow_pgt_lookup() into kvm
  KVM: s390: stop using lists to keep track of used dat tables
  KVM: s390: stop using page->index for non-shadow gmaps
  KVM: s390: move some gmap shadowing functions away from mm/gmap.c
  KVM: s390: get rid of gmap_translate()
  KVM: s390: get rid of gmap_fault()
  ...

PM: sleep: core: Restrict power.set_active propagation

Commit 3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of
parents and children") exposed an issue related to simple_pm_bus_pm_ops
that uses pm_runtime_force_suspend() and pm_runtime_force_resume() as
bus type PM callbacks for the noirq phases of system-wide suspend and
resume.

The problem is that pm_runtime_force_suspend() does not distinguish
runtime-suspended devices from devices for which runtime PM has never
been enabled, so if it sees a device with runtime PM status set to
RPM_ACTIVE, it will assume that runtime PM is enabled for that device
and so it will attempt to suspend it with the help of its runtime PM
callbacks which may not be ready for that. As it turns out, this
causes simple_pm_bus_runtime_suspend() to crash due to a NULL pointer
dereference.

Another problem related to the above commit and simple_pm_bus_pm_ops is
that setting runtime PM status of a device handled by the latter to
RPM_ACTIVE will actually prevent it from being resumed because
pm_runtime_force_resume() only resumes devices with runtime PM status
set to RPM_SUSPENDED.

To mitigate these issues, do not allow power.set_active to propagate
beyond the parent of the device with DPM_FLAG_SMART_SUSPEND set that
will need to be resumed, which should be a sufficient stop-gap for the
time being, but they will need to be properly addressed in the future
because in general during system-wide resume it is necessary to resume
all devices in a dependency chain in which at least one device is going
to be resumed.

Fixes: 3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of parents and children")
Closes: https://lore.kernel.org/linux-pm/1c2433d4-7e0f-4395-b841-b8eac7c25651@nvidia.com/
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/6137505.lOV4Wx5bFT@rjwysocki.net

Merge tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:
"Address a KUnit stack initialization regression that got tickled on
  m68k, and solve a Clang(v14 and earlier) bug found by 0day:

   - Fix stackinit KUnit regression on m68k

   - Use ARRAY_SIZE() for memtostr*()/strtomem*()"

* tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  string.h: Use ARRAY_SIZE() for memtostr*()/strtomem*()
  compiler.h: Introduce __must_be_byte_array()
  compiler.h: Move C string helpers into C-only kernel section
  stackinit: Fix comment for test_small_end
  stackinit: Keep selftest union size small on m68k

Merge tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fix from Kees Cook:
"This is really a work-around for x86_64 having grown a syscall to
  implement uretprobe, which has caused problems since v6.11.

  This may change in the future, but for now, this fixes the unintended
  seccomp filtering when uretprobe switched away from traps, and does so
  with something that should be easy to backport.

   - Allow uretprobe on x86_64 to avoid behavioral complications (Eyal
     Birger)"

* tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: validate uretprobe syscall passes through seccomp
  seccomp: passthrough uretprobe systemcall without filtering

Merge tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fix from Kees Cook:
"This is an alpha-specific fix, but since it touched ELF I was asked to
  carry it.

   - alpha/elf: Fix misc/setarch test of util-linux by removing 32bit
     support (Eric W. Biederman)"

* tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"A number of fairly small fixes, mostly in drivers but two in the core
  to change a retry for depopulation (a trendy new hdd thing that
  reorganizes blocks away from failing elements) and one to fix a GFP_
  annotation to avoid a lock dependency (the third core patch is all in
  testing)"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: qla1280: Fix kernel oops when debug level > 2
  scsi: ufs: core: Fix error return with query response
  scsi: storvsc: Set correct data length for sending SCSI command without payload
  scsi: ufs: core: Fix use-after free in init error and remove paths
  scsi: core: Do not retry I/Os during depopulation
  scsi: core: Use GFP_NOIO to avoid circular locking dependency
  scsi: ufs: Fix toggling of clk_gating.state when clock gating is not allowed
  scsi: ufs: core: Ensure clk_gating.lock is used only after initialization
  scsi: ufs: core: Simplify temperature exception event handling
  scsi: target: core: Add line break to status show
  scsi: ufs: core: Fix the HIGH/LOW_TEMP Bit Definitions
  scsi: core: Add passthrough tests for success and no failure definitions

Merge tag 'i2c-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c reverts from Wolfram Sang:
"It turned out the new mechanism for handling created devices does not
  handle all muxing cases.

  Revert the changes to give a proper solution more time"

* tag 'i2c-for-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  Revert "i2c: Replace list-based mechanism for handling auto-detected clients"
  Revert "i2c: Replace list-based mechanism for handling userspace-created clients"

Merge tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux

Pull rust fixes from Miguel Ojeda:

- Do not export KASAN ODR symbols to avoid gendwarfksyms warnings

- Fix future Rust 1.86.0 (to be released 2025-04-03) x86_64 builds

- Clean future Rust 1.86.0 (to be released 2025-04-03) warning

- Fix future GCC 15 (to be released in a few months) builds

- Fix `rusttest` target in macOS

* tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux:
  x86: rust: set rustc-abi=x86-softfloat on rustc>=1.86.0
  rust: kbuild: do not export generated KASAN ODR symbols
  rust: kbuild: add -fzero-init-padding-bits to bindgen_skip_cflags
  rust: init: use explicit ABI to clean warning in future compilers
  rust: kbuild: use host dylib naming in rusttestlib-kernel

Merge tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fix from Steven Rostedt:
"Function graph fix of notrace functions.

  When the function graph tracer was restructured to use the global
  section of the meta data in the shadow stack, the bit logic was
  changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in
  the mask that tells if the function graph tracer is currently in the
  "notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set.

  But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was
  used when it should have been the TRACE_GRAPH_NOTRACE mask. This made
  notrace not work properly"

* tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT

Merge tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Ingo Molnar:
"Fix a build regression on GCC 15 builds, caused by GCC changing the
  default C version that is overriden in the main Makefile but not in
  the x86 boot code Makefile"

* tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Use '-std=gnu11' to fix build with GCC 15

Merge tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
"Fix a PREEMPT_RT bug in the clocksource verification code that caused
  false positive warnings.

  Also fix a timer migration setup bug when new CPUs are added"

* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Fix off-by-one root mis-connection
  clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context

Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
"Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
  SCHED_WARN_ON() on certain workloads. The bug is believed to be
  (accidentally) self-correcting, hence no behavioral side effects are
  expected.

  Also print se.slice in debug output, since this value can now be set
  via the syscall ABI and can be useful to track"

* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Provide slice length for fair tasks
  sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue

Merge tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
"Another followup fix for the procps genirq output formatting
regression caused by an optimization"

* tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove leading space from irq_chip::irq_print_chip() callbacks

Merge tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
"Fix a dangling pointer bug in the futex code used by the uring code.

  It isn't causing problems at the moment due to uring ABI limitations
  leaving it essentially unused in current usages, but is a good idea to
  fix nevertheless"

* tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Pass in task to futex_queue()

fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT

The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.

There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.

For example:

# cd /sys/kernel/tracing
# echo set_track_prepare stack_trace_save  > set_graph_notrace
# echo function_graph > current_tracer
# cat trace
[..]
0)               |                          __slab_free() {
0)               |                            free_to_partial_list() {
0)               |                                  arch_stack_walk() {
0)               |                                    __unwind_start() {
0)   0.501 us    |                                      get_stack_info();

Where a non filter trace looks like:

# echo > set_graph_notrace
# cat trace
0)               |                            free_to_partial_list() {
0)               |                              set_track_prepare() {
0)               |                                stack_trace_save() {
0)               |                                  arch_stack_walk() {
0)               |                                    __unwind_start() {

Where the filter should look like:

# cat trace
0)               |                            free_to_partial_list() {
0)               |                              _raw_spin_lock_irqsave() {
0)   0.350 us    |                                preempt_count_add();
0)   0.351 us    |                                do_raw_spin_lock();
0)   2.440 us    |                              }

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home
Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

kbuild: Move -Wenum-enum-conversion to W=2

-Wenum-enum-conversion was strengthened in clang-19 to warn for C, which
caused the kernel to move it to W=1 in commit 75b5ab134bb5 ("kbuild:
Move -Wenum-{compare-conditional,enum-conversion} into W=1") because
there were numerous instances that would break builds with -Werror.
Unfortunately, this is not a full solution, as more and more developers,
subsystems, and distributors are building with W=1 as well, so they
continue to see the numerous instances of this warning.

Since the move to W=1, there have not been many new instances that have
appeared through various build reports and the ones that have appeared
seem to be following similar existing patterns, suggesting that most
instances of this warning will not be real issues. The only alternatives
for silencing this warning are adding casts (which is generally seen as
an ugly practice) or refactoring the enums to macro defines or a unified
enum (which may be undesirable because of type safety in other parts of
the code).

Move the warning to W=2, where warnings that occur frequently but may be
relevant should reside.

Cc: stable@vger.kernel.org
Fixes: 75b5ab134bb5 ("kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1")
Link: https://lore.kernel.org/ZwRA9SOcOjjLJcpi@google.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Three DFS fixes: DFS mount fix, fix for noisy log msg and one to
   remove some unused code

- SMB3 Lease fix

* tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: change lease epoch type from unsigned int to __u16
  smb: client: get rid of kstrdup() in get_ses_refpath()
  smb: client: fix noisy when tree connecting to DFS interlink targets
  smb: client: don't trust DFSREF_STORAGE_SERVER bit

Merge tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Just regular drm fixes, amdgpu, xe and i915 mostly, but a few
  scattered fixes. I think one of the i915 fixes fixes some build combos
  that Guenter was seeing.

  amdgpu:
   - Add new tiling flag for DCC write compress disable
   - Add BO metadata flag for DCC
   - Fix potential out of bounds access in display
   - Seamless boot fix
   - CONFIG_FRAME_WARN fix
   - PSR1 fix

  xe:
   - OA uAPI related fixes
   - Fix SRIOV migration initialization
   - Restore devcoredump to a sane state

  i915:
   - Fix the build error with clamp after WARN_ON on gcc 13.x+
   - HDCP related fixes
   - PMU fix zero delta busyness issue
   - Fix page cleanup on DMA remap failure
   - Drop 64bpp YUV formats from ICL+ SDR planes
   - GuC log related fix
   - DisplayPort related fixes

  ivpu:
   - Fix error handling

  komeda:
   - add return check

  zynqmp:
   - fix locking in DP code

  ast:
   - fix AST DP timeout

  cec:
   - fix broken CEC adapter check"

* tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel: (29 commits)
  drm/i915/dp: Fix potential infinite loop in 128b/132b SST
  Revert "drm/amd/display: Use HW lock mgr for PSR1"
  drm/amd/display: Respect user's CONFIG_FRAME_WARN more for dml files
  accel/amdxdna: Add MODULE_FIRMWARE() declarations
  drm/i915/dp: Iterate DSC BPP from high to low on all platforms
  drm/xe: Fix and re-enable xe_print_blob_ascii85()
  drm/xe/devcoredump: Move exec queue snapshot to Contexts section
  drm/xe/oa: Set stream->pollin in xe_oa_buffer_check_unlocked
  drm/xe/pf: Fix migration initialization
  drm/xe/oa: Preserve oa_ctrl unused bits
  drm/amd/display: Fix seamless boot sequence
  drm/amd/display: Fix out-of-bound accesses
  drm/amdgpu: add a BO metadata flag to disable write compression for Vulkan
  drm/i915/backlight: Return immediately when scale() finds invalid parameters
  drm/i915/dp: Return min bpc supported by source instead of 0
  drm/i915/dp: fix the Adaptive sync Operation mode for SDP
  drm/i915/guc: Debug print LRC state entries only if the context is pinned
  drm/i915: Drop 64bpp YUV formats from ICL+ SDR planes
  drm/i915: Fix page cleanup on DMA remap failure
  drm/i915/pmu: Fix zero delta busyness issue
  ...

Merge tag 'stable/for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft

Pull ibft fixes from Konrad Rzeszutek Wilk:
"Two tiny fixes to IBFT code: one for Kconfig and another for IPv6"

* tag 'stable/for-linus-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
iscsi_ibft: Fix UBSAN shift-out-of-bounds warning in ibft_attr_show_nic()
firmware: iscsi_ibft: fix ISCSI_IBFT Kconfig entry

Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- MD pull request via Song:
      - fix an error handling path for md-linear

- NVMe pull request via Keith:
      - Connection fixes for fibre channel transport (Daniel)
      - Endian fixes (Keith, Christoph)
      - Cleanup fix for host memory buffer (Francis)
      - Platform specific power quirks (Georg)
      - Target memory leak (Sagi)
      - Use appropriate controller state accessor (Daniel)

- Fixup for a regression introduced last week, where sunvdc wasn't
   updated for an API change, causing compilation failures on sparc64.

* tag 'block-6.14-20250207' of git://git.kernel.dk/linux:
  drivers/block/sunvdc.c: update the correct AIP call
  md: Fix linear_set_limits()
  nvme-fc: use ctrl state getter
  nvme: make nvme_tls_attrs_group static
  nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
  nvmet: the result field in nvmet_alloc_ctrl_args is little endian
  nvmet: fix a memory leak in controller identify
  nvme-fc: do not ignore connectivity loss during connecting
  nvme: handle connectivity loss in nvme_set_queue_count
  nvme-fc: go straight to connecting state when initializing
  nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
  nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
  nvme-pci: remove redundant dma frees in hmb
  nvmet: fix rw control endian access

kbuild: install-extmod-build: add missing quotation marks for CC variable

While attempting to build a Debian packages with CC="ccache gcc", I
saw the following error as builddeb builds linux-headers-$KERNELVERSION:

make HOSTCC=ccache gcc VPATH= srcroot=. -f ./scripts/Makefile.build obj=debian/linux-headers-6.14.0-rc1/usr/src/linux-headers-6.14.0-rc1/scripts
make[6]: *** No rule to make target 'gcc'. Stop.

Upon investigation, it seems that one instance of $(CC) variable reference
in ./scripts/package/install-extmod-build was missing quotation marks,
causing the above error.

Add the missing quotation marks around $(CC) to fix build.

Fixes: 5f73e7d0386d ("kbuild: refactor cross-compiling linux-headers package")
Co-developed-by: Mingcong Bai <jeffbai@aosc.io>
Signed-off-by: Mingcong Bai <jeffbai@aosc.io>
Tested-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

Merge tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix a handful of issues in the amd-pstate driver, the airoha
  cpufreq driver build, a (recently added) possible NULL pointer
  dereference in the cpufreq code and a possible memory leak in the
  power capping subsystem:

   - Fix cpufreq_policy reference counting and prevent max_perf from
     going above the current limit in amd-pstate, and drop a redundant
     goto label from it (Dhananjay Ugwekar)

   - Prevent the per-policy boost_enabled flag in amd-pstate from
     getting out of sync with the actual state after boot failures
     (Lifeng Zheng)

   - Fix a recently added possible NULL pointer dereference in the
     cpufreq core (Aboorva Devarajan)

   - Fix a build issue related to CONFIG_OF and COMPILE_TEST
     dependencies in the airoha cpufreq driver (Arnd Bergmann)

   - Fix a possible memory leak in the power capping subsystem (Joe
     Hattori)"

* tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq/amd-pstate: Fix cpufreq_policy ref counting
  cpufreq: prevent NULL dereference in cpufreq_online()
  cpufreq: airoha: modify CONFIG_OF dependency
  cpufreq/amd-pstate: Fix max_perf updation with schedutil
  cpufreq/amd-pstate: Remove the goto label in amd_pstate_update_limits
  cpufreq/amd-pstate: Fix per-policy boost flag incorrect when fail
  powercap: call put_device() on an error path in powercap_register_control_type()

Merge tag 'acpi-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix three assorted issues, including one recent regression:

   - Add an ACPI IRQ override quirk for Eluktronics MECH-17 to make the
     internal keyboard work (Gannon Kolding)

   - Make acpi_data_prop_read() reflect the OF counterpart behavior in
     error cases (Andy Shevchenko)

   - Remove recently added strict ACPI PRM handler address checks that
     prevented PRM from working on some platforms in the field (Aubrey
     Li)"

* tag 'acpi-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PRM: Remove unnecessary strict handler address checks
  ACPI: resource: IRQ override for Eluktronics MECH-17
  ACPI: property: Fix return value for nval == 0 in acpi_data_prop_read()

Merge tag 'gpio-fixes-for-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:

- fix interrupt support in gpio-pca953x

- fix configfs attribute locking in gpio-sim

- limit the visibility of the GPIO_GRGPIO Kconfig symbol to OF systems
   only

- update MAINTAINERS

* tag 'gpio-fixes-for-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  MAINTAINERS: Use my kernel.org address for ACPI GPIO work
  gpio: GPIO_GRGPIO should depend on OF
  gpio: sim: lock hog configfs items if present
  gpio: pca953x: Improve interrupt support

Merge tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

- Fix fsnotify FMODE_NONOTIFY* handling.

   This also disables fsnotify on all pseudo files by default apart from
   very select exceptions. This carries a regression risk so we need to
   watch out and adapt accordingly. However, it is overall a significant
   improvement over the current status quo where every rando file can
   get fsnotify enabled.

- Cleanup and simplify lockref_init() after recent lockref changes.

- Fix vboxfs build with gcc-15.

- Add an assert into inode_set_cached_link() to catch corrupt links.

- Allow users to also use an empty string check to detect whether a
   given mount option string was empty or not.

- Fix how security options were appended to statmount()'s ->mnt_opt
   field.

- Fix statmount() selftests to always check the returned mask.

- Fix uninitialized value in vfs_statx_path().

- Fix pidfs_ioctl() sanity checks to guard against ioctl() overloading
   and preserve extensibility.

* tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  vfs: sanity check the length passed to inode_set_cached_link()
  pidfs: improve ioctl handling
  fsnotify: disable pre-content and permission events by default
  selftests: always check mask returned by statmount(2)
  fsnotify: disable notification by default for all pseudo files
  fs: fix adding security options to statmount.mnt_opt
  fsnotify: use accessor to set FMODE_NONOTIFY_*
  lockref: remove count argument of lockref_init
  gfs2: switch to lockref_init(..., 1)
  gfs2: use lockref_init for gl_lockref
  statmount: let unset strings be empty
  vboxsf: fix building with GCC 15
  fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()

Merge tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
"Nothing major, things continue to be fairly quiet over here.

   - add a SubmittingPatches to clarify that patches submitted for
     bcachefs do, in fact, need to be tested

   - discard path now correctly issues journal flushes when needed, this
     fixes performance issues when the filesystem is nearly full and
     we're bottlenecked on copygc

   - fix a bug that could cause the pending rebalance work accounting to
     be off when devices are being onlined/offlined; users should report
     if they are still seeing this

   - and a few more trivial ones"

* tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs:
  bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
  bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit()
  bcachefs: Fix discard path journal flushing
  bcachefs: fix deadlock in journal_entry_open()
  bcachefs: fix incorrect pointer check in __bch2_subvolume_delete()
  bcachefs docs: SubmittingPatches.rst

MAINTAINERS: Remove myself

I no longer have any faith left in the kernel development process or
community management approach.

Apple/ARM platform development will continue downstream. If I feel like
sending some patches upstream in the future myself for whatever subtree
I may, or I may not. Anyone who feels like fighting the upstreaming
fight themselves is welcome to do so.

Signed-off-by: Hector Martin <marcan@marcan.st>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: Move Pavel to kernel.org address

I need to filter my emails better, switch to pavel@kernel.org address
to help with that.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'md-6.14-20250206' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into block-6.14

Pull MD fix from Song:

"This patch, by Bart Van Assche, fixes an error handling path for
md-linear."

* tag 'md-6.14-20250206' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md: Fix linear_set_limits()

Merge branches 'acpi-property' and 'acpi-resource'

Merge a new ACPI IRQ override quirk for Eluktronics MECH-17 (Gannon
Kolding) and an acpi_data_prop_read() fix making it reflect the OF
counterpart behavior in error cases (Andy Shevchenko).

* acpi-property:
ACPI: property: Fix return value for nval == 0 in acpi_data_prop_read()

* acpi-resource:
ACPI: resource: IRQ override for Eluktronics MECH-17

Merge branch 'pm-powercap'

Fix a possible memory leak in the power capping subsystem (Joe Hattori).

* pm-powercap:
powercap: call put_device() on an error path in powercap_register_control_type()

vfs: sanity check the length passed to inode_set_cached_link()

This costs a strlen() call when instatianating a symlink.

Preferably it would be hidden behind VFS_WARN_ON (or compatible), but
there is no such facility at the moment. With the facility in place the
call can be patched out in production kernels.

In the meantime, since the cost is being paid unconditionally, use the
result to a fixup the bad caller.

This is not expected to persist in the long run (tm).

Sample splat:
bad length passed for symlink [/tmp/syz-imagegen43743633/file0/file0] (got 131109, expected 37)
[rest of WARN blurp goes here]

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20250204213207.337980-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>