]> www.infradead.org Git - users/hch/xfs.git/log
users/hch/xfs.git
2 months agoxfs: always freeze the fs when checking fs counters xfs-freecounter-dbg
Christoph Hellwig [Tue, 4 Feb 2025 14:48:04 +0000 (15:48 +0100)]
xfs: always freeze the fs when checking fs counters

Otherwise we can't guarantee that the delalloc and free space counters
are in sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: assert that no modifications happen to block counters on frozen sbs
Christoph Hellwig [Tue, 4 Feb 2025 06:38:29 +0000 (07:38 +0100)]
xfs: assert that no modifications happen to block counters on frozen sbs

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agofs: update s_writers.frozen before calling ->unfreeze_fs
Christoph Hellwig [Tue, 4 Feb 2025 14:36:46 +0000 (15:36 +0100)]
fs: update s_writers.frozen before calling ->unfreeze_fs

That way fіle systems can add asserts that ensure certain activity
don't happen on a completely frozen (SB_FREEZE_COMPLETE) superblock
while still allowing for said activity during unfreeze.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: export max_open_zones in sysfs xfs-zoned-rebase
Christoph Hellwig [Tue, 21 Jan 2025 05:45:22 +0000 (06:45 +0100)]
xfs: export max_open_zones in sysfs

Add a zoned group with an attribute for the maximum number of open zones.
This allows querying the open zones for data placement tests, or also
for placement aware applications that are in control of the entire
file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: contain more sysfs code in xfs_sysfs.c
Christoph Hellwig [Tue, 21 Jan 2025 05:30:27 +0000 (06:30 +0100)]
xfs: contain more sysfs code in xfs_sysfs.c

Extend the error sysfs initialization helper to include the neighbouring
attributes as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: export zone stats in /proc/*/mountstats
Hans Holmberg [Sat, 30 Nov 2024 04:53:49 +0000 (05:53 +0100)]
xfs: export zone stats in /proc/*/mountstats

Add the per-zone life time hint and the used block distribution
for fully written zones, grouping reclaimable zones in fixed-percentage
buckets spanning 0..9%, 10..19% and full zones as 100% used as well as a
few statistics about the zone allocator and open and reclaimable zones
in /proc/*/mountstats.

This gives good insight into data fragmentation and data placement
success rate.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: wire up the show_stats super operation
Christoph Hellwig [Sun, 17 Nov 2024 05:24:19 +0000 (06:24 +0100)]
xfs: wire up the show_stats super operation

The show_stats option allows a file system to dump plain text statistic
on a per-mount basis into /proc/*/mountstats.  Wire up a no-op version
which will grow useful information for zoned file systems later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: support write life time based data placement
Hans Holmberg [Fri, 31 Jan 2025 08:39:36 +0000 (09:39 +0100)]
xfs: support write life time based data placement

Add a file write life time data placement allocation scheme that aims to
minimize fragmentation and thereby to do two things:

 a) separate file data to different zones when possible.
 b) colocate file data of similar life times when feasible.

To get best results, average file sizes should align with the zone
capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl.

For RocksDB using leveled compaction, the lifetime hints can improve
throughput for overwrite workloads at 80% file system utilization by
~10%.

Lifetime hints can be disabled using the nolifetime mount option.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: add a max_open_zones mount option
Christoph Hellwig [Sun, 17 Nov 2024 07:05:16 +0000 (08:05 +0100)]
xfs: add a max_open_zones mount option

Allow limiting the number of open zones used below that exported by the
device.  This is required to tune the number of write streams when zoned
RT devices are used on conventional devices, and can be useful on zoned
devices that support a very large number of open zones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: support zone gaps
Christoph Hellwig [Sat, 21 Dec 2024 09:58:24 +0000 (09:58 +0000)]
xfs: support zone gaps

Zoned devices can have gaps beyond the usable capacity of a zone and the
end in the LBA/daddr address space.  In other words, the hardware
equivalent to the RT groups already takes care of the power of 2
alignment for us.  In this case the sparse FSB/RTB address space maps 1:1
to the device address space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: enable the zoned RT device feature
Christoph Hellwig [Sun, 17 Nov 2024 07:36:47 +0000 (08:36 +0100)]
xfs: enable the zoned RT device feature

Enable the zoned RT device directory feature.  With this feature, RT
groups are written sequentially and always emptied before rewriting
the blocks.  This perfectly maps to zoned devices, but can also be
used on conventional block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: disable rt quotas for zoned file systems
Christoph Hellwig [Tue, 22 Oct 2024 12:16:44 +0000 (14:16 +0200)]
xfs: disable rt quotas for zoned file systems

They'll need a little more work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: disable reflink for zoned file systems
Christoph Hellwig [Sat, 21 Dec 2024 08:41:49 +0000 (08:41 +0000)]
xfs: disable reflink for zoned file systems

While the zoned on-disk format supports reflinks, the GC code currently
always unshares reflinks when moving blocks to new zones, thus making the
feature unusuable.  Disable reflinks until the GC code is refcount aware.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: enable fsmap reporting for internal RT devices
Christoph Hellwig [Wed, 13 Nov 2024 05:51:55 +0000 (06:51 +0100)]
xfs: enable fsmap reporting for internal RT devices

File system with internal RT devices are a bit odd in that we need
to report AGs and RGs.  To make this happen use separate synthetic
fmr_device values for the different sections instead of the dev_t
mapping used by other XFS configurations.

The data device is reported as file system metadata before the
start of the RGs for the synthetic RT fmr_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: support xrep_require_rtext_inuse on zoned file systems
Christoph Hellwig [Mon, 22 Jul 2024 13:31:28 +0000 (06:31 -0700)]
xfs: support xrep_require_rtext_inuse on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-referenced.  But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: support xchk_xref_is_used_rt_space on zoned file systems
Christoph Hellwig [Sun, 17 Nov 2024 06:35:44 +0000 (07:35 +0100)]
xfs: support xchk_xref_is_used_rt_space on zoned file systems

Space usage is tracked by the rmap, which already is separately
cross-referenced.  But on top of that we have the write pointer and can
do a basic sanity check here that the block is not beyond the write
pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: allow COW forks on zoned file systems in xchk_bmap
Christoph Hellwig [Fri, 10 May 2024 06:51:02 +0000 (08:51 +0200)]
xfs: allow COW forks on zoned file systems in xchk_bmap

Zoned file systems can have COW forks even without reflinks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: support growfs on zoned file systems
Christoph Hellwig [Sun, 17 Nov 2024 09:27:24 +0000 (10:27 +0100)]
xfs: support growfs on zoned file systems

Replace the inner loop growing one RT bitmap block at a time with
one just modifying the superblock counters for growing an entire
zone (aka RTG).  The big restriction is just like at mkfs time only
a RT extent size of a single FSB is allowed, and the file system
capacity needs to be aligned to the zone size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: hide reserved RT blocks from statfs
Christoph Hellwig [Mon, 16 Dec 2024 07:14:42 +0000 (08:14 +0100)]
xfs: hide reserved RT blocks from statfs

File systems with a zoned RT device have a large number of reserved
blocks that are required for garbage collection, and which can't be
filled with user data.  Exclude them from the available blocks reported
through stat(v)fs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: wire up zoned block freeing in xfs_rtextent_free_finish_item
Christoph Hellwig [Sun, 17 Nov 2024 07:19:22 +0000 (08:19 +0100)]
xfs: wire up zoned block freeing in xfs_rtextent_free_finish_item

Make xfs_rtextent_free_finish_item call into the zoned allocator to free
blocks on zoned RT devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: implement direct writes to zoned RT devices
Christoph Hellwig [Mon, 27 Jan 2025 14:35:00 +0000 (15:35 +0100)]
xfs: implement direct writes to zoned RT devices

Direct writes to zoned RT devices are extremely simple.  After taking the
block reservation before acquiring the iolock, the iomap direct I/O calls
into ->iomap_begin which will return a "fake" iomap for the entire
requested range.  The actual block allocation is then done from the
submit_io handler using code shared with the buffered I/O path.

The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize
the embedded ioend, which allows reusing the existing ioend based buffered
I/O completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: implement buffered writes to zoned RT devices
Christoph Hellwig [Sat, 30 Nov 2024 03:56:08 +0000 (04:56 +0100)]
xfs: implement buffered writes to zoned RT devices

Implement buffered writes including page faults and block zeroing for
zoned RT devices.  Buffered writes to zoned RT devices are split into
three phases:

 1) a reservation for the worst case data block usage is taken before
    acquiring the iolock.  When there are enough free blocks but not
    enough available one, garbage collection is kicked of to free the
    space before continuing with the write.  If there isn't enough
    freeable space, the block reservation is reduced and a short write
    will happen as expected by normal Linux write semantics.
 2) with the iolock held, the generic iomap buffered write code is
    called, which through the iomap_begin operation usually just inserts
    delalloc extents for the range in a single iteration.  Only for
    overwrites of existing data that are not block aligned, or zeroing
    operations the existing extent mapping is read to fill out the srcmap
    and to figure out if zeroing is required.
 3) the ->map_blocks callback to the generic iomap writeback code
    calls into the zoned space allocator to actually allocate on-disk
    space for the range before kicking of the writeback.

Note that because all writes are out of place, truncate or hole punches
that are not aligned to block size boundaries need to allocate space.
For block zeroing from truncate, ->setattr is called with the iolock
(aka i_rwsem) already held, so a hacky deviation from the above
scheme is needed.  In this case the space reservations is called with
the iolock held, but is required not to block and can dip into the
reserved block pool.  This can lead to -ENOSPC when truncating a
file, which is unfortunate.  But fixing the calling conventions in
the VFS is probably much easier with code requiring it already in
mainline.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: implement zoned garbage collection
Christoph Hellwig [Fri, 31 Jan 2025 06:59:59 +0000 (07:59 +0100)]
xfs: implement zoned garbage collection

RT groups on a zoned file system need to be completely empty before their
space can be reused.  This means that partially empty groups need to be
emptied entirely to free up space if no entirely free groups are
available.

Add a garbage collection thread that moves all data out of the least used
zone when not enough free zones are available, and which resets all zones
that have been emptied.  To find empty zone a simple set of 10 buckets
based on the amount of space used in the zone is used.  To empty zones,
the rmap is walked to find the owners and the data is read and then
written to the new place.

To automatically defragment files the rmap records are sorted by inode
and logical offset.  This means defragmentation of parallel writes into
a single zone happens automatically when performing garbage collection.
Because holding the iolock over the entire GC cycle would inject very
noticeable latency for other accesses to the inodes, the iolock is not
taken while performing I/O.  Instead the I/O completion handler checks
that the mapping hasn't changed over the one recorded at the start of
the GC cycle and doesn't update the mapping if it change.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: add support for zoned space reservations
Christoph Hellwig [Fri, 31 Jan 2025 07:02:50 +0000 (08:02 +0100)]
xfs: add support for zoned space reservations

For zoned file systems garbage collection (GC) has to take the iolock
and mmaplock after moving data to a new place to synchronize with
readers.  This means waiting for garbage collection with the iolock can
deadlock.

To avoid this, the worst case required blocks have to be reserved before
taking the iolock, which is done using a new RTAVAILABLE counter that
tracks blocks that are free to write into and don't require garbage
collection.  The new helpers try to take these available blocks, and
if there aren't enough available it wakes and waits for GC.  This is
done using a list of on-stack reservations to ensure fairness.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: add the zoned space allocator
Christoph Hellwig [Fri, 31 Jan 2025 06:27:26 +0000 (07:27 +0100)]
xfs: add the zoned space allocator

For zoned RT devices space is always allocated at the write pointer, that
is right after the last written block and only recorded on I/O completion.

Because the actual allocation algorithm is very simple and just involves
picking a good zone - preferably the one used for the last write to the
inode.  As the number of zones that can written at the same time is
usually limited by the hardware, selecting a zone is done as late as
possible from the iomap dio and buffered writeback bio submissions
helpers just before submitting the bio.

Given that the writers already took a reservation before acquiring the
iolock, space will always be readily available if an open zone slot is
available.  A new structure is used to track these open zones, and
pointed to by the xfs_rtgroup.  Because zoned file systems don't have
a rsum cache the space for that pointer can be reused.

Allocations are only recorded at I/O completion time.  The scheme used
for that is very similar to the reflink COW end I/O path.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: parse and validate hardware zone information
Christoph Hellwig [Sun, 17 Nov 2024 05:28:10 +0000 (06:28 +0100)]
xfs: parse and validate hardware zone information

Add support to validate and parse reported hardware zone state.

Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file
Christoph Hellwig [Sun, 17 Nov 2024 07:02:04 +0000 (08:02 +0100)]
xfs: skip zoned RT inodes in xfs_inodegc_want_queue_rt_file

The zoned allocator never performs speculative preallocations, so don't
bother queueing up zoned inodes here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes
Christoph Hellwig [Thu, 21 Nov 2024 07:50:23 +0000 (08:50 +0100)]
xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodes

Zoned file systems require out of place writes and thus can't support
post-EOF speculative preallocations.  Avoid the pointless ilock critical
section to find out that none can be freed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: disable FITRIM for zoned RT devices
Christoph Hellwig [Sun, 17 Nov 2024 05:23:16 +0000 (06:23 +0100)]
xfs: disable FITRIM for zoned RT devices

The zoned allocator unconditionally issues zone resets or discards after
emptying an entire zone, so supporting FITRIM for a zoned RT device is
not useful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: disable sb_frextents for zoned file systems
Christoph Hellwig [Sat, 30 Nov 2024 04:59:54 +0000 (05:59 +0100)]
xfs: disable sb_frextents for zoned file systems

Zoned file systems not only don't use the global frextents counter, but
for them the in-memory percpu counter also includes reservations taken
before even allocating delalloc extent records, so it will never match
the per-zone used information.  Disable all updates and verification of
the sb counter for zoned file systems as it isn't useful for them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: export zoned geometry via XFS_FSOP_GEOM
Christoph Hellwig [Thu, 24 Oct 2024 08:57:41 +0000 (10:57 +0200)]
xfs: export zoned geometry via XFS_FSOP_GEOM

Export the zoned geometry information so that userspace can query it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: allow internal RT devices for zoned mode
Christoph Hellwig [Sun, 17 Nov 2024 07:53:10 +0000 (08:53 +0100)]
xfs: allow internal RT devices for zoned mode

Allow creating an RT subvolume on the same device as the main data
device.  This is mostly used for SMR HDDs where the conventional zones
are used for the data device and the sequential write required zones
for the zoned RT section.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: define the zoned on-disk format
Christoph Hellwig [Sat, 30 Nov 2024 03:48:33 +0000 (04:48 +0100)]
xfs: define the zoned on-disk format

Zone file systems reuse the basic RT group enabled XFS file system
structure to support a mode where each RT group is always written from
start to end and then reset for reuse (after moving out any remaining
data).  There are few minor but important changes, which are indicated
by a new incompat flag:

1) there are no bitmap and summary inodes, thus the
   /rtgroups/{rgno}.{bitmap,summary} metadir files do not exist and the
   sb_rbmblocks superblock field must be cleared to zero.

2) there is a new superblock field that specifies the start of an
   internal RT section.  This allows supporting SMR HDDs that have random
   writable space at the beginning which is used for the XFS data device
   (which really is the metadata device for this configuration), directly
   followed by a RT device on the same block device.  While something
   similar could be achieved using dm-linear just having a single device
   directly consumed by XFS makes handling the file systems a lot easier.

3) Another superblock field that tracks the amount of reserved space (or
   overprovisioning) that is never used for user capacity, but allows GC
   to run more smoothly.

4) an overlay of the cowextsize field for the rtrmap inode so that we
   can persistently track the total amount of rtblocks currently used in
   a RT group.  There is no data structure other than the rmap that
   tracks used space in an RT group, and this counter is used to decide
   when a RT group has been entirely emptied, and to select one that
   is relatively empty if garbage collection needs to be performed.
   While this counter could be tracked entirely in memory and rebuilt
   from the rmap at mount time, that would lead to very long mount times
   with the large number of RT groups implied by the number of hardware
   zones especially on SMR hard drives with 256MB zone sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: add a xfs_rtrmap_highest_rgbno helper
Christoph Hellwig [Sat, 21 Dec 2024 08:40:17 +0000 (08:40 +0000)]
xfs: add a xfs_rtrmap_highest_rgbno helper

Add a helper to find the last offset mapped in the rtrmap.  This will be
used by the zoned code to find out where to start writing again on
conventional devices without hardware zone support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay
Christoph Hellwig [Tue, 10 Sep 2024 04:58:17 +0000 (07:58 +0300)]
xfs: support XFS_BMAPI_REMAP in xfs_bmap_del_extent_delay

The zone allocator wants to be able to remove a delalloc mapping in the
COW fork while keeping the block reservation.  To support that pass the
blags argument down to xfs_bmap_del_extent_delay and support the
XFS_BMAPI_REMAP flag to keep the reservation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: reduce metafile reservations
Christoph Hellwig [Fri, 17 Jan 2025 05:55:08 +0000 (06:55 +0100)]
xfs: reduce metafile reservations

There is no point in reserving more space than actually available
on the data device for the worst case scenario that is unlikely to
happen.  Reserve at most 1/4th of the data device blocks, which is
still a heuristic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: make metabtree reservations global
Christoph Hellwig [Mon, 6 Jan 2025 13:49:40 +0000 (14:49 +0100)]
xfs: make metabtree reservations global

Currently each metabtree inode has it's own space reservation to ensure
it can be expanded to the maximum size.  This is not very efficient as it
requires a large number of blocks to be set aside that can't be used at
all by other btrees.

Switch to a model that uses a global pool instead in preparation for
reducing the amount of reserved space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: trace in-memory freecounters
Christoph Hellwig [Fri, 31 Jan 2025 06:24:44 +0000 (07:24 +0100)]
xfs: trace in-memory freecounters

Add two tracepoints when the freecounter dips into the reserved pool
and when it is entirely out of space.

This requires moving enum xfs_free_counter to xfs_types.h to avoid
build failures in various sources files include xfs_trace.h, but
that's probably the right place for it anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: reflow xfs_dec_freecounter
Christoph Hellwig [Fri, 31 Jan 2025 06:23:35 +0000 (07:23 +0100)]
xfs: reflow xfs_dec_freecounter

Let the successful allocation be the main path through the function
with exception handling in branches to make the code easier to
follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: preserve RT reservations across remounts
Hans Holmberg [Sat, 30 Nov 2024 04:39:47 +0000 (05:39 +0100)]
xfs: preserve RT reservations across remounts

Introduce a reservation setting for rt devices so that zoned GC
reservations are preserved over remount ro/rw cycles.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: generalize the freespace and reserved blocks handling
Christoph Hellwig [Mon, 16 Dec 2024 07:10:35 +0000 (08:10 +0100)]
xfs: generalize the freespace and reserved blocks handling

The main handling of the incore per-cpu freespace counters is already
handled in xfs_mod_freecounter for both the block and RT extent cases,
but the actual counter is passed in an special cases.

Replace both the percpu counters and the resblks counters with arrays,
so that support reserved RT extents can be supported, which will be
needed for garbarge collection on zoned devices.

Use helpers to access the freespace counters everywhere intead of
poking through the abstraction by using the percpu_count helpers
directly.  This also switches the flooring of the frextents counter
to 0 in statfs for the rthinherit case to a manual min_t call to match
the handling of the fdblocks counter for normal file systems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: refine the unaligned check for always COW inodes in xfs_file_dio_write
Christoph Hellwig [Fri, 27 Oct 2023 07:58:24 +0000 (09:58 +0200)]
xfs: refine the unaligned check for always COW inodes in xfs_file_dio_write

For always COW inodes we also must check the alignment of each individual
iovec segment, as they could end up with different I/Os due to the way
bio_iov_iter_get_pages works, and we'd then overwrite an already written
block.  The existing always_cow sysctl based code doesn't catch this
because nothing enforces that blocks aren't rewritten, but for zoned XFS
on sequential write required zones this is a hard error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: skip always_cow inodes in xfs_reflink_trim_around_shared
Christoph Hellwig [Sat, 14 Oct 2023 05:50:35 +0000 (07:50 +0200)]
xfs: skip always_cow inodes in xfs_reflink_trim_around_shared

xfs_reflink_trim_around_shared tries to find shared blocks in the
refcount btree.  Always_cow inodes don't have that tree, so don't
bother.

For the existing always_cow code this is a minor optimization.  For
the upcoming zoned code that can do COW without the rtreflink code it
avoids triggering a NULL pointer dereference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c
Christoph Hellwig [Sun, 17 Nov 2024 09:22:50 +0000 (10:22 +0100)]
xfs: move xfs_bmapi_reserve_delalloc to xfs_iomap.c

Delalloc reservations are not supported in userspace, and thus it doesn't
make sense to share this helper with xfsprogs.c.  Move it to xfs_iomap.c
toward the two callers.

Note that there rest of the delalloc handling should probably eventually
also move out of xfs_bmap.c, but that will require a bit more surgery.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoxfs: add a rtg_blocks helper
Christoph Hellwig [Sat, 21 Dec 2024 08:38:28 +0000 (08:38 +0000)]
xfs: add a rtg_blocks helper

Shortcut dereferencing the xg_block_count field in the generic group
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoxfs: factor out a xfs_rt_check_size helper
Christoph Hellwig [Tue, 30 Jul 2024 23:42:42 +0000 (16:42 -0700)]
xfs: factor out a xfs_rt_check_size helper

Add a helper to check that the last block of a RT device is readable
to share the code between mount and growfs.  This also adds the mount
time overflow check to growfs and improves the error messages.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: pass private data to iomap_truncate_page
Christoph Hellwig [Fri, 16 Aug 2024 16:49:13 +0000 (18:49 +0200)]
iomap: pass private data to iomap_truncate_page

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: pass private data to iomap_zero_range
Christoph Hellwig [Fri, 16 Aug 2024 16:48:16 +0000 (18:48 +0200)]
iomap: pass private data to iomap_zero_range

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: pass private data to iomap_page_mkwrite
Christoph Hellwig [Sat, 30 Nov 2024 03:43:35 +0000 (04:43 +0100)]
iomap: pass private data to iomap_page_mkwrite

Allow the file system to pass private data which can be used by the
iomap_begin and iomap_end methods through the private pointer in the
iomap_iter structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: add a io_private field to struct iomap_ioend
Christoph Hellwig [Tue, 7 Jan 2025 07:07:15 +0000 (08:07 +0100)]
iomap: add a io_private field to struct iomap_ioend

Add a private data field to struct iomap_ioend so that the file system
can attach information to it.  Zoned XFS will use this for a pointer to
the open zone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoiomap: optionally use ioends for direct I/O
Christoph Hellwig [Wed, 8 Jan 2025 18:36:49 +0000 (19:36 +0100)]
iomap: optionally use ioends for direct I/O

struct iomap_ioend currently tracks outstanding buffered writes and has
some really nice code in core iomap and XFS to merge contiguous I/Os
an defer them to userspace for completion in a very efficient way.

For zoned writes we'll also need a per-bio user context completion to
record the written blocks, and the infrastructure for that would look
basically like the ioend handling for buffered I/O.

So instead of reinventing the wheel, reuse the existing infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: factor out a iomap_dio_done helper
Christoph Hellwig [Mon, 16 Dec 2024 05:39:10 +0000 (06:39 +0100)]
iomap: factor out a iomap_dio_done helper

Split out the struct iomap-dio level final completion from
iomap_dio_bio_end_io into a helper to clean up the code and make it
reusable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: move common ioend code to ioend.c
Christoph Hellwig [Mon, 16 Dec 2024 05:32:35 +0000 (06:32 +0100)]
iomap: move common ioend code to ioend.c

This code will be reused for direct I/O soon, so split it out of
buffered-io.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: split bios to zone append limits in the submission handlers
Christoph Hellwig [Mon, 27 Jan 2025 14:33:47 +0000 (15:33 +0100)]
iomap: split bios to zone append limits in the submission handlers

Provide helpers for file systems to split bios in the direct I/O and
writeback I/O submission handlers.  The split ioends are chained to
the parent ioend so that only the parent ioend originally generated
by the iomap layer will be processed after all the chained off children
have completed.  This is based on the block layer bio chaining that has
supported a similar mechanism for a long time.

This Follows btrfs' lead and don't try to build bios to hardware limits
for zone append commands, but instead build them as normal unconstrained
bios and split them to the hardware limits in the I/O submission handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: add a IOMAP_F_ANON_WRITE flag
Christoph Hellwig [Sun, 5 Nov 2023 05:40:52 +0000 (06:40 +0100)]
iomap: add a IOMAP_F_ANON_WRITE flag

Add a IOMAP_F_ANON_WRITE flag that indicates that the write I/O does not
have a target block assigned to it yet at iomap time and the file system
will do that in the bio submission handler, splitting the I/O as needed.

This is used to implement Zone Append based I/O for zoned XFS, where
splitting writes to the hardware limits and assigning a zone to them
happens just before sending the I/O off to the block layer, but could
also be useful for other things like compressed I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: simplify io_flags and io_type in struct iomap_ioend
Christoph Hellwig [Sun, 24 Nov 2024 12:53:36 +0000 (13:53 +0100)]
iomap: simplify io_flags and io_type in struct iomap_ioend

The ioend fields for distinct types of I/O are a bit complicated.
Consolidate them into a single io_flag field with it's own flags
decoupled from the iomap flags.  This also prepares for adding a new
flag that is unrelated to both of the iomap namespaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoiomap: allow the file system to submit the writeback bios
Christoph Hellwig [Tue, 5 Nov 2024 07:33:10 +0000 (08:33 +0100)]
iomap: allow the file system to submit the writeback bios

Change ->prepare_ioend to ->submit_ioend and require file systems that
implement it to submit the bio.  This is needed for file systems that
do their own work on the bios before submitting them to the block layer
like btrfs or zoned xfs.  To make this easier also pass the writeback
context to the method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2 months agoTEMP: nvme-pci: disable async probe
Christoph Hellwig [Sun, 15 Oct 2023 07:26:20 +0000 (09:26 +0200)]
TEMP: nvme-pci: disable async probe

This keeps getting my ZNS vs ZNS drivers reordered a bit and is annoying
for testing.

Not-really-signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoDocumentation: Document the new zoned loop block device driver
Damien Le Moal [Fri, 31 Jan 2025 06:41:46 +0000 (15:41 +0900)]
Documentation: Document the new zoned loop block device driver

Introduce the zoned_loop.rst documentation file under
admin-guide/blockdev to document the zoned loop block device driver.
An overview of the driver is provided and its usage to create and delete
zoned devices described.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2 months agoblock: new zoned loop block device driver
Damien Le Moal [Fri, 31 Jan 2025 06:41:45 +0000 (15:41 +0900)]
block: new zoned loop block device driver

The zoned loop block device driver allows a user to create emulated
zoned block devices using one regular file per zone as backing storage.
Compared to null_blk or scsi_debug, it has the advantage of allowing
emulating large zoned devices without requiring the same amount of
memory as the capacity of the emulated device. Furthermore, zoned
devices emulated with this driver can be re-started after a host reboot
without any loss of the state of the device zones, which is something
that null_blk and scsi_debug do not support.

This initial implementation is simple and does not support zone resource
limits. That is, a zoned loop block device limits for the maximum number
of open zones and maximum number of active zones is always 0.

This driver can be either compiled in-kernel or as a module, named
"zloop". Compilation of this driver depends on the block layer support
for zoned block device (CONFIG_BLK_DEV_ZONED must be set).

Using the zloop driver to create and delete zoned block devices is
done by writing commands to the zoned loop control character device file
(/dev/zloop-control). Creating a device is done with:

  $ echo "add [options]" > /dev/zloop-control

The options available for the "add" operation cat be listed by reading
the zloop-control device file:

  $ cat /dev/zloop-control
  add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
  remove id=%d

The options available allow controlling the zoned device total
capacity, zone size, zone capactity of sequential zones, total number
of conventional zones, base directory for the zones backing file, number
of I/O queues and the maximum queue depth of I/O queues.

Deleting a device is done using the "remove" command:

  $ echo "remove id=0" > /dev/zloop-control

This implementation passes various tests using zonefs and fio (t/zbd
tests) and provides a state machine for zone conditions that is
compliant with the T10 ZBC and NVMe ZNS specifications.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2 months agoxfs: remove xfs_buf_cache.bc_lock
Christoph Hellwig [Sun, 12 Jan 2025 06:00:04 +0000 (07:00 +0100)]
xfs: remove xfs_buf_cache.bc_lock

xfs_buf_cache.bc_lock serializes adding buffers to and removing them from
the hashtable.  But as the rhashtable code already uses fine grained
internal locking for inserts and removals the extra protection isn't
actually required.

It also happens to fix a lock order inversion vs b_lock added by the
recent lookup race fix.

Fixes: ee10f6fcdb96 ("xfs: fix buffer lookup vs release race")
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2 months agoxfs: don't call remap_verify_area with sb write protection held
Christoph Hellwig [Sun, 19 Jan 2025 06:02:36 +0000 (07:02 +0100)]
xfs: don't call remap_verify_area with sb write protection held

The XFS_IOC_EXCHANGE_RANGE ioctl with the XFS_EXCHANGE_RANGE_TO_EOF flag
operates on a range bounded by the end of the file.  This means the
actual amount of blocks exchanged is derived from the inode size, which
is only stable with the IOLOCK (i_rwsem) held.  Do that, it currently
calls remap_verify_area from inside the sb write protection which nests
outside the IOLOCK.  But this makes fsnotify_file_area_perm which is
called from remap_verify_area unhappy when the kernel is built with
lockdep and the recently added CONFIG_FANOTIFY_ACCESS_PERMISSIONS
option.

Fix this by always calling remap_verify_area before taking the write
protection, and passing a 0 size to remap_verify_area similar to
the FICLONE/FICLONERANGE ioctls when they are asked to clone until
the file end.

(Note: the size argument gets passed to fsnotify_file_area_perm, but
then isn't actually used there).

Fixes: 9a64d9b3109d ("xfs: introduce new file range exchange ioctl")
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 months agoLinux 6.14-rc1
Linus Torvalds [Sun, 2 Feb 2025 23:39:26 +0000 (15:39 -0800)]
Linux 6.14-rc1

2 months agoMerge tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 2 Feb 2025 18:49:13 +0000 (10:49 -0800)]
Merge tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux

Pull turbostat updates from Len Brown:

 - Fix regression that affinitized forked child in one-shot mode.

 - Harden one-shot mode against hotplug online/offline

 - Enable RAPL SysWatt column by default

 - Add initial PTL, CWF platform support

 - Harden initial PMT code in response to early use

 - Enable first built-in PMT counter: CWF c1e residency

 - Refuse to run on unsupported platforms without --force, to encourage
   updating to a version that supports the system, and to avoid
   no-so-useful measurement results

* tag 'turbostat-2025.02.02' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (25 commits)
  tools/power turbostat: version 2025.02.02
  tools/power turbostat: Add CPU%c1e BIC for CWF
  tools/power turbostat: Harden one-shot mode against cpu offline
  tools/power turbostat: Fix forked child affinity regression
  tools/power turbostat: Add tcore clock PMT type
  tools/power turbostat: version 2025.01.14
  tools/power turbostat: Allow adding PMT counters directly by sysfs path
  tools/power turbostat: Allow mapping multiple PMT files with the same GUID
  tools/power turbostat: Add PMT directory iterator helper
  tools/power turbostat: Extend PMT identification with a sequence number
  tools/power turbostat: Return default value for unmapped PMT domains
  tools/power turbostat: Check for non-zero value when MSR probing
  tools/power turbostat: Enhance turbostat self-performance visibility
  tools/power turbostat: Add fixed RAPL PSYS divisor for SPR
  tools/power turbostat: Fix PMT mmaped file size rounding
  tools/power turbostat: Remove SysWatt from DISABLED_BY_DEFAULT
  tools/power turbostat: Add an NMI column
  tools/power turbostat: add Busy% to "show idle"
  tools/power turbostat: Introduce --force parameter
  tools/power turbostat: Improve --help output
  ...

2 months agoMerge tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubi...
Linus Torvalds [Sun, 2 Feb 2025 18:40:27 +0000 (10:40 -0800)]
Merge tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux

Pull sh updates from John Paul Adrian Glaubitz:
 "Fixes and improvements for sh:

   - replace seq_printf() with the more efficient
     seq_put_decimal_ull_width() to increase performance when stress
     reading /proc/interrupts (David Wang)

   - migrate sh to the generic rule for built-in DTB to help avoid race
     conditions during parallel builds which can occur because Kbuild
     decends into arch/*/boot/dts twice (Masahiro Yamada)

   - replace select with imply in the board Kconfig for enabling
     hardware with complex dependencies. This addresses warnings which
     were reported by the kernel test robot (Geert Uytterhoeven)"

* tag 'sh-for-v6.14-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
  sh: boards: Use imply to enable hardware with complex dependencies
  sh: Migrate to the generic rule for built-in DTB
  sh: irq: Use seq_put_decimal_ull_width() for decimal values

2 months agotools/power turbostat: version 2025.02.02
Len Brown [Sun, 2 Feb 2025 16:43:02 +0000 (10:43 -0600)]
tools/power turbostat: version 2025.02.02

Summary of Changes since 2024.11.30:

Fix regression in 2023.11.07 that affinitized forked child
in one-shot mode.

Harden one-shot mode against hotplug online/offline

Enable RAPL SysWatt column by default.

Add initial PTL, CWF platform support.

Harden initial PMT code in response to early use.

Enable first built-in PMT counter: CWF c1e residency

Refuse to run on unsupported platforms without --force,
to encourage updating to a version that supports the system,
and to avoid no-so-useful measurement results.

Signed-off-by: Len Brown <len.brown@intel.com>
2 months agoMerge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sat, 1 Feb 2025 23:07:56 +0000 (15:07 -0800)]
Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull misc vfs cleanups from Al Viro:
 "Two unrelated patches - one is a removal of long-obsolete include in
  overlayfs (it used to need fs/internal.h, but the extern it wanted has
  been moved back to include/linux/namei.h) and another introduces
  convenience helper constructing struct qstr by a NUL-terminated
  string"

* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  add a string-to-qstr constructor
  fs/overlayfs/namei.c: get rid of include ../internal.h

2 months agoMerge tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Linus Torvalds [Sat, 1 Feb 2025 22:54:33 +0000 (14:54 -0800)]
Merge tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fix from Thomas Bogendoerfer:
 "Revert commit breaking sysv ipc for o32 ABI"

* tag 'mips_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  Revert "mips: fix shmctl/semctl/msgctl syscall for o32"

2 months agoMerge tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 1 Feb 2025 19:30:41 +0000 (11:30 -0800)]
Merge tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull more smb client updates from Steve French:

   - various updates for special file handling: symlink handling,
     support for creating sockets, cleanups, new mount options (e.g. to
     allow disabling using reparse points for them, and to allow
     overriding the way symlinks are saved), and fixes to error paths

   - fix for kerberos mounts (allow IAKerb)

   - SMB1 fix for stat and for setting SACL (auditing)

   - fix an incorrect error code mapping

   - cleanups"

* tag 'v6.14-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: (21 commits)
  cifs: Fix parsing native symlinks directory/file type
  cifs: update internal version number
  cifs: Add support for creating WSL-style symlinks
  smb3: add support for IAKerb
  cifs: Fix struct FILE_ALL_INFO
  cifs: Add support for creating NFS-style symlinks
  cifs: Add support for creating native Windows sockets
  cifs: Add mount option -o reparse=none
  cifs: Add mount option -o symlink= for choosing symlink create type
  cifs: Fix creating and resolving absolute NT-style symlinks
  cifs: Simplify reparse point check in cifs_query_path_info() function
  cifs: Remove symlink member from cifs_open_info_data union
  cifs: Update description about ACL permissions
  cifs: Rename struct reparse_posix_data to reparse_nfs_data_buffer and move to common/smb2pdu.h
  cifs: Remove struct reparse_posix_data from struct cifs_open_info_data
  cifs: Remove unicode parameter from parse_reparse_point() function
  cifs: Fix getting and setting SACLs over SMB1
  cifs: Remove intermediate object of failed create SFU call
  cifs: Validate EAs for WSL reparse points
  cifs: Change translation of STATUS_PRIVILEGE_NOT_HELD to -EPERM
  ...

2 months agoMerge tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 1 Feb 2025 18:04:29 +0000 (10:04 -0800)]
Merge tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull debugfs fix from Greg KH:
 "Here is a single debugfs fix from Al to resolve a reported regression
  in the driver-core tree. It has been reported to fix the issue"

* tag 'driver-core-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  debugfs: Fix the missing initializations in __debugfs_file_get()

2 months agoMerge tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sat, 1 Feb 2025 17:49:20 +0000 (09:49 -0800)]
Merge tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes. 8 are cc:stable and the remainder address post-6.13
  issues. 13 are for MM and 8 are for non-MM.

  All are singletons, please see the changelogs for details"

* tag 'mm-hotfixes-stable-2025-02-01-03-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  MAINTAINERS: include linux-mm for xarray maintenance
  revert "xarray: port tests to kunit"
  MAINTAINERS: add lib/test_xarray.c
  mailmap, MAINTAINERS, docs: update Carlos's email address
  mm/hugetlb: fix hugepage allocation for interleaved memory nodes
  mm: gup: fix infinite loop within __get_longterm_locked
  mm, swap: fix reclaim offset calculation error during allocation
  .mailmap: update email address for Christopher Obbard
  kfence: skip __GFP_THISNODE allocations on NUMA systems
  nilfs2: fix possible int overflows in nilfs_fiemap()
  mm: compaction: use the proper flag to determine watermarks
  kernel: be more careful about dup_mmap() failures and uprobe registering
  mm/fake-numa: handle cases with no SRAT info
  mm: kmemleak: fix upper boundary check for physical address objects
  mailmap: add an entry for Hamza Mahfooz
  MAINTAINERS: mailmap: update Yosry Ahmed's email address
  scripts/gdb: fix aarch64 userspace detection in get_current_task
  mm/vmscan: accumulate nr_demoted for accurate demotion statistics
  ocfs2: fix incorrect CPU endianness conversion causing mount failure
  mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc()
  ...

2 months agoMerge tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Sat, 1 Feb 2025 17:15:01 +0000 (09:15 -0800)]
Merge tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fix from Mauro Carvalho Chehab:
 "A revert for a regression in the uvcvideo driver"

* tag 'media/v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  Revert "media: uvcvideo: Require entities to have a non-zero unique ID"

2 months agoMAINTAINERS: include linux-mm for xarray maintenance
Andrew Morton [Fri, 31 Jan 2025 00:16:20 +0000 (16:16 -0800)]
MAINTAINERS: include linux-mm for xarray maintenance

MM developers have an interest in the xarray code.

Cc: David Gow <davidgow@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agorevert "xarray: port tests to kunit"
Andrew Morton [Fri, 31 Jan 2025 00:09:20 +0000 (16:09 -0800)]
revert "xarray: port tests to kunit"

Revert c7bb5cf9fc4e ("xarray: port tests to kunit").  It broke the build
when compiing the xarray userspace test harness code.

Reported-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Closes: https://lkml.kernel.org/r/07cf896e-adf8-414f-a629-a808fc26014a@oracle.com
Cc: David Gow <davidgow@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tamir Duberstein <tamird@gmail.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoMAINTAINERS: add lib/test_xarray.c
Tamir Duberstein [Wed, 29 Jan 2025 21:13:49 +0000 (16:13 -0500)]
MAINTAINERS: add lib/test_xarray.c

Ensure test-only changes are sent to the relevant maintainer.

Link: https://lkml.kernel.org/r/20250129-xarray-test-maintainer-v1-1-482e31f30f47@gmail.com
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Cc: Mattew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomailmap, MAINTAINERS, docs: update Carlos's email address
Carlos Bilbao [Thu, 30 Jan 2025 01:22:44 +0000 (19:22 -0600)]
mailmap, MAINTAINERS, docs: update Carlos's email address

Update .mailmap to reflect my new (and final) primary email address,
carlos.bilbao@kernel.org.  Also update contact information in files
Documentation/translations/sp_SP/index.rst and MAINTAINERS.

Link: https://lkml.kernel.org/r/20250130012248.1196208-1-carlos.bilbao@kernel.org
Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Cc: Carlos Bilbao <bilbao@vt.edu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mattew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/hugetlb: fix hugepage allocation for interleaved memory nodes
Ritesh Harjani (IBM) [Sat, 11 Jan 2025 11:06:55 +0000 (16:36 +0530)]
mm/hugetlb: fix hugepage allocation for interleaved memory nodes

gather_bootmem_prealloc() assumes the start nid as 0 and size as
num_node_state(N_MEMORY).  That means in case if memory attached numa
nodes are interleaved, then gather_bootmem_prealloc_parallel() will fail
to scan few of these nodes.

Since memory attached numa nodes can be interleaved in any fashion, hence
ensure that the current code checks for all numa node ids
(.size = nr_node_ids). Let's still keep max_threads as N_MEMORY, so that
it can distributes all nr_node_ids among the these many no. threads.

e.g. qemu cmdline
========================
numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20"
mem_cmd="-object memory-backend-ram,id=mem1,size=16G"

w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
==========================
~ # cat /proc/meminfo  |grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:               0 kB

with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2):
===========================
~ # cat /proc/meminfo |grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       2
HugePages_Free:        2
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         2097152 kB

Link: https://lkml.kernel.org/r/f8d8dad3a5471d284f54185f65d575a6aaab692b.1736592534.git.ritesh.list@gmail.com
Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reported-by: Pavithra Prakash <pavrampu@linux.ibm.com>
Suggested-by: Muchun Song <muchun.song@linux.dev>
Tested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Gang Li <gang.li@linux.dev>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: gup: fix infinite loop within __get_longterm_locked
Zhaoyang Huang [Tue, 21 Jan 2025 02:01:59 +0000 (10:01 +0800)]
mm: gup: fix infinite loop within __get_longterm_locked

We can run into an infinite loop in __get_longterm_locked() when
collect_longterm_unpinnable_folios() finds only folios that are isolated
from the LRU or were never added to the LRU.  This can happen when all
folios to be pinned are never added to the LRU, for example when
vm_ops->fault allocated pages using cma_alloc() and never added them to
the LRU.

Fix it by simply taking a look at the list in the single caller, to see if
anything was added.

[zhaoyang.huang@unisoc.com: move definition of local]
Link: https://lkml.kernel.org/r/20250122012604.3654667-1-zhaoyang.huang@unisoc.com
Link: https://lkml.kernel.org/r/20250121020159.3636477-1-zhaoyang.huang@unisoc.com
Fixes: 67e139b02d99 ("mm/gup.c: refactor check_and_migrate_movable_pages()")
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Aijun Sun <aijun.sun@unisoc.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm, swap: fix reclaim offset calculation error during allocation
Kairui Song [Thu, 30 Jan 2025 11:51:31 +0000 (19:51 +0800)]
mm, swap: fix reclaim offset calculation error during allocation

There is a code error that will cause the swap entry allocator to reclaim
and check the whole cluster with an unexpected tail offset instead of the
part that needs to be reclaimed.  This may cause corruption of the swap
map, so fix it.

Link: https://lkml.kernel.org/r/20250130115131.37777-1-ryncsn@gmail.com
Fixes: 3b644773eefd ("mm, swap: reduce contention on device lock")
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Chris Li <chrisl@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months ago.mailmap: update email address for Christopher Obbard
Christopher Obbard [Wed, 22 Jan 2025 12:04:27 +0000 (12:04 +0000)]
.mailmap: update email address for Christopher Obbard

Update my email address.

Link: https://lkml.kernel.org/r/20250122-wip-obbardc-update-email-v2-1-12bde6b79ad0@linaro.org
Signed-off-by: Christopher Obbard <christopher.obbard@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agokfence: skip __GFP_THISNODE allocations on NUMA systems
Marco Elver [Fri, 24 Jan 2025 12:01:38 +0000 (13:01 +0100)]
kfence: skip __GFP_THISNODE allocations on NUMA systems

On NUMA systems, __GFP_THISNODE indicates that an allocation _must_ be on
a particular node, and failure to allocate on the desired node will result
in a failed allocation.

Skip __GFP_THISNODE allocations if we are running on a NUMA system, since
KFENCE can't guarantee which node its pool pages are allocated on.

Link: https://lkml.kernel.org/r/20250124120145.410066-1-elver@google.com
Fixes: 236e9f153852 ("kfence: skip all GFP_ZONEMASK allocations")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Chistoph Lameter <cl@linux.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agonilfs2: fix possible int overflows in nilfs_fiemap()
Nikita Zhandarovich [Fri, 24 Jan 2025 22:20:53 +0000 (07:20 +0900)]
nilfs2: fix possible int overflows in nilfs_fiemap()

Since nilfs_bmap_lookup_contig() in nilfs_fiemap() calculates its result
by being prepared to go through potentially maxblocks == INT_MAX blocks,
the value in n may experience an overflow caused by left shift of blkbits.

While it is extremely unlikely to occur, play it safe and cast right hand
expression to wider type to mitigate the issue.

Found by Linux Verification Center (linuxtesting.org) with static analysis
tool SVACE.

Link: https://lkml.kernel.org/r/20250124222133.5323-1-konishi.ryusuke@gmail.com
Fixes: 622daaff0a89 ("nilfs2: fiemap support")
Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: compaction: use the proper flag to determine watermarks
yangge [Sat, 25 Jan 2025 06:53:57 +0000 (14:53 +0800)]
mm: compaction: use the proper flag to determine watermarks

There are 4 NUMA nodes on my machine, and each NUMA node has 32GB of
memory.  I have configured 16GB of CMA memory on each NUMA node, and
starting a 32GB virtual machine with device passthrough is extremely slow,
taking almost an hour.

Long term GUP cannot allocate memory from CMA area, so a maximum of 16 GB
of no-CMA memory on a NUMA node can be used as virtual machine memory.
There is 16GB of free CMA memory on a NUMA node, which is sufficient to
pass the order-0 watermark check, causing the __compaction_suitable()
function to consistently return true.

For costly allocations, if the __compaction_suitable() function always
returns true, it causes the __alloc_pages_slowpath() function to fail to
exit at the appropriate point.  This prevents timely fallback to
allocating memory on other nodes, ultimately resulting in excessively long
virtual machine startup times.

Call trace:
__alloc_pages_slowpath
    if (compact_result == COMPACT_SKIPPED ||
        compact_result == COMPACT_DEFERRED)
        goto nopage; // should exit __alloc_pages_slowpath() from here

We could use the real unmovable allocation context to have
__zone_watermark_unusable_free() subtract CMA pages, and thus we won't
pass the order-0 check anymore once the non-CMA part is exhausted.  There
is some risk that in some different scenario the compaction could in fact
migrate pages from the exhausted non-CMA part of the zone to the CMA part
and succeed, and we'll skip it instead.  But only __GFP_NORETRY
allocations should be affected in the immediate "goto nopage" when
compaction is skipped, others will attempt with DEF_COMPACT_PRIORITY
anyway and won't fail without trying to compact-migrate the non-CMA
pageblocks into CMA pageblocks first, so it should be fine.

After this fix, it only takes a few tens of seconds to start a 32GB
virtual machine with device passthrough functionality.

Link: https://lore.kernel.org/lkml/1736335854-548-1-git-send-email-yangge1116@126.com/
Link: https://lkml.kernel.org/r/1737788037-8439-1-git-send-email-yangge1116@126.com
Signed-off-by: yangge <yangge1116@126.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Barry Song <21cnbao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agokernel: be more careful about dup_mmap() failures and uprobe registering
Liam R. Howlett [Mon, 27 Jan 2025 17:02:21 +0000 (12:02 -0500)]
kernel: be more careful about dup_mmap() failures and uprobe registering

If a memory allocation fails during dup_mmap(), the maple tree can be left
in an unsafe state for other iterators besides the exit path.  All the
locks are dropped before the exit_mmap() call (in mm/mmap.c), but the
incomplete mm_struct can be reached through (at least) the rmap finding
the vmas which have a pointer back to the mm_struct.

Up to this point, there have been no issues with being able to find an
mm_struct that was only partially initialised.  Syzbot was able to make
the incomplete mm_struct fail with recent forking changes, so it has been
proven unsafe to use the mm_struct that hasn't been initialised, as
referenced in the link below.

Although 8ac662f5da19f ("fork: avoid inappropriate uprobe access to
invalid mm") fixed the uprobe access, it does not completely remove the
race.

This patch sets the MMF_OOM_SKIP to avoid the iteration of the vmas on the
oom side (even though this is extremely unlikely to be selected as an oom
victim in the race window), and sets MMF_UNSTABLE to avoid other potential
users from using a partially initialised mm_struct.

When registering vmas for uprobe, skip the vmas in an mm that is marked
unstable.  Modifying a vma in an unstable mm may cause issues if the mm
isn't fully initialised.

Link: https://lore.kernel.org/all/6756d273.050a0220.2477f.003d.GAE@google.com/
Link: https://lkml.kernel.org/r/20250127170221.1761366-1-Liam.Howlett@oracle.com
Fixes: d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/fake-numa: handle cases with no SRAT info
Bruno Faccini [Mon, 27 Jan 2025 17:16:23 +0000 (09:16 -0800)]
mm/fake-numa: handle cases with no SRAT info

Handle more gracefully cases where no SRAT information is available, like
in VMs with no Numa support, and allow fake-numa configuration to complete
successfully in these cases

Link: https://lkml.kernel.org/r/20250127171623.1523171-1-bfaccini@nvidia.com
Fixes: 63db8170bf34 (“mm/fake-numa: allow later numa node hotplug”)
Signed-off-by: Bruno Faccini <bfaccini@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm: kmemleak: fix upper boundary check for physical address objects
Catalin Marinas [Mon, 27 Jan 2025 18:42:33 +0000 (18:42 +0000)]
mm: kmemleak: fix upper boundary check for physical address objects

Memblock allocations are registered by kmemleak separately, based on their
physical address.  During the scanning stage, it checks whether an object
is within the min_low_pfn and max_low_pfn boundaries and ignores it
otherwise.

With the recent addition of __percpu pointer leak detection (commit
6c99d4eb7c5e ("kmemleak: enable tracking for percpu pointers")), kmemleak
started reporting leaks in setup_zone_pageset() and
setup_per_cpu_pageset().  These were caused by the node_data[0] object
(initialised in alloc_node_data()) ending on the PFN_PHYS(max_low_pfn)
boundary.  The non-strict upper boundary check introduced by commit
84c326299191 ("mm: kmemleak: check physical address when scan") causes the
pg_data_t object to be ignored (not scanned) and the __percpu pointers it
contains to be reported as leaks.

Make the max_low_pfn upper boundary check strict when deciding whether to
ignore a physical address object and not scan it.

Link: https://lkml.kernel.org/r/20250127184233.2974311-1-catalin.marinas@arm.com
Fixes: 84c326299191 ("mm: kmemleak: check physical address when scan")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Cc: Patrick Wang <patrick.wang.shcn@gmail.com>
Cc: <stable@vger.kernel.org> [6.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomailmap: add an entry for Hamza Mahfooz
Hamza Mahfooz [Mon, 20 Jan 2025 20:56:59 +0000 (15:56 -0500)]
mailmap: add an entry for Hamza Mahfooz

Map my previous work email to my current one.

Link: https://lkml.kernel.org/r/20250120205659.139027-1-hamzamahfooz@linux.microsoft.com
Signed-off-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hans verkuil <hverkuil@xs4all.nl>
Cc: Matthieu Baerts <matttbe@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoMAINTAINERS: mailmap: update Yosry Ahmed's email address
Yosry Ahmed [Thu, 23 Jan 2025 23:13:44 +0000 (23:13 +0000)]
MAINTAINERS: mailmap: update Yosry Ahmed's email address

Moving to a linux.dev email address.

Link: https://lkml.kernel.org/r/20250123231344.817358-1-yosry.ahmed@linux.dev
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoscripts/gdb: fix aarch64 userspace detection in get_current_task
Jan Kiszka [Fri, 10 Jan 2025 10:36:33 +0000 (11:36 +0100)]
scripts/gdb: fix aarch64 userspace detection in get_current_task

At least recent gdb releases (seen with 14.2) return SP_EL0 as signed long
which lets the right-shift always return 0.

Link: https://lkml.kernel.org/r/dcd2fabc-9131-4b48-8419-6444e2d67454@siemens.com
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/vmscan: accumulate nr_demoted for accurate demotion statistics
Li Zhijian [Fri, 10 Jan 2025 12:21:32 +0000 (20:21 +0800)]
mm/vmscan: accumulate nr_demoted for accurate demotion statistics

In shrink_folio_list(), demote_folio_list() can be called 2 times.
Currently stat->nr_demoted will only store the last nr_demoted( the later
nr_demoted is always zero, the former nr_demoted will get lost), as a
result number of demoted pages is not accurate.

Accumulate the nr_demoted count across multiple calls to
demote_folio_list(), ensuring accurate reporting of demotion statistics.

[lizhijian@fujitsu.com: introduce local nr_demoted to fix nr_reclaimed double counting]
Link: https://lkml.kernel.org/r/20250111015253.425693-1-lizhijian@fujitsu.com
Link: https://lkml.kernel.org/r/20250110122133.423481-1-lizhijian@fujitsu.com
Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Acked-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Tested-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agoocfs2: fix incorrect CPU endianness conversion causing mount failure
Heming Zhao [Tue, 21 Jan 2025 11:22:03 +0000 (19:22 +0800)]
ocfs2: fix incorrect CPU endianness conversion causing mount failure

Commit 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()")
introduced a regression bug.  The blksz_bits value is already converted to
CPU endian in the previous code; therefore, the code shouldn't use
le32_to_cpu() anymore.

Link: https://lkml.kernel.org/r/20250121112204.12834-1-heming.zhao@suse.com
Fixes: 23aab037106d ("ocfs2: fix UBSAN warning in ocfs2_verify_volume()")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc()
Hyeonggon Yoo [Mon, 27 Jan 2025 23:16:31 +0000 (08:16 +0900)]
mm/zsmalloc: add __maybe_unused attribute for is_first_zpdesc()

Commit c1b3bb73d55e ("mm/zsmalloc: use zpdesc in
trylock_zspage()/lock_zspage()") introduces is_first_zpdesc() function.
However, the function is only used when CONFIG_DEBUG_VM=y.

When building with LLVM=1 and W=1 option, the following warning is
generated:
  $ make -j12 W=1 LLVM=1 mm/zsmalloc.o
  mm/zsmalloc.c:455:20: error: function 'is_first_zpdesc' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
    455 | static inline bool is_first_zpdesc(struct zpdesc *zpdesc)
        |                    ^~~~~~~~~~~~~~~
  1 error generated.

Fix the warning by adding __maybe_unused attribute to the function.
No functional change intended.

Link: https://lkml.kernel.org/r/20250127231631.4363-1-42.hyeyoo@gmail.com
Fixes: c1b3bb73d55e ("mm/zsmalloc: use zpdesc in trylock_zspage()/lock_zspage()")
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501240958.4ILzuBrH-lkp@intel.com/
Cc: Alex Shi <alexs@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agomm/vmscan: fix hard LOCKUP in function isolate_lru_folios
liuye [Tue, 19 Nov 2024 06:08:42 +0000 (14:08 +0800)]
mm/vmscan: fix hard LOCKUP in function isolate_lru_folios

This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim.  If the LRU mostly contains ineligible folios this may trigger
watchdog.

watchdog: Watchdog detected hard LOCKUP on cpu 173
RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
Call Trace:
_raw_spin_lock_irqsave+0x31/0x40
folio_lruvec_lock_irqsave+0x5f/0x90
folio_batch_move_lru+0x91/0x150
lru_add_drain_per_cpu+0x1c/0x40
process_one_work+0x17d/0x350
worker_thread+0x27b/0x3a0
kthread+0xe8/0x120
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1b/0x30

lruvec->lru_lock owner:

PID: 2865     TASK: ffff888139214d40  CPU: 40   COMMAND: "kswapd0"
 #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555
 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171
 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920
 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4
 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde
    [exception RIP: isolate_lru_folios+403]
    RIP: ffffffffa597df53  RSP: ffffc90006fb7c28  RFLAGS: 00000002
    RAX: 0000000000000001  RBX: ffffc90006fb7c60  RCX: ffffea04a2196f88
    RDX: ffffc90006fb7c60  RSI: ffffc90006fb7c60  RDI: ffffea04a2197048
    RBP: ffff88812cbd3010   R8: ffffea04a2197008   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000001  R12: ffffea04a2197008
    R13: ffffea04a2197048  R14: ffffc90006fb7de8  R15: 0000000003e3e937
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    <NMI exception stack>
 #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788
 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0
 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354
 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238
crash>

Scenario:
User processe are requesting a large amount of memory and keep page active.
Then a module continuously requests memory from ZONE_DMA32 area.
Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
However pages in the LRU(active_anon) list are mostly from
the ZONE_NORMAL area.

Reproduce:
Terminal 1: Construct to continuously increase pages active(anon).
mkdir /tmp/memory
mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
dd if=/dev/zero of=/tmp/memory/block bs=4M
tail /tmp/memory/block

Terminal 2:
vmstat -a 1
active will increase.
procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
 r  b   swpd   free  inact active   si   so    bi    bo
 1  0   0 1445623076 45898836 83646008    0    0     0
 1  0   0 1445623076 43450228 86094616    0    0     0
 1  0   0 1445623076 41003480 88541364    0    0     0
 1  0   0 1445623076 38557088 90987756    0    0     0
 1  0   0 1445623076 36109688 93435156    0    0     0
 1  0   0 1445619552 33663256 95881632    0    0     0
 1  0   0 1445619804 31217140 98327792    0    0     0
 1  0   0 1445619804 28769988 100774944    0    0     0
 1  0   0 1445619804 26322348 103222584    0    0     0
 1  0   0 1445619804 23875592 105669340    0    0     0

cat /proc/meminfo | head
Active(anon) increase.
MemTotal:       1579941036 kB
MemFree:        1445618500 kB
MemAvailable:   1453013224 kB
Buffers:            6516 kB
Cached:         128653956 kB
SwapCached:            0 kB
Active:         118110812 kB
Inactive:       11436620 kB
Active(anon):   115345744 kB
Inactive(anon):   945292 kB

When the Active(anon) is 115345744 kB, insmod module triggers
the ZONE_DMA32 watermark.

perf record -e vmscan:mm_vmscan_lru_isolate -aR
perf script
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
nr_skipped=2 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
nr_skipped=29 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon

See nr_scanned=28835844.
28835844 * 4k = 115343376KB approximately equal to 115345744 kB.

If increase Active(anon) to 1000G then insmod module triggers
the ZONE_DMA32 watermark. hard lockup will occur.

In my device nr_scanned = 0000000003e3e937 when hard lockup.
Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.

   [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
    ffffc90006fb7c300000000000000020 0000000000000000
    ffffc90006fb7c40ffffc90006fb7d40 ffff88812cbd3000
    ffffc90006fb7c50ffffc90006fb7d30 0000000106fb7de8
    ffffc90006fb7c60ffffea04a2197008 ffffea0006ed4a48
    ffffc90006fb7c700000000000000000 0000000000000000
    ffffc90006fb7c800000000000000000 0000000000000000
    ffffc90006fb7c900000000000000000 0000000000000000
    ffffc90006fb7ca00000000000000000 0000000003e3e937
    ffffc90006fb7cb00000000000000000 0000000000000000
    ffffc90006fb7cc08d7c0b56b7874b00 ffff88812cbd3000

About the Fixes:
Why did it take eight years to be discovered?

The problem requires the following conditions to occur:
1. The device memory should be large enough.
2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
3. The memory in ZONE_DMA32 needs to reach the watermark.

If the memory is not large enough, or if the usage design of ZONE_DMA32
area memory is reasonable, this problem is difficult to detect.

notes:
The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
but other suitable scenarios may also trigger the problem.

Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn
Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Signed-off-by: liuye <liuye@kylinos.cn>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 months agosh: boards: Use imply to enable hardware with complex dependencies
Geert Uytterhoeven [Fri, 24 Jan 2025 08:39:19 +0000 (09:39 +0100)]
sh: boards: Use imply to enable hardware with complex dependencies

If CONFIG_I2C=n:

    WARNING: unmet direct dependencies detected for SND_SOC_AK4642
      Depends on [n]: SOUND [=y] && SND [=y] && SND_SOC [=y] && I2C [=n]
      Selected by [y]:
      - SH_7724_SOLUTION_ENGINE [=y] && CPU_SUBTYPE_SH7724 [=y] && SND_SIMPLE_CARD [=y]

    WARNING: unmet direct dependencies detected for SND_SOC_DA7210
      Depends on [n]: SOUND [=y] && SND [=y] && SND_SOC [=y] && SND_SOC_I2C_AND_SPI [=n]
      Selected by [y]:
      - SH_ECOVEC [=y] && CPU_SUBTYPE_SH7724 [=y] && SND_SIMPLE_CARD [=y]

Fix this by replacing select by imply, instead of adding a dependency on
I2C.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501240836.OvXqmANX-lkp@intel.com/
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2 months agosh: Migrate to the generic rule for built-in DTB
Masahiro Yamada [Sun, 22 Dec 2024 00:32:07 +0000 (09:32 +0900)]
sh: Migrate to the generic rule for built-in DTB

Commit 654102df2ac2 ("kbuild: add generic support for built-in
boot DTBs") introduced generic support for built-in DTBs.

Select GENERIC_BUILTIN_DTB when built-in DTB support is enabled.

To keep consistency across architectures, this commit also renames
CONFIG_USE_BUILTIN_DTB to CONFIG_BUILTIN_DTB, and
CONFIG_BUILTIN_DTB_SOURCE to CONFIG_BUILTIN_DTB_NAME.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2 months agosh: irq: Use seq_put_decimal_ull_width() for decimal values
David Wang [Sat, 30 Nov 2024 13:49:09 +0000 (21:49 +0800)]
sh: irq: Use seq_put_decimal_ull_width() for decimal values

On a system with n CPUs and m interrupts, there will be n*m decimal
values yielded via seq_printf(.."%10u "..) which has significant costs
parsing format string and is less efficient than seq_put_decimal_ull_width().
Stress reading /proc/interrupts indicates ~30% performance improvement with
this patch.

Signed-off-by: David Wang <00107082@163.com>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
2 months agoMerge tag 'for-linus-hexagon-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 1 Feb 2025 04:11:24 +0000 (20:11 -0800)]
Merge tag 'for-linus-hexagon-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/bcain/linux

Pull hexagon updates from Brian Cain:

 - Move kernel prototypes out of uapi header to internal header

 - Fix to address an unbalanced spinlock

 - Miscellaneous patches to fix static checks

 - Update bcain@quicinc.com->brian.cain@oss.qualcomm.com

* tag 'for-linus-hexagon-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/bcain/linux:
  MAINTAINERS: Update my email address
  hexagon: Fix unbalanced spinlock in die()
  hexagon: Fix warning comparing pointer to 0
  hexagon: Move kernel prototypes out of uapi/asm/setup.h header
  hexagon: time: Remove redundant null check for resource
  hexagon: fix using plain integer as NULL pointer warning in cmpxchg

2 months agoRemove stale generated 'genheaders' file
Linus Torvalds [Sat, 1 Feb 2025 03:49:17 +0000 (19:49 -0800)]
Remove stale generated 'genheaders' file

This bogus stale file was added in commit 101971298be2 ("riscv: add a
warning when physical memory address overflows").  It's the old location
for what is now 'security/selinux/genheaders'.

It looks like it got incorrectly committed back when that file was in
the old location, and then rebasing kept the bogus file alive.

Reported-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/linux-riscv/20250201020003.GA77370@sol.localdomain/
Fixes: 101971298be2 ("riscv: add a warning when physical memory address overflows")
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 months agoMerge tag 'AT_EXECVE_CHECK-v6.14-rc1-fix1' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sat, 1 Feb 2025 01:12:31 +0000 (17:12 -0800)]
Merge tag 'AT_EXECVE_CHECK-v6.14-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull AT_EXECVE_CHECK selftest fix from Kees Cook:
 "Fixes the AT_EXECVE_CHECK selftests which didn't run on old versions
  of glibc"

* tag 'AT_EXECVE_CHECK-v6.14-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests: Handle old glibc without execveat(2)

2 months agoMerge tag 'hardening-v6.14-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 1 Feb 2025 01:10:26 +0000 (17:10 -0800)]
Merge tag 'hardening-v6.14-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening fixes from Kees Cook:
 "This is a fix for the soon to be released GCC 15 which has regressed
  its initialization of unions when performing explicit initialization
  (i.e. a general problem, not specifically a hardening problem; we're
  just carrying the fix).

  Details in the final patch, Acked by Masahiro, with updated selftests
  to validate the fix"

* tag 'hardening-v6.14-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kbuild: Use -fzero-init-padding-bits=all
  stackinit: Add union initialization to selftests
  stackinit: Add old-style zero-init syntax to struct tests