www.infradead.org Git - users/jedix/linux-maple.git/log

spanning_ascend fixes

fix gaps by moving sibling check earlier
fix node_finalise by making cp->end +1, was skipping last node
Continue if left and right meet (l_wr_mas->mas->noded == r_wr_mas..

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

split destination off by one

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

gap fix for non leaf

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

mt_validate() detect root of size 0

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

node_copy(): try to fix max of nodes

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

node_copy setting wrong parent slot

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

mt_pivots incorrect for maple_copy

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

spanning_data_calc() one off error

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

readability of d increment and gap fix

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

fix max when rewinding

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: Create ma_leaf_max_gap() to not depend on maple state

Pass through the min, max, pivots, and slots so that a maple state is not needed.

This is neede dlater to avoid the min and max extraction from the struct which isn't always available.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

tests are failing

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: Increase spanning store testing and rcu cleanup

Add more spanning store tests and validate the tree after the tests.

When rcu tests fail, many fail and dump the tree. Add a lock to only
allow the first to dump the tree while the rest exit.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

debug and warn on

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

node_finalise was a mess

meta needs to be set regardless for arange

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

height fix

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

ascend cleanup

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

node finalise new root\n

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

height fix

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

fix new node location

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

new root broken

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

fix whitespace

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

drop wr_l_mas and l_mas from mas_wr_spanning_store

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

drop mast from spanning rebalance

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

big hunk of code, drop big node from spanning store

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

add cp.pivot to ma_pivots

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

cp.min and cp.max fixes

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_copy: Increase slots to 3

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

fix destination max setting

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

add dummy loop

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

stop using maple_subtree_state

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

abstract split dest and src

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

abstract data calc from mas_wr_spanning_rebalance()

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

runs through

probably an issue with the end slot checking in mas_wr_spanning_rebalance()
Probably an issue ith the split calculation for 3 nodes

But I'll kick myself later when I didn't take the time to fix them now.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

broken progress

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: Make ma_wr_states reliable for reuse in spanning store

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

broken count

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: inline mas_spanning_rebalance() into mas_wr_spanning_rebalance()

Copy the contents of mas_spanning_rebalance() into
mas_wr_spanning_rebalance(), in preparation of removing initial big node
use.

No functional changes intended.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: Remove unnecessary assignment of orig_l index

The index value is already a copy of the maple state so there is no need
to set it again.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: Extract use of big node from mas_wr_spanning_store()

Isolate big node to use in its own function.

No functional changes intended.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

maple_tree: Move mas_spanning_rebalance loop to function

Move the loop over the tree levels to its own function.

No intended functional changes.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

drivers/base: move memory_block_add_nid() into the caller

Now the node id only needs to be set for early memory, so move
memory_block_add_nid() into the caller and rename it into
memory_block_add_nid_early(). This allows us to further simplify the code
by dropping the 'context' argument to
do_register_memory_block_under_node().

Link: https://lkml.kernel.org/r/20250729064637.51662-4-hare@kernel.org
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory_hotplug: activate node before adding new memory blocks

The sysfs attributes for memory blocks require the node ID to be set and
initialized, so move the node activation before adding new memory blocks.
This also has the nice side effect that the BUG_ON() can be converted into
a WARN_ON() as we now can handle registration errors.

Link: https://lkml.kernel.org/r/20250729064637.51662-3-hare@kernel.org
Fixes: b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/base/memory: add node id parameter to add_memory_block()

We have some udev rules trying to read the sysfs attribute 'valid_zones'
during an memory 'add' event, causing a crash in zone_for_pfn_range().
Debugging found that mem->nid was set to NUMA_NO_NODE, which crashed in
NODE_DATA(nid).  Further analysis revealed that we're running into a race
with udev event processing: add_memory_resource() has this function calls:

1) __try_online_node()
2) arch_add_memory()
3) create_memory_block_devices()
  -> calls device_register() -> memory 'add' event
4) node_set_online()/__register_one_node()
  -> calls device_register() -> node 'add' event
5) register_memory_blocks_under_node()
  -> sets mem->nid

Which, to the uninitated, is ... weird ...

Why do we try to online the node in 1), but only register the node in 4)
_after_ we have created the memory blocks in 3) ?  And why do we set the
'nid' value in 5), when the uevent (which might need to see the correct
'nid' value) is sent out in 3) ?  There must be a reason, I'm sure ...

So here's a small patchset to fixup uevent ordering.  The first patch adds
a 'nid' parameter to add_memory_blocks() (to avoid mem->nid being
initialized with NUMA_NO_NODE), and the second patch reshuffles the code
in add_memory_resource() to fully initialize the node prior to calling
create_memory_block_devices() so that the node is valid at that time and
uevent processing will see correct values in sysfs.

This patch (of 3):

Add a 'nid' parameter to add_memory_block() to initialize the memory block
with the correct node id.

Link: https://lkml.kernel.org/r/20250729064637.51662-1-hare@kernel.org
Link: https://lkml.kernel.org/r/20250729064637.51662-2-hare@kernel.org
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mpage: convert do_mpage_readpage() to return int type

The return value of do_mpage_readpage() is arg->bio, which is already set
in the arg structure. Returning it again is redundant.

This patch changes the return type to int and always returns 0 since the
caller does not care about the return value.

Link: https://lkml.kernel.org/r/20250812072225.181798-3-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sungjong Seo <sj1557.seo@samsung.com>
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mpage: terminate read-ahead on read error

For exFAT filesystems with 4MB read_ahead_size, removing the storage
device during read operations can delay EIO error reporting by several
minutes. This occurs because the read-ahead implementation in mpage
doesn't handle errors.

Another reason for the delay is that the filesystem requires metadata to
issue file read request. When the storage device is removed, the metadata
buffers are invalidated, causing mpage to repeatedly attempt to fetch
metadata during each get_block call.

The original purpose of this patch is terminate read ahead when we fail to
get metadata, to make the patch more generic, implement it by checking
folio status, instead of checking the return of get_block().

Link: https://lkml.kernel.org/r/20250812072225.181798-1-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sungjong Seo <sj1557.seo@samsung.com>
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/filemap: Skip non-uptodate folio if there are available folios

When reading data exceeding the maximum IO size, the operation is split
into multiple IO requests, but the data isn't immediately copied to
userspace after each IO completion.

For example, when reading 2560k data from a device with 1280k maximum IO
size, the following sequence occurs:

1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. wait the next 1280k
5. copy 8 pages to user buffer
6. copy 20 folios(64k) to user buffer

The 8 pages in step 5 are copied after the second 1280k completes(step 4)
due to waiting for a non-uptodate folio in filemap_update_page. We can
copy the 8 pages before the second 1280k completes(step 4) to reduce the
latency of this read operation.

After applying the patch, these 8 pages will be copied before the next IO
completes:

1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. copy 8 pages to user buffer
5. wait the next 1280k
6. copy 20 folios(64k) to user buffer

This patch drops a setting of IOCB_NOWAIT for AIO, which is fine because
filemap_read will set it again for AIO.

The final solution provided by Matthew Wilcox:
Link: https://lore.kernel.org/linux-fsdevel/aIDy076Sxt544qja@casper.infradead.org/
Link: https://lkml.kernel.org/r/20250728083952.75518-3-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/filemap: do not use is_partially_uptodate for entire folio

Patch series "Tiny optimization for large read operations".

This series contains two patches,

1. Skip calling is_partially_uptodate for entire folio to save time, I
   have reviewed the mpage and iomap implementations and didn't spot any
   issues, but this change likely needs more thorough review.

2. Skip calling filemap_uptodate if there are ready folios in the
   batch, This might save a few milliseconds in practice, but I didn't
   observe measurable improvements in my tests.

This patch (of 2):

When a folio is marked as non-uptodate, it means the folio contains some
non-uptodate data.  Therefore, calling is_partially_uptodate() to recheck
the entire folio is redundant.

If all data in a folio is actually up-to-date but the folio lacks the
uptodate flag, it will still be treated as non-uptodate in many other
places.  Thus, there should be no special case handling for filemap.

Link: https://lkml.kernel.org/r/20250728083952.75518-1-chizhiling@163.com
Link: https://lkml.kernel.org/r/20250728083952.75518-2-chizhiling@163.com
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/rmap: use folio_large_nr_pages() when we are sure it is a large folio

Non-large folio is handled at the beginning, so it is a large folio for
sure.

Use folio_large_nr_pages() here like elsewhere.

Link: https://lkml.kernel.org/r/20250817032647.29147-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/rmap: not necessary to mask off FOLIO_PAGES_MAPPED

At this point, we are in an if branch conditional on (nr <
ENTIRELY_MAPPED), and FOLIO_PAGES_MAPPED is equal to (ENTIRELY_MAPPED -
1). This means the upper bits are already cleared.

It is not necessary to mask it off.

Link: https://lkml.kernel.org/r/20250817032647.29147-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm/uffd: refactor non-composite global vars into struct

Refactor macros and non-composite global variable definitions into a
struct that is defined at the start of a test and is passed around
instead of relying on global vars.

Link: https://lkml.kernel.org/r/20250817065211.855-1-ujwal.kundur@gmail.com
Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/tests/core-kunit: add damos_commit_filter test

Add unit test to verify that damos_commmit_filter() change dest value
well.

Link: https://lkml.kernel.org/r/20250817021348.570692-1-ekffu200098@gmail.com
Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix duplicate accounting of free pages in should_reclaim_retry()

In the zone_reclaimable_pages() function, if the page counts for
NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_INACTIVE_ANON, and
NR_ZONE_ACTIVE_ANON are all zero, the function returns the number of free
pages as the result.

In this case, when should_reclaim_retry() calculates reclaimable pages, it
will inadvertently double-count the free pages in its accounting.

static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
                     struct alloc_context *ac, int alloc_flags,
                     bool did_some_progress, int *no_progress_loops)
{
        ...
                available = reclaimable = zone_reclaimable_pages(zone);
                available += zone_page_state_snapshot(zone, NR_FREE_PAGES);

Link: https://lkml.kernel.org/r/20250812070210.1624218-1-liuqiqi@kylinos.cn
Fixes: 6aaced5abd32 ("mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()")
Signed-off-by: liuqiqi <liuqiqi@kylinos.cn>
Reviewed-by: Ye Liu <liuye@kylinos.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: add folio_is_pci_p2pdma()

Reimplement is_pci_p2pdma_page() in terms of folio_is_pci_p2pdma(). Moves
the page_folio() call from inside page_pgmap() to is_pci_p2pdma_page().
This removes a page_folio() call from try_grab_folio() which already has a
folio and can pass it in.

Link: https://lkml.kernel.org/r/20250805172307.1302730-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: reimplement folio_is_fsdax()

For callers of folio_is_fsdax(), we save a folio->page->folio conversion.
Callers of is_fsdax_page() simply move the conversion of page->folio from
the implementation of page_pgmap() to is_fsdax_page().

Link: https://lkml.kernel.org/r/20250805172307.1302730-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: reimplement folio_is_device_coherent()

For callers of folio_is_device_coherent(), we save a folio->page->folio
conversion. Callers of is_device_coherent_page() simply move the
conversion of page->folio from the implementation of page_pgmap() to
is_device_coherent_page().

Link: https://lkml.kernel.org/r/20250805172307.1302730-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: reimplement folio_is_device_private()

For callers of folio_is_device_private(), we save a folio->page->folio
conversion. Callers of is_device_private_page() simply move the
conversion of page->folio from the implementation of page_pgmap() to
is_device_private_page().

Link: https://lkml.kernel.org/r/20250805172307.1302730-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: introduce memdesc_is_zone_device()

Remove the conversion from folio to page in folio_is_zone_device() by
introducing memdesc_is_zone_device() which takes a memdesc_flags_t from
either a page or a folio.

Link: https://lkml.kernel.org/r/20250805172307.1302730-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab: use memdesc_nid()

We no longer need to convert from slab to folio to get the nid, we can ask
memdesc_nid() for the nid directly.

Link: https://lkml.kernel.org/r/20250805172307.1302730-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab: use memdesc_flags_t

The slab flags are memdesc flags and contain the same information in the
upper bits as the other memdescs (like node ID).

Link: https://lkml.kernel.org/r/20250805172307.1302730-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: introduce memdesc_zonenum()

Remove a conversion from folio to page by passing the folio->flags (which
are a copy of the page->flags) to the new memdesc_zonenum() function.

Link: https://lkml.kernel.org/r/20250805172307.1302730-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: introduce memdesc_nid()

Remove a conversion from folio to page by passing the folio->flags (which
are a copy of the page->flags) to the new memdesc_nid() function.

Link: https://lkml.kernel.org/r/20250805172307.1302730-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: convert page_to_section() to memdesc_section()

Pass in the memdesc_flags_t instead of a pointer to the page. This will
allow us to remove a few conversions to struct page in upcoming patches.

Link: https://lkml.kernel.org/r/20250805172307.1302730-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: introduce memdesc_flags_t

Patch series "Add and use memdesc_flags_t".

At some point struct page will be separated from struct slab and struct
folio.  This is a step towards that by introducing a type for the 'flags'
word of all three structures.  This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits.  That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.

There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead.  That's most of the churn here.  Maybe we should be removing
these things from the debug output?

This patch (of 11):

Wrap the unsigned long flags in a typedef.  In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.

Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/Kconfig: make DAMON_STAT_ENABLED_DEFAULT depend on DAMON_STAT

The DAMON_STAT_ENABLED_DEFAULT option is strongly tied to DAMON_STAT
option -- enabling it alone is meaningless. This patch makes
DAMON_STAT_ENABLED_DEFAULT depend on DAMON_STAT, ensuring functional
consistency.

Link: https://lkml.kernel.org/r/20250815092110.811757-1-lienze@kylinos.cn
Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: prctl: introduce tests for disabling THPs except for madvise

The test will set the global system THP setting to never, madvise or
always depending on the fixture variant and the 2M setting to inherit
before it starts (and reset to original at teardown).  The fixture setup
will also test if PR_SET_THP_DISABLE prctl call can be made with
PR_THP_DISABLE_EXCEPT_ADVISED and skip if it fails.

This tests if the process can:
- successfully get the policy to disable THPs expect for madvise.
- get hugepages only on MADV_HUGE and MADV_COLLAPSE if the global policy
  is madvise/always and only with MADV_COLLAPSE if the global policy is
  never.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
  - MADV_COLLAPSE when policy is set to never.
  - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
  - always when policy is set to "always".
- never get a THP with MADV_NOHUGEPAGE.
- repeat the above tests in a forked process to make sure  the policy is
  carried across forks.

Test results:
./prctl_thp_disable
TAP version 13
1..12
ok 1 prctl_thp_disable_completely.never.nofork
ok 2 prctl_thp_disable_completely.never.fork
ok 3 prctl_thp_disable_completely.madvise.nofork
ok 4 prctl_thp_disable_completely.madvise.fork
ok 5 prctl_thp_disable_completely.always.nofork
ok 6 prctl_thp_disable_completely.always.fork
ok 7 prctl_thp_disable_except_madvise.never.nofork
ok 8 prctl_thp_disable_except_madvise.never.fork
ok 9 prctl_thp_disable_except_madvise.madvise.nofork
ok 10 prctl_thp_disable_except_madvise.madvise.fork
ok 11 prctl_thp_disable_except_madvise.always.nofork
ok 12 prctl_thp_disable_except_madvise.always.fork

Link: https://lkml.kernel.org/r/20250815135549.130506-8-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests: prctl: introduce tests for disabling THPs completely

The test will set the global system THP setting to never, madvise or
always depending on the fixture variant and the 2M setting to inherit
before it starts (and reset to original at teardown).  The fixture setup
will also test if PR_SET_THP_DISABLE prctl call can be made to disable all
THPs and skip if it fails.

This tests if the process can:
- successfully get the policy to disable THPs completely.
- never get a hugepage when the THPs are completely disabled
  with the prctl, including with MADV_HUGE and MADV_COLLAPSE.
- successfully reset the policy of the process.
- after reset, only get hugepages with:
  - MADV_COLLAPSE when policy is set to never.
  - MADV_HUGE and MADV_COLLAPSE when policy is set to madvise.
  - always when policy is set to "always".
- never get a THP with MADV_NOHUGEPAGE.
- repeat the above tests in a forked process to make sure
  the policy is carried across forks.

Link: https://lkml.kernel.org/r/20250815135549.130506-7-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftest/mm: extract sz2ord function into vm_util.h

The function already has 2 uses and will have a 3rd one in prctl
selftests. The pagesize argument is added into the function, as it's not
a global variable anymore. No functional change intended with this patch.

Link: https://lkml.kernel.org/r/20250815135549.130506-6-usamaarif642@gmail.com
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

docs: transhuge: document process level THP controls

This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of prctl
calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED flag for
the PR_SET_THP_DISABLE prctl call.

Link: https://lkml.kernel.org/r/20250815135549.130506-5-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED

Let's allow for making MADV_COLLAPSE succeed on areas that neither have
VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled unless explicitly
advised (PR_THP_DISABLE_EXCEPT_ADVISED).

MADV_COLLAPSE is a clear advice that we want to collapse.

Note that we still respect the VM_NOHUGEPAGE flag, just like
MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED,
including for shmem.

Link: https://lkml.kernel.org/r/20250815135549.130506-4-usamaarif642@gmail.com
Co-developed-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/huge_memory: convert "tva_flags" to "enum tva_type"

When determining which THP orders are eligible for a VMA mapping, we have
previously specified tva_flags, however it turns out it is really not
necessary to treat these as flags.

Rather, we distinguish between distinct modes.

The only case where we previously combined flags was with
TVA_ENFORCE_SYSFS, but we can avoid this by observing that this is the
default, except for MADV_COLLAPSE or an edge cases in
collapse_pte_mapped_thp() and hugepage_vma_revalidate(), and adding a mode
specifically for this case - TVA_FORCED_COLLAPSE.

We have:
* smaps handling for showing "THPeligible"
* Pagefault handling
* khugepaged handling
* Forced collapse handling: primarily MADV_COLLAPSE, but also for
  an edge case in collapse_pte_mapped_thp()

Disregarding the edge cases, we only want to ignore sysfs settings only
when we are forcing a collapse through MADV_COLLAPSE, otherwise we want to
enforce it, hence this patch does the following flag to enum conversions:

* TVA_SMAPS | TVA_ENFORCE_SYSFS -> TVA_SMAPS
* TVA_IN_PF | TVA_ENFORCE_SYSFS -> TVA_PAGEFAULT
* TVA_ENFORCE_SYSFS             -> TVA_KHUGEPAGED
* 0                             -> TVA_FORCED_COLLAPSE

With this change, we immediately know if we are in the forced collapse
case, which will be valuable next.

Link: https://lkml.kernel.org/r/20250815135549.130506-3-usamaarif642@gmail.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

prctl: extend PR_SET_THP_DISABLE to optionally exclude VM_HUGEPAGE

Patch series "prctl: extend PR_SET_THP_DISABLE to only provide THPs when
advised", v5.

This will allow individual processes to opt-out of THP = "always" into THP
= "madvise", without affecting other workloads on the system.  This has
been extensively discussed on the mailing list and has been summarized
very well by David in the first patch which also includes the links to
alternatives, please refer to the first patch commit message for the
motivation for this series.

Patch 1 adds the PR_THP_DISABLE_EXCEPT_ADVISED flag to implement this,
along with the MMF changes.

Patch 2 is a cleanup patch for tva_flags that will allow the forced
collapse case to be transmitted to vma_thp_disabled (which is done in
patch 3).

Patch 4 adds documentation for PR_SET_THP_DISABLE/PR_GET_THP_DISABLE.

Patches 6-7 implement the selftests for PR_SET_THP_DISABLE for completely
disabling THPs (old behaviour) and only enabling it at advise
(PR_THP_DISABLE_EXCEPT_ADVISED).

This patch (of 7):

People want to make use of more THPs, for example, moving from the "never"
system policy to "madvise", or from "madvise" to "always".

While this is great news for every THP desperately waiting to get
allocated out there, apparently there are some workloads that require a
bit of care during that transition: individual processes may need to
opt-out from this behavior for various reasons, and this should be
permitted without needing to make all other workloads on the system
similarly opt-out.

The following scenarios are imaginable:

(1) Switch from "none" system policy to "madvise"/"always", but keep THPs
    disabled for selected workloads.

(2) Stay at "none" system policy, but enable THPs for selected
    workloads, making only these workloads use the "madvise" or "always"
    policy.

(3) Switch from "madvise" system policy to "always", but keep the
    "madvise" policy for selected workloads: allocate THPs only when
    advised.

(4) Stay at "madvise" system policy, but enable THPs even when not advised
    for selected workloads -- "always" policy.

Once can emulate (2) through (1), by setting the system policy to
"madvise"/"always" while disabling THPs for all processes that don't want
THPs.  It requires configuring all workloads, but that is a user-space
problem to sort out.

(4) can be emulated through (3) in a similar way.

Back when (1) was relevant in the past, as people started enabling THPs,
we added PR_SET_THP_DISABLE, so relevant workloads that were not ready yet
(i.e., used by Redis) were able to just disable THPs completely.  Redis
still implements the option to use this interface to disable THPs
completely.

With PR_SET_THP_DISABLE, we added a way to force-disable THPs for a
workload -- a process, including fork+exec'ed process hierarchy.  That
essentially made us support (1): simply disable THPs for all workloads
that are not ready for THPs yet, while still enabling THPs system-wide.

The quest for handling (3) and (4) started, but current approaches
(completely new prctl, options to set other policies per process,
alternatives to prctl -- mctrl, cgroup handling) don't look particularly
promising.  Likely, the future will use bpf or something similar to
implement better policies, in particular to also make better decisions
about THP sizes to use, but this will certainly take a while as that work
just started.

Long story short: a simple enable/disable is not really suitable for the
future, so we're not willing to add completely new toggles.

While we could emulate (3)+(4) through (1)+(2) by simply disabling THPs
completely for these processes, this is a step backwards, because these
processes can no longer allocate THPs in regions where THPs were
explicitly advised: regions flagged as VM_HUGEPAGE.  Apparently, that
imposes a problem for relevant workloads, because "not THPs" is certainly
worse than "THPs only when advised".

Could we simply relax PR_SET_THP_DISABLE, to "disable THPs unless not
explicitly advised by the app through MAD_HUGEPAGE"?  *maybe*, but this
would change the documented semantics quite a bit, and the versatility to
use it for debugging purposes, so I am not 100% sure that is what we want
-- although it would certainly be much easier.

So instead, as an easy way forward for (3) and (4), add an option to
make PR_SET_THP_DISABLE disable *less* THPs for a process.

In essence, this patch:

(A) Adds PR_THP_DISABLE_EXCEPT_ADVISED, to be used as a flag in arg3
    of prctl(PR_SET_THP_DISABLE) when disabling THPs (arg2 != 0).

    prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED).

(B) Makes prctl(PR_GET_THP_DISABLE) return 3 if
    PR_THP_DISABLE_EXCEPT_ADVISED was set while disabling.

    Previously, it would return 1 if THPs were disabled completely. Now
    it returns the set flags as well: 3 if PR_THP_DISABLE_EXCEPT_ADVISED
    was set.

(C) Renames MMF_DISABLE_THP to MMF_DISABLE_THP_COMPLETELY, to express
    the semantics clearly.

    Fortunately, there are only two instances outside of prctl() code.

(D) Adds MMF_DISABLE_THP_EXCEPT_ADVISED to express "no THP except for VMAs
    with VM_HUGEPAGE" -- essentially "thp=madvise" behavior

    Fortunately, we only have to extend vma_thp_disabled().

(E) Indicates "THP_enabled: 0" in /proc/pid/status only if THPs are
    disabled completely

    Only indicating that THPs are disabled when they are really disabled
    completely, not only partially.

    For now, we don't add another interface to obtained whether THPs
    are disabled partially (PR_THP_DISABLE_EXCEPT_ADVISED was set). If
    ever required, we could add a new entry.

The documented semantics in the man page for PR_SET_THP_DISABLE "is
inherited by a child created via fork(2) and is preserved across
execve(2)" is maintained.  This behavior, for example, allows for
disabling THPs for a workload through the launching process (e.g., systemd
where we fork() a helper process to then exec()).

For now, MADV_COLLAPSE will *fail* in regions without VM_HUGEPAGE and
VM_NOHUGEPAGE.  As MADV_COLLAPSE is a clear advise that user space thinks
a THP is a good idea, we'll enable that separately next (requiring a bit
of cleanup first).

There is currently not way to prevent that a process will not issue
PR_SET_THP_DISABLE itself to re-enable THP.  There are not really known
users for re-enabling it, and it's against the purpose of the original
interface.  So if ever required, we could investigate just forbidding to
re-enable them, or make this somehow configurable.

Link: https://lkml.kernel.org/r/20250815135549.130506-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20250815135549.130506-2-usamaarif642@gmail.com
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yafang <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: readahead: improve mmap_miss heuristic for concurrent faults

If two or more threads of an application faulting on the same folio, the
mmap_miss counter can be decreased multiple times.  It breaks the
mmap_miss heuristic and keeps the readahead enabled even under extreme
levels of memory pressure.

It happens often if file folios backing a multi-threaded application are
getting evicted and re-faulted.

Fix it by skipping decreasing mmap_miss if the folio is locked.

This change was evaluated on several hundred thousands hosts in Google's
production over a couple of weeks.  The number of containers being stuck
in a vicious reclaim cycle for a long time was reduced several fold
(~10-20x), as well as the overall fleet-wide cpu time spent in direct
memory reclaim was meaningfully reduced.  No regressions were observed.

Link: https://lkml.kernel.org/r/20250815183224.62007-1-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: skip hugepage-mremap test if userfaultfd unavailable

Gracefully skip test if userfaultfd is not supported (ENOSYS) or not
permitted (EPERM), instead of failing. This avoids misleading failures
with clear skip messages.

--------------
Before Patch
--------------
~ running ./hugepage-mremap
...
~ Bail out! userfaultfd: Function not implemented
~ Planned tests != run tests (1 != 0)
~ Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
~ [FAIL]
not ok 4 hugepage-mremap # exit=1

--------------
After Patch
--------------
~ running ./hugepage-mremap
...
~ ok 2 # SKIP userfaultfd is not supported/not enabled.
~ 1 skipped test(s) detected.
~ Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
~ [SKIP]
ok 4 hugepage-mremap # SKIP

Link: https://lkml.kernel.org/r/20250816040113.760010-8-aboorvad@linux.ibm.com
Co-developed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: skip thuge-gen test if system is not setup properly

Make thuge-gen skip instead of fail when it can't run due to system
settings. If shmmax is too small or no 1G huge pages are available, the
test now prints a warning and is marked as skipped.

-------------------
Before Patch:
-------------------
~ running ./thuge-gen
~ Bail out! Please do echo 262144 > /proc/sys/kernel/shmmax
~ Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
~ [FAIL]
not ok 28 thuge-gen ~ exit=1

-------------------
After Patch:
-------------------
~ running ./thuge-gen
~ ~ WARNING: shmmax is too small to run this test.
~ ~ Please run the following command to increase shmmax:
~ ~ echo 262144 > /proc/sys/kernel/shmmax
~ 1..0 ~ SKIP Test skipped due to insufficient shmmax value.
~ [SKIP]
ok 29 thuge-gen ~ SKIP

Link: https://lkml.kernel.org/r/20250816040113.760010-7-aboorvad@linux.ibm.com
Co-developed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: fix child process exit codes in ksm_functional_tests

In ksm_functional_tests, test_child_ksm() returned negative values to
indicate errors.  However, when passed to exit(), these were interpreted
as large unsigned values (e.g, -2 became 254), leading to incorrect
handling in the parent process.  As a result, some tests appeared to be
skipped or silently failed.

This patch changes test_child_ksm() to return positive error codes (1, 2,
3) and updates test_child_ksm_err() to interpret them correctly.
Additionally, test_prctl_fork_exec() now uses exit(4) after a failed
execv() to clearly signal exec failures.  This ensures the parent
accurately detects and reports child process failures.

--------------
Before patch:
--------------
- [RUN] test_unmerge
ok 1 Pages were unmerged
...
- [RUN] test_prctl_fork
- No pages got merged
- [RUN] test_prctl_fork_exec
ok 7 PR_SET_MEMORY_MERGE value is inherited
...
Bail out! 1 out of 8 tests failed
- Planned tests != run tests (9 != 8)
- Totals: pass:7 fail:1 xfail:0 xpass:0 skip:0 error:0

--------------
After patch:
--------------
- [RUN] test_unmerge
ok 1 Pages were unmerged
...
- [RUN] test_prctl_fork
- No pages got merged
not ok 7 Merge in child failed
- [RUN] test_prctl_fork_exec
ok 8 PR_SET_MEMORY_MERGE value is inherited
...
Bail out! 2 out of 9 tests failed
- Totals: pass:7 fail:2 xfail:0 xpass:0 skip:0 error:0

Link: https://lkml.kernel.org/r/20250816040113.760010-6-aboorvad@linux.ibm.com
Fixes: 6c47de3be3a0 ("selftest/mm: ksm_functional_tests: extend test case for ksm fork/exec")
Co-developed-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/selftests: fix split_huge_page_test failure on systems with 64KB page size

The split_huge_page_test fails on systems with a 64KB base page size.
This is because the order of a 2MB huge page is different:

On 64KB systems, the order is 5.

On 4KB systems, it's 9.

The test currently assumes a maximum huge page order of 9, which is only
valid for 4KB base page systems. On systems with 64KB pages, attempting
to split huge pages beyond their actual order (5) causes the test to fail.

In this patch, we calculate the huge page order based on the system's base
page size. With this change, the tests now run successfully on both 64KB
and 4KB page size systems.

Link: https://lkml.kernel.org/r/20250816040113.760010-5-aboorvad@linux.ibm.com
Fixes: fa6c02315f74 ("mm: huge_memory: a new debugfs interface for splitting THP tests")
Co-developed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftest/mm: fix ksm_funtional_test failures

This patch fixes 2 issues.

1) After fork() in test_prctl_fork, the child process uses the file
   descriptors from the parent process to read ksm_stat and
   ksm_merging_pages.  This results in incorrect values being read (parent
   process ksm_stat and ksm_merging_pages will be read in child), causing
   the test to fail.

   This patch calls init_global_file_handles() in the child process to
   ensure that the current process's file descriptors are used to read
   ksm_stat and ksm_merging_pages.

2) All tests currently call ksm_merge to trigger page merging.  To
   ensure the system remains in a consistent state for subsequent tests,
   it is better to call ksm_unmerge during the test cleanup phase

   In the test_prctl_fork test, after a fork(), reading
   ksm_merging_pages in the child process returns a non-zero value because
   a previous test performed a merge, and the child's memory state is
   inherited from the parent.

   Although the child process calls ksm_unmerge, the ksm_merging_pages
   counter in the parent is reset to zero, while the child's counter
   remains unchanged.  This discrepancy causes the test to fail.

   To avoid this issue, each test should call ksm_unmerge during
   cleanup to ensure the counter is reset and the system is in a clean
   state for subsequent tests.

execv argument is an array of pointers to null-terminated strings.  In
this patch we also added NULL in the execv argument.

Link: https://lkml.kernel.org/r/20250816040113.760010-4-aboorvad@linux.ibm.com
Fixes: 6c47de3be3a0 ("selftest/mm: ksm_functional_tests: extend test case for ksm fork/exec")
Co-developed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: add support to test 4PB VA on PPC64

PowerPC64 supports a 4PB virtual address space, but this test was
previously limited to 512TB. This patch extends the coverage up to the
full 4PB VA range on PowerPC64.

Memory from 0 to 128TB is allocated without an address hint, while
allocations from 128TB to 4PB use a hint address.

Link: https://lkml.kernel.org/r/20250816040113.760010-3-aboorvad@linux.ibm.com
Co-developed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/selftests: fix incorrect pointer being passed to mark_range()

Patch series "selftests/mm: Fix false positives and skip unsupported
tests", v4.

This patch series addresses false positives in the generic mm selftests
and skips tests that cannot run correctly due to missing features or
system limitations.

This patch (of 7):

In main(), the high address is stored in hptr, but for mark_range(), the
address passed is ptr, not hptr. Fixed this by changing ptr[i] to hptr[i]
in mark_range() function call.

Link: https://lkml.kernel.org/r/20250816040113.760010-1-aboorvad@linux.ibm.com
Link: https://lkml.kernel.org/r/20250816040113.760010-2-aboorvad@linux.ibm.com
Fixes: b2a79f62133a ("selftests/mm: virtual_address_range: unmap chunks after validation")
Co-developed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/oom_kill: only delay OOM reaper for processes using robust futexes

The OOM reaper can quickly reap a process's memory when the system
encounters OOM, helping the system recover.  Without the OOM reaper, if a
process frozen by cgroup v1 is OOM killed, the victims' memory cannot be
freed, and the system stays in a poor state.  Even if the process is not
frozen by cgroup v1, reaping victims' memory is still meaningful, because
having one more process working speeds up memory release.

When processes holding robust futexes are OOM killed but waiters on those
futexes remain alive, the robust futexes might be reaped before
futex_cleanup() runs.  It would cause the waiters to block indefinitely.
To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
The OOM reaper now rarely runs since many killed processes exit within 2
seconds.

Because robust futex users are few, it is unreasonable to delay OOM reap
for all victims.  For processes that do not hold robust futexes, the OOM
reaper should not be delayed and for processes holding robust futexes, the
OOM reaper must still be delayed to prevent the waiters to block
indefinitely [1].

Link: https://lkml.kernel.org/r/20250814135555.17493-3-zhongjinji@honor.com
Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Cc: Andre Almeida <andrealmeid@igalia.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

futex: introduce function process_has_robust_futex()

Patch series "mm/oom_kill: Only delay OOM reaper for processes using
robust futexes", v4.

The OOM reaper quickly reclaims a process's memory when the system hits
OOM, helping the system recover.  Without the OOM reaper, if a process
frozen by cgroup v1 is OOM killed, the victim's memory cannot be freed,
leaving the system in a poor state.  Even if the process is not frozen by
cgroup v1, reclaiming victims' memory remains important, as having one
more process working speeds up memory release.

When processes holding robust futexes are OOM killed but waiters on those
futexes remain alive, the robust futexes might be reaped before
futex_cleanup() runs.  This can cause the waiters to block indefinitely
[1].

To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
Since many killed processes exit within 2 seconds, the OOM reaper rarely
runs after this delay.  However, robust futex users are few, so delaying
OOM reap for all victims is unnecessary.

If each thread's robust_list in a process is NULL, the process holds no
robust futexes.  For such processes, the OOM reaper should not be delayed.
For processes holding robust futexes, to avoid issue [1], the OOM reaper
must still be delayed.

Patch 1 introduces process_has_robust_futex() to detect whether a process
uses robust futexes.  Patch 2 delays the OOM reaper only for processes
holding robust futexes, improving OOM reaper performance.  Patch 3 makes
the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
to reduce PTE lock contention caused by unmapping the same vma.

This patch (of 3):

When the holders of robust futexes are OOM killed but the waiters on
robust futexes are still alive, the robust futexes might be reaped before
futex_cleanup() runs.  This can cause the waiters to block indefinitely
[1].  To prevent this issue, the OOM reaper's work is delayed by 2 seconds
[1].  However, the OOM reaper now rarely runs since many killed processes
exit within 2 seconds.

Because robust futex users are few, delay the reaper's execution only for
processes holding robust futexes to improve the performance of the OOM
reaper.

Introduce the function process_has_robust_futex() to detect whether a
process uses robust futexes.  If each thread's robust_list in a process is
NULL, it means the process holds no robust futexes.  Conversely, it means
the process holds robust futexes.

Link: https://lkml.kernel.org/r/20250814135555.17493-1-zhongjinji@honor.com
Link: https://lkml.kernel.org/r/20250814135555.17493-2-zhongjinji@honor.com
Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
Signed-off-by: zhongjinji <zhongjinji@honor.com>
Cc: Andre Almeida <andrealmeid@igalia.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Joel Savitz <jsavitz@redhat.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

powerpc: mm: support page table check

On creation and clearing of a page table mapping, instrument such calls by
invoking page_table_check_pte_set and page_table_check_pte_clear
respectively. These calls serve as a sanity check against illegal
mappings.

Enable ARCH_SUPPORTS_PAGE_TABLE_CHECK for all platforms.

See also:

riscv support in commit 3fee229a8eb9 ("riscv/mm: enable
ARCH_SUPPORTS_PAGE_TABLE_CHECK")
arm64 in commit 42b2547137f5 ("arm64/mm: enable
ARCH_SUPPORTS_PAGE_TABLE_CHECK")
x86_64 in commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
check")

[ajd@linux.ibm.com: rebase]
Link: https://lkml.kernel.org/r/20250813062614.51759-14-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

powerpc: mm: use set_pte_at_unchecked() for internal usages

In the new set_ptes() API, set_pte_at() (a special case of set_ptes()) is
intended to be instrumented by the page table check facility.  There are
however several other routines that constitute the API for setting page
table entries, including set_pmd_at() among others.  Such routines are
themselves implemented in terms of set_ptes_at().

A future patch providing support for page table checking on powerpc must
take care to avoid duplicate calls to page_table_check_p{te,md,ud}_set().
Allow for assignment of pte entries without instrumentation through the
set_pte_at_unchecked() routine introduced in this patch.

Cause API-facing routines that call set_pte_at() to instead call
set_pte_at_unchecked(), which will remain uninstrumented by page table
check.  set_ptes() is itself implemented by calls to __set_pte_at(), so
this eliminates redundant code.

[ajd@linux.ibm.com: don't change to unchecked for early boot/kernel mappings]
Link: https://lkml.kernel.org/r/20250813062614.51759-13-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

powerpc: mm: implement *_user_accessible_page() for ptes

Page table checking depends on architectures providing an implementation
of p{te,md,ud}_user_accessible_page. With refactorisations made on
powerpc/mm, the pte_access_permitted() and similar methods verify whether
a userland page is accessible with the required permissions.

Since page table checking is the only user of
p{te,md,ud}_user_accessible_page(), implement these for all platforms,
using some of the same preliminary checks taken by pte_access_permitted()
on that platform.

Since commit 8e9bd41e4ce1 ("powerpc/nohash: Replace pte_user() by
pte_read()") pte_user() is no longer required to be present on all
platforms as it may be equivalent to or implied by pte_read(). Hence
implementations of pte_user_accessible_page() are specialised.

[ajd@linux.ibm.com: rebase and fix commit message]
Link: https://lkml.kernel.org/r/20250813062614.51759-12-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

powerpc: mm: add pud_pfn() stub

The page table check feature requires that pud_pfn() be defined on each
consuming architecture. Since only 64-bit, Book3S platforms allow for
hugepages at this upper level, and since the calling code is gated by a
call to pud_user_accessible_page(), which will return zero, include this
stub as a BUILD_BUG().

Link: https://lkml.kernel.org/r/20250813062614.51759-11-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: provide address parameter to p{te,md,ud}_user_accessible_page()

On several powerpc platforms, a page table entry may not imply whether the
relevant mapping is for userspace or kernelspace. Instead, such platforms
infer this by the address which is being accessed.

Add an additional address argument to each of these routines in order to
provide support for page table check on powerpc.

[ajd@linux.ibm.com: rebase on arm64 changes]
Link: https://lkml.kernel.org/r/20250813062614.51759-10-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_table_check: reinstate address parameter in [__]page_table_check_pte_clear()

This reverts commit aa232204c468 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pte_clear").

Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.

[ajd@linux.ibm.com: rebase, fix additional occurrence and loop handling]
Link: https://lkml.kernel.org/r/20250813062614.51759-9-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_table_check: reinstate address parameter in [__]page_table_check_pmd_clear()

This reverts commit 1831414cd729 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pmd_clear").

Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.

[ajd@linux.ibm.com: rebase on arm64 changes]
Link: https://lkml.kernel.org/r/20250813062614.51759-8-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_table_check: reinstate address parameter in [__]page_table_check_pud_clear()

This reverts commit 931c38e16499 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pud_clear").

Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.

[ajd@linux.ibm.com: rebase on arm64 changes]
Link: https://lkml.kernel.org/r/20250813062614.51759-7-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_table_check: provide addr parameter to page_table_check_ptes_set()

To provide support for powerpc platforms, provide an addr parameter to the
__page_table_check_ptes_set() and page_table_check_ptes_set() routines.
This parameter is needed on some powerpc platforms which do not encode
whether a mapping is for user or kernel in the pte. On such platforms,
this can be inferred from the addr parameter.

[ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message]
Link: https://lkml.kernel.org/r/20250813062614.51759-6-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_table_check: reinstate address parameter in [__]page_table_check_pmd[s]_set()

This reverts commit a3b837130b58 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pmd_set").

Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.

Apply this to __page_table_check_pmds_set(), page_table_check_pmd_set(),
and the page_table_check_pmd_set() wrapper macro.

[ajd@linux.ibm.com: rebase on arm64 + riscv changes, update commit message]
Link: https://lkml.kernel.org/r/20250813062614.51759-5-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_table_check: reinstate address parameter in [__]page_table_check_pud[s]_set()

This reverts commit 6d144436d954 ("mm/page_table_check: remove unused
parameter in [__]page_table_check_pud_set").

Reinstate previously unused parameters for the purpose of supporting
powerpc platforms, as many do not encode user/kernel ownership of the page
in the pte, but instead in the address of the access.

Apply this to __page_table_check_puds_set(), page_table_check_puds_set()
and the page_table_check_pud_set() wrapper macro.

[ajd@linux.ibm.com: rebase on riscv + arm64 changes, update commit message]
Link: https://lkml.kernel.org/r/20250813062614.51759-4-ajd@linux.ibm.com
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: Ingo Molnar <mingo@kernel.org> [x86]
Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> [riscv]
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arm64/mm: add addr parameter to __ptep_get_and_clear_anysz()

To provide support for page table check on powerpc, we need to reinstate
the address parameter in several functions, including
page_table_check_{pte,pmd,pud}_clear().

In preparation for this, add the addr parameter to arm64's
__ptep_get_and_clear_anysz() and change its callsites accordingly.

Link: https://lkml.kernel.org/r/20250813062614.51759-3-ajd@linux.ibm.com
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Rohan McLure <rmclure@linux.ibm.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

arm64/mm: add addr parameter to __set_ptes_anysz()

Patch series "Support page table check on PowerPC", v16.

Support page table check on all PowerPC platforms.  This works by
serialising assignments, reassignments and clears of page table entries at
each level in order to ensure that anonymous mappings have at most one
writable consumer, and likewise that file-backed mappings are not
simultaneously also anonymous mappings.

In order to support this infrastructure, a number of stubs must be defined
for all powerpc platforms.  Additionally, separate set_pte_at() and
set_pte_at_unchecked(), to allow for internal, uninstrumented mappings.

On some PowerPC platforms, implementing
{pte,pmd,pud}_user_accessible_page() requires the address.  We revert
previous changes that removed the address parameter from various
interfaces, and add it to some other interfaces, in order to allow this.

(This series was initially written by Rohan McLure, who has left IBM and
is no longer working on powerpc.)

This patch (of 13):

To provide support for page table check on powerpc, we need to reinstate
the address parameter in several functions, including
page_table_check_{ptes,pmds,puds}_set().

In preparation for this, add the addr parameter to arm64's
__set_ptes_anysz() and change its callsites accordingly.

Link: https://lkml.kernel.org/r/20250813062614.51759-1-ajd@linux.ibm.com
Link: https://lkml.kernel.org/r/20250813062614.51759-2-ajd@linux.ibm.com
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Rohan McLure <rmclure@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-filemap-align-last_index-to-folio-size-fix

fix overflow on 32-bit

iocb->ki_pos is loff_t (long long) while pgoff_t is unsigned long and
this overflow seems to happen in practice, resulting in last_index being
before index.

Link: https://lkml.kernel.org/r/yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va
Signed-off-by: Klara Modin <klarasmodin@gmail.com>
Cc: Chi Zhiling <chizhiling@kylinos.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Youling Tang <tangyouling@kylinos.cn>
Cc: Youling Tang <youling.tang@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/filemap: align last_index to folio size

On XFS systems with pagesize=4K, blocksize=16K, and
CONFIG_TRANSPARENT_HUGEPAGE enabled, We observed the following readahead
behaviors:

# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=64k count=1
# ./tools/mm/page-types -r -L -f  /mnt/xfs/test
foffset offset flags
0 136d4c __RU_l_________H______t_________________F_1
1 136d4d __RU_l__________T_____t_________________F_1
2 136d4e __RU_l__________T_____t_________________F_1
3 136d4f __RU_l__________T_____t_________________F_1
...
c 136bb8 __RU_l_________H______t_________________F_1
d 136bb9 __RU_l__________T_____t_________________F_1
e 136bba __RU_l__________T_____t_________________F_1
f 136bbb __RU_l__________T_____t_________________F_1   <-- first read
10 13c2cc ___U_l_________H______t______________I__F_1   <-- readahead flag
11 13c2cd ___U_l__________T_____t______________I__F_1
12 13c2ce ___U_l__________T_____t______________I__F_1
13 13c2cf ___U_l__________T_____t______________I__F_1
...
1c 1405d4 ___U_l_________H______t_________________F_1
1d 1405d5 ___U_l__________T_____t_________________F_1
1e 1405d6 ___U_l__________T_____t_________________F_1
1f 1405d7 ___U_l__________T_____t_________________F_1
[ra_size = 32, req_count = 16, async_size = 16]

# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=60k count=1
# ./page-types -r -L -f  /mnt/xfs/test
foffset offset flags
0 136048 __RU_l_________H______t_________________F_1
...
c 110a40 __RU_l_________H______t_________________F_1
d 110a41 __RU_l__________T_____t_________________F_1
e 110a42 __RU_l__________T_____t_________________F_1   <-- first read
f 110a43 __RU_l__________T_____t_________________F_1   <-- first readahead flag
10 13e7a8 ___U_l_________H______t_________________F_1
...
20 137a00 ___U_l_________H______t_______P______I__F_1   <-- second readahead flag (20 - 2f)
21 137a01 ___U_l__________T_____t_______P______I__F_1
...
3f 10d4af ___U_l__________T_____t_______P_________F_1
[first readahead: ra_size = 32, req_count = 15, async_size = 17]

When reading 64k data (same for 61-63k range, where last_index is
page-aligned in filemap_get_pages()), 128k readahead is triggered via
page_cache_sync_ra() and the PG_readahead flag is set on the next folio
(the one containing 0x10 page).

When reading 60k data, 128k readahead is also triggered via
page_cache_sync_ra().  However, in this case the readahead flag is set on
the 0xf page.  Although the requested read size (req_count) is 60k, the
actual read will be aligned to folio size (64k), which triggers the
readahead flag and initiates asynchronous readahead via
page_cache_async_ra().  This results in two readahead operations totaling
256k.

The root cause is that when the requested size is smaller than the actual
read size (due to folio alignment), it triggers asynchronous readahead.
By changing last_index alignment from page size to folio size, we ensure
the requested size matches the actual read size, preventing the case where
a single read operation triggers two readahead operations.

After applying the patch:
# echo 3 > /proc/sys/vm/drop_caches
# dd if=test of=/dev/null bs=60k count=1
# ./page-types -r -L -f  /mnt/xfs/test
foffset offset flags
0 136d4c __RU_l_________H______t_________________F_1
1 136d4d __RU_l__________T_____t_________________F_1
2 136d4e __RU_l__________T_____t_________________F_1
3 136d4f __RU_l__________T_____t_________________F_1
...
c 136bb8 __RU_l_________H______t_________________F_1
d 136bb9 __RU_l__________T_____t_________________F_1
e 136bba __RU_l__________T_____t_________________F_1   <-- first read
f 136bbb __RU_l__________T_____t_________________F_1
10 13c2cc ___U_l_________H______t______________I__F_1   <-- readahead flag
11 13c2cd ___U_l__________T_____t______________I__F_1
12 13c2ce ___U_l__________T_____t______________I__F_1
13 13c2cf ___U_l__________T_____t______________I__F_1
...
1c 1405d4 ___U_l_________H______t_________________F_1
1d 1405d5 ___U_l__________T_____t_________________F_1
1e 1405d6 ___U_l__________T_____t_________________F_1
1f 1405d7 ___U_l__________T_____t_________________F_1
[ra_size = 32, req_count = 16, async_size = 16]

The same phenomenon will occur when reading from 49k to 64k.  Set the
readahead flag to the next folio.

Because the minimum order of folio in address_space equals the block size
(at least in xfs and bcachefs that already support bs > ps), having
request_count aligned to block size will not cause overread.

Link: https://lkml.kernel.org/r/20250711055509.91587-1-youling.tang@linux.dev
Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Youling Tang <youling.tang@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Klara Modin <klarasmodin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: remove redundant pcp->free_count initialization in per_cpu_pages_init()

In per_cpu_pages_init(), pcp->free_count is explicitly initialized to 0,
but this is redundant because the entire struct is already zeroed by
memset(pcp, 0, sizeof(*pcp)).

Link: https://lkml.kernel.org/r/20250814071828.12036-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix typos in VMA comments

Fix the following typos in VMA-related files:
1. "operationr" -> "operation" in mm/vma.h
2. "initialisaing" -> "initializing" in mm/vma_init.c

Link: https://lkml.kernel.org/r/20250814073800.13617-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>