Set variables after declaration, order declaration by length of lines.
Remove the height variable as it is seldom used.
Don't memcpy size 0.
change type of slots size to void __rcu *
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
spanning_leaf_init: make spanning leaf safe for all rebalances
Getting the right side of the operations content from the slot means
that the rebalance can pass the same wr_mas into this function to
correctly set up the new leaf entries.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Separate wr_split_store and wr_rebalance store type code
path
The split and rebalance store types both go through the same function
that uses the big node. Separate the code paths so that each can be
updated independently.
No functional change intended
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
fix gaps by moving sibling check earlier
fix node_finalise by making cp->end +1, was skipping last node
Continue if left and right meet (l_wr_mas->mas->noded == r_wr_mas..
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Hannes Reinecke [Tue, 29 Jul 2025 06:46:36 +0000 (08:46 +0200)]
drivers/base: move memory_block_add_nid() into the caller
Now the node id only needs to be set for early memory, so move
memory_block_add_nid() into the caller and rename it into
memory_block_add_nid_early(). This allows us to further simplify the code
by dropping the 'context' argument to
do_register_memory_block_under_node().
Link: https://lkml.kernel.org/r/20250729064637.51662-4-hare@kernel.org Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hannes Reinecke [Tue, 29 Jul 2025 06:46:35 +0000 (08:46 +0200)]
mm/memory_hotplug: activate node before adding new memory blocks
The sysfs attributes for memory blocks require the node ID to be set and
initialized, so move the node activation before adding new memory blocks.
This also has the nice side effect that the BUG_ON() can be converted into
a WARN_ON() as we now can handle registration errors.
Link: https://lkml.kernel.org/r/20250729064637.51662-3-hare@kernel.org Fixes: b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hannes Reinecke [Tue, 29 Jul 2025 06:46:34 +0000 (08:46 +0200)]
drivers/base/memory: add node id parameter to add_memory_block()
We have some udev rules trying to read the sysfs attribute 'valid_zones'
during an memory 'add' event, causing a crash in zone_for_pfn_range().
Debugging found that mem->nid was set to NUMA_NO_NODE, which crashed in
NODE_DATA(nid). Further analysis revealed that we're running into a race
with udev event processing: add_memory_resource() has this function calls:
Why do we try to online the node in 1), but only register the node in 4)
_after_ we have created the memory blocks in 3) ? And why do we set the
'nid' value in 5), when the uevent (which might need to see the correct
'nid' value) is sent out in 3) ? There must be a reason, I'm sure ...
So here's a small patchset to fixup uevent ordering. The first patch adds
a 'nid' parameter to add_memory_blocks() (to avoid mem->nid being
initialized with NUMA_NO_NODE), and the second patch reshuffles the code
in add_memory_resource() to fully initialize the node prior to calling
create_memory_block_devices() so that the node is valid at that time and
uevent processing will see correct values in sysfs.
This patch (of 3):
Add a 'nid' parameter to add_memory_block() to initialize the memory block
with the correct node id.
Chi Zhiling [Tue, 12 Aug 2025 07:22:25 +0000 (15:22 +0800)]
mpage: convert do_mpage_readpage() to return int type
The return value of do_mpage_readpage() is arg->bio, which is already set
in the arg structure. Returning it again is redundant.
This patch changes the return type to int and always returns 0 since the
caller does not care about the return value.
Link: https://lkml.kernel.org/r/20250812072225.181798-3-chizhiling@163.com Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sungjong Seo <sj1557.seo@samsung.com> Cc: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chi Zhiling [Tue, 12 Aug 2025 07:22:23 +0000 (15:22 +0800)]
mpage: terminate read-ahead on read error
For exFAT filesystems with 4MB read_ahead_size, removing the storage
device during read operations can delay EIO error reporting by several
minutes. This occurs because the read-ahead implementation in mpage
doesn't handle errors.
Another reason for the delay is that the filesystem requires metadata to
issue file read request. When the storage device is removed, the metadata
buffers are invalidated, causing mpage to repeatedly attempt to fetch
metadata during each get_block call.
The original purpose of this patch is terminate read ahead when we fail to
get metadata, to make the patch more generic, implement it by checking
folio status, instead of checking the return of get_block().
Link: https://lkml.kernel.org/r/20250812072225.181798-1-chizhiling@163.com Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sungjong Seo <sj1557.seo@samsung.com> Cc: Yuezhang Mo <Yuezhang.Mo@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chi Zhiling [Mon, 28 Jul 2025 08:39:52 +0000 (16:39 +0800)]
mm/filemap: Skip non-uptodate folio if there are available folios
When reading data exceeding the maximum IO size, the operation is split
into multiple IO requests, but the data isn't immediately copied to
userspace after each IO completion.
For example, when reading 2560k data from a device with 1280k maximum IO
size, the following sequence occurs:
1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. wait the next 1280k
5. copy 8 pages to user buffer
6. copy 20 folios(64k) to user buffer
The 8 pages in step 5 are copied after the second 1280k completes(step 4)
due to waiting for a non-uptodate folio in filemap_update_page. We can
copy the 8 pages before the second 1280k completes(step 4) to reduce the
latency of this read operation.
After applying the patch, these 8 pages will be copied before the next IO
completes:
1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. copy 8 pages to user buffer
5. wait the next 1280k
6. copy 20 folios(64k) to user buffer
This patch drops a setting of IOCB_NOWAIT for AIO, which is fine because
filemap_read will set it again for AIO.
Chi Zhiling [Mon, 28 Jul 2025 08:39:51 +0000 (16:39 +0800)]
mm/filemap: do not use is_partially_uptodate for entire folio
Patch series "Tiny optimization for large read operations".
This series contains two patches,
1. Skip calling is_partially_uptodate for entire folio to save time, I
have reviewed the mpage and iomap implementations and didn't spot any
issues, but this change likely needs more thorough review.
2. Skip calling filemap_uptodate if there are ready folios in the
batch, This might save a few milliseconds in practice, but I didn't
observe measurable improvements in my tests.
This patch (of 2):
When a folio is marked as non-uptodate, it means the folio contains some
non-uptodate data. Therefore, calling is_partially_uptodate() to recheck
the entire folio is redundant.
If all data in a folio is actually up-to-date but the folio lacks the
uptodate flag, it will still be treated as non-uptodate in many other
places. Thus, there should be no special case handling for filemap.
Wei Yang [Sun, 17 Aug 2025 03:26:46 +0000 (03:26 +0000)]
mm/rmap: not necessary to mask off FOLIO_PAGES_MAPPED
At this point, we are in an if branch conditional on (nr <
ENTIRELY_MAPPED), and FOLIO_PAGES_MAPPED is equal to (ENTIRELY_MAPPED -
1). This means the upper bits are already cleared.
It is not necessary to mask it off.
Link: https://lkml.kernel.org/r/20250817032647.29147-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ujwal Kundur [Sun, 17 Aug 2025 06:52:11 +0000 (12:22 +0530)]
selftests/mm/uffd: refactor non-composite global vars into struct
Refactor macros and non-composite global variable definitions into a
struct that is defined at the start of a test and is passed around
instead of relying on global vars.
Link: https://lkml.kernel.org/r/20250817065211.855-1-ujwal.kundur@gmail.com Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
liuqiqi [Tue, 12 Aug 2025 07:02:10 +0000 (15:02 +0800)]
mm: fix duplicate accounting of free pages in should_reclaim_retry()
In the zone_reclaimable_pages() function, if the page counts for
NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_INACTIVE_ANON, and
NR_ZONE_ACTIVE_ANON are all zero, the function returns the number of free
pages as the result.
In this case, when should_reclaim_retry() calculates reclaimable pages, it
will inadvertently double-count the free pages in its accounting.
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
bool did_some_progress, int *no_progress_loops)
{
...
available = reclaimable = zone_reclaimable_pages(zone);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
Link: https://lkml.kernel.org/r/20250812070210.1624218-1-liuqiqi@kylinos.cn Fixes: 6aaced5abd32 ("mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()") Signed-off-by: liuqiqi <liuqiqi@kylinos.cn> Reviewed-by: Ye Liu <liuye@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:23:01 +0000 (18:23 +0100)]
mm: add folio_is_pci_p2pdma()
Reimplement is_pci_p2pdma_page() in terms of folio_is_pci_p2pdma(). Moves
the page_folio() call from inside page_pgmap() to is_pci_p2pdma_page().
This removes a page_folio() call from try_grab_folio() which already has a
folio and can pass it in.
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:23:00 +0000 (18:23 +0100)]
mm: reimplement folio_is_fsdax()
For callers of folio_is_fsdax(), we save a folio->page->folio conversion.
Callers of is_fsdax_page() simply move the conversion of page->folio from
the implementation of page_pgmap() to is_fsdax_page().
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:22:59 +0000 (18:22 +0100)]
mm: reimplement folio_is_device_coherent()
For callers of folio_is_device_coherent(), we save a folio->page->folio
conversion. Callers of is_device_coherent_page() simply move the
conversion of page->folio from the implementation of page_pgmap() to
is_device_coherent_page().
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:22:58 +0000 (18:22 +0100)]
mm: reimplement folio_is_device_private()
For callers of folio_is_device_private(), we save a folio->page->folio
conversion. Callers of is_device_private_page() simply move the
conversion of page->folio from the implementation of page_pgmap() to
is_device_private_page().