Qi Zheng [Tue, 25 Feb 2025 03:45:56 +0000 (11:45 +0800)]
mm: pgtable: remove tlb_remove_page_ptdesc()
The tlb_remove_ptdesc()/tlb_remove_table() is specially designed for page
table pages, and now all architectures have been converted to use it to
remove page table pages. So let's remove tlb_remove_page_ptdesc(), it
currently has no users and should not be used for page table pages.
Link: https://lkml.kernel.org/r/3df04c8494339073b71be4acb2d92e108ecd1b60.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 25 Feb 2025 03:45:54 +0000 (11:45 +0800)]
riscv: pgtable: unconditionally use tlb_remove_ptdesc()
To support fast gup, the commit 69be3fb111e7 ("riscv: enable
MMU_GATHER_RCU_TABLE_FREE for SMP && MMU") did the following:
1) use tlb_remove_page_ptdesc() for those platforms which use IPI to
perform TLB shootdown
2) use tlb_remove_ptdesc() for those platforms which use SBI to perform
TLB shootdown
The tlb_remove_page_ptdesc() is the wrapper of the tlb_remove_page(). By
design, the tlb_remove_page() should be used to remove a normal page from
a page table entry, and should not be used for page table pages.
The tlb_remove_ptdesc() is the wrapper of the tlb_remove_table(), which is
designed specifically for freeing page table pages. If the
CONFIG_MMU_GATHER_TABLE_FREE is enabled, the tlb_remove_table() will use
semi RCU to free page table pages, that is:
- batch table freeing: asynchronous free by RCU
- single table freeing: IPI + synchronous free
If the CONFIG_MMU_GATHER_TABLE_FREE is disabled, the tlb_remove_table()
will fall back to pagetable_dtor() + tlb_remove_page().
For case 1), since we need to perform TLB shootdown before freeing the
page table page, the local_irq_save() in fast gup can block the freeing
and protect the fast gup page walker. Therefore we can ensure safety by
just using tlb_remove_page_ptdesc(). In addition, we can also the
tlb_remove_ptdesc()/tlb_remove_table() to achieve it, and it doesn't
matter whether CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected. And in
theory, the performance of freeing pages asynchronously via RCU will not
be lower than synchronous free.
For case 2), since local_irq_save() only disable S-privilege IPI irq but
not M-privilege's, which is used by the SBI implementation to perform TLB
shootdown, so we must select CONFIG_MMU_GATHER_RCU_TABLE_FREE and use
tlb_remove_ptdesc() to ensure safety. The riscv selects this config for
SMP && MMU, the CONFIG_RISCV_SBI is dependent on MMU. Therefore, only the
UP system may have the situation where CONFIG_MMU_GATHER_RCU_TABLE_FREE is
disabled but CONFIG_RISCV_SBI is enabled. But there is no freeing vs fast
gup race in the UP system.
So, in summary, we can use tlb_remove_ptdesc() to support fast gup in all
cases, and this interface is specifically designed for page table pages.
So let's use it unconditionally.
Link: https://lkml.kernel.org/r/9025595e895515515c95e48db54b29afa489c41d.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 25 Feb 2025 03:45:53 +0000 (11:45 +0800)]
mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
Now, the nine architectures of csky, hexagon, loongarch, m68k, mips,
nios2, openrisc, sh and um do not select CONFIG_MMU_GATHER_RCU_TABLE_FREE,
and just call pagetable_dtor() + tlb_remove_page_ptdesc() (the wrapper of
tlb_remove_page()). This is the same as the implementation of
tlb_remove_{ptdesc|table}() under !CONFIG_MMU_GATHER_TABLE_FREE, so
convert these architectures to use tlb_remove_ptdesc().
The ultimate goal is to make the architecture only use tlb_remove_ptdesc()
or tlb_remove_table() for page table pages.
Link: https://lkml.kernel.org/r/19db3e8673b67bad2f1df1ab37f1c89d99eacfea.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 25 Feb 2025 03:45:52 +0000 (11:45 +0800)]
mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*
All callers of tlb_remove_ptdesc() pass it a pointer of struct ptdesc, so
let's change the pt parameter from void * to struct ptdesc * to perform a
type safety check.
Link: https://lkml.kernel.org/r/60bb44299cf2d731df6592e446e7f694054d0dbe.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Tue, 25 Feb 2025 03:45:51 +0000 (11:45 +0800)]
mm: pgtable: make generic tlb_remove_table() use struct ptdesc
Patch series "remove tlb_remove_page_ptdesc()", v2.
As suggested by Peter Zijlstra below [1], this series aims to remove
tlb_remove_page_ptdesc().
: Fundamentally tlb_remove_page() is about removing *pages* as from a PTE,
: there should not be a page-table anywhere near here *ever*.
:
: Yes, some architectures use tlb_remove_page() for page-tables too, but
: that is more or less an implementation detail that can be fixed.
After this series, all architectures use tlb_remove_table() or
tlb_remove_ptdesc() to remove the page table pages. In the future, once
all architectures using tlb_remove_table() have also converted to using
struct ptdesc (eg. powerpc), it may be possible to use only
tlb_remove_ptdesc().
Now only arm will call tlb_remove_ptdesc()/tlb_remove_table() when
CONFIG_MMU_GATHER_TABLE_FREE is disabled. In this case, the type of the
table parameter is actually struct ptdesc * instead of struct page *.
Since struct ptdesc still overlaps with struct page and has not been
separated from it, forcing the table parameter to struct page * will not
cause any problems at this time. But this is definitely incorrect and
needs to be fixed. So just like the generic __tlb_remove_table(), let
generic tlb_remove_table() use struct ptdesc by default when
CONFIG_MMU_GATHER_TABLE_FREE is disabled.
Link: https://lkml.kernel.org/r/cover.1740454179.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5be8c3ab7bd68510bf0db4cf84010f4dfe372917.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Thu, 13 Feb 2025 16:13:52 +0000 (11:13 -0500)]
x86/mm: make MMU_GATHER_RCU_TABLE_FREE unconditional
Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.
There is no real good reason to do things differently for
each setup. Make them all the same.
Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:
- on bare metal, by blocking IRQs, which block TLB flush IPIs
- on paravirt, with MMU_GATHER_RCU_TABLE_FREE
This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.
Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
Using RCU synchronization between page table freeing and get_user_pages_fast()
allows bare metal to also do TLB flushing while interrupts are disabled.
Various places in the mm do still block IRQs or disable preemption
as an implicit way to block RCU frees.
That makes it safe to use INVLPGB on AMD CPUs.
Link: https://lore.kernel.org/r/20250213161423.449435-2-riel@surriel.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Manali Shukla <Manali.Shukla@amd.com> Tested-by: Brendan Jackman <jackmanb@google.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jinjiang Tu [Tue, 25 Feb 2025 14:19:33 +0000 (22:19 +0800)]
mm/hugetlb: fix set_max_huge_pages() when there are surplus pages
In set_max_huge_pages(), min_count should mean the acquired persistent
huge pages, but it contains surplus huge pages. It will lead to failing
to free free huge pages for a node.
Steps to reproduce:
1) create 5 huge pages in Node 0
2) run a program to use all the huge pages
3) create 5 huge pages in Node 1
4) echo 0 > nr_hugepages for Node 1 to free the huge pages
The result:
Node 0 Node 1
Total 5 5
Free 0 5
Surp 5 5
With this patch, step 4) destroys the 5 huge pages in Node 1
The result with this patch:
Node 0 Node 1
Total 5 0
Free 0 0
Surp 5 0
Link: https://lkml.kernel.org/r/20250225141933.3852667-1-tujinjiang@huawei.com Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Tue, 25 Feb 2025 18:45:09 +0000 (18:45 +0000)]
mm/page_alloc: warn on nr_reserved_highatomic underflow
As documented in the comment this underflow should not happen. The
locking has indeed changed here since the comment was written, see the
migratetype hygiene patches[0]. However, those changes made the locking
_safer_, so the underflow _really_ shouldn't happen now. So upgrade the
comment to a warning.
Christoph Hellwig [Mon, 24 Feb 2025 16:30:33 +0000 (08:30 -0800)]
vmalloc: drop Christoph from Reviewers
I haven't been doing as much review as I should. As part of reducing my
inbox flow drop me from the official Reviewers. I might still chime in on
patches occasionally.
Kairui Song [Mon, 24 Feb 2025 18:02:12 +0000 (02:02 +0800)]
mm, swap: simplify folio swap allocation
With slot cache gone, clean up the allocation helpers even more.
folio_alloc_swap will be the only entry for allocation and adding the
folio to swap cache (except suspend), making it opposite of
folio_free_swap.
Link: https://lkml.kernel.org/r/20250224180212.22802-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 24 Feb 2025 18:02:11 +0000 (02:02 +0800)]
mm, swap: remove swap slot cache
Slot cache is no longer needed now, removing it and all related code.
- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs),
16 test runs for each case, measuring the total throughput:
Before (KB/s) (stdev) After (KB/s) (stdev)
Random (4K): 424907.60 (24410.78) 414745.92 (34554.78)
Random (64K): 163308.82 (11635.72) 167314.50 (18434.99)
Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)
The performance changes are below noise level.
- Build linux kernel with make -j96, using 4K folio with 1.5G memory
cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs,
12 test runs, measuring the system time:
Before (s) (stdev) After (s) (stdev)
make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46)
make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55)
Similar to above, 64k mTHP case showed a slight improvement.
Link: https://lkml.kernel.org/r/20250224180212.22802-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 24 Feb 2025 18:02:10 +0000 (02:02 +0800)]
mm, swap: use percpu cluster as allocation fast path
Current allocation workflow first traverses the plist with a global lock
held, after choosing a device, it uses the percpu cluster on that swap
device. This commit moves the percpu cluster variable out of being tied
to individual swap devices, making it a global percpu variable, and will
be used directly for allocation as a fast path.
The global percpu cluster variable will never point to a HDD device, and
allocations on a HDD device are still globally serialized.
This improves the allocator performance and prepares for removal of the
slot cache in later commits. There shouldn't be much observable behavior
change, except one thing: this changes how swap device allocation rotation
works.
Currently, each allocation will rotate the plist, and because of the
existence of slot cache (one order 0 allocation usually returns 64
entries), swap devices of the same priority are rotated for every 64 order
0 entries consumed. High order allocations are different, they will
bypass the slot cache, and so swap device is rotated for every 16K, 32K,
or up to 2M allocation.
The rotation rule was never clearly defined or documented, it was changed
several times without mentioning.
After this commit, and once slot cache is gone in later commits, swap
device rotation will happen for every consumed cluster. Ideally non-HDD
devices will be rotated if 2M space has been consumed for each order.
Fragmented clusters will rotate the device faster, which seems OK. HDD
devices is rotated for every allocation regardless of the allocation
order, which should be OK too and trivial.
This commit also slightly changes allocation behaviour for slot cache.
The new added cluster allocation fast path may allocate entries from
different device to the slot cache, this is not observable from user
space, only impact performance very slightly, and slot cache will be just
gone in next commit, so this can be ignored.
Link: https://lkml.kernel.org/r/20250224180212.22802-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 24 Feb 2025 18:02:09 +0000 (02:02 +0800)]
mm, swap: don't update the counter up-front
The counter update before allocation design was useful to avoid
unnecessary scan when device is full, so it will abort early if the
counter indicates the device is full. But that is an uncommon case, and
now scanning of a full device is very fast, so the up-front update is not
helpful any more.
Remove it and simplify the slot allocation logic.
Link: https://lkml.kernel.org/r/20250224180212.22802-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 24 Feb 2025 18:02:08 +0000 (02:02 +0800)]
mm, swap: avoid redundant swap device pinning
Currently __read_swap_cache_async() has get/put_swap_device() calls to
increase/decrease a swap device reference to prevent swapoff. While some
of its callers have already held the swap device reference, e.g in
do_swap_page() and shmem_swapin_folio() where __read_swap_cache_async()
will finally called. Now there are only two callers not holding a swap
device reference, so make them hold a reference instead. And drop the
get/put_swap_device calls in __read_swap_cache_async. This should reduce
the overhead for swap in during page fault slightly.
Link: https://lkml.kernel.org/r/20250224180212.22802-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 24 Feb 2025 18:02:07 +0000 (02:02 +0800)]
mm, swap: drop the flag TTRS_DIRECT
This flag exists temporarily to allow the allocator to bypass the slot
cache during freeing, so reclaiming one slot will free the slot
immediately.
But now we have already removed slot cache usage on freeing, so this flag
has no effect now.
Link: https://lkml.kernel.org/r/20250224180212.22802-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 24 Feb 2025 18:02:06 +0000 (02:02 +0800)]
mm, swap: avoid reclaiming irrelevant swap cache
Patch series "mm, swap: remove swap slot cache", v2.
Slot cache was initially introduced by commit 67afa38e012e ("mm/swap: add
cache for swap slots allocation") to reduce the lock contention of
si->lock.
Previous series "mm, swap: rework of swap allocator locks" [1] removed
swap slot cache for freeing path as freeing path no longer touches
si->lock in most cased. Allocation path also have slight to none
contention on si->lock since that series, but slot cache still helps to
reduce other overheads, like counters and the plist.
This series removes the slot cache from allocation path too, by using the
cluster as allocation fast path and also reduce other overheads.
Now slot cache is completely gone, the code is much simplified without
obvious feature or performance change, also clean up related workaround.
Also this should avoid other potential issues, e.g. the long pinning of
swap slots: swap slot cache pins swap slots with HAS_CACHE, causing
reclaim or allocation fail to use these slots on scanning.
The only behavior change is the swap device allocation rotation mechanism,
as explained in the patch "mm, swap: use percpu cluster as allocation fast
path".
Test results are looking good after deleting the swap slot cache:
- vm-scalability with: `usemem --init-time -O -y -x -R -31 1G`,
12G memory cgroup using simulated pmem as SWAP (32G pmem, 32 CPUs),
16 test runs for each case, measuring the total throughput:
Before (KB/s) (stdev) After (KB/s) (stdev)
Random (4K): 424907.60 (24410.78) 414745.92 (34554.78)
Random (64K): 163308.82 (11635.72) 167314.50 (18434.99)
Sequential (4K, !-R): 6150056.79 (103205.90) 6321469.06 (115878.16)
- Build linux kernel with make -j96, using 4K folio with 1.5G memory
cgroup limit and 64K folio with 2G memory cgroup limit, on top of tmpfs,
12 test runs, measuring the system time:
Before (s) (stdev) After (s) (stdev)
make -j96 (4K): 6445.69 (61.95) 6408.80 (69.46)
make -j96 (64K): 6841.71 (409.04) 6437.99 (435.55)
The performance is unchanged, slightly better in some cases.
Swap allocator will do swap cache reclaim to recycle HAS_CACHE slots for
allocation. It initiates the reclaim from the offset to be reclaimed and
looks up the corresponding folio. The lookup process is lockless, so it's
possible the folio will be removed from the swap cache and given a
different swap entry before the reclaim locks the folio. If it happens,
the reclaim will end up reclaiming an irrelevant folio, and return wrong
return value.
This shouldn't cause any problem with correctness or stability, but it is
indeed confusing and unexpected, and will increase fragmentation, decrease
performance.
Fix this by checking whether the folio is still pointing to the offset the
allocator want to reclaim before reclaiming it.
Link: https://lkml.kernel.org/r/20250224180212.22802-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250224180212.22802-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jane Chu [Mon, 24 Feb 2025 21:14:45 +0000 (14:14 -0700)]
mm: make page_mapped_in_vma() hugetlb walk aware
When a process consumes a UE in a page, the memory failure handler
attempts to collect information for a potential SIGBUS. If the page is an
anonymous page, page_mapped_in_vma(page, vma) is invoked in order to
1. retrieve the vaddr from the process' address space,
2. verify that the vaddr is indeed mapped to the poisoned page,
where 'page' is the precise small page with UE.
It's been observed that when injecting poison to a non-head subpage of an
anonymous hugetlb page, no SIGBUS shows up, while injecting to the head
page produces a SIGBUS. The cause is that, though hugetlb_walk() returns
a valid pmd entry (on x86), but check_pte() detects mismatch between the
head page per the pmd and the input subpage. Thus the vaddr is considered
not mapped to the subpage and the process is not collected for SIGBUS
purpose. This is the calling stack:
check_pte() header says that it
"check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte"
but practically works only if pvmw->pfn is the head page pfn at pvmw->pte.
Hindsight acknowledging that some pvmw->pte could point to a hugepage of
some sort such that it makes sense to make check_pte() work for hugepage.
Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Tue, 25 Feb 2025 00:08:25 +0000 (19:08 -0500)]
mm: page_alloc: remove remnants of unlocked migratetype updates
The freelist hygiene patches made migratetype accesses fully protected
under the zone->lock. Remove remnants of handling the race conditions
that existed before from the MIGRATE_HIGHATOMIC code.
Link: https://lkml.kernel.org/r/20250225001023.1494422-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Tue, 25 Feb 2025 00:08:24 +0000 (19:08 -0500)]
mm: page_alloc: don't steal single pages from biggest buddy
The fallback code searches for the biggest buddy first in an attempt to
steal the whole block and encourage type grouping down the line.
The approach used to be this:
- Non-movable requests will split the largest buddy and steal the
remainder. This splits up contiguity, but it allows subsequent
requests of this type to fall back into adjacent space.
- Movable requests go and look for the smallest buddy instead. The
thinking is that movable requests can be compacted, so grouping is
less important than retaining contiguity.
c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
conversion") enforces freelist type hygiene, which restricts stealing to
either claiming the whole block or just taking the requested chunk; no
additional pages or buddy remainders can be stolen any more.
The patch mishandled when to switch to finding the smallest buddy in that
new reality. As a result, it may steal the exact request size, but from
the biggest buddy. This causes fracturing for no good reason.
Fix this by committing to the new behavior: either steal the whole block,
or fall back to the smallest buddy.
Remove single-page stealing from steal_suitable_fallback(). Rename it to
try_to_steal_block() to make the intentions clear. If this fails, always
fall back to the smallest buddy.
The following is from 4 runs of mmtest's thpchallenge. "Pollute" is
single page fallback, "steal" is conversion of a partially used block.
The numbers for free block conversions (omitted) are comparable.
vanilla patched
@pollute[unmovable from reclaimable]: 27 106
@pollute[unmovable from movable]: 82 46
@pollute[reclaimable from unmovable]: 256 83
@pollute[reclaimable from movable]: 46 8
@pollute[movable from unmovable]: 4841 868
@pollute[movable from reclaimable]: 5278 12568
@steal[unmovable from reclaimable]: 11 12
@steal[unmovable from movable]: 113 49
@steal[reclaimable from unmovable]: 19 34
@steal[reclaimable from movable]: 47 21
@steal[movable from unmovable]: 250 183
@steal[movable from reclaimable]: 81 93
The allocator appears to do a better job at keeping stealing and polluting
to the first fallback preference. As a result, the numbers for "from
movable" - the least preferred fallback option, and most detrimental to
compactability - are down across the board.
Link: https://lkml.kernel.org/r/20250225001023.1494422-2-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 21 Feb 2025 12:05:23 +0000 (12:05 +0000)]
tools/selftests: add guard region test for /proc/$pid/pagemap
Add a test to the guard region self tests to assert that the
/proc/$pid/pagemap information now made availabile to the user correctly
identifies and reports guard regions.
As a part of this change, update vm_util.h to add the new bit (note there
is no header file in the kernel where this is exposed, the user is
expected to provide their own mask) and utilise the helper functions there
for pagemap functionality.
Link: https://lkml.kernel.org/r/164feb0a43ae72650e6b20c3910213f469566311.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 21 Feb 2025 12:05:22 +0000 (12:05 +0000)]
fs/proc/task_mmu: add guard region bit to pagemap
Patch series "fs/proc/task_mmu: add guard region bit to pagemap".
Currently there is no means of determining whether a given page in a
mapping range is designated a guard region (as installed via madvise()
using the MADV_GUARD_INSTALL flag).
This is generally not an issue, but in some instances users may wish to
determine whether this is the case.
This series adds this ability via /proc/$pid/pagemap, updates the
documentation and adds a self test to assert that this functions
correctly.
This patch (of 2):
Currently there is no means by which users can determine whether a given
page in memory is in fact a guard region, that is having had the
MADV_GUARD_INSTALL madvise() flag applied to it.
This is intentional, as to provide this information in VMA metadata would
contradict the intent of the feature (providing a means to change fault
behaviour at a page table level rather than a VMA level), and would
require VMA metadata operations to scan page tables, which is
unacceptable.
In many cases, users have no need to reflect and determine what regions
have been designated guard regions, as it is the user who has established
them in the first place.
But in some instances, such as monitoring software, or software that
relies upon being able to ascertain the nature of mappings within a remote
process for instance, it becomes useful to be able to determine which
pages have the guard region marker applied.
This patch makes use of an unused pagemap bit (58) to provide this
information.
This patch updates the documentation at the same time as making the change
such that the implementation of the feature and the documentation of it
are tied together.
Link: https://lkml.kernel.org/r/cover.1740139449.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/521d99c08b975fb06a1e7201e971cc24d68196d1.1740139449.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kemeng Shi [Sat, 22 Feb 2025 16:08:48 +0000 (00:08 +0800)]
mm, swap: remove setting SWAP_MAP_BAD for discard cluster
Before alloc from a cluster, we will aqcuire cluster's lock and make sure
it is usable by cluster_is_usable(), so there is no need to set
SWAP_MAP_BAD for cluster to be discarded.
Brendan Jackman [Fri, 21 Feb 2025 18:25:42 +0000 (18:25 +0000)]
selftests/mm: skip uffd-stress if userfaultfd not available
It's pretty obvious that the test wouldn't work if you don't have the
feature enabled. But, it's still useful to SKIP instead of failing so the
reader can immediately tell that this is the reason why.
Link: https://lkml.kernel.org/r/20250221-mm-selftests-v2-3-28c4d66383c5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Fri, 21 Feb 2025 18:25:41 +0000 (18:25 +0000)]
selftests/mm: fix assumption that sudo is present
If we are root, sudo isn't needed. If we are not root, we need sudo, so
skip the test if it isn't present.
We already do this for on-fault-limit, but this uses separate
infrastructure since that is specifically for sudo-ing to the nobody user.
Note this ptrace_skip configuration still fails if that file doesn't
exist, but in that case the test is still fine, so this just prints an
error but doesn't break anything. I suspect that's probably deliberate.
Link: https://lkml.kernel.org/r/20250221-mm-selftests-v2-2-28c4d66383c5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shuah Khan (Samsung OSG) <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Fri, 21 Feb 2025 18:25:40 +0000 (18:25 +0000)]
selftests/mm: report errno when things fail in gup_longterm
Patch series "selftests/mm: Some cleanups from trying to run them", v2.
I never had much luck running mm selftests so I spent a couple of hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like all
of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate() fails,
but I don't have time to dig deep enough into it to actually make them
pass - I observed these issues on both 9p and virtiofs. Probably it
requires digging into the filesystem implementation
(An alternative might just be to mount a tmpfs in the test script).
I am also seeing some failures to allocate hugetlb pages in uffd-mp-mremap
that I have not had time to fully understand, you can see those here:
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:25 +0000 (11:03 +0900)]
zram: permit reclaim in zstd custom allocator
When configured with pre-trained compression/decompression dictionary
support, zstd requires custom memory allocator, which it calls internally
from compression()/decompression() routines. That means allocation from
atomic context (either under entry spin-lock, or per-CPU local-lock or
both). Now, with non-atomic zram read()/write(), those limitations are
relaxed and we can allow direct and indirect reclaim.
Link: https://lkml.kernel.org/r/20250303022425.285971-17-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:24 +0000 (11:03 +0900)]
zram: switch to new zsmalloc object mapping API
Use new read/write zsmalloc object API. For cases when RO mapped object
spans two physical pages (requires temp buffer) compression streams now
carry around one extra physical page.
Link: https://lkml.kernel.org/r/20250303022425.285971-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:23 +0000 (11:03 +0900)]
zsmalloc: introduce new object mapping API
Current object mapping API is a little cumbersome. First, it's
inconsistent, sometimes it returns with page-faults disabled and sometimes
with page-faults enabled. Second, and most importantly, it enforces
atomicity restrictions on its users. zs_map_object() has to return a
liner object address which is not always possible because some objects
span multiple physical (non-contiguous) pages. For such objects zsmalloc
uses a per-CPU buffer to which object's data is copied before a pointer to
that per-CPU buffer is returned back to the caller. This leads to
another, final, issue - extra memcpy(). Since the caller gets a pointer
to per-CPU buffer it can memcpy() data only to that buffer, and during
zs_unmap_object() zsmalloc will memcpy() from that per-CPU buffer to
physical pages that object in question spans across.
New API splits functions by access mode:
- zs_obj_read_begin(handle, local_copy)
Returns a pointer to handle memory. For objects that span two
physical pages a local_copy buffer is used to store object's
data before the address is returned to the caller. Otherwise
the object's page is kmap_local mapped directly.
- zs_obj_read_end(handle, buf)
Unmaps the page if it was kmap_local mapped by zs_obj_read_begin().
- zs_obj_write(handle, buf, len)
Copies len-bytes from compression buffer to handle memory
(takes care of objects that span two pages). This does not
need any additional (e.g. per-CPU) buffers and writes the data
directly to zsmalloc pool pages.
In terms of performance, on a synthetic and completely reproducible
test that allocates fixed number of objects of fixed sizes and
iterates over those objects, first mapping in RO then in RW mode:
Difference at 95.0% confidence
-1.03219e+08 +/- 55308.7
-27.9705% +/- 0.0149878%
(Student's t, pooled s = 58864.4)
The old API will stay around until the remaining users switch to the new
one. After that we'll also remove zsmalloc per-CPU buffer and CPU hotplug
handling.
The split of map(RO) and map(WO) into read_{begin/end}/write is suggested
by Yosry Ahmed.
Link: https://lkml.kernel.org/r/20250303022425.285971-15-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:22 +0000 (11:03 +0900)]
zsmalloc: sleepable zspage reader-lock
In order to implement preemptible object mapping we need a zspage lock
that satisfies several preconditions:
- it should be reader-write type of a lock
- it should be possible to hold it from any context, but also being
preemptible if the context allows it
- we never sleep while acquiring but can sleep while holding in read
mode
An rwsemaphore doesn't suffice, due to atomicity requirements, rwlock
doesn't satisfy due to reader-preemptability requirement. It's also worth
to mention, that per-zspage rwsem is a little too memory heavy (we can
easily have double digits megabytes used only on rwsemaphores).
Switch over from rwlock_t to a atomic_t-based implementation of a
reader-writer semaphore that satisfies all of the preconditions.
The spin-lock based zspage_lock is suggested by Hillf Danton.
Link: https://lkml.kernel.org/r/20250303022425.285971-14-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:21 +0000 (11:03 +0900)]
zsmalloc: rename pool lock
The old name comes from the times when the pool did not have compaction
(defragmentation). Rename it to ->lock because these days it synchronizes
not only migration.
Link: https://lkml.kernel.org/r/20250303022425.285971-13-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:20 +0000 (11:03 +0900)]
zram: move post-processing target allocation
Allocate post-processing target in place_pp_slot(). This simplifies
scan_slots_for_writeback() and scan_slots_for_recompress() loops because
we don't need to track pps pointer state anymore. Previously we have to
explicitly NULL the point if it has been added to a post-processing bucket
or re-use previously allocated pointer otherwise and make sure we don't
leak the memory in the end.
We are also fine doing GFP_NOIO allocation, as post-processing can be
called under memory pressure so we better pick as many slots as we can as
soon as we can and start post-processing them, possibly saving the memory.
Allocation failure there is not fatal, we will post-process whatever we
put into the buckets on previous iterations.
Link: https://lkml.kernel.org/r/20250303022425.285971-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:19 +0000 (11:03 +0900)]
zram: rework recompression loop
This reworks recompression loop handling:
- set a rule that stream-put NULLs the stream pointer If the loop
returns with a non-NULL stream then it's a successful recompression,
otherwise the stream should always be NULL.
- do not count the number of recompressions Mark object as
incompressible as soon as the algorithm with the highest priority failed
to compress that object.
- count compression errors as resource usage Even if compression has
failed, we still need to bump num_recomp_pages counter.
Link: https://lkml.kernel.org/r/20250303022425.285971-11-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:18 +0000 (11:03 +0900)]
zram: filter out recomp targets based on priority
Do no select for post processing slots that are already compressed with
same or higher priority compression algorithm.
This should save some memory, as previously we would still put those
entries into corresponding post-processing buckets and filter them out
later in recompress_slot().
Link: https://lkml.kernel.org/r/20250303022425.285971-10-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:14 +0000 (11:03 +0900)]
zram: remove second stage of handle allocation
Previously zram write() was atomic which required us to pass
__GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast path and
attempt a slow path allocation (with recompression) if the fast path
failed.
Since we are not in atomic context anymore we can permit direct reclaim
during handle allocation, and hence can have a single allocation path.
There is no slow path anymore so we don't unlock per-CPU stream (and don't
lose compressed data) which means that there is no need to do
recompression now (which should reduce CPU and battery usage).
Link: https://lkml.kernel.org/r/20250303022425.285971-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:11 +0000 (11:03 +0900)]
zram: permit preemption with active compression stream
Currently, per-CPU stream access is done from a non-preemptible (atomic)
section, which imposes the same atomicity requirements on compression
backends as entry spin-lock, and makes it impossible to use algorithms
that can schedule/wait/sleep during compression and decompression.
Switch to preemptible per-CPU model, similar to the one used in zswap.
Instead of a per-CPU local lock, each stream carries a mutex which is
locked throughout entire time zram uses it for compression or
decompression, so that cpu-dead event waits for zram to stop using a
particular per-CPU stream and release it.
Link: https://lkml.kernel.org/r/20250303022425.285971-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Mon, 3 Mar 2025 02:03:10 +0000 (11:03 +0900)]
zram: sleepable entry locking
Patch series "zsmalloc/zram: there be preemption", v10.
Currently zram runs compression and decompression in non-preemptible
sections, e.g.
zcomp_stream_get() // grabs CPU local lock
zcomp_compress()
or
zram_slot_lock() // grabs entry spin-lock
zcomp_stream_get() // grabs CPU local lock
zs_map_object() // grabs rwlock and CPU local lock
zcomp_decompress()
Potentially a little troublesome for a number of reasons.
For instance, this makes it impossible to use async compression algorithms
or/and H/W compression algorithms, which can wait for OP completion or
resource availability. This also restricts what compression algorithms
can do internally, for example, zstd can allocate internal state memory
for C/D dictionaries:
Not to mention that the system can be configured to maximize compression
ratio at a cost of CPU/HW time (e.g. lz4hc or deflate with very high
compression level) so zram can stay in non-preemptible section (even under
spin-lock or/and rwlock) for an extended period of time. Aside from
compression algorithms, this also restricts what zram can do. One
particular example is zram_write_page() zsmalloc handle allocation, which
has an optimistic allocation (disallowing direct reclaim) and a
pessimistic fallback path, which then forces zram to compress the page one
more time.
This series changes zram to not directly impose atomicity restrictions on
compression algorithms (and on itself), which makes zram write() fully
preemptible; zram read(), sadly, is not always preemptible yet. There are
still indirect atomicity restrictions imposed by zsmalloc(). One notable
example is object mapping API, which returns with: a) local CPU lock held
b) zspage rwlock held
First, zsmalloc's zspage lock is converted from rwlock to a special type
of RW-lookalike look with some extra guarantees/features. Second, a new
handle mapping is introduced which doesn't use per-CPU buffers (and hence
no local CPU lock), does fewer memcpy() calls, but requires users to
provide a pointer to temp buffer for object copy-in (when needed). Third,
zram is converted to the new zsmalloc mapping API and thus zram read()
becomes preemptible.
This patch (of 19):
Concurrent modifications of meta table entries is now handled by per-entry
spin-lock. This has a number of shortcomings.
First, this imposes atomic requirements on compression backends. zram can
call both zcomp_compress() and zcomp_decompress() under entry spin-lock,
which implies that we can use only compression algorithms that don't
schedule/sleep/wait during compression and decompression. This, for
instance, makes it impossible to use some of the ASYNC compression
algorithms (H/W compression, etc.) implementations.
Second, this can potentially trigger watchdogs. For example, entry
re-compression with secondary algorithms is performed under entry
spin-lock. Given that we chain secondary compression algorithms and that
some of them can be configured for best compression ratio (and worst
compression speed) zram can stay under spin-lock for quite some time.
Having a per-entry mutex (or, for instance, a rw-semaphore) significantly
increases sizeof() of each entry and hence the meta table. Therefore
entry locking returns back to bit locking, as before, however, this time
also preempt-rt friendly, because if waits-on-bit instead of
spinning-on-bit. Lock owners are also now permitted to schedule, which is
a first step on the path of making zram non-atomic.
Liu Ye [Wed, 12 Feb 2025 02:58:42 +0000 (10:58 +0800)]
mm/folio_queue: delete __folio_order and use folio_order directly
__folio_order is the same as folio_order, remove __folio_order and then
just include mm.h and use folio_order directly.
Link: https://lkml.kernel.org/r/20250212025843.80283-2-liuye@kylinos.cn Signed-off-by: Liu Ye <liuye@kylinos.cn> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Colin Ian King [Wed, 19 Feb 2025 08:36:07 +0000 (08:36 +0000)]
mm/mincore: improve performance by adding an unlikely hint
Adding an unlikely() hint on the masked start comparison error return path
improves run-time performance of the mincore system call.
Benchmarking on an i9-12900 shows an improvement of 7ns on mincore calls
on a 256KB mmap'd region where 50% of the pages we resident. Improvement
was from ~970 ns down to 963 ns, so a small ~0.7% improvement.
Results based on running 20 tests with turbo disabled (to reduce clock
freq turbo changes), with 10 second run per test and comparing the number
of mincores calls per second. The % standard deviation of the 20 tests
was ~0.10%, so results are reliable.
SeongJae Park [Wed, 19 Feb 2025 22:01:46 +0000 (14:01 -0800)]
Docs/mm/damon/design: document unmapped DAMOS filter type
Document availability and meaning of unmapped DAMOS filter type on design
document. Since introduction of the type requires no additional user ABI,
usage and ABI document need no update.
SeongJae Park [Wed, 19 Feb 2025 22:01:45 +0000 (14:01 -0800)]
mm/damon: implement a new DAMOS filter type for unmapped pages
Patch series "mm/damon: introduce DAMOS filter type for unmapped pages".
User decides whether their memory will be mapped or unmapped. It implies
that the two types of memory can have different characteristics and
management requirements. Provide the DAMON-observaibility DAMOS-operation
capability for the different types by introducing a new DAMOS filter type
for unmapped pages.
This patch (of 2):
Implement yet another DAMOS filter type for unmapped pages on DAMON kernel
API, and add support of it from the physical address space DAMON
operations set (paddr). Since it is for only unmapped pages, support from
the virtual address spaces DAMON operations set (vaddr) is not required.
Zi Yan [Wed, 26 Feb 2025 21:08:54 +0000 (16:08 -0500)]
mm/shmem: use xas_try_split() in shmem_split_large_entry()
During shmem_split_large_entry(), large swap entries are covering n slots
and an order-0 folio needs to be inserted.
Instead of splitting all n slots, only the 1 slot covered by the folio
need to be split and the remaining n-1 shadow entries can be retained with
orders ranging from 0 to n-1. This method only requires
(n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
(n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
xas_split_alloc() + xas_split() one.
For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
is 6), 1 xa_node is needed instead of 8.
xas_try_split_min_order() is used to reduce the number of calls to
xas_try_split() during split.
Link: https://lkml.kernel.org/r/20250226210854.2045816-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Mattew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:08:53 +0000 (16:08 -0500)]
mm/filemap: use xas_try_split() in __filemap_add_folio()
Patch series "Minimize xa_node allocation during xarry split", v3.
When splitting a multi-index entry in XArray from order-n to order-m,
existing xas_split_alloc()+xas_split() approach requires 2^(n %
XA_CHUNK_SHIFT) xa_node allocations. But its callers,
__filemap_add_folio() and shmem_split_large_entry(), use at most 1
xa_node.
To minimize xa_node allocation and remove the limitation of no split from
order-12 (or above) to order-0 (or anything between 0 and 5)[1],
xas_try_split() was added[2], which allocates (n / XA_CHUNK_SHIFT - m /
XA_CHUNK_SHIFT) xa_node. It is used for non-uniform folio split, but can
be used by __filemap_add_folio() and shmem_split_large_entry().
xas_split_alloc() and xas_split() split an order-9 to order-0:
xas_try_split() is designed to be called iteratively with n = m + 1.
xas_try_split_mini_order() is added to minmize the number of calls to
xas_try_split() by telling the caller the next minimal order to split to
instead of n - 1. Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT
does not require xa_node allocation and requires 1 xa_node when n=l *
XA_CHUNK_SHIFT and m = n - 1, so it is OK to use xas_try_split() with n >
m + 1 when no new xa_node is needed.
During __filemap_add_folio(), a shadow entry is covering n slots and a
folio covers m slots with m < n is to be added. Instead of splitting all
n slots, only the m slots covered by the folio need to be split and the
remaining n-m shadow entries can be retained with orders ranging from m to
n-1. This method only requires
(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)
new xa_nodes instead of
(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))
new xa_nodes, compared to the original xas_split_alloc() + xas_split()
one. For example, to insert an order-0 folio when an order-9 shadow entry
is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of
8.
xas_try_split_min_order() is introduced to reduce the number of calls to
xas_try_split() during split.
Link: https://lkml.kernel.org/r/20250226210854.2045816-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250226210854.2045816-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mattew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:31 +0000 (16:00 -0500)]
selftests/mm: add tests for folio_split(), buddy allocator like split
It splits page cache folios to orders from 0 to 8 at different in-folio
offset.
Link: https://lkml.kernel.org/r/20250226210032.2044041-9-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Sun, 2 Mar 2025 03:34:24 +0000 (22:34 -0500)]
mm/truncate: make sure folio2 is large and has the same mapping after lock
It is possible that folio2 no longer belongs to the original mapping.
Link: https://lkml.kernel.org/r/56EBE3B6-99EA-470E-B2B3-92C9C13032DF@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:30 +0000 (16:00 -0500)]
mm/truncate: use buddy allocator like folio split for truncate operation
Instead of splitting the large folio uniformly during truncation, try to
use buddy allocator like split at the start of truncation range to
minimize the number of resulting folios if it is supported.
try_folio_split() is introduced to use folio_split() if supported and fall
back to uniform split otherwise.
For example, to truncate a order-4 folio
[0, 1, 2, 3, 4, 5, ..., 15]
between [3, 10] (inclusive), folio_split() splits the folio to
[0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and
[8..15] is kept with zeros in [8..10], then another folio_split() is
done at 10, so [8..10] can be dropped.
One possible optimization is to make folio_split() to split a folio based
on a given range, like [3..10] above. But that complicates folio_split(),
so it will be investigated when necessary.
Link: https://lkml.kernel.org/r/20250226210032.2044041-8-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:29 +0000 (16:00 -0500)]
mm/huge_memory: add folio_split() to debugfs testing interface
This allows to test folio_split() by specifying an additional in folio
page offset parameter to split_huge_page debugfs interface.
Link: https://lkml.kernel.org/r/20250226210032.2044041-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:28 +0000 (16:00 -0500)]
mm/huge_memory: remove the old, unused __split_huge_page()
Now split_huge_page_to_list_to_order() uses the new backend split code in
__folio_split_without_mapping(), the old __split_huge_page() and
__split_huge_page_tail() can be removed.
Link: https://lkml.kernel.org/r/20250226210032.2044041-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:27 +0000 (16:00 -0500)]
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon,
since anon folio does not support order-1 yet.
-----------------------------------------------------------------
| | | | | | | | |
|O-0|O-0|O-0|O-0| O-2 |...| O-7 | O-8 |
| | | | | | | | |
-----------------------------------------------------------------
It generates fewer folios (i.e., 11 or 10) than existing page split
approach, which splits the order-9 to 512 order-0 folios. It also reduces
the number of new xa_node needed during a pagecache folio split from 8 to
1, potentially decreasing the folio split failure rate due to memory
constraints.
folio_split() and existing split_huge_page_to_list_to_order() share the
folio unmapping and remapping code in __folio_split() and the common
backend split code in __split_unmapped_folio() using uniform_split
variable to distinguish their operations.
uniform_split_supported() and non_uniform_split_supported() are added to
factor out check code and will be used outside __folio_split() in the
following commit.
Link: https://lkml.kernel.org/r/20250226210032.2044041-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:26 +0000 (16:00 -0500)]
mm/huge_memory: move folio split common code to __folio_split()
This is a preparation patch for folio_split().
In the upcoming patch folio_split() will share folio unmapping and
remapping code with split_huge_page_to_list_to_order(), so move the code
to a common function __folio_split() first.
Link: https://lkml.kernel.org/r/20250226210032.2044041-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:25 +0000 (16:00 -0500)]
mm/huge_memory: add two new (not yet used) functions for folio_split()
This is a preparation patch, both added functions are not used yet.
The added __split_unmapped_folio() is able to split a folio with its
mapping removed in two manners: 1) uniform split (the existing way), and
2) buddy allocator like split.
The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __split_unmapped_folio() calls it once to split the
given folio to the new order. For buddy allocator split,
__split_unmapped_folio() calls it (folio_order - new_order) times and each
time splits the folio containing the given page to one lower order.
Link: https://lkml.kernel.org/r/20250226210032.2044041-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 26 Feb 2025 21:00:24 +0000 (16:00 -0500)]
xarray: add xas_try_split() to split a multi-index entry
Patch series "Buddy allocator like (or non-uniform) folio split", v9.
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10 (or
11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2
order-0 folios (or 4 order-0 for anonymous memory) instead of 512 order-0
folios.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split() can
support both uniform split and buddy allocator like (or non-uniform)
split. All existing split_huge_page*() users can be gradually converted
to use folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
This patch (of 8):
A preparation patch for non-uniform folio split, which always split a
folio into half iteratively, and minimal xarray entry split.
Currently, xas_split_alloc() and xas_split() always split all slots from a
multi-index entry. They cost the same number of xa_node as the
to-be-split slots. For example, to split an order-9 entry, which takes
2^(9-6)=8 slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8
xa_node are needed. Instead xas_try_split() is intended to be used
iteratively to split the order-9 entry into 2 order-8 entries, then split
one order-8 entry, based on the given index, to 2 order-7 entries, ...,
and split one order-1 entry to 2 order-0 entries. When splitting the
order-6 entry and a new xa_node is needed, xas_try_split() will try to
allocate one if possible. As a result, xas_try_split() would only need
one xa_node instead of 8.
When a new xa_node is needed during the split, xas_try_split() can try to
allocate one but no more. -ENOMEM will be return if a node cannot be
allocated. -EINVAL will be return if a sibling node is split or cascade
split happens, where two or more new nodes are needed, and these are not
supported by xas_try_split().
xas_split_alloc() and xas_split() split an order-9 to order-0:
Link: https://lkml.kernel.org/r/20250226210032.2044041-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250226210032.2044041-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:28 +0000 (18:29 +0000)]
mm/hugetlb: move hugetlb CMA code in to its own file
hugetlb.c contained a number of CONFIG_CMA ifdefs, and the code inside
them was large enough to merit being in its own file, so move it, cleaning
up things a bit.
Hide some direct variable access behind functions to accommodate the move.
No functional change intended.
Link: https://lkml.kernel.org/r/20250228182928.2645936-28-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:27 +0000 (18:29 +0000)]
mm/hugetlb: enable bootmem allocation from CMA areas
If hugetlb_cma_only is enabled, we know that hugetlb pages can only be
allocated from CMA. Now that there is an interface to do early
reservations from a CMA area (returning memblock memory), it can be used
to allocate hugetlb pages from CMA.
This also allows for doing pre-HVO on these pages (if enabled).
Make sure to initialize the page structures and associated data correctly.
Create a flag to signal that a hugetlb page has been allocated from CMA
to make things a little easier.
Some configurations of powerpc have a special hugetlb bootmem allocator,
so introduce a boolean arch_specific_huge_bootmem_alloc that returns true
if such an allocator is present. In that case, CMA bootmem allocations
can't be used, so check that function before trying.
Link: https://lkml.kernel.org/r/20250228182928.2645936-27-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:26 +0000 (18:29 +0000)]
mm/hugetlb: add hugetlb_cma_only cmdline option
Add an option to force hugetlb gigantic pages to be allocated using CMA
only (if hugetlb_cma is enabled). This avoids a fallback to allocation
from the rest of system memory if the CMA allocation fails. This makes
the size of hugetlb_cma a hard upper boundary for gigantic hugetlb page
allocations.
This is useful because, with a large CMA area, the kernel's unmovable
allocations will have less room to work with and it is undesirable for new
hugetlb gigantic page allocations to be done from that remaining area. It
will eat in to the space available for unmovable allocations, leading to
unwanted system behavior (OOMs because the kernel fails to do unmovable
allocations).
So, with this enabled, an administrator can force a hard upper bound for
runtime gigantic page allocations, and have more predictable system
behavior.
Link: https://lkml.kernel.org/r/20250228182928.2645936-26-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:25 +0000 (18:29 +0000)]
mm/cma: introduce interface for early reservations
It can be desirable to reserve memory in a CMA area before it is
activated, early in boot. Such reservations would effectively be memblock
allocations, but they can be returned to the CMA area later. This
functionality can be used to allow hugetlb bootmem allocations from a
hugetlb CMA area.
A new interface, cma_reserve_early is introduced. This allows for
pageblock-aligned reservations. These reservations are skipped during the
initial handoff of pages in a CMA area to the buddy allocator. The caller
is responsible for making sure that the page structures are set up, and
that the migrate type is set correctly, as with other memblock allocations
that stick around. If the CMA area fails to activate (because it
intersects with multiple zones), the reserved memory is not given to the
buddy allocator, the caller needs to take care of that.
Link: https://lkml.kernel.org/r/20250228182928.2645936-25-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:24 +0000 (18:29 +0000)]
mm/cma: introduce a cma validate function
Define a function to check if a CMA area is valid, which means: do its
ranges not cross any zone boundaries. Store the result in the newly
created flags for each CMA area, so that multiple calls are dealt with.
This allows for checking the validity of a CMA area early, which is needed
later in order to be able to allocate hugetlb bootmem pages from it with
pre-HVO.
Link: https://lkml.kernel.org/r/20250228182928.2645936-24-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:23 +0000 (18:29 +0000)]
mm/cma: simplify zone intersection check
cma_activate_area walks all pages in the area, checking their zone
individually to see if the area resides in more than one zone.
Make this a little more efficient by using the recently introduced
pfn_range_intersects_zones() function. Store the NUMA node id (if any) in
the cma structure to facilitate this.
Link: https://lkml.kernel.org/r/20250228182928.2645936-23-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:22 +0000 (18:29 +0000)]
x86/mm: set ARCH_WANT_HUGETLB_VMEMMAP_PREINIT
Now that hugetlb bootmem pages are allocated earlier, and available for
section preinit (HVO-style), set ARCH_WANT_HUGETLB_VMEMMAP_PREINIT for
x86_64, so that is can be done.
This enables pre-HVO on x86_64.
Link: https://lkml.kernel.org/r/20250228182928.2645936-22-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:21 +0000 (18:29 +0000)]
x86/setup: call hugetlb_bootmem_alloc early
Call hugetlb_bootmem_allloc in an earlier spot in setup, after
hugelb_cma_reserve. This will make vmemmap preinit of the sections
covered by the allocated hugetlb pages possible.
Link: https://lkml.kernel.org/r/20250228182928.2645936-21-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:20 +0000 (18:29 +0000)]
mm/hugetlb: do pre-HVO for bootmem allocated pages
For large systems, the overhead of vmemmap pages for hugetlb is
substantial. It's about 1.5% of memory, which is about 45G for a 3T
system. If you want to configure most of that system for hugetlb (e.g.
to use as backing memory for VMs), there is a chance of running out of
memory on boot, even though you know that the 45G will become available
later.
To avoid this scenario, and since it's a waste to first allocate and then
free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages
('gigantic' pages).
pre-HVO is done by adding functions that are called from
sparse_init_nid_early and sparse_init_nid_late. The first is called
before memmap allocation, so it takes care of allocating memmap HVO-style.
The second verifies that all bootmem pages look good, specifically it
checks that they do not intersect with multiple zones. This can only be
done from sparse_init_nid_late path, when zones have been initialized.
The hugetlb page size must be aligned to the section size, and aligned to
the size of memory described by the number of page structures contained in
one PMD (since pre-HVO is not prepared to split PMDs). This should be
true for most 'gigantic' pages, it is for 1G pages on x86, where both of
these alignment requirements are 128M.
This will only have an effect if hugetlb_bootmem_alloc was called early in
boot. If not, it won't do anything, and HVO for bootmem hugetlb pages
works as before.
Link: https://lkml.kernel.org/r/20250228182928.2645936-20-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make the hugetlb_vmemmap_restore_folios definition inline for the
!CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP case, so that including this file in
files other than hugetlb_vmemmap.c will work.
Link: https://lkml.kernel.org/r/20250228182928.2645936-19-fvdl@google.com Fixes: cfb8c75099db ("hugetlb: perform vmemmap restoration on a list of pages") Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:18 +0000 (18:29 +0000)]
mm/hugetlb: add pre-HVO framework
Define flags for pre-HVOed bootmem hugetlb pages, and act on them.
The most important flag is the HVO flag, signalling that a bootmem
allocated gigantic page has already been HVO-ed. If this flag is seen by
the hugetlb bootmem gather code, the page is marked as HVO optimized. The
HVO code will then not try to optimize it again. Instead, it will just
map the tail page mirror pages read-only, completing the HVO steps.
No functional change, as nothing sets the flags yet.
Link: https://lkml.kernel.org/r/20250228182928.2645936-18-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:17 +0000 (18:29 +0000)]
mm/hugetlb: move huge_boot_pages list init to hugetlb_bootmem_alloc
Instead of initializing the per-node hugetlb bootmem pages list from the
alloc function, we can now do it in a somewhat cleaner way, since there is
an explicit hugetlb_bootmem_alloc function. Initialize the lists there.
Link: https://lkml.kernel.org/r/20250228182928.2645936-17-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:16 +0000 (18:29 +0000)]
mm/hugetlb: deal with multiple calls to hugetlb_bootmem_alloc
Architectures that want pre-HVO of hugetlb vmemmap pages will need to call
hugetlb_bootmem_alloc from an earlier spot in boot (before sparse_init).
To facilitate some architectures doing this, protect hugetlb_bootmem_alloc
against multiple calls.
Also provide a helper function to check if it's been called, so that the
early HVO code, to be added later, can see if there is anything to do.
Link: https://lkml.kernel.org/r/20250228182928.2645936-16-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The populate and undo functions are expected to be used in early init,
from the sparse_init_nid_early() function. The wrprotect function is to
be used, potentially, later.
To implement these functions, mostly re-use the existing compound pages
vmemmap logic used by DAX. vmemmap_populate_address has its argument
changed a bit in this commit: the page structure passed in to be reused in
the mapping is replaced by a PFN and a flag. The flag indicates whether
an extra ref should be taken on the vmemmap page containing the head page
structure. Taking the ref is appropriate to for DAX / ZONE_DEVICE, but
not for HugeTLB HVO.
The HugeTLB vmemmap optimization maps tail page structure pages read-only.
The vmemmap_wrprotect_hvo function that does this is implemented
separately, because it cannot be guaranteed that reserved page structures
will not be write accessed during memory initialization. Even with
CONFIG_DEFERRED_STRUCT_PAGE_INIT, they might still be written to (if they
are at the bottom of a zone). So, vmemmap_populate_hvo leaves the tail
page structure pages RW initially, and then later during initialization,
after memmap init is fully done, vmemmap_wrprotect_hvo must be called to
finish the job.
Subsequent commits will use these functions for early HugeTLB HVO.
Link: https://lkml.kernel.org/r/20250228182928.2645936-15-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:14 +0000 (18:29 +0000)]
mm/hugetlb: check bootmem pages for zone intersections
Bootmem hugetlb pages are allocated using memblock, which isn't (and
mostly can't be) aware of zones.
So, they may end up crossing zone boundaries. This would create
confusion, a hugetlb page that is part of multiple zones is bad. Worse,
HVO might then end up stealthily re-assigning pages to a different zone
when a hugetlb page is freed, since the tail page structures beyond the
first vmemmap page would inherit the zone of the first page structures.
While the chance of this happening is low, you can definitely create a
configuration where this happens (especially using ZONE_MOVABLE).
To avoid this issue, check if bootmem hugetlb pages intersect with
multiple zones during the gather phase, and discard them, handing them to
the page allocator, if they do. Record the number of invalid bootmem
pages per node and subtract them from the number of available pages at the
end, making it easier to do these checks in multiple places later on.
Link: https://lkml.kernel.org/r/20250228182928.2645936-14-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:13 +0000 (18:29 +0000)]
mm: define __init_reserved_page_zone function
Sometimes page structs must be unconditionally initialized as reserved,
regardless of DEFERRED_STRUCT_PAGE_INIT.
Define a function, __init_reserved_page_zone, containing code that already
did all of the work in init_reserved_page, and make it available for use.
Link: https://lkml.kernel.org/r/20250228182928.2645936-13-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:12 +0000 (18:29 +0000)]
mm/hugetlb: set migratetype for bootmem folios
The pageblocks that back memblock allocated hugetlb folios might not have
the migrate type set, in the CONFIG_DEFERRED_STRUCT_PAGE_INIT case.
memblock allocated hugetlb folios might be given to the buddy allocator
eventually (if nr_hugepages is lowered), so make sure that the migrate
type for the pageblocks contained in them is set when initializing them.
Set it to the default that memmap init also uses (MIGRATE_MOVABLE).
Link: https://lkml.kernel.org/r/20250228182928.2645936-12-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:11 +0000 (18:29 +0000)]
mm/sparse: allow for alternate vmemmap section init at boot
Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized.
They are called sparse_vmemmap_init_nid_early and
sparse_vmemmap_init_nid_late, respectively.
This allows for mm subsystems to add calls to initialize memmap and page
structures in a specific way, if using SPARSEMEM_VMEMMAP. Specifically,
hugetlb can pre-HVO bootmem allocated pages that way, so that no time and
resources are wasted on allocating vmemmap pages, only to free them later
(and possibly unnecessarily running the system out of memory in the
process).
Refactor some code and export a few convenience functions for external
use.
In sparse_init_nid, skip any sections that are already initialized, e.g.
they have been initialized by sparse_vmemmap_init_nid_early already.
The hugetlb code to use these functions will be added in a later commit.
Export section_map_size, as any alternate memmap init code will want to
use it.
The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT,
which is selected if an architecture-specific option,
ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set. In the future, if other
subsystems want to do preinit too, they can do it in a similar fashion.
The internal config option is there because a section flag is used, and
the number of flags available is architecture-dependent (see mmzone.h).
Architecures can decide if there is room for the flag when enabling
options that select SPARSEMEM_VMEMMAP_PREINIT.
Fortunately, as of right now, all sparse vmemmap using architectures do
have room.
Link: https://lkml.kernel.org/r/20250228182928.2645936-11-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If other mm code wants to use this function for early memmap inialization
(on the platforms that have it), it should be made available properly, not
just unconditionally in mm.h
Make this function available for such cases.
Link: https://lkml.kernel.org/r/20250228182928.2645936-10-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:09 +0000 (18:29 +0000)]
x86/mm: make register_page_bootmem_memmap handle PTE mappings
register_page_bootmem_memmap expects that vmemmap pages handed to it are
PMD-mapped, and that the number of pages to call get_page_bootmem on is
PMD-aligned.
This is currently a correct assumption, but will no longer be true once
pre-HVO of hugetlb pages is implemented.
Make it handle PTE-mapped vmemmap pages and a nr_pages argument that is
not necessarily PAGES_PER_SECTION.
Link: https://lkml.kernel.org/r/20250228182928.2645936-9-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:08 +0000 (18:29 +0000)]
mm/hugetlb: convert cmdline parameters from setup to early
Convert the cmdline parameters (hugepagesz, hugepages, default_hugepagesz
and hugetlb_free_vmemmap) to early parameters.
Since parse_early_param might run before MMU setups on some platforms
(powerpc), validation of huge page sizes as specified in command line
parameters would fail. So instead, for the hstate-related values, just
record the them and parse them on demand, from hugetlb_bootmem_alloc.
The allocation of hugetlb bootmem pages is now done in
hugetlb_bootmem_alloc, which is called explicitly at the start of
mm_core_init(). core_initcall would be too late, as that happens with
memblock already torn down.
This change will allow earlier allocation and initialization of bootmem
hugetlb pages later on.
No functional change intended.
Link: https://lkml.kernel.org/r/20250228182928.2645936-8-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:07 +0000 (18:29 +0000)]
mm/hugetlb: use online nodes for bootmem allocation
Later commits will move hugetlb bootmem allocation to earlier in init,
when N_MEMORY has not yet been set on nodes. Use online nodes instead.
At most, this wastes just a few cycles once during boot (and most likely
none).
Link: https://lkml.kernel.org/r/20250228182928.2645936-7-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:06 +0000 (18:29 +0000)]
mm/hugetlb: remove redundant __ClearPageReserved
In hugetlb_folio_init_tail_vmemmap, the reserved flag is cleared for the
tail page just before it is zeroed out, which is redundant. Remove the
__ClearPageReserved call.
Link: https://lkml.kernel.org/r/20250228182928.2645936-6-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frank van der Linden [Fri, 28 Feb 2025 18:29:05 +0000 (18:29 +0000)]
mm, hugetlb: use cma_declare_contiguous_multi
hugetlb_cma is fine with using multiple CMA ranges, as long as it can get
its gigantic pages allocated from them. So, use
cma_declare_contiguous_multi to allow for multiple ranges, increasing the
chances of getting what we want on systems with gaps in physical memory.
Link: https://lkml.kernel.org/r/20250228182928.2645936-5-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>