David Hildenbrand [Mon, 1 Sep 2025 15:03:27 +0000 (17:03 +0200)]
mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
Let's reject them early, which in turn makes folio_alloc_gigantic() reject
them properly.
To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
and calculate MAX_FOLIO_NR_PAGES based on that.
While at it, let's just make the order a "const unsigned order".
Link: https://lkml.kernel.org/r/20250901150359.867252-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Mon, 1 Sep 2025 15:03:22 +0000 (17:03 +0200)]
mm: stop making SPARSEMEM_VMEMMAP user-selectable
Patch series "mm: remove nth_page()", v2.
As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.
To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.
While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.
So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.
Likely, many people have no idea when nth_page() is required and when it
might be dropped.
We refer to such problematic PFN ranges and "non-contiguous pages". If we
only deal with "contiguous pages", there is not need for nth_page().
Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).
So let's handle all that, add sanity checks, and remove nth_page().
Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #13 : disallow folios to have non-contiguous pages
Patch #14 -> #20 : remove nth_page() usage within folios
Patch #22 : disallow CMA allocations of non-contiguous pages
Patch #23 -> #33 : sanity+check + remove nth_page() usage within SG entry
Patch #34 : sanity-check + remove nth_page() usage in
unpin_user_page_range_dirty_lock()
Patch #35 : remove nth_page() in kfence
Patch #36 : adjust stale comment regarding nth_page
Patch #37 : mm: remove nth_page()
A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.
This patch (of 37):
In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.
However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.
So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.
This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc. All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).
This is a preparation for not supporting
(1) folio sizes that exceed a single memory section
(2) CMA allocations of non-contiguous page ranges
in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).
Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Cc: Alex Dubov <oakad@yahoo.com> Cc: Alex Willamson <alex.williamson@redhat.com> Cc: Bart van Assche <bvanassche@acm.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Brett Creeley <brett.creeley@amd.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Lameter (Ampere) <cl@gentwo.org> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ingo Molnar <mingo@redhat.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Maxim Levitky <maximlevitsky@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Niklas Cassel <cassel@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kaushlendra Kumar [Sat, 30 Aug 2025 17:20:22 +0000 (22:50 +0530)]
tools/mm/slabinfo: fix access to null terminator in string boundary
The current code incorrectly accesses buffer[strlen(buffer)], which points
to the null terminator ('\0') at the end of the string. This is
technically out-of-bounds access since valid string content ends at index
strlen(buffer)-1.
Fix by:
1. Declaring strlen() result variable at function scope
2. Adding bounds check (len > 0) to handle empty strings
3. Using buffer[len-1] to correctly access the last character before
the null terminator
Andrew Morton [Sun, 31 Aug 2025 19:29:57 +0000 (12:29 -0700)]
memfd: move MFD_ALL_FLAGS definition to memfd.h
It's not part of the UAPI, but putting it here is better from a
maintainability POV. Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joey Pabalinas <joeypabalinas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Inside the small stack_not_used() function there are several ifdefs for
stack growing-up vs. regular versions. Instead just implement this
function two times, one for growing-up and another regular.
Add comments like /* !CONFIG_DEBUG_STACK_USAGE */ to clarify what the
ifdefs are doing.
[linus.walleij@linaro.org: rebased, function moved elsewhere in the kernel] Link: https://lkml.kernel.org/r/20250829-fork-cleanups-for-dynstack-v1-2-3bbaadce1f00@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-13-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pasha Tatashin [Fri, 29 Aug 2025 11:44:40 +0000 (13:44 +0200)]
fork: check charging success before zeroing stack
Patch series "mm: task_stack: Stack handling cleanups".
These are some small cleanups for the fork code that was split off from
Pasha:s dynamic stack patch series, they are generally nice on their own
so let's propose them for merging.
This patch (of 2):
No need to do zero cached stack if memcg charge fails, so move the
charging attempt before the memset operation.
Link: https://lkml.kernel.org/r/20250829-fork-cleanups-for-dynstack-v1-0-3bbaadce1f00@linaro.org Link: https://lkml.kernel.org/r/20250829-fork-cleanups-for-dynstack-v1-1-3bbaadce1f00@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-6-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ujwal Kundur [Fri, 29 Aug 2025 15:56:00 +0000 (21:26 +0530)]
selftests/mm/uffd: refactor non-composite global vars into struct
Refactor macros and non-composite global variable definitions into a
struct that is defined at the start of a test and is passed around instead
of relying on global vars.
Link: https://lkml.kernel.org/r/20250829155600.2000-1-ujwal.kundur@gmail.com Signed-off-by: Ujwal Kundur <ujwal.kundur@gmail.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 29 Aug 2025 16:15:27 +0000 (17:15 +0100)]
mm: remove unused zpool layer
With zswap using zsmalloc directly, there are no more in-tree users of
this code. Remove it.
With zpool gone, zsmalloc is now always a simple dependency and no
longer something the user needs to configure. Hide CONFIG_ZSMALLOC
from the user and have zswap and zram pull it in as needed.
Link: https://lkml.kernel.org/r/20250829162212.208258-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 29 Aug 2025 16:15:26 +0000 (17:15 +0100)]
mm: zswap: interact directly with zsmalloc
Patch series "mm: remove zpool".
zpool is an indirection layer for zswap to switch between multiple
allocator backends at runtime. Since 6.15, zsmalloc is the only allocator
left in-tree, so there is no point in keeping zpool around.
This patch (of 3):
zswap goes through the zpool layer to enable runtime-switching of
allocator backends for compressed data. However, since zbud and z3fold
were removed in 6.15, zsmalloc has been the only option available.
As such, the zpool indirection is unnecessary. Make zswap deal with
zsmalloc directly. This is comparable to zram, which also directly
interacts with zsmalloc and has never supported a different backend.
Note that this does not preclude future improvements and experiments with
different allocation strategies. Should it become necessary, it's
possible to provide an alternate implementation for the zsmalloc API,
selectable at compile time. However, zsmalloc is also rather mature and
feature rich, with years of widespread production exposure; it's
encouraged to make incremental improvements rather than fork it.
In any case, the complexity of runtime pluggability seems excessive and
unjustified at this time. Switch zswap to zsmalloc to remove the last
user of the zpool API.
Liam R. Howlett [Thu, 28 Aug 2025 00:30:23 +0000 (20:30 -0400)]
maple_tree: testing fix for spanning store on 32b
32 bit nodes have a larger branching factor. This affects the required
value to cause a height change. Update the spanning store height test to
work for both 64 and 32 bit nodes.
Link: https://lkml.kernel.org/r/20250828003023.418966-3-Liam.Howlett@oracle.com Fixes: f9d3a963fef4 ("maple_tree: use height and depth consistently") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 28 Aug 2025 00:30:22 +0000 (20:30 -0400)]
maple_tree: fix testing for 32 bit builds
Patch series "maple_tree: Fix testing for 32bit compiles".
The maple tree test suite supports 32bit builds which causes 32bit nodes
and index/last values. Some tests have too large values and must be
skipped while others depend on certain actions causing the tree to be
altered in another measurable way (such as the height decreasing or
increasing).
Two tests were added that broke 32bit testing, either by compile warnings
or failures. These fixes restore the tests to a working order.
Building 32bit version can be done on a 32bit platform, or by using a
command like: BUILD=32 make clean maple
This patch (of 2):
Some tests are invalid on 32bit due to the size of the index and last.
Making those tests depend on the correct build flags stops compile
complaints.
Max Kellermann [Thu, 28 Aug 2025 08:48:20 +0000 (10:48 +0200)]
huge_mm.h: disallow is_huge_zero_folio(NULL)
Calling is_huge_zero_folio(NULL) should not be legal - it makes no sense,
and a different (theoretical) implementation may dereference the pointer.
But currently, lacking any explicit documentation, this call is possible.
But if somebody really passes NULL, the function should not return true -
this isn't the huge zero folio after all! However, if the
`huge_zero_folio` hasn't been allocated yet, it's NULL, and
is_huge_zero_folio(NULL) just happens to return true, which is a lie.
This weird side effect prevented me from reproducing a kernel crash that
occurred when the elements of a folio_batch were NULL - since
folios_put_refs() skips huge zero folios, this sometimes causes a crash,
but sometimes does not. For debugging, it is better to reveal such bugs
reliably and not hide them behind random preconditions like "has the huge
zero folio already been created?"
To improve detection of such bugs, David Hildenbrand suggested adding a
VM_WARN_ON_ONCE().
Link: https://lkml.kernel.org/r/20250828084820.570118-1-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Thu, 28 Aug 2025 09:16:18 +0000 (09:16 +0000)]
mm/page_alloc: find_large_buddy() from start_pfn aligned order
We iterate pfn from order 0 to MAX_PAGE_ORDER aligned to find large buddy.
While if the order is less than start_pfn aligned order, we would get the
same pfn and do the same check again.
Iterate from start_pfn aligned order to reduce duplicated work.
Brendan Jackman [Thu, 28 Aug 2025 12:28:01 +0000 (12:28 +0000)]
tools: testing: use existing atomic.h for vma/maple tests
The shared userspace logic used for unit-testing maple tree and VMA code
currently has its own replacements for atomics helpers. This is not
needed as the necessary APIs already have userspace implementations in the
tools tree. Switching over to that allows deleting a bit of code.
Note that the implementation is different; while the version being deleted
here is implemented using liburcu, the existing version in tools uses
either x86 asm or compiler builtins. It's assumed that both are equally
likely to be correct.
The tools tree's version of atomic_t is a struct type while the version
being deleted was just a typedef of an integer. This means it's no longer
valid to call __sync_bool_compare_and_swap() directly on it. One option
would be to just peek into the struct and call it on the field, but it
seems a little cleaner to just use the corresponding atomic.h API whic has
been added recently. Now the fake mapping_map_writable() is copied from
the real one.
Link: https://lkml.kernel.org/r/20250828-b4-vma-no-atomic-h-v2-4-02d146a58ed2@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Thu, 28 Aug 2025 12:27:59 +0000 (12:27 +0000)]
tools: testing: allow importing arch headers in shared.mk
There is an arch/ tree under tools. This contains some useful stuff, to
make that available, add it to the -I flags. This requires $(SRCARCH),
which is provided by Makefile.arch, so include that..
There still aren't that many headers so also just smush all of them into
SHARED_DEPS instead of starting to do any header dependency hocus pocus.
Link: https://lkml.kernel.org/r/20250828-b4-vma-no-atomic-h-v2-2-02d146a58ed2@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Thu, 28 Aug 2025 12:27:58 +0000 (12:27 +0000)]
tools/include: implement a couple of atomic_t ops
Patch series "tools: testing: Use existing atomic.h for vma/maple tests",
v2.
De-duplicating this lets us delete a bit of code.
Ulterior motive: I'm working on a new set of the userspace-based unit
tests, which will need the atomics API too. That would involve even more
duplication, so while the win in this patchset alone is very minimal, it
looks a lot more significant with my other WIP patchset.
I've tested these commands:
make -C tools/testing/vma -j
tools/testing/vma/vma
make -C tools/testing/radix-tree -j
tools/testing/radix-tree/maple
Note the EXTRA_CFLAGS patch is actually orthogonal, let me know if you'd
prefer I send it separately.
This patch (of 4):
The VMA tests need an operation equivalent to atomic_inc_unless_negative()
to implement a fake mapping_map_writable(). Adding it will enable them to
switch to the shared atomic headers and simplify that fake implementation.
In order to add that, also add atomic_try_cmpxchg() which can be used to
implement it. This is copied from Documentation/atomic_t.txt. Then,
implement atomic_inc_unless_negative() itself based on the
raw_atomic_dec_unless_positive() in
include/linux/atomic/atomic-arch-fallback.h.
There's no present need for a highly-optimised version of this (nor any
reason to think this implementation is sub-optimal on x86) so just
implement this with generic C, no x86-specifics.
SeongJae Park [Thu, 28 Aug 2025 17:12:36 +0000 (10:12 -0700)]
mm/damon/paddr: support addr_unit for MIGRATE_{HOT,COLD}
Add support of addr_unit for DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD
action handling from the DAMOS operation implementation for the physical
address space.
Link: https://lkml.kernel.org/r/20250828171242.59810-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Thu, 28 Aug 2025 17:12:35 +0000 (10:12 -0700)]
mm/damon/paddr: support addr_unit for DAMOS_LRU_[DE]PRIO
Add support of addr_unit for DAMOS_LRU_PRIO and DAMOS_LRU_DEPRIO action
handling from the DAMOS operation implementation for the physical address
space.
Link: https://lkml.kernel.org/r/20250828171242.59810-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: ze zuo <zuoze1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Thu, 28 Aug 2025 17:12:32 +0000 (10:12 -0700)]
mm/damon/core: add damon_ctx->addr_unit
Patch series "mm/damon: support ARM32 with LPAE", v3.
Previously, DAMON's physical address space monitoring only supported
memory ranges below 4GB on LPAE-enabled systems. This was due to the use
of 'unsigned long' in 'struct damon_addr_range', which is 32-bit on ARM32
even with LPAE enabled[1].
To add DAMON support for ARM32 with LPAE enabled, a new core layer
parameter called 'addr_unit' was introduced[2]. Operations set layer can
translate a core layer address to the real address by multiplying the
parameter value to the core layer address. Support of the parameter is up
to each operations layer implementation, though. For example, operations
set implementations for virtual address space can simply ignore the
parameter. Add the support on paddr, which is the DAMON operations set
implementation for the physical address space, as we have a clear use case
for that.
This patch (of 11):
In some cases, some of the real address that handled by the underlying
operations set cannot be handled by DAMON since it uses only 'unsinged
long' as the address type. Using DAMON for physical address space
monitoring of 32 bit ARM devices with large physical address extension
(LPAE) is one example[1].
Add a parameter name 'addr_unit' to core layer to help such cases. DAMON
core API callers can set it as the scale factor that will be used by the
operations set for translating the core layer's addresses to the real
address by multiplying the parameter value to the core layer address.
Support of the parameter is up to each operations set layer. The support
from the physical address space operations set (paddr) will be added with
following commits.
enum pageblock_bits defines the meaning of pageblock bits. Currently
PB_migratetype_bits says the lowest 3 bits represents migratetype and
PB_migrate_end/MIGRATETYPE_MASK's definition rely on it with magical
computation.
Remove the definition of PB_migratetype_bits/PB_migrate_end. Use
PB_migrate_[0|1|2] to represent lowest bits for migratetype. Then we can
simplify related definition.
Also, MIGRATETYPE_AND_ISO_MASK is MIGRATETYPE_MASK add isolation bit. Use
MIGRATETYPE_MASK in the definition of MIGRATETYPE_AND_ISO_MASK looks
cleaner.
No functional change intended.
Link: https://lkml.kernel.org/r/20250827070105.16864-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Boris Burkov [Thu, 21 Aug 2025 21:55:37 +0000 (14:55 -0700)]
btrfs: set AS_KERNEL_FILE on the btree_inode
extent_buffers are global and shared so their pages should not belong to
any particular cgroup (currently whichever cgroups happens to allocate the
extent_buffer).
Btrfs tree operations should not arbitrarily block on cgroup reclaim or
have the shared extent_buffer pages on a cgroup's reclaim lists.
Link: https://lkml.kernel.org/r/2ee99832619a3fdfe80bf4dc9760278662d2d746.1755812945.git.boris@bur.io Signed-off-by: Boris Burkov <boris@bur.io> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: syzbot@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qu Wenruo <wqu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Boris Burkov [Thu, 21 Aug 2025 21:55:36 +0000 (14:55 -0700)]
mm: add vmstat for kernel_file pages
Kernel file pages are tricky to track because they are indistinguishable
from files whose usage is accounted to the root cgroup.
To maintain good accounting, introduce a vmstat counter tracking kernel
file pages.
Confirmed that these work as expected at a high level by mounting a btrfs
using AS_KERNEL_FILE for metadata pages, and seeing the counter rise with
fs usage then go back to a minimal level after drop_caches and finally
down to 0 after unmounting the fs.
Link: https://lkml.kernel.org/r/08ff633e3a005ed5f7691bfd9f58a5df8e474339.1755812945.git.boris@bur.io Signed-off-by: Boris Burkov <boris@bur.io> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: syzbot@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qu Wenruo <wqu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Boris Burkov [Thu, 21 Aug 2025 21:55:35 +0000 (14:55 -0700)]
mm/filemap: add AS_KERNEL_FILE
Patch series "introduce kernel file mapped folios", v4.
Btrfs currently tracks its metadata pages in the page cache, using a fake
inode (fs_info->btree_inode) with offsets corresponding to where the
metadata is stored in the filesystem's full logical address space.
A consequence of this is that when btrfs uses filemap_add_folio(), this
usage is charged to the cgroup of whichever task happens to be running at
the time. These folios don't belong to any particular user cgroup, so I
don't think it makes much sense for them to be charged in that way. Some
negative consequences as a result:
- A task can be holding some important btrfs locks, then need to lookup
some metadata and go into reclaim, extending the duration it holds
that lock for, and unfairly pushing its own reclaim pain onto other
cgroups.
- If that cgroup goes into reclaim, it might reclaim these folios a
different non-reclaiming cgroup might need soon. This is naturally
offset by LRU reclaim, but still.
We have two options for how to manage such file pages:
1. charge them to the root cgroup.
2. don't charge them to any cgroup at all.
2. breaks the invariant that every mapped page has a cgroup. This is
workable, but unnecessarily risky. Therefore, go with 1.
A very similar proposal to use the root cgroup was previously made by Qu,
where he eventually proposed the idea of setting it per address_space.
This makes good sense for the btrfs use case, as the behavior should apply
to all use of the address_space, not select allocations. I.e., if someone
adds another filemap_add_folio() call using btrfs's btree_inode, we would
almost certainly want to account that to the root cgroup as well.
This patch (of 3):
Add the flag AS_KERNEL_FILE to the address_space to indicate that this
mapping's memory is exempt from the usual memcg accounting.
Miaohe Lin [Tue, 26 Aug 2025 03:09:55 +0000 (11:09 +0800)]
Revert "hugetlb: make hugetlb depends on SYSFS or SYSCTL"
Commit f8142cf94d47 ("hugetlb: make hugetlb depends on SYSFS or SYSCTL")
added dependency on SYSFS or SYSCTL but hugetlb can be used without SYSFS
or SYSCTL. So this dependency is wrong and should be removed.
For users with CONFIG_SYSFS or CONFIG_SYSCTL on, there should be no
difference. For users have CONFIG_SYSFS and CONFIG_SYSCTL both
undefined, hugetlbfs can still works perfectly well through cmdline
except a possible kismet warning[1] when select CONFIG_HUGETLBFS.
IMHO, it might not worth a backport.
Dev Jain [Tue, 9 Sep 2025 06:15:30 +0000 (11:45 +0530)]
selftests/mm/uffd-stress: stricten constraint on free hugepages needed before the test
The test requires at least 2 * (bytes/page_size) hugetlb memory, since we
require identical number of hugepages for src and dst location. Fix this.
Along with the above, as explained in patch "selftests/mm/uffd-stress:
Make test operate on less hugetlb memory", the racy nature of the test
requires that we have some extra number of hugepages left beyond what is
required. Therefore, stricten this constraint.
Link: https://lkml.kernel.org/r/20250909061531.57272-3-dev.jain@arm.com Fixes: 5a6aa60d1823 ("selftests/mm: skip uffd hugetlb tests with insufficient hugepages") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Tue, 9 Sep 2025 06:15:29 +0000 (11:45 +0530)]
selftests/mm/uffd-stress: make test operate on less hugetlb memory
Patch series "selftests/mm: uffd-stress fixes", v2.
This patchset ensures that the number of hugepages is correctly set in the
system so that the uffd-stress test does not fail due to the racy nature
of the test. Patch 1 changes the hugepage constraint in the
run_vmtests.sh script, whereas patch 2 changes the constraint in the test
itself.
This patch (of 2):
We observed uffd-stress selftest failure on arm64 and intermittent
failures on x86 too:
For this particular case, the number of free hugepages from run_vmtests.sh
will be 128, and the test will allocate 64 hugepages in the source
location. The stress() function will start spawning threads which will
operate on the destination location, triggering uffd-operations like
UFFDIO_COPY from src to dst, which means that we will require 64 more
hugepages for the dst location.
Let us observe the locking_thread() function. It will lock the mutex kept
at dst, triggering uffd-copy. Suppose that 127 (64 for src and 63 for
dst) hugepages have been reserved. In case of BOUNCE_RANDOM, it may
happen that two threads trying to lock the mutex at dst, try to do so at
the same hugepage number. If one thread succeeds in reserving the last
hugepage, then the other thread may fail in alloc_hugetlb_folio(),
returning -ENOMEM. I can confirm that this is indeed the case by this
hacky patch:
This code path gets triggered indicating that the PMD at which one thread
is trying to map a hugepage, gets filled by a racing thread.
Therefore, instead of using freepgs to compute the amount of memory, use
freepgs - (min(32, nr_cpus) - 1), so that the test still has some extra
hugepages to use. The adjustment is a function of min(32, nr_cpus) - the
value of nr_parallel in the test - because in the worst case, nr_parallel
number of threads will try to map a hugepage on the same PMD, one will win
the allocation race, and the other nr_parallel - 1 threads will fail, so
we need extra nr_parallel - 1 hugepages to satisfy this request. Note
that, in case the adjusted value underflows, there is a check for the
number of free hugepages in the test itself, which will fail:
get_free_hugepages() < bytes / page_size A negative value will be passed
on to bytes which is of type size_t, thus the RHS will become a large
value and the check will fail, so we are safe.
Link: https://lkml.kernel.org/r/20250909061531.57272-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250909061531.57272-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Tue, 26 Aug 2025 14:06:54 +0000 (14:06 +0000)]
mm/page_alloc: harmonize should_compact_retry() type
Currently order is signed in one version of the function and unsigned in
the other. Tidy that up.
In page_alloc.c, order is unsigned in the vast majority of cases. But,
there is a cluster of exceptions in compaction-related code (probably
stemming from the fact that compact_control.order is signed). So, prefer
local consistency and make this one signed too.
Link: https://lkml.kernel.org/r/20250826-cleanup-should_compact_retry-v1-1-d2ca89727fcf@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Tue, 26 Aug 2025 15:13:44 +0000 (15:13 +0000)]
maple_tree: fix MAPLE_PARENT_RANGE32 and parent pointer docs
MAPLE_PARENT_RANGE32 should be 0x02 as a 32 bit node is indicated by the
bit pattern 0b010 which is the hex value 0x02. There are no users
currently, so there is no associated bug with this wrong value.
Fix typo Note -> Node and replace x with b to indicate binary values.
Link: https://lkml.kernel.org/r/20250826151344.403286-1-sidhartha.kumar@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pratyush Yadav [Tue, 26 Aug 2025 12:38:16 +0000 (14:38 +0200)]
kho: make sure kho_scratch argument is fully consumed
When specifying fixed sized scratch areas, the parser only parses the
three scratch sizes and ignores the rest of the argument. This means the
argument can have any bogus trailing characters.
For example, "kho_scratch=256M,512M,512Mfoobar" results in successful
parsing:
It is generally a good idea to parse arguments as strictly as possible.
In addition, if bogus trailing characters are allowed in the kho_scratch
argument, it is possible that some people might end up using them and
later extensions to the argument format will cause unexpected breakages.
Make sure the argument is fully consumed after all three scratch sizes are
parsed. With this change, the bogus argument
"kho_scratch=256M,512M,512Mfoobar" results in:
[ 0.000000] Malformed early option 'kho_scratch'
Link: https://lkml.kernel.org/r/20250826123817.64681-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wander Lairson Costa [Mon, 25 Aug 2025 12:59:26 +0000 (09:59 -0300)]
kmem/tracing: add kmem name to kmem_cache_alloc tracepoint
The kmem_cache_free tracepoint includes a "name" field, which allows for
easy identification and filtering of specific kmem's. However, the
kmem_cache_alloc tracepoint lacks this field, making it difficult to pair
corresponding alloc and free events for analysis.
Add the "name" field to kmem_cache_alloc to enable consistent tracking and
correlation of kmem alloc and free events.
Link: https://lkml.kernel.org/r/20250825125927.59816-1-wander@redhat.com Signed-off-by: Wander Lairson Costa <wander@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 25 Aug 2025 16:37:21 +0000 (00:37 +0800)]
mm/page-writeback: drop usage of folio_index
folio_index is only needed for mixed usage of page cache and swap cache.
The remaining three caller in page-writeback are for page cache tag
marking. Swap cache space doesn't use tag (explicitly sets
mapping_set_no_writeback_tags), so use folio->index here directly.
I Viswanath [Mon, 25 Aug 2025 17:06:43 +0000 (22:36 +0530)]
selftests/mm: use calloc instead of malloc in pagemap_ioctl.c
As per Documentation/process/deprecated.rst, dynamic size calculations
should not be performed in memory allocator arguments due to possible
overflows.
Replace malloc with calloc to avoid open-ended arithmetic and prevent
possible overflows.
Link: https://lkml.kernel.org/r/20250825170643.63174-1-viswanathiyyappan@gmail.com Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Hildenbrand <david@redhat.com>
Reviewed by: Donet Tom <donettom@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Donet Tom [Fri, 22 Aug 2025 08:48:45 +0000 (14:18 +0530)]
drivers/base/node: handle error properly in register_one_node()
If register_node() returns an error, it is not handled correctly.
The function will proceed further and try to register CPUs under the
node, which is not correct.
So, in this patch, if register_node() returns an error, we return
immediately from the function.
Link: https://lkml.kernel.org/r/20250822084845.19219-1-donettom@linux.ibm.com Fixes: 76b67ed9dce6 ("[PATCH] node hotplug: register cpu: remove node struct") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Fri, 22 Aug 2025 02:57:32 +0000 (02:57 +0000)]
mm/khugepaged: use list_xxx() helper to improve readability
In general, khugepaged_scan_mm_slot() iterates khugepaged_scan.mm_head list
to get a mm_struct for collapse memory.
Use list_xxx() helper would be more obvious to the list iteration
operation.
No functional change.
Link: https://lkml.kernel.org/r/20250822025732.9025-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bala-Vignesh-Reddy [Thu, 21 Aug 2025 10:11:59 +0000 (15:41 +0530)]
selftests: centralise maybe-unused definition in kselftest.h
Several selftests subdirectories duplicated the define __maybe_unused,
leading to redundant code. Move to kselftest.h header and remove other
definitions.
This addresses the duplication noted in the proc-pid-vm warning fix
Link: https://lkml.kernel.org/r/20250821101159.2238-1-reddybalavignesh9979@gmail.com Signed-off-by: Bala-Vignesh-Reddy <reddybalavignesh9979@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Link:https://lore.kernel.org/lkml/20250820143954.33d95635e504e94df01930d0@linux-foundation.org/ Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Mickal Salan <mic@digikod.net> [landlock] Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Usama Arif [Thu, 21 Aug 2025 15:00:38 +0000 (16:00 +0100)]
mm/huge_memory: remove enforce_sysfs from __thp_vma_allowable_orders
Using forced_collapse directly is clearer and enforce_sysfs is not really
needed.
Link: https://lkml.kernel.org/r/20250821150038.2025521-1-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Thu, 21 Aug 2025 13:29:47 +0000 (13:29 +0000)]
mm: remove is_migrate_highatomic()
There are 3 potential reasons for is_migrate_*() helpers:
1. They represent higher-level attributes of migratetypes, like
is_migrate_movable()
2. They are ifdef'd, like is_migrate_isolate().
3. For consistency with an is_migrate_*_page() helper, also like
is_migrate_isolate().
It looks like is_migrate_highatomic() was for case 3, but that was
removed in commit e0932b6c1f94 ("mm: page_alloc: consolidate free page
accounting").
So remove the indirection and go back to a simple comparison.
Link: https://lkml.kernel.org/r/20250821-is-migrate-highatomic-v1-1-ddb6e5d7c566@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nhat Pham [Wed, 20 Aug 2025 18:15:47 +0000 (11:15 -0700)]
mm/zswap: reduce the size of the compression buffer to a single page
Reduce the compression buffer size from 2 * PAGE_SIZE to only one page, as
the compression output (in the success case) should not exceed the length
of the input.
In the past, Chengming tried to reduce the compression buffer size, but
ran into issues with the LZO algorithm (see [2]). Herbert Xu reported
that the issue has been fixed (see [3]). Now we should have the guarantee
that compressors' output should not exceed one page in the success case,
and the algorithm will just report failure otherwise.
With this patch, we save one page per cpu (per compression algorithm).
Currently, we have no way to distinguish a kernel stack page from an
unidentified page. Being able to track this information can be beneficial
for optimizing kernel memory usage (i.e. analyzing fragmentation,
location etc.). Knowing a page is being used for a kernel stack gives us
more insight about pages that are certainly immovable and important to
kernel functionality.
Add a new pagetype, and tag pages alongside the kernel stack accounting.
Also, ensure the type is dumped to /proc/kpageflags and the page-types
tool can find it.
Link: https://lkml.kernel.org/r/20250820202029.1909925-1-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Steven Rostedt [Thu, 12 Jun 2025 14:03:13 +0000 (10:03 -0400)]
mm, x86/mm: move creating the tlb_flush event back to x86 code
Commit e73ad5ff2f76 ("mm, x86/mm: Make the batched unmap TLB flush API
more generic") moved the trace_tlb_flush out of mm/rmap.c and back into
x86 specific architecture, but it kept the include to the events/tlb.h
file, even though it didn't use that event.
Then another commit came in and added more events to the mm/rmap.c file
and moved the #define CREATE_TRACE_POINTS define from the x86 specific
architecture to the generic mm/rmap.h file to create both the tlb_flush
tracepoint and the new tracepoints.
But since the tlb_flush tracepoint is only x86 specific, it now creates
that tracepoint for all other architectures and this wastes approximately
5K of text and meta data that will not be used.
Remove the events/tlb.h from mm/rmap.c and add the define
CREATE_TRACE_POINTS back in the x86 code.
Link: https://lkml.kernel.org/r/20250612100313.3b9a8b80@batman.local.home Fixes: e73ad5ff2f76 ("mm, x86/mm: Make the batched unmap TLB flush API more generic") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Mon, 18 Aug 2025 06:10:09 +0000 (08:10 +0200)]
bcachefs: stop using write_cache_pages
Stop using the obsolete write_cache_pages and use writeback_iter directly.
This basically just open codes write_cache_pages without the indirect
call, but there's probably ways to structure the code even nicer as a
follow on.
Link: https://lkml.kernel.org/r/20250818061017.1526853-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Mon, 18 Aug 2025 06:10:10 +0000 (08:10 +0200)]
mm: remove write_cache_pages
No users left.
Link: https://lkml.kernel.org/r/20250818061017.1526853-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Tue, 19 Aug 2025 08:00:47 +0000 (08:00 +0000)]
selftests/mm: test that rmap behaves as expected
As David suggested, currently we don't have a high level test case to
verify the behavior of rmap. This patch introduce the verification on
rmap by migration.
The general idea is if migrate one shared page between processes, this
would be reflected in all related processes. Otherwise, we have problem
in rmap.
Link: https://lkml.kernel.org/r/20250819080047.10063-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Tue, 19 Aug 2025 08:00:46 +0000 (08:00 +0000)]
selftests/mm: put general ksm operation into vm_util
Patch series "test that rmap behaves as expected", v4.
As David suggested, currently we don't have a high level test case to
verify the behavior of rmap. This patch set introduce the verification
on rmap by migration.
Patch 1 is a preparation to move ksm related operations into vm_util.
Patch 2 is the new test case for rmap.
Baokun Li [Tue, 19 Aug 2025 06:18:03 +0000 (14:18 +0800)]
tmpfs: preserve SB_I_VERSION on remount
Now tmpfs enables i_version by default and tmpfs does not modify it. But
SB_I_VERSION can also be modified via sb_flags, and reconfigure_super()
always overwrites the existing flags with the latest ones. This means
that if tmpfs is remounted without specifying iversion, the default
i_version will be unexpectedly disabled.
To ensure iversion remains enabled, SB_I_VERSION is now always set for
fc->sb_flags in shmem_init_fs_context(), instead of for sb->s_flags in
shmem_fill_super().
Link: https://lkml.kernel.org/r/20250819061803.1496443-1-libaokun@huaweicloud.com Fixes: 36f05cab0a2c ("tmpfs: add support for an i_version counter") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Mon, 18 Aug 2025 18:46:22 +0000 (14:46 -0400)]
selftests/mm: check after-split folio orders in split_huge_page_test
Instead of just checking the existence of PMD folios before and after folio
split tests, use check_folio_orders() to check after-split folio orders.
The split ranges in split_thp_in_pagecache_to_order_at() are changed to
[addr, addr + pagesize) for every pmd_pagesize. It prevents folios within
the range being split multiple times due to debugfs split function always
perform splits with a pagesize step for a given range.
The following tests are not changed:
1. split_pte_mapped_thp: the test already uses kpageflags to check;
2. split_file_backed_thp: no vaddr available.
Link: https://lkml.kernel.org/r/20250818184622.1521620-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: wang lian <lianux.mm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The helper gathers a folio order statistics of folios within a virtual
address range and checks it against a given order list. It aims to provide
a more precise folio order check instead of just checking the existence of
PMD folios.
The helper will be used the upcoming commit.
Link: https://lkml.kernel.org/r/20250818184622.1521620-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: wang lian <lianux.mm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Mon, 18 Aug 2025 18:46:20 +0000 (14:46 -0400)]
selftests/mm: reimplement is_backed_by_thp() with more precise check
and rename it to is_backed_by_folio().
is_backed_by_folio() checks if the given vaddr is backed a folio with
a given order. It does so by:
1. getting the pfn of the vaddr;
2. checking kpageflags of the pfn;
if order is greater than 0:
3. checking kpageflags of the head pfn;
4. checking kpageflags of all tail pfns.
pmd_order is added to split_huge_page_test.c and replaces max_order.
Link: https://lkml.kernel.org/r/20250818184622.1521620-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: wang lian <lianux.mm@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Mon, 18 Aug 2025 18:46:19 +0000 (14:46 -0400)]
selftests/mm: mark all functions static in split_huge_page_test.c
All functions are only used within the file.
Link: https://lkml.kernel.org/r/20250818184622.1521620-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: wang lian <lianux.mm@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Mon, 18 Aug 2025 18:46:18 +0000 (14:46 -0400)]
mm/huge_memory: add new_order and offset to split_huge_pages*() pr_debug
Patch series "Better split_huge_page_test result check", v5.
This patchset uses kpageflags to get after-split folio orders for a better
split_huge_page_test result check[1]. The added
gather_after_split_folio_orders() scans through a VPN range and collects
the numbers of folios at different orders.
check_after_split_folio_orders() compares the result of
gather_after_split_folio_orders() to a given list of numbers of different
orders.
This patchset also adds new order and in folio offset to the split huge
page debugfs's pr_debug()s;
This patch (of 5):
They are useful information for debugging split huge page tests.
Link: https://lkml.kernel.org/r/20250818184622.1521620-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250818184622.1521620-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: wang lian <lianux.mm@gmail.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li RongQing [Thu, 14 Aug 2025 10:23:33 +0000 (18:23 +0800)]
mm/hugetlb: early exit from hugetlb_pages_alloc_boot() when max_huge_pages=0
Optimize hugetlb_pages_alloc_boot() to return immediately when
max_huge_pages is 0, avoiding unnecessary CPU cycles and the below log
message when hugepages aren't configured in the kernel command line.
[ 3.702280] HugeTLB: allocation took 0ms with hugepage_allocation_threads=32
Link: https://lkml.kernel.org/r/20250814102333.4428-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chi Zhiling [Mon, 28 Jul 2025 08:39:52 +0000 (16:39 +0800)]
mm/filemap: skip non-uptodate folio if there are available folios
When reading data exceeding the maximum IO size, the operation is split
into multiple IO requests, but the data isn't immediately copied to
userspace after each IO completion.
For example, when reading 2560k data from a device with 1280k maximum IO
size, the following sequence occurs:
1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. wait the next 1280k
5. copy 8 pages to user buffer
6. copy 20 folios(64k) to user buffer
The 8 pages in step 5 are copied after the second 1280k completes(step 4)
due to waiting for a non-uptodate folio in filemap_update_page. We can
copy the 8 pages before the second 1280k completes(step 4) to reduce the
latency of this read operation.
After applying the patch, these 8 pages will be copied before the next IO
completes:
1. read 1280k
2. copy 41 pages and issue read ahead for next 1280k
3. copy 31 pages to user buffer
4. copy 8 pages to user buffer
5. wait the next 1280k
6. copy 20 folios(64k) to user buffer
This patch drops a setting of IOCB_NOWAIT for AIO, which is fine because
filemap_read will set it again for AIO.
Chi Zhiling [Mon, 28 Jul 2025 08:39:51 +0000 (16:39 +0800)]
mm/filemap: do not use is_partially_uptodate for entire folio
Patch series "Tiny optimization for large read operations".
This series contains two patches,
1. Skip calling is_partially_uptodate for entire folio to save time, I
have reviewed the mpage and iomap implementations and didn't spot any
issues, but this change likely needs more thorough review.
2. Skip calling filemap_uptodate if there are ready folios in the
batch, This might save a few milliseconds in practice, but I didn't
observe measurable improvements in my tests.
This patch (of 2):
When a folio is marked as non-uptodate, it means the folio contains some
non-uptodate data. Therefore, calling is_partially_uptodate() to recheck
the entire folio is redundant.
If all data in a folio is actually up-to-date but the folio lacks the
uptodate flag, it will still be treated as non-uptodate in many other
places. Thus, there should be no special case handling for filemap.
Wei Yang [Sun, 17 Aug 2025 03:26:46 +0000 (03:26 +0000)]
mm/rmap: not necessary to mask off FOLIO_PAGES_MAPPED
At this point, we are in an if branch conditional on (nr <
ENTIRELY_MAPPED), and FOLIO_PAGES_MAPPED is equal to (ENTIRELY_MAPPED -
1). This means the upper bits are already cleared.
It is not necessary to mask it off.
Link: https://lkml.kernel.org/r/20250817032647.29147-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
liuqiqi [Tue, 12 Aug 2025 07:02:10 +0000 (15:02 +0800)]
mm: fix duplicate accounting of free pages in should_reclaim_retry()
In the zone_reclaimable_pages() function, if the page counts for
NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_INACTIVE_ANON, and
NR_ZONE_ACTIVE_ANON are all zero, the function returns the number of free
pages as the result.
In this case, when should_reclaim_retry() calculates reclaimable pages, it
will inadvertently double-count the free pages in its accounting.
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
bool did_some_progress, int *no_progress_loops)
{
...
available = reclaimable = zone_reclaimable_pages(zone);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
This may result in an increase in the number of retries of
__alloc_pages_slowpath(), causing increased kswapd load.
Link: https://lkml.kernel.org/r/20250812070210.1624218-1-liuqiqi@kylinos.cn Fixes: 6aaced5abd32 ("mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()") Signed-off-by: liuqiqi <liuqiqi@kylinos.cn> Reviewed-by: Ye Liu <liuye@kylinos.cn> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:23:01 +0000 (18:23 +0100)]
mm: add folio_is_pci_p2pdma()
Reimplement is_pci_p2pdma_page() in terms of folio_is_pci_p2pdma(). Moves
the page_folio() call from inside page_pgmap() to is_pci_p2pdma_page().
This removes a page_folio() call from try_grab_folio() which already has a
folio and can pass it in.
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:23:00 +0000 (18:23 +0100)]
mm: reimplement folio_is_fsdax()
For callers of folio_is_fsdax(), we save a folio->page->folio conversion.
Callers of is_fsdax_page() simply move the conversion of page->folio from
the implementation of page_pgmap() to is_fsdax_page().
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:22:59 +0000 (18:22 +0100)]
mm: reimplement folio_is_device_coherent()
For callers of folio_is_device_coherent(), we save a folio->page->folio
conversion. Callers of is_device_coherent_page() simply move the
conversion of page->folio from the implementation of page_pgmap() to
is_device_coherent_page().
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:22:58 +0000 (18:22 +0100)]
mm: reimplement folio_is_device_private()
For callers of folio_is_device_private(), we save a folio->page->folio
conversion. Callers of is_device_private_page() simply move the
conversion of page->folio from the implementation of page_pgmap() to
is_device_private_page().
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:22:57 +0000 (18:22 +0100)]
mm: introduce memdesc_is_zone_device()
Remove the conversion from folio to page in folio_is_zone_device() by
introducing memdesc_is_zone_device() which takes a memdesc_flags_t from
either a page or a folio.
Nicola Vetrini [Mon, 25 Aug 2025 21:42:45 +0000 (23:42 +0200)]
mips: fix compilation error
The following build error occurs on a mips build configuration
(32r2el_defconfig and similar ones)
./arch/mips/include/asm/cacheflush.h:42:34: error: passing argument 2 of `set_bit'
from incompatible pointer type [-Werror=incompatible-pointer-types]
42 | set_bit(PG_dcache_dirty, &(folio)->flags)
| ^~~~~~~~~~~~~~~
| |
| memdesc_flags_t *
This is due to changes introduced by
commit 30f45bf18d55 ("mm: introduce memdesc_flags_t"), which did not update
these usage sites.
Matthew Wilcox (Oracle) [Mon, 18 Aug 2025 12:42:22 +0000 (08:42 -0400)]
mm-introduce-memdesc_flags_t-fix
s/flags/flags.f/ in several architectures
Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Tue, 5 Aug 2025 17:22:51 +0000 (18:22 +0100)]
mm: introduce memdesc_flags_t
Patch series "Add and use memdesc_flags_t".
At some point struct page will be separated from struct slab and struct
folio. This is a step towards that by introducing a type for the 'flags'
word of all three structures. This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits. That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.
There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead. That's most of the churn here. Maybe we should be removing
these things from the debug output?
This patch (of 11):
Wrap the unsigned long flags in a typedef. In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.
Enze Li [Fri, 15 Aug 2025 09:21:10 +0000 (17:21 +0800)]
mm/damon/Kconfig: make DAMON_STAT_ENABLED_DEFAULT depend on DAMON_STAT
The DAMON_STAT_ENABLED_DEFAULT option is strongly tied to DAMON_STAT
option -- enabling it alone is meaningless. This patch makes
DAMON_STAT_ENABLED_DEFAULT depend on DAMON_STAT, ensuring functional
consistency.
Link: https://lkml.kernel.org/r/20250815092110.811757-1-lienze@kylinos.cn Fixes: 369c415e6073 ("mm/damon: introduce DAMON_STAT module") Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>