From f6a09e6800936c6c9ba5667ac3efc18feb8f3a2f Mon Sep 17 00:00:00 2001
From: Nico Pache <npache@redhat.com>
Date: Fri, 14 Mar 2025 15:37:57 -0600
Subject: [PATCH 01/16] xen: balloon: update the NR_BALLOON_PAGES state

Update the NR_BALLOON_PAGES counter when pages are added to or removed
from the Xen balloon.

Link: https://lkml.kernel.org/r/20250314213757.244258-5-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Alexander Atanasov <alexander.atanasov@virtuozzo.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Juegren Gross <jgross@suse.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 drivers/xen/balloon.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 163f7f1d70f1..65d4e7fa1eb8 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -157,6 +157,8 @@ static void balloon_append(struct page *page)
 		list_add(&page->lru, &ballooned_pages);
 		balloon_stats.balloon_low++;
 	}
+	inc_node_page_state(page, NR_BALLOON_PAGES);
+
 	wake_up(&balloon_wq);
 }
 
@@ -179,6 +181,8 @@ static struct page *balloon_retrieve(bool require_lowmem)
 		balloon_stats.balloon_low--;
 
 	__ClearPageOffline(page);
+	dec_node_page_state(page, NR_BALLOON_PAGES);
+
 	return page;
 }
 
-- 
2.51.0


From 9f171d94be80d80067c6445e53b00d098150391e Mon Sep 17 00:00:00 2001
From: Jiwen Qi <jiwen7.qi@gmail.com>
Date: Sat, 15 Mar 2025 21:13:17 +0000
Subject: [PATCH 02/16] docs/mm: Physical Memory: Populate the "Zones" section

Briefly describe what zones are and the fields of struct zone.

Link: https://lkml.kernel.org/r/20250315211317.27612-1-jiwen7.qi@gmail.com
Signed-off-by: Jiwen Qi <jiwen7.qi@gmail.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/mm/physical_memory.rst | 266 ++++++++++++++++++++++++++-
 1 file changed, 264 insertions(+), 2 deletions(-)

diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index 71fd4a6acf42..d3ac106e6b14 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -338,10 +338,272 @@ Statistics
 
 Zones
 =====
+As we have mentioned, each zone in memory is described by a ``struct zone``
+which is an element of the ``node_zones`` array of the node it belongs to.
+``struct zone`` is the core data structure of the page allocator. A zone
+represents a range of physical memory and may have holes.
+
+The page allocator uses the GFP flags, see :ref:`mm-api-gfp-flags`, specified by
+a memory allocation to determine the highest zone in a node from which the
+memory allocation can allocate memory. The page allocator first allocates memory
+from that zone, if the page allocator can't allocate the requested amount of
+memory from the zone, it will allocate memory from the next lower zone in the
+node, the process continues up to and including the lowest zone. For example, if
+a node contains ``ZONE_DMA32``, ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` and the
+highest zone of a memory allocation is ``ZONE_MOVABLE``, the order of the zones
+from which the page allocator allocates memory is ``ZONE_MOVABLE`` >
+``ZONE_NORMAL`` > ``ZONE_DMA32``.
+
+At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas
+of the zone. The Per-CPU Pagesets are a vital mechanism in the kernel's memory
+management system. By handling most frequent allocations and frees locally on
+each CPU, the Per-CPU Pagesets improve performance and scalability, especially
+on systems with many cores. The page allocator in the kernel employs a two-step
+strategy for memory allocation, starting with the Per-CPU Pagesets before
+falling back to the buddy allocator. Pages are transferred between the Per-CPU
+Pagesets and the global free areas (managed by the buddy allocator) in batches.
+This minimizes the overhead of frequent interactions with the global buddy
+allocator.
+
+Architecture specific code calls free_area_init() to initializes zones.
+
+Zone structure
+--------------
+The zones structure ``struct zone`` is defined in ``include/linux/mmzone.h``.
+Here we briefly describe fields of this structure:
 
-.. admonition:: Stub
+General
+~~~~~~~
 
-   This section is incomplete. Please list and describe the appropriate fields.
+``_watermark``
+  The watermarks for this zone. When the amount of free pages in a zone is below
+  the min watermark, boosting is ignored, an allocation may trigger direct
+  reclaim and direct compaction, it is also used to throttle direct reclaim.
+  When the amount of free pages in a zone is below the low watermark, kswapd is
+  woken up. When the amount of free pages in a zone is above the high watermark,
+  kswapd stops reclaiming (a zone is balanced) when the
+  ``NUMA_BALANCING_MEMORY_TIERING`` bit of ``sysctl_numa_balancing_mode`` is not
+  set. The promo watermark is used for memory tiering and NUMA balancing. When
+  the amount of free pages in a zone is above the promo watermark, kswapd stops
+  reclaiming when the ``NUMA_BALANCING_MEMORY_TIERING`` bit of
+  ``sysctl_numa_balancing_mode`` is set. The watermarks are set by
+  ``__setup_per_zone_wmarks()``. The min watermark is calculated according to
+  ``vm.min_free_kbytes`` sysctl. The other three watermarks are set according
+  to the distance between two watermarks. The distance itself is calculated
+  taking ``vm.watermark_scale_factor`` sysctl into account.
+
+``watermark_boost``
+  The number of pages which are used to boost watermarks to increase reclaim
+  pressure to reduce the likelihood of future fallbacks and wake kswapd now
+  as the node may be balanced overall and kswapd will not wake naturally.
+
+``nr_reserved_highatomic``
+  The number of pages which are reserved for high-order atomic allocations.
+
+``nr_free_highatomic``
+  The number of free pages in reserved highatomic pageblocks
+
+``lowmem_reserve``
+  The array of the amounts of the memory reserved in this zone for memory
+  allocations. For example, if the highest zone a memory allocation can
+  allocate memory from is ``ZONE_MOVABLE``, the amount of memory reserved in
+  this zone for this allocation is ``lowmem_reserve[ZONE_MOVABLE]`` when
+  attempting to allocate memory from this zone. This is a mechanism the page
+  allocator uses to prevent allocations which could use ``highmem`` from using
+  too much ``lowmem``. For some specialised workloads on ``highmem`` machines,
+  it is dangerous for the kernel to allow process memory to be allocated from
+  the ``lowmem`` zone. This is because that memory could then be pinned via the
+  ``mlock()`` system call, or by unavailability of swapspace.
+  ``vm.lowmem_reserve_ratio`` sysctl determines how aggressive the kernel is in
+  defending these lower zones. This array is recalculated by
+  ``setup_per_zone_lowmem_reserve()`` at runtime if ``vm.lowmem_reserve_ratio``
+  sysctl changes.
+
+``node``
+  The index of the node this zone belongs to. Available only when
+  ``CONFIG_NUMA`` is enabled because there is only one zone in a UMA system.
+
+``zone_pgdat``
+  Pointer to the ``struct pglist_data`` of the node this zone belongs to.
+
+``per_cpu_pageset``
+  Pointer to the Per-CPU Pagesets (PCP) allocated and initialized by
+  ``setup_zone_pageset()``. By handling most frequent allocations and frees
+  locally on each CPU, PCP improves performance and scalability on systems with
+  many cores.
+
+``pageset_high_min``
+  Copied to the ``high_min`` of the Per-CPU Pagesets for faster access.
+
+``pageset_high_max``
+  Copied to the ``high_max`` of the Per-CPU Pagesets for faster access.
+
+``pageset_batch``
+  Copied to the ``batch`` of the Per-CPU Pagesets for faster access. The
+  ``batch``, ``high_min`` and ``high_max`` of the Per-CPU Pagesets are used to
+  calculate the number of elements the Per-CPU Pagesets obtain from the buddy
+  allocator under a single hold of the lock for efficiency. They are also used
+  to decide if the Per-CPU Pagesets return pages to the buddy allocator in page
+  free process.
+
+``pageblock_flags``
+  The pointer to the flags for the pageblocks in the zone (see
+  ``include/linux/pageblock-flags.h`` for flags list). The memory is allocated
+  in ``setup_usemap()``. Each pageblock occupies ``NR_PAGEBLOCK_BITS`` bits.
+  Defined only when ``CONFIG_FLATMEM`` is enabled. The flags is stored in
+  ``mem_section`` when ``CONFIG_SPARSEMEM`` is enabled.
+
+``zone_start_pfn``
+  The start pfn of the zone. It is initialized by
+  ``calculate_node_totalpages()``.
+
+``managed_pages``
+  The present pages managed by the buddy system, which is calculated as:
+  ``managed_pages`` = ``present_pages`` - ``reserved_pages``, ``reserved_pages``
+  includes pages allocated by the memblock allocator. It should be used by page
+  allocator and vm scanner to calculate all kinds of watermarks and thresholds.
+  It is accessed using ``atomic_long_xxx()`` functions. It is initialized in
+  ``free_area_init_core()`` and then is reinitialized when memblock allocator
+  frees pages into buddy system.
+
+``spanned_pages``
+  The total pages spanned by the zone, including holes, which is calculated as:
+  ``spanned_pages`` = ``zone_end_pfn`` - ``zone_start_pfn``. It is initialized
+  by ``calculate_node_totalpages()``.
+
+``present_pages``
+  The physical pages existing within the zone, which is calculated as:
+  ``present_pages`` = ``spanned_pages`` - ``absent_pages`` (pages in holes). It
+  may be used by memory hotplug or memory power management logic to figure out
+  unmanaged pages by checking (``present_pages`` - ``managed_pages``). Write
+  access to ``present_pages`` at runtime should be protected by
+  ``mem_hotplug_begin/done()``. Any reader who can't tolerant drift of
+  ``present_pages`` should use ``get_online_mems()`` to get a stable value. It
+  is initialized by ``calculate_node_totalpages()``.
+
+``present_early_pages``
+  The present pages existing within the zone located on memory available since
+  early boot, excluding hotplugged memory. Defined only when
+  ``CONFIG_MEMORY_HOTPLUG`` is enabled and initialized by
+  ``calculate_node_totalpages()``.
+
+``cma_pages``
+  The pages reserved for CMA use. These pages behave like ``ZONE_MOVABLE`` when
+  they are not used for CMA. Defined only when ``CONFIG_CMA`` is enabled.
+
+``name``
+  The name of the zone. It is a pointer to the corresponding element of
+  the ``zone_names`` array.
+
+``nr_isolate_pageblock``
+  Number of isolated pageblocks. It is used to solve incorrect freepage counting
+  problem due to racy retrieving migratetype of pageblock. Protected by
+  ``zone->lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled.
+
+``span_seqlock``
+  The seqlock to protect ``zone_start_pfn`` and ``spanned_pages``. It is a
+  seqlock because it has to be read outside of ``zone->lock``, and it is done in
+  the main allocator path. However, the seqlock is written quite infrequently.
+  Defined only when ``CONFIG_MEMORY_HOTPLUG`` is enabled.
+
+``initialized``
+  The flag indicating if the zone is initialized. Set by
+  ``init_currently_empty_zone()`` during boot.
+
+``free_area``
+  The array of free areas, where each element corresponds to a specific order
+  which is a power of two. The buddy allocator uses this structure to manage
+  free memory efficiently. When allocating, it tries to find the smallest
+  sufficient block, if the smallest sufficient block is larger than the
+  requested size, it will be recursively split into the next smaller blocks
+  until the required size is reached. When a page is freed, it may be merged
+  with its buddy to form a larger block. It is initialized by
+  ``zone_init_free_lists()``.
+
+``unaccepted_pages``
+  The list of pages to be accepted. All pages on the list are ``MAX_PAGE_ORDER``.
+  Defined only when ``CONFIG_UNACCEPTED_MEMORY`` is enabled.
+
+``flags``
+  The zone flags. The least three bits are used and defined by
+  ``enum zone_flags``. ``ZONE_BOOSTED_WATERMARK`` (bit 0): zone recently boosted
+  watermarks. Cleared when kswapd is woken. ``ZONE_RECLAIM_ACTIVE`` (bit 1):
+  kswapd may be scanning the zone. ``ZONE_BELOW_HIGH`` (bit 2): zone is below
+  high watermark.
+
+``lock``
+  The main lock that protects the internal data structures of the page allocator
+  specific to the zone, especially protects ``free_area``.
+
+``percpu_drift_mark``
+  When free pages are below this point, additional steps are taken when reading
+  the number of free pages to avoid per-cpu counter drift allowing watermarks
+  to be breached. It is updated in ``refresh_zone_stat_thresholds()``.
+
+Compaction control
+~~~~~~~~~~~~~~~~~~
+
+``compact_cached_free_pfn``
+  The PFN where compaction free scanner should start in the next scan.
+
+``compact_cached_migrate_pfn``
+  The PFNs where compaction migration scanner should start in the next scan.
+  This array has two elements: the first one is used in ``MIGRATE_ASYNC`` mode,
+  and the other one is used in ``MIGRATE_SYNC`` mode.
+
+``compact_init_migrate_pfn``
+  The initial migration PFN which is initialized to 0 at boot time, and to the
+  first pageblock with migratable pages in the zone after a full compaction
+  finishes. It is used to check if a scan is a whole zone scan or not.
+
+``compact_init_free_pfn``
+  The initial free PFN which is initialized to 0 at boot time and to the last
+  pageblock with free ``MIGRATE_MOVABLE`` pages in the zone. It is used to check
+  if it is the start of a scan.
+
+``compact_considered``
+  The number of compactions attempted since last failure. It is reset in
+  ``defer_compaction()`` when a compaction fails to result in a page allocation
+  success. It is increased by 1 in ``compaction_deferred()`` when a compaction
+  should be skipped. ``compaction_deferred()`` is called before
+  ``compact_zone()`` is called, ``compaction_defer_reset()`` is called when
+  ``compact_zone()`` returns ``COMPACT_SUCCESS``, ``defer_compaction()`` is
+  called when ``compact_zone()`` returns ``COMPACT_PARTIAL_SKIPPED`` or
+  ``COMPACT_COMPLETE``.
+
+``compact_defer_shift``
+  The number of compactions skipped before trying again is
+  ``1<<compact_defer_shift``. It is increased by 1 in ``defer_compaction()``.
+  It is reset in ``compaction_defer_reset()`` when a direct compaction results
+  in a page allocation success. Its maximum value is ``COMPACT_MAX_DEFER_SHIFT``.
+
+``compact_order_failed``
+  The minimum compaction failed order. It is set in ``compaction_defer_reset()``
+  when a compaction succeeds and in ``defer_compaction()`` when a compaction
+  fails to result in a page allocation success.
+
+``compact_blockskip_flush``
+  Set to true when compaction migration scanner and free scanner meet, which
+  means the ``PB_migrate_skip`` bits should be cleared.
+
+``contiguous``
+  Set to true when the zone is contiguous (in other words, no hole).
+
+Statistics
+~~~~~~~~~~
+
+``vm_stat``
+  VM statistics for the zone. The items tracked are defined by
+  ``enum zone_stat_item``.
+
+``vm_numa_event``
+  VM NUMA event statistics for the zone. The items tracked are defined by
+  ``enum numa_stat_item``.
+
+``per_cpu_zonestats``
+  Per-CPU VM statistics for the zone. It records VM statistics and VM NUMA event
+  statistics on a per-CPU basis. It reduces updates to the global ``vm_stat``
+  and ``vm_numa_event`` fields of the zone to improve performance.
 
 .. _pages:
 
-- 
2.51.0


From b25bcabb6cefaa394fa03087c41d5cb82c23163d Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Mon, 17 Mar 2025 17:36:14 +0100
Subject: [PATCH 03/16] fork: use __vmalloc_node() for stack allocation

Replace __vmalloc_node_range() by __vmalloc_node().  The last variant
requires less parameters and it uses exactly the same arguments which are
partly now hidden inside __vmalloc_node().

This change does not change any functionality.  It makes the code a bit
simpler.

Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/fork.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f9cf0f056eb6..83cb82643817 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -311,11 +311,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node)
 	 * so memcg accounting is performed manually on assigning/releasing
 	 * stacks to tasks. Drop __GFP_ACCOUNT.
 	 */
-	stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
-				     VMALLOC_START, VMALLOC_END,
+	stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
 				     THREADINFO_GFP & ~__GFP_ACCOUNT,
-				     PAGE_KERNEL,
-				     0, node, __builtin_return_address(0));
+				     node, __builtin_return_address(0));
 	if (!stack)
 		return -ENOMEM;
 
-- 
2.51.0


From d8a866c766ebe34b4371d42f4a3edca200f5e645 Mon Sep 17 00:00:00 2001
From: Brendan Jackman <jackmanb@google.com>
Date: Mon, 17 Mar 2025 10:20:34 +0000
Subject: [PATCH 04/16] selftests/mm: add commentary about 9pfs bugs

As discussed here:

https://lore.kernel.org/lkml/Z9RRkL1hom48z3Tt@google.com/

This code could benefit from some more commentary.

To avoid needing to comment the same thing in multiple places (I guess
more of these SKIPs will need to be added over time, for now I am only
like 20% of the way through Project Run run_vmtests.sh Successfully), add
a dummy "skip tests for this specific reason" function that basically just
serves as a hook to hang comments on.

Link: https://lkml.kernel.org/r/20250317-9pfs-comments-v1-1-9ac96043e146@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 tools/testing/selftests/mm/gup_longterm.c |  6 +-----
 tools/testing/selftests/mm/map_populate.c |  8 +++-----
 tools/testing/selftests/mm/vm_util.h      | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/mm/gup_longterm.c b/tools/testing/selftests/mm/gup_longterm.c
index 03271442aae5..21595b20bbc3 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -97,11 +97,7 @@ static void do_test(int fd, size_t size, enum test_type type, bool shared)
 
 	if (ftruncate(fd, size)) {
 		if (errno == ENOENT) {
-			/*
-			 * This can happen if the file has been unlinked and the
-			 * filesystem doesn't support truncating unlinked files.
-			 */
-			ksft_test_result_skip("ftruncate() failed with ENOENT\n");
+			skip_test_dodgy_fs("ftruncate()");
 		} else {
 			ksft_test_result_fail("ftruncate() failed (%s)\n", strerror(errno));
 		}
diff --git a/tools/testing/selftests/mm/map_populate.c b/tools/testing/selftests/mm/map_populate.c
index 433e54fb634f..9df2636c829b 100644
--- a/tools/testing/selftests/mm/map_populate.c
+++ b/tools/testing/selftests/mm/map_populate.c
@@ -18,6 +18,8 @@
 #include <unistd.h>
 #include "../kselftest.h"
 
+#include "vm_util.h"
+
 #define MMAP_SZ		4096
 
 #define BUG_ON(condition, description)						\
@@ -88,11 +90,7 @@ int main(int argc, char **argv)
 
 	ret = ftruncate(fileno(ftmp), MMAP_SZ);
 	if (ret < 0 && errno == ENOENT) {
-		/*
-		 * This probably means tmpfile() made a file on a filesystem
-		 * that doesn't handle temporary files the way we want.
-		 */
-		ksft_exit_skip("ftruncate(fileno(tmpfile())) gave ENOENT, weird filesystem?\n");
+		skip_test_dodgy_fs("ftruncate()");
 	}
 	BUG_ON(ret, "ftruncate()");
 
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 0e629586556b..6effafdc4d8a 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -5,6 +5,7 @@
 #include <err.h>
 #include <strings.h> /* ffsl() */
 #include <unistd.h> /* _SC_PAGESIZE */
+#include "../kselftest.h"
 
 #define BIT_ULL(nr)                   (1ULL << (nr))
 #define PM_SOFT_DIRTY                 BIT_ULL(55)
@@ -32,6 +33,23 @@ static inline unsigned int pshift(void)
 	return __page_shift;
 }
 
+/*
+ * Plan 9 FS has bugs (at least on QEMU) where certain operations fail with
+ * ENOENT on unlinked files. See
+ * https://gitlab.com/qemu-project/qemu/-/issues/103 for some info about such
+ * bugs. There are rumours of NFS implementations with similar bugs.
+ *
+ * Ideally, tests should just detect filesystems known to have such issues and
+ * bail early. But 9pfs has the additional "feature" that it causes fstatfs to
+ * pass through the f_type field from the host filesystem. To avoid having to
+ * scrape /proc/mounts or some other hackery, tests can call this function when
+ * it seems such a bug might have been encountered.
+ */
+static inline void skip_test_dodgy_fs(const char *op_name)
+{
+	ksft_test_result_skip("%s failed with ENOENT. Filesystem might be buggy (9pfs?)\n", op_name);
+}
+
 uint64_t pagemap_get_entry(int fd, char *start);
 bool pagemap_is_softdirty(int fd, char *start);
 bool pagemap_is_swapped(int fd, char *start);
-- 
2.51.0


From 0bfd4586855cf6919d025b5914be212fd4cd27b5 Mon Sep 17 00:00:00 2001
From: Nico Pache <npache@redhat.com>
Date: Mon, 17 Mar 2025 17:04:03 -0600
Subject: [PATCH 05/16] MM documentation: add "Unaccepted" meminfo entry

Commit dcdfdd40fa82 ("mm: Add support for unaccepted memory") added a
entry to meminfo but did not document it in the proc.rst file.

This counter tracks the amount of "Unaccepted" guest memory for some
Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP.

Add the missing entry in the documentation.

Link: https://lkml.kernel.org/r/20250317230403.79632-1-npache@redhat.com
Signed-off-by: Nico Pache <npache@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/filesystems/proc.rst | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 80f33bce5fe1..f97692b31a2d 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1081,6 +1081,7 @@ Example output. You may not have all of these fields.
     FilePmdMapped:         0 kB
     CmaTotal:              0 kB
     CmaFree:               0 kB
+    Unaccepted:            0 kB
     Balloon:               0 kB
     HugePages_Total:       0
     HugePages_Free:        0
@@ -1256,6 +1257,8 @@ CmaTotal
               Memory reserved for the Contiguous Memory Allocator (CMA)
 CmaFree
               Free remaining memory in the CMA reserves
+Unaccepted
+              Memory that has not been accepted by the guest
 Balloon
               Memory returned to Host by VM Balloon Drivers
 HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
-- 
2.51.0


From 98c183a4fccf3b855fa64005a8b6d892570dfd66 Mon Sep 17 00:00:00 2001
From: Alistair Popple <apopple@nvidia.com>
Date: Tue, 18 Mar 2025 18:33:01 -0700
Subject: [PATCH 06/16] fs/dax: don't disassociate zero page entries

Prior to commit 38607c62b34b ("fs/dax: properly refcount fs dax pages")
dax_associate_entry() and dax_disassociate_entry() would implicitly skip
zero and empty dax entries using the for_each_mapped_pfn() macro.  The use
of compound ZONE_DEVICE folios removed the need for this macro and so it
was removed, leading dax_folio_put() to be called on zero pages.

This lead to the below warning. To fix this explicitly skip zero and
empty entries in dax_associate/disassociate_entry().

[   27.536963] ------------[ cut here ]------------
[   27.537674] WARNING: CPU: 11 PID: 874 at fs/dax.c:415 dax_folio_put.isra.0+0x10d/0x170
[   27.538844] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[   27.539732] CPU: 11 UID: 0 PID: 874 Comm: ctl_prefault Tainted: G        W          6.14.0-rc2+ #1104
[   27.541093] Tainted: [W]=WARN
[   27.541549] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204
[   27.543197] RIP: 0010:dax_folio_put.isra.0+0x10d/0x170
[   27.543970] Code: 20 48 85 c0 0f 84 29 ff ff ff 48 83 e8 01 48 89 47 20 0f 84 1b ff ff ff 48 83 c4 10 5b 5d 41 5c c3 cc cc4
[   27.546723] RSP: 0000:ffff961e4102fae0 EFLAGS: 00010002
[   27.547505] RAX: 0000000000000001 RBX: ffffc9cce4e18000 RCX: 0000000000000009
[   27.548564] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8a2a7badca40
[   27.549630] RBP: ffffc9cce4e18000 R08: 0000000000009ffb R09: 00000000ffffdfff
[   27.550691] R10: 00000000ffffdfff R11: ffffffffa4e823a0 R12: 0000000000000000
[   27.551748] R13: 0000000000000000 R14: 0000000010f10005 R15: 0000000000000004
[   27.552819] FS:  00007f5f539d74c0(0000) GS:ffff8a2a7bac0000(0000) knlGS:0000000000000000
[   27.554015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   27.554873] CR2: 00007f5f52e00000 CR3: 0000000909340000 CR4: 00000000000006f0
[   27.555938] Call Trace:
[   27.556318]  <TASK>
[   27.556650]  ? __warn+0x91/0x190
[   27.557146]  ? dax_folio_put.isra.0+0x10d/0x170
[   27.557824]  ? report_bug+0x164/0x190
[   27.558378]  ? handle_bug+0x54/0x90
[   27.558898]  ? exc_invalid_op+0x17/0x70
[   27.559489]  ? asm_exc_invalid_op+0x1a/0x20
[   27.560125]  ? dax_folio_put.isra.0+0x10d/0x170
[   27.560808]  dax_insert_entry+0x1e1/0x420
[   27.561419]  dax_fault_iter+0x252/0x860
[   27.561995]  dax_iomap_pmd_fault+0x23c/0x4a0
[   27.562651]  ext4_dax_huge_fault+0x1e2/0x450
[   27.563296]  __handle_mm_fault+0x6c8/0x12b0
[   27.563920]  ? do_user_addr_fault+0x1ca/0x670
[   27.564577]  ? lock_vma_under_rcu+0x178/0x3b0
[   27.565235]  handle_mm_fault+0xe5/0x290
[   27.565816]  do_user_addr_fault+0x208/0x670
[   27.566446]  exc_page_fault+0x6d/0x230
[   27.567008]  asm_exc_page_fault+0x26/0x30
[   27.567610] RIP: 0033:0x7f5f543bcb4f
[   27.568152] Code: 45 f0 48 8b 45 f0 48 8b 4d f8 48 03 41 18 48 89 45 e8 48 8b 45 f0 48 3b 45 e8 0f 83 97 00 00 00 48 8b 458
[   27.570895] RSP: 002b:00007ffc2d774460 EFLAGS: 00010287
[   27.571672] RAX: 00007f5f52e00000 RBX: 0000000000200000 RCX: 000055760153fc00
[   27.572731] RDX: 0000000000000000 RSI: 0000557601542a20 RDI: 000055760153fc00
[   27.573787] RBP: 00007ffc2d774460 R08: 0000000000000000 R09: 0000000000000073
[   27.574840] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffc2d77534b
[   27.575897] R13: 00007ffc2d774aa0 R14: 0000000000800000 R15: 0000000000800000
[   27.576961]  </TASK>
[   27.577301] irq event stamp: 13394
[   27.577810] hardirqs last  enabled at (13393): [<ffffffffa3485780>] flush_tlb_mm_range+0x1c0/0x220
[   27.579138] hardirqs last disabled at (13394): [<ffffffffa450d0c7>] _raw_spin_lock_irq+0x47/0x50
[   27.580428] softirqs last  enabled at (12530): [<ffffffffa433941a>] xs_tcp_send_request+0x22a/0x2e0
[   27.581762] softirqs last disabled at (12528): [<ffffffffa40a60fd>] release_sock+0x1d/0xb0
[   27.582986] ---[ end trace 0000000000000000 ]---

Link: https://lkml.kernel.org/r/20250319013301.369822-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503102229.122fbd6c-lkp@intel.com
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dax.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cf96f3dd4e5f..464c2badc135 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -445,6 +445,9 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 	unsigned long size = dax_entry_size(entry), index;
 	struct folio *folio = dax_to_folio(entry);
 
+	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
+		return;
+
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
@@ -473,6 +476,9 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
+	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
+		return;
+
 	dax_folio_put(folio);
 }
 
@@ -1074,8 +1080,7 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
-		if (!(flags & DAX_ZERO_PAGE))
-			dax_associate_entry(new_entry, mapping, vmf->vma,
+		dax_associate_entry(new_entry, mapping, vmf->vma,
 					vmf->address, shared);
 
 		/*
-- 
2.51.0


From 3b23a44f1f196741596616082e759f7f3a400e78 Mon Sep 17 00:00:00 2001
From: Nhat Pham <nphamcs@gmail.com>
Date: Tue, 18 Mar 2025 11:30:28 -0700
Subject: [PATCH 07/16] mm/damon: implement a new DAMOS filter type for active
 pages

Patch series "mm/damon: introduce DAMOS filter type for active pages".

The memory reclaim algorithm categorizes pages into active and inactive
lists, separately for file and anon pages.  The system's performance
relies heavily on the (relative and absolute) accuracy of this
categorization.

This patch series add a new DAMOS filter for pages' activeness, giving us
visibility into the access frequency of the pages on each list.  This
insight can help us diagnose issues with the active-inactive balancing
dynamics, and make decisions to optimize reclaim efficiency and memory
utilization.

For instance, we might decide to enable DAMON_LRU_SORT, if we find that
there are pages on the active list that are infrequently accessed, or less
frequently accessed than pages on the inactive list.


This patch (of 2):

Implement a DAMOS filter type for active pages on DAMON kernel API, and
add support of it from the physical address space DAMON operations set
(paddr).

Link: https://lkml.kernel.org/r/20250318183029.2062917-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20250318183029.2062917-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/damon.h    | 2 ++
 mm/damon/paddr.c         | 3 +++
 mm/damon/sysfs-schemes.c | 1 +
 3 files changed, 6 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 3db4f77261f5..47e36e6ea203 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -334,6 +334,7 @@ struct damos_stat {
 /**
  * enum damos_filter_type - Type of memory for &struct damos_filter
  * @DAMOS_FILTER_TYPE_ANON:	Anonymous pages.
+ * @DAMOS_FILTER_TYPE_ACTIVE:	Active pages.
  * @DAMOS_FILTER_TYPE_MEMCG:	Specific memcg's pages.
  * @DAMOS_FILTER_TYPE_YOUNG:	Recently accessed pages.
  * @DAMOS_FILTER_TYPE_HUGEPAGE_SIZE:	Page is part of a hugepage.
@@ -355,6 +356,7 @@ struct damos_stat {
  */
 enum damos_filter_type {
 	DAMOS_FILTER_TYPE_ANON,
+	DAMOS_FILTER_TYPE_ACTIVE,
 	DAMOS_FILTER_TYPE_MEMCG,
 	DAMOS_FILTER_TYPE_YOUNG,
 	DAMOS_FILTER_TYPE_HUGEPAGE_SIZE,
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index b08847ef9b81..1b70d3f36046 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -217,6 +217,9 @@ static bool damos_pa_filter_match(struct damos_filter *filter,
 	case DAMOS_FILTER_TYPE_ANON:
 		matched = folio_test_anon(folio);
 		break;
+	case DAMOS_FILTER_TYPE_ACTIVE:
+		matched = folio_test_active(folio);
+		break;
 	case DAMOS_FILTER_TYPE_MEMCG:
 		rcu_read_lock();
 		memcg = folio_memcg_check(folio);
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 5023f2b690d6..23b562df0839 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -344,6 +344,7 @@ static struct damon_sysfs_scheme_filter *damon_sysfs_scheme_filter_alloc(
 /* Should match with enum damos_filter_type */
 static const char * const damon_sysfs_scheme_filter_type_strs[] = {
 	"anon",
+	"active",
 	"memcg",
 	"young",
 	"hugepage_size",
-- 
2.51.0


From af96c610c6fdcd3a765f8a158293fe208a205ac2 Mon Sep 17 00:00:00 2001
From: Nhat Pham <nphamcs@gmail.com>
Date: Tue, 18 Mar 2025 11:30:29 -0700
Subject: [PATCH 08/16] docs/mm/damon/design: document active DAMOS filter type

Document availability and meaning of "active" DAMOS filter type on design
document.  Since introduction of the type requires no additional user ABI,
usage and ABI document need no update.

Link: https://lkml.kernel.org/r/20250318183029.2062917-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/mm/damon/design.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index aae3a691ee69..f12d33749329 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -656,6 +656,8 @@ Below ``type`` of filters are currently supported.
 - Operations layer handled, supported by only ``paddr`` operations set.
     - anon
         - Applied to pages that containing data that not stored in files.
+    - active
+        - Applied to active pages.
     - memcg
         - Applied to pages that belonging to a given cgroup.
     - young
-- 
2.51.0


From 735b3f7e773bd09d459537562754debd1f8e816b Mon Sep 17 00:00:00 2001
From: Ryan Roberts <ryan.roberts@arm.com>
Date: Tue, 18 Mar 2025 17:43:40 +0000
Subject: [PATCH 09/16] selftests/mm: uffd-unit-tests support for hugepages >
 2M

uffd-unit-tests uses a memory area with a fixed 32M size.  Then it
calculates the number of pages by dividing by page_size, which itself is
either the base page size or the PMD huge page size depending on the test
config.  For the latter, we end up with nr_pages=1 for arm64 16K base
pages, and nr_pages=0 for 64K base pages.  This doesn't end well.

So let's make the 32M size a floor and also ensure that we have at least 2
pages given the PMD size.  With this change, the tests pass on arm64 64K
base page size configuration.

Link: https://lkml.kernel.org/r/20250318174343.243631-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 tools/testing/selftests/mm/uffd-unit-tests.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index 24ea82ee2231..e8fd9011c2a3 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -26,6 +26,8 @@
 #define ALIGN_UP(x, align_to) \
 	((__typeof__(x))((((unsigned long)(x)) + ((align_to)-1)) & ~((align_to)-1)))
 
+#define MAX(a, b) (((a) > (b)) ? (a) : (b))
+
 struct mem_type {
 	const char *name;
 	unsigned int mem_flag;
@@ -196,7 +198,8 @@ uffd_setup_environment(uffd_test_args_t *args, uffd_test_case_t *test,
 	else
 		page_size = psize();
 
-	nr_pages = UFFD_TEST_MEM_SIZE / page_size;
+	/* Ensure we have at least 2 pages */
+	nr_pages = MAX(UFFD_TEST_MEM_SIZE, page_size * 2) / page_size;
 	/* TODO: remove this global var.. it's so ugly */
 	nr_parallel = 1;
 
-- 
2.51.0


From a2c6f9c3cafac02a48db83714f4b62fee2508bc3 Mon Sep 17 00:00:00 2001
From: Ryan Roberts <ryan.roberts@arm.com>
Date: Tue, 18 Mar 2025 17:43:41 +0000
Subject: [PATCH 10/16] selftests/mm: speed up split_huge_page_test

create_pagecache_thp_and_fd() was previously writing a file sized at twice
the PMD size by making a per-byte write syscall.  This was quite slow when
the PMD size is 4M, but completely intolerable for 32M (PMD size for
arm64's 16K page size), and 512M (PMD size for arm64's 64K page size).

The byte pattern has a 256 byte period, so let's create a 1K buffer and
fill it with exactly 4 periods.  Then we can write the buffer as many
times as is required to fill the file.  This makes things much more
tolerable.

The test now passes for 16K page size.  It still fails for 64K page size
because MAX_PAGECACHE_ORDER is too small for 512M folio size (I think).

Link: https://lkml.kernel.org/r/20250318174343.243631-3-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Rafael Aquini <raquini@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 tools/testing/selftests/mm/split_huge_page_test.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index 719c5e2a6624..aa7400ed0e99 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -5,6 +5,7 @@
  */
 
 #define _GNU_SOURCE
+#include <assert.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdarg.h>
@@ -398,6 +399,7 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd,
 {
 	size_t i;
 	int dummy = 0;
+	unsigned char buf[1024];
 
 	srand(time(NULL));
 
@@ -405,11 +407,12 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd,
 	if (*fd == -1)
 		ksft_exit_fail_msg("Failed to create a file at %s\n", testfile);
 
-	for (i = 0; i < fd_size; i++) {
-		unsigned char byte = (unsigned char)i;
+	assert(fd_size % sizeof(buf) == 0);
+	for (i = 0; i < sizeof(buf); i++)
+		buf[i] = (unsigned char)i;
+	for (i = 0; i < fd_size; i += sizeof(buf))
+		write(*fd, buf, sizeof(buf));
 
-		write(*fd, &byte, sizeof(byte));
-	}
 	close(*fd);
 	sync();
 	*fd = open("/proc/sys/vm/drop_caches", O_WRONLY);
-- 
2.51.0


From e452872b40e3f1fb92adf0d573a0a6a7c9f6ce22 Mon Sep 17 00:00:00 2001
From: Hao Jia <jiahao1@lixiang.com>
Date: Tue, 18 Mar 2025 15:58:32 +0800
Subject: [PATCH 11/16] mm: vmscan: split proactive reclaim statistics from
 direct reclaim statistics
MIME-Version: 1.0
Content-Type: text/plain; charset=utf8
Content-Transfer-Encoding: 8bit

Patch series "Adding Proactive Memory Reclaim Statistics".

These two patches are related to proactive memory reclaim.

Patch 1 Split proactive reclaim statistics from direct reclaim counters
and introduces new counters: pgsteal_proactive, pgdemote_proactive,
and pgscan_proactive.

Patch 2 Adds pswpin and pswpout items to the cgroup-v2 documentation.


This patch (of 2):

In proactive memory reclaim scenarios, it is necessary to accurately track
proactive reclaim statistics to dynamically adjust the frequency and
amount of memory being reclaimed proactively.  Currently, proactive
reclaim is included in direct reclaim statistics, which can make these
direct reclaim statistics misleading.

Therefore, separate proactive reclaim memory from the direct reclaim
counters by introducing new counters: pgsteal_proactive,
pgdemote_proactive, and pgscan_proactive, to avoid confusion with direct
reclaim.

Link: https://lkml.kernel.org/r/20250318075833.90615-1-jiahao.kernel@gmail.com
Link: https://lkml.kernel.org/r/20250318075833.90615-2-jiahao.kernel@gmail.com
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal KoutnÃ½ <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 +++++++
 include/linux/mmzone.h                  |  1 +
 include/linux/vm_event_item.h           |  2 ++
 mm/memcontrol.c                         |  7 +++++
 mm/vmscan.c                             | 35 ++++++++++++++-----------
 mm/vmstat.c                             |  3 +++
 6 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f8a894a16307..d7624e500610 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1576,6 +1576,9 @@ The following nested keys are defined.
 	  pgscan_khugepaged (npn)
 		Amount of scanned pages by khugepaged  (in an inactive LRU list)
 
+	  pgscan_proactive (npn)
+		Amount of scanned pages proactively (in an inactive LRU list)
+
 	  pgsteal_kswapd (npn)
 		Amount of reclaimed pages by kswapd
 
@@ -1585,6 +1588,9 @@ The following nested keys are defined.
 	  pgsteal_khugepaged (npn)
 		Amount of reclaimed pages by khugepaged
 
+	  pgsteal_proactive (npn)
+		Amount of reclaimed pages proactively
+
 	  pgfault (npn)
 		Total number of page faults incurred
 
@@ -1662,6 +1668,9 @@ The following nested keys are defined.
 	  pgdemote_khugepaged
 		Number of pages demoted by khugepaged.
 
+	  pgdemote_proactive
+		Number of pages demoted by proactively.
+
 	  hugetlb
 		Amount of memory used by hugetlb pages. This metric only shows
 		up if hugetlb usage is accounted for in memory.current (i.e.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a9db0fbd2b94..f7fe1126dc75 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -221,6 +221,7 @@ enum node_stat_item {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+	PGDEMOTE_PROACTIVE,
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..f11b6fa9c5b3 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -41,9 +41,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSTEAL_KSWAPD,
 		PGSTEAL_DIRECT,
 		PGSTEAL_KHUGEPAGED,
+		PGSTEAL_PROACTIVE,
 		PGSCAN_KSWAPD,
 		PGSCAN_DIRECT,
 		PGSCAN_KHUGEPAGED,
+		PGSCAN_PROACTIVE,
 		PGSCAN_DIRECT_THROTTLE,
 		PGSCAN_ANON,
 		PGSCAN_FILE,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 57cf5a6c279c..40c07b8699ae 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,6 +315,7 @@ static const unsigned int memcg_node_stat_items[] = {
 	PGDEMOTE_KSWAPD,
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
+	PGDEMOTE_PROACTIVE,
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
@@ -431,9 +432,11 @@ static const unsigned int memcg_vm_event_stat[] = {
 	PGSCAN_KSWAPD,
 	PGSCAN_DIRECT,
 	PGSCAN_KHUGEPAGED,
+	PGSCAN_PROACTIVE,
 	PGSTEAL_KSWAPD,
 	PGSTEAL_DIRECT,
 	PGSTEAL_KHUGEPAGED,
+	PGSTEAL_PROACTIVE,
 	PGFAULT,
 	PGMAJFAULT,
 	PGREFILL,
@@ -1394,6 +1397,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "pgdemote_kswapd",		PGDEMOTE_KSWAPD		},
 	{ "pgdemote_direct",		PGDEMOTE_DIRECT		},
 	{ "pgdemote_khugepaged",	PGDEMOTE_KHUGEPAGED	},
+	{ "pgdemote_proactive",		PGDEMOTE_PROACTIVE	},
 #ifdef CONFIG_NUMA_BALANCING
 	{ "pgpromote_success",		PGPROMOTE_SUCCESS	},
 #endif
@@ -1436,6 +1440,7 @@ static int memcg_page_state_output_unit(int item)
 	case PGDEMOTE_KSWAPD:
 	case PGDEMOTE_DIRECT:
 	case PGDEMOTE_KHUGEPAGED:
+	case PGDEMOTE_PROACTIVE:
 #ifdef CONFIG_NUMA_BALANCING
 	case PGPROMOTE_SUCCESS:
 #endif
@@ -1509,10 +1514,12 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	seq_buf_printf(s, "pgscan %lu\n",
 		       memcg_events(memcg, PGSCAN_KSWAPD) +
 		       memcg_events(memcg, PGSCAN_DIRECT) +
+		       memcg_events(memcg, PGSCAN_PROACTIVE) +
 		       memcg_events(memcg, PGSCAN_KHUGEPAGED));
 	seq_buf_printf(s, "pgsteal %lu\n",
 		       memcg_events(memcg, PGSTEAL_KSWAPD) +
 		       memcg_events(memcg, PGSTEAL_DIRECT) +
+		       memcg_events(memcg, PGSTEAL_PROACTIVE) +
 		       memcg_events(memcg, PGSTEAL_KHUGEPAGED));
 
 	for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b5c7dfc2b189..98e6ac82e428 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -456,21 +456,26 @@ void drop_slab(void)
 	} while ((freed >> shift++) > 1);
 }
 
-static int reclaimer_offset(void)
+#define CHECK_RECLAIMER_OFFSET(type)					\
+	do {								\
+		BUILD_BUG_ON(PGSTEAL_##type - PGSTEAL_KSWAPD !=		\
+			     PGDEMOTE_##type - PGDEMOTE_KSWAPD);	\
+		BUILD_BUG_ON(PGSTEAL_##type - PGSTEAL_KSWAPD !=		\
+			     PGSCAN_##type - PGSCAN_KSWAPD);		\
+	} while (0)
+
+static int reclaimer_offset(struct scan_control *sc)
 {
-	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
-			PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD);
-	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
-			PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD);
-	BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD !=
-			PGSCAN_DIRECT - PGSCAN_KSWAPD);
-	BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD !=
-			PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD);
+	CHECK_RECLAIMER_OFFSET(DIRECT);
+	CHECK_RECLAIMER_OFFSET(KHUGEPAGED);
+	CHECK_RECLAIMER_OFFSET(PROACTIVE);
 
 	if (current_is_kswapd())
 		return 0;
 	if (current_is_khugepaged())
 		return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD;
+	if (sc->proactive)
+		return PGSTEAL_PROACTIVE - PGSTEAL_KSWAPD;
 	return PGSTEAL_DIRECT - PGSTEAL_KSWAPD;
 }
 
@@ -2008,7 +2013,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 				     &nr_scanned, sc, lru);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
-	item = PGSCAN_KSWAPD + reclaimer_offset();
+	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_scanned);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
@@ -2024,10 +2029,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	spin_lock_irq(&lruvec->lru_lock);
 	move_folios_to_lru(lruvec, &folio_list);
 
-	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
+	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	item = PGSTEAL_KSWAPD + reclaimer_offset();
+	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
@@ -4571,7 +4576,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
 			break;
 	}
 
-	item = PGSCAN_KSWAPD + reclaimer_offset();
+	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc)) {
 		__count_vm_events(item, isolated);
 		__count_vm_events(PGREFILL, sorted);
@@ -4721,10 +4726,10 @@ retry:
 		reset_batch_size(walk);
 	}
 
-	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
+	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
 
-	item = PGSTEAL_KSWAPD + reclaimer_offset();
+	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, reclaimed);
 	__count_memcg_events(memcg, item, reclaimed);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ae0e4259ac23..ab5c840941f3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1274,6 +1274,7 @@ const char * const vmstat_text[] = {
 	"pgdemote_kswapd",
 	"pgdemote_direct",
 	"pgdemote_khugepaged",
+	"pgdemote_proactive",
 #ifdef CONFIG_HUGETLB_PAGE
 	"nr_hugetlb",
 #endif
@@ -1309,9 +1310,11 @@ const char * const vmstat_text[] = {
 	"pgsteal_kswapd",
 	"pgsteal_direct",
 	"pgsteal_khugepaged",
+	"pgsteal_proactive",
 	"pgscan_kswapd",
 	"pgscan_direct",
 	"pgscan_khugepaged",
+	"pgscan_proactive",
 	"pgscan_direct_throttle",
 	"pgscan_anon",
 	"pgscan_file",
-- 
2.51.0


From 4c8bc7c4e3fbbdf07a879429c34a78f31d9894d4 Mon Sep 17 00:00:00 2001
From: Hao Jia <jiahao1@lixiang.com>
Date: Tue, 18 Mar 2025 15:58:33 +0800
Subject: [PATCH 12/16] cgroup: docs: add pswpin and pswpout items in cgroup v2
 doc
MIME-Version: 1.0
Content-Type: text/plain; charset=utf8
Content-Transfer-Encoding: 8bit

The commit 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin
counter") introduced the pswpin and pswpout items in the memory.stat of
cgroup v2.  Therefore, update them accordingly in the cgroup-v2
documentation.

Link: https://lkml.kernel.org/r/20250318075833.90615-3-jiahao.kernel@gmail.com
Fixes: 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin counter")
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal KoutnÃ½ <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index d7624e500610..ba11a4d7321b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1561,6 +1561,12 @@ The following nested keys are defined.
 	  workingset_nodereclaim
 		Number of times a shadow node has been reclaimed
 
+	  pswpin (npn)
+		Number of pages swapped into memory
+
+	  pswpout (npn)
+		Number of pages swapped out of memory
+
 	  pgscan (npn)
 		Amount of scanned pages (in an inactive LRU list)
 
-- 
2.51.0


From 5f5ee52d4f58605330b09851273d6e56aaadd29e Mon Sep 17 00:00:00 2001
From: Jinjiang Tu <tujinjiang@huawei.com>
Date: Tue, 18 Mar 2025 16:39:38 +0800
Subject: [PATCH 13/16] mm/hwpoison: introduce folio_contain_hwpoisoned_page()
 helper

Patch series "mm/vmscan: don't try to reclaim hwpoison folio".

Fix a bug during memory reclaim if folio is hwpoisoned.


This patch (of 2):

Introduce helper folio_contain_hwpoisoned_page() to check if the entire
folio is hwpoisoned or it contains hwpoisoned pages.

Link: https://lkml.kernel.org/r/20250318083939.987651-1-tujinjiang@huawei.com
Link: https://lkml.kernel.org/r/20250318083939.987651-2-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/page-flags.h | 6 ++++++
 mm/memory_hotplug.c        | 3 +--
 mm/shmem.c                 | 3 +--
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f26ce54c7aa4..68f4b188fc6b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -1098,6 +1098,12 @@ static inline bool is_page_hwpoison(const struct page *page)
 	return folio_test_hugetlb(folio) && PageHWPoison(&folio->page);
 }
 
+static inline bool folio_contain_hwpoisoned_page(struct folio *folio)
+{
+	return folio_test_hwpoison(folio) ||
+	    (folio_test_large(folio) && folio_test_has_hwpoisoned(folio));
+}
+
 bool is_free_buddy_page(const struct page *page);
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 16cf9e17077e..75401866fb76 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1828,8 +1828,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		if (unlikely(page_folio(page) != folio))
 			goto put_folio;
 
-		if (folio_test_hwpoison(folio) ||
-		    (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) {
+		if (folio_contain_hwpoisoned_page(folio)) {
 			if (WARN_ON(folio_test_lru(folio)))
 				folio_isolate_lru(folio);
 			if (folio_mapped(folio)) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 7b738d8d6581..17f27d92c664 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3290,8 +3290,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	if (ret)
 		return ret;
 
-	if (folio_test_hwpoison(folio) ||
-	    (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) {
+	if (folio_contain_hwpoisoned_page(folio)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		return -EIO;
-- 
2.51.0


From 1b0449544c6482179ac84530b61fc192a6527bfd Mon Sep 17 00:00:00 2001
From: Jinjiang Tu <tujinjiang@huawei.com>
Date: Tue, 18 Mar 2025 16:39:39 +0800
Subject: [PATCH 14/16] mm/vmscan: don't try to reclaim hwpoison folio

Syzkaller reports a bug as follows:

Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000
Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users
Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed
page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e
memcg:ffff0000dd6d9000
anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff)
raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9
raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000
page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio))
------------[ cut here ]------------
kernel BUG at mm/swap_state.c:184!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
Modules linked in:
CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3
Hardware name: linux,dummy-virt (DT)
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : add_to_swap+0xbc/0x158
lr : add_to_swap+0xbc/0x158
sp : ffff800087f37340
x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780
x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0
x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4
x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000
x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c
x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b
x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000
x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001
x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000
Call trace:
 add_to_swap+0xbc/0x158
 shrink_folio_list+0x12ac/0x2648
 shrink_inactive_list+0x318/0x948
 shrink_lruvec+0x450/0x720
 shrink_node_memcgs+0x280/0x4a8
 shrink_node+0x128/0x978
 balance_pgdat+0x4f0/0xb20
 kswapd+0x228/0x438
 kthread+0x214/0x230
 ret_from_fork+0x10/0x20

I can reproduce this issue with the following steps:

1) When a dirty swapcache page is isolated by reclaim process and the
   page isn't locked, inject memory failure for the page.
   me_swapcache_dirty() clears uptodate flag and tries to delete from lru,
   but fails.  Reclaim process will put the hwpoisoned page back to lru.

2) The process that maps the hwpoisoned page exits, the page is deleted
   the page will never be freed and will be in the lru forever.

3) If we trigger a reclaim again and tries to reclaim the page,
   add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is
   cleared.

To fix it, skip the hwpoisoned page in shrink_folio_list().  Besides, the
hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap
it in shrink_folio_list(), otherwise the folio will fail to be unmaped by
hwpoison_user_mappings() since the folio isn't in lru list.

Link: https://lkml.kernel.org/r/20250318083939.987651-3-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger,kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/vmscan.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 98e6ac82e428..2b2ab386cab5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1127,6 +1127,13 @@ retry:
 		if (!folio_trylock(folio))
 			goto keep;
 
+		if (folio_contain_hwpoisoned_page(folio)) {
+			unmap_poisoned_folio(folio, folio_pfn(folio), false);
+			folio_unlock(folio);
+			folio_put(folio);
+			continue;
+		}
+
 		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
 
 		nr_pages = folio_nr_pages(folio);
-- 
2.51.0


From d893aca973c315ee985e1f8220fcd239c0ab7d19 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Wed, 19 Mar 2025 14:23:37 +0200
Subject: [PATCH 15/16] x86/mm: restore early initialization of high_memory for
 32-bits

Kernel test robot reports the following crash on 32-bit system with
HIGHMEM and DEBUG_VIRTUAL:

[    0.056128][    T0] kernel BUG at arch/x86/mm/physaddr.c:77!
PANIC: early exception 0x06 IP 60:c116539d error 0 cr2 0x0
[    0.056916][    T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc4-00010-ga4dbe5c71817 #1
[    0.057570][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 0.058299][ T0] EIP: __phys_addr (arch/x86/mm/physaddr.c:77)
[ 0.058633][ T0] Code: 00 74 33 89 f0 e8 d3 8b 2e 00 89 c3 0f b6 d0 b8 58 bb 4b c5 31 c9 6a 00 e8 70 f5 15 00 83 c4 04 84 db 74 25 ff 05 78 de 5d c5 <0f> 0b b8 c8 91 ea c4 e8 e7 6e ea ff b8 58 bb 4b c5 31 d2 31 c9 6a
All code
[    0.060017][    T0] EAX: 00000000 EBX: c61f7001 ECX: 00000000 EDX: 00000000
[    0.060519][    T0] ESI: c61f7000 EDI: 061f7000 EBP: c4e31f04 ESP: c61f7000
[    0.061016][    T0] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: cff4 EFLAGS: 00210002
[    0.061560][    T0] CR0: 80050033 CR2: 00000000 CR3: 059fc000 CR4: 00000090
[    0.062060][    T0] Call Trace:
[ 0.062288][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:478)
[ 0.062588][ T0] ? early_fixup_exception (arch/x86/include/asm/nospec-branch.h:595)
[ 0.062968][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:352)
[ 0.063360][ T0] ? __phys_addr (arch/x86/mm/physaddr.c:77)
[ 0.063677][ T0] ? one_page_table_init (arch/x86/mm/init_32.c:100)
[ 0.064037][ T0] ? page_table_range_init (arch/x86/mm/init_32.c:227)
[ 0.064411][ T0] ? permanent_kmaps_init (include/linux/pgtable.h:191 include/linux/pgtable.h:196 arch/x86/mm/init_32.c:395)
[ 0.064814][ T0] ? paging_init (arch/x86/mm/init_32.c:677)
[ 0.065118][ T0] ? native_pagetable_init (arch/x86/mm/init_32.c:481)
[ 0.065503][ T0] ? setup_arch (arch/x86/kernel/setup.c:1131)
[ 0.065819][ T0] ? start_kernel (include/linux/jump_label.h:267 init/main.c:920)
[ 0.066143][ T0] ? i386_start_kernel (arch/x86/kernel/head32.c:79)
[ 0.066501][ T0] ? startup_32_smp (arch/x86/kernel/head_32.S:292)

The crash happens because commit e120d1bc12da ("arch, mm: set high_memory
in free_area_init()") moved initialization of high_memory after
__vmalloc_start_set and with high_memory still set to 0 any address passes
is_vmalloc_addr() check.

Restore early initialization of high_memory on 32-bit systems in
initmem_init().

Link: https://lkml.kernel.org/r/20250319122337.1538924-1-rppt@kernel.org
Fixes: e120d1bc12da ("arch, mm: set high_memory in free_area_init()")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202503191442.112e954f-lkp@intel.com
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/x86/mm/init_32.c | 3 +++
 arch/x86/mm/numa_32.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 95b2758b4e4d..f69d2436d780 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -626,6 +626,9 @@ void __init initmem_init(void)
 		highstart_pfn = max_low_pfn;
 	printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
 		pages_to_mb(highend_pfn - highstart_pfn));
+	high_memory = (void *) __va(highstart_pfn * PAGE_SIZE - 1) + 1;
+#else
+	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1;
 #endif
 
 	memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0);
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 442ef3facff0..65fda406e6f2 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -41,6 +41,9 @@ void __init initmem_init(void)
 		highstart_pfn = max_low_pfn;
 	printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
 	       pages_to_mb(highend_pfn - highstart_pfn));
+	high_memory = (void *) __va(highstart_pfn * PAGE_SIZE - 1) + 1;
+#else
+	high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1;
 #endif
 	printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
 			pages_to_mb(max_low_pfn));
-- 
2.51.0


From 0a1e082b64ccce165e7307a7b49d22b2504f9d1f Mon Sep 17 00:00:00 2001
From: Liu Ye <liuye@kylinos.cn>
Date: Wed, 19 Mar 2025 17:17:26 +0800
Subject: [PATCH 16/16] mm/page_alloc: remove unnecessary __maybe_unused in
 order_to_pindex()

The `movable` variable is always used when `CONFIG_TRANSPARENT_HUGEPAGE`
is enabled, so the `__maybe_unused` attribute is not necessary.  This
patch removes it and keeps the variable declaration within the `#ifdef`
block for better clarity.

Link: https://lkml.kernel.org/r/20250319091726.401158-1-liuyerd@163.com
Signed-off-by: Liu Ye<liuye@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a6d060eea638..0c01998cb3a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -509,9 +509,9 @@ out:
 
 static inline unsigned int order_to_pindex(int migratetype, int order)
 {
-	bool __maybe_unused movable;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bool movable;
 	if (order > PAGE_ALLOC_COSTLY_ORDER) {
 		VM_BUG_ON(order != HPAGE_PMD_ORDER);
 
-- 
2.51.0