From f6a09e6800936c6c9ba5667ac3efc18feb8f3a2f Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Fri, 14 Mar 2025 15:37:57 -0600 Subject: [PATCH 01/16] xen: balloon: update the NR_BALLOON_PAGES state Update the NR_BALLOON_PAGES counter when pages are added to or removed from the Xen balloon. Link: https://lkml.kernel.org/r/20250314213757.244258-5-npache@redhat.com Signed-off-by: Nico Pache Reviewed-by: Juergen Gross Cc: Alexander Atanasov Cc: Chengming Zhou Cc: David Hildenbrand Cc: Dexuan Cui Cc: Haiyang Zhang Cc: Johannes Weiner Cc: Juegren Gross Cc: Kanchana P Sridhar Cc: K. Y. Srinivasan Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Muchun Song Cc: Nhat Pham Cc: Oleksandr Tyshchenko Cc: Roman Gushchin Cc: Shakeel Butt Cc: Stefano Stabellini Cc: Wei Liu Cc: Michael Kelley Signed-off-by: Andrew Morton --- drivers/xen/balloon.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 163f7f1d70f1..65d4e7fa1eb8 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -157,6 +157,8 @@ static void balloon_append(struct page *page) list_add(&page->lru, &ballooned_pages); balloon_stats.balloon_low++; } + inc_node_page_state(page, NR_BALLOON_PAGES); + wake_up(&balloon_wq); } @@ -179,6 +181,8 @@ static struct page *balloon_retrieve(bool require_lowmem) balloon_stats.balloon_low--; __ClearPageOffline(page); + dec_node_page_state(page, NR_BALLOON_PAGES); + return page; } -- 2.51.0 From 9f171d94be80d80067c6445e53b00d098150391e Mon Sep 17 00:00:00 2001 From: Jiwen Qi Date: Sat, 15 Mar 2025 21:13:17 +0000 Subject: [PATCH 02/16] docs/mm: Physical Memory: Populate the "Zones" section Briefly describe what zones are and the fields of struct zone. Link: https://lkml.kernel.org/r/20250315211317.27612-1-jiwen7.qi@gmail.com Signed-off-by: Jiwen Qi Acked-by: Mike Rapoport (Microsoft) Cc: Bagas Sanjaya Cc: Jonathan Corbet Cc: Randy Dunlap Signed-off-by: Andrew Morton --- Documentation/mm/physical_memory.rst | 266 ++++++++++++++++++++++++++- 1 file changed, 264 insertions(+), 2 deletions(-) diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst index 71fd4a6acf42..d3ac106e6b14 100644 --- a/Documentation/mm/physical_memory.rst +++ b/Documentation/mm/physical_memory.rst @@ -338,10 +338,272 @@ Statistics Zones ===== +As we have mentioned, each zone in memory is described by a ``struct zone`` +which is an element of the ``node_zones`` array of the node it belongs to. +``struct zone`` is the core data structure of the page allocator. A zone +represents a range of physical memory and may have holes. + +The page allocator uses the GFP flags, see :ref:`mm-api-gfp-flags`, specified by +a memory allocation to determine the highest zone in a node from which the +memory allocation can allocate memory. The page allocator first allocates memory +from that zone, if the page allocator can't allocate the requested amount of +memory from the zone, it will allocate memory from the next lower zone in the +node, the process continues up to and including the lowest zone. For example, if +a node contains ``ZONE_DMA32``, ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` and the +highest zone of a memory allocation is ``ZONE_MOVABLE``, the order of the zones +from which the page allocator allocates memory is ``ZONE_MOVABLE`` > +``ZONE_NORMAL`` > ``ZONE_DMA32``. + +At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas +of the zone. The Per-CPU Pagesets are a vital mechanism in the kernel's memory +management system. By handling most frequent allocations and frees locally on +each CPU, the Per-CPU Pagesets improve performance and scalability, especially +on systems with many cores. The page allocator in the kernel employs a two-step +strategy for memory allocation, starting with the Per-CPU Pagesets before +falling back to the buddy allocator. Pages are transferred between the Per-CPU +Pagesets and the global free areas (managed by the buddy allocator) in batches. +This minimizes the overhead of frequent interactions with the global buddy +allocator. + +Architecture specific code calls free_area_init() to initializes zones. + +Zone structure +-------------- +The zones structure ``struct zone`` is defined in ``include/linux/mmzone.h``. +Here we briefly describe fields of this structure: -.. admonition:: Stub +General +~~~~~~~ - This section is incomplete. Please list and describe the appropriate fields. +``_watermark`` + The watermarks for this zone. When the amount of free pages in a zone is below + the min watermark, boosting is ignored, an allocation may trigger direct + reclaim and direct compaction, it is also used to throttle direct reclaim. + When the amount of free pages in a zone is below the low watermark, kswapd is + woken up. When the amount of free pages in a zone is above the high watermark, + kswapd stops reclaiming (a zone is balanced) when the + ``NUMA_BALANCING_MEMORY_TIERING`` bit of ``sysctl_numa_balancing_mode`` is not + set. The promo watermark is used for memory tiering and NUMA balancing. When + the amount of free pages in a zone is above the promo watermark, kswapd stops + reclaiming when the ``NUMA_BALANCING_MEMORY_TIERING`` bit of + ``sysctl_numa_balancing_mode`` is set. The watermarks are set by + ``__setup_per_zone_wmarks()``. The min watermark is calculated according to + ``vm.min_free_kbytes`` sysctl. The other three watermarks are set according + to the distance between two watermarks. The distance itself is calculated + taking ``vm.watermark_scale_factor`` sysctl into account. + +``watermark_boost`` + The number of pages which are used to boost watermarks to increase reclaim + pressure to reduce the likelihood of future fallbacks and wake kswapd now + as the node may be balanced overall and kswapd will not wake naturally. + +``nr_reserved_highatomic`` + The number of pages which are reserved for high-order atomic allocations. + +``nr_free_highatomic`` + The number of free pages in reserved highatomic pageblocks + +``lowmem_reserve`` + The array of the amounts of the memory reserved in this zone for memory + allocations. For example, if the highest zone a memory allocation can + allocate memory from is ``ZONE_MOVABLE``, the amount of memory reserved in + this zone for this allocation is ``lowmem_reserve[ZONE_MOVABLE]`` when + attempting to allocate memory from this zone. This is a mechanism the page + allocator uses to prevent allocations which could use ``highmem`` from using + too much ``lowmem``. For some specialised workloads on ``highmem`` machines, + it is dangerous for the kernel to allow process memory to be allocated from + the ``lowmem`` zone. This is because that memory could then be pinned via the + ``mlock()`` system call, or by unavailability of swapspace. + ``vm.lowmem_reserve_ratio`` sysctl determines how aggressive the kernel is in + defending these lower zones. This array is recalculated by + ``setup_per_zone_lowmem_reserve()`` at runtime if ``vm.lowmem_reserve_ratio`` + sysctl changes. + +``node`` + The index of the node this zone belongs to. Available only when + ``CONFIG_NUMA`` is enabled because there is only one zone in a UMA system. + +``zone_pgdat`` + Pointer to the ``struct pglist_data`` of the node this zone belongs to. + +``per_cpu_pageset`` + Pointer to the Per-CPU Pagesets (PCP) allocated and initialized by + ``setup_zone_pageset()``. By handling most frequent allocations and frees + locally on each CPU, PCP improves performance and scalability on systems with + many cores. + +``pageset_high_min`` + Copied to the ``high_min`` of the Per-CPU Pagesets for faster access. + +``pageset_high_max`` + Copied to the ``high_max`` of the Per-CPU Pagesets for faster access. + +``pageset_batch`` + Copied to the ``batch`` of the Per-CPU Pagesets for faster access. The + ``batch``, ``high_min`` and ``high_max`` of the Per-CPU Pagesets are used to + calculate the number of elements the Per-CPU Pagesets obtain from the buddy + allocator under a single hold of the lock for efficiency. They are also used + to decide if the Per-CPU Pagesets return pages to the buddy allocator in page + free process. + +``pageblock_flags`` + The pointer to the flags for the pageblocks in the zone (see + ``include/linux/pageblock-flags.h`` for flags list). The memory is allocated + in ``setup_usemap()``. Each pageblock occupies ``NR_PAGEBLOCK_BITS`` bits. + Defined only when ``CONFIG_FLATMEM`` is enabled. The flags is stored in + ``mem_section`` when ``CONFIG_SPARSEMEM`` is enabled. + +``zone_start_pfn`` + The start pfn of the zone. It is initialized by + ``calculate_node_totalpages()``. + +``managed_pages`` + The present pages managed by the buddy system, which is calculated as: + ``managed_pages`` = ``present_pages`` - ``reserved_pages``, ``reserved_pages`` + includes pages allocated by the memblock allocator. It should be used by page + allocator and vm scanner to calculate all kinds of watermarks and thresholds. + It is accessed using ``atomic_long_xxx()`` functions. It is initialized in + ``free_area_init_core()`` and then is reinitialized when memblock allocator + frees pages into buddy system. + +``spanned_pages`` + The total pages spanned by the zone, including holes, which is calculated as: + ``spanned_pages`` = ``zone_end_pfn`` - ``zone_start_pfn``. It is initialized + by ``calculate_node_totalpages()``. + +``present_pages`` + The physical pages existing within the zone, which is calculated as: + ``present_pages`` = ``spanned_pages`` - ``absent_pages`` (pages in holes). It + may be used by memory hotplug or memory power management logic to figure out + unmanaged pages by checking (``present_pages`` - ``managed_pages``). Write + access to ``present_pages`` at runtime should be protected by + ``mem_hotplug_begin/done()``. Any reader who can't tolerant drift of + ``present_pages`` should use ``get_online_mems()`` to get a stable value. It + is initialized by ``calculate_node_totalpages()``. + +``present_early_pages`` + The present pages existing within the zone located on memory available since + early boot, excluding hotplugged memory. Defined only when + ``CONFIG_MEMORY_HOTPLUG`` is enabled and initialized by + ``calculate_node_totalpages()``. + +``cma_pages`` + The pages reserved for CMA use. These pages behave like ``ZONE_MOVABLE`` when + they are not used for CMA. Defined only when ``CONFIG_CMA`` is enabled. + +``name`` + The name of the zone. It is a pointer to the corresponding element of + the ``zone_names`` array. + +``nr_isolate_pageblock`` + Number of isolated pageblocks. It is used to solve incorrect freepage counting + problem due to racy retrieving migratetype of pageblock. Protected by + ``zone->lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled. + +``span_seqlock`` + The seqlock to protect ``zone_start_pfn`` and ``spanned_pages``. It is a + seqlock because it has to be read outside of ``zone->lock``, and it is done in + the main allocator path. However, the seqlock is written quite infrequently. + Defined only when ``CONFIG_MEMORY_HOTPLUG`` is enabled. + +``initialized`` + The flag indicating if the zone is initialized. Set by + ``init_currently_empty_zone()`` during boot. + +``free_area`` + The array of free areas, where each element corresponds to a specific order + which is a power of two. The buddy allocator uses this structure to manage + free memory efficiently. When allocating, it tries to find the smallest + sufficient block, if the smallest sufficient block is larger than the + requested size, it will be recursively split into the next smaller blocks + until the required size is reached. When a page is freed, it may be merged + with its buddy to form a larger block. It is initialized by + ``zone_init_free_lists()``. + +``unaccepted_pages`` + The list of pages to be accepted. All pages on the list are ``MAX_PAGE_ORDER``. + Defined only when ``CONFIG_UNACCEPTED_MEMORY`` is enabled. + +``flags`` + The zone flags. The least three bits are used and defined by + ``enum zone_flags``. ``ZONE_BOOSTED_WATERMARK`` (bit 0): zone recently boosted + watermarks. Cleared when kswapd is woken. ``ZONE_RECLAIM_ACTIVE`` (bit 1): + kswapd may be scanning the zone. ``ZONE_BELOW_HIGH`` (bit 2): zone is below + high watermark. + +``lock`` + The main lock that protects the internal data structures of the page allocator + specific to the zone, especially protects ``free_area``. + +``percpu_drift_mark`` + When free pages are below this point, additional steps are taken when reading + the number of free pages to avoid per-cpu counter drift allowing watermarks + to be breached. It is updated in ``refresh_zone_stat_thresholds()``. + +Compaction control +~~~~~~~~~~~~~~~~~~ + +``compact_cached_free_pfn`` + The PFN where compaction free scanner should start in the next scan. + +``compact_cached_migrate_pfn`` + The PFNs where compaction migration scanner should start in the next scan. + This array has two elements: the first one is used in ``MIGRATE_ASYNC`` mode, + and the other one is used in ``MIGRATE_SYNC`` mode. + +``compact_init_migrate_pfn`` + The initial migration PFN which is initialized to 0 at boot time, and to the + first pageblock with migratable pages in the zone after a full compaction + finishes. It is used to check if a scan is a whole zone scan or not. + +``compact_init_free_pfn`` + The initial free PFN which is initialized to 0 at boot time and to the last + pageblock with free ``MIGRATE_MOVABLE`` pages in the zone. It is used to check + if it is the start of a scan. + +``compact_considered`` + The number of compactions attempted since last failure. It is reset in + ``defer_compaction()`` when a compaction fails to result in a page allocation + success. It is increased by 1 in ``compaction_deferred()`` when a compaction + should be skipped. ``compaction_deferred()`` is called before + ``compact_zone()`` is called, ``compaction_defer_reset()`` is called when + ``compact_zone()`` returns ``COMPACT_SUCCESS``, ``defer_compaction()`` is + called when ``compact_zone()`` returns ``COMPACT_PARTIAL_SKIPPED`` or + ``COMPACT_COMPLETE``. + +``compact_defer_shift`` + The number of compactions skipped before trying again is + ``1< Date: Mon, 17 Mar 2025 17:36:14 +0100 Subject: [PATCH 03/16] fork: use __vmalloc_node() for stack allocation Replace __vmalloc_node_range() by __vmalloc_node(). The last variant requires less parameters and it uses exactly the same arguments which are partly now hidden inside __vmalloc_node(). This change does not change any functionality. It makes the code a bit simpler. Link: https://lkml.kernel.org/r/20250317163614.166502-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) Acked-by: Michal Hocko Cc: Christian Brauner Cc: Oleg Nesterov Cc: Tejun Heo Signed-off-by: Andrew Morton --- kernel/fork.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index f9cf0f056eb6..83cb82643817 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -311,11 +311,9 @@ static int alloc_thread_stack_node(struct task_struct *tsk, int node) * so memcg accounting is performed manually on assigning/releasing * stacks to tasks. Drop __GFP_ACCOUNT. */ - stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN, - VMALLOC_START, VMALLOC_END, + stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, THREADINFO_GFP & ~__GFP_ACCOUNT, - PAGE_KERNEL, - 0, node, __builtin_return_address(0)); + node, __builtin_return_address(0)); if (!stack) return -ENOMEM; -- 2.51.0 From d8a866c766ebe34b4371d42f4a3edca200f5e645 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Mon, 17 Mar 2025 10:20:34 +0000 Subject: [PATCH 04/16] selftests/mm: add commentary about 9pfs bugs As discussed here: https://lore.kernel.org/lkml/Z9RRkL1hom48z3Tt@google.com/ This code could benefit from some more commentary. To avoid needing to comment the same thing in multiple places (I guess more of these SKIPs will need to be added over time, for now I am only like 20% of the way through Project Run run_vmtests.sh Successfully), add a dummy "skip tests for this specific reason" function that basically just serves as a hook to hang comments on. Link: https://lkml.kernel.org/r/20250317-9pfs-comments-v1-1-9ac96043e146@google.com Signed-off-by: Brendan Jackman Cc: David Hildenbrand Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/gup_longterm.c | 6 +----- tools/testing/selftests/mm/map_populate.c | 8 +++----- tools/testing/selftests/mm/vm_util.h | 18 ++++++++++++++++++ 3 files changed, 22 insertions(+), 10 deletions(-) diff --git a/tools/testing/selftests/mm/gup_longterm.c b/tools/testing/selftests/mm/gup_longterm.c index 03271442aae5..21595b20bbc3 100644 --- a/tools/testing/selftests/mm/gup_longterm.c +++ b/tools/testing/selftests/mm/gup_longterm.c @@ -97,11 +97,7 @@ static void do_test(int fd, size_t size, enum test_type type, bool shared) if (ftruncate(fd, size)) { if (errno == ENOENT) { - /* - * This can happen if the file has been unlinked and the - * filesystem doesn't support truncating unlinked files. - */ - ksft_test_result_skip("ftruncate() failed with ENOENT\n"); + skip_test_dodgy_fs("ftruncate()"); } else { ksft_test_result_fail("ftruncate() failed (%s)\n", strerror(errno)); } diff --git a/tools/testing/selftests/mm/map_populate.c b/tools/testing/selftests/mm/map_populate.c index 433e54fb634f..9df2636c829b 100644 --- a/tools/testing/selftests/mm/map_populate.c +++ b/tools/testing/selftests/mm/map_populate.c @@ -18,6 +18,8 @@ #include #include "../kselftest.h" +#include "vm_util.h" + #define MMAP_SZ 4096 #define BUG_ON(condition, description) \ @@ -88,11 +90,7 @@ int main(int argc, char **argv) ret = ftruncate(fileno(ftmp), MMAP_SZ); if (ret < 0 && errno == ENOENT) { - /* - * This probably means tmpfile() made a file on a filesystem - * that doesn't handle temporary files the way we want. - */ - ksft_exit_skip("ftruncate(fileno(tmpfile())) gave ENOENT, weird filesystem?\n"); + skip_test_dodgy_fs("ftruncate()"); } BUG_ON(ret, "ftruncate()"); diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h index 0e629586556b..6effafdc4d8a 100644 --- a/tools/testing/selftests/mm/vm_util.h +++ b/tools/testing/selftests/mm/vm_util.h @@ -5,6 +5,7 @@ #include #include /* ffsl() */ #include /* _SC_PAGESIZE */ +#include "../kselftest.h" #define BIT_ULL(nr) (1ULL << (nr)) #define PM_SOFT_DIRTY BIT_ULL(55) @@ -32,6 +33,23 @@ static inline unsigned int pshift(void) return __page_shift; } +/* + * Plan 9 FS has bugs (at least on QEMU) where certain operations fail with + * ENOENT on unlinked files. See + * https://gitlab.com/qemu-project/qemu/-/issues/103 for some info about such + * bugs. There are rumours of NFS implementations with similar bugs. + * + * Ideally, tests should just detect filesystems known to have such issues and + * bail early. But 9pfs has the additional "feature" that it causes fstatfs to + * pass through the f_type field from the host filesystem. To avoid having to + * scrape /proc/mounts or some other hackery, tests can call this function when + * it seems such a bug might have been encountered. + */ +static inline void skip_test_dodgy_fs(const char *op_name) +{ + ksft_test_result_skip("%s failed with ENOENT. Filesystem might be buggy (9pfs?)\n", op_name); +} + uint64_t pagemap_get_entry(int fd, char *start); bool pagemap_is_softdirty(int fd, char *start); bool pagemap_is_swapped(int fd, char *start); -- 2.51.0 From 0bfd4586855cf6919d025b5914be212fd4cd27b5 Mon Sep 17 00:00:00 2001 From: Nico Pache Date: Mon, 17 Mar 2025 17:04:03 -0600 Subject: [PATCH 05/16] MM documentation: add "Unaccepted" meminfo entry Commit dcdfdd40fa82 ("mm: Add support for unaccepted memory") added a entry to meminfo but did not document it in the proc.rst file. This counter tracks the amount of "Unaccepted" guest memory for some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP. Add the missing entry in the documentation. Link: https://lkml.kernel.org/r/20250317230403.79632-1-npache@redhat.com Signed-off-by: Nico Pache Acked-by: Kirill A. Shutemov Acked-by: David Hildenbrand Cc: Andrii Nakryiko Cc: Catalin Marinas Cc: Jeff Xu Cc: Jonathan Corbet Cc: Pasha Tatashin Cc: Suren Baghdasaryan Cc: xu xin Signed-off-by: Andrew Morton --- Documentation/filesystems/proc.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 80f33bce5fe1..f97692b31a2d 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1081,6 +1081,7 @@ Example output. You may not have all of these fields. FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB + Unaccepted: 0 kB Balloon: 0 kB HugePages_Total: 0 HugePages_Free: 0 @@ -1256,6 +1257,8 @@ CmaTotal Memory reserved for the Contiguous Memory Allocator (CMA) CmaFree Free remaining memory in the CMA reserves +Unaccepted + Memory that has not been accepted by the guest Balloon Memory returned to Host by VM Balloon Drivers HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb -- 2.51.0 From 98c183a4fccf3b855fa64005a8b6d892570dfd66 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Tue, 18 Mar 2025 18:33:01 -0700 Subject: [PATCH 06/16] fs/dax: don't disassociate zero page entries Prior to commit 38607c62b34b ("fs/dax: properly refcount fs dax pages") dax_associate_entry() and dax_disassociate_entry() would implicitly skip zero and empty dax entries using the for_each_mapped_pfn() macro. The use of compound ZONE_DEVICE folios removed the need for this macro and so it was removed, leading dax_folio_put() to be called on zero pages. This lead to the below warning. To fix this explicitly skip zero and empty entries in dax_associate/disassociate_entry(). [ 27.536963] ------------[ cut here ]------------ [ 27.537674] WARNING: CPU: 11 PID: 874 at fs/dax.c:415 dax_folio_put.isra.0+0x10d/0x170 [ 27.538844] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm [ 27.539732] CPU: 11 UID: 0 PID: 874 Comm: ctl_prefault Tainted: G W 6.14.0-rc2+ #1104 [ 27.541093] Tainted: [W]=WARN [ 27.541549] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204 [ 27.543197] RIP: 0010:dax_folio_put.isra.0+0x10d/0x170 [ 27.543970] Code: 20 48 85 c0 0f 84 29 ff ff ff 48 83 e8 01 48 89 47 20 0f 84 1b ff ff ff 48 83 c4 10 5b 5d 41 5c c3 cc cc4 [ 27.546723] RSP: 0000:ffff961e4102fae0 EFLAGS: 00010002 [ 27.547505] RAX: 0000000000000001 RBX: ffffc9cce4e18000 RCX: 0000000000000009 [ 27.548564] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8a2a7badca40 [ 27.549630] RBP: ffffc9cce4e18000 R08: 0000000000009ffb R09: 00000000ffffdfff [ 27.550691] R10: 00000000ffffdfff R11: ffffffffa4e823a0 R12: 0000000000000000 [ 27.551748] R13: 0000000000000000 R14: 0000000010f10005 R15: 0000000000000004 [ 27.552819] FS: 00007f5f539d74c0(0000) GS:ffff8a2a7bac0000(0000) knlGS:0000000000000000 [ 27.554015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.554873] CR2: 00007f5f52e00000 CR3: 0000000909340000 CR4: 00000000000006f0 [ 27.555938] Call Trace: [ 27.556318] [ 27.556650] ? __warn+0x91/0x190 [ 27.557146] ? dax_folio_put.isra.0+0x10d/0x170 [ 27.557824] ? report_bug+0x164/0x190 [ 27.558378] ? handle_bug+0x54/0x90 [ 27.558898] ? exc_invalid_op+0x17/0x70 [ 27.559489] ? asm_exc_invalid_op+0x1a/0x20 [ 27.560125] ? dax_folio_put.isra.0+0x10d/0x170 [ 27.560808] dax_insert_entry+0x1e1/0x420 [ 27.561419] dax_fault_iter+0x252/0x860 [ 27.561995] dax_iomap_pmd_fault+0x23c/0x4a0 [ 27.562651] ext4_dax_huge_fault+0x1e2/0x450 [ 27.563296] __handle_mm_fault+0x6c8/0x12b0 [ 27.563920] ? do_user_addr_fault+0x1ca/0x670 [ 27.564577] ? lock_vma_under_rcu+0x178/0x3b0 [ 27.565235] handle_mm_fault+0xe5/0x290 [ 27.565816] do_user_addr_fault+0x208/0x670 [ 27.566446] exc_page_fault+0x6d/0x230 [ 27.567008] asm_exc_page_fault+0x26/0x30 [ 27.567610] RIP: 0033:0x7f5f543bcb4f [ 27.568152] Code: 45 f0 48 8b 45 f0 48 8b 4d f8 48 03 41 18 48 89 45 e8 48 8b 45 f0 48 3b 45 e8 0f 83 97 00 00 00 48 8b 458 [ 27.570895] RSP: 002b:00007ffc2d774460 EFLAGS: 00010287 [ 27.571672] RAX: 00007f5f52e00000 RBX: 0000000000200000 RCX: 000055760153fc00 [ 27.572731] RDX: 0000000000000000 RSI: 0000557601542a20 RDI: 000055760153fc00 [ 27.573787] RBP: 00007ffc2d774460 R08: 0000000000000000 R09: 0000000000000073 [ 27.574840] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffc2d77534b [ 27.575897] R13: 00007ffc2d774aa0 R14: 0000000000800000 R15: 0000000000800000 [ 27.576961] [ 27.577301] irq event stamp: 13394 [ 27.577810] hardirqs last enabled at (13393): [] flush_tlb_mm_range+0x1c0/0x220 [ 27.579138] hardirqs last disabled at (13394): [] _raw_spin_lock_irq+0x47/0x50 [ 27.580428] softirqs last enabled at (12530): [] xs_tcp_send_request+0x22a/0x2e0 [ 27.581762] softirqs last disabled at (12528): [] release_sock+0x1d/0xb0 [ 27.582986] ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250319013301.369822-1-apopple@nvidia.com Signed-off-by: Alistair Popple Fixes: 38607c62b34b ("fs/dax: properly refcount fs dax pages") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202503102229.122fbd6c-lkp@intel.com Cc: Dan Williams Cc: Alison Schofield Cc: David Hildenbrand Cc: Balbir Singh Cc: "Darrick J. Wong" Cc: Dave Hansen Cc: Jason Gunthorpe Cc: John Hubbard Signed-off-by: Andrew Morton --- fs/dax.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index cf96f3dd4e5f..464c2badc135 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -445,6 +445,9 @@ static void dax_associate_entry(void *entry, struct address_space *mapping, unsigned long size = dax_entry_size(entry), index; struct folio *folio = dax_to_folio(entry); + if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) + return; + if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return; @@ -473,6 +476,9 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) return; + if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) + return; + dax_folio_put(folio); } @@ -1074,8 +1080,7 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf, void *old; dax_disassociate_entry(entry, mapping, false); - if (!(flags & DAX_ZERO_PAGE)) - dax_associate_entry(new_entry, mapping, vmf->vma, + dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address, shared); /* -- 2.51.0 From 3b23a44f1f196741596616082e759f7f3a400e78 Mon Sep 17 00:00:00 2001 From: Nhat Pham Date: Tue, 18 Mar 2025 11:30:28 -0700 Subject: [PATCH 07/16] mm/damon: implement a new DAMOS filter type for active pages Patch series "mm/damon: introduce DAMOS filter type for active pages". The memory reclaim algorithm categorizes pages into active and inactive lists, separately for file and anon pages. The system's performance relies heavily on the (relative and absolute) accuracy of this categorization. This patch series add a new DAMOS filter for pages' activeness, giving us visibility into the access frequency of the pages on each list. This insight can help us diagnose issues with the active-inactive balancing dynamics, and make decisions to optimize reclaim efficiency and memory utilization. For instance, we might decide to enable DAMON_LRU_SORT, if we find that there are pages on the active list that are infrequently accessed, or less frequently accessed than pages on the inactive list. This patch (of 2): Implement a DAMOS filter type for active pages on DAMON kernel API, and add support of it from the physical address space DAMON operations set (paddr). Link: https://lkml.kernel.org/r/20250318183029.2062917-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20250318183029.2062917-2-nphamcs@gmail.com Signed-off-by: Nhat Pham Suggested-by: SeongJae Park Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 ++ mm/damon/paddr.c | 3 +++ mm/damon/sysfs-schemes.c | 1 + 3 files changed, 6 insertions(+) diff --git a/include/linux/damon.h b/include/linux/damon.h index 3db4f77261f5..47e36e6ea203 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -334,6 +334,7 @@ struct damos_stat { /** * enum damos_filter_type - Type of memory for &struct damos_filter * @DAMOS_FILTER_TYPE_ANON: Anonymous pages. + * @DAMOS_FILTER_TYPE_ACTIVE: Active pages. * @DAMOS_FILTER_TYPE_MEMCG: Specific memcg's pages. * @DAMOS_FILTER_TYPE_YOUNG: Recently accessed pages. * @DAMOS_FILTER_TYPE_HUGEPAGE_SIZE: Page is part of a hugepage. @@ -355,6 +356,7 @@ struct damos_stat { */ enum damos_filter_type { DAMOS_FILTER_TYPE_ANON, + DAMOS_FILTER_TYPE_ACTIVE, DAMOS_FILTER_TYPE_MEMCG, DAMOS_FILTER_TYPE_YOUNG, DAMOS_FILTER_TYPE_HUGEPAGE_SIZE, diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index b08847ef9b81..1b70d3f36046 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -217,6 +217,9 @@ static bool damos_pa_filter_match(struct damos_filter *filter, case DAMOS_FILTER_TYPE_ANON: matched = folio_test_anon(folio); break; + case DAMOS_FILTER_TYPE_ACTIVE: + matched = folio_test_active(folio); + break; case DAMOS_FILTER_TYPE_MEMCG: rcu_read_lock(); memcg = folio_memcg_check(folio); diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c index 5023f2b690d6..23b562df0839 100644 --- a/mm/damon/sysfs-schemes.c +++ b/mm/damon/sysfs-schemes.c @@ -344,6 +344,7 @@ static struct damon_sysfs_scheme_filter *damon_sysfs_scheme_filter_alloc( /* Should match with enum damos_filter_type */ static const char * const damon_sysfs_scheme_filter_type_strs[] = { "anon", + "active", "memcg", "young", "hugepage_size", -- 2.51.0 From af96c610c6fdcd3a765f8a158293fe208a205ac2 Mon Sep 17 00:00:00 2001 From: Nhat Pham Date: Tue, 18 Mar 2025 11:30:29 -0700 Subject: [PATCH 08/16] docs/mm/damon/design: document active DAMOS filter type Document availability and meaning of "active" DAMOS filter type on design document. Since introduction of the type requires no additional user ABI, usage and ABI document need no update. Link: https://lkml.kernel.org/r/20250318183029.2062917-3-nphamcs@gmail.com Signed-off-by: Nhat Pham Suggested-by: SeongJae Park Reviewed-by: SeongJae Park Signed-off-by: Andrew Morton --- Documentation/mm/damon/design.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst index aae3a691ee69..f12d33749329 100644 --- a/Documentation/mm/damon/design.rst +++ b/Documentation/mm/damon/design.rst @@ -656,6 +656,8 @@ Below ``type`` of filters are currently supported. - Operations layer handled, supported by only ``paddr`` operations set. - anon - Applied to pages that containing data that not stored in files. + - active + - Applied to active pages. - memcg - Applied to pages that belonging to a given cgroup. - young -- 2.51.0 From 735b3f7e773bd09d459537562754debd1f8e816b Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Tue, 18 Mar 2025 17:43:40 +0000 Subject: [PATCH 09/16] selftests/mm: uffd-unit-tests support for hugepages > 2M uffd-unit-tests uses a memory area with a fixed 32M size. Then it calculates the number of pages by dividing by page_size, which itself is either the base page size or the PMD huge page size depending on the test config. For the latter, we end up with nr_pages=1 for arm64 16K base pages, and nr_pages=0 for 64K base pages. This doesn't end well. So let's make the 32M size a floor and also ensure that we have at least 2 pages given the PMD size. With this change, the tests pass on arm64 64K base page size configuration. Link: https://lkml.kernel.org/r/20250318174343.243631-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Acked-by: Peter Xu Acked-by: Rafael Aquini Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/uffd-unit-tests.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c index 24ea82ee2231..e8fd9011c2a3 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -26,6 +26,8 @@ #define ALIGN_UP(x, align_to) \ ((__typeof__(x))((((unsigned long)(x)) + ((align_to)-1)) & ~((align_to)-1))) +#define MAX(a, b) (((a) > (b)) ? (a) : (b)) + struct mem_type { const char *name; unsigned int mem_flag; @@ -196,7 +198,8 @@ uffd_setup_environment(uffd_test_args_t *args, uffd_test_case_t *test, else page_size = psize(); - nr_pages = UFFD_TEST_MEM_SIZE / page_size; + /* Ensure we have at least 2 pages */ + nr_pages = MAX(UFFD_TEST_MEM_SIZE, page_size * 2) / page_size; /* TODO: remove this global var.. it's so ugly */ nr_parallel = 1; -- 2.51.0 From a2c6f9c3cafac02a48db83714f4b62fee2508bc3 Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Tue, 18 Mar 2025 17:43:41 +0000 Subject: [PATCH 10/16] selftests/mm: speed up split_huge_page_test create_pagecache_thp_and_fd() was previously writing a file sized at twice the PMD size by making a per-byte write syscall. This was quite slow when the PMD size is 4M, but completely intolerable for 32M (PMD size for arm64's 16K page size), and 512M (PMD size for arm64's 64K page size). The byte pattern has a 256 byte period, so let's create a 1K buffer and fill it with exactly 4 periods. Then we can write the buffer as many times as is required to fill the file. This makes things much more tolerable. The test now passes for 16K page size. It still fails for 64K page size because MAX_PAGECACHE_ORDER is too small for 512M folio size (I think). Link: https://lkml.kernel.org/r/20250318174343.243631-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts Acked-by: Peter Xu Acked-by: Rafael Aquini Cc: Shuah Khan Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/split_huge_page_test.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c index 719c5e2a6624..aa7400ed0e99 100644 --- a/tools/testing/selftests/mm/split_huge_page_test.c +++ b/tools/testing/selftests/mm/split_huge_page_test.c @@ -5,6 +5,7 @@ */ #define _GNU_SOURCE +#include #include #include #include @@ -398,6 +399,7 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, { size_t i; int dummy = 0; + unsigned char buf[1024]; srand(time(NULL)); @@ -405,11 +407,12 @@ int create_pagecache_thp_and_fd(const char *testfile, size_t fd_size, int *fd, if (*fd == -1) ksft_exit_fail_msg("Failed to create a file at %s\n", testfile); - for (i = 0; i < fd_size; i++) { - unsigned char byte = (unsigned char)i; + assert(fd_size % sizeof(buf) == 0); + for (i = 0; i < sizeof(buf); i++) + buf[i] = (unsigned char)i; + for (i = 0; i < fd_size; i += sizeof(buf)) + write(*fd, buf, sizeof(buf)); - write(*fd, &byte, sizeof(byte)); - } close(*fd); sync(); *fd = open("/proc/sys/vm/drop_caches", O_WRONLY); -- 2.51.0 From e452872b40e3f1fb92adf0d573a0a6a7c9f6ce22 Mon Sep 17 00:00:00 2001 From: Hao Jia Date: Tue, 18 Mar 2025 15:58:32 +0800 Subject: [PATCH 11/16] mm: vmscan: split proactive reclaim statistics from direct reclaim statistics MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Patch series "Adding Proactive Memory Reclaim Statistics". These two patches are related to proactive memory reclaim. Patch 1 Split proactive reclaim statistics from direct reclaim counters and introduces new counters: pgsteal_proactive, pgdemote_proactive, and pgscan_proactive. Patch 2 Adds pswpin and pswpout items to the cgroup-v2 documentation. This patch (of 2): In proactive memory reclaim scenarios, it is necessary to accurately track proactive reclaim statistics to dynamically adjust the frequency and amount of memory being reclaimed proactively. Currently, proactive reclaim is included in direct reclaim statistics, which can make these direct reclaim statistics misleading. Therefore, separate proactive reclaim memory from the direct reclaim counters by introducing new counters: pgsteal_proactive, pgdemote_proactive, and pgscan_proactive, to avoid confusion with direct reclaim. Link: https://lkml.kernel.org/r/20250318075833.90615-1-jiahao.kernel@gmail.com Link: https://lkml.kernel.org/r/20250318075833.90615-2-jiahao.kernel@gmail.com Signed-off-by: Hao Jia Acked-by: Johannes Weiner Cc: Jonathan Corbet Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Cc: Tejun Heo Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 9 +++++++ include/linux/mmzone.h | 1 + include/linux/vm_event_item.h | 2 ++ mm/memcontrol.c | 7 +++++ mm/vmscan.c | 35 ++++++++++++++----------- mm/vmstat.c | 3 +++ 6 files changed, 42 insertions(+), 15 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f8a894a16307..d7624e500610 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1576,6 +1576,9 @@ The following nested keys are defined. pgscan_khugepaged (npn) Amount of scanned pages by khugepaged (in an inactive LRU list) + pgscan_proactive (npn) + Amount of scanned pages proactively (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd @@ -1585,6 +1588,9 @@ The following nested keys are defined. pgsteal_khugepaged (npn) Amount of reclaimed pages by khugepaged + pgsteal_proactive (npn) + Amount of reclaimed pages proactively + pgfault (npn) Total number of page faults incurred @@ -1662,6 +1668,9 @@ The following nested keys are defined. pgdemote_khugepaged Number of pages demoted by khugepaged. + pgdemote_proactive + Number of pages demoted by proactively. + hugetlb Amount of memory used by hugetlb pages. This metric only shows up if hugetlb usage is accounted for in memory.current (i.e. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a9db0fbd2b94..f7fe1126dc75 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -221,6 +221,7 @@ enum node_stat_item { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, + PGDEMOTE_PROACTIVE, #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index f70d0958095c..f11b6fa9c5b3 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -41,9 +41,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, PGSCAN_KSWAPD, PGSCAN_DIRECT, PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, PGSCAN_DIRECT_THROTTLE, PGSCAN_ANON, PGSCAN_FILE, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 57cf5a6c279c..40c07b8699ae 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -315,6 +315,7 @@ static const unsigned int memcg_node_stat_items[] = { PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, + PGDEMOTE_PROACTIVE, #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif @@ -431,9 +432,11 @@ static const unsigned int memcg_vm_event_stat[] = { PGSCAN_KSWAPD, PGSCAN_DIRECT, PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, PGFAULT, PGMAJFAULT, PGREFILL, @@ -1394,6 +1397,7 @@ static const struct memory_stat memory_stats[] = { { "pgdemote_kswapd", PGDEMOTE_KSWAPD }, { "pgdemote_direct", PGDEMOTE_DIRECT }, { "pgdemote_khugepaged", PGDEMOTE_KHUGEPAGED }, + { "pgdemote_proactive", PGDEMOTE_PROACTIVE }, #ifdef CONFIG_NUMA_BALANCING { "pgpromote_success", PGPROMOTE_SUCCESS }, #endif @@ -1436,6 +1440,7 @@ static int memcg_page_state_output_unit(int item) case PGDEMOTE_KSWAPD: case PGDEMOTE_DIRECT: case PGDEMOTE_KHUGEPAGED: + case PGDEMOTE_PROACTIVE: #ifdef CONFIG_NUMA_BALANCING case PGPROMOTE_SUCCESS: #endif @@ -1509,10 +1514,12 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) seq_buf_printf(s, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) + memcg_events(memcg, PGSCAN_DIRECT) + + memcg_events(memcg, PGSCAN_PROACTIVE) + memcg_events(memcg, PGSCAN_KHUGEPAGED)); seq_buf_printf(s, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) + memcg_events(memcg, PGSTEAL_DIRECT) + + memcg_events(memcg, PGSTEAL_PROACTIVE) + memcg_events(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { diff --git a/mm/vmscan.c b/mm/vmscan.c index b5c7dfc2b189..98e6ac82e428 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -456,21 +456,26 @@ void drop_slab(void) } while ((freed >> shift++) > 1); } -static int reclaimer_offset(void) +#define CHECK_RECLAIMER_OFFSET(type) \ + do { \ + BUILD_BUG_ON(PGSTEAL_##type - PGSTEAL_KSWAPD != \ + PGDEMOTE_##type - PGDEMOTE_KSWAPD); \ + BUILD_BUG_ON(PGSTEAL_##type - PGSTEAL_KSWAPD != \ + PGSCAN_##type - PGSCAN_KSWAPD); \ + } while (0) + +static int reclaimer_offset(struct scan_control *sc) { - BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != - PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD); - BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != - PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD); - BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != - PGSCAN_DIRECT - PGSCAN_KSWAPD); - BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != - PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD); + CHECK_RECLAIMER_OFFSET(DIRECT); + CHECK_RECLAIMER_OFFSET(KHUGEPAGED); + CHECK_RECLAIMER_OFFSET(PROACTIVE); if (current_is_kswapd()) return 0; if (current_is_khugepaged()) return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD; + if (sc->proactive) + return PGSTEAL_PROACTIVE - PGSTEAL_KSWAPD; return PGSTEAL_DIRECT - PGSTEAL_KSWAPD; } @@ -2008,7 +2013,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, &nr_scanned, sc, lru); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); - item = PGSCAN_KSWAPD + reclaimer_offset(); + item = PGSCAN_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); @@ -2024,10 +2029,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &folio_list); - __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - item = PGSTEAL_KSWAPD + reclaimer_offset(); + item = PGSTEAL_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); @@ -4571,7 +4576,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, break; } - item = PGSCAN_KSWAPD + reclaimer_offset(); + item = PGSCAN_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) { __count_vm_events(item, isolated); __count_vm_events(PGREFILL, sorted); @@ -4721,10 +4726,10 @@ retry: reset_batch_size(walk); } - __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(), + __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), stat.nr_demoted); - item = PGSTEAL_KSWAPD + reclaimer_offset(); + item = PGSTEAL_KSWAPD + reclaimer_offset(sc); if (!cgroup_reclaim(sc)) __count_vm_events(item, reclaimed); __count_memcg_events(memcg, item, reclaimed); diff --git a/mm/vmstat.c b/mm/vmstat.c index ae0e4259ac23..ab5c840941f3 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1274,6 +1274,7 @@ const char * const vmstat_text[] = { "pgdemote_kswapd", "pgdemote_direct", "pgdemote_khugepaged", + "pgdemote_proactive", #ifdef CONFIG_HUGETLB_PAGE "nr_hugetlb", #endif @@ -1309,9 +1310,11 @@ const char * const vmstat_text[] = { "pgsteal_kswapd", "pgsteal_direct", "pgsteal_khugepaged", + "pgsteal_proactive", "pgscan_kswapd", "pgscan_direct", "pgscan_khugepaged", + "pgscan_proactive", "pgscan_direct_throttle", "pgscan_anon", "pgscan_file", -- 2.51.0 From 4c8bc7c4e3fbbdf07a879429c34a78f31d9894d4 Mon Sep 17 00:00:00 2001 From: Hao Jia Date: Tue, 18 Mar 2025 15:58:33 +0800 Subject: [PATCH 12/16] cgroup: docs: add pswpin and pswpout items in cgroup v2 doc MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit The commit 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin counter") introduced the pswpin and pswpout items in the memory.stat of cgroup v2. Therefore, update them accordingly in the cgroup-v2 documentation. Link: https://lkml.kernel.org/r/20250318075833.90615-3-jiahao.kernel@gmail.com Fixes: 15ff4d409e1a ("mm/memcontrol: add per-memcg pgpgin/pswpin counter") Signed-off-by: Hao Jia Acked-by: Tejun Heo Acked-by: Johannes Weiner Cc: Jonathan Corbet Cc: Michal Hocko Cc: Michal Koutný Cc: Muchun Song Cc: Roman Gushchin Cc: Shakeel Butt Signed-off-by: Andrew Morton --- Documentation/admin-guide/cgroup-v2.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index d7624e500610..ba11a4d7321b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1561,6 +1561,12 @@ The following nested keys are defined. workingset_nodereclaim Number of times a shadow node has been reclaimed + pswpin (npn) + Number of pages swapped into memory + + pswpout (npn) + Number of pages swapped out of memory + pgscan (npn) Amount of scanned pages (in an inactive LRU list) -- 2.51.0 From 5f5ee52d4f58605330b09851273d6e56aaadd29e Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Tue, 18 Mar 2025 16:39:38 +0800 Subject: [PATCH 13/16] mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper Patch series "mm/vmscan: don't try to reclaim hwpoison folio". Fix a bug during memory reclaim if folio is hwpoisoned. This patch (of 2): Introduce helper folio_contain_hwpoisoned_page() to check if the entire folio is hwpoisoned or it contains hwpoisoned pages. Link: https://lkml.kernel.org/r/20250318083939.987651-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250318083939.987651-2-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu Acked-by: Miaohe Lin Cc: David Hildenbrand Cc: Kefeng Wang Cc: Nanyong Sun Cc: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton --- include/linux/page-flags.h | 6 ++++++ mm/memory_hotplug.c | 3 +-- mm/shmem.c | 3 +-- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index f26ce54c7aa4..68f4b188fc6b 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -1098,6 +1098,12 @@ static inline bool is_page_hwpoison(const struct page *page) return folio_test_hugetlb(folio) && PageHWPoison(&folio->page); } +static inline bool folio_contain_hwpoisoned_page(struct folio *folio) +{ + return folio_test_hwpoison(folio) || + (folio_test_large(folio) && folio_test_has_hwpoisoned(folio)); +} + bool is_free_buddy_page(const struct page *page); PAGEFLAG(Isolated, isolated, PF_ANY); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 16cf9e17077e..75401866fb76 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1828,8 +1828,7 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) if (unlikely(page_folio(page) != folio)) goto put_folio; - if (folio_test_hwpoison(folio) || - (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) { + if (folio_contain_hwpoisoned_page(folio)) { if (WARN_ON(folio_test_lru(folio))) folio_isolate_lru(folio); if (folio_mapped(folio)) { diff --git a/mm/shmem.c b/mm/shmem.c index 7b738d8d6581..17f27d92c664 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3290,8 +3290,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if (ret) return ret; - if (folio_test_hwpoison(folio) || - (folio_test_large(folio) && folio_test_has_hwpoisoned(folio))) { + if (folio_contain_hwpoisoned_page(folio)) { folio_unlock(folio); folio_put(folio); return -EIO; -- 2.51.0 From 1b0449544c6482179ac84530b61fc192a6527bfd Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Tue, 18 Mar 2025 16:39:39 +0800 Subject: [PATCH 14/16] mm/vmscan: don't try to reclaim hwpoison folio Syzkaller reports a bug as follows: Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000 anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:184! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT) pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158 lr : add_to_swap+0xbc/0x158 sp : ffff800087f37340 x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace: add_to_swap+0xbc/0x158 shrink_folio_list+0x12ac/0x2648 shrink_inactive_list+0x318/0x948 shrink_lruvec+0x450/0x720 shrink_node_memcgs+0x280/0x4a8 shrink_node+0x128/0x978 balance_pgdat+0x4f0/0xb20 kswapd+0x228/0x438 kthread+0x214/0x230 ret_from_fork+0x10/0x20 I can reproduce this issue with the following steps: 1) When a dirty swapcache page is isolated by reclaim process and the page isn't locked, inject memory failure for the page. me_swapcache_dirty() clears uptodate flag and tries to delete from lru, but fails. Reclaim process will put the hwpoisoned page back to lru. 2) The process that maps the hwpoisoned page exits, the page is deleted the page will never be freed and will be in the lru forever. 3) If we trigger a reclaim again and tries to reclaim the page, add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is cleared. To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list. Link: https://lkml.kernel.org/r/20250318083939.987651-3-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu Acked-by: Miaohe Lin Cc: David Hildenbrand Cc: Kefeng Wang Cc: Nanyong Sun Cc: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton --- mm/vmscan.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 98e6ac82e428..2b2ab386cab5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1127,6 +1127,13 @@ retry: if (!folio_trylock(folio)) goto keep; + if (folio_contain_hwpoisoned_page(folio)) { + unmap_poisoned_folio(folio, folio_pfn(folio), false); + folio_unlock(folio); + folio_put(folio); + continue; + } + VM_BUG_ON_FOLIO(folio_test_active(folio), folio); nr_pages = folio_nr_pages(folio); -- 2.51.0 From d893aca973c315ee985e1f8220fcd239c0ab7d19 Mon Sep 17 00:00:00 2001 From: "Mike Rapoport (Microsoft)" Date: Wed, 19 Mar 2025 14:23:37 +0200 Subject: [PATCH 15/16] x86/mm: restore early initialization of high_memory for 32-bits Kernel test robot reports the following crash on 32-bit system with HIGHMEM and DEBUG_VIRTUAL: [ 0.056128][ T0] kernel BUG at arch/x86/mm/physaddr.c:77! PANIC: early exception 0x06 IP 60:c116539d error 0 cr2 0x0 [ 0.056916][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc4-00010-ga4dbe5c71817 #1 [ 0.057570][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.058299][ T0] EIP: __phys_addr (arch/x86/mm/physaddr.c:77) [ 0.058633][ T0] Code: 00 74 33 89 f0 e8 d3 8b 2e 00 89 c3 0f b6 d0 b8 58 bb 4b c5 31 c9 6a 00 e8 70 f5 15 00 83 c4 04 84 db 74 25 ff 05 78 de 5d c5 <0f> 0b b8 c8 91 ea c4 e8 e7 6e ea ff b8 58 bb 4b c5 31 d2 31 c9 6a All code [ 0.060017][ T0] EAX: 00000000 EBX: c61f7001 ECX: 00000000 EDX: 00000000 [ 0.060519][ T0] ESI: c61f7000 EDI: 061f7000 EBP: c4e31f04 ESP: c61f7000 [ 0.061016][ T0] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: cff4 EFLAGS: 00210002 [ 0.061560][ T0] CR0: 80050033 CR2: 00000000 CR3: 059fc000 CR4: 00000090 [ 0.062060][ T0] Call Trace: [ 0.062288][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:478) [ 0.062588][ T0] ? early_fixup_exception (arch/x86/include/asm/nospec-branch.h:595) [ 0.062968][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:352) [ 0.063360][ T0] ? __phys_addr (arch/x86/mm/physaddr.c:77) [ 0.063677][ T0] ? one_page_table_init (arch/x86/mm/init_32.c:100) [ 0.064037][ T0] ? page_table_range_init (arch/x86/mm/init_32.c:227) [ 0.064411][ T0] ? permanent_kmaps_init (include/linux/pgtable.h:191 include/linux/pgtable.h:196 arch/x86/mm/init_32.c:395) [ 0.064814][ T0] ? paging_init (arch/x86/mm/init_32.c:677) [ 0.065118][ T0] ? native_pagetable_init (arch/x86/mm/init_32.c:481) [ 0.065503][ T0] ? setup_arch (arch/x86/kernel/setup.c:1131) [ 0.065819][ T0] ? start_kernel (include/linux/jump_label.h:267 init/main.c:920) [ 0.066143][ T0] ? i386_start_kernel (arch/x86/kernel/head32.c:79) [ 0.066501][ T0] ? startup_32_smp (arch/x86/kernel/head_32.S:292) The crash happens because commit e120d1bc12da ("arch, mm: set high_memory in free_area_init()") moved initialization of high_memory after __vmalloc_start_set and with high_memory still set to 0 any address passes is_vmalloc_addr() check. Restore early initialization of high_memory on 32-bit systems in initmem_init(). Link: https://lkml.kernel.org/r/20250319122337.1538924-1-rppt@kernel.org Fixes: e120d1bc12da ("arch, mm: set high_memory in free_area_init()") Signed-off-by: Mike Rapoport (Microsoft) Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202503191442.112e954f-lkp@intel.com Cc: Andy Lutomirski Cc: Borislav Betkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Thomas Gleinxer Signed-off-by: Andrew Morton --- arch/x86/mm/init_32.c | 3 +++ arch/x86/mm/numa_32.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index 95b2758b4e4d..f69d2436d780 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -626,6 +626,9 @@ void __init initmem_init(void) highstart_pfn = max_low_pfn; printk(KERN_NOTICE "%ldMB HIGHMEM available.\n", pages_to_mb(highend_pfn - highstart_pfn)); + high_memory = (void *) __va(highstart_pfn * PAGE_SIZE - 1) + 1; +#else + high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1; #endif memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0); diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c index 442ef3facff0..65fda406e6f2 100644 --- a/arch/x86/mm/numa_32.c +++ b/arch/x86/mm/numa_32.c @@ -41,6 +41,9 @@ void __init initmem_init(void) highstart_pfn = max_low_pfn; printk(KERN_NOTICE "%ldMB HIGHMEM available.\n", pages_to_mb(highend_pfn - highstart_pfn)); + high_memory = (void *) __va(highstart_pfn * PAGE_SIZE - 1) + 1; +#else + high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1; #endif printk(KERN_NOTICE "%ldMB LOWMEM available.\n", pages_to_mb(max_low_pfn)); -- 2.51.0 From 0a1e082b64ccce165e7307a7b49d22b2504f9d1f Mon Sep 17 00:00:00 2001 From: Liu Ye Date: Wed, 19 Mar 2025 17:17:26 +0800 Subject: [PATCH 16/16] mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() The `movable` variable is always used when `CONFIG_TRANSPARENT_HUGEPAGE` is enabled, so the `__maybe_unused` attribute is not necessary. This patch removes it and keeps the variable declaration within the `#ifdef` block for better clarity. Link: https://lkml.kernel.org/r/20250319091726.401158-1-liuyerd@163.com Signed-off-by: Liu Ye Signed-off-by: Andrew Morton --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a6d060eea638..0c01998cb3a0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -509,9 +509,9 @@ out: static inline unsigned int order_to_pindex(int migratetype, int order) { - bool __maybe_unused movable; #ifdef CONFIG_TRANSPARENT_HUGEPAGE + bool movable; if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != HPAGE_PMD_ORDER); -- 2.51.0