After swapping out, we perform a swap-in operation. If we first read and
then write, we encounter a major fault in do_swap_page for reading, along
with additional minor faults in do_wp_page for writing. However, the
latter appears to be unnecessary and inefficient. Instead, we can
directly reuse in do_swap_page and completely eliminate the need for
do_wp_page.
This patch achieves that optimization specifically for exclusive folios.
The following microbenchmark demonstrates the significant reduction in
minor faults.
w/o patch,
/ # ~/a.out
minor faults:512 major faults:512
w/ patch,
/ # ~/a.out
minor faults:0 major faults:512
Minor faults decrease to 0!
Link: https://lkml.kernel.org/r/20240602004502.26895-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jonathan Cameron [Wed, 5 Jun 2024 08:20:49 +0000 (11:20 +0300)]
mm/memory_hotplug: drop memblock_phys_free() call in try_remove_memory()
The call for memblock_phys_free() in try_remove_memory() does not balance
any call to memblock_alloc() (or memblock_reserve() for that matter).
There are no memblock_reserve() calls in mm/memory_hotplug.c, no memblock
allocations possible after mm_core_init(), and even if memblock_add_node()
called from add_memory_resource() would need to allocate memory, that
memory would ba allocated from slab.
The patch f9126ab9241f ("memory-hotplug: fix wrong edge when hot add a new
node") that introduced that call to memblock_free() does not provide
adequate description why that was required and tinkering with memblock in
the context of memory hotplug on x86 seems bogus because x86 never kept
memblock after boot anyway.
Drop memblock_phys_free() call in try_remove_memory().
[rppt@kernel.org: rewrite the commit message] Link: https://lkml.kernel.org/r/20240605082049.973242-1-rppt@kernel.org Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 11 Jun 2024 10:11:10 +0000 (18:11 +0800)]
mm: shmem: add mTHP counters for anonymous shmem
Add mTHP counters for anonymous shmem.
[baolin.wang@linux.alibaba.com: update Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/d86e2e7f-4141-432b-b2ba-c6691f36ef0b@linux.alibaba.com Link: https://lkml.kernel.org/r/4fd9e467d49ae4a747e428bcd821c7d13125ae67.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 11 Jun 2024 10:11:09 +0000 (18:11 +0800)]
mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
Although the top-level hugepage allocation can be turned off, anonymous
shmem can still use mTHP by configuring the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'.
Therefore, add alignment for mTHP size to provide a suitable alignment
address in shmem_get_unmapped_area().
Link: https://lkml.kernel.org/r/0c549b57cf7db07503af692d8546ecfad0fcce52.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Lance Yang <ioworker0@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 11 Jun 2024 10:11:08 +0000 (18:11 +0800)]
mm: shmem: add mTHP support for anonymous shmem
Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for all
anonymous pages, including the anonymous shmem, in order to enjoy the
benefits of mTHP. For example, lower latency than PMD-mapped THP, smaller
memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture to
reduce TLB miss etc. In addition, the mTHP interfaces can be extended to
support all shmem/tmpfs scenarios in the future, especially for the shmem
mmap() case.
The primary strategy is similar to supporting anonymous mTHP. Introduce a
new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size,
which is set to "inherit". This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.
Link: https://lkml.kernel.org/r/65796c1e72e51e15f3410195b5c2d5b6c160d411.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 11 Jun 2024 10:11:07 +0000 (18:11 +0800)]
mm: shmem: add multi-size THP sysfs interface for anonymous shmem
To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set
to: "always", "inherit (to inherit the top level setting)", "within_size",
"advise", "never". An 'inherit' option is added to ensure compatibility
with these global settings, and the options 'force' and 'deny' are
dropped, which are rather testing artifacts from the old ages.
By default, PMD-sized hugepages have enabled="inherit" and all other
hugepage sizes have enabled="never" for
'/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice
versa. That means now we will avoid using non-PMD sized THP to override
the global huge allocation.
[baolin.wang@linux.alibaba.com: fix transhuge.rst indentation] Link: https://lkml.kernel.org/r/b189d815-998b-4dfd-ba89-218ff51313f8@linux.alibaba.com
[akpm@linux-foundation.org: reflow transhuge.rst addition to 80 cols]
[baolin.wang@linux.alibaba.com: move huge_shmem_orders_lock under CONFIG_SYSFS] Link: https://lkml.kernel.org/r/eb34da66-7f12-44f3-a39e-2bcc90c33354@linux.alibaba.com
[akpm@linux-foundation.org: huge_memory.c needs mm_types.h] Link: https://lkml.kernel.org/r/ffddfa8b3cb4266ff963099ab78cfd7184c57ac7.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 11 Jun 2024 10:11:06 +0000 (18:11 +0800)]
mm: shmem: add THP validation for PMD-mapped THP related statistics
In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.
Link: https://lkml.kernel.org/r/c4b04cbd51e6951cc2436a87be8eaa4a1516faec.1718090413.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 11 Jun 2024 10:11:05 +0000 (18:11 +0800)]
mm: memory: extend finish_fault() to support large folio
Patch series "add mTHP support for anonymous shmem", v5.
Anonymous pages have already been supported for multi-size (mTHP)
allocation through commit 19eaf44954df, that can allow THP to be
configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Many implement anonymous page sharing
through mmap(MAP_SHARED | MAP_ANONYMOUS), especially in database usage
scenarios, therefore, users expect to apply an unified mTHP strategy for
anonymous pages, also including the anonymous shared pages, in order to
enjoy the benefits of mTHP. For example, lower latency than PMD-mapped
THP, smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM
architecture to reduce TLB miss etc.
As discussed in the bi-weekly MM meeting[1], the mTHP controls should
control all of shmem, not only anonymous shmem, but support will be added
iteratively. Therefore, this patch set starts with support for anonymous
shmem.
The primary strategy is similar to supporting anonymous mTHP. Introduce a
new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size,
which is set to "inherit". This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.
Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time sys_time faults_per_sec_per_cpu faults_per_sec
0.04s 3.10s 83516.416 2669684.890
From the data above, it is observed that the patchset has a minimal impact
when mTHP is not enabled (some fluctuations observed during testing).
When enabling 64K mTHP, there is a significant improvement of the page
fault latency.
Add large folio mapping establishment support for finish_fault() as a
preparation, to support multi-size THP allocation of anonymous shmem pages
in the following patches.
Keep the same behavior (per-page fault) for non-anon shmem to avoid
inflating the RSS unintentionally, and we can discuss what size of mapping
to build when extending mTHP to control non-anon shmem in the future.
Matthew Wilcox (Oracle) [Fri, 31 May 2024 03:29:25 +0000 (04:29 +0100)]
mm/memory-failure: stop setting the folio error flag
Nobody checks the error flag any more, so setting it accomplishes nothing.
Remove the obsolete parts of this comment; it hasn't been true since
errseq_t was used to track writeback errors in 2017.
Link: https://lkml.kernel.org/r/20240531032938.2712870-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Huang Ying [Fri, 31 May 2024 08:12:30 +0000 (16:12 +0800)]
mm,swap: simplify VMA based swap readahead window calculation
Replace PFNs with addresses in readahead window calculation. This
simplified the logic and reduce the code line number.
No functionality change is expected.
Link: https://lkml.kernel.org/r/20240531081230.310128-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Huang Ying [Fri, 31 May 2024 08:12:29 +0000 (16:12 +0800)]
mm,swap: remove struct vma_swap_readahead
When VMA based swap readahead is introduced in commit ec560175c0b6 ("mm,
swap: VMA based swap readahead"), "struct vma_swap_readahead" is defined
to describe the readahead window. Because we wanted to save the PTE
entries in the struct at that time. But after commit 4f8fcf4ced0b
("mm/swap: swap_vma_readahead() do the pte_offset_map()"), we no longer
save PTE entries in the struct. The size of the struct becomes so small,
that it's better to use the fields of the struct directly. This can
simplify the code to improve the code readability. The line number of
source code reduces too.
No functionality change is expected in this patch.
Link: https://lkml.kernel.org/r/20240531081230.310128-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Huang Ying [Fri, 31 May 2024 08:12:28 +0000 (16:12 +0800)]
mm,swap: fix a theoretical underflow in readahead window calculation
Patch series "mm,swap: cleanup VMA based swap readahead window
calculation".
When VMA based swap readahead is introduced in commit ec560175c0b6 ("mm,
swap: VMA based swap readahead"), "struct vma_swap_readahead" is defined
to describe the readahead window. Because we wanted to save the PTE
entries in the struct at that time. But after commit 4f8fcf4ced0b
("mm/swap: swap_vma_readahead() do the pte_offset_map()"), we no longer
save PTE entries in the struct. The size of the struct becomes so small,
that it's better to use the fields of the struct directly. This can
simplify the code to improve the code readability. The line number of
source code reduces too.
A theoretical underflow issue and some related code cleanup is done in the
series too.
This patch (of 3):
In swap readahead window calculation, if the fault PFN is smaller than the
readahead window size, underflow may occurs. This is only possible in
theory, because the start of the virtual address space will not be used
for anonymous pages in practice. Even if underflow occurs, there will be
no functional bugs. In the worst cases, some swap entries may be swapped
in incorrectly and some pages may be allocate on the wrong nodes.
Anyway, we still needs to fix the issue via some underflow checking.
Link: https://lkml.kernel.org/r/20240531081230.310128-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20240531081230.310128-2-ying.huang@intel.com Fixes: ec560175c0b6 ("mm, swap: VMA based swap readahead") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Fri, 31 May 2024 12:41:44 +0000 (18:11 +0530)]
mm: sparse: consistently use _nr
Consistently name the return variable with an _nr suffix, whenever calling
pfn_to_section_nr(), to avoid confusion with a (struct mem_section *).
Link: https://lkml.kernel.org/r/20240531124144.240399-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 27 May 2024 04:45:23 +0000 (06:45 +0200)]
arch/x86: do not explicitly clear Reserved flag in free_pagetable
In free_pagetable() we use the non-atomic version for clearing the
PageReserved bit from the page. free_pagetable() will either call
free_reserved_page() or put_page_bootmem(), which will eventually end up
calling free_reserved_page(), and in there we already clear the
PageReserved flag.
Link: https://lkml.kernel.org/r/20240527044523.29207-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 27 May 2024 15:48:55 +0000 (11:48 -0400)]
mm: drop leftover comment references to pxx_huge()
pxx_huge() has been removed in recent commit 9636f055dae1 ("mm/treewide:
remove pXd_huge()"), however there are still three comments referencing
the API that got overlooked. Remove them.
Link: https://lkml.kernel.org/r/20240527154855.528816-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brian Johannesmeyer [Tue, 28 May 2024 10:48:07 +0000 (12:48 +0200)]
kmsan: introduce test_unpoison_memory()
Add a regression test to ensure that kmsan_unpoison_memory() works the
same as an unpoisoning operation added by the instrumentation.
The test has two subtests: one that checks the instrumentation, and one
that checks kmsan_unpoison_memory(). Each subtest initializes the first
byte of a 4-byte buffer, then checks that the other 3 bytes are
uninitialized.
Uros Bizjak [Tue, 28 May 2024 14:43:14 +0000 (16:43 +0200)]
mm/vmalloc: use __this_cpu_try_cmpxchg() in preload_this_cpu_lock()
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old,
new) == old in preload_this_cpu_lock(). x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.
Shakeel Butt [Tue, 28 May 2024 16:40:50 +0000 (09:40 -0700)]
memcg: rearrange fields of mem_cgroup_per_node
Kernel test robot reported [1] performance regression for will-it-scale
test suite's page_fault2 test case for the commit 70a64b7919cb ("memcg:
dynamically allocate lruvec_stats"). After inspection it seems like the
commit has unintentionally introduced false cache sharing.
After the commit the fields of mem_cgroup_per_node which get read on the
performance critical path share the cacheline with the fields which get
updated often on LRU page allocations or deallocations. This has caused
contention on that cacheline and the workloads which manipulates a lot of
LRU pages are regressed as reported by the test report.
The solution is to rearrange the fields of mem_cgroup_per_node such that
the false sharing is eliminated. Let's move all the read only pointers at
the start of the struct, followed by memcg-v1 only fields and at the end
fields which get updated often.
Experiment setup: Ran fallocate1, fallocate2, page_fault1, page_fault2 and
page_fault3 from the will-it-scale test suite inside a three level memcg
with /tmp mounted as tmpfs on two different machines, one a single numa
node and the other one, two node machine.
Sidhartha Kumar [Thu, 30 May 2024 17:14:27 +0000 (10:14 -0700)]
mm/hugetlb: mm/memory_hotplug: use a folio in scan_movable_pages()
By using a folio in scan_movable_pages() we convert the last user of the
page-based hugetlb information macro functions to the folio version.
After this conversion, we can safely remove the page-based definitions
from include/linux/hugetlb.h.
Link: https://lkml.kernel.org/r/20240530171427.242018-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chuanhua Han [Wed, 29 May 2024 08:28:24 +0000 (20:28 +1200)]
mm: swap: entirely map large folios found in swapcache
When a large folio is found in the swapcache, the current implementation
requires calling do_swap_page() nr_pages times, resulting in nr_pages page
faults. This patch opts to map the entire large folio at once to minimize
page faults. Additionally, redundant checks and early exits for ARM64 MTE
restoring are removed.
Link: https://lkml.kernel.org/r/20240529082824.150954-7-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chuanhua Han [Wed, 29 May 2024 08:28:23 +0000 (20:28 +1200)]
mm: swap: make should_try_to_free_swap() support large-folio
The function should_try_to_free_swap() operates under the assumption that
swap-in always occurs at the normal page granularity, i.e.,
folio_nr_pages() = 1. However, in reality, for large folios,
add_to_swap_cache() will invoke folio_ref_add(folio, nr). To accommodate
large folio swap-in, this patch eliminates this assumption.
Link: https://lkml.kernel.org/r/20240529082824.150954-6-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Wed, 29 May 2024 08:28:22 +0000 (20:28 +1200)]
mm: introduce arch_do_swap_page_nr() which allows restore metadata for nr pages
Should do_swap_page() have the capability to directly map a large folio,
metadata restoration becomes necessary for a specified number of pages
denoted as nr. It's important to highlight that metadata restoration is
solely required by the SPARC platform, which, however, does not enable
THP_SWAP. Consequently, in the present kernel configuration, there exists
no practical scenario where users necessitate the restoration of nr
metadata. Platforms implementing THP_SWAP might invoke this function with
nr values exceeding 1, subsequent to do_swap_page() successfully mapping
an entire large folio. Nonetheless, their arch_do_swap_page_nr()
functions remain empty.
Link: https://lkml.kernel.org/r/20240529082824.150954-5-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Wed, 29 May 2024 08:28:21 +0000 (20:28 +1200)]
mm: introduce pte_move_swp_offset() helper which can move offset bidirectionally
There could arise a necessity to obtain the first pte_t from a swap pte_t
located in the middle. For instance, this may occur within the context of
do_swap_page(), where a page fault can potentially occur in any PTE of a
large folio. To address this, the following patch introduces
pte_move_swp_offset(), a function capable of bidirectional movement by a
specified delta argument. Consequently, pte_next_swp_offset() will
directly invoke it with delta = 1.
Link: https://lkml.kernel.org/r/20240529082824.150954-4-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Wed, 29 May 2024 08:28:20 +0000 (20:28 +1200)]
mm: remove the implementation of swap_free() and always use swap_free_nr()
To streamline maintenance efforts, we propose removing the implementation
of swap_free(). Instead, we can simply invoke swap_free_nr() with nr set
to 1. swap_free_nr() is designed with a bitmap consisting of only one
long, resulting in overhead that can be ignored for cases where nr equals
1.
A prime candidate for leveraging swap_free_nr() lies within
kernel/power/swap.c. Implementing this change facilitates the adoption of
batch processing for hibernation.
Link: https://lkml.kernel.org/r/20240529082824.150954-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <len.brown@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Gao Xiang <xiang@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chuanhua Han [Wed, 29 May 2024 08:28:19 +0000 (20:28 +1200)]
mm: swap: introduce swap_free_nr() for batched swap_free()
Patch series "large folios swap-in: handle refault cases first", v5.
This patchset is extracted from the large folio swapin series[1],
primarily addressing the handling of scenarios involving large folios in
the swap cache. Currently, it is particularly focused on addressing the
refaulting of mTHP, which is still undergoing reclamation. This approach
aims to streamline code review and expedite the integration of this
segment into the MM tree.
It relies on Ryan's swap-out series[2], leveraging the helper function
swap_pte_batch() introduced by that series.
Presently, do_swap_page only encounters a large folio in the swap cache
before the large folio is released by vmscan. However, the code should
remain equally useful once we support large folio swap-in via
swapin_readahead(). This approach can effectively reduce page faults and
eliminate most redundant checks and early exits for MTE restoration in
recent MTE patchset[3].
The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead() will
be split into separate patch sets and sent at a later time.
While swapping in a large folio, we need to free swaps related to the
whole folio. To avoid frequently acquiring and releasing swap locks, it
is better to introduce an API for batched free. Furthermore, this new
function, swap_free_nr(), is designed to efficiently handle various
scenarios for releasing a specified number, nr, of swap entries.
Link: https://lkml.kernel.org/r/20240529082824.150954-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240529082824.150954-2-21cnbao@gmail.com Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yosry Ahmed [Mon, 6 May 2024 21:13:33 +0000 (21:13 +0000)]
mm: rmap: abstract updating per-node and per-memcg stats
A lot of intricacies go into updating the stats when adding or removing
mappings: which stat index to use and which function. Abstract this away
into a new static helper in rmap.c, __folio_mod_stat().
This adds an unnecessary call to folio_test_anon() in
__folio_add_anon_rmap() and __folio_add_file_rmap(). However, the folio
struct should already be in the cache at this point, so it shouldn't cause
any noticeable overhead.
Yosry Ahmed [Fri, 24 May 2024 03:38:18 +0000 (03:38 +0000)]
mm: zswap: make same_filled functions folio-friendly
A variable name 'page' is used in zswap_is_folio_same_filled() and
zswap_fill_page() to point at the kmapped data in a folio. Use 'data'
instead to avoid confusion and stop it from showing up when searching
for 'page' references in mm/zswap.c.
While we are at it, move the kmap/kunmap calls into zswap_fill_page(),
make it take in a folio, and rename it to zswap_fill_folio().
Link: https://lkml.kernel.org/r/20240524033819.1953587-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yosry Ahmed [Fri, 24 May 2024 03:38:16 +0000 (03:38 +0000)]
mm: zswap: use sg_set_folio() in zswap_{compress/decompress}()
Patch series "mm: zswap: trivial folio conversions".
Some trivial folio conversions in zswap code.
This patch (of 3):
sg_set_folio() is equivalent to sg_set_page() for order-0 folios, which
are the only ones supported by zswap. Now zswap_decompress() can take in
a folio directly.
Kefeng Wang [Fri, 24 May 2024 05:28:43 +0000 (13:28 +0800)]
mm: remove MIGRATE_SYNC_NO_COPY mode
Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY")
introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to
a device DMA engine, which is only used __migrate_device_pages() to decide
whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only
set in hmm, as the MIGRATE_SYNC_NO_COPY set is removed by previous
cleanup, it seems that we could remove the unnecessary
MIGRATE_SYNC_NO_COPY.
Link: https://lkml.kernel.org/r/20240524052843.182275-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 24 May 2024 05:28:42 +0000 (13:28 +0800)]
mm: migrate: remove migrate_folio_extra()
migrate_folio_extra() is only called in migrate.c now, convert it a static
function and take a new src_private argument which could be shared by
migrate_folio() and filemap_migrate_folio() to simplify code a bit.
Link: https://lkml.kernel.org/r/20240524052843.182275-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 24 May 2024 05:28:41 +0000 (13:28 +0800)]
mm: migrate_device: unify migrate folio for MIGRATE_SYNC_NO_COPY
The __migrate_device_pages() won't copy page so MIGRATE_SYNC_NO_COPY
passed into migrate_folio()/migrate_folio_extra(), actually a easy way is
just to call folio_migrate_mapping()/folio_migrate_flags(), converting it
to unify and simplify the migrate device pages, which also remove the only
call for MIGRATE_SYNC_NO_COPY.
Link: https://lkml.kernel.org/r/20240524052843.182275-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 24 May 2024 05:28:39 +0000 (13:28 +0800)]
mm: migrate: simplify __buffer_migrate_folio()
Patch series "mm: cleanup MIGRATE_SYNC_NO_COPY mode".
Commit 2916ecc0f9d4 ("mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY")
introduce a new MIGRATE_SYNC_NO_COPY mode to allow to offload the copy to
a device DMA engine, which is only used __migrate_device_pages() to decide
whether or not copy the old page, and the MIGRATE_SYNC_NO_COPY mode only
used in hmm, a easy way is just to call the folio_migrate_mapping() and
folio_migrate_flags(), which help to remove the MIGRATE_SYNC_NO_COPY mode.
This patch (of 5):
Use filemap_migrate_folio() helper to simplify __buffer_migrate_folio().
Link: https://lkml.kernel.org/r/20240524052843.182275-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240524052843.182275-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 24 May 2024 05:36:18 +0000 (13:36 +0800)]
rmap: remove DEFINE_PAGE_VMA_WALK()
This are no users since commit 40d707f33db5 ("mm/ksm: use folio in
write_protect_page"), so remove DEFINE_PAGE_VMA_WALK().
Link: https://lkml.kernel.org/r/20240524053618.208895-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 24 May 2024 01:49:50 +0000 (09:49 +0800)]
mm: memcontrol: remove page_memcg()
The page_memcg() only called by mod_memcg_page_state(), so squash it to
cleanup page_memcg().
Link: https://lkml.kernel.org/r/20240524014950.187805-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yifei Li [Mon, 13 May 2024 07:58:30 +0000 (15:58 +0800)]
mm/memory-failure: use helper llist_for_each_entry()
Change the llist_for_each_entry_safe function to the llist_for_each_entry
function and delete the next variable. Because the linked list is not
modified,the llist_for_each_entry_safe function is not required. No
functional changes are intended.
Link: https://lkml.kernel.org/r/20240513075830.2611-1-liyifei28@huawei.com Signed-off-by: Yifei Li <liyifei28@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Donet Tom [Thu, 23 May 2024 06:39:05 +0000 (01:39 -0500)]
selftest: mm: Test if hugepage does not get leaked during __bio_release_pages()
Commit 1b151e2435fc ("block: Remove special-casing of compound pages")
caused a change in behaviour when releasing the pages if the buffer does
not start at the beginning of the page. This was because the calculation
of the number of pages to release was incorrect. This was fixed by commit 38b43539d64b ("block: Fix page refcounts for unaligned buffers in
__bio_release_pages()").
We pin the user buffer during direct I/O writes. If this buffer is a
hugepage, bio_release_page() will unpin it and decrement all references
and pin counts at ->bi_end_io. However, if any references to the hugepage
remain post-I/O, the hugepage will not be freed upon unmap, leading to a
memory leak.
This patch verifies that a hugepage, used as a user buffer for DIO
operations, is correctly freed upon unmapping, regardless of whether the
offsets are aligned or unaligned w.r.t page boundary.
Test Result Fail Scenario (Without the fix)
--------------------------------------------------------
[]# ./hugetlb_dio
TAP version 13
1..4
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 1 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 2 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 3 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 6
not ok 4 : Huge pages not freed!
Totals: pass:3 fail:1 xfail:0 xpass:0 skip:0 error:0
Test Result PASS Scenario (With the fix)
---------------------------------------------------------
[]#./hugetlb_dio
TAP version 13
1..4
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 1 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 2 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 3 : Huge pages freed successfully !
No. Free pages before allocation : 7
No. Free pages after munmap : 7
ok 4 : Huge pages freed successfully !
Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[donettom@linux.ibm.com: address review comments from Muhammad] Link: https://lkml.kernel.org/r/20240604132801.23377-1-donettom@linux.ibm.com
[donettom@linux.ibm.com: add this test to run_vmtests.sh] Link: https://lkml.kernel.org/r/20240607182000.6494-1-donettom@linux.ibm.com Link: https://lkml.kernel.org/r/20240523063905.3173-1-donettom@linux.ibm.com Fixes: 38b43539d64b ("block: Fix page refcounts for unaligned buffers in __bio_release_pages()") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tony Battersby <tonyb@cybernetics.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mateusz Guzik [Tue, 21 May 2024 23:43:21 +0000 (01:43 +0200)]
mm: batch unlink_file_vma calls in free_pgd_range
Execs of dynamically linked binaries at 20-ish cores are bottlenecked on
the i_mmap_rwsem semaphore, while the biggest singular contributor is
free_pgd_range inducing the lock acquire back-to-back for all consecutive
mappings of a given file.
Tracing the count of said acquires while building the kernel shows:
[1, 2) 799579 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 3) 0 | |
[3, 4) 3009 | |
[4, 5) 3009 | |
[5, 6) 326442 |@@@@@@@@@@@@@@@@@@@@@ |
So in particular there were 326442 opportunities to coalesce 5 acquires
into 1.
Doing so increases execs per second by 4% (~50k to ~52k) when running
the benchmark linked below.
The lock remains the main bottleneck, I have not looked at other spots
yet.
Bench can be found here:
http://apollo.backplane.com/DFlyMisc/doexec.c
$ cc -O2 -o shared-doexec doexec.c
$ ./shared-doexec $(nproc)
Note this particular test makes sure binaries are separate, but the
loader is shared.
Jane Chu [Fri, 24 May 2024 21:53:06 +0000 (15:53 -0600)]
mm/memory-failure: send SIGBUS in the event of thp split fail
While handling hwpoison in a THP page, it is possible that
try_to_split_thp_page() fails. For example, when the THP page has been
RDMA pinned. At this point, the kernel cannot isolate the poisoned THP
page, all it could do is to send a SIGBUS to the user process with
meaningful payload to give user-level recovery a chance.
Link: https://lkml.kernel.org/r/20240524215306.2705454-6-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jane Chu [Fri, 24 May 2024 21:53:05 +0000 (15:53 -0600)]
mm/memory-failure: move hwpoison_filter() higher up
Move hwpoison_filter() higher up as there is no need to spend a lot cycles
only to find out later that the page is supposed to be skipped from
hwpoison handling.
Link: https://lkml.kernel.org/r/20240524215306.2705454-5-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Added two explicit MF_MSG messages describing failure in
get_hwpoison_page. Attemped to document the definition of various action
names, and made a few adjustment to the action_result() calls.
Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jane Chu [Fri, 24 May 2024 21:53:03 +0000 (15:53 -0600)]
mm/madvise: add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)
The soft hwpoison injector via madvise(MADV_HWPOISON) operates in a
synchrous way in a sense, the injector is also a process under test, and
should it have the poisoned page mapped in its address space, it should
get killed as much as in a real UE situation. Doing so align with what
the madvise(2) man page says: " "This operation may result in the calling
process receiving a SIGBUS and the page being unmapped."
Link: https://lkml.kernel.org/r/20240524215306.2705454-3-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <oalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jane Chu [Fri, 24 May 2024 21:53:02 +0000 (15:53 -0600)]
mm/memory-failure: try to send SIGBUS even if unmap failed
Patch series "Enhance soft hwpoison handling and injection", v4.
This series is aimed at the following enhancements:
- Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave
more like as if a real UE occurred. Because the other two injectors
such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me
we need a better simulation to real UE scenario.
- For years, if the kernel is unable to unmap a hwpoisoned page, it send
a SIGKILL instead of SIGBUS to prevent user process from potentially
accessing the page again. But in doing so, the user process also lose
important information: vaddr, for recovery. Fortunately, the kernel
already has code to kill process re-accessing a hwpoisoned page, so
remove the '!unmap_success' check.
- Right now, if a thp page under GUP longterm pin is hwpoisoned, and
kernel cannot split the thp page, memory-failure simply ignores the UE
and returns. That's not ideal, it could deliver a SIGBUS with useful
information for userspace recovery.
This patch (of 5):
For years when it comes down to kill a process due to hwpoison, a SIGBUS
is delivered only if unmap has been successful. Otherwise, a SIGKILL is
delivered. And the reason for that is to prevent the involved process
from accessing the hwpoisoned page again.
Since then a lot has changed, a hwpoisoned page is marked and upon being
re-accessed, the memory-failure handler invokes kill_accessing_process()
to kill the process immediately. So let's take out the '!unmap_success'
factor and try to deliver SIGBUS if possible.
Bang Li [Wed, 22 May 2024 06:12:04 +0000 (14:12 +0800)]
mm: use update_mmu_tlb_range() to simplify code
Let us simplify the code by update_mmu_tlb_range().
Link: https://lkml.kernel.org/r/20240522061204.117421-4-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bang Li [Wed, 22 May 2024 06:12:03 +0000 (14:12 +0800)]
mm: implement update_mmu_tlb() using update_mmu_tlb_range()
Let's make update_mmu_tlb() simply a generic wrapper around
update_mmu_tlb_range(). Only the latter can now be overridden by the
architecture. We can now remove __HAVE_ARCH_UPDATE_MMU_TLB as well.
Link: https://lkml.kernel.org/r/20240522061204.117421-3-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bang Li [Wed, 22 May 2024 06:12:02 +0000 (14:12 +0800)]
mm: add update_mmu_tlb_range()
Patch series "Add update_mmu_tlb_range() to simplify code", v4.
This series of commits mainly adds the update_mmu_tlb_range() to batch
update tlb in an address range and implement update_mmu_tlb() using
update_mmu_tlb_range().
After commit 19eaf44954df ("mm: thp: support allocation of anonymous
multi-size THP"), We may need to batch update tlb of a certain address
range by calling update_mmu_tlb() in a loop. Using the
update_mmu_tlb_range(), we can simplify the code and possibly reduce the
execution of some unnecessary code in some architectures.
This patch (of 3):
Add update_mmu_tlb_range(), we can batch update tlb of an address range.
Link: https://lkml.kernel.org/r/20240522061204.117421-1-libang.li@antgroup.com Link: https://lkml.kernel.org/r/20240522061204.117421-2-libang.li@antgroup.com Signed-off-by: Bang Li <libang.li@antgroup.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Wed, 22 May 2024 07:04:35 +0000 (12:34 +0530)]
selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing
Post FEAT_LPA2, the Aarch64 Linux kernel extends higher address support to
4K and 16K translation granules. To support testing this out, we need to
do away with static initialization of page size, while still maintaining
the nice array of testcases; this can be achieved by initializing and
populating the array as a stack variable, and filling in the page size and
hugepage size at runtime.
Link: https://lkml.kernel.org/r/20240522070435.773918-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Wed, 22 May 2024 07:04:34 +0000 (12:34 +0530)]
selftests/mm: va_high_addr_switch: reduce test noise
Patch series "Restructure va_high_addr_switch".
The va_high_addr_switch memory selftest tests out some corner cases
related to allocation and page/hugepage faulting around the switch
boundary. Currently, the page size and hugepage size have been statically
defined. Post FEAT_LPA2, the Aarch64 Linux kernel adds support for 4k and
16k translation granules on higher addresses; we restructure the test to
support the same. In addition, we avoid invocation of the binary twice,
in the shell script, to reduce test noise.
This patch (of 2):
When invoking the binary with "--run-hugetlb" flag, the testcases
involving the base page are anyways going to be run. Therefore, remove
duplication by invoking the binary only once.
David Hildenbrand [Wed, 22 May 2024 12:57:13 +0000 (14:57 +0200)]
mm/rmap: sanity check that zeropages are not passed to RMAP
Using insert_page() we might have previously ended up passing the zeropage
into rmap code. Make sure that won't happen again.
Note that we won't check the huge zeropage for now, which might still end
up in RMAP code.
Link: https://lkml.kernel.org/r/20240522125713.775114-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Wed, 22 May 2024 12:57:12 +0000 (14:57 +0200)]
mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()
For now we only get the (small) zeropage mapped to user space in four
cases (excluding VM_PFNMAP mappings, such as /proc/vmstat):
(1) Read page faults in anonymous VMAs (MAP_PRIVATE|MAP_ANON):
do_anonymous_page() will not refcount it and map it pte_mkspecial()
(2) UFFDIO_ZEROPAGE on anonymous VMA or COW mapping of shmem
(MAP_PRIVATE). mfill_atomic_pte_zeropage() will not refcount it and
map it pte_mkspecial().
(3) KSM in mergeable VMA (anonymous VMA or COW mapping).
cmp_and_merge_page() will not refcount it and map it
pte_mkspecial().
(4) FSDAX as an optimization for holes.
vmf_insert_mixed()->__vm_insert_mixed() might end up calling
insert_page() without CONFIG_ARCH_HAS_PTE_SPECIAL, refcounting the
zeropage and not mapping it pte_mkspecial(). With
CONFIG_ARCH_HAS_PTE_SPECIAL, we'll call insert_pfn() where we will
not refcount it and map it pte_mkspecial().
In case (4), we might not have VM_MIXEDMAP set: while fs/fuse/dax.c sets
VM_MIXEDMAP, we removed it for ext4 fsdax in commit e1fb4a086495 ("dax:
remove VM_MIXEDMAP for fsdax and device dax") and for XFS in commit e1fb4a086495 ("dax: remove VM_MIXEDMAP for fsdax and device dax").
Without CONFIG_ARCH_HAS_PTE_SPECIAL and with VM_MIXEDMAP, vm_normal_page()
would currently return the zeropage. We'll refcount the zeropage when
mapping and when unmapping.
Without CONFIG_ARCH_HAS_PTE_SPECIAL and without VM_MIXEDMAP,
vm_normal_page() would currently refuse to return the zeropage. So we'd
refcount it when mapping but not when unmapping it ... do we have fsdax
without CONFIG_ARCH_HAS_PTE_SPECIAL in practice? Hard to tell.
Independent of that, we should never refcount the zeropage when we might
be holding that reference for a long time, because even without an
accounting imbalance we might overflow the refcount. As there is interest
in using the zeropage also in other VM_MIXEDMAP mappings, let's add clean
support for that in the cases where it makes sense:
(A) Never refcount the zeropage when mapping it:
In insert_page(), special-case the zeropage, do not refcount it, and use
pte_mkspecial(). Don't involve insert_pfn(), adjusting insert_page()
looks cleaner than branching off to insert_pfn().
(B) Never refcount the zeropage when unmapping it:
In vm_normal_page(), also don't return the zeropage in a VM_MIXEDMAP
mapping without CONFIG_ARCH_HAS_PTE_SPECIAL. Add a VM_WARN_ON_ONCE()
sanity check if we'd ever return the zeropage, which could happen if
someone forgets to set pte_mkspecial() when mapping the zeropage.
Document that.
(C) Allow the zeropage only where reasonable
s390x never wants the zeropage in some processes running legacy KVM guests
that make use of storage keys. So disallow that.
Further, using the zeropage in COW mappings is unproblematic (just what we
do for other COW mappings), because FAULT_FLAG_UNSHARE can just unshare it
and GUP with FOLL_LONGTERM would work as expected.
Similarly, mappings that can never have writable PTEs (implying no write
faults) are also not problematic, because nothing could end up mapping the
PTE writable by mistake later. But in case we could have writable PTEs,
we'll only allow the zeropage in FSDAX VMAs, that are incompatible with
GUP and are blocked there completely.
We'll always require the zeropage to be mapped with pte_special().
GUP-fast will reject the zeropage that way, but GUP-slow will allow it.
(Note that GUP does not refcount the zeropage with FOLL_PIN, because there
were issues with overflowing the refcount in the past).
Add sanity checks to can_change_pte_writable() and wp_page_reuse(), to
catch early during testing if we'd ever find a zeropage unexpectedly in
code that wants to upgrade write permissions.
Convert the BUG_ON in vm_mixed_ok() to an ordinary check and simply fail
with VM_FAULT_SIGBUS, like we do for other sanity checks. Drop the stale
comment regarding reserved pages from insert_page().
Note that:
* we won't mess with VM_PFNMAP mappings for now. remap_pfn_range() and
vmf_insert_pfn() would allow the zeropage in some cases and
not refcount it.
* vmf_insert_pfn*() will reject the zeropage in VM_MIXEDMAP
mappings and we'll leave that alone for now. People can simply use
one of the other interfaces.
* we won't bother with the huge zeropage for now. It's never
PTE-mapped and also GUP does not special-case it yet.
Link: https://lkml.kernel.org/r/20240522125713.775114-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Wed, 22 May 2024 12:57:11 +0000 (14:57 +0200)]
mm/memory: move page_count() check into validate_page_before_insert()
Patch series "mm/memory: cleanly support zeropage in vm_insert_page*(),
vm_map_pages*() and vmf_insert_mixed()", v2.
There is interest in mapping zeropages via vm_insert_pages() [1] into
MAP_SHARED mappings.
For now, we only get zeropages in MAP_SHARED mappings via
vmf_insert_mixed() from FSDAX code, and I think it's a bit shaky in some
cases because we refcount the zeropage when mapping it but not necessarily
always when unmapping it ... and we should actually never refcount it.
It's all a bit tricky, especially how zeropages in MAP_SHARED mappings
interact with GUP (FOLL_LONGTERM), mprotect(), write-faults and s390x
forbidding the shared zeropage (rewrite [2] s now upstream).
This series tries to take the careful approach of only allowing the
zeropage where it is likely safe to use (which should cover the existing
FSDAX use case and [1]), preventing that it could accidentally get mapped
writable during a write fault, mprotect() etc, and preventing issues with
FOLL_LONGTERM in the future with other users.
Tested with a patch from Vincent that uses the zeropage in context of
[1].
We'll now also cover the case where insert_page() is called from
__vm_insert_mixed(), which sounds like the right thing to do.
Link: https://lkml.kernel.org/r/20240522125713.775114-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Mon, 20 May 2024 22:44:07 +0000 (15:44 -0700)]
mm/hugetlb: remove {Set,Clear}Hpage macros
All users have been converted to use the folio version of these macros, we
can safely remove the page based interface.
Link: https://lkml.kernel.org/r/20240520224407.110062-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:53 +0000 (01:58 +0800)]
mm/swap: reduce swap cache search space
Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller swap files inside one
swap device. But when doing swap cache look up or insert, we are still
using the offset of the whole large swap device. This is OK for
correctness, as the offset (key) is unique.
But Xarray is specially optimized for small indexes, it creates the radix
tree levels lazily to be just enough to fit the largest key stored in one
Xarray. So we are wasting tree nodes unnecessarily.
For 64M chunk it should only take at most 3 levels to contain everything.
But if we are using the offset from the whole swap device, the offset
(key) value will be way beyond 64M, and so will the tree level.
Optimize this by using a new helper swap_cache_index to get a swap entry's
unique offset in its own 64M swap_address_space.
I see a ~1% performance gain in benchmark and actual workload with high
memory pressure.
Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if
SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap
cache):
Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps
After (0.8% faster):
94834.91 qps
Radix tree slab usage is also very slightly lower.
Link: https://lkml.kernel.org/r/20240521175854.96038-12-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:52 +0000 (01:58 +0800)]
mm: drop page_index and simplify folio_index
There are two helpers for retrieving the index within address space for
mixed usage of swap cache and page cache:
- page_index
- folio_index
This commit drops page_index, as we have eliminated all users, and
converts folio_index's helper __page_file_index to use folio to avoid the
page conversion.
Link: https://lkml.kernel.org/r/20240521175854.96038-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:51 +0000 (01:58 +0800)]
mm: remove page_file_offset and folio_file_pos
These two helpers were useful for mixed usage of swap cache and page
cache, which help retrieve the corresponding file or swap device offset of
a page or folio.
They were introduced in commit f981c5950fa8 ("mm: methods for teaching
filesystems about PG_swapcache pages") and used in commit d56b4ddf7781
("nfs: teach the NFS client how to treat PG_swapcache pages"), suppose to
be used with direct_IO for swap over fs.
But after commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it for
reads from SWP_FS_OPS swap-space"), swap with direct_IO is no more, and
swap cache mapping is never exposed to fs.
Now we have dropped all users of page_file_offset and folio_file_pos, so
they can be deleted.
Link: https://lkml.kernel.org/r/20240521175854.96038-10-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:50 +0000 (01:58 +0800)]
mm/swap: get the swap device offset directly
folio_file_pos and page_file_offset are for mixed usage of swap cache and
page cache, it can't be page cache here, so introduce a new helper to get
the swap offset in swap device directly.
Need to include swapops.h in mm/swap.h to ensure swp_offset is always
defined before use.
Link: https://lkml.kernel.org/r/20240521175854.96038-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These two are used through DEFINE_NFS_FOLIO_EVENT and
DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events:
- trace_nfs_aop_readpage{_done}: only called by nfs_read_folio
- trace_nfs_writeback_folio: only called by nfs_wb_folio
- trace_nfs_invalidate_folio: only called by nfs_invalidate_folio
- trace_nfs_launder_folio_done: only called by nfs_launder_folio
None of them could possibly be used on swap cache folio,
nfs_read_folio only called by:
.write_begin -> nfs_read_folio
.read_folio
Also, seeing from the swap side, swap_rw is now the only interface calling
into fs, the offset info is always in iocb.ki_pos now.
So we can remove all these folio_file_pos call safely.
Link: https://lkml.kernel.org/r/20240521175854.96038-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:48 +0000 (01:58 +0800)]
netfs: drop usage of folio_file_pos
folio_file_pos is only needed for mixed usage of page cache and swap
cache, for pure page cache usage, the caller can just use folio_pos
instead.
It can't be a swap cache page here. Swap mapping may only call into fs
through swap_rw and that is not supported for netfs. So just drop it and
use folio_pos instead.
Link: https://lkml.kernel.org/r/20240521175854.96038-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:47 +0000 (01:58 +0800)]
afs: drop usage of folio_file_pos
folio_file_pos is only needed for mixed usage of page cache and swap
cache, for pure page cache usage, the caller can just use folio_pos
instead.
It can't be a swap cache page here. Swap mapping may only call into fs
through swap_rw and that is not supported for afs. So just drop it and
use folio_pos instead.
Link: https://lkml.kernel.org/r/20240521175854.96038-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: David Howells <dhowells@redhat.com> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:46 +0000 (01:58 +0800)]
NFS: remove nfs_page_lengthg and usage of page_index
This function is no longer used after commit 4fa7a717b432 ("NFS: Fix up
nfs_vm_page_mkwrite() for folios"), all users have been converted to use
folio instead, just delete it to remove usage of page_index.
Link: https://lkml.kernel.org/r/20240521175854.96038-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:45 +0000 (01:58 +0800)]
ceph: drop usage of page_index
page_index is needed for mixed usage of page cache and swap cache, for
pure page cache usage, the caller can just use page->index instead.
It can't be a swap cache page here, so just drop it.
Link: https://lkml.kernel.org/r/20240521175854.96038-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 21 May 2024 17:58:44 +0000 (01:58 +0800)]
nilfs2: drop usage of page_index
Patch series "mm/swap: clean up and optimize swap cache index", v6.
Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller files inside a swap
device. But when doing swap cache look up or insert, we are still using
the offset of the whole large swap device. This is OK for correctness, as
the offset (key) is unique.
But Xarray is specially optimized for small indexes, it creates the redix
tree levels lazily to be just enough to fit the largest key stored in one
Xarray. So we are wasting tree nodes unnecessarily.
For 64M chunk it should only take at most 3 level to contain everything.
But if we are using the offset from the whole swap device, the offset
(key) value will be way beyond 64M, and so will the tree level.
Optimize this by reduce the swap cache search space into 64M scope.
Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if
SWP_SYNCHRONOUS_IO is enabled, as swap out path can never skip swap
cache):
Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps
After (+0.8% faster):
94834.91 qps
There is alse a very slight drop of radix tree node slab usage:
Before: 303952K
After: 302224K
For this series:
There are multiple places that expect mixed type of pages (page cache or
swap cache), eg. migration, huge memory split; There are four helpers
for that:
To keep the code clean and compatible, this series first cleaned up usage
of them.
page_file_offset and folio_file_pos are historical helpes that can be
simply dropped after clean up. And page_index can be all converted to
folio_index or folio->index.
Then introduce two new helpers swap_cache_index and swap_dev_pos for swap.
Replace swp_offset with swap_cache_index when used to retrieve folio from
swap cache, and use swap_dev_pos when needed to retrieve the device
position of a swap entry. This way, swap_cache_index can return the
optimized value with no compatibility issue.
The result is better performance and reduced LOC.
Idealy, in the future, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT from
14 to 12: Default Xarray chunk offset is 6, so we have 3 level trees
instead of 2 level trees just for 2 extra bits. But swap cache is based
on address_space struct, with 4 times more metadata sparsely distributed
in memory it waste more cacheline, the performance gain from this series
is almost canceled according to my test. So first, just have a cleaner
seperation of offsets and smaller search space.
This patch (of 10):
page_index is only for mixed usage of page cache and swap cache, for pure
page cache usage, the caller can just use page->index instead.
It can't be a swap cache page here (being part of buffer head), so just
drop it. And while we are at it, optimize the code by retrieving the
offset of the buffer head within the folio directly using bh_offset, and
get rid of the loop and usage of page helpers.
Link: https://lkml.kernel.org/r/20240521175854.96038-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240521175854.96038-3-ryncsn@gmail.com Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kemeng Shi [Tue, 14 May 2024 12:52:48 +0000 (20:52 +0800)]
writeback: add general function domain_dirty_avail to calculate dirty and avail of domain
Add general function domain_dirty_avail to calculate dirty and avail for
either dirty limit or background writeback in either global domain or wb
domain.
Kemeng Shi [Tue, 14 May 2024 12:52:47 +0000 (20:52 +0800)]
writeback: factor out wb_bg_dirty_limits to remove repeated code
Patch series "Add helper functions to remove repeated code and improve
readability of cgroup writeback", v2.
This series adds a lot of helpers to remove repeated code between domain
and wb; dirty limit and dirty background; global domain and wb domain.
The helpers also improve readability. More details can be found in the
respective patches.
A simple domain hierarchy is tested:
global domain (> 20G)
|
cgroup domain1(10G)
|
wb1
|
fio
Test steps:
/* make it easy to observe */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 3000 > /proc/sys/vm/dirty_writeback_centisecs
/* run fio to generate dirty pages */
fio -name test -filename=/bdi1/file -size=xxx -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
When fio size is 1G, the wb is in freerun state and dirty pages are only
written back when dirty inode is expired after 30 seconds. When fio size
is 2G, the dirty pages keep being written back and bandwidth of fio is
limited.
This patch (of 8):
Similar to wb_dirty_limits which calculates dirty and thresh of wb,
wb_bg_dirty_limits calculates background dirty and background thresh of
wb. With wb_bg_dirty_limits, we could remove repeated code in
wb_over_bg_thresh.
Shakeel Butt [Wed, 29 May 2024 15:49:11 +0000 (08:49 -0700)]
mm: vmscan: reset sc->priority on retry
The commit 6be5e186fd65 ("mm: vmscan: restore incremental cgroup
iteration") added a retry reclaim heuristic to iterate all the cgroups
before returning an unsuccessful reclaim but missed to reset the
sc->priority. Let's fix it.
Link: https://lkml.kernel.org/r/20240529154911.3008025-1-shakeel.butt@linux.dev Fixes: 6be5e186fd65 ("mm: vmscan: restore incremental cgroup iteration") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com Tested-by: syzbot+17416257cb95200cba44@syzkaller.appspotmail.com Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Tue, 14 May 2024 20:26:41 +0000 (16:26 -0400)]
mm: vmscan: restore incremental cgroup iteration
Currently, reclaim always walks the entire cgroup tree in order to ensure
fairness between groups. While overreclaim is limited in shrink_lruvec(),
many of our systems have a sizable number of active groups, and an even
bigger number of idle cgroups with cache left behind by previous jobs; the
mere act of walking all these cgroups can impose significant latency on
direct reclaimers.
In the past, we've used a save-and-restore iterator that enabled
incremental tree walks over multiple reclaim invocations. This ensured
fairness, while keeping the work of individual reclaimers small.
However, in edge cases with a lot of reclaim concurrency, individual
reclaimers would sometimes not see enough of the cgroup tree to make
forward progress and (prematurely) declare OOM. Consequently we switched
to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not share cgroup
iteration between reclaimers").
To address the latency problem without bringing back the premature OOM
issue, reinstate the shared iteration, but with a restart condition to do
the full walk in the OOM case - similar to what we do for memory.low
enforcement and active page protection.
In the worst case, we do one more full tree walk before declaring
OOM. But the vast majority of direct reclaim scans can then finish
much quicker, while fairness across the tree is maintained:
- Before this patch, we observed that direct reclaim always takes more
than 100us and most direct reclaim time is spent in reclaim cycles
lasting between 1ms and 1 second. Almost 40% of direct reclaim time
was spent on reclaim cycles exceeding 100ms.
- With this patch, almost all page reclaim cycles last less than 10ms,
and a good amount of direct page reclaim finishes in under 100us. No
page reclaim cycles lasting over 100ms were observed anymore.
The shared iterator state is maintaned inside the target cgroup, so
fair and incremental walks are performed during both global reclaim
and cgroup limit reclaim of complex subtrees.
Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Facebook Kernel Team <kernel-team@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ran Xiaokai [Wed, 15 May 2024 02:47:54 +0000 (10:47 +0800)]
mm/huge_memory: mark racy access onhuge_anon_orders_always
huge_anon_orders_always is accessed lockless, it is better to use the
READ_ONCE() wrapper. This is not fixing any visible bug, hopefully this
can cease some KCSAN complains in the future. Also do that for
huge_anon_orders_madvise.
Link: https://lkml.kernel.org/r/20240515104754889HqrahFPePOIE1UlANHVAh@zte.com.cn Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lu Zhongjun <lu.zhongjun@zte.com.cn> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Thu, 16 May 2024 08:10:35 +0000 (10:10 +0200)]
mm/hugetlb: drop node_alloc_noretry from alloc_fresh_hugetlb_folio
Since commit d67e32f26713 ("hugetlb: restructure pool allocations"), the
parameter node_alloc_noretry from alloc_fresh_hugetlb_folio() is not used,
so drop it.
Link: https://lkml.kernel.org/r/20240516081035.5651-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Illia Ostapyshyn [Fri, 17 May 2024 09:13:48 +0000 (11:13 +0200)]
mm/vmscan: update stale references to shrink_page_list
Commit 49fd9b6df54e ("mm/vmscan: fix a lot of comments") renamed
shrink_page_list() to shrink_folio_list(). Fix up the remaining
references to the old name in comments and documentation.
Link: https://lkml.kernel.org/r/20240517091348.1185566-1-illia@yshyn.com Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Sun, 30 Jun 2024 21:32:24 +0000 (14:32 -0700)]
Merge tag 'ata-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Niklas Cassel:
- Add NOLPM quirk for for all Crucial BX SSD1 models.
Considering that we now have had bug reports for 3 different BX SSD1
variants from Crucial with the same product name, make the quirk more
inclusive, to catch more device models from the same generation.
- Fix a trivial NULL pointer dereference in the error path for
ata_host_release().
- Create a ata_port_free(), so that we don't miss freeing ata_port
struct members when freeing a struct ata_port.
- Fix a trivial double free in the error path for ata_host_alloc().
- Ensure that we remove the libata "remapped NVMe device count" sysfs
entry on .probe() error.
* tag 'ata-6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci: Clean up sysfs file on error
ata: libata-core: Fix double free on error
ata,scsi: libata-core: Do not leak memory for ata_port struct members
ata: libata-core: Fix null pointer dereference on error
ata: libata-core: Add ATA_HORKAGE_NOLPM for all Crucial BX SSD1 models
Niklas Cassel [Sat, 29 Jun 2024 12:42:14 +0000 (14:42 +0200)]
ata: ahci: Clean up sysfs file on error
.probe() (ahci_init_one()) calls sysfs_add_file_to_group(), however,
if probe() fails after this call, we currently never call
sysfs_remove_file_from_group().
(The sysfs_remove_file_from_group() call in .remove() (ahci_remove_one())
does not help, as .remove() is not called on .probe() error.)
Thus, if probe() fails after the sysfs_add_file_to_group() call, the next
time we insmod the module we will get:
Niklas Cassel [Sat, 29 Jun 2024 12:42:13 +0000 (14:42 +0200)]
ata: libata-core: Fix double free on error
If e.g. the ata_port_alloc() call in ata_host_alloc() fails, we will jump
to the err_out label, which will call devres_release_group().
devres_release_group() will trigger a call to ata_host_release().
ata_host_release() calls kfree(host), so executing the kfree(host) in
ata_host_alloc() will lead to a double free:
Niklas Cassel [Sat, 29 Jun 2024 12:42:12 +0000 (14:42 +0200)]
ata,scsi: libata-core: Do not leak memory for ata_port struct members
libsas is currently not freeing all the struct ata_port struct members,
e.g. ncq_sense_buf for a driver supporting Command Duration Limits (CDL).
Add a function, ata_port_free(), that is used to free a ata_port,
including its struct members. It makes sense to keep the code related to
freeing a ata_port in its own function, which will also free all the
struct members of struct ata_port.
Do not access ata_port struct members unconditionally.
Fixes: 633273a3ed1c ("libata-pmp: hook PMP support and enable it") Cc: stable@vger.kernel.org Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240629124210.181537-7-cassel@kernel.org Signed-off-by: Niklas Cassel <cassel@kernel.org>
Linus Torvalds [Sun, 30 Jun 2024 17:00:01 +0000 (10:00 -0700)]
Merge tag 'kbuild-fixes-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Remove the executable bit from installed DTB files
- Escape $ in subshell execution in the debian-orig target
- Fix RPM builds with CONFIG_MODULES=n
- Fix xconfig with the O= option
- Fix scripts_gdb with the O= option
* tag 'kbuild-fixes-v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: scripts/gdb: bring the "abspath" back
kbuild: Use $(obj)/%.cc to fix host C++ module builds
kbuild: rpm-pkg: fix build error with CONFIG_MODULES=n
kbuild: Fix build target deb-pkg: ln: failed to create hard link
kbuild: doc: Update default INSTALL_MOD_DIR from extra to updates
kbuild: Install dtb files as 0644 in Makefile.dtbinst
Linus Torvalds [Wed, 26 Jun 2024 00:50:04 +0000 (17:50 -0700)]
x86-32: fix cmpxchg8b_emu build error with clang
The kernel test robot reported that clang no longer compiles the 32-bit
x86 kernel in some configurations due to commit 95ece48165c1
("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}()
functions").
The build fails with
arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available
and the reason seems to be that not only does the cmpxchg8b instruction
need four fixed registers (EDX:EAX and ECX:EBX), with the emulation
fallback the inline asm also wants a fifth fixed register for the
address (it uses %esi for that, but that's just a software convention
with cmpxchg8b_emu).
Avoiding using another pointer input to the asm (and just forcing it to
use the "0(%esi)" addressing that we end up requiring for the sw
fallback) seems to fix the issue.