mm/mprotect: Fix do_mprotect_pkey() return on error
When the loop over the VMA is terminated early due to an error, the
return code could be overwritten with ENOMEM. Fix this by only setting
the error code when it is not set already.
Fixes: 2286a6914c77 ("mm: change mprotect_fixup to vma iterator") Cc: <stable@vger.kernel.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
mm/mmap: Change do_vmi_align_munmap() for maple tree iterator changes
The maple tree iterator clean up is incompatible with the way
do_vmi_align_munmap() expects it to behave. Update the expected
behaviour to map now since the change will work currently.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Fri, 10 Feb 2023 21:50:08 +0000 (16:50 -0500)]
maple_tree: Remove unnecessary check from mas_destroy()
mas_destroy currently checks if mas->node is MAS_START prior to calling
mas_start(), but this is unnecessary as mas_start() will do nothing if
the node is anything but MAS_START.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Thu, 2 Feb 2023 16:16:24 +0000 (11:16 -0500)]
maple_tree: Add __init and __exit to test module
The test functions are not needed after the module is removed, so mark
them as such. Add __exit to the module removal function. Some other
variables have been marked as const static as well.
Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Thu, 1 Dec 2022 16:48:12 +0000 (11:48 -0500)]
maple_tree: Make test code work without debug enabled
The test code is less useful without debug, but can still do general
validations. Define mt_dump(), mas_dump() and mas_wr_dump() as a noop
if debug is not enabled and document it in the test module information
that more information can be obtained with another kernel config option.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Return error on mte_pivots() out of range
Rename mte_pivots() to mas_pivots() and pass through the ma_state to set
the error code to -EIO when the offset is out of range for the node
type. Change the WARN_ON() to MAS_WARN_ON() to log the maple state.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Use MAS_WR_BUG_ON() in mas_store_prealloc()
mas_store_prealloc() should never fail, but if it does due to internal
tree issues then get as much debug information as possible prior to
crashing the kernel.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Use MAS_BUG_ON() instead of MT_BUG_ON() to get the maple state
information. In the unlikely even of a tree height of > 31, try to increase
the probability of useful information being logged.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 31 Aug 2022 19:55:15 +0000 (15:55 -0400)]
maple_tree: Convert debug code to use MT_WARN_ON() and MAS_WARN_ON()
Using MT_WARN_ON() allows for the removal of if statements before
logging. Using MAS_WARN_ON() will provide more information when issues
are encountered.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Tue, 15 Nov 2022 14:29:33 +0000 (09:29 -0500)]
maple_tree: Clean up mas_parent_enum()
mas_parent_enum() is a simple wrapper for mte_parent_enum() which is
only called from that wrapper. Remove the wrapper and inline
mte_parent_enum() into mas_parent_enum().
At the same time, clean up the bit masking of the root pointer since it
cannot be set by the time the bit masking occurs. Change the check on
the root bit to a WARN_ON(), and fix the verification code to not
trigger the WARN_ON() before checking if the node is root.
Reported-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Jiaqi Yan [Mon, 27 Mar 2023 21:15:48 +0000 (14:15 -0700)]
mm/khugepaged: recover from poisoned file-backed memory
Make collapse_file roll back when copying pages failed. More concretely:
- extract copying operations into a separate loop
- postpone the updates for nr_none until both scanning and copying
succeeded
- postpone joining small xarray entries until both scanning and copying
succeeded
- postpone the update operations to NR_XXX_THPS until both scanning and
copying succeeded
- for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded but
copying failed
Tested manually:
0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk.
1. Start a two-thread application. Each thread allocates a chunk of
non-huge memory buffer from /mnt/ramdisk.
2. Pick 4 random buffer address (2 in each thread) and inject
uncorrectable memory errors at physical addresses.
3. Signal both threads to make their memory buffer collapsible, i.e.
calling madvise(MADV_HUGEPAGE).
4. Wait and then check kernel log: khugepaged is able to recover from
poisoned pages by skipping them.
5. Signal both threads to inspect their buffer contents and make sure no
data corruption.
Link: https://lkml.kernel.org/r/20230327211548.462509-4-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tong Tiangen <tongtiangen@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiaqi Yan [Mon, 27 Mar 2023 21:15:47 +0000 (14:15 -0700)]
mm/hwpoison: introduce copy_mc_highpage
Similar to how copy_mc_user_highpage is implemented for copy_user_highpage
on #MC supported architecture, introduce the #MC handled version of
copy_highpage.
This helper has immediate usage when khugepaged wants to copy file-backed
memory pages and tolerate #MC.
Link: https://lkml.kernel.org/r/20230327211548.462509-3-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tong Tiangen <tongtiangen@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiaqi Yan [Mon, 27 Mar 2023 21:15:46 +0000 (14:15 -0700)]
mm/khugepaged: recover from poisoned anonymous memory
Patch series "Memory poison recovery in khugepaged collapsing", v11.
Problem
=======
Memory DIMMs are subject to multi-bit flips, i.e. memory errors. As
memory size and density increase, the chances of and number of memory
errors increase. The increasing size and density of server RAM in the
data center and cloud have shown increased uncorrectable memory errors.
There are already mechanisms in the kernel to recover from uncorrectable
memory errors. This series of patches provides the recovery mechanism for
the particular kernel agent khugepaged when it collapses memory pages.
Impact
======
The main reason we chose to make khugepaged collapsing tolerant of memory
failures was its high possibility of accessing poisoned memory while
performing functionally optional compaction actions. Standard
applications typically don't have strict requirements on the size of its
pages. So they are given 4K pages by the kernel. The kernel is able to
improve application performance by either
1) giving applications 2M pages to begin with, or
2) collapsing 4K pages into 2M pages when possible.
This collapsing operation is done by khugepaged, a kernel agent that is
constantly scanning memory. When collapsing 4K pages into a 2M page, it
must copy the data from the 4K pages into a physically contiguous 2M page.
Therefore, as long as there exists one poisoned cache line in collapsible
4K pages, khugepaged will eventually access it. The current impact to
users is a machine check exception triggered kernel panic. However,
khugepaged’s compaction operations are not functionally required kernel
actions. Therefore making khugepaged tolerant to poisoned memory will
greatly improve user experience.
This patch series is for cases where khugepaged is the first guy that
detects the memory errors on the poisoned pages. IOW, the pages are not
known to have memory errors when khugepaged collapsing gets to them. In
our observation, this happens frequently when the huge page ratio of the
system is relatively low, which is fairly common in virtual machines
running on cloud.
Solution
========
As stated before, it is less desirable to crash the system only because
khugepaged accesses poisoned pages while it is collapsing 4K pages. The
high level idea of this patch series is to skip the group of pages
(usually 512 4K-size pages) once khugepaged finds one of them is poisoned,
as these pages have become ineligible to be collapsed.
We are also careful to unwind operations khuagepaged has performed before
it detects memory failures. For example, before copying and collapsing a
group of anonymous pages into a huge page, the source pages will be
isolated and their page table is unlinked from their PMD. These
operations need to be undone in order to ensure these pages are not
changed/lost from the perspective of other threads (both user and kernel
space). As for file backed memory pages, there already exists a rollback
case. This patch just extends it so that khugepaged also correctly rolls
back when it fails to copy poisoned 4K pages.
This patch (of 11):
Make __collapse_huge_page_copy return whether copying anonymous pages
succeeded, and make collapse_huge_page handle the return status.
Break existing PTE scan loop into two for-loops. The first loop copies
source pages into target huge page, and can fail gracefully when running
into memory errors in source pages. If copying all pages succeeds, the
second loop releases and clears up these normal pages. Otherwise, the
second loop rolls back the page table and page states by:
- re-establishing the original PTEs-to-PMD connection.
- releasing source pages back to their LRU list.
Tested manually:
0. Enable khugepaged on system under test.
1. Start a two-thread application. Each thread allocates a chunk of
non-huge anonymous memory buffer.
2. Pick 4 random buffer locations (2 in each thread) and inject
uncorrectable memory errors at corresponding physical addresses.
3. Signal both threads to make their memory buffer collapsible, i.e.
calling madvise(MADV_HUGEPAGE).
4. Wait and check kernel log: khugepaged is able to recover from poisoned
pages and skips collapsing them.
5. Signal both threads to inspect their buffer contents and make sure no
data corruption.
Link: https://lkml.kernel.org/r/20230327211548.462509-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20230327211548.462509-2-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tong Tiangen <tongtiangen@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhen Lei [Mon, 27 Mar 2023 03:41:49 +0000 (11:41 +0800)]
kmsan: fix a stale comment in kmsan_save_stack_with_flags()
After commit 446ec83805dd ("mm/page_alloc: use might_alloc()") and commit 84172f4bb752 ("mm/page_alloc: combine __alloc_pages and
__alloc_pages_nodemask"), the comment is no longer accurate. Flag
'__GFP_DIRECT_RECLAIM' is clear enough on its own, so remove the comment
rather than update it.
Link: https://lkml.kernel.org/r/20230327034149.942-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The local_lock_irqsave() is invoked in put_cpu_partial() and happens in
IPI context, due to the CONFIG_PROVE_RAW_LOCK_NESTING=y (the
LD_WAIT_CONFIG not equal to LD_WAIT_SPIN), so acquire local_lock in IPI
context will trigger above calltrace.
This commit therefore moves qlist_free_all() from hard-irq context to task
context.
Link: https://lkml.kernel.org/r/20230327120019.1027640-1-qiang1.zhang@intel.com Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 27 Mar 2023 15:10:50 +0000 (16:10 +0100)]
hugetlb: remove PageHeadHuge()
Sidhartha Kumar removed the last caller of PageHeadHuge(), so we can now
remove it and make folio_test_hugetlb() the real implementation. Add
kernel-doc for folio_test_hugetlb().
Link: https://lkml.kernel.org/r/20230327151050.1787744-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A global vmap_blocks-xarray array can be contented under heavy usage of
the vm_map_ram()/vm_unmap_ram() APIs. The lock_stat shows that a
"vmap_blocks.xa_lock" lock is a second in a top-list when it comes to
contentions:
Matthew Wilcox (Oracle) [Mon, 27 Mar 2023 17:45:15 +0000 (18:45 +0100)]
mm: hold the RCU read lock over calls to ->map_pages
Prevent filesystems from doing things which sleep in their map_pages
method. This is in preparation for a pagefault path protected only by
RCU.
Link: https://lkml.kernel.org/r/20230327174515.1811532-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 27 Mar 2023 17:45:14 +0000 (18:45 +0100)]
afs: split afs_pagecache_valid() out of afs_validate()
For the map_pages() method, we need a test that does not sleep. The page
fault handler will continue to call the fault() method where we can sleep
and do the full revalidation there.
Link: https://lkml.kernel.org/r/20230327174515.1811532-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 27 Mar 2023 17:45:13 +0000 (18:45 +0100)]
xfs: remove xfs_filemap_map_pages() wrapper
Patch series "Prevent ->map_pages from sleeping", v2.
In preparation for a larger patch series which will handle (some, easy)
page faults protected only by RCU, change the two filesystems which have
sleeping locks to not take them and hold the RCU lock around calls to
->map_page to prevent other filesystems from adding sleeping locks.
This patch (of 3):
XFS doesn't actually need to be holding the XFS_MMAPLOCK_SHARED to do
this. filemap_map_pages() cannot bring new folios into the page cache
and the folio lock is taken during filemap_map_pages() which provides
sufficient protection against a truncation or hole punch.
Mike Rapoport (IBM) [Sat, 25 Mar 2023 06:08:26 +0000 (09:08 +0300)]
sh: drop ranges for definition of ARCH_FORCE_MAX_ORDER
untweak ARCH_FORCE_MAX_ORDER's `range'
Link: https://lkml.kernel.org/r/20230325060828.2662773-13-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20230325060828.2662773-12-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Fri, 24 Mar 2023 05:22:29 +0000 (08:22 +0300)]
powerpc: drop ranges for definition of ARCH_FORCE_MAX_ORDER
PowerPC defines ranges for ARCH_FORCE_MAX_ORDER some of which are insanely
allowing MAX_ORDER up to 63, which implies maximal contiguous allocation
size of 2^63 pages.
Drop bogus definitions of ranges for ARCH_FORCE_MAX_ORDER and leave it a
simple integer with sensible defaults.
Users that *really* need to change the value of ARCH_FORCE_MAX_ORDER will
be able to do so but they won't be mislead by the bogus ranges.
Link: https://lkml.kernel.org/r/20230324052233.2654090-11-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rich Felker <dalias@libc.org> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Fri, 24 Mar 2023 05:22:23 +0000 (08:22 +0300)]
csky: drop ARCH_FORCE_MAX_ORDER
The default value of ARCH_FORCE_MAX_ORDER matches the generic default
defined in the MM code, the architecture does not support huge pages, so
there is no need to keep ARCH_FORCE_MAX_ORDER option available.
Drop it.
Link: https://lkml.kernel.org/r/20230324052233.2654090-5-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rich Felker <dalias@libc.org> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20230325060828.2662773-4-rppt@kernel.org Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Fri, 24 Mar 2023 05:22:21 +0000 (08:22 +0300)]
arm64: drop ranges in definition of ARCH_FORCE_MAX_ORDER
It is not a good idea to change fundamental parameters of core memory
management. Having predefined ranges suggests that the values within
those ranges are sensible, but one has to *really* understand implications
of changing MAX_ORDER before actually amending it and ranges don't help
here.
Drop ranges in definition of ARCH_FORCE_MAX_ORDER and make its prompt
visible only if EXPERT=y
Link: https://lkml.kernel.org/r/20230324052233.2654090-3-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rich Felker <dalias@libc.org> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Fri, 24 Mar 2023 05:22:20 +0000 (08:22 +0300)]
arm: reword ARCH_FORCE_MAX_ORDER prompt and help text
Patch series "arch,mm: cleanup Kconfig entries for ARCH_FORCE_MAX_ORDER",
v3.
Several architectures have ARCH_FORCE_MAX_ORDER in their Kconfig and
they all have wrong and misleading prompt and help text for this option.
Besides, some define insane limits for possible values of
ARCH_FORCE_MAX_ORDER, some carefully define ranges only for a subset of
possible configurations, some make this option configurable by users for no
good reason.
This set updates the prompt and help text everywhere and does its best to
update actual definitions of ranges where applicable.
kbuild generated a bunch of false positives because it assigns -1 to
ARCH_FORCE_MAX_ORDER, hopefully this will be fixed soon.
This patch (of 14):
The prompt and help text of ARCH_FORCE_MAX_ORDER are not even close to
describe this configuration option.
Update both to actually describe what this option does.
Link: https://lkml.kernel.org/r/20230325060828.2662773-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20230324052233.2654090-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20230324052233.2654090-2-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rich Felker <dalias@libc.org> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thomas WeiĂźschuh [Fri, 24 Mar 2023 15:35:27 +0000 (15:35 +0000)]
mm/damon/sysfs: make more kobj_type structures constant
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the
driver core allows the usage of const struct kobj_type.
Take advantage of this to constify the structure definition to prevent
modification at runtime.
These structures were not constified in commit e56397e8c40d
("mm/damon/sysfs: make kobj_type structures constant") as they didn't
exist when that patch was written.
Chaitanya S Prakash [Thu, 23 Mar 2023 06:01:21 +0000 (11:31 +0530)]
selftests/mm: set overcommit_policy as OVERCOMMIT_ALWAYS
The kernel's default behaviour is to obstruct the allocation of high
virtual address as it handles memory overcommit in a heuristic manner.
Setting the parameter as OVERCOMMIT_ALWAYS, ensures kernel isn't
susceptible to the availability of a platform's physical memory when
denying a memory allocation request.
Chaitanya S Prakash [Thu, 23 Mar 2023 06:01:20 +0000 (11:31 +0530)]
selftests/mm: change NR_CHUNKS_HIGH for aarch64
Although there is a provision for 52 bit VA on arm64 platform, it remains
unutilised and higher addresses are not allocated. In order to
accommodate 4PB [2^52] virtual address space where supported,
NR_CHUNKS_HIGH is changed accordingly.
Array holding addresses is changed from static allocation to dynamic
allocation to accommodate its voluminous nature which otherwise might
overflow the stack.
Chaitanya S Prakash [Thu, 23 Mar 2023 06:01:19 +0000 (11:31 +0530)]
selftests/mm: change MAP_CHUNK_SIZE
Patch series "selftests: Fix virtual address range for arm64", v2.
When the virtual address range selftest is run on arm64 and x86 platforms,
it is observed that both the low and high VA range iterations are skipped
when the MAP_CHUNK_SIZE is set to 16GB. The MAP_CHUNK_SIZE is changed to
1GB to resolve this issue, following which support for arm64 platform is
added by changing the NR_CHUNKS_HIGH for aarch64 to accommodate up to 4PB
of virtual address space allocation requests. Dynamic memory allocation
of array holding addresses is introduced to prevent overflow of the stack.
Finally, the overcommit_policy is set as OVERCOMMIT_ALWAYS to prevent the
kernel from denying a memory allocation request based on a platform's
physical memory availability.
This patch (of 3):
mmap() fails to allocate 16GB virtual space chunk, skipping both low and
high VA range iterations. Hence, reduce MAP_CHUNK_SIZE to 1GB and update
relevant macros as required.
Wenchao Hao [Thu, 23 Mar 2023 11:41:36 +0000 (19:41 +0800)]
trace: cma: remove unnecessary event class cma_alloc_class
After commit cb6c33d4dc09 ("cma: tracing: print alloc result in
trace_cma_alloc_finish"), cma_alloc_class has only one event which is
cma_alloc_busy_retry. So we can remove the cma_alloc_class.
Link: https://lkml.kernel.org/r/20230323114136.177677-1-haowenchao2@huawei.com Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Hongxiang Lou <louhongxiang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tomas Krcka [Thu, 23 Mar 2023 17:43:49 +0000 (17:43 +0000)]
mm: be less noisy during memory hotplug
Turn a pr_info() into a pr_debug() to prevent dmesg spamming on systems
where memory hotplug is a frequent operation.
Link: https://lkml.kernel.org/r/20230323174349.35990-1-krckatom@amazon.de Signed-off-by: Tomas Krcka <krckatom@amazon.de> Suggested-by: Jan H. Schönherr <jschoenh@amazon.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 22 Mar 2023 20:19:00 +0000 (20:19 +0000)]
mm/mmap/vma_merge: init cleanup, be explicit about the non-mergeable case
Rather than setting err = -1 and only resetting if we hit merge cases,
explicitly check the non-mergeable case to make it abundantly clear that
we only proceed with the rest if something is mergeable, default err to 0
and only update if an error might occur.
Move the merge_prev, merge_next cases closer to the logic determining
curr, next and reorder initial variables so they are more logically
grouped.
This has no functional impact.
Link: https://lkml.kernel.org/r/99259fbc6403e80e270e1cc4612abbc8620b121b.1679516210.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vernon Yang <vernon2gm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously, vma was an uninitialised variable which was only definitely
assigned as a result of the logic covering all possible input cases - for
it to have remained uninitialised, prev would have to be NULL, and next
would _have_ to be mergeable.
The value of res defaults to NULL, so we can neatly eliminate the
assignment to res and vma in the if (prev) block and ensure that both res
and vma are both explicitly assigned, by just setting both to prev.
In addition we add an explanation as to under what circumstances both
might change, and since we absolutely do rely on addr == curr->vm_start
should curr exist, assert that this is the case.
Link: https://lkml.kernel.org/r/83938bed24422cbe5954bbf491341674becfe567.1679516210.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vernon Yang <vernon2gm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 22 Mar 2023 20:18:58 +0000 (20:18 +0000)]
mm/mmap/vma_merge: fold curr, next assignment logic
Use find_vma_intersection() and vma_lookup() to both simplify the logic
and to fold the end == next->vm_start condition into one block.
This groups all of the simple range checks together and establishes the
invariant that, if prev, curr or next are non-NULL then their positions
are as expected.
This has no functional impact.
Link: https://lkml.kernel.org/r/c6d960641b4ba58fa6ad3d07bf68c27d847963c8.1679516210.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vernon Yang <vernon2gm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 21 Mar 2023 20:45:55 +0000 (20:45 +0000)]
mm/mmap/vma_merge: further improve prev/next VMA naming
Patch series "further cleanup of vma_merge()", v2.
Following on from Vlastimil Babka's patch series "cleanup vma_merge() and
improve mergeability tests" which was in turn based on Liam's prior
cleanups, this patch series introduces changes discussed in review of
Vlastimil's series and goes further in attempting to make the logic as
clear as possible.
Nearly all of this should have absolutely no functional impact, however it
does add a singular VM_WARN_ON() case.
With many thanks to Vernon for helping kick start the discussion around
simplification - abstract use of vma did indeed turn out not to be
necessary - and to Liam for his excellent suggestions which greatly
simplified things.
This patch (of 4):
Previously the ASCII diagram above vma_merge() and the accompanying
variable naming was rather confusing, however recent efforts by Liam
Howlett and Vlastimil Babka have significantly improved matters.
This patch goes a little further - replacing 'X' with 'N' which feels a
lot more natural and replacing what was 'N' with 'C' which stands for
'concurrent' VMA.
No word quite describes a VMA that has coincident start as the input span,
concurrent, abbreviated to 'curr' (and which can be thought of also as
'current') however fits intuitions well alongside prev and next.
Stephen Rothwell [Tue, 28 Mar 2023 23:25:14 +0000 (16:25 -0700)]
mm: vmalloc: fix sparc64 warning
This fixes this warning from a sparc64 defconfig build:
In file included from /home/sfr/next/next/include/linux/wait.h:11,
from /home/sfr/next/next/include/linux/swait.h:8,
from /home/sfr/next/next/include/linux/completion.h:12,
from /home/sfr/next/next/include/linux/mm_types.h:14,
from /home/sfr/next/next/include/linux/uio.h:10,
from /home/sfr/next/next/include/linux/vmalloc.h:12,
from /home/sfr/next/next/include/asm-generic/io.h:994,
from /home/sfr/next/next/arch/sparc/include/asm/io.h:22,
from /home/sfr/next/next/arch/sparc/vdso/vclock_gettime.c:18:
/home/sfr/next/next/arch/sparc/include/asm/current.h:18:30: warning: call-clobbered register used for global register variable
18 | register struct task_struct *current asm("g4");
| ^~~~~~~
Link: https://lkml.kernel.org/r/20230320144721.663280c3@canb.auug.org.au Fixes: 4e29dd9708cb ("mm: vmalloc: convert vread() to vread_iter()") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 22 Mar 2023 18:57:04 +0000 (18:57 +0000)]
mm: vmalloc: convert vread() to vread_iter()
Having previously laid the foundation for converting vread() to an
iterator function, pull the trigger and do so.
This patch attempts to provide minimal refactoring and to reflect the
existing logic as best we can, for example we continue to zero portions of
memory not read, as before.
Overall, there should be no functional difference other than a performance
improvement in /proc/kcore access to vmalloc regions.
Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
we dispense with it, and try to write to user memory optimistically but
with faults disabled via copy_page_to_iter_nofault(). We already have
preemption disabled by holding a spin lock. We continue faulting in until
the operation is complete.
Additionally, we must account for the fact that at any point a copy may
fail (most likely due to a fault not being able to occur), we exit
indicating fewer bytes retrieved than expected.
Link: https://lkml.kernel.org/r/941f88bc5ab928e6656e1e2593b91bf0f8c81e1b.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 22 Mar 2023 18:57:03 +0000 (18:57 +0000)]
iov_iter: add copy_page_to_iter_nofault()
Provide a means to copy a page to user space from an iterator, aborting if
a page fault would occur. This supports compound pages, but may be passed
a tail page with an offset extending further into the compound page, so we
cannot pass a folio.
This allows for this function to be called from atomic context and _try_
to user pages if they are faulted in, aborting if not.
The function does not use _copy_to_iter() in order to not specify
might_fault(), this is similar to copy_page_from_iter_atomic().
This is being added in order that an iteratable form of vread() can be
implemented while holding spinlocks.
Link: https://lkml.kernel.org/r/19734729defb0f498a76bdec1bef3ac48a3af3e8.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 22 Mar 2023 18:57:02 +0000 (18:57 +0000)]
fs/proc/kcore: convert read_kcore() to read_kcore_iter()
For the time being we still use a bounce buffer for vread(), however in
the next patch we will convert this to interact directly with the iterator
and eliminate the bounce buffer altogether.
Link: https://lkml.kernel.org/r/ebe12c8d70eebd71f487d80095605f3ad0d1489c.1679511146.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Baoquan He <bhe@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 22 Mar 2023 18:57:01 +0000 (18:57 +0000)]
fs/proc/kcore: avoid bounce buffer for ktext data
Patch series "convert read_kcore(), vread() to use iterators", v8.
While reviewing Baoquan's recent changes to permit vread() access to
vm_map_ram regions of vmalloc allocations, Willy pointed out [1] that it
would be nice to refactor vread() as a whole, since its only user is
read_kcore() and the existing form of vread() necessitates the use of a
bounce buffer.
This patch series does exactly that, as well as adjusting how we read the
kernel text section to avoid the use of a bounce buffer in this case as
well.
This has been tested against the test case which motivated Baoquan's
changes in the first place [2] which continues to function correctly, as
do the vmalloc self tests.
This patch (of 4):
Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
introduced the use of a bounce buffer to retrieve kernel text data for
/proc/kcore in order to avoid failures arising from hardened user copies
enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().
We can avoid doing this if instead of copy_to_user() we use
_copy_to_user() which bypasses the hardening check. This is more
efficient than using a bounce buffer and simplifies the code.
We do so as part an overall effort to eliminate bounce buffer usage in the
function with an eye to converting it an iterator read.
This function no longer exists, however the prot != vma->vm_page_prot case
discussion has been retained and moved to vmf_insert_pfn_prot() so refer
to this instead.
Link: https://lkml.kernel.org/r/db403b3622b94a87bd93528cc1d6b44ae88adcdd.1678661628.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Sun, 12 Mar 2023 23:40:14 +0000 (23:40 +0000)]
mm: remove vmf_insert_pfn_xxx_prot() for huge page-table entries
This functionality's sole user, the drm ttm module, removed support for it
in commit 0d979509539e ("drm/ttm: remove ttm_bo_vm_insert_huge()") as the
whole approach is currently unworkable without a PMD/PUD special bit and
updates to GUP.
Link: https://lkml.kernel.org/r/604c2ad79659d4b8a6e3e1611c6219d5d3233988.1678661628.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Sun, 12 Mar 2023 23:40:13 +0000 (23:40 +0000)]
mm: remove unused vmf_insert_mixed_prot()
Patch series "Remove drm/ttm-specific mm changes".
Functionality was added specifically for the DRM TTM driver to support
mapping memory for VM_MIXEDMAP VMAs with customised protection flags,
however this has now been rolled back as issues were found with this
approach.
This series removes the mm changes too, retaining some of the useful
comments.
This patch (of 3):
The sole user of vmf_insert_mixed_prot(), the drm ttm module, stopped
using this in commit f91142c62161 ("drm/ttm: nuke VM_MIXEDMAP on BO
mappings v3") citing use of VM_MIXEDMAP in this case being terribly
broken.
Remove this now-dead code and references to it, but retain the useful
description of the prot != vma->vm_page_prot case, moving it to
vmf_insert_pfn_prot() instead.
Link: https://lkml.kernel.org/r/cover.1678661628.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/a069644388e6f1593a7020d15840e6fc9f39bcaf.1678661628.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tomas Mudrunka [Tue, 21 Mar 2023 10:34:30 +0000 (11:34 +0100)]
mm/memtest: add results of early memtest to /proc/meminfo
Currently the memtest results were only presented in dmesg.
When running a large fleet of devices without ECC RAM it's currently not
easy to do bulk monitoring for memory corruption. You have to parse
dmesg, but that's a ring buffer so the error might disappear after some
time. In general I do not consider dmesg to be a great API to query RAM
status.
In several companies I've seen such errors remain undetected and cause
issues for way too long. So I think it makes sense to provide a
monitoring API, so that we can safely detect and act upon them.
This adds /proc/meminfo entry which can be easily used by scripts.
Mike Rapoport (IBM) [Tue, 21 Mar 2023 17:05:09 +0000 (19:05 +0200)]
init,mm: fold late call to page_ext_init() to page_alloc_init_late()
When deferred initialization of struct pages is enabled, page_ext_init()
must be called after all the deferred initialization is done, but there is
no point to keep it a separate call from kernel_init_freeable() right
after page_alloc_init_late().
Fold the call to page_ext_init() into page_alloc_init_late() and localize
deferred_struct_pages variable.
Link: https://lkml.kernel.org/r/20230321170513.2401534-11-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Tue, 21 Mar 2023 17:05:03 +0000 (19:05 +0200)]
mm: handle hashdist initialization in mm/mm_init.c
The hashdist variable must be initialized before the first call to
alloc_large_system_hash() and free_area_init() looks like a better place
for it than page_alloc_init().
Move hashdist handling to mm/mm_init.c
Link: https://lkml.kernel.org/r/20230321170513.2401534-5-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Tue, 21 Mar 2023 17:05:02 +0000 (19:05 +0200)]
mm: move most of core MM initialization to mm/mm_init.c
The bulk of memory management initialization code is spread all over
mm/page_alloc.c and makes navigating through page allocator functionality
difficult.
Move most of the functions marked __init and __meminit to mm/mm_init.c to
make it better localized and allow some more spare room before
mm/page_alloc.c reaches 10k lines.
No functional changes.
Link: https://lkml.kernel.org/r/20230321170513.2401534-4-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Tue, 21 Mar 2023 17:05:00 +0000 (19:05 +0200)]
mips: fix comment about pgtable_init()
Patch series "mm: move core MM initialization to mm/mm_init.c", v2.
This set moves most of the core MM initialization to mm/mm_init.c.
This largely includes free_area_init() and its helpers, functions used at
boot time, mm_init() from init/main.c and some of the functions it calls.
Aside from gaining some more space before mm/page_alloc.c hits 10k lines,
this makes mm/page_alloc.c to be mostly about buddy allocator and moves
the init code out of the way, which IMO improves maintainability.
Besides, this allows to move a couple of declarations out of include/linux
and make them private to mm/.
And as an added bonus there a slight decrease in vmlinux size. For
tinyconfig and defconfig on x86 I've got
tinyconfig:
text data bss dec hex filename
853206 289376 12001282342710 23bf36 a/vmlinux
853198 289344 12001282342670 23bf0e b/vmlinux
Lorenzo Stoakes [Tue, 21 Mar 2023 18:09:55 +0000 (18:09 +0000)]
MAINTAINERS: add Lorenzo as vmalloc reviewer
I have recently been involved in both reviewing and submitting patches to
the vmalloc code in mm and would be willing and happy to help out with
review going forward if it would be helpful!
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:45 +0000 (15:03 -0300)]
vmstat: add pcp remote node draining via cpu_vm_stats_fold
Large NUMA systems might have significant portions of system memory to be
trapped in pcp queues. The number of pcp is determined by the number of
processors and nodes in a system. A system with 4 processors and 2 nodes
has 8 pcps which is okay. But a system with 1024 processors and 512 nodes
has 512k pcps with a high potential for large amount of memory being
caught in them.
Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case, where
vmstat_shepherd will perform the aging and draining via cpu_vm_stats_fold.
Link: https://lkml.kernel.org/r/20230320180745.858515310@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:43 +0000 (15:03 -0300)]
mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
With a task that busy loops on a given CPU, the kworker interruption to
execute vmstat_update is undesired and may exceed latency thresholds for
certain applications.
switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.
To fix this, now that the counters are modified via cmpxchg both CPU
locally (via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU vmstats
folding remotely.
Link: https://lkml.kernel.org/r/20230320180745.807656081@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:40 +0000 (15:03 -0300)]
mm/vmstat: switch counter modification to cmpxchg
In preparation for switching vmstat shepherd to flush per-CPU counters
remotely, switch the __{mod,inc,dec} functions that modify the counters to
use cmpxchg.
To facilitate reviewing, functions are ordered in the text file, as:
To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
For the single_page_alloc_free test, which does
/** Loop to measure **/
for (i = 0; i < rec->loops; i++) {
my_page = alloc_page(gfp_mask);
if (unlikely(my_page == NULL))
return 0;
__free_page(my_page);
}
Unit is cycles.
Vanilla Patched Diff
115.25 117 1.4%
Link: https://lkml.kernel.org/r/20230320180745.733575720@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:37 +0000 (15:03 -0300)]
this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
Goal is to have vmstat_shepherd to transfer from per-CPU counters to
global counters remotely. For this, an atomic this_cpu_cmpxchg is
necessary.
Following the kernel convention for cmpxchg/cmpxchg_local, change x86's
this_cpu_cmpxchg_ helpers to be atomic. and add this_cpu_cmpxchg_local_
helpers which are not atomic.
Link: https://lkml.kernel.org/r/20230320180745.658574087@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>