]> www.infradead.org Git - nvme.git/log
nvme.git
9 months agokmsan: fix kmsan_copy_to_user() on arches with overlapping address spaces
Ilya Leoshkevich [Fri, 21 Jun 2024 11:34:50 +0000 (13:34 +0200)]
kmsan: fix kmsan_copy_to_user() on arches with overlapping address spaces

Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap.  Assume that we are handling user memory access in this
case.

Link: https://lkml.kernel.org/r/20240621113706.315500-7-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reported-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agokmsan: fix is_bad_asm_addr() on arches with overlapping address spaces
Ilya Leoshkevich [Fri, 21 Jun 2024 11:34:49 +0000 (13:34 +0200)]
kmsan: fix is_bad_asm_addr() on arches with overlapping address spaces

Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap.  Skip the comparison when this is the case.

Link: https://lkml.kernel.org/r/20240621113706.315500-6-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agokmsan: increase the maximum store size to 4096
Ilya Leoshkevich [Fri, 21 Jun 2024 11:34:48 +0000 (13:34 +0200)]
kmsan: increase the maximum store size to 4096

The inline assembly block in s390's chsc() stores that much.

Link: https://lkml.kernel.org/r/20240621113706.315500-5-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agokmsan: disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
Ilya Leoshkevich [Fri, 21 Jun 2024 11:34:47 +0000 (13:34 +0200)]
kmsan: disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled

KMSAN relies on memblock returning all available pages to it (see
kmsan_memblock_free_pages()).  It partitions these pages into 3
categories: pages available to the buddy allocator, shadow pages and
origin pages.  This partitioning is static.

If new pages appear after kmsan_init_runtime(), it is considered an error.
DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as incompatible with
KMSAN.

Link: https://lkml.kernel.org/r/20240621113706.315500-4-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agokmsan: make the tests compatible with kmsan.panic=1
Ilya Leoshkevich [Fri, 21 Jun 2024 11:34:46 +0000 (13:34 +0200)]
kmsan: make the tests compatible with kmsan.panic=1

It's useful to have both tests and kmsan.panic=1 during development, but
right now the warnings, that the tests cause, lead to kernel panics.

Temporarily set kmsan.panic=0 for the duration of the KMSAN testing.

Link: https://lkml.kernel.org/r/20240621113706.315500-3-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoftrace: unpoison ftrace_regs in ftrace_ops_list_func()
Ilya Leoshkevich [Fri, 21 Jun 2024 11:34:45 +0000 (13:34 +0200)]
ftrace: unpoison ftrace_regs in ftrace_ops_list_func()

Patch series "kmsan: Enable on s390", v7.

Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func().  Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry().  This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.

The issue was not encountered on x86_64 so far only by accident:
assembly-allocated ftrace_regs was overlapping a stale partially
unpoisoned stack frame.  Poisoning stack frames before returns [1] makes
the issue appear on x86_64 as well.

[1] https://github.com/iii-i/llvm-project/commits/msan-poison-allocas-before-returning-2024-06-12/

Link: https://lkml.kernel.org/r/20240621113706.315500-1-iii@linux.ibm.com
Link: https://lkml.kernel.org/r/20240621113706.315500-2-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoDocs/mm/damon/maintainer-profile: document DAMON community meetups
SeongJae Park [Fri, 21 Jun 2024 16:36:26 +0000 (09:36 -0700)]
Docs/mm/damon/maintainer-profile: document DAMON community meetups

DAMON bi-weekly community meetup series has continued since 2022-08-15 for
community members who prefer synchronous chat over asynchronous mails.
Recently I got some feedbacks about the series from a few people.  They
told me the series is helpful for understanding of the project and
particiapting to the development, but it could be further better in terms
of the visibility.  Based on that, I started sending meeting reminder for
every occurrence.  For people who don't subscribe the mailing list,
however, adding an announcement on the official document could be helpful.
Document the series on DAMON maintainer's profile for the purpose.

Link: https://lkml.kernel.org/r/20240621163626.74815-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoDocs/mm/damon/maintainer-profile: introduce HacKerMaiL
SeongJae Park [Fri, 21 Jun 2024 16:36:25 +0000 (09:36 -0700)]
Docs/mm/damon/maintainer-profile: introduce HacKerMaiL

Patch series "Docs/mm/damon/maintaier-profile: document a mailing tool and
community meetup series", v2.

There is a mailing tool that developed and maintained by DAMON
maintainer aiming to support DAMON community.  Also there are DAMON
community meetup series.  Both are known to have rooms of improvements
in terms of their visibility.  Document those on the maintainer's
profile document.

This patch (of 2):

Since DAMON was merged into mainline, I periodically received some
questions around DAMON's mailing lists based workflow.  The workflow is
not different from the normal ones that well documented, but it is also
true that it is not always easy and familiar for everyone.

I personally overcame it by developing and using a simple tool, named
HacKerMaiL (hkml)[1].  Based on my experience, I believe it is matured
enough to be used for simple workflows like that of DAMON.  Actually some
DAMON contributors and Linux kernel developers other than myself told me
they are using the tool.

As DAMON maintainer, I also believe helping new DAMON community members
onboarding to the worklow is one of the most important parts of my
responsibilities.  For the reason, the tool is announced[2] to support
DAMON community.  To further increasing the visibility of the fact,
document the tool and the support plan on DAMON maintainer's profile.

[1] https://github.com/damonitor/hackermail
[2] https://github.com/damonitor/hackermail/commit/3909dad91301

Link: https://lkml.kernel.org/r/20240621163626.74815-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240621163626.74815-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: read page_type using READ_ONCE
David Hildenbrand [Fri, 31 May 2024 12:56:16 +0000 (14:56 +0200)]
mm: read page_type using READ_ONCE

KCSAN complains about possible data races: while we check for a page_type
-- for example for sanity checks -- we might concurrently modify the
mapcount that overlays page_type.

Let's use READ_ONCE to avoid load tearing (shouldn't make a difference)
and to make KCSAN happy.

Likely, we might also want to use WRITE_ONCE for the writer side of
page_type, if KCSAN ever complains about that.  But we'll not mess with
that for now.

Note: nothing should really be broken besides wrong KCSAN complaints.  The
sanity check that triggers this was added in commit 68f0320824fa
("mm/rmap: convert folio_add_file_rmap_range() into
folio_add_file_rmap_[pte|ptes|pmd]()").  Even before that similar races
likely where possible, ever since we added page_type in commit
6e292b9be7f4 ("mm: split page_type out from _mapcount").

Link: https://lkml.kernel.org/r/20240531125616.2850153-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-lkp@intel.com
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: memory: rename pages_per_huge_page to nr_pages
Kefeng Wang [Tue, 18 Jun 2024 09:12:42 +0000 (17:12 +0800)]
mm: memory: rename pages_per_huge_page to nr_pages

Since the callers are converted to use nr_pages naming, use it inside too.

Link: https://lkml.kernel.org/r/20240618091242.2140164-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: memory: improve copy_user_large_folio()
Kefeng Wang [Tue, 18 Jun 2024 09:12:41 +0000 (17:12 +0800)]
mm: memory: improve copy_user_large_folio()

Use nr_pages instead of pages_per_huge_page and move the address alignment
from copy_user_large_folio() into the callers since it is only needed when
we don't know which address will be accessed.

Link: https://lkml.kernel.org/r/20240618091242.2140164-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: memory: use folio in struct copy_subpage_arg
Kefeng Wang [Tue, 18 Jun 2024 09:12:40 +0000 (17:12 +0800)]
mm: memory: use folio in struct copy_subpage_arg

Directly use folio in struct copy_subpage_arg.

Link: https://lkml.kernel.org/r/20240618091242.2140164-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: memory: convert clear_huge_page() to folio_zero_user()
Kefeng Wang [Tue, 18 Jun 2024 09:12:39 +0000 (17:12 +0800)]
mm: memory: convert clear_huge_page() to folio_zero_user()

Patch series "mm: improve clear and copy user folio", v2.

Some folio conversions.  An improvement is to move address alignment into
the caller as it is only needed if we don't know which address will be
accessed when clearing/copying user folios.

This patch (of 4):

Replace clear_huge_page() with folio_zero_user(), and take a folio
instead of a page. Directly get number of pages by folio_nr_pages()
to remove pages_per_huge_page argument, furthermore, move the address
alignment from folio_zero_user() to the callers since the alignment
is only needed when we don't know which address will be accessed.

Link: https://lkml.kernel.org/r/20240618091242.2140164-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240618091242.2140164-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/page_alloc: reword the comment of buddy_merge_likely()
Wei Yang [Wed, 19 Jun 2024 01:06:12 +0000 (01:06 +0000)]
mm/page_alloc: reword the comment of buddy_merge_likely()

For page with order O, we are checking its order (O + 1)'s buddy.  If it
is free, we would like to put it to the tail and expect it would be merged
to a page with order (O + 2).

Reword the comment to reflect it.

Link: https://lkml.kernel.org/r/20240619010612.20740-4-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/page_alloc: fix a typo in comment about GFP flag
Wei Yang [Wed, 19 Jun 2024 01:06:11 +0000 (01:06 +0000)]
mm/page_alloc: fix a typo in comment about GFP flag

The GFP flags used to choose the zonelist is __GFP_THISNODE.

Let's change it to what exactly it should be.

Link: https://lkml.kernel.org/r/20240619010612.20740-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/mm_init.c: move build check on MAX_ZONELISTS out of ifdef
Wei Yang [Wed, 19 Jun 2024 01:06:10 +0000 (01:06 +0000)]
mm/mm_init.c: move build check on MAX_ZONELISTS out of ifdef

Current check on MAX_ZONELISTS is wrapped in CONFIG_DEBUG_MEMORY_INIT,
which may not be triggered all the time.

Let's move it out to a more general place.

Link: https://lkml.kernel.org/r/20240619010612.20740-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/sparse: nr_pages won't be 0
Wei Yang [Wed, 19 Jun 2024 01:06:09 +0000 (01:06 +0000)]
mm/sparse: nr_pages won't be 0

Function subsection_map_init() is only used in free_area_init() in the
loop of for_each_mem_pfn_range().  And we are sure in each iteration of
for_each_mem_pfn_range(), start_pfn < end_pfn.

So nr_pages is not possible to be 0 and we can remove the check.

Link: https://lkml.kernel.org/r/20240619010612.20740-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: refactor log format in unpoison_memory
Jiaqi Yan [Wed, 19 Jun 2024 06:33:55 +0000 (06:33 +0000)]
mm/memory-failure: refactor log format in unpoison_memory

Logs from memory_failure and other memory-failure.c code follow the
format:

  "Memory failure: 0x{pfn}: ${lower_case_message}"

Convert the logs in unpoison_memory to follow similar format:

  "Unpoison: 0x${pfn}: ${lower_case_message}"

For example (from local test):
  [ 1331.938397] Unpoison: 0x144bc8: page was already unpoisoned

No functional change in this commit.

Link: https://lkml.kernel.org/r/20240619063355.171313-1-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/Kconfig: mention arm64 in DEFAULT_MMAP_MIN_ADDR symbol help text
Javier Martinez Canillas [Wed, 19 Jun 2024 08:30:38 +0000 (10:30 +0200)]
mm/Kconfig: mention arm64 in DEFAULT_MMAP_MIN_ADDR symbol help text

Currently ppc64 and x86 are mentioned as architectures where a 65536 value
is reasonable but arm64 isn't listed and it is also a 64-bit architecture.

The help text says that for "arm" the value should be no higher than 32768
but it's only talking about 32-bit ARM.  Adding arm64 to the above list
can make this more clear and avoid confusing users who may think that the
32k limit would also apply to 64-bit ARM.

Link: https://lkml.kernel.org/r/20240619083047.114613-1-javierm@redhat.com
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Cc: Brian Masney <bmasney@redhat.com>
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Maxime Ripard <mripard@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomaple_tree: modified return type of mas_wr_store_entry()
JaeJoon Jung [Fri, 14 Jun 2024 09:24:28 +0000 (18:24 +0900)]
maple_tree: modified return type of mas_wr_store_entry()

Since the return value of mas_wr_store_entry() is not used,
the return type can be changed to void.

Link: https://lkml.kernel.org/r/20240614092428.29491-1-rgbi3307@gmail.com
Signed-off-by: JaeJoon Jung <rgbi3307@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()
Barry Song [Mon, 17 Jun 2024 23:11:37 +0000 (11:11 +1200)]
mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

The folio_test_anon(folio)==false cases has been relocated to
folio_add_new_anon_rmap().  Additionally, four other callers consistently
pass anonymous folios.

stack 1:
remove_migration_pmd
   -> folio_add_anon_rmap_pmd
     -> __folio_add_anon_rmap

stack 2:
__split_huge_pmd_locked
   -> folio_add_anon_rmap_ptes
      -> __folio_add_anon_rmap

stack 3:
remove_migration_pmd
   -> folio_add_anon_rmap_pmd
      -> __folio_add_anon_rmap (RMAP_LEVEL_PMD)

stack 4:
try_to_merge_one_page
   -> replace_page
     -> folio_add_anon_rmap_pte
       -> __folio_add_anon_rmap

__folio_add_anon_rmap() only needs to handle the cases
folio_test_anon(folio)==true now.
We can remove the !folio_test_anon(folio)) path within
__folio_add_anon_rmap() now.

Link: https://lkml.kernel.org/r/20240617231137.80726-4-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Shuai Yuan <yuanshuai@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false
Barry Song [Mon, 17 Jun 2024 23:11:36 +0000 (11:11 +1200)]
mm: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false

For the !folio_test_anon(folio) case, we can now invoke
folio_add_new_anon_rmap() with the rmap flags set to either EXCLUSIVE or
non-EXCLUSIVE.  This action will suppress the VM_WARN_ON_FOLIO check
within __folio_add_anon_rmap() while initiating the process of bringing up
mTHP swapin.

 static __always_inline void __folio_add_anon_rmap(struct folio *folio,
                 struct page *page, int nr_pages, struct vm_area_struct *vma,
                 unsigned long address, rmap_t flags, enum rmap_level level)
 {
         ...
         if (unlikely(!folio_test_anon(folio))) {
                 VM_WARN_ON_FOLIO(folio_test_large(folio) &&
                                  level != RMAP_LEVEL_PMD, folio);
         }
         ...
 }

It also improves the code's readability.  Currently, all new anonymous
folios calling folio_add_anon_rmap_ptes() are order-0.  This ensures that
new folios cannot be partially exclusive; they are either entirely
exclusive or entirely shared.

A useful comment from Hugh's fix:

: Commit "mm: use folio_add_new_anon_rmap() if folio_test_anon(folio)==
: false" has extended folio_add_new_anon_rmap() to use on non-exclusive
: folios, already visible to others in swap cache and on LRU.
:
: That renders its non-atomic __folio_set_swapbacked() unsafe: it risks
: overwriting concurrent atomic operations on folio->flags, losing bits
: added or restoring bits cleared.  Since it's only used in this risky way
: when folio_test_locked and !folio_test_anon, many such races are excluded;
: but, for example, isolations by folio_test_clear_lru() are vulnerable, and
: setting or clearing active.
:
: It could just use the atomic folio_set_swapbacked(); but this function
: does try to avoid atomics where it can, so use a branch instead: just
: avoid setting swapbacked when it is already set, that is good enough.
: (Swapbacked is normally stable once set: lazyfree can undo it, but only
: later, when found anon in a page table.)
:
: This fixes a lot of instability under compaction and swapping loads:
: assorted "Bad page"s, VM_BUG_ON_FOLIO()s, apparently even page double
: frees - though I've not worked out what races could lead to the latter.

[akpm@linux-foundation.org: comment fixes, per David and akpm]
[v-songbaohua@oppo.com: lock the folio to avoid race]
Link: https://lkml.kernel.org/r/20240622032002.53033-1-21cnbao@gmail.com
[hughd@google.com: folio_add_new_anon_rmap() careful __folio_set_swapbacked()]
Link: https://lkml.kernel.org/r/f3599b1d-8323-0dc5-e9e0-fdb3cfc3dd5a@google.com
Link: https://lkml.kernel.org/r/20240617231137.80726-3-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Shuai Yuan <yuanshuai@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: extend rmap flags arguments for folio_add_new_anon_rmap
Barry Song [Mon, 17 Jun 2024 23:11:35 +0000 (11:11 +1200)]
mm: extend rmap flags arguments for folio_add_new_anon_rmap

Patch series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()", v2.

This patchset is preparatory work for mTHP swapin.

folio_add_new_anon_rmap() assumes that new anon rmaps are always
exclusive.  However, this assumption doesn’t hold true for cases like
do_swap_page(), where a new anon might be added to the swapcache and is
not necessarily exclusive.

The patchset extends the rmap flags to allow folio_add_new_anon_rmap() to
handle both exclusive and non-exclusive new anon folios.  The
do_swap_page() function is updated to use this extended API with rmap
flags.  Consequently, all new anon folios now consistently use
folio_add_new_anon_rmap().  The special case for !folio_test_anon() in
__folio_add_anon_rmap() can be safely removed.

In conclusion, new anon folios always use folio_add_new_anon_rmap(),
regardless of exclusivity.  Old anon folios continue to use
__folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and
folio_add_anon_rmap_ptes().

This patch (of 3):

In the case of a swap-in, a new anonymous folio is not necessarily
exclusive.  This patch updates the rmap flags to allow a new anonymous
folio to be treated as either exclusive or non-exclusive.  To maintain the
existing behavior, we always use EXCLUSIVE as the default setting.

[akpm@linux-foundation.org: cleanup and constifications per David and akpm]
[v-songbaohua@oppo.com: fix missing doc for flags of folio_add_new_anon_rmap()]
Link: https://lkml.kernel.org/r/20240619210641.62542-1-21cnbao@gmail.com
[v-songbaohua@oppo.com: enhance doc for extend rmap flags arguments for folio_add_new_anon_rmap]
Link: https://lkml.kernel.org/r/20240622030256.43775-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240617231137.80726-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240617231137.80726-2-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Shuai Yuan <yuanshuai@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agovmalloc: modify the alloc_vmap_area() error message for better diagnostics
Shubhang Kaushik OS [Mon, 10 Jun 2024 17:22:58 +0000 (17:22 +0000)]
vmalloc: modify the alloc_vmap_area() error message for better diagnostics

'vmap allocation for size %lu failed: use vmalloc=<size> to increase size'
The above warning is seen in the kernel functionality for allocation of
the restricted virtual memory range till exhaustion.

This message is misleading because 'vmalloc=' is supported on arm32, x86
platforms and is not a valid kernel parameter on a number of other
platforms (in particular its not supported on arm64, alpha, loongarch,
arc, csky, hexagon, microblaze, mips, nios2, openrisc, parisc, m64k,
powerpc, riscv, sh, um, xtensa, s390, sparc).  With the update, the output
gets modified to include the function parameters along with the start and
end of the virtual memory range allowed.

The warning message after fix on kernel version 6.10.0-rc1+:

vmalloc_node_range for size 33619968 failed: Address range restricted between 0xffff800082640000 - 0xffff800084650000

Backtrace with the misleading error message:

vmap allocation for size 33619968 failed: use vmalloc=<size> to increase size
insmod: vmalloc error: size 33554432, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 46 PID: 1977 Comm: insmod Tainted: G            E      6.10.0-rc1+ #79
Hardware name: INGRASYS Yushan Server iSystem TEMP-S000141176+10/Yushan Motherboard, BIOS 2.10.20230517 (SCP: xxx) yyyy/mm/dd
Call trace:
dump_backtrace+0xa0/0x128
show_stack+0x20/0x38
dump_stack_lvl+0x78/0x90
dump_stack+0x18/0x28
warn_alloc+0x12c/0x1b8
__vmalloc_node_range_noprof+0x28c/0x7e0
custom_init+0xb4/0xfff8 [test_driver]
do_one_initcall+0x60/0x290
do_init_module+0x68/0x250
load_module+0x236c/0x2428
init_module_from_file+0x8c/0xd8
__arm64_sys_finit_module+0x1b4/0x388
invoke_syscall+0x78/0x108
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x3c/0x130
el0t_64_sync_handler+0x100/0x130
el0t_64_sync+0x190/0x198

[Shubhang@os.amperecomputing.com: v5]
Link: https://lkml.kernel.org/r/CH2PR01MB5894B0182EA0B28DF2EFB916F5C72@CH2PR01MB5894.prod.exchangelabs.com
Link: https://lkml.kernel.org/r/MN2PR01MB59025CC02D1D29516527A693F5C62@MN2PR01MB5902.prod.exchangelabs.com
Signed-off-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Xiongwei Song <xiongwei.song@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when...
David Hildenbrand [Fri, 7 Jun 2024 09:09:38 +0000 (11:09 +0200)]
mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages when offlining

We currently have a hack for virtio-mem in place to handle memory
offlining with PageOffline pages for which we already adjusted the managed
page count.

Let's enlighten memory offlining code so we can get rid of that hack, and
document the situation.

Link: https://lkml.kernel.org/r/20240607090939.89524-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Marco Elver <elver@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of...
David Hildenbrand [Fri, 7 Jun 2024 09:09:37 +0000 (11:09 +0200)]
mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()

We currently initialize the memmap such that PG_reserved is set and the
refcount of the page is 1.  In virtio-mem code, we have to manually clear
that PG_reserved flag to make memory offlining with partially hotplugged
memory blocks possible: has_unmovable_pages() would otherwise bail out on
such pages.

We want to avoid PG_reserved where possible and move to typed pages
instead.  Further, we want to further enlighten memory offlining code
about PG_offline: offline pages in an online memory section.  One example
is handling managed page count adjustments in a cleaner way during memory
offlining.

So let's initialize the pages with PG_offline instead of PG_reserved.
generic_online_page()->__free_pages_core() will now clear that flag before
handing that memory to the buddy.

Note that the page refcount is still 1 and would forbid offlining of such
memory except when special care is take during GOING_OFFLINE as currently
only implemented by virtio-mem.

With this change, we can now get non-PageReserved() pages in the XEN
balloon list.  From what I can tell, that can already happen via
decrease_reservation(), so that should be fine.

HV-balloon should not really observe a change: partial online memory
blocks still cannot get surprise-offlined, because the refcount of these
PageOffline() pages is 1.

Update virtio-mem, HV-balloon and XEN-balloon code to be aware that
hotplugged pages are now PageOffline() instead of PageReserved() before
they are handed over to the buddy.

We'll leave the ZONE_DEVICE case alone for now.

Note that self-hosted vmemmap pages will no longer be marked as
reserved.  This matches ordinary vmemmap pages allocated from the buddy
during memory hotplug.  Now, really only vmemmap pages allocated from
memblock during early boot will be marked reserved.  Existing
PageReserved() checks seem to be handling all relevant cases correctly
even after this change.

Link: https://lkml.kernel.org/r/20240607090939.89524-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de> [generic memory-hotplug bits]
Cc: Alexander Potapenko <glider@google.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Marco Elver <elver@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: pass meminit_context to __free_pages_core()
David Hildenbrand [Fri, 7 Jun 2024 09:09:36 +0000 (11:09 +0200)]
mm: pass meminit_context to __free_pages_core()

Patch series "mm/memory_hotplug: use PageOffline() instead of
PageReserved() for !ZONE_DEVICE".

This can be a considered a long-overdue follow-up to some parts of [1].
The patches are based on [2], but they are not strictly required -- just
makes it clearer why we can use adjust_managed_page_count() for memory
hotplug without going into details about highmem.

We stop initializing pages with PageReserved() in memory hotplug code --
except when dealing with ZONE_DEVICE for now.  Instead, we use
PageOffline(): all pages are initialized to PageOffline() when onlining a
memory section, and only the ones actually getting exposed to the
system/page allocator will get PageOffline cleared.

This way, we enlighten memory hotplug more about PageOffline() pages and
can cleanup some hacks we have in virtio-mem code.

What about ZONE_DEVICE?  PageOffline() is wrong, but we might just stop
using PageReserved() for them later by simply checking for
is_zone_device_page() at suitable places.  That will be a separate patch
set / proposal.

This primarily affects virtio-mem, HV-balloon and XEN balloon. I only
briefly tested with virtio-mem, which benefits most from these cleanups.

[1] https://lore.kernel.org/all/20191024120938.11237-1-david@redhat.com/
[2] https://lkml.kernel.org/r/20240607083711.62833-1-david@redhat.com

This patch (of 3):

In preparation for further changes, let's teach __free_pages_core() about
the differences of memory hotplug handling.

Move the memory hotplug specific handling from generic_online_page() to
__free_pages_core(), use adjust_managed_page_count() on the memory hotplug
path, and spell out why memory freed via memblock cannot currently use
adjust_managed_page_count().

[david@redhat.com: add missed CONFIG_DEFERRED_STRUCT_PAGE_INIT]
Link: https://lkml.kernel.org/r/b72e6efd-fb0a-459c-b1a0-88a98e5b19e2@redhat.com
[david@redhat.com: fix up the memblock comment, per Oscar]
Link: https://lkml.kernel.org/r/2ed64218-7f3b-4302-a5dc-27f060654fe2@redhat.com
[david@redhat.com: add the parameter name also in the declaration]
Link: https://lkml.kernel.org/r/ca575956-f0dd-4fb9-a307-6b7621681ed9@redhat.com
Link: https://lkml.kernel.org/r/20240607090939.89524-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240607090939.89524-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Marco Elver <elver@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: remove page_mkclean()
Kefeng Wang [Tue, 4 Jun 2024 11:48:22 +0000 (19:48 +0800)]
mm: remove page_mkclean()

There are no more users of page_mkclean(), remove it and update the
document and comment.

Link: https://lkml.kernel.org/r/20240604114822.2089819-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agofb_defio: use a folio in fb_deferred_io_work()
Kefeng Wang [Tue, 4 Jun 2024 11:48:21 +0000 (19:48 +0800)]
fb_defio: use a folio in fb_deferred_io_work()

Replaces three calls to compound_head() with one, which removes last
caller of page_mkclean().

Link: https://lkml.kernel.org/r/20240604114822.2089819-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: remove page_maybe_dma_pinned()
Kefeng Wang [Tue, 4 Jun 2024 11:48:20 +0000 (19:48 +0800)]
mm: remove page_maybe_dma_pinned()

After the last user of page_maybe_dma_pinned() is converted to
folio_maybe_dma_pinned(), remove page_maybe_dma_pinned() and update the
document and comment.

[wangkefeng.wang@huawei.com: fix pin_user_pages.rst underlining]
Link: https://lkml.kernel.org/r/61b256c7-4989-44ec-83db-f34a1bd4be2d@huawei.com
Link: https://lkml.kernel.org/r/20240604114822.2089819-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agofs/proc/task_mmu: use folio API in pte_is_pinned()
Kefeng Wang [Tue, 4 Jun 2024 11:48:19 +0000 (19:48 +0800)]
fs/proc/task_mmu: use folio API in pte_is_pinned()

Patch series "mm: remove page_maybe_dma_pinned() and page_mkclean()".

Most page_maybe_dma_pinned() and page_mkclean() callers have been
converted to the folio equivalents, after two more convertsions,
remove them and update the comment and documention.

This patch (of 4):

Convert to use vm_normal_folio() and folio_maybe_dma_pinned() API, which
helps to remove page_maybe_dma_pinned() in the subsequent change.

Link: https://lkml.kernel.org/r/20240604114822.2089819-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20240604114822.2089819-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Helge Deller <deller@gmx.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/mm_init: initialize page->_mapcount directly in __init_single_page()
David Hildenbrand [Wed, 29 May 2024 11:19:04 +0000 (13:19 +0200)]
mm/mm_init: initialize page->_mapcount directly in __init_single_page()

Let's simply reinitialize the page->_mapcount directly.  We can now get
rid of page_mapcount_reset().

Link: https://lkml.kernel.org/r/20240529111904.2069608-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads]
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/filemap: reinitialize folio->_mapcount directly
David Hildenbrand [Wed, 29 May 2024 11:19:03 +0000 (13:19 +0200)]
mm/filemap: reinitialize folio->_mapcount directly

Let's get rid of the page_mapcount_reset() call and simply reinitialize
folio->_mapcount directly.

Link: https://lkml.kernel.org/r/20240529111904.2069608-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads]
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/page_alloc: clear PageBuddy using __ClearPageBuddy() for bad pages
David Hildenbrand [Wed, 29 May 2024 11:19:02 +0000 (13:19 +0200)]
mm/page_alloc: clear PageBuddy using __ClearPageBuddy() for bad pages

Let's stop using page_mapcount_reset() and clear PageBuddy using
__ClearPageBuddy() instead.

Link: https://lkml.kernel.org/r/20240529111904.2069608-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads]
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/zsmalloc: use a proper page type
David Hildenbrand [Wed, 29 May 2024 11:19:01 +0000 (13:19 +0200)]
mm/zsmalloc: use a proper page type

Let's clean it up: use a proper page type and store our data (offset into
a page) in the lower 16 bit as documented.

We won't be able to support 256 KiB base pages, which is acceptable.
Teach Kconfig to handle that cleanly using a new CONFIG_HAVE_ZSMALLOC.

Based on this, we should do a proper "struct zsdesc" conversion, as
proposed in [1].

This removes the last _mapcount/page_type offender.

[1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/

Link: https://lkml.kernel.org/r/20240529111904.2069608-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads]
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: allow reuse of the lower 16 bit of the page type with an actual type
David Hildenbrand [Wed, 29 May 2024 11:19:00 +0000 (13:19 +0200)]
mm: allow reuse of the lower 16 bit of the page type with an actual type

As long as the owner sets a page type first, we can allow reuse of the
lower 16 bit: sufficient to store an offset into a 64 KiB page, which is
the maximum base page size in *common* configurations (ignoring the 256
KiB variant).  Restrict it to the head page.

We'll use that for zsmalloc next, to set a proper type while still reusing
that field to store information (offset into a base page) that cannot go
elsewhere for now.

Let's reserve the lower 16 bit for that purpose and for catching mapcount
underflows, and let's reduce PAGE_TYPE_BASE to a single bit.

Note that we will still have to overflow the mapcount quite a lot until we
would actually indicate a valid page type.

Start handing out the type bits from highest to lowest, to make it clearer
how many bits for types we have left.  Out of 15 bit we can use for types,
we currently use 6.  If we run out of bits before we have better typing
(e.g., memdesc), we can always investigate storing a value instead [1].

[1] https://lore.kernel.org/all/00ba1dff-7c05-46e8-b0d9-a78ac1cfc198@redhat.com/

[akpm@linux-foundation.org: fix PG_hugetlb typo, per David]
Link: https://lkml.kernel.org/r/20240529111904.2069608-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads]
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: update _mapcount and page_type documentation
David Hildenbrand [Wed, 29 May 2024 11:18:59 +0000 (13:18 +0200)]
mm: update _mapcount and page_type documentation

Patch series "mm: page_type, zsmalloc and page_mapcount_reset()", v2.

Wanting to remove the remaining abuser of _mapcount/page_type along with
page_mapcount_reset(), I stumbled over zsmalloc, which is yet to be
converted away from "struct page" [1].

Unfortunately, we cannot stop using the page_type field in zsmalloc code
completely for its own purposes.  All other fields in "struct page" are
used one way or the other.  Could we simply store a 2-byte offset value at
the beginning of each page?  Likely, but that will require a bit more
work; and once we have memdesc we might want to move the offset in there
(struct zsalloc?) again.

...  but we can limit the abuse to 16 bit, glue it to a page type that
must be set, and document it.  page_has_type() will always successfully
indicate such zsmalloc pages, and such zsmalloc pages only.

We lose zsmalloc support for PAGE_SIZE > 64KB, which should be tolerable.
We could use more bits from the page type, but 16 bit sounds like a good
idea for now.

So clarify the _mapcount/page_type documentation, use a proper page_type
for zsmalloc, and remove page_mapcount_reset().

[1] https://lore.kernel.org/all/20231130101242.2590384-1-42.hyeyoo@gmail.com/

This patch (of 6):

Let's make it clearer that _mapcount must no longer be used for own
purposes, and how _mapcount and page_type behaves nowadays (also in the
context of hugetlb folios, which are typed folios that will be mapped to
user space).

Move the documentation regarding "-1" over from page_mapcount_reset(),
which we will remove next.  Move "page_type" before "mapcount", to make it
clearer what typed folios are.

Link: https://lkml.kernel.org/r/20240529111904.2069608-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240529111904.2069608-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram/zsmalloc workloads]
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoselftests/mm: remove local __NR_* definitions
John Hubbard [Tue, 18 Jun 2024 02:24:22 +0000 (19:24 -0700)]
selftests/mm: remove local __NR_* definitions

This continues the work on getting the selftests to build without
requiring people to first run "make headers" [1].

Now that the system call numbers are in the correct, checked-in locations
in the kernel tree (./tools/include/uapi/asm/unistd*.h), make sure that
the mm selftests include that file (indirectly).

Doing so provides guaranteed definitions at build time, so remove all of
the checks for "ifdef __NR_xxx" in the mm selftests, because they will
always be true (defined).

[1] commit e076eaca5906 ("selftests: break the dependency upon local
header files")

Link: https://lkml.kernel.org/r/20240618022422.804305-7-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jeff Xu <jeffxu@chromium.org>
Cc: Andrei Vagin <avagin@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/huge_memory.c: fix used-uninitialized
Andrew Morton [Tue, 25 Jun 2024 21:51:36 +0000 (14:51 -0700)]
mm/huge_memory.c: fix used-uninitialized

Fix used-uninitialized of `page'.

Fixes: dce7d10be4bb ("mm/madvise: optimize lazyfreeing with mTHP in madvise_free")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406260514.SLhNM9kQ-lkp@intel.com
Cc: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agonilfs2: fix incorrect inode allocation from reserved inodes
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:35 +0000 (14:11 +0900)]
nilfs2: fix incorrect inode allocation from reserved inodes

If the bitmap block that manages the inode allocation status is corrupted,
nilfs_ifile_create_inode() may allocate a new inode from the reserved
inode area where it should not be allocated.

Previous fix commit d325dc6eb763 ("nilfs2: fix use-after-free bug of
struct nilfs_root"), fixed the problem that reserved inodes with inode
numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
bitmap corruption, but since the start number of non-reserved inodes is
read from the super block and may change, in which case inode allocation
may occur from the extended reserved inode area.

If that happens, access to that inode will cause an IO error, causing the
file system to degrade to an error state.

Fix this potential issue by adding a wraparound option to the common
metadata object allocation routine and by modifying
nilfs_ifile_create_inode() to disable the option so that it only allocates
inodes with inode numbers greater than or equal to the inode number read
in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
inodes.

Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agonilfs2: add missing check for inode numbers on directory entries
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:34 +0000 (14:11 +0900)]
nilfs2: add missing check for inode numbers on directory entries

Syzbot reported that mounting and unmounting a specific pattern of
corrupted nilfs2 filesystem images causes a use-after-free of metadata
file inodes, which triggers a kernel bug in lru_add_fn().

As Jan Kara pointed out, this is because the link count of a metadata file
gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
tries to delete that inode (ifile inode in this case).

The inconsistency occurs because directories containing the inode numbers
of these metadata files that should not be visible in the namespace are
read without checking.

Fix this issue by treating the inode numbers of these internal files as
errors in the sanity check helper when reading directory folios/pages.

Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
analysis.

Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agonilfs2: fix inode number range checks
Ryusuke Konishi [Sun, 23 Jun 2024 05:11:33 +0000 (14:11 +0900)]
nilfs2: fix inode number range checks

Patch series "nilfs2: fix potential issues related to reserved inodes".

This series fixes one use-after-free issue reported by syzbot, caused by
nilfs2's internal inode being exposed in the namespace on a corrupted
filesystem, and a couple of flaws that cause problems if the starting
number of non-reserved inodes written in the on-disk super block is
intentionally (or corruptly) changed from its default value.

This patch (of 3):

In the current implementation of nilfs2, "nilfs->ns_first_ino", which
gives the first non-reserved inode number, is read from the superblock,
but its lower limit is not checked.

As a result, if a number that overlaps with the inode number range of
reserved inodes such as the root directory or metadata files is set in the
super block parameter, the inode number test macros (NILFS_MDT_INODE and
NILFS_VALID_INODE) will not function properly.

In addition, these test macros use left bit-shift calculations using with
the inode number as the shift count via the BIT macro, but the result of a
shift calculation that exceeds the bit width of an integer is undefined in
the C specification, so if "ns_first_ino" is set to a large value other
than the default value NILFS_USER_INO (=11), the macros may potentially
malfunction depending on the environment.

Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
by preventing bit shifts equal to or greater than the NILFS_USER_INO
constant in the inode number test macros.

Also, change the type of "ns_first_ino" from signed integer to unsigned
integer to avoid the need for type casting in comparisons such as the
lower bound check introduced this time.

Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: avoid overflows in dirty throttling logic
Jan Kara [Fri, 21 Jun 2024 14:42:38 +0000 (16:42 +0200)]
mm: avoid overflows in dirty throttling logic

The dirty throttling logic is interspersed with assumptions that dirty
limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
fit into 64-bits).  If limits end up being larger, we will hit overflows,
possible divisions by 0 etc.  Fix these problems by never allowing so
large dirty limits as they have dubious practical value anyway.  For
dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
so large limits.  For dirty_ratio / dirty_background_ratio it isn't so
simple as the dirty limit is computed from the amount of available memory
which can change due to memory hotplug etc.  So when converting dirty
limits from ratios to numbers of pages, we just don't allow the result to
exceed UINT_MAX.

This is root-only triggerable problem which occurs when the operator
sets dirty limits to >16 TB.

Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-By: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoRevert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
Jan Kara [Fri, 21 Jun 2024 14:42:37 +0000 (16:42 +0200)]
Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"

Patch series "mm: Avoid possible overflows in dirty throttling".

Dirty throttling logic assumes dirty limits in page units fit into
32-bits.  This patch series makes sure this is true (see patch 2/2 for
more details).

This patch (of 2):

This reverts commit 9319b647902cbd5cc884ac08a8a6d54ce111fc78.

The commit is broken in several ways.  Firstly, the removed (u64) cast
from the multiplication will introduce a multiplication overflow on 32-bit
archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
default settings with 4GB of RAM will trigger this).  Secondly, the
div64_u64() is unnecessarily expensive on 32-bit archs.  We have
div64_ul() in case we want to be safe & cheap.  Thirdly, if dirty
thresholds are larger than 1<<32 pages, then dirty balancing is going to
blow up in many other spectacular ways anyway so trying to fix one
possible overflow is just moot.

Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz
Fixes: 9319b647902c ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-By: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: optimize the redundant loop of mm_update_owner_next()
Jinliang Zheng [Thu, 20 Jun 2024 12:21:24 +0000 (20:21 +0800)]
mm: optimize the redundant loop of mm_update_owner_next()

When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tycho Andersen <tandersen@netflix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agokhugepaged: simplify the allocation of slab caches
Hongfu Li [Tue, 18 Jun 2024 01:45:17 +0000 (09:45 +0800)]
khugepaged: simplify the allocation of slab caches

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Link: https://lkml.kernel.org/r/20240618014517.25954-1-lihongfu@kylinos.cn
Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: ksm: drop KSM_KMEM_CACHE()
Kefeng Wang [Tue, 18 Jun 2024 08:12:01 +0000 (16:12 +0800)]
mm: ksm: drop KSM_KMEM_CACHE()

After commit 21fbd59136e0 ("ksm: add the ksm prefix to the names of the
ksm private structures"), we could directly use KMEM_CACHE().

Link: https://lkml.kernel.org/r/20240618081201.134985-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/lru_sort: remove unnecessary online tuning handling code
SeongJae Park [Tue, 18 Jun 2024 18:18:09 +0000 (11:18 -0700)]
mm/damon/lru_sort: remove unnecessary online tuning handling code

DAMON_LRU_SORT contains code for handling of online DAMON parameters
update edge cases.  It is no more necessary since damon_commit_ctx() takes
care of the cases.  Remove the unnecessary code.

Link: https://lkml.kernel.org/r/20240618181809.82078-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/lru_sort: use damon_commit_ctx()
SeongJae Park [Tue, 18 Jun 2024 18:18:08 +0000 (11:18 -0700)]
mm/damon/lru_sort: use damon_commit_ctx()

DAMON_LRU_SORT manually manipulates the DAMON context struct for online
parameters update.  Since the struct contains not only input parameters
but also internal status and operation results, it is not that simple.
Indeed, we found and fixed a few bugs in the code.  Now DAMON core layer
provides a function for the usage, namely damon_commit_ctx().  Replace the
manual manipulation logic with the function.  The core layer function
could have its own bugs, but this change removes a source of bugs.

Link: https://lkml.kernel.org/r/20240618181809.82078-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/reclaim: remove unnecessary code for online tuning
SeongJae Park [Tue, 18 Jun 2024 18:18:07 +0000 (11:18 -0700)]
mm/damon/reclaim: remove unnecessary code for online tuning

DAMON_RECLAIM contains code for handling of online DAMON parameters update
edge cases.  It is no more necessary since damon_commit_ctx() takes care
of the cases.  Remove the unnecessary code.

Link: https://lkml.kernel.org/r/20240618181809.82078-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/reclaim: use damon_commit_ctx()
SeongJae Park [Tue, 18 Jun 2024 18:18:06 +0000 (11:18 -0700)]
mm/damon/reclaim: use damon_commit_ctx()

DAMON_RECLAIM manually manipulates the DAMON context struct for online
parameters update.  Since the struct contains not only input parameters
but also internal status and operation results, it is not that simple.
Indeed, we found and fixed a few bugs in the code.  Now DAMON core layer
provides a function for the usage, namely damon_commit_ctx().  Replace the
manual manipulation logic with the function.  The core layer function
could have its own bugs, but this change removes a source of bugs.

Link: https://lkml.kernel.org/r/20240618181809.82078-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs-schemes: rename *_set_{schemes,scheme_filters,quota_score,schemes}()
SeongJae Park [Tue, 18 Jun 2024 18:18:05 +0000 (11:18 -0700)]
mm/damon/sysfs-schemes: rename *_set_{schemes,scheme_filters,quota_score,schemes}()

The functions were for updating DAMON structs that may or may not be
partially populated.  Hence it was not for only adding items, but also
removing unnecessary items and updating items in-place.  A previous commit
has changed the functions to assume the structs are not partially
populated, and do only adding items.  Make the names better explain the
behavior.

Link: https://lkml.kernel.org/r/20240618181809.82078-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs-schemes: remove unnecessary online tuning handling code
SeongJae Park [Tue, 18 Jun 2024 18:18:04 +0000 (11:18 -0700)]
mm/damon/sysfs-schemes: remove unnecessary online tuning handling code

damon/sysfs-schemes.c contains code for handling of online DAMON
parameters update edge cases.  The logics are no more necessary since
damon_commit_ctx() and damon_commit_quota_goals() takes care of the cases.
Remove the unnecessary code.

Link: https://lkml.kernel.org/r/20240618181809.82078-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs: rename damon_sysfs_set_targets() to ...add_targets()
SeongJae Park [Tue, 18 Jun 2024 18:18:03 +0000 (11:18 -0700)]
mm/damon/sysfs: rename damon_sysfs_set_targets() to ...add_targets()

The function was for updating DAMON structs that may or may not be
partially populated.  Hence it was not for only adding items, but also
removing unnecessary items and updating items in-place.  A previous commit
has changed the function to assume the structs are not partially
populated, and do only adding items.  Make the function name better
explain the behavior.

Link: https://lkml.kernel.org/r/20240618181809.82078-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs: remove unnecessary online tuning handling code
SeongJae Park [Tue, 18 Jun 2024 18:18:02 +0000 (11:18 -0700)]
mm/damon/sysfs: remove unnecessary online tuning handling code

damon/sysfs.c contains code for handling of online DAMON parameters update
edge cases.  It is no more necessary since damon_commit_ctx() takes care
of the cases.  Remove the unnecessary code.

Link: https://lkml.kernel.org/r/20240618181809.82078-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs-schemes: use damos_commit_quota_goals()
SeongJae Park [Tue, 18 Jun 2024 18:18:01 +0000 (11:18 -0700)]
mm/damon/sysfs-schemes: use damos_commit_quota_goals()

DAMON_SYSFS manually manipulates the DAMOS quota structs for online quotal
goals parameter update.  Since the struct contains not only input
parameters but also internal status and operation results, it is not that
simple.  Now DAMON core layer provides a function for the usage, namely
damon_commit_quota_goals().  Replace the manual manipulation logic with
the function.  The core layer function could have its own bugs, but this
change removes a source of bugs.

Link: https://lkml.kernel.org/r/20240618181809.82078-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs: use damon_commit_ctx()
SeongJae Park [Tue, 18 Jun 2024 18:18:00 +0000 (11:18 -0700)]
mm/damon/sysfs: use damon_commit_ctx()

DAMON_SYSFS manually manipulates DAMON context structs for online
parameters update.  Since the struct contains not only input parameters
but also internal status and operation results, it is not that simple.
Indeed, we found and fixed a few bugs in the code.  Now DAMON core layer
provides a function for the usage, namely damon_commit_ctx().  Replace the
manual manipulation logic with the function.  The core layer function
could have its own bugs, but this change removes a source of bugs.

Link: https://lkml.kernel.org/r/20240618181809.82078-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/core: implement DAMON context commit function
SeongJae Park [Tue, 18 Jun 2024 18:17:59 +0000 (11:17 -0700)]
mm/damon/core: implement DAMON context commit function

Implement functions for supporting online DAMON context level parameters
update.  The function receives two DAMON context structs.  One is the
struct that currently being used by a kdamond and therefore to be updated.
The other one contains the parameters to be applied to the first one.
The function applies the new parameters to the destination struct while
keeping/updating the internal status and operation results.  The function
should be called from DAMON context-update-safe place, like DAMON
callbacks.

Link: https://lkml.kernel.org/r/20240618181809.82078-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/core: implement DAMOS quota goals online commit function
SeongJae Park [Tue, 18 Jun 2024 18:17:58 +0000 (11:17 -0700)]
mm/damon/core: implement DAMOS quota goals online commit function

Patch series "mm/damon: introduce DAMON parameters online commit function".

DAMON context struct (damon_ctx) contains user requests (parameters),
internal status, and operation results.  For flexible usages, DAMON API
users are encouraged to manually manipulate the struct.  That works well
for simple use cases.  However, it has turned out that it is not that
simple at least for online parameters udpate.  It is easy to forget
properly maintaining internal status and operation results.  Also, such
manual manipulation for online tuning is implemented multiple times on
DAMON API users including DAMON sysfs interface, DAMON_RECLAIM and
DAMON_LRU_SORT.  As a result, we have multiple sources of bugs for same
problem.  Actually we found and fixed a few bugs from online parameter
updating of DAMON API users.

Implement a function for online DAMON parameters update in core layer, and
replace DAMON API users' manual manipulation code for the use case.  The
core layer function could still have bugs, but this change reduces the
source of bugs for the problem to one place.

This patch (of 12):

Implement functions for supporting online DAMOS quota goals parameters
update.  The function receives two DAMOS quota structs.  One is the struct
that currently being used by a kdamond and therefore to be updated.  The
other one contains the parameters to be applied to the first one.  The
function applies the new parameters to the destination struct while
keeping/updating the internal status.  The function should be called from
parameters-update safe place, like DAMON callbacks.

Link: https://lkml.kernel.org/r/20240618181809.82078-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240618181809.82078-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: memcontrol: add VM_BUG_ON_FOLIO() to catch lru folio in mem_cgroup_migrate()
Baolin Wang [Fri, 14 Jun 2024 01:07:42 +0000 (09:07 +0800)]
mm: memcontrol: add VM_BUG_ON_FOLIO() to catch lru folio in mem_cgroup_migrate()

mem_cgroup_migrate() will clear the memcg data of the old folio,
therefore, the callers must make sure the old folio is no longer on the
LRU list, otherwise the old folio can not get the correct lruvec object
without the memcg data, which could lead to potential problems [1].

Thus adding a VM_BUG_ON_FOLIO() to catch this issue.

[1] https://lore.kernel.org/all/5ab860d8ee987955e917748f9d6da525d3b52690.1718326003.git.baolin.wang@linux.alibaba.com/

Link: https://lkml.kernel.org/r/66d181c41b7ced35dbd39ffd3f5774a11aef266a.1718327124.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoDocs/damon: document damos_migrate_{hot,cold}
Honggyu Kim [Fri, 14 Jun 2024 03:00:09 +0000 (12:00 +0900)]
Docs/damon: document damos_migrate_{hot,cold}

This patch adds damon description for "migrate_hot" and "migrate_cold"
actions for both usage and design documents as long as a new
"target_nid" knob to set the migration target node.

[sj@kernel.org: trivial fixups for DAMOS_MIGRATE_{HOT,COLD} documentation]
Link: https://lkml.kernel.org/r/20240618213630.84846-2-sj@kernel.org
Link: https://lkml.kernel.org/r/20240614030010.751-8-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: SeongJae Park <sj@kernel.org>Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion
Hyeongtak Ji [Fri, 14 Jun 2024 03:00:08 +0000 (12:00 +0900)]
mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but proritizes hot pages.

It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.

Here is one of the example usage of this 'migrate_hot' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
  $ cat contexts/<N>/schemes/<N>/action
  migrate_hot
  $ echo 0 > contexts/<N>/schemes/<N>/target_nid
  $ echo commit > state
  $ numactl -p 2 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID             Node 0 Node 1 Node 2 Total
  --------------  ------ ------ ------ -----
  701 (hot_cold)     501      0    601  1101

Link: https://lkml.kernel.org/r/20240614030010.751-7-honggyu.kim@sk.com
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
Honggyu Kim [Fri, 14 Jun 2024 03:00:07 +0000 (12:00 +0900)]
mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.

The 'target_nid' sysfs knob informs the migration target node ID.

Here is one of the example usage of this 'migrate_cold' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/<N>
  $ cat contexts/<N>/schemes/<N>/action
  migrate_cold
  $ echo 2 > contexts/<N>/schemes/<N>/target_nid
  $ echo commit > state
  $ numactl -p 0 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID             Node 0 Node 1 Node 2 Total
  --------------  ------ ------ ------ -----
  701 (hot_cold)     501      0    601  1101

Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list().

Link: https://lkml.kernel.org/r/20240614030010.751-6-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/migrate: add MR_DAMON to migrate_reason
Honggyu Kim [Fri, 14 Jun 2024 03:00:06 +0000 (12:00 +0900)]
mm/migrate: add MR_DAMON to migrate_reason

The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.

Link: https://lkml.kernel.org/r/20240614030010.751-5-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/damon/sysfs-schemes: add target_nid on sysfs-schemes
Hyeongtak Ji [Fri, 14 Jun 2024 03:00:05 +0000 (12:00 +0900)]
mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

This patch adds target_nid under
  /sys/kernel/mm/damon/admin/kdamonds/<N>/contexts/<N>/schemes/<N>/

The 'target_nid' can be used as the destination node for DAMOS actions
such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.

[sj@kernel.org: document target_nid file]
Link: https://lkml.kernel.org/r/20240618213630.84846-3-sj@kernel.org
Link: https://lkml.kernel.org/r/20240614030010.751-4-honggyu.kim@sk.com
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: rename alloc_demote_folio to alloc_migrate_folio
Honggyu Kim [Fri, 14 Jun 2024 03:00:04 +0000 (12:00 +0900)]
mm: rename alloc_demote_folio to alloc_migrate_folio

The alloc_demote_folio can also be used for general migration including
both demotion and promotion so it'd be better to rename it from
alloc_demote_folio to alloc_migrate_folio.

Link: https://lkml.kernel.org/r/20240614030010.751-3-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: make alloc_demote_folio externally invokable for migration
Honggyu Kim [Fri, 14 Jun 2024 03:00:03 +0000 (12:00 +0900)]
mm: make alloc_demote_folio externally invokable for migration

Patch series "DAMON based tiered memory management for CXL memory", v6.

Introduction
============

With the advent of CXL/PCIe attached DRAM, which will be called simply as
CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of its
protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.  DAMOS
provides multiple actions based on DAMON monitoring results and it can be
used for proactive reclaim, which means swapping cold pages out with
DAMOS_PAGEOUT action, but it doesn't support migration actions such as
demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted to
slow tiers so that the system can increase the chance to allocate more hot
pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 11% to 3~5% when the system
runs under high memory pressure on its fast tier DRAM nodes.

DAMON configuration
===================

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that generates
a json file with the full config of DAMON knobs and it creates multiple
kdamonds for each NUMA node when the DAMON is enabled so that it can run
hot/cold based migration for tiered memory.

Evaluation Workload
===================

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and latest
distributions but their configs are slightly modified to make memory usage
higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of a
single workload.  The default memory allocation policy creates pages to
the fast tier DRAM node first, then allocates newly created pages to the
slow tier CXL node when the DRAM node has insufficient free space.  Once
the page allocation is done then those pages never move between NUMA
nodes.  It's not true when using numa balancing, but it is not the scope
of this DAMON based tiered memory management support.

If the working set of redis can be fit fully into the DRAM node, then the
redis will access the fast DRAM only.  Since the performance of DRAM only
is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate this patch series.

To make pages of redis be distributed across fast DRAM node and slow CXL
node to evaluate our migrate_{hot,cold} actions, we pre-allocate some cold
memory externally using mmap and memset before launching redis-server.  We
assumed that there are enough amount of cold memory in datacenters as
TMO[6] and TPP[7] papers mentioned.

The evaluation sequence is as follows.

1. Turn on DAMON with DAMOS_MIGRATE_COLD action for DRAM node and
   DAMOS_MIGRATE_HOT action for CXL node.  It demotes cold pages on DRAM
   node and promotes hot pages on CXL node in a regular interval.
2. Allocate a huge block of cold memory by calling mmap and memset at
   the fast tier DRAM node, then make the process sleep to make the fast
   tier has insufficient space for redis-server.
3. Launch redis-server and load prebaked snapshot image, dump.rdb.  The
   redis-server consumes 52GB of anon pages and 33GB of file pages, but
   due to the cold memory allocated at 2, it fails allocating the entire
   memory of redis-server on the fast tier DRAM node so it partially
   allocates the remaining on the slow tier CXL node.  The ratio of
   DRAM:CXL depends on the size of the pre-allocated cold memory.
4. Run YCSB to make zipfian or latest distribution of memory accesses to
   redis-server, then measure its execution time when it's completed.
5. Repeat 4 over 50 times to measure the average execution time for each
   run.
6. Increase the cold memory size then repeat goes to 2.

For each test at 4 took about a minute so repeating it 50 times almost
took about 1 hour for each test with a specific cold memory from 440GB to
500GB in 10GB increments for each evaluation.  So it took about more than
10 hours for both zipfian and latest workloads to get the entire
evaluation results.  Repeating the same test set multiple times doesn't
show much difference so I think it might be enough to make the result
reliable.

Evaluation Results
==================

All the result values are normalized to DRAM-only execution time because
the workload cannot be faster than DRAM-only unless the workload hits the
peak bandwidth but our redis test doesn't go beyond the bandwidth limit.

So the DRAM-only execution time is the ideal result without affected by
the gap between DRAM and CXL performance difference.  The NUMA node
environment is as follows.

  node0 - local DRAM, 512GB with a CPU socket (fast tier)
  node1 - disabled
  node2 - CXL DRAM, 96GB, no CPU attached (slow tier)

The following is the result of generating zipfian distribution to
redis-server and the numbers are averaged by 50 times of execution.

  1. YCSB zipfian distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  ====================+================================================+=========
                      |       cold memory occupied by mmap and memset  |
                      |   0G  440G  450G  460G  470G  480G  490G  500G |
  ====================+================================================+=========
  Execution time normalized to DRAM-only values                        | GEOMEAN
  --------------------+------------------------------------------------+---------
  DRAM-only           | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only            | 1.19     -     -     -     -     -     -     - | 1.19
  default             |    -  1.00  1.05  1.08  1.12  1.14  1.18  1.18 | 1.11
  DAMON tiered        |    -  1.03  1.03  1.03  1.03  1.03  1.07 *1.05 | 1.04
  DAMON lazy          |    -  1.04  1.03  1.04  1.05  1.06  1.06 *1.06 | 1.05
  ====================+================================================+=========
  CXL usage of redis-server in GB                                      | AVERAGE
  --------------------+------------------------------------------------+---------
  DRAM-only           |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only            | 51.4     -     -     -     -     -     -     - | 51.4
  default             |    -   0.6  10.6  20.5  30.5  40.5  47.6  50.4 | 28.7
  DAMON tiered        |    -   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
  DAMON lazy          |    -   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
  ====================+================================================+=========

Each test result is based on the execution environment as follows.

  DRAM-only:           redis-server uses only local DRAM memory.
  CXL-only:            redis-server uses only CXL memory.
  default:             default memory policy(MPOL_DEFAULT).
                       numa balancing disabled.
  DAMON tiered:        DAMON enabled with DAMOS_MIGRATE_COLD for DRAM
                       nodes and DAMOS_MIGRATE_HOT for CXL nodes.
  DAMON lazy:          same as DAMON tiered, but turn on DAMON just
                       before making memory access request via YCSB.

The above result shows the "default" execution time goes up as the size of
cold memory is increased from 440G to 500G because the more cold memory
used, the more CXL memory is used for the target redis workload and this
makes the execution time increase.

However, "DAMON tiered" and other DAMON results show less slowdown because
the DAMOS_MIGRATE_COLD action at DRAM node proactively demotes
pre-allocated cold memory to CXL node and this free space at DRAM
increases more chance to allocate hot or warm pages of redis-server to
fast DRAM node.  Moreover, DAMOS_MIGRATE_HOT action at CXL node also
promotes hot pages of redis-server to DRAM node actively.

As a result, it makes more memory of redis-server stay in DRAM node
compared to "default" memory policy and this makes the performance
improvement.

Please note that the result numbers of "DAMON tiered" and "DAMON lazy" at
500G are marked with * stars, which means their test results are replaced
with reproduced tests that didn't have OOM issue.

That was needed because sometimes the test processes get OOM when DRAM has
insufficient space.  The DAMOS_MIGRATE_HOT doesn't kick reclaim but just
gives up migration when there is not enough space at DRAM side.  The
problem happens when there is competition between normal allocation and
migration and the migration is done before normal allocation, then the
completely unrelated normal allocation can trigger reclaim, which incurs
OOM.

Because of this issue, I have also tested more cases with
"demotion_enabled" flag enabled to make such reclaim doesn't trigger OOM,
but just demote reclaimed pages.  The following test results show more
tests with "kswapd" marked.

  2. YCSB zipfian distribution read only workload (with demotion_enabled true)
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  ====================+================================================+=========
                      |       cold memory occupied by mmap and memset  |
                      |   0G  440G  450G  460G  470G  480G  490G  500G |
  ====================+================================================+=========
  Execution time normalized to DRAM-only values                        | GEOMEAN
  --------------------+------------------------------------------------+---------
  DAMON tiered        |    -  1.03  1.03  1.03  1.03  1.03  1.07  1.05 | 1.04
  DAMON lazy          |    -  1.04  1.03  1.04  1.05  1.06  1.06  1.06 | 1.05
  DAMON tiered kswapd |    -  1.03  1.03  1.03  1.03  1.02  1.02  1.03 | 1.03
  DAMON lazy kswapd   |    -  1.04  1.04  1.04  1.03  1.05  1.04  1.05 | 1.04
  ====================+================================================+=========
  CXL usage of redis-server in GB                                      | AVERAGE
  --------------------+------------------------------------------------+---------
  DAMON tiered        |    -   0.6   0.5   0.4   0.7   0.8   7.1   5.6 |  2.2
  DAMON lazy          |    -   0.5   3.0   4.5   5.4   6.4   9.4   9.1 |  5.5
  DAMON tiered kswapd |    -   0.0   0.0   0.4   0.5   0.1   0.8   1.0 |  0.4
  DAMON lazy kswapd   |    -   4.2   4.6   5.3   1.7   6.8   8.1   5.8 |  5.2
  ====================+================================================+=========

Each test result is based on the exeuction environment as follows.

  DAMON tiered:        same as before
  DAMON lazy:          same as before
  DAMON tiered kswapd: same as DAMON tiered, but turn on
                       /sys/kernel/mm/numa/demotion_enabled to make
                       kswapd or direct reclaim does demotion.
  DAMON lazy kswapd:   same as DAMON lazy, but turn on
                       /sys/kernel/mm/numa/demotion_enabled to make
                       kswapd or direct reclaim does demotion.

The "DAMON tiered kswapd" and "DAMON lazy kswapd" didn't trigger OOM at
all unlike other tests because kswapd and direct reclaim from DRAM node
can demote reclaimed pages to CXL node independently from DAMON actions
and their results are slightly better than without having
"demotion_enabled".

In summary, the evaluation results show that DAMON memory management with
DAMOS_MIGRATE_{HOT,COLD} actions reduces the performance slowdown compared
to the "default" memory policy from 11% to 3~5% when the system runs with
high memory pressure on its fast tier DRAM nodes.

Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
tiered memory systems run more efficiently under high memory pressures.

This patch (of 7):

The alloc_demote_folio can be used out of vmscan.c so it'd be better to
remove static keyword from it.

Link: https://lkml.kernel.org/r/20240614030010.751-1-honggyu.kim@sk.com
Link: https://lkml.kernel.org/r/20240614030010.751-2-honggyu.kim@sk.com
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Gregory Price <gregory.price@memverge.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/mm_init.c: simplify logic of deferred_[init|free]_pages
Wei Yang [Wed, 12 Jun 2024 02:04:21 +0000 (02:04 +0000)]
mm/mm_init.c: simplify logic of deferred_[init|free]_pages

Function deferred_[init|free]_pages are only used in
deferred_init_maxorder(), which makes sure the range to init/free is
within MAX_ORDER_NR_PAGES size.

With this knowledge, we can simplify these two functions. Since

  * only the first pfn could be IS_MAX_ORDER_ALIGNED()

Also since the range passed to deferred_[init|free]_pages is always from
memblock.memory for those we have already allocated memmap to cover,
pfn_valid() always return true.  Then we can remove related check.

[richard.weiyang@gmail.com: adjust function declaration indention per David]
Link: https://lkml.kernel.org/r/20240613114525.27528-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20240612020421.31975-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: correct comment in me_swapcache_dirty
Miaohe Lin [Wed, 12 Jun 2024 07:18:35 +0000 (15:18 +0800)]
mm/memory-failure: correct comment in me_swapcache_dirty

Dirty swap cache page could live both in page table (not page cache) and
swap cache when freshly swapped in.  Correct comment.

Link: https://lkml.kernel.org/r/20240612071835.157004-14-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: remove obsolete comment in kill_proc()
Miaohe Lin [Wed, 12 Jun 2024 07:18:34 +0000 (15:18 +0800)]
mm/memory-failure: remove obsolete comment in kill_proc()

When user sets SIGBUS to SIG_IGN, it won't cause loop now.  For action
required mce error, SIGBUS cannot be blocked.  Also when a hwpoisoned page
is re-accessed, kill_accessing_process() will be called to kill the
process.

Link: https://lkml.kernel.org/r/20240612071835.157004-13-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: fix comment of get_hwpoison_page()
Miaohe Lin [Wed, 12 Jun 2024 07:18:33 +0000 (15:18 +0800)]
mm/memory-failure: fix comment of get_hwpoison_page()

When return value is 0, it could also means the page is free hugetlb page
or free buddy page.  Fix the corresponding comment.

Link: https://lkml.kernel.org/r/20240612071835.157004-12-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: move some function declarations into internal.h
Miaohe Lin [Wed, 12 Jun 2024 07:18:32 +0000 (15:18 +0800)]
mm/memory-failure: move some function declarations into internal.h

There are some functions only used inside mm.  Move them into internal.h.
No functional change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-11-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202405251049.hxjwX7zO-lkp@intel.com/
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: remove obsolete comment in unpoison_memory()
Miaohe Lin [Wed, 12 Jun 2024 07:18:31 +0000 (15:18 +0800)]
mm/memory-failure: remove obsolete comment in unpoison_memory()

Since commit 130d4df57390 ("mm/sl[au]b: rearrange struct slab fields to
allow larger rcu_head"), folio->_mapcount is not overloaded with SLAB.
Update corresponding comment.

Link: https://lkml.kernel.org/r/20240612071835.157004-10-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: use helper macro task_pid_nr()
Miaohe Lin [Wed, 12 Jun 2024 07:18:30 +0000 (15:18 +0800)]
mm/memory-failure: use helper macro task_pid_nr()

Use helper macro task_pid_nr() to get the pid of a task.  No functional
change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-9-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: don't export hwpoison_filter() when !CONFIG_HWPOISON_INJECT
Miaohe Lin [Wed, 12 Jun 2024 07:18:29 +0000 (15:18 +0800)]
mm/memory-failure: don't export hwpoison_filter() when !CONFIG_HWPOISON_INJECT

When CONFIG_HWPOISON_INJECT is not enabled, there is no user of the
hwpoison_filter() outside memory-failure.  So there is no need to export
it in that case.

Link: https://lkml.kernel.org/r/20240612071835.157004-8-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406070136.hGQwVbsv-lkp@intel.com/
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: remove confusing initialization to count
Miaohe Lin [Wed, 12 Jun 2024 07:18:28 +0000 (15:18 +0800)]
mm/memory-failure: remove confusing initialization to count

It's meaningless and confusing to init local variable count to 1.  Remove
it.  No functional change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: remove unneeded empty string
Miaohe Lin [Wed, 12 Jun 2024 07:18:27 +0000 (15:18 +0800)]
mm/memory-failure: remove unneeded empty string

Remove unneeded empty string in definition of macro pr_fmt.  No functional
change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: save some page_folio() calls
Miaohe Lin [Wed, 12 Jun 2024 07:18:26 +0000 (15:18 +0800)]
mm/memory-failure: save some page_folio() calls

Use local variable folio directly to save a page_folio() call.  Also use
folio_mapped() to save more page_folio() calls.  No functional change
intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: add macro GET_PAGE_MAX_RETRY_NUM
Miaohe Lin [Wed, 12 Jun 2024 07:18:25 +0000 (15:18 +0800)]
mm/memory-failure: add macro GET_PAGE_MAX_RETRY_NUM

Add helper macro GET_PAGE_MAX_RETRY_NUM to replace magic number 3.  No
functional change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: remove MF_MSG_SLAB
Miaohe Lin [Wed, 12 Jun 2024 07:18:24 +0000 (15:18 +0800)]
mm/memory-failure: remove MF_MSG_SLAB

Since commit 46df8e73a4a3 ("mm: free up PG_slab"), MF_MSG_SLAB becomes
unused.  Remove it.  No functional change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory-failure: simplify put_ref_page()
Miaohe Lin [Wed, 12 Jun 2024 07:18:23 +0000 (15:18 +0800)]
mm/memory-failure: simplify put_ref_page()

Patch series "Some cleanups for memory-failure", v3.

This series contains a few cleanup patches to avoid exporting unused
function, add helper macro, fix some obsolete comments and so on.  More
details can be found in the respective changelogs.

This patch (of 13):

Remove unneeded page != NULL check.  pfn_to_page() won't return NULL.  No
functional change intended.

Link: https://lkml.kernel.org/r/20240612071835.157004-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20240612071835.157004-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/hugetlb: guard dequeue_hugetlb_folio_nodemask against NUMA_NO_NODE uses
Oscar Salvador [Wed, 12 Jun 2024 08:29:36 +0000 (10:29 +0200)]
mm/hugetlb: guard dequeue_hugetlb_folio_nodemask against NUMA_NO_NODE uses

dequeue_hugetlb_folio_nodemask() expects a preferred node where to get the
hugetlb page from.  It does not expect, though, users to pass
NUMA_NO_NODE, otherwise we will get trash when trying to get the zonelist
from that node.  All current users are careful enough to not pass
NUMA_NO_NODE, but it opens the door for new users to get this wrong since
it is not documented [0].

Guard against this by getting the local nid if NUMA_NO_NODE was passed.

[0] https://lore.kernel.org/linux-mm/0000000000004f12bb061a9acf07@google.com/

Closes: https://lore.kernel.org/linux-mm/0000000000004f12bb061a9acf07@google.com/
Link: https://lkml.kernel.org/r/20240612082936.10867-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: syzbot+569ed13f4054f271087b@syzkaller.appspotmail.com
Tested-by: syzbot+569ed13f4054f271087b@syzkaller.appspotmail.com
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/hugetlb_cgroup: switch to the new cftypes
Xiu Jianfeng [Wed, 12 Jun 2024 09:24:09 +0000 (09:24 +0000)]
mm/hugetlb_cgroup: switch to the new cftypes

The previous patch has already reconstructed the cftype attributes based
on the templates and saved them in dfl_cftypes and legacy_cftypes.  then
remove the old procedure and switch to the new cftypes.

Link: https://lkml.kernel.org/r/20240612092409.2027592-4-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/hugetlb_cgroup: prepare cftypes based on template
Xiu Jianfeng [Wed, 12 Jun 2024 09:24:08 +0000 (09:24 +0000)]
mm/hugetlb_cgroup: prepare cftypes based on template

Unlike other cgroup subsystems, the hugetlb cgroup does not provide a
static array of cftype that explicitly displays the properties, handling
functions, etc., of each file.  Instead, it dynamically creates the
attribute of cftypes based on the hstate during the startup procedure.
This reduces the readability of the code.

To fix this issue, introduce two templates of cftypes, and rebuild the
attributes according to the hstate to make it ready to be added to cgroup
framework.

Link: https://lkml.kernel.org/r/20240612092409.2027592-3-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: kernel test robot <oliver.sang@intel.com>
From: Xiu Jianfeng <xiujianfeng@huawei.com>
Subject: mm/hugetlb_cgroup: register lockdep key for cftype
Date: Tue, 18 Jun 2024 07:19:22 +0000

When CONFIG_DEBUG_LOCK_ALLOC is enabled, the following commands can
trigger a bug,

mount -t cgroup2 none /sys/fs/cgroup
cd /sys/fs/cgroup
echo "+hugetlb" > cgroup.subtree_control

The log is as below:

BUG: key ffff8880046d88d8 has not been registered!
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 3 PID: 226 at kernel/locking/lockdep.c:4945 lockdep_init_map_type+0x185/0x220
Modules linked in:
CPU: 3 PID: 226 Comm: bash Not tainted 6.10.0-rc4-next-20240617-g76db4c64526c #544
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:lockdep_init_map_type+0x185/0x220
Code: 00 85 c0 0f 84 6c ff ff ff 8b 3d 6a d1 85 01 85 ff 0f 85 5e ff ff ff 48 c7 c6 21 99 4a 82 48 c7 c7 60 29 49 82 e8 3b 2e f5
RSP: 0018:ffffc9000083fc30 EFLAGS: 00000282
RAX: 0000000000000000 RBX: ffffffff828dd820 RCX: 0000000000000027
RDX: ffff88803cd9cac8 RSI: 0000000000000001 RDI: ffff88803cd9cac0
RBP: ffff88800674fbb0 R08: ffffffff828ce248 R09: 00000000ffffefff
R10: ffffffff8285e260 R11: ffffffff828b8eb8 R12: ffff8880046d88d8
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880067281c0
FS:  00007f68601ea740(0000) GS:ffff88803cd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005614f3ebc740 CR3: 000000000773a000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 ? __warn+0x77/0xd0
 ? lockdep_init_map_type+0x185/0x220
 ? report_bug+0x189/0x1a0
 ? handle_bug+0x3c/0x70
 ? exc_invalid_op+0x18/0x70
 ? asm_exc_invalid_op+0x1a/0x20
 ? lockdep_init_map_type+0x185/0x220
 __kernfs_create_file+0x79/0x100
 cgroup_addrm_files+0x163/0x380
 ? find_held_lock+0x2b/0x80
 ? find_held_lock+0x2b/0x80
 ? find_held_lock+0x2b/0x80
 css_populate_dir+0x73/0x180
 cgroup_apply_control_enable+0x12f/0x3a0
 cgroup_subtree_control_write+0x30b/0x440
 kernfs_fop_write_iter+0x13a/0x1f0
 vfs_write+0x341/0x450
 ksys_write+0x64/0xe0
 do_syscall_64+0x4b/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f68602d9833
Code: 8b 15 61 26 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 08
RSP: 002b:00007fff9bbdf8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f68602d9833
RDX: 0000000000000009 RSI: 00005614f3ebc740 RDI: 0000000000000001
RBP: 00005614f3ebc740 R08: 000000000000000a R09: 0000000000000008
R10: 00005614f3db6ba0 R11: 0000000000000246 R12: 0000000000000009
R13: 00007f68603bd6a0 R14: 0000000000000009 R15: 00007f68603b8880

For lockdep, there is a sanity check in lockdep_init_map_type(), the
lock-class key must either have been allocated statically or must
have been registered as a dynamic key. However the commit e18df2889ff9
("mm/hugetlb_cgroup: prepare cftypes based on template") has changed
the cftypes from static allocated objects to dynamic allocated objects,
so the cft->lockdep_key must be registered proactively.

[xiujianfeng@huawei.com: fix BUG()]
Link: https://lkml.kernel.org/r/20240619015527.2212698-1-xiujianfeng@huawei.com
Link: https://lkml.kernel.org/r/20240618071922.2127289-1-xiujianfeng@huawei.com
Link: https://lore.kernel.org/all/602186b3-5ce3-41b3-90a3-134792cc2a48@samsung.com/
Fixes: e18df2889ff9 ("mm/hugetlb_cgroup: prepare cftypes based on template")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202406181046.8d8b2492-oliver.sang@intel.com
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: SeongJae Park <sj@kernel.org>
Closes: https://lore.kernel.org/20240618233608.400367-1-sj@kernel.org
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/hugetlb_cgroup: identify the legacy using cgroup_subsys_on_dfl()
Xiu Jianfeng [Wed, 12 Jun 2024 09:24:07 +0000 (09:24 +0000)]
mm/hugetlb_cgroup: identify the legacy using cgroup_subsys_on_dfl()

Patch series "mm/hugetlb_cgroup: rework on cftypes", v3.

This patchset provides an intuitive view of the control files through
static templates of cftypes.  This improves the readability of the code.

This patch (of 3):

Currently the numa_stat file encodes 1 into .private using the micro
MEMFILE_PRIVATE() to identify the legacy.  Actually, we can use
cgroup_subsys_on_dfl() instead.  This is helpful to handle .private in the
static templates in the next patch.

Link: https://lkml.kernel.org/r/20240612092409.2027592-1-xiujianfeng@huawei.com
Link: https://lkml.kernel.org/r/20240612092409.2027592-2-xiujianfeng@huawei.com
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: report per-page metadata information
Sourav Panda [Wed, 5 Jun 2024 22:27:51 +0000 (22:27 +0000)]
mm: report per-page metadata information

Today, we do not have any observability of per-page metadata and how much
it takes away from the machine capacity.  Thus, we want to describe the
amount of memory that is going towards per-page metadata, which can vary
depending on build configuration, machine architecture, and system use.

This patch adds 2 fields to /proc/vmstat that can used as shown below:

Accounting per-page metadata allocated by boot-allocator:
/proc/vmstat:nr_memmap_boot * PAGE_SIZE

Accounting per-page metadata allocated by buddy-allocator:
/proc/vmstat:nr_memmap * PAGE_SIZE

Accounting total Perpage metadata allocated on the machine:
(/proc/vmstat:nr_memmap_boot +
 /proc/vmstat:nr_memmap) * PAGE_SIZE

Utility for userspace:

Observability: Describe the amount of memory overhead that is going to
per-page metadata on the system at any given time since this overhead is
not currently observable.

Debugging: Tracking the changes or absolute value in struct pages can help
detect anomalies as they can be correlated with other metrics in the
machine (e.g., memtotal, number of huge pages, etc).

page_ext overheads: Some kernel features such as page_owner
page_table_check that use page_ext can be optionally enabled via kernel
parameters.  Having the total per-page metadata information helps users
precisely measure impact.  Furthermore, page-metadata metrics will reflect
the amount of struct pages reliquished (or overhead reduced) when
hugetlbfs pages are reserved which will vary depending on whether hugetlb
vmemmap optimization is enabled or not.

For background and results see:
lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com

Link: https://lkml.kernel.org/r/20240605222751.1406125-1-souravpanda@google.com
Signed-off-by: Sourav Panda <souravpanda@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Chen Linxuan <chenlinxuan@uniontech.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tomas Mudrunka <tomas.mudrunka@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoselftests/mm: guard defines from shm
Edward Liaw [Wed, 5 Jun 2024 22:36:35 +0000 (22:36 +0000)]
selftests/mm: guard defines from shm

thuge-gen.c defines SHM_HUGE_* macros that are provided by the uapi since
4.14.  These macros get redefined when compiling with Android's bionic
because its sys/shm.h will import the uapi definitions.

However if linux/shm.h is included, with glibc, sys/shm.h will clash on
some struct definitions:

  /usr/include/linux/shm.h:26:8: error: redefinition of ‘struct shmid_ds’
     26 | struct shmid_ds {
        |        ^~~~~~~~
  In file included from /usr/include/x86_64-linux-gnu/bits/shm.h:45,
                   from /usr/include/x86_64-linux-gnu/sys/shm.h:30:
  /usr/include/x86_64-linux-gnu/bits/types/struct_shmid_ds.h:24:8: note: originally defined here
     24 | struct shmid_ds
        |        ^~~~~~~~

For now, guard the SHM_HUGE_* defines with ifndef to prevent redefinition
warnings on Android bionic.

Link: https://lkml.kernel.org/r/20240605223637.1374969-3-edliaw@google.com
Signed-off-by: Edward Liaw <edliaw@google.com>
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoselftests/mm: include linux/mman.h
Edward Liaw [Wed, 5 Jun 2024 22:36:34 +0000 (22:36 +0000)]
selftests/mm: include linux/mman.h

thuge-gen defines MAP_HUGE_* macros that are provided by linux/mman.h
since 4.15. Removes the macros and includes linux/mman.h instead.

Link: https://lkml.kernel.org/r/20240605223637.1374969-2-edliaw@google.com
Signed-off-by: Edward Liaw <edliaw@google.com>
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/memory_hotplug: prevent accessing by index=-1
Anastasia Belova [Thu, 6 Jun 2024 08:06:59 +0000 (11:06 +0300)]
mm/memory_hotplug: prevent accessing by index=-1

nid may be equal to NUMA_NO_NODE=-1.  Prevent accessing node_data array by
invalid index with check for nid.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Link: https://lkml.kernel.org/r/20240606080659.18525-1-abelova@astralinux.ru
Fixes: e83a437faa62 ("mm/memory_hotplug: introduce "auto-movable" online policy")
Signed-off-by: Anastasia Belova <abelova@astralinux.ru>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/mlock: implement folio_mlock_step() using folio_pte_batch()
Lance Yang [Tue, 11 Jun 2024 01:04:18 +0000 (09:04 +0800)]
mm/mlock: implement folio_mlock_step() using folio_pte_batch()

Let's make folio_mlock_step() simply a wrapper around folio_pte_batch(),
which will greatly reduce the cost of ptep_get() when scanning a range of
contptes.

Link: https://lkml.kernel.org/r/20240611010418.70797-1-ioworker0@gmail.com
Signed-off-by: Lance Yang <ioworker0@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Barry Song <21cnbao@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: Bang Li <libang.li@antgroup.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: zswap: handle incorrect attempts to load large folios
Yosry Ahmed [Tue, 11 Jun 2024 02:45:16 +0000 (02:45 +0000)]
mm: zswap: handle incorrect attempts to load large folios

Zswap does not support storing or loading large folios.  Until proper
support is added, attempts to load large folios from zswap are a bug.

For example, if a swapin fault observes that contiguous PTEs are pointing
to contiguous swap entries and tries to swap them in as a large folio,
swap_read_folio() will pass in a large folio to zswap_load(), but
zswap_load() will only effectively load the first page in the folio.  If
the first page is not in zswap, the folio will be read from disk, even
though other pages may be in zswap.

In both cases, this will lead to silent data corruption.  Proper support
needs to be added before large folio swapins and zswap can work together.

Looking at callers of swap_read_folio(), it seems like they are either
allocated from __read_swap_cache_async() or do_swap_page() in the
SWP_SYNCHRONOUS_IO path.  Both of which allocate order-0 folios, so
everything is fine for now.

However, there is ongoing work to add to support large folio swapins [1].
To make sure new development does not break zswap (or get broken by
zswap), add minimal handling of incorrect loads of large folios to zswap.
First, move the call folio_mark_uptodate() inside zswap_load().

If a large folio load is attempted, and zswap was ever enabled on the
system, return 'true' without calling folio_mark_uptodate().  This will
prevent the folio from being read from disk, and will emit an IO error
because the folio is not uptodate (e.g.  do_swap_fault() will return
VM_FAULT_SIGBUS).  It may not be reliable recovery in all cases, but it is
better than nothing.

This was tested by hacking the allocation in __read_swap_cache_async() to
use order 2 and __GFP_COMP.

In the future, to handle this correctly, the swapin code should:

(a) Fall back to order-0 swapins if zswap was ever used on the
    machine, because compressed pages remain in zswap after it is
    disabled.

(b) Add proper support to swapin large folios from zswap (fully or
    partially).

Probably start with (a) then followup with (b).

[1]https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/

Link: https://lkml.kernel.org/r/20240611024516.1375191-3-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Barry Song <baohua@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: zswap: add zswap_never_enabled()
Yosry Ahmed [Tue, 11 Jun 2024 02:45:15 +0000 (02:45 +0000)]
mm: zswap: add zswap_never_enabled()

Add zswap_never_enabled() to skip the xarray lookup in zswap_load() if
zswap was never enabled on the system.  It is implemented using static
branches for efficiency, as enabling zswap should be a rare event.  This
could shave some cycles off zswap_load() when CONFIG_ZSWAP is used but
zswap is never enabled.

However, the real motivation behind this patch is two-fold:
- Incoming large folio swapin work will need to fallback to order-0
  folios if zswap was ever enabled, because any part of the folio could be
  in zswap, until proper handling of large folios with zswap is added.

- A warning and recovery attempt will be added in a following change in
  case the above was not done incorrectly.  Zswap will fail the read if
  the folio is large and it was ever enabled.

Expose zswap_never_enabled() in the header for the swapin work to use
it later.

[yosryahmed@google.com: expose zswap_never_enabled() in the header]
Link: https://lkml.kernel.org/r/Zmjf0Dr8s9xSW41X@google.com
Link: https://lkml.kernel.org/r/20240611024516.1375191-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: zswap: rename is_zswap_enabled() to zswap_is_enabled()
Yosry Ahmed [Tue, 11 Jun 2024 02:45:14 +0000 (02:45 +0000)]
mm: zswap: rename is_zswap_enabled() to zswap_is_enabled()

In preparation for introducing a similar function, rename
is_zswap_enabled() to use zswap_* prefix like other zswap functions.

Link: https://lkml.kernel.org/r/20240611024516.1375191-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/mm_init.c: print mem_init info after defer_init is done
Wei Yang [Tue, 11 Jun 2024 14:52:23 +0000 (14:52 +0000)]
mm/mm_init.c: print mem_init info after defer_init is done

Current call flow looks like this:

start_kernel
  mm_core_init
    mem_init
    mem_init_print_info
  rest_init
    kernel_init
      kernel_init_freeable
        page_alloc_init_late
          deferred_init_memmap

If CONFIG_DEFERRED_STRUCT_PAGE_INIT, the time mem_init_print_info()
calls, pages are not totally initialized and freed to buddy.

This has one issue

  * nr_free_pages() just contains partial free pages in the system,
    which is not we expect.

Let's print the mem info after defer_init is done.

Also this would help changing totalram_pages accounting, since we plan
to move the accounting into __free_pages_core().

Link: https://lkml.kernel.org/r/20240611145223.16872-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/sparse: use MEMBLOCK_ALLOC_ACCESSIBLE enum instead of 0
Leesoo Ahn [Mon, 10 Jun 2024 15:15:28 +0000 (00:15 +0900)]
mm/sparse: use MEMBLOCK_ALLOC_ACCESSIBLE enum instead of 0

Setting 'limit' variable to 0 might seem like it means "no limit".  But in
the memblock API, 0 actually means the 'MEMBLOCK_ALLOC_ACCESSIBLE' enum,
which limits the physical address range end based on
'memblock.current_limit'.  This could be confusing.

Use the enum instead of 0 to make it clear.

Link: https://lkml.kernel.org/r/20240610151528.943680-1-lsahn@wewakecorp.com
Signed-off-by: Leesoo Ahn <lsahn@ooseel.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/vmscan: avoid split lazyfree THP during shrink_folio_list()
Lance Yang [Fri, 14 Jun 2024 01:51:38 +0000 (09:51 +0800)]
mm/vmscan: avoid split lazyfree THP during shrink_folio_list()

When the user no longer requires the pages, they would use
madvise(MADV_FREE) to mark the pages as lazy free.  Subsequently, they
typically would not re-write to that memory again.

During memory reclaim, if we detect that the large folio and its PMD are
both still marked as clean and there are no unexpected references (such as
GUP), so we can just discard the memory lazily, improving the efficiency
of memory reclamation in this case.

On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):

--------------------------------------------
|     Old       |      New       |  Change  |
--------------------------------------------
|   0.683426    |    0.049197    |  -92.80% |
--------------------------------------------

[ioworker0@gmail.com: minor changes per David]
Link: https://lkml.kernel.org/r/20240622100057.3352-1-ioworker0@gmail.com
Link: https://lkml.kernel.org/r/20240614015138.31461-4-ioworker0@gmail.com
Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zi Yan <ziy@nvidia.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Fangrui Song <maskray@google.com>
Cc: Jeff Xie <xiehuan09@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/rmap: integrate PMD-mapped folio splitting into pagewalk loop
Lance Yang [Fri, 14 Jun 2024 01:51:37 +0000 (09:51 +0800)]
mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
folios, start the pagewalk first, then call split_huge_pmd_address() to
split the folio.

Link: https://lkml.kernel.org/r/20240614015138.31461-3-ioworker0@gmail.com
Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Fangrui Song <maskray@google.com>
Cc: Jeff Xie <xiehuan09@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm/rmap: remove duplicated exit code in pagewalk loop
Lance Yang [Fri, 14 Jun 2024 01:51:36 +0000 (09:51 +0800)]
mm/rmap: remove duplicated exit code in pagewalk loop

Patch series "Reclaim lazyfree THP without splitting", v8.

This series adds support for reclaiming PMD-mapped THP marked as lazyfree
without needing to first split the large folio via
split_huge_pmd_address().

When the user no longer requires the pages, they would use
madvise(MADV_FREE) to mark the pages as lazy free.  Subsequently, they
typically would not re-write to that memory again.

During memory reclaim, if we detect that the large folio and its PMD are
both still marked as clean and there are no unexpected references(such as
GUP), so we can just discard the memory lazily, improving the efficiency
of memory reclamation in this case.

Performance Testing
===================

On an Intel i5 CPU, reclaiming 1GiB of lazyfree THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):

--------------------------------------------
|     Old       |      New       |  Change  |
--------------------------------------------
|   0.683426    |    0.049197    |  -92.80% |
--------------------------------------------

This patch (of 8):

Introduce the labels walk_done and walk_abort as exit points to eliminate
duplicated exit code in the pagewalk loop.

Link: https://lkml.kernel.org/r/20240614015138.31461-1-ioworker0@gmail.com
Link: https://lkml.kernel.org/r/20240614015138.31461-2-ioworker0@gmail.com
Signed-off-by: Lance Yang <ioworker0@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Cc: Bang Li <libang.li@antgroup.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Jeff Xie <xiehuan09@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agomm: do not start/end writeback for pages stored in zswap
Usama Arif [Mon, 10 Jun 2024 14:30:37 +0000 (15:30 +0100)]
mm: do not start/end writeback for pages stored in zswap

Most of the work done in folio_start_writeback is reversed in
folio_end_writeback.  For e.g.  NR_WRITEBACK and NR_ZONE_WRITE_PENDING are
incremented in start_writeback and decremented in end_writeback.  Calling
end_writeback immediately after start_writeback (separated by
folio_unlock) cancels the affect of most of the work done in start hence
can be removed.

There is some extra work done in folio_end_writeback, however it is
incorrect/not applicable to zswap:
- folio_end_writeback incorrectly increments NR_WRITTEN counter,
  eventhough the pages aren't written to disk, hence this change
  corrects this behaviour.
- folio_end_writeback calls folio_rotate_reclaimable, but that only
  makes sense for async writeback pages, while for zswap pages are
  synchronously reclaimed.

Link: https://lkml.kernel.org/r/20240612100109.1616626-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20240610143037.812955-1-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 months agoselftests/mm: use asm volatile to not optimize mmap read variable
Pankaj Raghav [Thu, 6 Jun 2024 20:36:19 +0000 (20:36 +0000)]
selftests/mm: use asm volatile to not optimize mmap read variable

create_pagecache_thp_and_fd() in split_huge_page_test.c used the variable
dummy to perform mmap read.

However, this test was skipped even on XFS which has large folio support.
The issue was compiler (gcc 13.2.0) was optimizing out the dummy variable,
therefore, not creating huge page in the page cache.

Use asm volatile() trick to force the compiler not to optimize out the
loop where we read from the mmaped addr.  This is similar to what is being
done in other tests (cow.c, etc)

As the variable is now used in the asm statement, remove the unused
attribute.

Link: https://lkml.kernel.org/r/20240606203619.677276-1-kernel@pankajraghav.com
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>