Miaohe Lin [Sat, 18 Jun 2022 09:05:27 +0000 (17:05 +0800)]
mm/madvise: minor cleanup for swapin_walk_pmd_entry()
Passing index to pte_offset_map_lock() directly so the below calculation
can be avoided. Rename orig_pte to ptep as it's not changed. Also use
helper is_swap_pte() to improve the readability. No functional change
intended.
Muchun Song [Thu, 16 Jun 2022 03:38:46 +0000 (11:38 +0800)]
mm: hugetlb: remove minimum_order variable
commit 641844f5616d ("mm/hugetlb: introduce minimum hugepage order") fixed
a static checker warning and introduced a global variable minimum_order to
fix the warning. However, the local variable in
dissolve_free_huge_pages() can be initialized to
huge_page_order(&default_hstate) to fix the warning.
Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Fri, 17 Jun 2022 13:56:50 +0000 (21:56 +0800)]
mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory
For now, the feature of hugetlb_free_vmemmap is not compatible with the
feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes
precedence over memory_hotplug.memmap_on_memory. However, someone wants
to make memory_hotplug.memmap_on_memory takes precedence over
hugetlb_free_vmemmap since memmap_on_memory makes it more likely to
succeed memory hotplug in close-to-OOM situations. So the decision of
making hugetlb_free_vmemmap take precedence is not wise and elegant.
The proper approach is to have hugetlb_vmemmap.c do the check whether the
section which the HugeTLB pages belong to can be optimized. If the
section's vmemmap pages are allocated from the added memory block itself,
hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do
the optimization. Then both kernel parameters are compatible. So this
patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap
pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page
can be optimized.
Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20220620110616.12056-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Fri, 17 Jun 2022 13:56:49 +0000 (21:56 +0800)]
mm: memory_hotplug: enumerate all supported section flags
Patch series "make hugetlb_optimize_vmemmap compatible with
memmap_on_memory", v3.
This series makes hugetlb_optimize_vmemmap compatible with
memmap_on_memory.
This patch (of 2):
We are almost running out of section flags, only one bit is available in
the worst case (powerpc with 256k pages). However, there are still some
free bits (in ->section_mem_map) on other architectures (e.g. x86_64 has
10 bits available, arm64 has 8 bits available with worst case of 64K
pages). We have hard coded those numbers in code, it is inconvenient to
use those bits on other architectures except powerpc. So transfer those
section flags to enumeration to make it easy to add new section flags in
the future. Also, move SECTION_TAINT_ZONE_DEVICE into the scope of
CONFIG_ZONE_DEVICE to save a bit on non-zone-device case.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:15 +0000 (18:50 +0100)]
mm/swap: convert __put_compound_page() to __folio_put_large()
All the callers now have a folio, so pass it in. This doesn't
save any text, but it does save a call to compound_head() as
folio_test_hugetlb() does not contain a call like PageHuge() does.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:12 +0000 (18:50 +0100)]
mm/swap: convert put_pages_list to use folios
Pages linked through the LRU list cannot be tail pages as ->compound_head
is in a union with one of the words of the list_head, and they cannot
be ZONE_DEVICE pages as ->pgmap is in a union with the same word.
Saves 60 bytes of text by removing a call to page_is_fake_head().
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:11 +0000 (18:50 +0100)]
mm/swap: convert release_pages to use a folio internally
This function was already calling compound_head(), but now it can
cache the result of calling compound_head() and avoid calling it again.
Saves 299 bytes of text by avoiding various calls to compound_page()
and avoiding checks of PageTail.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:08 +0000 (18:50 +0100)]
mm/swap: pull the CPU conditional out of __lru_add_drain_all()
The function is too long, so pull this complicated conditional out into
cpu_needs_drain(). This ends up shrinking the text by 14 bytes,
by allowing GCC to cache the result of calling per_cpu() instead of
relocating each lookup individually.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:06 +0000 (18:50 +0100)]
mm/swap: convert activate_page to a folio_batch
Rename it to just 'activate', saving 696 bytes of text from removals
of compound_page() and the pagevec_lru_move_fn() infrastructure.
Inline need_activate_page_drain() into its only caller.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:02 +0000 (18:50 +0100)]
mm/swap: convert lru_add to a folio_batch
When adding folios to the LRU for the first time, the LRU flag will
already be clear, so skip the test-and-clear part of moving from one
LRU to another.
Removes 285 bytes from kernel text, mostly due to removing
__pagevec_lru_add().
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:49:59 +0000 (18:49 +0100)]
mm: add folios_put()
Patch series "Convert the swap code to be more folio-based".
There's still more to do with the swap code, but this reaps a lot of the
folio benefit. More than 4kB of kernel text saved (with the UEK7 kernel
config). I don't know how much that's going to translate into CPU
savings, but some of those compound_head() calls are on every page free,
so it should be noticable. It might even be noticable just from an
I-cache consumption perspective.
This patch (of 22):
This is just a wrapper around release_pages() for now. Place the
prototype in mm.h along with folio_put() and folio_put_refs().
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 15:42:44 +0000 (16:42 +0100)]
mm/vmscan: convert reclaim_clean_pages_from_list() to folios
Patch series "nvert much of vmscan to folios"
vmscan always operates on folios since it puts the pages on the LRU list.
Switching all of these functions from pages to folios saves 1483 bytes of
text from removing all the baggage around calling compound_page() and
similar functions.
This patch (of 5):
This is a straightforward conversion which removes several hidden calls
to compound_head, saving 330 bytes of kernel text.
Kuan-Ying Lee [Wed, 15 Jun 2022 06:22:18 +0000 (14:22 +0800)]
kasan: separate double free case from invalid free
Currently, KASAN describes all invalid-free/double-free bugs as
"double-free or invalid-free". This is ambiguous.
KASAN should report "double-free" when a double-free is a more likely
cause (the address points to the start of an object) and report
"invalid-free" otherwise [1].
Yang Shi [Thu, 16 Jun 2022 17:48:40 +0000 (10:48 -0700)]
doc: proc: fix the description to THPeligible
The THPeligible bit shows 1 if and only if the VMA is eligible for
allocating THP and the THP is also PMD mappable. Some misaligned file
VMAs may be eligible for allocating THP but the THP can't be mapped by
PMD. Make this more explicitly to avoid ambiguity.
Link: https://lkml.kernel.org/r/20220616174840.1202070-8-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:39 +0000 (10:48 -0700)]
mm: khugepaged: reorg some khugepaged helpers
The khugepaged_{enabled|always|req_madv} are not khugepaged only anymore,
move them to huge_mm.h and rename to hugepage_flags_xxx, and remove
khugepaged_req_madv due to no users.
Also move khugepaged_defrag to khugepaged.c since its only caller is in
that file, it doesn't have to be in a header file.
Link: https://lkml.kernel.org/r/20220616174840.1202070-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:38 +0000 (10:48 -0700)]
mm: thp: kill __transhuge_page_enabled()
The page fault path checks THP eligibility with __transhuge_page_enabled()
which does the similar thing as hugepage_vma_check(), so use
hugepage_vma_check() instead.
However page fault allows DAX and !anon_vma cases, so added a new flag,
in_pf, to hugepage_vma_check() to make page fault work correctly.
The in_pf flag is also used to skip shmem and file THP for page fault
since shmem handles THP in its own shmem_fault() and file THP allocation
on fault is not supported yet.
Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only
caller now, it is not necessary to have a helper function.
Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Thu, 23 Jun 2022 00:02:45 +0000 (17:02 -0700)]
mm-thp-kill-transparent_hugepage_active-fix-fix
add comment to vdso check
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 22 Jun 2022 00:51:42 +0000 (17:51 -0700)]
mm-thp-kill-transparent_hugepage_active-fix
check vma->vm_mm, per Zach
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:37 +0000 (10:48 -0700)]
mm: thp: kill transparent_hugepage_active()
The transparent_hugepage_active() was introduced to show THP eligibility
bit in smaps in proc, smaps is the only user. But it actually does the
similar check as hugepage_vma_check() which is used by khugepaged. We
definitely don't have to maintain two similar checks, so kill
transparent_hugepage_active().
This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas.
Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it
is not only for khugepaged anymore.
Link: https://lkml.kernel.org/r/20220616174840.1202070-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:36 +0000 (10:48 -0700)]
mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate
The hugepage_vma_revalidate() needs to check if the vma is still anonymous
vma or not since the address may be unmapped then remapped to file before
khugepaged reaquired the mmap_lock.
The old comment is not quite helpful, elaborate this with better comment.
Link: https://lkml.kernel.org/r/20220616174840.1202070-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:35 +0000 (10:48 -0700)]
mm: thp: consolidate vma size check to transhuge_vma_suitable
There are couple of places that check whether the vma size is ok for THP
or whether address fits, they are open coded and duplicate, use
transhuge_vma_suitable() to do the job by passing in (vma->end -
HPAGE_PMD_SIZE).
Move vma size check into hugepage_vma_check(). This will make
khugepaged_enter() is as same as khugepaged_enter_vma(). There is just
one caller for khugepaged_enter(), replace it to khugepaged_enter_vma()
and remove khugepaged_enter().
Link: https://lkml.kernel.org/r/20220616174840.1202070-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:34 +0000 (10:48 -0700)]
mm: khugepaged: check THP flag in hugepage_vma_check()
Patch series "Cleanup transhuge_xxx helpers", v5.
This series is the follow-up of the discussion about cleaning up
transhuge_xxx helpers at
https://lore.kernel.org/linux-mm/627a71f8-e879-69a5-ceb3-fc8d29d2f7f1@suse.cz/.
THP has a bunch of helpers that do VMA sanity check for different paths,
they do the similar checks for the most callsites and have a lot duplicate
codes. And it is confusing what helpers should be used at what
conditions.
This series reorganized and cleaned up the code so that we could
consolidate all the checks into hugepage_vma_check().
The transhuge_vma_enabled(), transparent_hugepage_active() and
__transparent_hugepage_enabled() are killed by this series.
This patch (of 7):
Currently the THP flag check in hugepage_vma_check() will fallthrough if
the flag is NEVER and VM_HUGEPAGE is set. This is not a problem for now
since all the callers have the flag checked before or can't be invoked if
the flag is NEVER.
However, the following patch will call hugepage_vma_check() in more
places, for example, page fault, so this flag must be checked in
hugepge_vma_check().
Liam Howlett [Wed, 15 Jun 2022 17:40:58 +0000 (17:40 +0000)]
mm/mlock: drop dead code in count_mm_mlocked_page_nr()
The check for mm being null has never been needed since the only caller
has always passed in current->mm. Remove the check from
count_mm_mlocked_page_nr().
David Hildenbrand [Tue, 14 Jun 2022 09:36:29 +0000 (11:36 +0200)]
mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection
Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
can try mapping anonymous pages in a private writable mapping writable if
they are exclusive, the PTE is already dirty, and no special handling
applies. Mapping the anonymous page writable is essentially the same
thing the write fault handler would do in this case.
Special handling is required for uffd-wp and softdirty tracking, so take
care of that properly. Also, leave PROT_NONE handling alone for now; in
the future, we could similarly extend the logic in do_numa_page() or use
pte_mk_savedwrite() here.
While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
performance, it should also be a valuable optimization for uffd-wp, when
un-protecting.
This has been previously suggested by Peter Collingbourne in [1], relevant
in the context of the Scudo memory allocator, before we had
PageAnonExclusive.
This commit doesn't add the same handling for PMDs (i.e., anonymous THP,
anonymous hugetlb); benchmark results from Andrea indicate that there are
minor performance gains, so it's might still be valuable to streamline
that logic for all anonymous pages in the future.
As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename it
to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually
happening.
Link: https://lkml.kernel.org/r/20220614093629.76309-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Collingbourne <pcc@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Edward Liaw [Mon, 13 Jun 2022 23:33:21 +0000 (23:33 +0000)]
userfaultfd: selftests: infinite loop in faulting_process
On Android this test is getting stuck in an infinite loop due to
indeterminate behavior:
The local variables steps and signalled were being reset to 1 and 0
respectively after every jump back to sigsetjmp by siglongjmp in the
signal handler. The test was incrementing them and expecting them to
retain their incremented values. The documentation for siglongjmp says:
All accessible objects have values as of the time sigsetjmp() was called,
except that the values of objects of automatic storage duration which are
local to the function containing the invocation of the corresponding
sigsetjmp() which do not have volatile-qualified type and which are
changed between the sigsetjmp() invocation and siglongjmp() call are
indeterminate.
Tagging steps and signalled with volatile enabled the test to pass.
Link: https://lkml.kernel.org/r/20220613233321.431282-1-edliaw@google.com Signed-off-by: Edward Liaw <edliaw@google.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Wed, 1 Jun 2022 21:09:51 +0000 (14:09 -0700)]
selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh
This new mode was recently added to the userfaultfd selftest. We want to
exercise both userfaultfd(2) as well as /dev/userfaultfd, so add both test
cases to the script.
Link: https://lkml.kernel.org/r/20220601210951.3916598-7-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Wed, 1 Jun 2022 21:09:50 +0000 (14:09 -0700)]
userfaultfd: selftests: make /dev/userfaultfd testing configurable
Instead of always testing both userfaultfd(2) and /dev/userfaultfd, let
the user choose which to test.
As with other test features, change the behavior based on a new command
line flag. Introduce the idea of "test mods", which are generic (not
specific to a test type) modifications to the behavior of the test. This
is sort of borrowed from this RFC patch series [1], but simplified a bit.
The benefit is, in "typical" configurations this test is somewhat slow
(say, 30sec or something). Testing both clearly doubles it, so it may not
always be desirable, as users are likely to use one or the other, but
never both, in the "real world".
Axel Rasmussen [Wed, 1 Jun 2022 21:09:47 +0000 (14:09 -0700)]
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
Link: https://lkml.kernel.org/r/20220601210951.3916598-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Wed, 1 Jun 2022 21:09:46 +0000 (14:09 -0700)]
selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
Patch series "userfaultfd: add /dev/userfaultfd for fine grained access
control", v3.
This patch (of 6):
This not being included was just a simple oversight. There are certain
features (like minor fault support) which are only enabled on shared
mappings, so without including hugetlb_shared we actually lose a
significant amount of test coverage.
Link: https://lkml.kernel.org/r/20220601210951.3916598-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220601210951.3916598-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 13 Jun 2022 19:23:00 +0000 (19:23 +0000)]
mm/damon: introduce DAMON-based LRU-lists Sorting
Users can do data access-aware LRU-lists sorting using 'LRU_PRIO' and
'LRU_DEPRIO' DAMOS actions. However, finding best parameters including
the hotness/coldness thresholds, CPU quota, and watermarks could be
challenging for some users. To make the scheme easy to be used without
complex tuning for common situations, this commit implements a static
kernel module called 'DAMON_LRU_SORT' using the 'LRU_PRIO' and
'LRU_DEPRIO' DAMOS actions.
It proactively sorts LRU-lists using DAMON with conservatively chosen
default values of the parameters. That is, the module under its default
parameters will make no harm for common situations but provide some level
of efficiency improvements for systems having clear hot/cold access
pattern under a level of memory pressure while consuming only a limited
small portion of CPU time.
SeongJae Park [Mon, 13 Jun 2022 19:22:58 +0000 (19:22 +0000)]
mm/damon/schemes: add 'LRU_DEPRIO' action
This commit adds a new DAMON-based operation scheme action called
'LRU_DEPRIO' for physical address space. The action deprioritizes pages
in the memory area of the target access pattern on their LRU lists. This
is hence supposed to be used for rarely accessed (cold) memory regions so
that cold pages could be more likely reclaimed first under memory
pressure. Internally, it simply calls 'lru_deactivate()'.
Using this with 'LRU_PRIO' action for hot pages, users can proactively
sort LRU lists based on the access pattern. That is, it can make the LRU
lists somewhat more trustworthy source of access temperature. As a
result, efficiency of LRU-lists based mechanisms including the reclamation
target selection could be improved.
SeongJae Park [Mon, 13 Jun 2022 19:22:56 +0000 (19:22 +0000)]
mm/damon/schemes: add 'LRU_PRIO' DAMOS action
This commit adds a new DAMOS action called 'LRU_PRIO' for the physical
address space. The action prioritizes pages in the memory regions of the
user-specified target access pattern on their LRU lists. This is hence
supposed to be used for frequently accessed (hot) memory regions so that
hot pages could be more likely protected under memory pressure.
Internally, it simply calls 'mark_page_accessed()'.
SeongJae Park [Mon, 13 Jun 2022 19:22:55 +0000 (19:22 +0000)]
mm/damon/paddr: use a separate function for 'DAMOS_PAGEOUT' handling
This commit moves code for 'DAMOS_PAGEOUT' handling of the physical
address space monitoring operations set to a separate function so that its
caller, 'damon_pa_apply_scheme()', can be more easily extended for
additional DAMOS actions later.
SeongJae Park [Mon, 13 Jun 2022 19:22:53 +0000 (19:22 +0000)]
mm/damon/dbgfs: add and use mappings between 'schemes' action inputs and 'damos_action' values
Patch series "Extend DAMOS for Proactive LRU-lists Sorting".
Introduction
============
In short, this patchset 1) extends DAMON-based Operation Schemes (DAMOS)
for low overhead data access pattern based LRU-lists sorting, and 2)
implements a static kernel module for easy use of conservatively-tuned
version of that using the extended DAMOS capability.
Background
----------
As page-granularity access checking overhead could be significant on huge
systems, LRU lists are normally not proactively sorted but partially and
reactively sorted for special events including specific user requests,
system calls and memory pressure. As a result, LRU lists are sometimes
not so perfectly prepared to be used as a trustworthy access pattern
source for some situations including reclamation target pages selection
under sudden memory pressure.
Because DAMON can identify access patterns of best-effort accuracy while
inducing only user-specified range of overhead, using DAMON for Proactive
LRU-lists Sorting (PLRUS) could be helpful for this situation. The idea
is quite simple. Find hot pages and cold pages using DAMON, and
prioritize hot pages while deprioritizing cold pages on their LRU-lists.
This patchset extends DAMON to support such schemes by introducing a
couple of new DAMOS actions for prioritizing and deprioritizing memory
regions of specific access patterns on their LRU-lists. In detail, this
patchset simply uses 'mark_page_accessed()' and 'deactivate_page()'
functions for prioritization and deprioritization of pages on their LRU
lists, respectively.
To make the scheme easy to use without complex tuning for common
situations, this patchset further implements a static kernel module called
'DAMON_LRU_SORT' using the extended DAMOS functionality. It proactively
sorts LRU-lists using DAMON with conservatively chosen default
hotness/coldness thresholds and small CPU usage quota limit. That is, the
module under its default parameters will make no harm for common situation
but provide some level of benefit for systems having clear hot/cold access
pattern under only memory pressure while consuming only limited small
portion of CPU time.
Related Works
-------------
Proactive reclamation is well known to be helpful for reducing non-optimal
reclamation target selection caused performance drops. However, proactive
reclamation is not a best option for some cases, because it could incur
additional I/O. For an example, it could be prohitive for systems using
storage devices that total number of writes is limited, or cloud block
storages that charges every I/O.
Some proactive reclamation approaches[1,2] induce a level of memory
pressure using memcg files or swappiness while monitoring PSI. As
reclamation target selection is still relying on the original LRU-lists
mechanism, using DAMON-based proactive reclamation before inducing the
proactive reclamation could allow more memory saving with same level of
performance overhead, or less performance overhead with same level of
memory saving.
In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page
faults reduction, and 3.74% speedup under memory pressure.
Setup
-----
To show the effect of PLRUS, I run PARSEC3 and SPLASH-2X benchmarks under
below variant systems and measure a few metrics including the runtime of
each workload, number of system-wide major page faults, and system-wide
memory PSI (some).
- orig: v5.18-rc4 based mm-unstable kernel + this patchset, but no DAMON scheme
applied.
- mprs: Same to 'orig' but artificial memory pressure is induced.
- plrus: Same to 'mprs' but a radically tuned PLRUS scheme is applied to the
entire physical address space of the system.
For the artificial memory pressure, I set 'memory.limit_in_bytes' to 75%
of the running workload's peak RSS, wait 1 second, remove the pressure by
setting it to 200% of the peak RSS, wait 10 seconds, and repeat the
procedure until the workload finishes[1]. I use zram based swap device.
The tests are automated[2].
To show effect of PLRUS on the PARSEC3/SPLASH-2X workloads which runs for
no long time, we use radically tuned version of PLRUS. The version asks
DAMON to do the proactive LRU-lists sorting as below.
1. Find any memory regions shown some accesses (approximately >=20 accesses per
100 sampling) and prioritize pages of the regions on their LRU lists using
up to 2% CPU time. Under the CPU time limit, prioritize regions having
higher access frequency and kept the access frequency longer first.
2. Find any memory regions shown no access for at least >=5 seconds and
deprioritize pages of the rgions on their LRU lists using up to 2% CPU time.
Under the CPU time limit, deprioritize regions that not accessed for longer
time first.
Results
-------
I repeat the tests 25 times and calculate average of the measured numbers.
The results are as below:
The first row is for legend. The first cell shows the metric that the
following cells of the row shows. Second, third, and fourth cells show
the metrics under the configs shown at the first row of the cell, and the
fifth cell shows the metric under 'plrus' divided by the metric under
'mprs'. Second row shows the averaged runtime of the workloads in
seconds. Third row shows the number of system-wide major page faults
while the test was ongoing. Fourth row shows the system-wide memory
pressure stall for some processes in microseconds while the test was
ongoing.
In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page
faults reduction, and 3.74% speedup under memory pressure. We also
confirmed the CPU usage of kdamond was 2.61% of single CPU, which is below
4% as expected.
Sequence of Patches
===================
The first and second patch cleans up DAMON debugfs interface and
DAMOS_PAGEOUT handling code of physical address space monitoring
operations implementation for easier extension of the code.
The thrid and fourth patches implement a new DAMOS action called
'lru_prio', which prioritizes pages under memory regions which have a
user-specified access pattern, and document it, respectively. The fifth
and sixth patches implement yet another new DAMOS action called
'lru_deprio', which deprioritizes pages under memory regions which have a
user-specified access pattern, and document it, respectively.
The seventh patch implements a static kernel module called
'damon_lru_sort', which utilizes the DAMON-based proactive LRU-lists
sorting under conservatively chosen default parameter. Finally, the
eighth patch documents 'damon_lru_sort'.
This patch (of 8):
DAMON debugfs interface assumes users will write 'damos_action' value
directly to the 'schemes' file. This makes adding new 'damos_action' in
the middle of its definition breaks the backward compatibility of DAMON
debugfs interface, as values of some 'damos_action' could be changed. To
mitigate the situation, this commit adds mappings between the user inputs
and 'damos_action' value and makes DAMON debugfs code uses those.
Miaohe Lin [Sat, 11 Jun 2022 02:13:52 +0000 (10:13 +0800)]
mm/page_alloc: minor clean up for memmap_init_compound()
Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
compound_pincount_ptr is stored at first tail page now. So we should call
prep_compound_head() after the first tail page is initialized to take
advantage of the likelihood of that tail struct page being cached given
that we will read them right after in prep_compound_head().
Miaohe Lin [Wed, 8 Jun 2022 14:14:32 +0000 (22:14 +0800)]
mm/vmscan: don't try to reclaim freed folios
If folios were freed from under us, there's no need to reclaim them. Skip
these folios to save lots of cpu cycles and avoid possible unnecessary
disk I/O.
Miaohe Lin [Wed, 8 Jun 2022 14:40:31 +0000 (22:40 +0800)]
mm/swap: remove swap_cache_info statistics
swap_cache_info are not statistics that could be easily used to tune
system performance because they are not easily accessile. Also they can't
provide really useful info when OOM occurs. Remove these statistics can
also help mitigate unneeded global swap_cache_info cacheline contention.
Link: https://lkml.kernel.org/r/20220608144031.829-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Wed, 8 Jun 2022 14:40:30 +0000 (22:40 +0800)]
mm/swapfile: fix possible data races of inuse_pages
si->inuse_pages could still be accessed concurrently now. The plain reads
outside si->lock critical section, i.e. swap_show and si_swapinfo, which
results in data races. READ_ONCE and WRITE_ONCE is used to fix such data
races. Note these data races should be ok because they're just used for
showing swap info.
mm/vmalloc: extend __find_vmap_area() with one more argument
__find_vmap_area() finds a "vmap_area" based on passed address. It scan
the specific "vmap_area_root" rb-tree. Extend the function with one extra
argument, so any tree can be specified where the search has to be done.
There is no functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20220607093449.3100-5-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmalloc: initialize VA's list node after unlink
A vmap_area can travel between different places. For example
attached/detached to/from different rb-trees. In order to prevent fancy
bugs, initialize a VA's list node after it is removed from the list, so it
pairs with VA's rb_node which is also initialized.
There is no functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20220607093449.3100-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmalloc: extend __alloc_vmap_area() with extra arguments
It implies that __alloc_vmap_area() allocates only from the global vmap
space, therefore a list-head and rb-tree, which represent a free vmap
space, are not passed as parameters to this function and are accessed
directly from this function.
Extend the __alloc_vmap_area() and other dependent functions to have a
possibility to allocate from different trees making an interface common
and not specific.
There is no functional change as a result of this patch.
Link: https://lkml.kernel.org/r/20220607093449.3100-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmalloc: make link_va()/unlink_va() common to different rb_root
Patch series "Reduce a vmalloc internal lock contention preparation work".
This small serias is preparation work to implement per-cpu vmalloc
allocation in order to reduce a high internal lock contention. This
series does not introduce any functional changes, it is only about
preparation.
This patch (of 5):
Currently link_va() and unlik_va(), in order to figure out a tree type,
compares a passed root value with a global free_vmap_area_root variable to
distinguish the augmented rb-tree from a regular one. It is hard coded
since such functions can manipulate only with specific
"free_vmap_area_root" tree that represents a global free vmap space.
Make it common by introducing "_augment" versions of both internal
functions, so it is possible to deal with different trees.
There is no functional change as a result of this patch.
Shiyang Ruan [Fri, 3 Jun 2022 05:37:38 +0000 (13:37 +0800)]
xfs: add dax dedupe support
Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files who
are going to be deduped. After that, call compare range function only
when files are both DAX or not.
Link: https://lkml.kernel.org/r/20220603053738.1218681-15-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:37 +0000 (13:37 +0800)]
xfs: support CoW in fsdax mode
In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file. So,
add a CoW identification in ->iomap_begin(), and implement ->iomap_end()
to do the remapping work.
Link: https://lkml.kernel.org/r/20220603053738.1218681-14-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:36 +0000 (13:37 +0800)]
fsdax: dedup file range to use a compare function
With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.
Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:35 +0000 (13:37 +0800)]
fsdax: add dax_iomap_cow_copy() for dax zero
Punch hole on a reflinked file needs dax_iomap_cow_copy() too. Otherwise,
data in not aligned area will be not correct. So, add the CoW operation
for not aligned case in dax_memzero().
Link: https://lkml.kernel.org/r/20220603053738.1218681-12-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:34 +0000 (13:37 +0800)]
fsdax: replace mmap entry in case of CoW
Replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected. This helps us snapshots so new write pagefaults
after snapshots trigger a CoW.
Link: https://lkml.kernel.org/r/20220603053738.1218681-11-ruansy.fnst@fujitsu.com Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:33 +0000 (13:37 +0800)]
fsdax: introduce dax_iomap_cow_copy()
In the case where the iomap is a write operation and iomap is not equal to
srcmap after iomap_begin, we consider it is a CoW operation.
In this case, the destination (iomap->addr) points to a newly allocated
extent. It is needed to copy the data from srcmap to the extent. In
theory, it is better to copy the head and tail ranges which is outside of
the non-aligned area instead of copying the whole aligned range. But in
dax page fault, it will always be an aligned range. So copy the whole
range in this case.
Link: https://lkml.kernel.org/r/20220603053738.1218681-10-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:32 +0000 (13:37 +0800)]
fsdax: output address in dax_iomap_pfn() and rename it
Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case. Since this function both output address and pfn, rename it to
dax_iomap_direct_access().
Link: https://lkml.kernel.org/r/20220603053738.1218681-9-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:31 +0000 (13:37 +0800)]
fsdax: set a CoW flag when associate reflink mappings
Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file
mappings. In this case, since the dax-rmap has already took the
responsibility to look up for shared files by given dax page, the
page->mapping is no longer to used for rmap but for marking that this dax
page is shared. And to make sure disassociation works fine, we use
page->index as refcount, and clear page->mapping to the initial state when
page->index is decreased to 0.
With the help of this new flag, it is able to distinguish normal case and
CoW case, and keep the warning in normal case.
Link: https://lkml.kernel.org/r/20220603053738.1218681-8-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:30 +0000 (13:37 +0800)]
xfs: implement ->notify_failure() for XFS
Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.
If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data. For now all we do is
kill processes with that file mapped into their address spaces, but future
patches could actually do something about corrupt metadata.
After that, the memory failure needs to notify the processes who are using
those files.
Link: https://lkml.kernel.org/r/20220603053738.1218681-7-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:29 +0000 (13:37 +0800)]
mm: introduce mf_dax_kill_procs() for fsdax case
This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing a
DAX mapping. It is intended to be called by the file systems as part of
the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.
Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:28 +0000 (13:37 +0800)]
fsdax: introduce dax_lock_mapping_entry()
The current dax_lock_page() locks dax entry by obtaining mapping and index
in page. To support 1-to-N RMAP in NVDIMM, we need a new function to lock
a specific dax entry corresponding to this file's mapping,index. And
output the page corresponding to the specific dax entry for caller use.
Link: https://lkml.kernel.org/r/20220603053738.1218681-5-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:27 +0000 (13:37 +0800)]
pagemap,pmem: introduce ->memory_failure()
When memory-failure occurs, we call this function which is implemented by
each kind of devices. For the fsdax case, pmem device driver implements
it. Pmem device driver will find out the filesystem in which the
corrupted page located in.
With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers. If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.
Link: https://lkml.kernel.org/r/20220603053738.1218681-4-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:26 +0000 (13:37 +0800)]
mm: factor helpers for memory_failure_dev_pagemap
memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax. So it is needed to factor some helper functions to
simplify these code.
Link: https://lkml.kernel.org/r/20220603053738.1218681-3-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shiyang Ruan [Fri, 3 Jun 2022 05:37:25 +0000 (13:37 +0800)]
dax: introduce holder for dax_device
Patch series "v14 fsdax-rmap + v11 fsdax-reflink", v2.
The patchset fsdax-rmap is aimed to support shared pages tracking for
fsdax.
It moves owner tracking from dax_assocaite_entry() to pmem device driver,
by introducing an interface ->memory_failure() for struct pagemap. This
interface is called by memory_failure() in mm, and implemented by pmem
device.
Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.
Finally we are able to try to fix the corrupted data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.
The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
| dax_holder_notify_failure() =>
| dax_device->holder_ops->notify_failure() =>
| - xfs_dax_notify_failure()
| |* xfs_dax_notify_failure()
| |--------------------------
| | xfs_rmap_query_range()
| | xfs_dax_failure_fn()
| | * corrupted on metadata
| | try to recover data, call xfs_force_shutdown()
| | * corrupted on file data
| | try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
|mf_generic_kill_procs()
The patchset fsdax-reflink attempts to add CoW support for fsdax, and
takes XFS, which has both reflink and fsdax features, as an example.
One of the key mechanisms needed to be implemented in fsdax is CoW. Copy
the data from srcmap before we actually write data to the destination
iomap. And we just copy range in which data won't be changed.
Another mechanism is range comparison. In page cache case, readpage() is
used to load data on disk to page cache in order to be able to compare
data. In fsdax case, readpage() does not work. So, we need another
compare data with direct access support.
With the two mechanisms implemented in fsdax, we are able to make reflink
and fsdax work together in XFS.
This patch (of 14):
To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation. This holder is used to
remember who is using this dax_device:
- When it is the backend of a filesystem, the holder will be the
instance of this filesystem.
- When this pmem device is one of the targets in a mapped device, the
holder will be this mapped device. In this case, the mapped device
has its own dax_device and it will follow the first rule. So that we
can finally track to the filesystem we needed.
The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.
Link: https://lkml.kernel.org/r/20220603053738.1218681-1-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/20220603053738.1218681-2-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.wiliams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ERROR: space required before the open parenthesis '('
#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+ while(getline(&line, &len, fp) != -1) {
ERROR: space required after that ',' (ctx:VxV)
#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+ char *first = strtok(line,"- ");
^
ERROR: space required after that ',' (ctx:VxV)
#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+ char *second = strtok(NULL,"- ");
^
WARNING: Missing a blank line after declarations
#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+ void *second_val = (void *) strtol(second, NULL, 16);
+ if (first_val == start && second_val == start + 3 * page_size) {
total: 3 errors, 3 warnings, 113 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Matěna [Fri, 3 Jun 2022 14:57:19 +0000 (16:57 +0200)]
mm: add merging after mremap resize
When mremap call results in expansion, it might be possible to merge the
VMA with the next VMA which might become adjacent. This patch adds
vma_merge call after the expansion is done to try and merge.
Link: https://lkml.kernel.org/r/20220603145719.1012094-3-matenajakub@gmail.com Signed-off-by: Jakub Matěna <matenajakub@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Matěna [Fri, 3 Jun 2022 14:57:18 +0000 (16:57 +0200)]
mm: refactor of vma_merge()
Patch series "Refactor of vma_merge and new merge call", v4.
I am currently working on my master's thesis trying to increase number of
merges of VMAs currently failing because of page offset incompatibility
and difference in their anon_vmas. The following refactor and added merge
call included in this series is just two smaller upgrades I created along
the way.
This patch (of 2):
Refactor vma_merge() to make it shorter and more understandable. Main
change is the elimination of code duplicity in the case of merge next
check. This is done by first doing checks and caching the results before
executing the merge itself. The variable 'area' is divided into 'mid' and
'res' as previously it was used for two purposes, as the middle VMA
between prev and next and also as the result of the merge itself. Exit
paths are also unified.
Link: https://lkml.kernel.org/r/20220603145719.1012094-1-matenajakub@gmail.com Link: https://lkml.kernel.org/r/20220603145719.1012094-2-matenajakub@gmail.com Signed-off-by: Jakub Matěna <matenajakub@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Rik van Riel <riel@surriel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 2 Jun 2022 05:06:31 +0000 (14:06 +0900)]
mm, hwpoison: enable memory error handling on 1GB hugepage
Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.
Link: https://lkml.kernel.org/r/20220602050631.771414-6-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 2 Jun 2022 05:06:30 +0000 (14:06 +0900)]
mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
Currently if memory_failure() (modified to remove blocking code) is called
on a page in some 1GB hugepage, memory error handling returns failure and
the raw error page gets into undesirable state. The impact is small in
production systems (just leaked single 4kB page), but this limits the test
efficiency because unpoison doesn't work for it. So we can no longer
create 1GB hugepage on the 1GB physical address range with such hwpoison
pages, that could be an issue in testing on small systems.
When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the hugepage is broken
down into raw error page and order is 0:
if (unlikely(PageHWPoison(page)) && !order) {
...
return false;
}
Then, the page is not sent to buddy and the page refcount is left 0.
Originally this check is supposed to work when the error page is freed
from page_handle_poison() (that is called from soft-offline), but now we
are opening another path to call it, so the callers of
__page_handle_poison() need to handle the case by considering the return
value 0 as success. Then page refcount for hwpoison is properly
incremented and now unpoison works.
Link: https://lkml.kernel.org/r/20220602050631.771414-5-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 2 Jun 2022 05:06:29 +0000 (14:06 +0900)]
mm, hwpoison: make __page_handle_poison returns int
__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now. But we will want to
distinguish another case of "dissolve has passed but taking off failed" by
its return value. So change the type of the return value. No functional
change.
Link: https://lkml.kernel.org/r/20220602050631.771414-4-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 2 Jun 2022 05:06:28 +0000 (14:06 +0900)]
mm,hwpoison: set PG_hwpoison for busy hugetlb pages
If memory_failure() fails to grab page refcount on a hugetlb page because
it's busy, it returns without setting PG_hwpoison on it. This not only
loses a chance of error containment, but breaks the rule that
action_result() should be called only when memory_failure() do any of
handling work (even if that's just setting PG_hwpoison). This
inconsistency could harm code maintainability.
So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.
Link: https://lkml.kernel.org/r/20220602050631.771414-3-naoya.horiguchi@linux.dev Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__get_huge_page_for_hwpoison() is not needed when CONFIG_HUGETLB_PAGE is
n, so extending "#ifdef CONFIG_HUGETLB_PAG" to cover
__get_huge_page_for_hwpoison() would be a simple resolution.
Reported-by: kernel test robot <lkp@intel.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 2 Jun 2022 05:06:27 +0000 (14:06 +0900)]
mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
This patchset enables memory error handling on 1GB hugepage.
"Save raw error page" patch (1/4 of patchset [1]) is necessary, so it's
included in this series (the remaining part of hotplug related things are
still in progress). Patch 2/5 solves issues in a corner case of hugepage
handling, which might not be the main target of this patchset, but
slightly related. It was posted separately [2] but depends on 1/5, so I
group them together.
Patches 3/5 to 5/5 are the main part of this series and fix a small issue
about handling 1GB hugepage, which I hope will be workable.
When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages. If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right.
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error page is lost.
Use the private field of a tail page to keep that information. The code
path of shrinking hugepage pool used this info to try delayed dissolve.
This only keeps one hwpoison page for now, which might be OK because it's
simple and multiple hwpoison pages in a hugepage can be rare. But it can
be extended in the future.