mm/mempolicy: use vma iterator & maple state instead of vma linked list
Reworked the way mbind_range() finds the first VMA to reuse the maple
state and limit the number of tree walks needed.
Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
address higher than the last VMA. The code was written in a way that
allowed no VMA updates to occur and still return success. There should be
no functional change to this scenario with the new code.
The linked list is slower than walking the VMAs using the maple tree. We
can't use the VMA iterator here because it doesn't support moving to an
earlier position.
The VMA iterator is faster than the linked llist, and it can be walked
even when VMAs are being removed from the address space, so there's no
need to keep track of 'next'.
Use the Maple Tree iterator instead. This is too complicated for the VMA
iterator to handle, so let's open-code it for now. If this turns out to
be a common pattern, we can migrate it to common code.
Use the VMA iterator instead. This requires a little restructuring of the
surrounding code to hoist the mm to the caller. That turns
cxl_prefault_one() into a trivial function, so call cxl_fault_segment()
directly.
Use the VMA iterator instead. Since VMA can no longer be NULL in the
loop, then deal with out-of-memory outside the loop. This means a
slightly longer run time in the failure case (-ENOMEM) - it will run to
the end of the VMAs before erroring instead of in the middle of the loop.
mm/mmap: change do_brk_munmap() to use do_mas_align_munmap()
do_brk_munmap() has already aligned the address and has a maple tree state
to be used. Use the new do_mas_align_munmap() to avoid unnecessary
alignment and error checks.
Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
do_mas_align_munmap().
do_munmap() is a wrapper to create a maple state for any callers that have
not been converted to the maple tree.
do_mas_munmap() takes a maple state to mumap a range. This is just a
small function which checks for error conditions and aligns the end of the
range.
do_mas_align_munmap() uses the aligned range to mumap a range.
do_mas_align_munmap() starts with the first VMA in the range, then finds
the last VMA in the range. Both start and end are split if necessary.
Then the VMAs are removed from the linked list and the mm mlock count is
updated at the same time. Followed by a single tree operation of
overwriting the area in with a NULL. Finally, the detached list is
unmapped and freed.
By reorganizing the munmap calls as outlined, it is now possible to avoid
extra work of aligning pre-aligned callers which are known to be safe,
avoid extra VMA lookups or tree walks for modifications.
detach_vmas_to_be_unmapped() is no longer used, so drop this code.
vm_brk_flags() can just call the do_mas_munmap() as it checks for
intersecting VMAs directly.
By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code. Remove the vmacache
to reduce the work in keeping it up to date and code complexity.
mm/mmap: use advanced maple tree API for mmap_region()
Changing mmap_region() to use the maple tree state and the advanced maple
tree interface allows for a lot less tree walking.
This change removes the last caller of munmap_vma_range(), so drop this
unused function.
Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.
mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()
Avoid allocating a new VMA when it a vma modification can occur. When a
brk() can expand or contract a VMA, then the single store operation will
only modify one index of the maple tree instead of causing a node to split
or coalesce. This avoids unnecessary allocations/frees of maple tree
nodes and VMAs.
Move some limit & flag verifications out of the do_brk_flags() function to
use only relevant checks in the code path of bkr() and vm_brk_flags().
Set the vma to check if it can expand in vm_brk_flags() if extra criteria
are met.
Drop userfaultfd from do_brk_flags() path and only use it in
vm_brk_flags() path since that is the only place a munmap will happen.
mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup()
vma_lookup() will walk the vma tree once and not continue to look for the
next vma. Since the exact vma is checked below, this is a more optimal
way of searching.
Use vma_lookup() to walk the tree to the start value requested. If the
vma at the start does not match, then the answer is NULL and there is no
need to look at the next vma the way that find_vma() would.
vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value. It is more efficient
to only walk to the requested value since privcmd_ioctl_mmap() will exit
the loop if vm_start != msg->va.
mmap: change zeroing of maple tree in __vma_adjust()
Only write to the maple tree if we are not inserting or the insert isn't
going to overwrite the area to clear. This avoids spanning writes and
node coealescing when unnecessary.
The change requires a custom search for the linked list addition to find
the correct VMA for the prev link.
damon: Convert __damon_va_three_regions to use the VMA iterator
This rather specialised walk can use the VMA iterator. If this proves to
be too slow, we can write a custom routine to find the two largest gaps,
but it will be somewhat complicated, so let's see if we need it first.
Update the kunit test case to use the maple tree. This also fixes an
issue with the kunit testcase not adding the last VMA to the list.
Link: https://lkml.kernel.org/r/20220404143501.2016403-18-Liam.Howlett@oracle.com Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel/fork: use maple tree for dup_mmap() during forking
The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator and switch to the
maple tree advanced API to avoid multiple walks of the tree during insert
operations. Unexport the now-unused vma_store() function.
For performance reasons we bulk allocate the maple tree nodes. The node
calculations are done internally to the tree and use the VMA count and
assume the worst-case node requirements. The VM_DONT_COPY flag does not
allow for the most efficient copy method of the tree and so a bulk loading
algorithm is used.
mm/mmap: use maple tree for unmapped_area{_topdown}
The maple tree code was added to find the unmapped area in a previous
commit and was checked against what the rbtree returned, but the actual
result was never used. Start using the maple tree implementation and
remove the rbtree code.
Add kernel documentation comment for these functions.
mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
Use the maple tree's advanced API and a maple state to walk the tree for
the entry at the address of the next vma, then use the maple state to walk
back one entry to find the previous entry.
This thin layer of abstraction over the maple tree state is for iterating
over VMAs. You can go forwards, go backwards or ask where the iterator
is. Rename the existing vma_next() to __vma_next() -- it will be removed
by the end of this series.
Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree. Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.
The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. The first user that is covered in this patch
set is the vm_area_struct, where three data structures are replaced by the
maple tree: the augmented rbtree, the vma cache, and the linked list of
VMAs in the mm_struct. The long term goal is to reduce or remove the
mmap_sem contention.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
Link: https://lkml.kernel.org/r/20220404143501.2016403-8-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
radix tree test suite: add allocation counts and size to kmem_cache
Add functions to get the number of allocations, and total allocations from
a kmem_cache. Also add a function to get the allocated size and a way to
zero the total allocations.
radix tree test suite: add kmem_cache_set_non_kernel()
kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
the flags. This functionality allows for testing different paths though
the code.
Revert "mm/memory-failure.c: fix race with changing page compound again"
Reverts commit 888af2701db7 ("mm/memory-failure.c: fix race with changing
page compound again") because now we fetch the page refcount under
hugetlb_lock in try_memory_failure_hugetlb() so that the race check is no
longer necessary.
Link: https://lkml.kernel.org/r/20220408135323.1559401-4-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hwpoison: put page in already hwpoisoned case with MF_COUNT_INCREASED
In already hwpoisoned case, memory_failure() is supposed to return with
releasing the page refcount taken for error handling. But currently the
refcount is not released when called with MF_COUNT_INCREASED, which makes
page refcount inconsistent. This should be rare and non-critical, but it
might be inconvenient in testing (unpoison doesn't work).
Link: https://lkml.kernel.org/r/20220408135323.1559401-3-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Thu, 14 Apr 2022 06:07:10 +0000 (23:07 -0700)]
mm: wrap __find_buddy_pfn() with a necessary buddy page validation
Whenever the buddy of a page is found from __find_buddy_pfn(),
page_is_buddy() should be used to check its validity. Add a helper
function find_buddy_page_pfn() to find the buddy page and do the check
together.
__GFP_ATOMIC serves little purpose. Its main effect is to set
ALLOC_HARDER which adds a few little boosts to increase the chance of an
allocation succeeding, one of which is to lower the water-mark at which it
will succeed.
It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark. It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.
__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
There is little point to this. We already get a might_sleep() warning if
__GFP_DIRECT_RECLAIM is set.
__GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is
probable that testing ALLOC_HARDER is a better fit here.
__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep. This should test __GFP_DIRECT_RECLAIM instead.
This patch:
- removes __GFP_ATOMIC
- causes __GFP_HIGH to set ALLOC_HARDER unless __GFP_NOMEMALLOC is set
(as well as ALLOC_HIGH).
- makes other adjustments as suggested by the above.
The net result is not change to GFP_ATOMIC allocations. Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges. This affects:
xen, dm, md, ntfs3
the vermillion frame buffer
hibernation
ksm
swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.
Link: https://lkml.kernel.org/r/163712397076.13692.4727608274002939094@noble.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Thu, 14 Apr 2022 06:07:09 +0000 (23:07 -0700)]
mm/page_alloc: adding same penalty is enough to get round-robin order
To make node order in round-robin in the same distance group, we add a
penalty to the first node we got in each round.
To get a round-robin order in the same distance group, we don't need to
decrease the penalty since:
* find_next_best_node() always iterates node in the same order
* distance matters more then penalty in find_next_best_node()
* in nodes with the same distance, the first one would be picked up
So it is fine to increase same penalty when we get the first node in the
same distance group. Since we just increase a constance of 1 to node
penalty, it is not necessary to multiply MAX_NODE_LOAD for preference.
Link: https://lkml.kernel.org/r/20220123013537.20491-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yixuan Cao [Thu, 14 Apr 2022 06:07:09 +0000 (23:07 -0700)]
mm/vmalloc: fix a comment
The sentence
"but the mempolcy want to alloc memory by interleaving"
should be rephrased with
"but the mempolicy wants to alloc memory by interleaving"
where "mempolicy" is a struct name.
This work is coauthored by
Yinan Zhang
Jiajian Ye
Shenghong Han
Chongxi Zhao
Yuhong Feng
Yongqiang Liu
Miaohe Lin [Thu, 14 Apr 2022 06:07:08 +0000 (23:07 -0700)]
mm/mremap: avoid unneeded do_munmap call
When old_len == new_len, do_munmap will return -EINVAL due to len == 0.
This errno will be simply ignored because of old_len != new_len check. So
it is unnecessary to call do_munmap when old_len == new_len because
nothing is actually done.
Nadav Amit [Thu, 14 Apr 2022 06:07:08 +0000 (23:07 -0700)]
mm: avoid unnecessary flush on change_huge_pmd()
Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.
The first TLB flush is only necessary to prevent the dirty bit (and with a
lesser importance the access bit) from changing while the PTE is modified.
However, this is not necessary as the x86 CPUs set the dirty-bit
atomically with an additional check that the PTE is (still) present. One
caveat is Intel's Knights Landing that has a bug and does not do so.
Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.
Link: https://lkml.kernel.org/r/20220401180821.1986781-4-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nadav Amit [Thu, 14 Apr 2022 06:07:08 +0000 (23:07 -0700)]
mm/mprotect: do not flush when not required architecturally
Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. However, in such cases the
PTE is often not modified (i.e., remain RO) and therefore not TLB flush is
needed.
Add an arch-specific pte_needs_flush() which tells whether a TLB flush is
needed based on the old PTE and the new one. Implement an x86
pte_needs_flush().
Always flush the TLB when it is architecturally needed even when skipping
a TLB flush might only result in a spurious page-faults by skipping the
flush.
Even with such conservative manner, we can in the future further refine
the checks to test whether a PTE is present by only considering the
architectural _PAGE_PRESENT flag instead of {pte|pmd}_preesnt(). For not
be careful and use the latter.
Link: https://lkml.kernel.org/r/20220401180821.1986781-3-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nadav Amit [Thu, 14 Apr 2022 06:07:08 +0000 (23:07 -0700)]
mm/mprotect: use mmu_gather
Patch series "mm/mprotect: avoid unnecessary TLB flushes", v6.
This patchset is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar and
further optimizations for MADV_COLD and userfaultfd would be possible.
Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and do
better/fewer flushes. This would also be handy for later userfaultfd
enhancements.
2. Avoid unnecessary TLB flushes. This optimization is the one that
provides most of the performance benefits. Unlike previous versions,
we now only avoid flushes that would not result in spurious
page-faults.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I
therefore ran a microbenchmark: a loop that does the following on
anonymous memory, just as a sanity check to see that time is saved by
avoiding TLB flushes. The loop goes:
mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable
The test was run in KVM guest with 1 or 2 threads (the second thread was
busy-looping). I measured the time (cycles) of each operation:
The exact numbers are really meaningless, but the benefit is clear. There
are 2 interesting results though.
(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved
(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.
This patch (of 3):
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates the
code, as developers need to be aware of different invalidation schemes,
and prevents opportunities to avoid TLB flushes or perform them in finer
granularity.
The use of mmu_gather for modified PTEs has benefits in various scenarios
even if pages are not released. For instance, if only a single page needs
to be flushed out of a range of many pages, only that page would be
flushed. If a THP page is flushed, on x86 a single TLB invlpg instruction
can be used instead of 512 instructions (or a full TLB flush, which would
Linux would actually use by default). mprotect() over multiple VMAs
requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released, only
record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Link: https://lkml.kernel.org/r/20220401180821.1986781-1-namit@vmware.com Link: https://lkml.kernel.org/r/20220401180821.1986781-2-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Nick Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Thu, 14 Apr 2022 06:07:07 +0000 (23:07 -0700)]
x86/mm: enable ARCH_HAS_VM_GET_PAGE_PROT
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT. This also unsubscribes from config
ARCH_HAS_FILTER_PGPROT, after dropping off arch_filter_pgprot() and
arch_vm_get_page_prot().
Link: https://lkml.kernel.org/r/20220407103251.1209606-6-anshuman.khandual@arm.com Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT. It localizes
arch_vm_get_page_prot() as sparc_vm_get_page_prot() and moves near
vm_get_page_prot().
Link: https://lkml.kernel.org/r/20220407103251.1209606-5-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT. It localizes
arch_vm_get_page_prot() and moves it near vm_get_page_prot().
Link: https://lkml.kernel.org/r/20220407103251.1209606-4-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT. While here, this also localizes
arch_vm_get_page_prot() as powerpc_vm_get_page_prot() and moves it near
vm_get_page_prot().
Link: https://lkml.kernel.org/r/20220407103251.1209606-3-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/mmap: Drop arch_vm_get_page_prot() and arch_filter_pgprot()", v4.
protection_map[] is an array based construct that translates given
vm_flags combination. This array contains page protection map, which is
populated by the platform via [__S000 .. __S111] and [__P000 .. __P111]
exported macros. Primary usage for protection_map[] is for
vm_get_page_prot(), which is used to determine page protection value for a
given vm_flags. vm_get_page_prot() implementation, could again call
platform overrides arch_vm_get_page_prot() and arch_filter_pgprot(). Some
platforms override protection_map[] that was originally built with
__SXXX/__PXXX with different runtime values.
Currently there are multiple layers of abstraction i.e __SXXX/__PXXX
macros, protection_map[], arch_vm_get_page_prot() and arch_filter_pgprot()
built between the platform and generic MM, finally defining
vm_get_page_prot().
Hence this series proposes to drop later two abstraction levels and
instead just move the responsibility of defining vm_get_page_prot() to the
platform (still utilizing generic protection_map[] array) itself making it
clean and simple.
This first introduces ARCH_HAS_VM_GET_PAGE_PROT which enables the
platforms to define custom vm_get_page_prot(). This starts converting
platforms that define the overrides arch_filter_pgprot() or
arch_vm_get_page_prot() which enables for those constructs to be dropped
off completely.
The series has been inspired from an earlier discuss with Christoph Hellwig
Add a new config ARCH_HAS_VM_GET_PAGE_PROT, which when subscribed enables
a given platform to define its own vm_get_page_prot() but still utilizing
the generic protection_map[] array.
Link: https://lkml.kernel.org/r/20220407103251.1209606-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20220407103251.1209606-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
protection_map[] maps vm_flags access combinations into page protection
value as defined by the platform via __PXXX and __SXXX macros. The array
indices in protection_map[], represents vm_flags access combinations but
it's not very intuitive to derive. This makes it clear and explicit.
Although protection_map[] contains the platform defined page protection
map for a given vm_flags combination, vm_get_page_prot() is the right
interface to use. This will also reduce dependency on protection_map[]
which is going to be dropped off completely later on.
Jianxing Wang [Thu, 14 Apr 2022 06:07:05 +0000 (23:07 -0700)]
mm/mmu_gather: limit free batch count and add schedule point in tlb_batch_pages_flush
free a large list of pages maybe cause rcu_sched starved on
non-preemptible kernels. howerver free_unref_page_list maybe can't
cond_resched as it maybe called in interrupt or atomic context, especially
can't detect atomic context in CONFIG_PREEMPTION=n.
The issue is detected in guest with kvm cpu 200% overcommit, however I
didn't see the warning in the host with the same application. I'm sure
that the patch is needed for guest kernel, but no sure for host.
To reproduce, set up two virtual machines in one host machine, per vm has
the same number cpu and half memory of host. the run ltpstress.sh in per
vm, then will see rcu stall warning.kernel is preempt disabled, append
kernel command 'preempt=none' if enable dynamic preempt . It could
detected in loongson machine(32 core, 128G mem) and ProLiant DL380
Gen9(x86 E5-2680, 28 core, 64G mem)
tlb flush batch count depends on PAGE_SIZE, it's too large if PAGE_SIZE >
4K, here limit free batch count with 512. And add schedule point in
tlb_batch_pages_flush.
Link: https://lkml.kernel.org/r/20220317072857.2635262-1-wangjianxing@loongson.cn Signed-off-by: Jianxing Wang <wangjianxing@loongson.cn> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Thu, 14 Apr 2022 06:07:05 +0000 (23:07 -0700)]
mm/memcg: non-hierarchical mode is deprecated
After commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical
mode"), we won't have a NULL parent except root_mem_cgroup. And this case
is handled when (memcg == root).
Link: https://lkml.kernel.org/r/20220403020833.26164-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Thu, 14 Apr 2022 06:07:05 +0000 (23:07 -0700)]
mm/memcg: move generation assignment and comparison together
For each round-trip, we assign generation on first invocation and compare
it on subsequent invocations.
Let's move them together to make it more self-explaining. Also this
reduce a check on prev.
[hannes@cmpxchg.org: better comment to explain reclaim model] Link: https://lkml.kernel.org/r/20220330234719.18340-4-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Thu, 14 Apr 2022 06:07:04 +0000 (23:07 -0700)]
mm/memcg: mz already removed from rb_tree if not NULL
When mz is not NULL, it means mz can either come from
mem_cgroup_largest_soft_limit_node or
__mem_cgroup_largest_soft_limit_node. And both of them have removed this
node by __mem_cgroup_remove_exceeded().
Not necessary to call __mem_cgroup_remove_exceeded() again.