Replace the in-function call to BUG_ON() with calls before the function
call MT_WARN_ON() to get more information on error conditions and to
avoid crashing - although the crash is imminent in this case.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Remove BUG_ON() from mas_topiary_range()
In the even of trying to remove data from a leaf node by use of
mas_topiary_range(), simply log the issue and return. This will be
quickly followed by a crash but logging and returning will increase the
probability of meaningful data.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Use MT_WARN_ON() instead of MT_BUG_ON() as to avoid crashing the kernel.
In the unlikely even of a tree height of > 31, try to increase the
probability of the error being logged.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Avoid BUG_ON() when setting a leaf node as a parent
Don't crash the kernel in the very unlikely case that the node passed as
the parent is actually a leaf. Instead, use MT_WARN_ON() to increase
the probability of the debug information being recorded.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 31 Aug 2022 19:32:48 +0000 (15:32 -0400)]
radix tree test suite: Fix damage by separation of gfp and gfp_types
When the gfp header was split into two for a much smaller header, it
broke the radix test suite. Fix the test suite by splitting the header
in the test suite as well.
Fixes: cb5a065b4ea9 (headers/deps: mm: Split <linux/gfp_types.h> out of <linux/gfp.h>) Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov <yury.norov@gmail.com>
mm: Skip retry when new limit is not below old one in page_counter_set_max
In page_counter_set_max, we want to make sure the new limit is not below
the concurrently-changing counter value. We read the counter and check
that the limit is not below the counter before the swap. After the swap,
we read the counter again and retry in case the counter is incremented as
this may violate the requirement. Even though the page_counter_try_charge
can see the old limit, it is guaranteed that the counter is not above the
old limit after the increment. So in case the new limit is not below the
old limit, the counter is guaranteed to be not above the new limit too.
We can skip the retry in this case to optimize a little bit.
mm: pagewalk: move variables to more local scope, tweak loops
Move some variables to more local scopes to make it obvious that they
don't carry state. Put the end additions into the for loop instructions
to make them easier to read.
mm: pagewalk: allow walk_page_range_novma() without mm
Since e47690d756a7 ("x86: mm: avoid allocating struct mm_struct on the
stack") a pgd can be passed to walk_page_range_novma(). In case it is set
no place in the pagewalk code use the walk.mm anymore, so permit to pass a
NULL mm instead. It is up to the caller to ensure proper locking on the
pgd in this case.
Patch series "Minor improvements for pagewalk code".
For some project I had to use the pagewalk API for certain things and
during this have read through the code multiple times. Our usage has
changed several times depending on our current state of research as well.
During all of this I have made some tweaks to the code to be able to follow it
better when hunting my own problems, and not call into some things that I
actually don't need. The patches are more or less independent of each other.
This patch (of 6):
The err variable only needs to be checked when it was assigned directly
before, it is not carried on to any later checks. Move the checks into
the same "if" conditions where they are assigned. Also just return the
error at the relevant places. While at it move these err variables to a
more local scope at some places.
ERROR: space required before the open parenthesis '('
#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+ while(getline(&line, &len, fp) != -1) {
ERROR: space required after that ',' (ctx:VxV)
#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+ char *first = strtok(line,"- ");
^
ERROR: space required after that ',' (ctx:VxV)
#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+ char *second = strtok(NULL,"- ");
^
WARNING: Missing a blank line after declarations
#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+ void *second_val = (void *) strtol(second, NULL, 16);
+ if (first_val == start && second_val == start + 3 * page_size) {
total: 3 errors, 3 warnings, 113 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Matěna [Fri, 3 Jun 2022 14:57:19 +0000 (16:57 +0200)]
mm: add merging after mremap resize
When mremap call results in expansion, it might be possible to merge the
VMA with the next VMA which might become adjacent. This patch adds
vma_merge call after the expansion is done to try and merge.
Link: https://lkml.kernel.org/r/20220603145719.1012094-3-matenajakub@gmail.com Signed-off-by: Jakub Matěna <matenajakub@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Matěna [Fri, 3 Jun 2022 14:57:18 +0000 (16:57 +0200)]
mm: refactor of vma_merge()
Patch series "Refactor of vma_merge and new merge call", v4.
I am currently working on my master's thesis trying to increase number of
merges of VMAs currently failing because of page offset incompatibility
and difference in their anon_vmas. The following refactor and added merge
call included in this series is just two smaller upgrades I created along
the way.
This patch (of 2):
Refactor vma_merge() to make it shorter and more understandable. Main
change is the elimination of code duplicity in the case of merge next
check. This is done by first doing checks and caching the results before
executing the merge itself. The variable 'area' is divided into 'mid' and
'res' as previously it was used for two purposes, as the middle VMA
between prev and next and also as the result of the merge itself. Exit
paths are also unified.
Link: https://lkml.kernel.org/r/20220603145719.1012094-1-matenajakub@gmail.com Link: https://lkml.kernel.org/r/20220603145719.1012094-2-matenajakub@gmail.com Signed-off-by: Jakub Matěna <matenajakub@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Rik van Riel <riel@surriel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Suren Baghdasaryan [Tue, 31 May 2022 22:30:59 +0000 (15:30 -0700)]
mm: drop oom code from exit_mmap
The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see [1] for more details). The invocation has
moved around since then because of the interaction with the munlock logic
but the underlying reason has remained the same (see [2]).
Munlock code is no longer a problem since [3] and there shouldn't be any
blocking operation before the memory is unmapped by exit_mmap so the oom
reaper invocation can be dropped. The unmapping part can be done with the
non-exclusive mmap_sem and the exclusive one is only required when page
tables are freed.
Remove the oom_reaper from exit_mmap which will make the code easier to
read. This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.
[1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:06:34 +0000 (15:06 +0000)]
mm/mmap.c: pass in mapping to __vma_link_file()
__vma_link_file() resolves the mapping from the file, if there is one.
Pass through the mapping and check the vm_file externally since most
places already have the required information and check of vm_file.
Link: https://lkml.kernel.org/r/20220822150128.1562046-71-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:06:33 +0000 (15:06 +0000)]
mm: remove the vma linked list
Replace any vm_next use with vma_find().
Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
maple tree.
Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap(). At
the same time, alter the loop to be more compact.
Now that free_pgtables() and unmap_vmas() take a maple tree as an
argument, rearrange do_mas_align_munmap() to use the new tree to hold the
vmas to remove.
Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
used to update the linked list.
Drop linked list update from __insert_vm_struct().
Rework validation of tree as it was depending on the linked list.
Link: https://lkml.kernel.org/r/20220822150128.1562046-69-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:06:30 +0000 (15:06 +0000)]
mm/mempolicy: use vma iterator & maple state instead of vma linked list
Reworked the way mbind_range() finds the first VMA to reuse the maple
state and limit the number of tree walks needed.
Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
address higher than the last VMA. The code was written in a way that
allowed no VMA updates to occur and still return success. There should be
no functional change to this scenario with the new code.
Link: https://lkml.kernel.org/r/20220822150128.1562046-57-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:27 +0000 (15:06 +0000)]
sched: use maple tree iterator to walk VMAs
The linked list is slower than walking the VMAs using the maple tree. We
can't use the VMA iterator here because it doesn't support moving to an
earlier position.
Link: https://lkml.kernel.org/r/20220822150128.1562046-49-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:06:26 +0000 (15:06 +0000)]
ipc/shm: use VMA iterator instead of linked list
The VMA iterator is faster than the linked llist, and it can be walked
even when VMAs are being removed from the address space, so there's no
need to keep track of 'next'.
Link: https://lkml.kernel.org/r/20220822150128.1562046-46-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:25 +0000 (15:06 +0000)]
coredump: remove vma linked list walk
Use the Maple Tree iterator instead. This is too complicated for the VMA
iterator to handle, so let's open-code it for now. If this turns out to
be a common pattern, we can migrate it to common code.
Link: https://lkml.kernel.org/r/20220822150128.1562046-41-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:24 +0000 (15:06 +0000)]
cxl: remove vma linked list walk
Use the VMA iterator instead. This requires a little restructuring of the
surrounding code to hoist the mm to the caller. That turns
cxl_prefault_one() into a trivial function, so call cxl_fault_segment()
directly.
Link: https://lkml.kernel.org/r/20220822150128.1562046-38-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:19 +0000 (15:06 +0000)]
xtensa: remove vma linked list walks
Use the VMA iterator instead. Since VMA can no longer be NULL in the
loop, then deal with out-of-memory outside the loop. This means a
slightly longer run time in the failure case (-ENOMEM) - it will run to
the end of the VMAs before erroring instead of in the middle of the loop.
Link: https://lkml.kernel.org/r/20220822150128.1562046-37-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:05:19 +0000 (15:05 +0000)]
mm/mmap: change do_brk_munmap() to use do_mas_align_munmap()
do_brk_munmap() has already aligned the address and has a maple tree state
to be used. Use the new do_mas_align_munmap() to avoid unnecessary
alignment and error checks.
Link: https://lkml.kernel.org/r/20220822150128.1562046-30-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:05:17 +0000 (15:05 +0000)]
mm/mmap: reorganize munmap to use maple states
Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
do_mas_align_munmap().
do_munmap() is a wrapper to create a maple state for any callers that have
not been converted to the maple tree.
do_mas_munmap() takes a maple state to mumap a range. This is just a
small function which checks for error conditions and aligns the end of the
range.
do_mas_align_munmap() uses the aligned range to mumap a range.
do_mas_align_munmap() starts with the first VMA in the range, then finds
the last VMA in the range. Both start and end are split if necessary.
Then the VMAs are removed from the linked list and the mm mlock count is
updated at the same time. Followed by a single tree operation of
overwriting the area in with a NULL. Finally, the detached list is
unmapped and freed.
By reorganizing the munmap calls as outlined, it is now possible to avoid
extra work of aligning pre-aligned callers which are known to be safe,
avoid extra VMA lookups or tree walks for modifications.
detach_vmas_to_be_unmapped() is no longer used, so drop this code.
vm_brk_flags() can just call the do_mas_munmap() as it checks for
intersecting VMAs directly.
Link: https://lkml.kernel.org/r/20220822150128.1562046-29-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:05:11 +0000 (15:05 +0000)]
mm: remove vmacache
By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code. Remove the vmacache
to reduce the work in keeping it up to date and code complexity.
Link: https://lkml.kernel.org/r/20220822150128.1562046-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:05:09 +0000 (15:05 +0000)]
mm/mmap: use advanced maple tree API for mmap_region()
Changing mmap_region() to use the maple tree state and the advanced maple
tree interface allows for a lot less tree walking.
This change removes the last caller of munmap_vma_range(), so drop this
unused function.
Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.
Link: https://lkml.kernel.org/r/20220822150128.1562046-25-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:05:02 +0000 (15:05 +0000)]
mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()
Avoid allocating a new VMA when it a vma modification can occur. When a
brk() can expand or contract a VMA, then the single store operation will
only modify one index of the maple tree instead of causing a node to split
or coalesce. This avoids unnecessary allocations/frees of maple tree
nodes and VMAs.
Move some limit & flag verifications out of the do_brk_flags() function to
use only relevant checks in the code path of bkr() and vm_brk_flags().
Set the vma to check if it can expand in vm_brk_flags() if extra criteria
are met.
Drop userfaultfd from do_brk_flags() path and only use it in
vm_brk_flags() path since that is the only place a munmap will happen.
Add a wraper for munmap for the brk case called do_brk_munmap().
Link: https://lkml.kernel.org/r/20220822150128.1562046-23-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:04:35 +0000 (15:04 +0000)]
mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup()
vma_lookup() will walk the vma tree once and not continue to look for the
next vma. Since the exact vma is checked below, this is a more optimal
way of searching.
Link: https://lkml.kernel.org/r/20220822150128.1562046-22-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:04:32 +0000 (15:04 +0000)]
mm: optimize find_exact_vma() to use vma_lookup()
Use vma_lookup() to walk the tree to the start value requested. If the
vma at the start does not match, then the answer is NULL and there is no
need to look at the next vma the way that find_vma() would.
Link: https://lkml.kernel.org/r/20220822150128.1562046-21-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:04:30 +0000 (15:04 +0000)]
xen: use vma_lookup() in privcmd_ioctl_mmap()
vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value. It is more efficient
to only walk to the requested value since privcmd_ioctl_mmap() will exit
the loop if vm_start != msg->va.
Link: https://lkml.kernel.org/r/20220822150128.1562046-20-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:04:26 +0000 (15:04 +0000)]
mmap: change zeroing of maple tree in __vma_adjust()
Only write to the maple tree if we are not inserting or the insert isn't
going to overwrite the area to clear. This avoids spanning writes and
node coealescing when unnecessary.
The change requires a custom search for the linked list addition to find
the correct VMA for the prev link.
Link: https://lkml.kernel.org/r/20220822150128.1562046-19-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:03:16 +0000 (15:03 +0000)]
damon: convert __damon_va_three_regions to use the VMA iterator
This rather specialised walk can use the VMA iterator. If this proves to
be too slow, we can write a custom routine to find the two largest gaps,
but it will be somewhat complicated, so let's see if we need it first.
Update the kunit test case to use the maple tree. This also fixes an
issue with the kunit testcase not adding the last VMA to the list.
Link: https://lkml.kernel.org/r/20220822150128.1562046-16-Liam.Howlett@oracle.com Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests) Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Howells <dhowells@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:03:11 +0000 (15:03 +0000)]
kernel/fork: use maple tree for dup_mmap() during forking
The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator and switch to the
maple tree advanced API to avoid multiple walks of the tree during insert
operations. Unexport the now-unused vma_store() function.
For performance reasons we bulk allocate the maple tree nodes. The node
calculations are done internally to the tree and use the VMA count and
assume the worst-case node requirements. The VM_DONT_COPY flag does not
allow for the most efficient copy method of the tree and so a bulk loading
algorithm is used.
Link: https://lkml.kernel.org/r/20220822150128.1562046-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:03:08 +0000 (15:03 +0000)]
mm/mmap: use maple tree for unmapped_area{_topdown}
The maple tree code was added to find the unmapped area in a previous
commit and was checked against what the rbtree returned, but the actual
result was never used. Start using the maple tree implementation and
remove the rbtree code.
Add kernel documentation comment for these functions.
Link: https://lkml.kernel.org/r/20220822150128.1562046-14-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:03:05 +0000 (15:03 +0000)]
mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
Use the maple tree's advanced API and a maple state to walk the tree for
the entry at the address of the next vma, then use the maple state to walk
back one entry to find the previous entry.
Add kernel documentation comments for this API.
Link: https://lkml.kernel.org/r/20220822150128.1562046-13-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:02:54 +0000 (15:02 +0000)]
mm: add VMA iterator
This thin layer of abstraction over the maple tree state is for iterating
over VMAs. You can go forwards, go backwards or ask where the iterator
is. Rename the existing vma_next() to __vma_next() -- it will be removed
by the end of this series.
Link: https://lkml.kernel.org/r/20220822150128.1562046-10-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:02:47 +0000 (15:02 +0000)]
mm: start tracking VMAs with maple tree
Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree. Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.
The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.
This also moves the mmap_lock() in exit_mmap() since the oom reaper call
does walk the VMAs. Otherwise lockdep will be unhappy if oom happens.
When splitting a vma fails due to allocations of the maple tree nodes, the
error path in __split_vma() calls new->vm_ops->close(new). The page
accounting for hugetlb is actually in the close() operation, so it
accounts for the removal of 1/2 of the VMA which was not adjusted. This
results in a negative exit value. To avoid the negative charge, set
vm_start = vm_end and vm_pgoff = 0.
There is also a potential accounting issue in special mappings from
insert_vm_struct() failing to allocate, so reverse the charge there in the
failure scenario.
Link: https://lkml.kernel.org/r/20220822150128.1562046-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:02:25 +0000 (15:02 +0000)]
radix tree test suite: add allocation counts and size to kmem_cache
Add functions to get the number of allocations, and total allocations from
a kmem_cache. Also add a function to get the allocated size and a way to
zero the total allocations.
Link: https://lkml.kernel.org/r/20220822150128.1562046-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:02:21 +0000 (15:02 +0000)]
radix tree test suite: add kmem_cache_set_non_kernel()
kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
the flags. This functionality allows for testing different paths though
the code.
Link: https://lkml.kernel.org/r/20220822150128.1562046-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:02:08 +0000 (15:02 +0000)]
Maple Tree: add new data structure
Patch series "Introducing the Maple Tree"
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Davidlor said
: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage. Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit. This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead. Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.
A similar work has been discovered in the academic press
Sheer coincidence. We designed our tree with the intention of solving the
hardest problem first. Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article. So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.
This patch (of 70):
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Link: https://lkml.kernel.org/r/20220822150128.1562046-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220822150128.1562046-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Tue, 16 Aug 2022 18:58:01 +0000 (11:58 -0700)]
mm: memcg: export workingset refault stats for cgroup v1
Workingset refault stats are important and useful metrics to measure how
well reclaimer and swapping work and how healthy the services are, but
they are just available for cgroup v2. There are still plenty users with
cgroup v1, export the stats for cgroup v1.
Link: https://lkml.kernel.org/r/20220816185801.651091-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Thu, 18 Aug 2022 07:37:44 +0000 (15:37 +0800)]
mm/damon: replace pmd_huge() with pmd_trans_huge() for THP
pmd_huge() is usually used to indicate a pmd level hugetlb. However a pmd
mapped huge page can only be THP in damon_mkold_pmd_entry() or
damon_young_pmd_entry(), so replace pmd_huge() with pmd_trans_huge() in
this case to make the code more readable according to the discussion [1].