Thomas Gleixner [Thu, 25 May 2023 12:57:05 +0000 (14:57 +0200)]
mm/vmalloc: prevent flushing dirty space over and over
vmap blocks which have active mappings cannot be purged. Allocations
which have been freed are accounted for in vmap_block::dirty_min/max, so
that they can be detected in _vm_unmap_aliases() as potentially stale
TLBs.
If there are several invocations of _vm_unmap_aliases() then each of them
will flush the dirty range. That's pointless and just increases the
probability of full TLB flushes.
Avoid that by resetting the flush range after accounting for it. That's
safe versus other invocations of _vm_unmap_aliases() because this is all
serialized with vmap_purge_lock.
Link: https://lkml.kernel.org/r/20230525124504.692056496@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thomas Gleixner [Thu, 25 May 2023 12:57:03 +0000 (14:57 +0200)]
mm/vmalloc: prevent stale TLBs in fully utilized blocks
Patch series "mm/vmalloc: Assorted fixes and improvements", v2.
this series addresses the following issues:
1) Prevent the stale TLB problem related to fully utilized vmap blocks
2) Avoid the double per CPU list walk in _vm_unmap_aliases()
3) Avoid flushing dirty space over and over
4) Add a lockless quickcheck in vb_alloc() and add missing
READ/WRITE_ONCE() annotations
5) Prevent overeager purging of usable vmap_blocks if
not under memory/address space pressure.
This patch (of 6):
_vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a
page are left in the system. This is required due to the lazy TLB flush
mechanism in vmalloc.
This is tried to achieve by walking the per CPU free lists, but those do
not contain fully utilized vmap blocks because they are removed from the
free list once the blocks free space became zero.
When the block is not fully unmapped then it is not on the purge list
either.
So neither the per CPU list iteration nor the purge list walk find the
block and if the page was mapped via such a block and the TLB has not yet
been flushed, the guarantee of _vm_unmap_aliases() that there are no stale
TLBs after returning is broken:
x = vb_alloc() // Removes vmap_block from free list because vb->free became 0
vb_free(x) // Unmaps page and marks in dirty_min/max range
// Block has still mappings and is not put on purge list
// Page is reused
vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL
So instead of walking the per CPU free lists, walk the per CPU xarrays
which hold pointers to _all_ active blocks in the system including those
removed from the free lists.
Jim Cromie [Thu, 25 May 2023 17:43:56 +0000 (11:43 -0600)]
kmemleak-test: drop __init to get better backtrace
Drop the __init on kmemleak_test_init(). With it, the storage is
reclaimed, but then the symbol isn't available for "%pS" rendering,
and the backtrace gets a bare pointer where the actual leak happened.
T.J. Alumbaugh [Mon, 22 May 2023 11:20:57 +0000 (11:20 +0000)]
mm: multi-gen LRU: add helpers in page table walks
Add helpers to page table walking code:
- Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young"
- Avoids repeating same logic in two places
Link: https://lkml.kernel.org/r/20230522112058.2965866-3-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haifeng Xu [Mon, 22 May 2023 09:52:33 +0000 (09:52 +0000)]
selftests: cgroup: fix unexpected failure on test_memcg_low
Since commit f079a020ba95 ("selftests: memcg: factor out common parts of
memory.{low,min} tests"), the value used in second alloc_anon has changed
from 148M to 170M. Because memory.low allows reclaiming page cache in
child cgroups, so the memory.current is close to 30M instead of 50M.
Therefore, adjust the expected value of parent cgroup.
Link: https://lkml.kernel.org/r/20230522095233.4246-2-haifeng.xu@shopee.com Fixes: f079a020ba95 ("selftests: memcg: factor out common parts of memory.{low,min} tests") Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Mon, 22 May 2023 20:52:10 +0000 (13:52 -0700)]
mm/mlock: rename mlock_future_check() to mlock_future_ok()
It is felt that the name mlock_future_check() is vague - it doesn't
particularly convey the function's operation. mlock_future_ok() is a
clearer name for a predicate function.
Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 22 May 2023 08:24:12 +0000 (09:24 +0100)]
mm/mmap: refactor mlock_future_check()
In all but one instance, mlock_future_check() is treated as a boolean
function despite returning an error code. In one instance, this error
code is ignored and replaced with -ENOMEM.
This is confusing, and the inversion of true -> failure, false -> success
is not warranted. Convert the function to a bool, lightly refactor and
return true if the check passes, false if not.
Link: https://lkml.kernel.org/r/20230522082412.56685-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Fri, 19 May 2023 10:27:23 +0000 (12:27 +0200)]
selftests/mm: gup_longterm: add liburing tests
Similar to the COW selftests, also use io_uring fixed buffers to test if
long-term page pinning works as expected.
Link: https://lkml.kernel.org/r/20230519102723.185721-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Fri, 19 May 2023 10:27:22 +0000 (12:27 +0200)]
selftests/mm: gup_longterm: new functional test for FOLL_LONGTERM
Let's add a new test for checking whether GUP long-term page pinning works
as expected (R/O vs. R/W, MAP_PRIVATE vs. MAP_SHARED, GUP vs.
GUP-fast). Note that COW handling with long-term R/O pinning in private
mappings, and pinning of anonymous memory in general, is tested by the COW
selftest. This test, therefore, focuses on page pinning in file mappings.
The most interesting case is probably the "local tmpfile" case, as that
will likely end up on a "real" filesystem such as ext4 or xfs, not on a
virtual one like tmpfs or hugetlb where any long-term page pinning is
always expected to succeed.
For now, only add tests that use the "/sys/kernel/debug/gup_test"
interface. We'll add tests based on liburing separately next.
[akpm@linux-foundation.org: update .gitignore for gup_longterm, per Peter] Link: https://lkml.kernel.org/r/20230519102723.185721-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Fri, 19 May 2023 10:27:21 +0000 (12:27 +0200)]
selftests/mm: factor out detection of hugetlb page sizes into vm_util
Patch series "selftests/mm: new test for FOLL_LONGTERM on file mappings".
Let's add some selftests to make sure that:
* R/O long-term pinning always works of file mappings
* R/W long-term pinning always works in MAP_PRIVATE file mappings
* R/W long-term pinning only works in MAP_SHARED mappings with special
filesystems (shmem, hugetlb) and fails with other filesystems (ext4, btrfs,
xfs).
The tests make use of the gup_test kernel module to trigger ordinary GUP
and GUP-fast, and liburing (similar to our COW selftests). Test with
memfd, memfd hugetlb, tmpfile() and mkstemp(). The latter usually gives
us a "real" filesystem (ext4, btrfs, xfs) where long-term pinning is
expected to fail.
Note that these selftests don't contain any actual reproducers for data
corruptions in case R/W long-term pinning on problematic filesystems
"would" work.
Maybe we can later come up with a racy !FOLL_LONGTERM reproducer that can
reuse an existing interface to trigger short-term pinning (I'll look into
that next).
On current mm/mm-unstable:
# ./gup_longterm
# [INFO] detected hugetlb page size: 2048 KiB
# [INFO] detected hugetlb page size: 1048576 KiB
TAP version 13
1..50
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 1 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 2 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 3 Should have failed
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 4 Should have worked
# [RUN] R/W longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 5 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
ok 6 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
ok 7 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
ok 8 Should have failed
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 9 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 10 Should have worked
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd
ok 11 Should have worked
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with tmpfile
ok 12 Should have worked
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with local tmpfile
ok 13 Should have worked
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 14 Should have worked
# [RUN] R/O longterm GUP pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 15 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd
ok 16 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with tmpfile
ok 17 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with local tmpfile
ok 18 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 19 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 20 Should have worked
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
ok 21 Should have worked
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
ok 22 Should have worked
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 23 Should have worked
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 24 Should have worked
# [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 25 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
ok 26 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
ok 27 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 28 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 29 Should have worked
# [RUN] R/W longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 30 Should have worked
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd
ok 31 Should have worked
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with tmpfile
ok 32 Should have worked
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 33 Should have worked
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 34 Should have worked
# [RUN] R/O longterm GUP pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 35 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd
ok 36 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with tmpfile
ok 37 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with local tmpfile
ok 38 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 39 Should have worked
# [RUN] R/O longterm GUP-fast pin in MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 40 Should have worked
# [RUN] io_uring fixed buffer with MAP_SHARED file mapping ... with memfd
ok 41 Should have worked
# [RUN] io_uring fixed buffer with MAP_SHARED file mapping ... with tmpfile
ok 42 Should have worked
# [RUN] io_uring fixed buffer with MAP_SHARED file mapping ... with local tmpfile
ok 43 Should have failed
# [RUN] io_uring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (2048 kB)
ok 44 Should have worked
# [RUN] io_uring fixed buffer with MAP_SHARED file mapping ... with memfd hugetlb (1048576 kB)
ok 45 Should have worked
# [RUN] io_uring fixed buffer with MAP_PRIVATE file mapping ... with memfd
ok 46 Should have worked
# [RUN] io_uring fixed buffer with MAP_PRIVATE file mapping ... with tmpfile
ok 47 Should have worked
# [RUN] io_uring fixed buffer with MAP_PRIVATE file mapping ... with local tmpfile
ok 48 Should have worked
# [RUN] io_uring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (2048 kB)
ok 49 Should have worked
# [RUN] io_uring fixed buffer with MAP_PRIVATE file mapping ... with memfd hugetlb (1048576 kB)
ok 50 Should have worked
# Totals: pass:50 fail:0 xfail:0 xpass:0 skip:0 error:0
This patch (of 3):
Let's factor detection out into vm_util, to be reused by a new test.
Johannes Weiner [Fri, 19 May 2023 11:13:59 +0000 (13:13 +0200)]
mm: compaction: avoid GFP_NOFS ABBA deadlock
During stress testing with higher-order allocations, a deadlock scenario
was observed in compaction: One GFP_NOFS allocation was sleeping on
mm/compaction.c::too_many_isolated(), while all CPUs in the system were
busy with compactors spinning on buffer locks held by the sleeping
GFP_NOFS allocation.
Reclaim is susceptible to this same deadlock; we fixed it by granting
GFP_NOFS allocations additional LRU isolation headroom, to ensure it makes
forward progress while holding fs locks that other reclaimers might
acquire. Do the same here.
This code has been like this since compaction was initially merged, and I
only managed to trigger this with out-of-tree patches that dramatically
increase the contexts that do GFP_NOFS compaction. While the issue is
real, it seems theoretical in nature given existing allocation sites.
Worth fixing now, but no Fixes tag or stable CC.
Link: https://lkml.kernel.org/r/20230519111359.40475-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 19 May 2023 12:39:59 +0000 (14:39 +0200)]
mm: compaction: drop redundant watermark check in compaction_zonelist_suitable()
The watermark check in compaction_zonelist_suitable(), called from
should_compact_retry(), is sandwiched between two watermark checks
already: before, there are freelist attempts as part of direct reclaim and
direct compaction; after, there is a last-minute freelist attempt in
__alloc_pages_may_oom().
The check in compaction_zonelist_suitable() isn't necessary. Kill it.
Link: https://lkml.kernel.org/r/20230519123959.77335-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 19 May 2023 12:39:57 +0000 (14:39 +0200)]
mm: compaction: refactor __compaction_suitable()
__compaction_suitable() is supposed to check for available migration
targets. However, it also checks whether the operation was requested via
/proc/sys/vm/compact_memory, and whether the original allocation request
can already succeed. These don't apply to all callsites.
Move the checks out to the callers, so that later patches can deal with
them one by one. No functional change intended.
Johannes Weiner [Fri, 19 May 2023 12:39:56 +0000 (14:39 +0200)]
mm: compaction: simplify should_compact_retry()
The different branches for retry are unnecessarily complicated. There are
really only three outcomes: progress (retry n times), skipped (retry if
reclaim can help), failed (retry with higher priority).
Rearrange the branches and the retry counter to make it simpler.
The compaction result helpers encode quirks that are specific to the
allocator's retry logic. E.g. COMPACT_SUCCESS and COMPACT_COMPLETE
actually represent failures that should be retried upon, and so on. I
frequently found myself pulling up the helper implementation in order to
understand and work on the retry logic. They're not quite clean
abstractions; rather they split the retry logic into two locations.
Remove the helpers and inline the checks. Then comment on the result
interpretations directly where the decision making happens.
Liam R. Howlett [Thu, 18 May 2023 14:55:43 +0000 (10:55 -0400)]
mm: add vma_iter_{next,prev}_range() to vma iterator
Add functionality to the VMA iterator to advance and retreat one offset
within the maple tree, regardless of the value contained. This can lead
to less re-walking to find an area of interest, especially when there is
nothing in that offset.
Link: https://lkml.kernel.org/r/20230518145544.1722059-35-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:41 +0000 (10:55 -0400)]
maple_tree: clear up index and last setting in single entry tree
When there is a single entry tree (range of 0-0 pointing to an entry),
then ensure the limit is either 0-0 or 1-oo, depending on where the user
walks. Ensure the correct node setting as well; either MAS_ROOT or
MAS_NONE.
Link: https://lkml.kernel.org/r/20230518145544.1722059-33-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:40 +0000 (10:55 -0400)]
maple_tree: add mas_prev_range() and mas_find_range_rev interface
Some users of the maple tree may want to move to the previous range
regardless of the value stored there. Add this interface as well as the
'find' variant to support walking to the first value, then iterating over
the previous ranges.
Link: https://lkml.kernel.org/r/20230518145544.1722059-32-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:39 +0000 (10:55 -0400)]
maple_tree: introduce mas_prev_slot() interface
Sometimes the user needs to revert to the previous slot, regardless of if
it is empty or not. Add an interface to go to the previous slot.
Since there can't be two consecutive NULLs in the tree, the mas_prev()
function can be implemented by calling mas_prev_slot() a maximum of 2
times. Change the underlying interface to use mas_prev_slot() to align
the code.
Link: https://lkml.kernel.org/r/20230518145544.1722059-31-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:37 +0000 (10:55 -0400)]
maple_tree: add mas_next_range() and mas_find_range() interfaces
Some users of the maple tree may want to move to the next range in the
tree, even if it stores a NULL. This family of function provides that
functionality by advancing one slot at a time and returning the result,
while mas_contiguous() will iterate over the range and stop on
encountering the first NULL.
Link: https://lkml.kernel.org/r/20230518145544.1722059-29-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:36 +0000 (10:55 -0400)]
maple_tree: introduce mas_next_slot() interface
Sometimes, during a tree walk, the user needs the next slot regardless of
if it is empty or not. Add an interface to get the next slot.
Since there are no consecutive NULLs allowed in the tree, the mas_next()
function can only advance two slots at most. So use the new
mas_next_slot() interface to align both implementations. Use this method
for mas_find() as well.
Link: https://lkml.kernel.org/r/20230518145544.1722059-28-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:34 +0000 (10:55 -0400)]
maple_tree: revise limit checks in mas_empty_area{_rev}()
Since the maple tree is inclusive in range, ensure that a range of 1 (min
= max) works for searching for a gap in either direction, and make sure
the size is at least 1 but not larger than the delta between min and max.
This commit also updates the testing. Unfortunately there isn't a way to
safely update the tests and code without a test failure.
Link: https://lkml.kernel.org/r/20230518145544.1722059-26-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:33 +0000 (10:55 -0400)]
maple_tree: try harder to keep active node with mas_prev()
Keep a reference to the node when possible with mas_prev(). This will
avoid re-walking the tree. In keeping a reference to the node, keep the
last/index accurate to the range being referenced. This means the limit
may be within the range, but the range may extend outside of the limit.
Also fix the single entry tree to respect the range (of 0), or set the
node to MAS_NONE in the case of shifting beyond 0.
Link: https://lkml.kernel.org/r/20230518145544.1722059-25-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:31 +0000 (10:55 -0400)]
mm/mmap: change do_vmi_align_munmap() for maple tree iterator changes
The maple tree iterator clean up is incompatible with the way
do_vmi_align_munmap() expects it to behave. Update the expected behaviour
to map now since the change will work currently.
Link: https://lkml.kernel.org/r/20230518145544.1722059-23-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:29 +0000 (10:55 -0400)]
maple_tree: remove unnecessary check from mas_destroy()
mas_destroy currently checks if mas->node is MAS_START prior to calling
mas_start(), but this is unnecessary as mas_start() will do nothing if the
node is anything but MAS_START.
Link: https://lkml.kernel.org/r/20230518145544.1722059-21-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:28 +0000 (10:55 -0400)]
maple_tree: add __init and __exit to test module
The test functions are not needed after the module is removed, so mark
them as such. Add __exit to the module removal function. Some other
variables have been marked as const static as well.
Link: https://lkml.kernel.org/r/20230518145544.1722059-20-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:27 +0000 (10:55 -0400)]
mm: update vma_iter_store() to use MAS_WARN_ON()
MAS_WARN_ON() will provide more information on the maple state and can be
more useful for debugging. Use this version of WARN_ON() in the debugging
code when storing to the tree.
Update the printk to a pr_warn(), but this will only be printed when maple
tree debug is enabled anyways.
Making all print statements into one will keep them together on a busy
terminal.
Link: https://lkml.kernel.org/r/20230518145544.1722059-19-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:25 +0000 (10:55 -0400)]
maple_tree: make test code work without debug enabled
The test code is less useful without debug, but can still do general
validations. Define mt_dump(), mas_dump() and mas_wr_dump() as a noop if
debug is not enabled and document it in the test module information that
more information can be obtained with another kernel config option.
MT_BUG_ON() will report a failures without tree dumps, and the output will
be less useful.
Link: https://lkml.kernel.org/r/20230518145544.1722059-17-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:24 +0000 (10:55 -0400)]
maple_tree: return error on mte_pivots() out of range
Rename mte_pivots() to mas_pivots() and pass through the ma_state to set
the error code to -EIO when the offset is out of range for the node type.
Change the WARN_ON() to MAS_WARN_ON() to log the maple state.
Link: https://lkml.kernel.org/r/20230518145544.1722059-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:22 +0000 (10:55 -0400)]
maple_tree: use MAS_WR_BUG_ON() in mas_store_prealloc()
mas_store_prealloc() should never fail, but if it does due to internal
tree issues then get as much debug information as possible prior to
crashing the kernel.
Link: https://lkml.kernel.org/r/20230518145544.1722059-14-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:20 +0000 (10:55 -0400)]
maple_tree: use MAS_BUG_ON() in mas_set_height()
Use MAS_BUG_ON() instead of MT_BUG_ON() to get the maple state
information. In the unlikely event of a tree height of > 31, try to
increase the probability of useful information being logged.
Link: https://lkml.kernel.org/r/20230518145544.1722059-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:18 +0000 (10:55 -0400)]
maple_tree: convert debug code to use MT_WARN_ON() and MAS_WARN_ON()
Using MT_WARN_ON() allows for the removal of if statements before logging.
Using MAS_WARN_ON() will provide more information when issues are
encountered.
Link: https://lkml.kernel.org/r/20230518145544.1722059-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:13 +0000 (10:55 -0400)]
maple_tree: clean up mas_dfs_postorder()
Convert loop type to ensure all variables are set to make the compiler
happy, and use the mas_is_none() function instead of explicitly checking
the node in the maple state.
Link: https://lkml.kernel.org/r/20230518145544.1722059-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:12 +0000 (10:55 -0400)]
maple_tree: avoid unnecessary ascending
The maple tree node limits are implied by the parent. When walking up the
tree, the limit may not be known until a slot that does not have implied
limits are encountered. However, if the node is the left-most or
right-most node, the walking up to find that limit can be skipped.
This commit also fixes the debug/testing code that was not setting the
limit on walking down the tree as that optimization is not compatible with
this change.
Link: https://lkml.kernel.org/r/20230518145544.1722059-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Peng Zhang <zhangpeng.00@bytedance.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:11 +0000 (10:55 -0400)]
maple_tree: clean up mas_parent_enum() and rename to mas_parent_type()
mas_parent_enum() is a simple wrapper for mte_parent_enum() which is only
called from that wrapper. Remove the wrapper and inline mte_parent_enum()
into mas_parent_enum().
At the same time, clean up the bit masking of the root pointer since it
cannot be set by the time the bit masking occurs. Change the check on the
root bit to a WARN_ON(), and fix the verification code to not trigger the
WARN_ON() before checking if the node is root.
Align the name to mas_parent_type() since mas_node_type() exists already.
Link: https://lkml.kernel.org/r/20230518145544.1722059-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Thu, 18 May 2023 14:55:10 +0000 (10:55 -0400)]
maple_tree: fix static analyser cppcheck issue
Patch series "Maple tree mas_{next,prev}_range() and cleanup", v4.
This patchset contains a number of clean ups to the code to make it more
usable (next/prev range), the addition of debug output formatting, the
addition of printing the maple state information in the WARN_ON/BUG_ON
code.
There is also work done here to keep nodes active during iterations to
reduce the necessity of re-walking the tree.
Finally, there is a new interface added to move to the next or previous
range in the tree, even if it is empty.
The organisation of the patches is as follows:
0001-0004 - Small clean ups
0005-0018 - Additional debug options and WARN_ON/BUG_ON changes
0019 - Test module __init and __exit addition
0020-0021 - More functional clean ups
0022-0026 - Changes to keep nodes active
0027-0034 - Add new mas_{prev,next}_range()
0035 - Use new mas_{prev,next}_range() in mmap_region()
This patch (of 35):
Static analyser of the maple tree code noticed that the split variable is
being used to dereference into an array prior to checking the variable
itself. Fix this issue by changing the order of the statement to check
the variable first.
Matthew Wilcox (Oracle) [Sat, 13 May 2023 00:11:01 +0000 (01:11 +0100)]
mm: convert migrate_pages() to work on folios
Almost all of the callers & implementors of migrate_pages() were already
converted to use folios. compaction_alloc() & compaction_free() are
trivial to convert a part of this patch and not worth splitting out.
Lorenzo Stoakes [Wed, 17 May 2023 19:25:48 +0000 (20:25 +0100)]
mm/gup: remove vmas array from internal GUP functions
Now we have eliminated all callers to GUP APIs which use the vmas
parameter, eliminate it altogether.
This eliminates a class of bugs where vmas might have been kept around
longer than the mmap_lock and thus we need not be concerned about locks
being dropped during this operation leaving behind dangling pointers.
This simplifies the GUP API and makes it considerably clearer as to its
purpose - follow flags are applied and if pinning, an array of pages is
returned.
Link: https://lkml.kernel.org/r/6811b4b2b4b3baf3dd07f422bb18853bb2cd09fb.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 17 May 2023 19:25:45 +0000 (20:25 +0100)]
mm/gup: remove vmas parameter from pin_user_pages()
We are now in a position where no caller of pin_user_pages() requires the
vmas parameter at all, so eliminate this parameter from the function and
all callers.
This clears the way to removing the vmas parameter from GUP altogether.
Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> [qib] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> [drivers/media] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 17 May 2023 19:25:42 +0000 (20:25 +0100)]
io_uring: rsrc: delegate VMA file-backed check to GUP
Now that the GUP explicitly checks FOLL_LONGTERM pin_user_pages() for
broken file-backed mappings in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast
writing to file-backed mappings", there is no need to explicitly check VMAs
for this condition, so simply remove this logic from io_uring altogether.
Link: https://lkml.kernel.org/r/e4a4efbda9cd12df71e0ed81796dc630231a1ef2.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 17 May 2023 19:25:39 +0000 (20:25 +0100)]
mm/gup: remove vmas parameter from get_user_pages_remote()
The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-
- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
remove it.
- __access_remote_vm() was already using vma_lookup() when the original
lookup failed so by doing the lookup directly this also de-duplicates the
code.
We are able to perform these VMA operations as we already hold the
mmap_lock in order to be able to call get_user_pages_remote().
As part of this work we add get_user_page_vma_remote() which abstracts the
VMA lookup, error handling and decrementing the page reference count should
the VMA lookup fail.
This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.
[akpm@linux-foundation.org: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64) Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390) Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 17 May 2023 19:25:36 +0000 (20:25 +0100)]
mm/gup: remove unused vmas parameter from pin_user_pages_remote()
No invocation of pin_user_pages_remote() uses the vmas parameter, so
remove it. This forms part of a larger patch set eliminating the use of
the vmas parameters altogether.
Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Wed, 17 May 2023 19:25:33 +0000 (20:25 +0100)]
mm/gup: remove unused vmas parameter from get_user_pages()
Patch series "remove the vmas parameter from GUP APIs", v6.
(pin_/get)_user_pages[_remote]() each provide an optional output parameter
for an array of VMA objects associated with each page in the input range.
These provide the means for VMAs to be returned, as long as mm->mmap_lock
is never released during the GUP operation (i.e. the internal flag
FOLL_UNLOCKABLE is not specified).
In addition, these VMAs can only be accessed with the mmap_lock held and
become invalidated the moment it is released.
The vast majority of invocations do not use this functionality and of
those that do, all but one case retrieve a single VMA to perform checks
upon.
It is not egregious in the single VMA cases to simply replace the
operation with a vma_lookup(). In these cases we duplicate the (fast)
lookup on a slow path already under the mmap_lock, abstracted to a new
get_user_page_vma_remote() inline helper function which also performs
error checking and reference count maintenance.
The special case is io_uring, where io_pin_pages() specifically needs to
assert that the VMAs underlying the range do not result in broken
long-term GUP file-backed mappings.
As GUP now internally asserts that FOLL_LONGTERM mappings are not
file-backed in a broken fashion (i.e. requiring dirty tracking) - as
implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to
file-backed mappings" - this logic is no longer required and so we can
simply remove it altogether from io_uring.
Eliminating the vmas parameter eliminates an entire class of danging
pointer errors that might have occured should the lock have been
incorrectly released.
In addition, the API is simplified and now clearly expresses what it is
intended for - applying the specified GUP flags and (if pinning) returning
pinned pages.
This change additionally opens the door to further potential improvements
in GUP and the possible marrying of disparate code paths.
I have run this series against gup_test with no issues.
Thanks to Matthew Wilcox for suggesting this refactoring!
This patch (of 6):
No invocation of get_user_pages() use the vmas parameter, so remove it.
The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.
Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.
This is part of a patch series aiming to remove the vmas parameter
altogether.
Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts) Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sean Christopherson <seanjc@google.com> (KVM) Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:21 +0000 (14:38 +0800)]
mm: page_alloc: move is_check_pages_enabled() into page_alloc.c
The is_check_pages_enabled() only used in page_alloc.c, move it into
page_alloc.c, also use it in free_tail_page_prepare().
Link: https://lkml.kernel.org/r/20230516063821.121844-14-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:20 +0000 (14:38 +0800)]
mm: page_alloc: move sysctls into it own fils
This moves all page alloc related sysctls to its own file, as part of the
kernel/sysctl.c spring cleaning, also move some functions declarations
from mm.h into internal.h.
Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:19 +0000 (14:38 +0800)]
mm: vmscan: use gfp_has_io_fs()
Use gfp_has_io_fs() instead of open-code.
Link: https://lkml.kernel.org/r/20230516063821.121844-12-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:18 +0000 (14:38 +0800)]
mm: page_alloc: move pm_* function into power
pm_restrict_gfp_mask()/pm_restore_gfp_mask() only used in power, let's
move them out of page_alloc.c.
Adding a general gfp_has_io_fs() function which return true if gfp with
both __GFP_IO and __GFP_FS flags, then use it inside of
pm_suspended_storage(), also the pm_suspended_storage() is moved into
suspend.h.
Link: https://lkml.kernel.org/r/20230516063821.121844-11-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:17 +0000 (14:38 +0800)]
mm: page_alloc: move mark_free_page() into snapshot.c
The mark_free_page() is only used in kernel/power/snapshot.c, move it out
to reduce a bit of page_alloc.c
Link: https://lkml.kernel.org/r/20230516063821.121844-10-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:16 +0000 (14:38 +0800)]
mm: page_alloc: split out DEBUG_PAGEALLOC
Move DEBUG_PAGEALLOC related functions into a single file to reduce a bit
of page_alloc.c.
Link: https://lkml.kernel.org/r/20230516063821.121844-9-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:15 +0000 (14:38 +0800)]
mm: page_alloc: split out FAIL_PAGE_ALLOC
... to a single file to reduce a bit of page_alloc.c.
Link: https://lkml.kernel.org/r/20230516063821.121844-8-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DEFINE_DYNAMIC_DEBUG_METADATA and DYNAMIC_DEBUG_BRANCH already has stub
definitions without dynamic debug feature, remove unnecessary
alloc_contig_dump_pages() stub.
Link: https://lkml.kernel.org/r/20230516063821.121844-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:13 +0000 (14:38 +0800)]
mm: page_alloc: squash page_is_consistent()
Squash the page_is_consistent() into bad_range() as there is only one
caller.
Link: https://lkml.kernel.org/r/20230516063821.121844-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:12 +0000 (14:38 +0800)]
mm: page_alloc: collect mem statistic into show_mem.c
Let's move show_mem.c from lib to mm, as it belongs memory subsystem, also
split some memory statistic related functions from page_alloc.c to
show_mem.c, and we cleanup some unneeded include.
There is no functional change.
Link: https://lkml.kernel.org/r/20230516063821.121844-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:11 +0000 (14:38 +0800)]
mm: page_alloc: move set_zone_contiguous() into mm_init.c
set_zone_contiguous() is only used in mm init/hotplug, and
clear_zone_contiguous() only used in hotplug, move them from page_alloc.c
to the more appropriate file.
Link: https://lkml.kernel.org/r/20230516063821.121844-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:10 +0000 (14:38 +0800)]
mm: page_alloc: move init_on_alloc/free() into mm_init.c
Since commit f2fc4b44ec2b ("mm: move init_mem_debugging_and_hardening() to
mm/mm_init.c"), the init_on_alloc() and init_on_free() define is better to
move there too.
Link: https://lkml.kernel.org/r/20230516063821.121844-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Tue, 16 May 2023 06:38:09 +0000 (14:38 +0800)]
mm: page_alloc: move mirrored_kernelcore into mm_init.c
Patch series "mm: page_alloc: misc cleanup and refactor", v2.
This aims to reduce more space in page_alloc.c, also do some cleanup, no
functional changes intended.
This patch (of 13):
Since commit 9420f89db2dd ("mm: move most of core MM initialization to
mm/mm_init.c"), mirrored_kernelcore should be moved into mm_init.c, as
most related codes are already there.
Link: https://lkml.kernel.org/r/20230516063821.121844-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20230516063821.121844-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Mon, 15 May 2023 08:34:00 +0000 (11:34 +0300)]
mm/secretmem: make it on by default
Following the discussion about direct map fragmentaion at LSF/MM [1], it
appears that direct map fragmentation has a negligible effect on kernel
data accesses. Since the only reason that warranted secretmem to be
disabled by default was concern about performance regression caused by the
direct map fragmentation, it makes perfect sense to lift this restriction
and make secretmem enabled.
secretmem obeys RLIMIT_MEMBLOCK and as such it is not expected to cause
large fragmentation of the direct map or meaningfull increase in page
tables allocated during split of the large mappings in the direct map.
The secretmem.enable parameter is retained to allow system administrators
to disable secretmem at boot.
Switch the default setting of secretmem.enable parameter to 1.
Mel Gorman [Mon, 15 May 2023 11:33:44 +0000 (12:33 +0100)]
Revert "Revert "mm/compaction: fix set skip in fast_find_migrateblock""
This reverts commit 95e7a450b819 ("Revert "mm/compaction: fix set skip in
fast_find_migrateblock"").
Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") was reverted due to bug reports about khugepaged
consuming large amounts of CPU without making progress. The underlying
bug was partially fixed by commit cfccd2e63e7e ("mm, compaction: finish
pageblocks on complete migration failure") but it only mitigated the
problem and Vlastimil Babka pointing out the same issue could
theoretically happen to kcompactd.
As pageblocks containing pages that fail to migrate should now be forcibly
rescanned to set the skip hint if skip hints are used,
fast_find_migrateblock() should no longer loop on a small subset of
pageblocks for prolonged periods of time. Revert the revert so
fast_find_migrateblock() is effective again.
Using the mmtests config workload-usemem-stress-numa-compact, the number
of unique ranges scanned was analysed for both kcompactd and !kcompactd
activity.
Note that the vanilla kernel and the full series had some duplication of
ranges scanned but it was not severe and would be in line with compaction
resets when the skip hints are cleared. Just a revert of commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
showed excessive rescans of the same ranges so the series should not
reintroduce bug 1206848.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848 Link: https://lkml.kernel.org/r/20230515113344.6869-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 15 May 2023 11:33:43 +0000 (12:33 +0100)]
mm: compaction: update pageblock skip when first migration candidate is not at the start
isolate_migratepages_block should mark a pageblock as skip if scanning
started on an aligned pageblock boundary but it only updates the skip flag
if the first migration candidate is also aligned. Tracing during a
compaction stress load (mmtests: workload-usemem-stress-numa-compact) that
many pageblocks are not marked skip causing excessive scanning of blocks
that had been recently checked. Update pageblock skip based on
"valid_page" which is set if scanning started on a pageblock boundary.
[mgorman@techsingularity.net: fix handling of skip bit] Link: https://lkml.kernel.org/r/20230602111622.swtxhn6lu2qwgrwq@techsingularity.net Link: https://lkml.kernel.org/r/20230515113344.6869-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 15 May 2023 11:33:42 +0000 (12:33 +0100)]
mm: compaction: only force pageblock scan completion when skip hints are obeyed
fast_find_migrateblock relies on skip hints to avoid rescanning a recently
selected pageblock but compact_zone() only forces the pageblock scan
completion to set the skip hint if in direct compaction. While this
prevents direct compaction repeatedly scanning a subset of blocks due to
fast_find_migrateblock(), it does not prevent proactive compaction, node
compaction and kcompactd encountering the same problem described in commit cfccd2e63e7e ("mm, compaction: finish pageblocks on complete migration
failure").
Force the scan completion of a pageblock to set the skip hint if skip
hints are obeyed to prevent fast_find_migrateblock() repeatedly selecting
a subset of pageblocks.
Link: https://lkml.kernel.org/r/20230515113344.6869-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 15 May 2023 11:33:41 +0000 (12:33 +0100)]
mm: compaction: ensure rescanning only happens on partially scanned pageblocks
Patch series "Follow-up "Fix excessive CPU usage during compaction"".
The series "Fix excessive CPU usage during compaction" [1] attempted to
fix a bug [2] but Vlastimil noted that the fix was incomplete [3]. While
the series was merged, fast_find_migrateblock was still disabled. This
series should fix the corner cases and allow 95e7a450b819 ("Revert
"mm/compaction: fix set skip in fast_find_migrateblock"") to be safely
reverted. Details on how many pageblocks are rescanned are in the
changelog of the last patch.
"Raghavendra K T" tested this and reported "decent improvement from perf
perspective as well as compaction related data [4]
This patch (of 4):
compact_zone() intends to rescan pageblocks if there is a failure to
migrate "within the current order-aligned block". However, the pageblock
scan may already be complete and moved to the next block causing the next
pageblock to be "rescanned". Ensure only the most recent pageblock is
rescanned.
Link: https://lkml.kernel.org/r/20230515113344.6869-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20230515113344.6869-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuanchu Xie [Mon, 15 May 2023 17:26:08 +0000 (01:26 +0800)]
mm: pagemap: restrict pagewalk to the requested range
The pagewalk in pagemap_read reads one PTE past the end of the requested
range, and stops when the buffer runs out of space. While it produces
the right result, the extra read is unnecessary and less performant.
I timed the following command before and after this patch:
dd count=100000 if=/proc/self/pagemap of=/dev/null
The results are consistently within 0.001s across 5 runs.
Before:
100000+0 records in
100000+0 records out 51200000 bytes (51 MB) copied, 0.0763159 s, 671 MB/s
real 0m0.078s
user 0m0.012s
sys 0m0.065s
After:
100000+0 records in
100000+0 records out 51200000 bytes (51 MB) copied, 0.0487928 s, 1.0 GB/s
real 0m0.050s
user 0m0.011s
sys 0m0.039s
Link: https://lkml.kernel.org/r/20230515172608.3558391-1-yuanchu@google.com Signed-off-by: Yuanchu Xie <yuanchu@google.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haifeng Xu [Mon, 8 May 2023 07:35:38 +0000 (07:35 +0000)]
mm, oom: do not check 0 mask in out_of_memory()
Since commit 60e2793d440a ("mm, oom: do not trigger out_of_memory from the
#PF"), no user sets gfp_mask to 0. Remove the 0 mask check and update the
comments.
Link: https://lkml.kernel.org/r/20230508073538.1168-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pasha Tatashin [Mon, 8 May 2023 23:40:59 +0000 (23:40 +0000)]
mm: hugetlb_vmemmap: provide stronger vmemmap allocation guarantees
HugeTLB pages have a struct page optimizations where struct pages for tail
pages are freed. However, when HugeTLB pages are destroyed, the memory
for struct pages (vmemmap) need to be allocated again.
Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap,
but given that this flag makes very little effort to actually reclaim
memory the returning of huge pages back to the system can be problem.
Lets use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful
reclaim without causing ooms, but at least it may perform a few retries,
and will fail only when there is genuinely little amount of unused memory
in the system.
Freeing a 1G page requires 16M of free memory. A machine might need to be
reconfigured from one task to another, and release a large number of 1G
pages back to the system if allocating 16M fails, the release won't work.
Link: https://lkml.kernel.org/r/20230508234059.2529638-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Suggested-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arnd Bergmann [Tue, 9 May 2023 14:57:21 +0000 (16:57 +0200)]
kasan: use internal prototypes matching gcc-13 builtins
gcc-13 warns about function definitions for builtin interfaces that have a
different prototype, e.g.:
In file included from kasan_test.c:31:
kasan.h:574:6: error: conflicting types for built-in function '__asan_register_globals'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
574 | void __asan_register_globals(struct kasan_global *globals, size_t size);
kasan.h:577:6: error: conflicting types for built-in function '__asan_alloca_poison'; expected 'void(void *, long int)' [-Werror=builtin-declaration-mismatch]
577 | void __asan_alloca_poison(unsigned long addr, size_t size);
kasan.h:580:6: error: conflicting types for built-in function '__asan_load1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
580 | void __asan_load1(unsigned long addr);
kasan.h:581:6: error: conflicting types for built-in function '__asan_store1'; expected 'void(void *)' [-Werror=builtin-declaration-mismatch]
581 | void __asan_store1(unsigned long addr);
kasan.h:643:6: error: conflicting types for built-in function '__hwasan_tag_memory'; expected 'void(void *, unsigned char, long int)' [-Werror=builtin-declaration-mismatch]
643 | void __hwasan_tag_memory(unsigned long addr, u8 tag, unsigned long size);
The two problems are:
- Addresses are passes as 'unsigned long' in the kernel, but gcc-13
expects a 'void *'.
- sizes meant to use a signed ssize_t rather than size_t.
Change all the prototypes to match these. Using 'void *' consistently for
addresses gets rid of a couple of type casts, so push that down to the
leaf functions where possible.
This now passes all randconfig builds on arm, arm64 and x86, but I have
not tested it on the other architectures that support kasan, since they
tend to fail randconfig builds in other ways. This might fail if any of
the 32-bit architectures expect a 'long' instead of 'int' for the size
argument.
The __asan_allocas_unpoison() function prototype is somewhat weird, since
it uses a pointer for 'stack_top' and an size_t for 'stack_bottom'. This
looks like it is meant to be 'addr' and 'size' like the others, but the
implementation clearly treats them as 'top' and 'bottom'.
Arnd Bergmann [Tue, 9 May 2023 14:57:20 +0000 (16:57 +0200)]
kasan: add kasan_tag_mismatch prototype
The kasan sw-tags implementation contains one function that is only called
from assembler and has no prototype in a header. This causes a W=1
warning:
mm/kasan/sw_tags.c:171:6: warning: no previous prototype for 'kasan_tag_mismatch' [-Wmissing-prototypes]
171 | void kasan_tag_mismatch(unsigned long addr, unsigned long access_info,
Add a prototype in the local header to get a clean build.
Huang Ying [Wed, 10 May 2023 03:18:29 +0000 (11:18 +0800)]
migrate_pages_batch: simplify retrying and failure counting of large folios
After recent changes to the retrying and failure counting in
migrate_pages_batch(), it was found that it's unnecessary to count
retrying and failure for normal, large, and THP folios separately.
Because we don't use retrying and failure number of large folios directly.
So, in this patch, we simplified retrying and failure counting of large
folios via counting retrying and failure of normal and large folios
together. This results in the reduced line number.
Previously, in migrate_pages_batch we need to track whether the source
folio is large/THP before splitting. So is_large is used to cache
folio_test_large() result. Now, we don't need that variable any more
because we don't count retrying and failure of large folios (only counting
that of THP folios). So, in this patch, is_large is removed to simplify
the code.
This is just code cleanup, no functionality changes are expected.
Link: https://lkml.kernel.org/r/20230510031829.11513-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rick Wertenbroek [Wed, 10 May 2023 09:07:57 +0000 (11:07 +0200)]
mm: memory_hotplug: fix format string in warnings
The format string in __add_pages and __remove_pages has a typo and prints
e.g., "Misaligned __add_pages start: 0xfc605 end: #fc609" instead of
"Misaligned __add_pages start: 0xfc605 end: 0xfc609" Fix "#%lx" => "%#lx"
Link: https://lkml.kernel.org/r/20230510090758.3537242-1-rick.wertenbroek@gmail.com Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peng Zhang [Sat, 6 May 2023 02:47:52 +0000 (10:47 +0800)]
maple_tree: fix potential out-of-bounds access in mas_wr_end_piv()
Check the write offset end bounds before using it as the offset into the
pivot array. This avoids a possible out-of-bounds access on the pivot
array if the write extends to the last slot in the node, in which case the
node maximum should be used as the end pivot.
akpm: this doesn't affect any current callers, but new users of mapletree
may encounter this problem if backported into earlier kernels, so let's
fix it in -stable kernels in case of this.
Link: https://lkml.kernel.org/r/20230506024752.2550-1-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Sat, 6 May 2023 14:05:25 +0000 (15:05 +0100)]
mm/gup: add missing gup_must_unshare() check to gup_huge_pgd()
All other instances of gup_huge_pXd() perform the unshare check, so update
the PGD-specific function to do so as well.
While checking pgd_write() might seem unusual, this function already
performs such a check via pgd_access_permitted() so this is in line with
the existing implementation.
David said:
: This change makes unshare handling across all GUP-fast variants
: consistent, which is desirable as GUP-fast is complicated enough
: already even when consistent.
:
: This function was the only one I seemed to have missed (or left out and
: forgot why -- maybe because it's really dead code for now). The COW
: selftest would identify the problem, so far there was no report.
: Either the selftest wasn't run on corresponding architectures with that
: hugetlb size, or that code is still dead code and unused by
: architectures.
:
: the original commit(s) that added unsharing explain why we care about
: these checks:
:
: a7f226604170acd6 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
: 84209e87c6963f92 ("mm/gup: reliable R/O long-term pinning in COW mappings")
Nhat Pham [Wed, 3 May 2023 01:36:08 +0000 (18:36 -0700)]
selftests: add selftests for cachestat
Test cachestat on a newly created file, /dev/ files, /proc/ files and a
directory. Also test on a shmem file (which can also be tested with
huge pages since tmpfs supports huge pages).
[colin.i.king@gmail.com: fix spelling mistake "trucate" -> "truncate"] Link: https://lkml.kernel.org/r/20230505110855.2493457-1-colin.i.king@gmail.com
[mpe@ellerman.id.au: avoid excessive stack allocation] Link: https://lkml.kernel.org/r/877ctfa6yv.fsf@mail.lhotse Link: https://lkml.kernel.org/r/20230503013608.2431726-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Colin Ian King <colin.i.king@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>