]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
2 years agomm/mmap: Combine multiple if statements in vma_merge() maple_v6.0-rc1_extras
Liam R. Howlett [Wed, 24 Aug 2022 19:37:56 +0000 (15:37 -0400)]
mm/mmap: Combine multiple if statements in vma_merge()

Currently vma_merge() searches from 0 upwards if there is no prev vma,
or from prev->vm_end if there is a prev vma.  The check for merging with
prev also checks if prev exists.  The ordering is not important at this
stage, so move the merging check above finding the next vma so that prev
can be checked only once.

Also start searching from vma_start as apposed to 0.  If there is no
previous then there won't be a vma before vma_start.  The code reads
more clearly this way.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Remove vma_mas_szero() helper
Liam R. Howlett [Fri, 19 Aug 2022 17:40:21 +0000 (13:40 -0400)]
mm/mmap: Remove vma_mas_szero() helper

Remove the helper to zero out a portion of a VMA.  It is only called
from one function and putting the logic in that function allows for a
WARN_ON() to be added to check for a logic error.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/nommu: Add static inline to vma_mas_remove()
Liam R. Howlett [Fri, 19 Aug 2022 17:38:03 +0000 (13:38 -0400)]
mm/nommu: Add static inline to vma_mas_remove()

Only used in nommu, so make it static

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Remove unused vma_mas_remove() function
Liam R. Howlett [Fri, 19 Aug 2022 17:37:31 +0000 (13:37 -0400)]
mm/mmap: Remove unused vma_mas_remove() function

This is no longer used

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Remove __vma_adjust()
Liam R. Howlett [Thu, 11 Aug 2022 20:19:43 +0000 (16:19 -0400)]
mm/mmap: Remove __vma_adjust()

Inline the work of __vma_adjust() into vma_merge().  This reduces code
size and has the added benefits of the comments for the cases being
located with the code.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Convert do_brk_flags() to use lock_vma()
Liam R. Howlett [Thu, 11 Aug 2022 15:54:23 +0000 (11:54 -0400)]
mm/mmap: Convert do_brk_flags() to use lock_vma()

Use the abstracted vma locking for do_brk_flags()

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Introduce dup_vma_anon() helper
Liam R. Howlett [Wed, 10 Aug 2022 22:11:48 +0000 (18:11 -0400)]
mm/mmap: Introduce dup_vma_anon() helper

Create a helper for duplicating the anon vma when adjusting the vma.
This simplifies the logic of __vma_adjust().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Don't use __vma_adjust() in shift_arg_pages()
Liam R. Howlett [Wed, 10 Aug 2022 21:24:05 +0000 (17:24 -0400)]
mm/mmap: Don't use __vma_adjust() in shift_arg_pages()

Introduce shrink_vma() which uses the lock_vma() and unlock_vma()
functions to reduce the vma coverage.

Convert shift_arg_pages() to use expand_vma() and the new shrink_vma()
function.  Remove shrink_vma() support from __vma_adjust() since
shift_arg_pages() is the only user that shrinks a VMA in this way.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm: Don't use __vma_adjust() in __split_vma()
Liam R. Howlett [Wed, 10 Aug 2022 20:09:15 +0000 (16:09 -0400)]
mm: Don't use __vma_adjust() in __split_vma()

Use the abstracted locking and maple tree operations.  Since
__split_vma() is the only user of the __vma_adjust() function to use the
insert argument, drop that argument.  Remove the NULL passed through
from fs/exec's shift_arg_pages() at the same time.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Introduce init_vma_lock()
Liam R. Howlett [Tue, 9 Aug 2022 20:25:02 +0000 (16:25 -0400)]
mm/mmap: Introduce init_vma_lock()

Add init_vma_lock() to set up the struct vma_locking.  This is to
abstract the locking when adjusting VMAs.

Also remove the use of remove_next int in favour of a pointer to the VMA
to remove.  Rename next_next to remove2 since this better reflects its
use.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Use vma_lock() and vma_unlock() in vma_expand()
Liam R. Howlett [Fri, 5 Aug 2022 20:15:12 +0000 (16:15 -0400)]
mm/mmap: Use vma_lock() and vma_unlock() in vma_expand()

Use the new locking functions for vma_expand.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: Refactor locking out of __vma_adjust()
Liam R. Howlett [Fri, 5 Aug 2022 20:07:25 +0000 (16:07 -0400)]
mm/mmap: Refactor locking out of __vma_adjust()

Move the locking into lock_vma() and unlock_vma() for use elsewhere

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm/mmap: move anon_vma setting in __vma_adjust()
Liam R. Howlett [Fri, 5 Aug 2022 20:04:16 +0000 (16:04 -0400)]
mm/mmap: move anon_vma setting in __vma_adjust()

Move the anon_vma setting & warn_no up the function.  This is done to
clear up the locking later.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomm: Change munmap splitting order and move_vma()
Liam R. Howlett [Wed, 30 Mar 2022 17:35:49 +0000 (13:35 -0400)]
mm: Change munmap splitting order and move_vma()

Splitting can be more efficient when done in the reverse order to
minimize VMA walking.  Change do_mas_align_munmap() to reduce walking of
the tree during split operations.

move_vma() must also be altered to remove the dependency of keeping the
original VMA as the active part of the split.  Look up the new VMA or
two if necessary.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomaple_tree: Reduce loops in mt_validate()
Liam R. Howlett [Fri, 5 Aug 2022 18:02:20 +0000 (14:02 -0400)]
maple_tree: Reduce loops in mt_validate()

mt_validate() is taking too long when kasan and memory poisoning is
enabled which is delaying the rcu free operation which is causing issues
running LTP test msgstress03.  Change the validation to only loop over a
node once as apposed to many times for each test.  Also only walk the
tree once by keeping track of if the last leaf had a null at a higher
level.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomaple_tree: change mss_fill_bnode() logic a bit
Liam R. Howlett [Tue, 22 Mar 2022 16:38:04 +0000 (12:38 -0400)]
maple_tree: change mss_fill_bnode() logic a bit

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agomaple_tree: Reduce mas_split() memory usage
Liam R. Howlett [Tue, 22 Mar 2022 16:37:31 +0000 (12:37 -0400)]
maple_tree: Reduce mas_split() memory usage

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agotest_maple_tree: 32 bit testing support
Liam R. Howlett [Wed, 3 Aug 2022 17:22:51 +0000 (13:22 -0400)]
test_maple_tree: 32 bit testing support

Add support for the maple tree testing to work for 32 bit environment.
This disables a number of tests that store values above the 32 bit
limit, but tests a lot of the functionality.

Adds detection of the 32/64 bit environment in the makefile and the
generated headers.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
2 years agopage_ext-introduce-boot-parameter-early_page_ext-fix
Andrew Morton [Fri, 26 Aug 2022 04:17:22 +0000 (21:17 -0700)]
page_ext-introduce-boot-parameter-early_page_ext-fix

fix section issue by removing __meminitdata

Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Li Zhe <lizhe.67@bytedance.com>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agopage_ext: introduce boot parameter 'early_page_ext'
Li Zhe [Thu, 25 Aug 2022 10:27:14 +0000 (18:27 +0800)]
page_ext: introduce boot parameter 'early_page_ext'

In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem.  It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.

This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem.  If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above.  It can help us to catch early page allocations.  This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.

Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agos390/hugetlb: switch to generic version of follow_huge_pud()
Gerald Schaefer [Fri, 26 Aug 2022 05:03:31 +0000 (22:03 -0700)]
s390/hugetlb: switch to generic version of follow_huge_pud()

When pud-sized hugepages were introduced for s390, the generic version of
follow_huge_pud() was using pte_page() instead of pud_page().  This would
be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use
pmd_page() in follow_huge_pmd()").  Therefore, and probably because not
all archs were supporting pud_page() at that time, a private version of
follow_huge_pud() was added for s390, correctly using pud_page().

Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry"), the generic version of
follow_huge_pud() is now also using pud_page(), and in general behaves
similar to follow_huge_pmd().

Therefore we can now switch to the generic version and get rid of the
s390-specific follow_huge_pud().

Link: https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Haiyue Wang <haiyue.wang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/zswap: skip confusing print info
Liu Shixin [Thu, 25 Aug 2022 14:20:37 +0000 (22:20 +0800)]
mm/zswap: skip confusing print info

It's confusing when we disable zswap while zswap is init failed or has no
pool.  If no change required, just return directly.

Link: https://lkml.kernel.org/r/20220825142037.3214152-4-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm-zswap-delay-the-initializaton-of-zswap-until-the-first-enablement-fix
Andrew Morton [Fri, 26 Aug 2022 01:59:19 +0000 (18:59 -0700)]
mm-zswap-delay-the-initializaton-of-zswap-until-the-first-enablement-fix

s/init_zswap/zswap_init/

Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/zswap: delay the initializaton of zswap until the first enablement
Liu Shixin [Thu, 25 Aug 2022 14:20:36 +0000 (22:20 +0800)]
mm/zswap: delay the initializaton of zswap until the first enablement

In the initialization of zswap, about 18MB memory will be allocated for
zswap_pool in my machine.  Since not all users use zswap, the memory may
be wasted.  Save the memory for these users by delaying the initialization
of zswap to first enablement.

Link: https://lkml.kernel.org/r/20220825142037.3214152-3-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/zswap: replace zswap_init_{started/failed} with zswap_init_state
Liu Shixin [Thu, 25 Aug 2022 14:20:35 +0000 (22:20 +0800)]
mm/zswap: replace zswap_init_{started/failed} with zswap_init_state

Patch series "Delay the initializaton of zswap".

In the initialization of zswap, about 18MB memory will be allocated for
zswap_pool.  Since not all users use zswap, the memory may be wasted.
Save the memory for these users by delaying the initialization of zswap to
first enablement.

This patch (of 3):

zswap_init_started indicates that the initialization is started.  And
zswap_init_failed indicates that the initialization is failed.  As we will
support to init zswap after system startup, it's necessary to add a state
to indicate the initialization is complete and succeed to avoid
concurrency issues.  Since we don't care about the difference between init
started with init completion.  We only need three states: uninitialized,
initial failed, initial succeed.

Link: https://lkml.kernel.org/r/20220825142037.3214152-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20220825142037.3214152-2-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm-reduce-noise-in-show_mem-for-lowmem-allocations-fix
Andrew Morton [Fri, 26 Aug 2022 03:39:47 +0000 (20:39 -0700)]
mm-reduce-noise-in-show_mem-for-lowmem-allocations-fix

fix it for mapletree

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: reduce noise in show_mem for lowmem allocations
Michal Hocko [Tue, 23 Aug 2022 09:22:30 +0000 (11:22 +0200)]
mm: reduce noise in show_mem for lowmem allocations

While discussing early DMA pool pre-allocation failure with Christoph [1]
I have realized that the allocation failure warning is rather noisy for
constrained allocations like GFP_DMA{32}.  Those zones are usually not
populated on all nodes very often as their memory ranges are constrained.

This is an attempt to reduce the ballast that doesn't provide any relevant
information for those allocation failures investigation.  Please note that
I have only compile tested it (in my default config setup) and I am
throwing it mostly to see what people think about it.

[1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de

Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: fixup documentation regarding pte_numa() and PROT_NUMA
David Hildenbrand [Thu, 25 Aug 2022 16:46:59 +0000 (18:46 +0200)]
mm: fixup documentation regarding pte_numa() and PROT_NUMA

pte_numa() no longer exists -- replaced by pte_protnone() -- and PROT_NUMA
probably never existed: MM_CP_PROT_NUMA also ends up using PROT_NONE.

Let's fixup the doc.

Link: https://lkml.kernel.org/r/20220825164659.89824-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/gup: use gup_can_follow_protnone() also in GUP-fast
David Hildenbrand [Thu, 25 Aug 2022 16:46:58 +0000 (18:46 +0200)]
mm/gup: use gup_can_follow_protnone() also in GUP-fast

There seems to be no reason why FOLL_FORCE during GUP-fast would have to
fallback to the slow path when stumbling over a PROT_NONE mapped page.  We
only have to trigger hinting faults in case FOLL_FORCE is not set, and any
kind of fault handling naturally happens from the slow path -- where NUMA
hinting accounting/handling would be performed.

Note that the comment regarding THP migration is outdated: commit
2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP
migration") described that this was required for THP due to lack of PMD
migration entries.  Nowadays, we do have proper PMD migration entries in
place -- see set_pmd_migration_entry(), which does a proper
pmdp_invalidate() when placing the migration entry.

So let's just reuse gup_can_follow_protnone() here to make it consistent
and drop the somewhat outdated comments.

Link: https://lkml.kernel.org/r/20220825164659.89824-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/gup: replace FOLL_NUMA by gup_can_follow_protnone()
David Hildenbrand [Thu, 25 Aug 2022 16:46:57 +0000 (18:46 +0200)]
mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()

Patch series "mm: minor cleanups around NUMA hinting".

Working on some GUP cleanups (e.g., getting rid of some FOLL_ flags) and
preparing for other GUP changes (getting rid of FOLL_FORCE|FOLL_WRITE for
for taking a R/O longterm pin), this is something I can easily send out
independently.

Get rid of FOLL_NUMA, allow FOLL_FORCE access to PROT_NONE mapped pages in
GUP-fast, and fixup some documentation around NUMA hinting.

This patch (of 3):

No need for a special flag that is not even properly documented to be
internal-only.

Let's just factor this check out and get rid of this flag.  The separate
function has the nice benefit that we can centralize comments.

Link: https://lkml.kernel.org/r/20220825164659.89824-2-david@redhat.com
Link: https://lkml.kernel.org/r/20220825164659.89824-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomemcg: increase MEMCG_CHARGE_BATCH to 64
Shakeel Butt [Thu, 25 Aug 2022 00:05:06 +0000 (00:05 +0000)]
memcg: increase MEMCG_CHARGE_BATCH to 64

For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in Gbps,
32 is too small and makes the memcg charging path a bottleneck.  For now,
increase it to 64 for easy acceptance to 6.0.  We will need to revisit
this in future for ever increasing demand of higher performance.

Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change.  Though it does
have impact on rstat flushing and high limit reclaim backoff.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)       10482.7 Mbps
With patch              17064.7 Mbps (62.7% improvement)

With the patch, the throughput improved by 62.7%.

Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: page_counter: rearrange struct page_counter fields
Shakeel Butt [Thu, 25 Aug 2022 00:05:05 +0000 (00:05 +0000)]
mm: page_counter: rearrange struct page_counter fields

With memcg v2 enabled, memcg->memory.usage is a very hot member for the
workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads.  In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path.  This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.

To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 12413.7 Mbps (18.4% improvement)

With the patch, the throughput improved by 18.4%.

One side-effect of this patch is the increase in the size of struct
mem_cgroup.  For example with this patch on 64 bit build, the size of
struct mem_cgroup increased from 4032 bytes to 4416 bytes.  However for
the performance improvement, this additional size is worth it.  In
addition there are opportunities to reduce the size of struct mem_cgroup
like deprecation of kmem and tcpmem page counters and better packing.

Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: page_counter: remove unneeded atomic ops for low/min
Shakeel Butt [Thu, 25 Aug 2022 00:05:04 +0000 (00:05 +0000)]
mm: page_counter: remove unneeded atomic ops for low/min

Patch series "memcg: optimize charge codepath", v2.

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs.  One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck.  The
kernel test robot has also reported this regression[1].  This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg 21694.8 Mbps
2. 6.0-rc1 10482.7 Mbps (-51.6%)
3. 6.0-rc1 + (A) 14542.5 Mbps (-32.9%)
4. 6.0-rc1 + (B) 12413.7 Mbps (-42.7%)
5. 6.0-rc1 + (C) 17063.7 Mbps (-21.3%)
6. 6.0-rc1 + (A+B+C) 20120.3 Mbps (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

[1] https://lore.kernel.org/linux-mm/20220619150456.GB34471@xsang-OptiPlex-9020/

This patch (of 3):

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively.  We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e.  min >
0) and the usage is above the protection (i.e.  usage > min).

This scenario is actually very common where the users want a part of their
workload to be protected against the external reclaim.  Though this
optimization does introduce a race when the usage is around the protection
and concurrent charges and uncharged trip it over or under the protection.
In such cases, we might see lower effective protection but the subsequent
charge/uncharge will correct it.

To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy with top level
having min and low setup appropriately to see if this optimization is
effective for the mentioned case.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oliver Sang <oliver.sang@intel.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: use new vma_lock for pmd sharing synchronization
Mike Kravetz [Wed, 24 Aug 2022 17:57:57 +0000 (10:57 -0700)]
hugetlb: use new vma_lock for pmd sharing synchronization

The new hugetlb vma lock (rw semaphore) is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. the lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  check for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Link: https://lkml.kernel.org/r/20220824175757.20590-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: create hugetlb_unmap_file_folio to unmap single file folio
Mike Kravetz [Wed, 24 Aug 2022 17:57:56 +0000 (10:57 -0700)]
hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Link: https://lkml.kernel.org/r/20220824175757.20590-8-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: add vma based lock for pmd sharing
Mike Kravetz [Wed, 24 Aug 2022 17:57:55 +0000 (10:57 -0700)]
hugetlb: add vma based lock for pmd sharing

Allocate a rw semaphore and hang off vm_private_data for synchronization
use by vmas that could be involved in pmd sharing.  Only add
infrastructure for the new lock here.  Actual use will be added in
subsequent patch.

Link: https://lkml.kernel.org/r/20220824175757.20590-7-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: rename vma_shareable() and refactor code
Mike Kravetz [Wed, 24 Aug 2022 17:57:54 +0000 (10:57 -0700)]
hugetlb: rename vma_shareable() and refactor code

Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
checking a specific address within the vma.  Refactor code to check if an
aligned range is shareable as this will be needed in a subsequent patch.

Link: https://lkml.kernel.org/r/20220824175757.20590-6-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: fix/remove uninitialized variable in remove_inode_hugepages
Mike Kravetz [Fri, 26 Aug 2022 05:03:28 +0000 (22:03 -0700)]
hugetlb: fix/remove uninitialized variable in remove_inode_hugepages

Code introduced for the routine remove_inode_hugepages by patch "hugetlb:
handle truncate racing with page faults", incorrectly uses a variable
m_index.  This is a remnant from a previous version of the code when under
development.  Use the correct variable 'index' and remove 'm_index' from
the routine.

Link: https://lkml.kernel.org/r/Ywepr7C2X20ZvLdn@monkey
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: handle truncate racing with page faults
Mike Kravetz [Wed, 24 Aug 2022 17:57:53 +0000 (10:57 -0700)]
hugetlb: handle truncate racing with page faults

When page fault code needs to allocate and instantiate a new hugetlb page
(huegtlb_no_page), it checks early to determine if the fault is beyond
i_size.  When discovered early, it is easy to abort the fault and return
an error.  However, it becomes much more difficult to handle when
discovered later after allocating the page and consuming reservations and
adding to the page cache.  Backing out changes in such instances becomes
difficult and error prone.

Instead of trying to catch and backout all such races, use the hugetlb
fault mutex to handle truncate racing with page faults.  The most
significant change is modification of the routine remove_inode_hugepages
such that it will take the fault mutex for EVERY index in the truncated
range (or hole in the case of hole punch).  Since remove_inode_hugepages
is called in the truncate path after updating i_size, we can experience
races as follows:

- truncate code updates i_size and takes fault mutex before a racing
  fault.  After fault code takes mutex, it will notice fault beyond
  i_size and abort early.
- fault code obtains mutex, and truncate updates i_size after early
  checks in fault code.  fault code will add page beyond i_size.
  When truncate code takes mutex for page/index, it will remove the
  page.
- truncate updates i_size, but fault code obtains mutex first.  If
  fault code sees updated i_size it will abort early.  If fault code
  does not see updated i_size, it will add page beyond i_size and
  truncate code will remove page when it obtains fault mutex.

Note, for performance reasons remove_inode_hugepages will still use
filemap_get_folios for bulk folio lookups.  For indicies not returned in
the bulk lookup, it will need to lookup individual folios to check for
races with page fault.

Link: https://lkml.kernel.org/r/20220824175757.20590-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
Mike Kravetz [Wed, 24 Aug 2022 17:57:52 +0000 (10:57 -0700)]
hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache

remove_huge_page removes a hugetlb page from the page cache.  Change to
hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

Link: https://lkml.kernel.org/r/20220824175757.20590-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
Mike Kravetz [Wed, 24 Aug 2022 17:57:51 +0000 (10:57 -0700)]
hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user.  It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Link: https://lkml.kernel.org/r/20220824175757.20590-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agohugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
Mike Kravetz [Wed, 24 Aug 2022 17:57:50 +0000 (10:57 -0700)]
hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race

Patch series "hugetlb: Use new vma mutex for huge pmd sharing synchronization".

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

48 sample Avg
next-20220822 next-20220822 next-20220822
unmodified revert i_mmap_sema locking vma sema locking, this series
-----------------------------------------------------------------------------
494229 KB/s 495375 KB/s 495573 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test instances unmodified revert vma
--------------------------------------------------------------------------
faults per sec 1 397068 403411 394935
faults per sec  24  68322  83023  82436
forks per sec   1   2717   2862   2816
forks per sec   24    404    465    499
Combined faults 24   1528  69090  59544
Combined forks  24    337     66    140

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade79 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
Patches 3 and 4 of this series address those races.  This requires
using the hugetlb fault mutexes for more coordination between the fault
code and file page removal.

Patches 5 - 7 add infrastructure for a new vma based rw semaphore that
will be used for pmd sharing synchronization.  The idea is that this
semaphore will be held in read mode for the duration of fault processing,
and held in write mode for unmap operations which may call huge_pmd_unshare.
Acquiring i_mmap_rwsem is also still required to synchronize huge pmd
sharing.  However it is only required in the fault path when setting up
sharing, and will be acquired in huge_pmd_share().

Patch 8 makes use of this new vma lock.  Unfortunately, the fault code
and truncate/hole punch code would naturally take locks in the opposite
order which could lead to deadlock.  Since the performance of page faults
is more important, the truncation/hole punch code is modified to back
out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

This patch (of 8):

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Link: https://lkml.kernel.org/r/20220824175757.20590-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220824175757.20590-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: fix the handling Non-LRU pages returned by follow_page
Haiyue Wang [Tue, 23 Aug 2022 13:58:41 +0000 (21:58 +0800)]
mm: fix the handling Non-LRU pages returned by follow_page

The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called.  Fix the zone device page
check by handling the page reference count correctly before returning.

And as David reviewed, "device pages are never PageKsm pages".  Drop this
zone device page check for break_ksm().

Since the zone device page can't be a transparent huge page, so drop the
redundant zone device page check for split_huge_pages_pid().  (by Miaohe)

Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com
Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: remove EXPERIMENTAL flag for zswap
David Heidelberg [Tue, 23 Aug 2022 15:20:33 +0000 (17:20 +0200)]
mm: remove EXPERIMENTAL flag for zswap

zswap has been with us since 2013, and it's widely used in many products.

Link: https://lkml.kernel.org/r/20220823152033.66682-1-david@ixit.cz
Signed-off-by: David Heidelberg <david@ixit.cz>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agodrivers/block/zram/zram_drv.c: do not keep dangling zcomp pointer after zram reset
Sergey Senozhatsky [Wed, 24 Aug 2022 03:51:00 +0000 (12:51 +0900)]
drivers/block/zram/zram_drv.c: do not keep dangling zcomp pointer after zram reset

We do all reset operations under write lock, so we don't need to save
->disksize and ->comp to stack variables.  Another thing is that ->comp is
freed during zram reset, but comp pointer is not NULL-ed, so zram keeps
the freed pointer value.

Link: https://lkml.kernel.org/r/20220824035100.971816-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm-gupc-refactor-check_and_migrate_movable_pages-fix
Andrew Morton [Wed, 24 Aug 2022 21:10:22 +0000 (14:10 -0700)]
mm-gupc-refactor-check_and_migrate_movable_pages-fix

mm/gup.c: In function 'check_and_migrate_movable_pages':
mm/gup.c:2045:13: error: unused variable 'ret' [-Werror=unused-variable]
 2045 |         int ret;
      |             ^~~

Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/gup.c: refactor check_and_migrate_movable_pages()
Alistair Popple [Wed, 24 Aug 2022 05:09:52 +0000 (15:09 +1000)]
mm/gup.c: refactor check_and_migrate_movable_pages()

When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.

When migration succeeds all pages will have been unpinned so pinning needs
to be retried.  Migration can also fail, in which case the pages will also
have been unpinned but the operation should not be retried.  If all pages
are in the correct zone nothing will be unpinned and no retry is required.

The logic in check_and_migrate_movable_pages() tracks unnecessary state
and the return codes for each case are difficult to follow.  Refactor the
code to clean this up.  No behaviour change is intended.

Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()
Alistair Popple [Wed, 24 Aug 2022 05:09:51 +0000 (15:09 +1000)]
mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()

gup_flags is passed to check_and_migrate_movable_pages() so that it can
call either put_page() or unpin_user_page() to drop the page reference.
However check_and_migrate_movable_pages() is only called for
FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass
gup_flags.

Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shigeru Yoshida <syoshida@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: Skip retry when new limit is not below old one in page_counter_set_max
Bui Quang Minh [Sun, 21 Aug 2022 15:40:55 +0000 (22:40 +0700)]
mm: Skip retry when new limit is not below old one in page_counter_set_max

In page_counter_set_max, we want to make sure the new limit is not below
the concurrently-changing counter value.  We read the counter and check
that the limit is not below the counter before the swap.  After the swap,
we read the counter again and retry in case the counter is incremented as
this may violate the requirement.  Even though the page_counter_try_charge
can see the old limit, it is guaranteed that the counter is not above the
old limit after the increment.  So in case the new limit is not below the
old limit, the counter is guaranteed to be not above the new limit too.
We can skip the retry in this case to optimize a little bit.

Link: https://lkml.kernel.org/r/20220821154055.109635-1-minhquangbui99@gmail.com
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: add back missing variable initializations
Rolf Eike Beer [Wed, 24 Aug 2022 11:00:11 +0000 (13:00 +0200)]
mm: pagewalk: add back missing variable initializations

These initializations accidentially got lost during refactoring.

The first one can't actually be used without initialization, because
walk_p4d_range() is only called when one of the 4 callbacks is set, but relying
on this seems fragile.

Link: https://lkml.kernel.org/r/2123960.ggj6I0NvhH@mobilepool36.emlix.com
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: move variables to more local scope, tweak loops
Rolf Eike Beer [Mon, 22 Aug 2022 13:04:12 +0000 (15:04 +0200)]
mm: pagewalk: move variables to more local scope, tweak loops

Move some variables to more local scopes to make it obvious that they
don't carry state.  Put the end additions into the for loop instructions
to make them easier to read.

Link: https://lkml.kernel.org/r/8155217.NyiUUSuA9g@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: allow walk_page_range_novma() without mm
Rolf Eike Beer [Mon, 22 Aug 2022 13:03:29 +0000 (15:03 +0200)]
mm: pagewalk: allow walk_page_range_novma() without mm

Since e47690d756a7 ("x86: mm: avoid allocating struct mm_struct on the
stack") a pgd can be passed to walk_page_range_novma().  In case it is set
no place in the pagewalk code use the walk.mm anymore, so permit to pass a
NULL mm instead.  It is up to the caller to ensure proper locking on the
pgd in this case.

Link: https://lkml.kernel.org/r/5760214.MhkbZ0Pkbq@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: add api documentation for walk_page_range_novma()
Rolf Eike Beer [Mon, 22 Aug 2022 13:02:36 +0000 (15:02 +0200)]
mm: pagewalk: add api documentation for walk_page_range_novma()

Link: https://lkml.kernel.org/r/8991525.CDJkKcVGEf@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: fix documentation of PTE hole handling
Rolf Eike Beer [Mon, 22 Aug 2022 13:01:32 +0000 (15:01 +0200)]
mm: pagewalk: fix documentation of PTE hole handling

Empty PTEs are passed to the pte_entry callback, not to pte_hole.

Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: don't check vma in walk_page_range_novma()
Rolf Eike Beer [Mon, 22 Aug 2022 13:00:47 +0000 (15:00 +0200)]
mm: pagewalk: don't check vma in walk_page_range_novma()

Directly call walk_pgd_range() as that is everything that will actually
happen in __walk_page_range() besides checking if the vma is set.

Link: https://lkml.kernel.org/r/21585888.EfDdHjke4D@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: add back missing variable initializations
Rolf Eike Beer [Wed, 24 Aug 2022 11:00:11 +0000 (13:00 +0200)]
mm: pagewalk: add back missing variable initializations

These initializations accidentially got lost during refactoring.

The first one can't actually be used without initialization, because
walk_p4d_range() is only called when one of the 4 callbacks is set, but relying
on this seems fragile.

Link: https://lkml.kernel.org/r/2123960.ggj6I0NvhH@mobilepool36.emlix.com
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: add back missing variable initializations
Rolf Eike Beer [Wed, 24 Aug 2022 11:00:11 +0000 (13:00 +0200)]
mm: pagewalk: add back missing variable initializations

These initializations accidentially got lost during refactoring.

The first one can't actually be used without initialization, because
walk_p4d_range() is only called when one of the 4 callbacks is set, but relying
on this seems fragile.

Link: https://lkml.kernel.org/r/2123960.ggj6I0NvhH@mobilepool36.emlix.com
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: pagewalk: make error checks more obvious
Rolf Eike Beer [Mon, 22 Aug 2022 13:00:05 +0000 (15:00 +0200)]
mm: pagewalk: make error checks more obvious

Patch series "Minor improvements for pagewalk code".

For some project I had to use the pagewalk API for certain things and
during this have read through the code multiple times.  Our usage has
changed several times depending on our current state of research as well.

During all of this I have made some tweaks to the code to be able to follow it
better when hunting my own problems, and not call into some things that I
actually don't need. The patches are more or less independent of each other.

This patch (of 6):

The err variable only needs to be checked when it was assigned directly
before, it is not carried on to any later checks.  Move the checks into
the same "if" conditions where they are assigned.  Also just return the
error at the relevant places.  While at it move these err variables to a
more local scope at some places.

Link: https://lkml.kernel.org/r/3200642.44csPzL39Z@devpool047
Link: https://lkml.kernel.org/r/2203731.iZASKD2KPV@devpool047
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm-add-merging-after-mremap-resize-checkpatch-fixes
Andrew Morton [Fri, 3 Jun 2022 17:41:30 +0000 (10:41 -0700)]
mm-add-merging-after-mremap-resize-checkpatch-fixes

WARNING: line length of 108 exceeds 100 columns
#97: FILE: tools/testing/selftests/vm/mremap_test.c:136:
+ char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

WARNING: Missing a blank line after declarations
#98: FILE: tools/testing/selftests/vm/mremap_test.c:137:
+ char *start = mmap(NULL, 3 * page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ munmap(start + page_size, page_size);

ERROR: space required before the open parenthesis '('
#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+ while(getline(&line, &len, fp) != -1) {

ERROR: space required after that ',' (ctx:VxV)
#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+ char *first = strtok(line,"- ");
                           ^

ERROR: space required after that ',' (ctx:VxV)
#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+ char *second = strtok(NULL,"- ");
                            ^

WARNING: Missing a blank line after declarations
#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+ void *second_val = (void *) strtol(second, NULL, 16);
+ if (first_val == start && second_val == start + 3 * page_size) {

total: 3 errors, 3 warnings, 113 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jakub Matěna <matenajakub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: add merging after mremap resize
Jakub Matěna [Fri, 3 Jun 2022 14:57:19 +0000 (16:57 +0200)]
mm: add merging after mremap resize

When mremap call results in expansion, it might be possible to merge the
VMA with the next VMA which might become adjacent.  This patch adds
vma_merge call after the expansion is done to try and merge.

Link: https://lkml.kernel.org/r/20220603145719.1012094-3-matenajakub@gmail.com
Signed-off-by: Jakub Matěna <matenajakub@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: refactor of vma_merge()
Jakub Matěna [Fri, 3 Jun 2022 14:57:18 +0000 (16:57 +0200)]
mm: refactor of vma_merge()

Patch series "Refactor of vma_merge and new merge call", v4.

I am currently working on my master's thesis trying to increase number of
merges of VMAs currently failing because of page offset incompatibility
and difference in their anon_vmas.  The following refactor and added merge
call included in this series is just two smaller upgrades I created along
the way.

This patch (of 2):

Refactor vma_merge() to make it shorter and more understandable.  Main
change is the elimination of code duplicity in the case of merge next
check.  This is done by first doing checks and caching the results before
executing the merge itself.  The variable 'area' is divided into 'mid' and
'res' as previously it was used for two purposes, as the middle VMA
between prev and next and also as the result of the merge itself.  Exit
paths are also unified.

Link: https://lkml.kernel.org/r/20220603145719.1012094-1-matenajakub@gmail.com
Link: https://lkml.kernel.org/r/20220603145719.1012094-2-matenajakub@gmail.com
Signed-off-by: Jakub Matěna <matenajakub@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Rik van Riel <riel@surriel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm-delete-unused-mmf_oom_victim-flag-fix
Andrew Morton [Mon, 22 Aug 2022 22:11:53 +0000 (15:11 -0700)]
mm-delete-unused-mmf_oom_victim-flag-fix

remove MGLRU's mm_is_oom_victim() calls from vmscan.c

Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: delete unused MMF_OOM_VICTIM flag
Suren Baghdasaryan [Tue, 31 May 2022 22:31:00 +0000 (15:31 -0700)]
mm: delete unused MMF_OOM_VICTIM flag

With the last usage of MMF_OOM_VICTIM in exit_mmap gone, this flag is now
unused and can be removed.

Link: https://lkml.kernel.org/r/20220531223100.510392-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm-drop-oom-code-from-exit_mmap-fix-fix
Andrew Morton [Wed, 1 Jun 2022 22:17:52 +0000 (15:17 -0700)]
mm-drop-oom-code-from-exit_mmap-fix-fix

restore Suren's mmap_read_lock() optimization

Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: drop oom code from exit_mmap
Suren Baghdasaryan [Tue, 31 May 2022 22:30:59 +0000 (15:30 -0700)]
mm: drop oom code from exit_mmap

The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see [1] for more details).  The invocation has
moved around since then because of the interaction with the munlock logic
but the underlying reason has remained the same (see [2]).

Munlock code is no longer a problem since [3] and there shouldn't be any
blocking operation before the memory is unmapped by exit_mmap so the oom
reaper invocation can be dropped.  The unmapping part can be done with the
non-exclusive mmap_sem and the exclusive one is only required when page
tables are freed.

Remove the oom_reaper from exit_mmap which will make the code easier to
read.  This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.

[1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")

Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mlock: drop dead code in count_mm_mlocked_page_nr()
Liam Howlett [Wed, 15 Jun 2022 17:40:58 +0000 (17:40 +0000)]
mm/mlock: drop dead code in count_mm_mlocked_page_nr()

The check for mm being null has never been needed since the only caller
has always passed in current->mm.  Remove the check from
count_mm_mlocked_page_nr().

Link: https://lkml.kernel.org/r/20220615174050.738523-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mmap.c: pass in mapping to __vma_link_file()
Liam R. Howlett [Mon, 22 Aug 2022 15:06:34 +0000 (15:06 +0000)]
mm/mmap.c: pass in mapping to __vma_link_file()

__vma_link_file() resolves the mapping from the file, if there is one.
Pass through the mapping and check the vm_file externally since most
places already have the required information and check of vm_file.

Link: https://lkml.kernel.org/r/20220822150128.1562046-71-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mmap: drop range_has_overlap() function
Liam R. Howlett [Mon, 22 Aug 2022 15:06:34 +0000 (15:06 +0000)]
mm/mmap: drop range_has_overlap() function

Since there is no longer a linked list, the range_has_overlap() function
is identical to the find_vma_intersection() function.

Link: https://lkml.kernel.org/r/20220822150128.1562046-70-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: fix one kernel-doc comment
Yang Li [Wed, 24 Aug 2022 02:19:18 +0000 (10:19 +0800)]
mm: fix one kernel-doc comment

Add the description for @mt.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1949
Link: https://lkml.kernel.org/r/20220824021918.94116-1-yang.lee@linux.alibaba.com
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm: remove the vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:33 +0000 (15:06 +0000)]
mm: remove the vma linked list

Replace any vm_next use with vma_find().

Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
maple tree.

Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
the same time, alter the loop to be more compact.

Now that free_pgtables() and unmap_vmas() take a maple tree as an
argument, rearrange do_mas_align_munmap() to use the new tree to hold the
vmas to remove.

Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
used to update the linked list.

Drop linked list update from __insert_vm_struct().

Rework validation of tree as it was depending on the linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-69-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/vmscan: Use vma iterator instead of vm_next
Liam Howlett [Mon, 22 Aug 2022 15:06:33 +0000 (15:06 +0000)]
mm/vmscan: Use vma iterator instead of vm_next

Use the vma iterator in in get_next_vma() instead of the linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-68-Liam.Howlett@oracle.com
Suggested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agoriscv: use vma iterator for vdso
Liam R. Howlett [Mon, 22 Aug 2022 15:06:33 +0000 (15:06 +0000)]
riscv: use vma iterator for vdso

Remove the linked list use in favour of the vma iterator.

Link: https://lkml.kernel.org/r/20220822150128.1562046-67-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/nommu: fix error handling in split_vma()
Yang Yingliang [Wed, 24 Aug 2022 04:24:24 +0000 (12:24 +0800)]
mm/nommu: fix error handling in split_vma()

The memory allocated before calling mas_preallocate() is leaked if it
fails.  'mas' won't be modify until calling mas_preallocate(), so move it
up and add error label for free the memory.

Link: https://lkml.kernel.org/r/20220824042424.2031508-1-yangyingliang@huawei.com
Fixes: 8aff7dbeaeb1 ("nommu: remove uses of VMA linked list")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agonommu: remove uses of VMA linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:32 +0000 (15:06 +0000)]
nommu: remove uses of VMA linked list

Use the maple tree or VMA iterator instead.  This is faster and will allow
us to shrink the VMA.

Link: https://lkml.kernel.org/r/20220822150128.1562046-66-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agoi915: use the VMA iterator
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:32 +0000 (15:06 +0000)]
i915: use the VMA iterator

Replace the linked list in probe_range() with the VMA iterator.

Link: https://lkml.kernel.org/r/20220822150128.1562046-65-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/swapfile: use vma iterator instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:32 +0000 (15:06 +0000)]
mm/swapfile: use vma iterator instead of vma linked list

unuse_mm() no longer needs to reference the linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-64-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/pagewalk: use vma_find() instead of vma linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:31 +0000 (15:06 +0000)]
mm/pagewalk: use vma_find() instead of vma linked list

walk_page_range() no longer uses the one vma linked list reference.

Link: https://lkml.kernel.org/r/20220822150128.1562046-63-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/oom_kill: use maple tree iterators instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:31 +0000 (15:06 +0000)]
mm/oom_kill: use maple tree iterators instead of vma linked list

Link: https://lkml.kernel.org/r/20220822150128.1562046-62-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/msync: use vma_find() instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:31 +0000 (15:06 +0000)]
mm/msync: use vma_find() instead of vma linked list

Link: https://lkml.kernel.org/r/20220822150128.1562046-61-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mremap: use vma_find_intersection() instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:31 +0000 (15:06 +0000)]
mm/mremap: use vma_find_intersection() instead of vma linked list

Link: https://lkml.kernel.org/r/20220822150128.1562046-60-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mprotect: fix maple tree start address in do_mprotect_pkey()
Liam Howlett [Thu, 25 Aug 2022 20:30:24 +0000 (20:30 +0000)]
mm/mprotect: fix maple tree start address in do_mprotect_pkey()

Prior to my change to use the maple tree, the start address was changed
before calling find_vma() with the untagged_addr() version of start.  My
first change recorded the tagged address and searched on the incorrect
start location - which would have found the incorrect VMA.  This fix will
use the untagged_addr() as the start of the search as it was before I
changed the code at all.

Any penalty of calling untagged_addr() occurred regardless of the version
that was used.  The search of the maple tree would have also occurred in
both versions - just at the wrong location before this fix.  I expect that
the execution time would be equal as the search on the tagged address
would have either returned a VMA at start, or the VMA in the next slot in
the maple tree node - probably immeasurably slower since the data is very
likely already in the CPU cache, but I don't have hard data to say either
way.  I can look into a benchmark to measure the difference between both
working versions, but I don't have an arm64 native target so it will be
emulated.

Use the untagged_addr() instead of the address passed into the function.

Link: https://lkml.kernel.org/r/20220825202939.3041660-1-Liam.Howlett@oracle.com
Fixes: 3338b715d25d ("mm/mprotect: use maple tree navigation instead of vma linked list")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mprotect: use maple tree navigation instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:30 +0000 (15:06 +0000)]
mm/mprotect: use maple tree navigation instead of vma linked list

Link: https://lkml.kernel.org/r/20220822150128.1562046-59-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mlock: use vma iterator and maple state instead of vma linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:30 +0000 (15:06 +0000)]
mm/mlock: use vma iterator and maple state instead of vma linked list

Handle overflow checking in count_mm_mlocked_page_nr() differently.

Link: https://lkml.kernel.org/r/20220822150128.1562046-58-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/mempolicy: use vma iterator & maple state instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:30 +0000 (15:06 +0000)]
mm/mempolicy: use vma iterator & maple state instead of vma linked list

Reworked the way mbind_range() finds the first VMA to reuse the maple
state and limit the number of tree walks needed.

Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
address higher than the last VMA.  The code was written in a way that
allowed no VMA updates to occur and still return success.  There should be
no functional change to this scenario with the new code.

Link: https://lkml.kernel.org/r/20220822150128.1562046-57-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/memcontrol: stop using mm->highest_vm_end
Liam R. Howlett [Mon, 22 Aug 2022 15:06:29 +0000 (15:06 +0000)]
mm/memcontrol: stop using mm->highest_vm_end

Pass through ULONG_MAX instead.

Link: https://lkml.kernel.org/r/20220822150128.1562046-56-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/madvise: use vma_find() instead of vma linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:29 +0000 (15:06 +0000)]
mm/madvise: use vma_find() instead of vma linked list

madvise_walk_vmas() no longer uses linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-55-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/ksm: use vma iterators instead of vma linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:29 +0000 (15:06 +0000)]
mm/ksm: use vma iterators instead of vma linked list

Remove the use of the linked list for eventual removal.

Link: https://lkml.kernel.org/r/20220822150128.1562046-54-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/khugepaged: stop using vma linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:28 +0000 (15:06 +0000)]
mm/khugepaged: stop using vma linked list

Use vma iterator & find_vma() instead of vma linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-53-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agomm/gup: use maple tree navigation instead of linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:28 +0000 (15:06 +0000)]
mm/gup: use maple tree navigation instead of linked list

Use find_vma_intersection() to locate the VMAs in __mm_populate() instead
of using find_vma() and the linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-52-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agobpf: remove VMA linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:28 +0000 (15:06 +0000)]
bpf: remove VMA linked list

Use vma_next() and remove reference to the start of the linked list

Link: https://lkml.kernel.org/r/20220822150128.1562046-51-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agofork: use VMA iterator
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:27 +0000 (15:06 +0000)]
fork: use VMA iterator

The VMA iterator is faster than the linked list and removing the linked
list will shrink the vm_area_struct.

Link: https://lkml.kernel.org/r/20220822150128.1562046-50-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agosched: use maple tree iterator to walk VMAs
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:27 +0000 (15:06 +0000)]
sched: use maple tree iterator to walk VMAs

The linked list is slower than walking the VMAs using the maple tree.  We
can't use the VMA iterator here because it doesn't support moving to an
earlier position.

Link: https://lkml.kernel.org/r/20220822150128.1562046-49-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agoperf: use VMA iterator
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:27 +0000 (15:06 +0000)]
perf: use VMA iterator

The VMA iterator is faster than the linked list and removing the linked
list will shrink the vm_area_struct.

Link: https://lkml.kernel.org/r/20220822150128.1562046-48-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agoacct: use VMA iterator instead of linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:26 +0000 (15:06 +0000)]
acct: use VMA iterator instead of linked list

The VMA iterator is faster than the linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-47-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agoipc/shm: use VMA iterator instead of linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:26 +0000 (15:06 +0000)]
ipc/shm: use VMA iterator instead of linked list

The VMA iterator is faster than the linked llist, and it can be walked
even when VMAs are being removed from the address space, so there's no
need to keep track of 'next'.

Link: https://lkml.kernel.org/r/20220822150128.1562046-46-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agouserfaultfd: use maple tree iterator to iterate VMAs
Liam R. Howlett [Mon, 22 Aug 2022 15:06:26 +0000 (15:06 +0000)]
userfaultfd: use maple tree iterator to iterate VMAs

Don't use the mm_struct linked list or the vma->vm_next in prep for
removal.

Link: https://lkml.kernel.org/r/20220822150128.1562046-45-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agofs/proc/task_mmu: stop using linked list and highest_vm_end
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:25 +0000 (15:06 +0000)]
fs/proc/task_mmu: stop using linked list and highest_vm_end

Remove references to mm_struct linked list and highest_vm_end for when
they are removed

Link: https://lkml.kernel.org/r/20220822150128.1562046-44-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agofs/proc/base: use maple tree iterators in place of linked list
Liam R. Howlett [Mon, 22 Aug 2022 15:06:25 +0000 (15:06 +0000)]
fs/proc/base: use maple tree iterators in place of linked list

Use mas_for_each to iterate the list of VMAs instead of a for loop
across the linked list.

Link: https://lkml.kernel.org/r/20220822150128.1562046-43-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agoexec: use VMA iterator instead of linked list
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:25 +0000 (15:06 +0000)]
exec: use VMA iterator instead of linked list

Remove a use of the vm_next list by doing the initial lookup with the VMA
iterator and then using it to find the next entry.

Link: https://lkml.kernel.org/r/20220822150128.1562046-42-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2 years agocoredump: remove vma linked list walk
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:25 +0000 (15:06 +0000)]
coredump: remove vma linked list walk

Use the Maple Tree iterator instead.  This is too complicated for the VMA
iterator to handle, so let's open-code it for now.  If this turns out to
be a common pattern, we can migrate it to common code.

Link: https://lkml.kernel.org/r/20220822150128.1562046-41-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>