Liam R. Howlett [Wed, 24 Aug 2022 19:37:56 +0000 (15:37 -0400)]
mm/mmap: Combine multiple if statements in vma_merge()
Currently vma_merge() searches from 0 upwards if there is no prev vma,
or from prev->vm_end if there is a prev vma. The check for merging with
prev also checks if prev exists. The ordering is not important at this
stage, so move the merging check above finding the next vma so that prev
can be checked only once.
Also start searching from vma_start as apposed to 0. If there is no
previous then there won't be a vma before vma_start. The code reads
more clearly this way.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Fri, 19 Aug 2022 17:40:21 +0000 (13:40 -0400)]
mm/mmap: Remove vma_mas_szero() helper
Remove the helper to zero out a portion of a VMA. It is only called
from one function and putting the logic in that function allows for a
WARN_ON() to be added to check for a logic error.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Thu, 11 Aug 2022 20:19:43 +0000 (16:19 -0400)]
mm/mmap: Remove __vma_adjust()
Inline the work of __vma_adjust() into vma_merge(). This reduces code
size and has the added benefits of the comments for the cases being
located with the code.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 10 Aug 2022 21:24:05 +0000 (17:24 -0400)]
mm/mmap: Don't use __vma_adjust() in shift_arg_pages()
Introduce shrink_vma() which uses the lock_vma() and unlock_vma()
functions to reduce the vma coverage.
Convert shift_arg_pages() to use expand_vma() and the new shrink_vma()
function. Remove shrink_vma() support from __vma_adjust() since
shift_arg_pages() is the only user that shrinks a VMA in this way.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 10 Aug 2022 20:09:15 +0000 (16:09 -0400)]
mm: Don't use __vma_adjust() in __split_vma()
Use the abstracted locking and maple tree operations. Since
__split_vma() is the only user of the __vma_adjust() function to use the
insert argument, drop that argument. Remove the NULL passed through
from fs/exec's shift_arg_pages() at the same time.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 30 Mar 2022 17:35:49 +0000 (13:35 -0400)]
mm: Change munmap splitting order and move_vma()
Splitting can be more efficient when done in the reverse order to
minimize VMA walking. Change do_mas_align_munmap() to reduce walking of
the tree during split operations.
move_vma() must also be altered to remove the dependency of keeping the
original VMA as the active part of the split. Look up the new VMA or
two if necessary.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Fri, 5 Aug 2022 18:02:20 +0000 (14:02 -0400)]
maple_tree: Reduce loops in mt_validate()
mt_validate() is taking too long when kasan and memory poisoning is
enabled which is delaying the rcu free operation which is causing issues
running LTP test msgstress03. Change the validation to only loop over a
node once as apposed to many times for each test. Also only walk the
tree once by keeping track of if the last leaf had a null at a higher
level.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 3 Aug 2022 17:22:51 +0000 (13:22 -0400)]
test_maple_tree: 32 bit testing support
Add support for the maple tree testing to work for 32 bit environment.
This disables a number of tests that store values above the 32 bit
limit, but tests a lot of the functionality.
Adds detection of the 32/64 bit environment in the makefile and the
generated headers.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
In commit 2f1ee0913ce5 ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem. It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.
This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem. If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above. It can help us to catch early page allocations. This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gerald Schaefer [Fri, 26 Aug 2022 05:03:31 +0000 (22:03 -0700)]
s390/hugetlb: switch to generic version of follow_huge_pud()
When pud-sized hugepages were introduced for s390, the generic version of
follow_huge_pud() was using pte_page() instead of pud_page(). This would
be wrong for s390, see also commit 97534127012f ("mm/hugetlb: use
pmd_page() in follow_huge_pmd()"). Therefore, and probably because not
all archs were supporting pud_page() at that time, a private version of
follow_huge_pud() was added for s390, correctly using pud_page().
Since commit 3a194f3f8ad01 ("mm/hugetlb: make pud_huge() and
follow_huge_pud() aware of non-present pud entry"), the generic version of
follow_huge_pud() is now also using pud_page(), and in general behaves
similar to follow_huge_pmd().
Therefore we can now switch to the generic version and get rid of the
s390-specific follow_huge_pud().
Link: https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Haiyue Wang <haiyue.wang@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liu Shixin [Thu, 25 Aug 2022 14:20:36 +0000 (22:20 +0800)]
mm/zswap: delay the initializaton of zswap until the first enablement
In the initialization of zswap, about 18MB memory will be allocated for
zswap_pool in my machine. Since not all users use zswap, the memory may
be wasted. Save the memory for these users by delaying the initialization
of zswap to first enablement.
Link: https://lkml.kernel.org/r/20220825142037.3214152-3-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liu Shixin [Thu, 25 Aug 2022 14:20:35 +0000 (22:20 +0800)]
mm/zswap: replace zswap_init_{started/failed} with zswap_init_state
Patch series "Delay the initializaton of zswap".
In the initialization of zswap, about 18MB memory will be allocated for
zswap_pool. Since not all users use zswap, the memory may be wasted.
Save the memory for these users by delaying the initialization of zswap to
first enablement.
This patch (of 3):
zswap_init_started indicates that the initialization is started. And
zswap_init_failed indicates that the initialization is failed. As we will
support to init zswap after system startup, it's necessary to add a state
to indicate the initialization is complete and succeed to avoid
concurrency issues. Since we don't care about the difference between init
started with init completion. We only need three states: uninitialized,
initial failed, initial succeed.
Michal Hocko [Tue, 23 Aug 2022 09:22:30 +0000 (11:22 +0200)]
mm: reduce noise in show_mem for lowmem allocations
While discussing early DMA pool pre-allocation failure with Christoph [1]
I have realized that the allocation failure warning is rather noisy for
constrained allocations like GFP_DMA{32}. Those zones are usually not
populated on all nodes very often as their memory ranges are constrained.
This is an attempt to reduce the ballast that doesn't provide any relevant
information for those allocation failures investigation. Please note that
I have only compile tested it (in my default config setup) and I am
throwing it mostly to see what people think about it.
David Hildenbrand [Thu, 25 Aug 2022 16:46:58 +0000 (18:46 +0200)]
mm/gup: use gup_can_follow_protnone() also in GUP-fast
There seems to be no reason why FOLL_FORCE during GUP-fast would have to
fallback to the slow path when stumbling over a PROT_NONE mapped page. We
only have to trigger hinting faults in case FOLL_FORCE is not set, and any
kind of fault handling naturally happens from the slow path -- where NUMA
hinting accounting/handling would be performed.
Note that the comment regarding THP migration is outdated: commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP
migration") described that this was required for THP due to lack of PMD
migration entries. Nowadays, we do have proper PMD migration entries in
place -- see set_pmd_migration_entry(), which does a proper
pmdp_invalidate() when placing the migration entry.
So let's just reuse gup_can_follow_protnone() here to make it consistent
and drop the somewhat outdated comments.
Link: https://lkml.kernel.org/r/20220825164659.89824-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Thu, 25 Aug 2022 16:46:57 +0000 (18:46 +0200)]
mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()
Patch series "mm: minor cleanups around NUMA hinting".
Working on some GUP cleanups (e.g., getting rid of some FOLL_ flags) and
preparing for other GUP changes (getting rid of FOLL_FORCE|FOLL_WRITE for
for taking a R/O longterm pin), this is something I can easily send out
independently.
Get rid of FOLL_NUMA, allow FOLL_FORCE access to PROT_NONE mapped pages in
GUP-fast, and fixup some documentation around NUMA hinting.
This patch (of 3):
No need for a special flag that is not even properly documented to be
internal-only.
Let's just factor this check out and get rid of this flag. The separate
function has the nice benefit that we can centralize comments.
Shakeel Butt [Thu, 25 Aug 2022 00:05:06 +0000 (00:05 +0000)]
memcg: increase MEMCG_CHARGE_BATCH to 64
For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in Gbps,
32 is too small and makes the memcg charging path a bottleneck. For now,
increase it to 64 for easy acceptance to 6.0. We will need to revisit
this in future for ever increasing demand of higher performance.
Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change. Though it does
have impact on rstat flushing and high limit reclaim backoff.
To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 17064.7 Mbps (62.7% improvement)
With the patch, the throughput improved by 62.7%.
Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With memcg v2 enabled, memcg->memory.usage is a very hot member for the
workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads. In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path. This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.
To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 12413.7 Mbps (18.4% improvement)
With the patch, the throughput improved by 18.4%.
One side-effect of this patch is the increase in the size of struct
mem_cgroup. For example with this patch on 64 bit build, the size of
struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
the performance improvement, this additional size is worth it. In
addition there are opportunities to reduce the size of struct mem_cgroup
like deprecation of kmem and tcpmem page counters and better packing.
Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 25 Aug 2022 00:05:04 +0000 (00:05 +0000)]
mm: page_counter: remove unneeded atomic ops for low/min
Patch series "memcg: optimize charge codepath", v2.
Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression[1]. This patch series
tries to improve the memcg charging for such workloads.
This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
between usage and high.
(C) Increase the memcg charge batch to 64.
To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e. min >
0) and the usage is above the protection (i.e. usage > min).
This scenario is actually very common where the users want a part of their
workload to be protected against the external reclaim. Though this
optimization does introduce a race when the usage is around the protection
and concurrent charges and uncharged trip it over or under the protection.
In such cases, we might see lower effective protection but the subsequent
charge/uncharge will correct it.
To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload in a three level of cgroup hierarchy with top level
having min and low setup appropriately to see if this optimization is
effective for the mentioned case.
$ netserver -6
# 36 instances of netperf with following params
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Results (average throughput of netperf):
Without (6.0-rc1) 10482.7 Mbps
With patch 14542.5 Mbps (38.7% improvement)
With the patch, the throughput improved by 38.7%
Link: https://lkml.kernel.org/r/20220825000506.239406-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20220825000506.239406-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oliver Sang <oliver.sang@intel.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:57 +0000 (10:57 -0700)]
hugetlb: use new vma_lock for pmd sharing synchronization
The new hugetlb vma lock (rw semaphore) is used to address this race:
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
The vma_lock is used as follows:
- During fault processing. the lock is acquired in read mode before
doing a page table lock and allocation (huge_pte_alloc). The lock is
held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
called.
Lock ordering issues come into play when unmapping a page from all
vmas mapping the page. The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare. This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
error handling. In these routines we 'try' to obtain the vma lock and
fail to unmap if unsuccessful. Calling routines already deal with the
failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch. This routine
also tries to acquire the vma lock. If it fails, it skips the
unmapping. However, we can not have file truncation or hole punch
fail because of contention. After hugetlb_vmdelete_list, truncation
and hole punch call remove_inode_hugepages. remove_inode_hugepages
check for mapped pages and call hugetlb_unmap_file_page to unmap them.
hugetlb_unmap_file_page is designed to drop locks and reacquire in the
correct order to guarantee unmap success.
Link: https://lkml.kernel.org/r/20220824175757.20590-9-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:56 +0000 (10:57 -0700)]
hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio. This is refactored code from hugetlb_vmdelete_list. It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.
This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.
Link: https://lkml.kernel.org/r/20220824175757.20590-8-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:55 +0000 (10:57 -0700)]
hugetlb: add vma based lock for pmd sharing
Allocate a rw semaphore and hang off vm_private_data for synchronization
use by vmas that could be involved in pmd sharing. Only add
infrastructure for the new lock here. Actual use will be added in
subsequent patch.
Link: https://lkml.kernel.org/r/20220824175757.20590-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:54 +0000 (10:57 -0700)]
hugetlb: rename vma_shareable() and refactor code
Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
checking a specific address within the vma. Refactor code to check if an
aligned range is shareable as this will be needed in a subsequent patch.
Link: https://lkml.kernel.org/r/20220824175757.20590-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Fri, 26 Aug 2022 05:03:28 +0000 (22:03 -0700)]
hugetlb: fix/remove uninitialized variable in remove_inode_hugepages
Code introduced for the routine remove_inode_hugepages by patch "hugetlb:
handle truncate racing with page faults", incorrectly uses a variable
m_index. This is a remnant from a previous version of the code when under
development. Use the correct variable 'index' and remove 'm_index' from
the routine.
Link: https://lkml.kernel.org/r/Ywepr7C2X20ZvLdn@monkey Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:53 +0000 (10:57 -0700)]
hugetlb: handle truncate racing with page faults
When page fault code needs to allocate and instantiate a new hugetlb page
(huegtlb_no_page), it checks early to determine if the fault is beyond
i_size. When discovered early, it is easy to abort the fault and return
an error. However, it becomes much more difficult to handle when
discovered later after allocating the page and consuming reservations and
adding to the page cache. Backing out changes in such instances becomes
difficult and error prone.
Instead of trying to catch and backout all such races, use the hugetlb
fault mutex to handle truncate racing with page faults. The most
significant change is modification of the routine remove_inode_hugepages
such that it will take the fault mutex for EVERY index in the truncated
range (or hole in the case of hole punch). Since remove_inode_hugepages
is called in the truncate path after updating i_size, we can experience
races as follows:
- truncate code updates i_size and takes fault mutex before a racing
fault. After fault code takes mutex, it will notice fault beyond
i_size and abort early.
- fault code obtains mutex, and truncate updates i_size after early
checks in fault code. fault code will add page beyond i_size.
When truncate code takes mutex for page/index, it will remove the
page.
- truncate updates i_size, but fault code obtains mutex first. If
fault code sees updated i_size it will abort early. If fault code
does not see updated i_size, it will add page beyond i_size and
truncate code will remove page when it obtains fault mutex.
Note, for performance reasons remove_inode_hugepages will still use
filemap_get_folios for bulk folio lookups. For indicies not returned in
the bulk lookup, it will need to lookup individual folios to check for
races with page fault.
Link: https://lkml.kernel.org/r/20220824175757.20590-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:52 +0000 (10:57 -0700)]
hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
remove_huge_page removes a hugetlb page from the page cache. Change to
hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.
Link: https://lkml.kernel.org/r/20220824175757.20590-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:51 +0000 (10:57 -0700)]
hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. However, this has been shown to cause
performance/scaling issues. Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.
Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.
NOTE: Reverting this code does expose the following race condition.
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)
It is unknown if the above race was ever experienced by a user. It was
discovered via code inspection when initially addressed.
In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.
Link: https://lkml.kernel.org/r/20220824175757.20590-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Wed, 24 Aug 2022 17:57:50 +0000 (10:57 -0700)]
hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
Patch series "hugetlb: Use new vma mutex for huge pmd sharing synchronization".
hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7. At that time, a proposal to
address the regression was suggested [3] but went nowhere.
The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"
The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings. To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory. The shared
mapping size was 250GB. For comparison, a single instance of the program
was run. Then, multiple instances were run in parallel to introduce
lock contention. Changing the locking scheme results in a significant
performance benefit.
test instances unmodified revert vma
--------------------------------------------------------------------------
faults per sec 1 397068 403411 394935
faults per sec 24 68322 83023 82436
forks per sec 1 2717 2862 2816
forks per sec 24 404 465 499
Combined faults 24 1528 69090 59544
Combined forks 24 337 66 140
Combined test is when running both faulting program and forking program
simultaneously.
Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79. Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share. With c0d0381ade79 reverted, this race is exposed:
Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)
Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
Patches 3 and 4 of this series address those races. This requires
using the hugetlb fault mutexes for more coordination between the fault
code and file page removal.
Patches 5 - 7 add infrastructure for a new vma based rw semaphore that
will be used for pmd sharing synchronization. The idea is that this
semaphore will be held in read mode for the duration of fault processing,
and held in write mode for unmap operations which may call huge_pmd_unshare.
Acquiring i_mmap_rwsem is also still required to synchronize huge pmd
sharing. However it is only required in the fault path when setting up
sharing, and will be acquired in huge_pmd_share().
Patch 8 makes use of this new vma lock. Unfortunately, the fault code
and truncate/hole punch code would naturally take locks in the opposite
order which could lead to deadlock. Since the performance of page faults
is more important, the truncation/hole punch code is modified to back
out and take locks in the correct order if necessary.
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. The use of i_mmap_rwsem to prevent
fault/truncate races depends on this. However, this has been shown to
cause performance/scaling issues. As a result, that code will be
reverted. Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.
In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.
Link: https://lkml.kernel.org/r/20220824175757.20590-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220824175757.20590-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Haiyue Wang [Tue, 23 Aug 2022 13:58:41 +0000 (21:58 +0800)]
mm: fix the handling Non-LRU pages returned by follow_page
The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called. Fix the zone device page
check by handling the page reference count correctly before returning.
And as David reviewed, "device pages are never PageKsm pages". Drop this
zone device page check for break_ksm().
Since the zone device page can't be a transparent huge page, so drop the
redundant zone device page check for split_huge_pages_pid(). (by Miaohe)
Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com Fixes: 3218f8712d6b ("mm: handling Non-LRU pages returned by vm_normal_pages") Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Heidelberg [Tue, 23 Aug 2022 15:20:33 +0000 (17:20 +0200)]
mm: remove EXPERIMENTAL flag for zswap
zswap has been with us since 2013, and it's widely used in many products.
Link: https://lkml.kernel.org/r/20220823152033.66682-1-david@ixit.cz Signed-off-by: David Heidelberg <david@ixit.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Wed, 24 Aug 2022 03:51:00 +0000 (12:51 +0900)]
drivers/block/zram/zram_drv.c: do not keep dangling zcomp pointer after zram reset
We do all reset operations under write lock, so we don't need to save
->disksize and ->comp to stack variables. Another thing is that ->comp is
freed during zram reset, but comp pointer is not NULL-ed, so zram keeps
the freed pointer value.
When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
called to migrate pages out of zones which should not contain any longterm
pinned pages.
When migration succeeds all pages will have been unpinned so pinning needs
to be retried. Migration can also fail, in which case the pages will also
have been unpinned but the operation should not be retried. If all pages
are in the correct zone nothing will be unpinned and no retry is required.
The logic in check_and_migrate_movable_pages() tracks unnecessary state
and the return codes for each case are difficult to follow. Refactor the
code to clean this up. No behaviour change is intended.
Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alistair Popple [Wed, 24 Aug 2022 05:09:51 +0000 (15:09 +1000)]
mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages()
gup_flags is passed to check_and_migrate_movable_pages() so that it can
call either put_page() or unpin_user_page() to drop the page reference.
However check_and_migrate_movable_pages() is only called for
FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass
gup_flags.
Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: Skip retry when new limit is not below old one in page_counter_set_max
In page_counter_set_max, we want to make sure the new limit is not below
the concurrently-changing counter value. We read the counter and check
that the limit is not below the counter before the swap. After the swap,
we read the counter again and retry in case the counter is incremented as
this may violate the requirement. Even though the page_counter_try_charge
can see the old limit, it is guaranteed that the counter is not above the
old limit after the increment. So in case the new limit is not below the
old limit, the counter is guaranteed to be not above the new limit too.
We can skip the retry in this case to optimize a little bit.
mm: pagewalk: add back missing variable initializations
These initializations accidentially got lost during refactoring.
The first one can't actually be used without initialization, because
walk_p4d_range() is only called when one of the 4 callbacks is set, but relying
on this seems fragile.
Link: https://lkml.kernel.org/r/2123960.ggj6I0NvhH@mobilepool36.emlix.com Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: pagewalk: move variables to more local scope, tweak loops
Move some variables to more local scopes to make it obvious that they
don't carry state. Put the end additions into the for loop instructions
to make them easier to read.
mm: pagewalk: allow walk_page_range_novma() without mm
Since e47690d756a7 ("x86: mm: avoid allocating struct mm_struct on the
stack") a pgd can be passed to walk_page_range_novma(). In case it is set
no place in the pagewalk code use the walk.mm anymore, so permit to pass a
NULL mm instead. It is up to the caller to ensure proper locking on the
pgd in this case.
mm: pagewalk: add back missing variable initializations
These initializations accidentially got lost during refactoring.
The first one can't actually be used without initialization, because
walk_p4d_range() is only called when one of the 4 callbacks is set, but relying
on this seems fragile.
Link: https://lkml.kernel.org/r/2123960.ggj6I0NvhH@mobilepool36.emlix.com Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: pagewalk: add back missing variable initializations
These initializations accidentially got lost during refactoring.
The first one can't actually be used without initialization, because
walk_p4d_range() is only called when one of the 4 callbacks is set, but relying
on this seems fragile.
Link: https://lkml.kernel.org/r/2123960.ggj6I0NvhH@mobilepool36.emlix.com Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Rolf Eike Beer <eb@emlix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Minor improvements for pagewalk code".
For some project I had to use the pagewalk API for certain things and
during this have read through the code multiple times. Our usage has
changed several times depending on our current state of research as well.
During all of this I have made some tweaks to the code to be able to follow it
better when hunting my own problems, and not call into some things that I
actually don't need. The patches are more or less independent of each other.
This patch (of 6):
The err variable only needs to be checked when it was assigned directly
before, it is not carried on to any later checks. Move the checks into
the same "if" conditions where they are assigned. Also just return the
error at the relevant places. While at it move these err variables to a
more local scope at some places.
ERROR: space required before the open parenthesis '('
#107: FILE: tools/testing/selftests/vm/mremap_test.c:146:
+ while(getline(&line, &len, fp) != -1) {
ERROR: space required after that ',' (ctx:VxV)
#108: FILE: tools/testing/selftests/vm/mremap_test.c:147:
+ char *first = strtok(line,"- ");
^
ERROR: space required after that ',' (ctx:VxV)
#110: FILE: tools/testing/selftests/vm/mremap_test.c:149:
+ char *second = strtok(NULL,"- ");
^
WARNING: Missing a blank line after declarations
#112: FILE: tools/testing/selftests/vm/mremap_test.c:151:
+ void *second_val = (void *) strtol(second, NULL, 16);
+ if (first_val == start && second_val == start + 3 * page_size) {
total: 3 errors, 3 warnings, 113 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-add-merging-after-mremap-resize.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jakub Matěna <matenajakub@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Matěna [Fri, 3 Jun 2022 14:57:19 +0000 (16:57 +0200)]
mm: add merging after mremap resize
When mremap call results in expansion, it might be possible to merge the
VMA with the next VMA which might become adjacent. This patch adds
vma_merge call after the expansion is done to try and merge.
Link: https://lkml.kernel.org/r/20220603145719.1012094-3-matenajakub@gmail.com Signed-off-by: Jakub Matěna <matenajakub@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Matěna [Fri, 3 Jun 2022 14:57:18 +0000 (16:57 +0200)]
mm: refactor of vma_merge()
Patch series "Refactor of vma_merge and new merge call", v4.
I am currently working on my master's thesis trying to increase number of
merges of VMAs currently failing because of page offset incompatibility
and difference in their anon_vmas. The following refactor and added merge
call included in this series is just two smaller upgrades I created along
the way.
This patch (of 2):
Refactor vma_merge() to make it shorter and more understandable. Main
change is the elimination of code duplicity in the case of merge next
check. This is done by first doing checks and caching the results before
executing the merge itself. The variable 'area' is divided into 'mid' and
'res' as previously it was used for two purposes, as the middle VMA
between prev and next and also as the result of the merge itself. Exit
paths are also unified.
Link: https://lkml.kernel.org/r/20220603145719.1012094-1-matenajakub@gmail.com Link: https://lkml.kernel.org/r/20220603145719.1012094-2-matenajakub@gmail.com Signed-off-by: Jakub Matěna <matenajakub@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Rik van Riel <riel@surriel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Suren Baghdasaryan [Tue, 31 May 2022 22:30:59 +0000 (15:30 -0700)]
mm: drop oom code from exit_mmap
The primary reason to invoke the oom reaper from the exit_mmap path used
to be a prevention of an excessive oom killing if the oom victim exit
races with the oom reaper (see [1] for more details). The invocation has
moved around since then because of the interaction with the munlock logic
but the underlying reason has remained the same (see [2]).
Munlock code is no longer a problem since [3] and there shouldn't be any
blocking operation before the memory is unmapped by exit_mmap so the oom
reaper invocation can be dropped. The unmapping part can be done with the
non-exclusive mmap_sem and the exclusive one is only required when page
tables are freed.
Remove the oom_reaper from exit_mmap which will make the code easier to
read. This is really unlikely to make any observable difference although
some microbenchmarks could benefit from one less branch that needs to be
evaluated even though it almost never is true.
[1] 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
[2] 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3")
[3] a213e5cf71cb ("mm/munlock: delete munlock_vma_pages_all(), allow oomreap")
Link: https://lkml.kernel.org/r/20220531223100.510392-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Wed, 15 Jun 2022 17:40:58 +0000 (17:40 +0000)]
mm/mlock: drop dead code in count_mm_mlocked_page_nr()
The check for mm being null has never been needed since the only caller
has always passed in current->mm. Remove the check from
count_mm_mlocked_page_nr().
Liam R. Howlett [Mon, 22 Aug 2022 15:06:34 +0000 (15:06 +0000)]
mm/mmap.c: pass in mapping to __vma_link_file()
__vma_link_file() resolves the mapping from the file, if there is one.
Pass through the mapping and check the vm_file externally since most
places already have the required information and check of vm_file.
Link: https://lkml.kernel.org/r/20220822150128.1562046-71-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:06:33 +0000 (15:06 +0000)]
mm: remove the vma linked list
Replace any vm_next use with vma_find().
Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
maple tree.
Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap(). At
the same time, alter the loop to be more compact.
Now that free_pgtables() and unmap_vmas() take a maple tree as an
argument, rearrange do_mas_align_munmap() to use the new tree to hold the
vmas to remove.
Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
used to update the linked list.
Drop linked list update from __insert_vm_struct().
Rework validation of tree as it was depending on the linked list.
Link: https://lkml.kernel.org/r/20220822150128.1562046-69-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Yingliang [Wed, 24 Aug 2022 04:24:24 +0000 (12:24 +0800)]
mm/nommu: fix error handling in split_vma()
The memory allocated before calling mas_preallocate() is leaked if it
fails. 'mas' won't be modify until calling mas_preallocate(), so move it
up and add error label for free the memory.
Link: https://lkml.kernel.org/r/20220824042424.2031508-1-yangyingliang@huawei.com Fixes: 8aff7dbeaeb1 ("nommu: remove uses of VMA linked list") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Thu, 25 Aug 2022 20:30:24 +0000 (20:30 +0000)]
mm/mprotect: fix maple tree start address in do_mprotect_pkey()
Prior to my change to use the maple tree, the start address was changed
before calling find_vma() with the untagged_addr() version of start. My
first change recorded the tagged address and searched on the incorrect
start location - which would have found the incorrect VMA. This fix will
use the untagged_addr() as the start of the search as it was before I
changed the code at all.
Any penalty of calling untagged_addr() occurred regardless of the version
that was used. The search of the maple tree would have also occurred in
both versions - just at the wrong location before this fix. I expect that
the execution time would be equal as the search on the tagged address
would have either returned a VMA at start, or the VMA in the next slot in
the maple tree node - probably immeasurably slower since the data is very
likely already in the CPU cache, but I don't have hard data to say either
way. I can look into a benchmark to measure the difference between both
working versions, but I don't have an arm64 native target so it will be
emulated.
Use the untagged_addr() instead of the address passed into the function.
Liam R. Howlett [Mon, 22 Aug 2022 15:06:30 +0000 (15:06 +0000)]
mm/mempolicy: use vma iterator & maple state instead of vma linked list
Reworked the way mbind_range() finds the first VMA to reuse the maple
state and limit the number of tree walks needed.
Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
address higher than the last VMA. The code was written in a way that
allowed no VMA updates to occur and still return success. There should be
no functional change to this scenario with the new code.
Link: https://lkml.kernel.org/r/20220822150128.1562046-57-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:27 +0000 (15:06 +0000)]
sched: use maple tree iterator to walk VMAs
The linked list is slower than walking the VMAs using the maple tree. We
can't use the VMA iterator here because it doesn't support moving to an
earlier position.
Link: https://lkml.kernel.org/r/20220822150128.1562046-49-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 22 Aug 2022 15:06:26 +0000 (15:06 +0000)]
ipc/shm: use VMA iterator instead of linked list
The VMA iterator is faster than the linked llist, and it can be walked
even when VMAs are being removed from the address space, so there's no
need to keep track of 'next'.
Link: https://lkml.kernel.org/r/20220822150128.1562046-46-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 22 Aug 2022 15:06:25 +0000 (15:06 +0000)]
coredump: remove vma linked list walk
Use the Maple Tree iterator instead. This is too complicated for the VMA
iterator to handle, so let's open-code it for now. If this turns out to
be a common pattern, we can migrate it to common code.
Link: https://lkml.kernel.org/r/20220822150128.1562046-41-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>