Mike Rapoport (IBM) [Tue, 21 Mar 2023 17:05:02 +0000 (19:05 +0200)]
mm: move most of core MM initialization to mm/mm_init.c
The bulk of memory management initialization code is spread all over
mm/page_alloc.c and makes navigating through page allocator functionality
difficult.
Move most of the functions marked __init and __meminit to mm/mm_init.c to
make it better localized and allow some more spare room before
mm/page_alloc.c reaches 10k lines.
No functional changes.
Link: https://lkml.kernel.org/r/20230321170513.2401534-4-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Tue, 21 Mar 2023 17:05:00 +0000 (19:05 +0200)]
mips: fix comment about pgtable_init()
Patch series "mm: move core MM initialization to mm/mm_init.c", v2.
This set moves most of the core MM initialization to mm/mm_init.c.
This largely includes free_area_init() and its helpers, functions used at
boot time, mm_init() from init/main.c and some of the functions it calls.
Aside from gaining some more space before mm/page_alloc.c hits 10k lines,
this makes mm/page_alloc.c to be mostly about buddy allocator and moves
the init code out of the way, which IMO improves maintainability.
Besides, this allows to move a couple of declarations out of include/linux
and make them private to mm/.
And as an added bonus there a slight decrease in vmlinux size. For
tinyconfig and defconfig on x86 I've got
tinyconfig:
text data bss dec hex filename
853206 289376 12001282342710 23bf36 a/vmlinux
853198 289344 12001282342670 23bf0e b/vmlinux
Lorenzo Stoakes [Tue, 21 Mar 2023 18:09:55 +0000 (18:09 +0000)]
MAINTAINERS: add Lorenzo as vmalloc reviewer
I have recently been involved in both reviewing and submitting patches to
the vmalloc code in mm and would be willing and happy to help out with
review going forward if it would be helpful!
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:45 +0000 (15:03 -0300)]
vmstat: add pcp remote node draining via cpu_vm_stats_fold
Large NUMA systems might have significant portions of system memory to be
trapped in pcp queues. The number of pcp is determined by the number of
processors and nodes in a system. A system with 4 processors and 2 nodes
has 8 pcps which is okay. But a system with 1024 processors and 512 nodes
has 512k pcps with a high potential for large amount of memory being
caught in them.
Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case, where
vmstat_shepherd will perform the aging and draining via cpu_vm_stats_fold.
Link: https://lkml.kernel.org/r/20230320180745.858515310@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:43 +0000 (15:03 -0300)]
mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
With a task that busy loops on a given CPU, the kworker interruption to
execute vmstat_update is undesired and may exceed latency thresholds for
certain applications.
switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.
To fix this, now that the counters are modified via cmpxchg both CPU
locally (via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU vmstats
folding remotely.
Link: https://lkml.kernel.org/r/20230320180745.807656081@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:40 +0000 (15:03 -0300)]
mm/vmstat: switch counter modification to cmpxchg
In preparation for switching vmstat shepherd to flush per-CPU counters
remotely, switch the __{mod,inc,dec} functions that modify the counters to
use cmpxchg.
To facilitate reviewing, functions are ordered in the text file, as:
To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
For the single_page_alloc_free test, which does
/** Loop to measure **/
for (i = 0; i < rec->loops; i++) {
my_page = alloc_page(gfp_mask);
if (unlikely(my_page == NULL))
return 0;
__free_page(my_page);
}
Unit is cycles.
Vanilla Patched Diff
115.25 117 1.4%
Link: https://lkml.kernel.org/r/20230320180745.733575720@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:37 +0000 (15:03 -0300)]
this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
Goal is to have vmstat_shepherd to transfer from per-CPU counters to
global counters remotely. For this, an atomic this_cpu_cmpxchg is
necessary.
Following the kernel convention for cmpxchg/cmpxchg_local, change x86's
this_cpu_cmpxchg_ helpers to be atomic. and add this_cpu_cmpxchg_local_
helpers which are not atomic.
Link: https://lkml.kernel.org/r/20230320180745.658574087@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:34 +0000 (15:03 -0300)]
this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
Goal is to have vmstat_shepherd to transfer from per-CPU counters to
global counters remotely. For this, an atomic this_cpu_cmpxchg is
necessary.
Following the kernel convention for cmpxchg/cmpxchg_local, change ARM's
this_cpu_cmpxchg_ helpers to be atomic, and add this_cpu_cmpxchg_local_
helpers which are not atomic.
Link: https://lkml.kernel.org/r/20230320180745.582248645@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marcelo Tosatti [Mon, 20 Mar 2023 18:03:33 +0000 (15:03 -0300)]
vmstat: allow_direct_reclaim should use zone_page_state_snapshot
Patch series "fold per-CPU vmstats remotely", v7.
This patch series addresses the following two problems:
1. A customer provided evidence indicating that a process
was stalled in direct reclaim:
- The process was trapped in throttle_direct_reclaim().
The function wait_event_killable() was called to wait condition
allow_direct_reclaim(pgdat) for current node to be true.
The allow_direct_reclaim(pgdat) examined the number of free pages
on the node by zone_page_state() which just returns value in
zone->vm_stat[NR_FREE_PAGES].
- On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
However, the freelist on this node was not empty.
- This inconsistent of vmstat value was caused by percpu vmstat on
nohz_full cpus. Every increment/decrement of vmstat is performed
on percpu vmstat counter at first, then pooled diffs are cumulated
to the zone's vmstat counter in timely manner. However, on nohz_full
cpus (in case of this customer's system, 48 of 52 cpus) these pooled
diffs were not cumulated once the cpu had no event on it so that
the cpu started sleeping infinitely.
I checked percpu vmstat and found there were total 69 counts not
cumulated to the zone's vmstat counter yet.
- In this situation, kswapd did not help the trapped process.
In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
of free pages on the node by zone_page_state_snapshot() which
checks pending counts on percpu vmstat.
Therefore kswapd could know there were 69 free pages correctly.
Since zone->_watermark = {8, 20, 32}, kswapd did not work because
69 was greater than 32 as high watermark.
2. With a task that busy loops on a given CPU,
the kworker interruption to execute vmstat_update
is undesired and may exceed latency thresholds
for certain applications.
By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.
This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).
Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.
switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.
This patch (of 13):
A customer provided evidence indicating that a process
was stalled in direct reclaim:
- The process was trapped in throttle_direct_reclaim().
The function wait_event_killable() was called to wait condition
allow_direct_reclaim(pgdat) for current node to be true.
The allow_direct_reclaim(pgdat) examined the number of free pages
on the node by zone_page_state() which just returns value in
zone->vm_stat[NR_FREE_PAGES].
- On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
However, the freelist on this node was not empty.
- This inconsistent of vmstat value was caused by percpu vmstat on
nohz_full cpus. Every increment/decrement of vmstat is performed
on percpu vmstat counter at first, then pooled diffs are cumulated
to the zone's vmstat counter in timely manner. However, on nohz_full
cpus (in case of this customer's system, 48 of 52 cpus) these pooled
diffs were not cumulated once the cpu had no event on it so that
the cpu started sleeping infinitely.
I checked percpu vmstat and found there were total 69 counts not
cumulated to the zone's vmstat counter yet.
- In this situation, kswapd did not help the trapped process.
In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
of free pages on the node by zone_page_state_snapshot() which
checks pending counts on percpu vmstat.
Therefore kswapd could know there were 69 free pages correctly.
Since zone->_watermark = {8, 20, 32}, kswapd did not work because
69 was greater than 32 as high watermark.
Change allow_direct_reclaim to use zone_page_state_snapshot, which
allows a more precise version of the vmstat counters to be used.
allow_direct_reclaim will only be called from try_to_free_pages,
which is not a hot path.
Link: https://lkml.kernel.org/r/20230320180332.102837832@redhat.com Link: https://lkml.kernel.org/r/20230320180745.556821285@redhat.com Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Russell King (Oracle)" <linux@armlinux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Rapoport (IBM) [Sun, 19 Mar 2023 11:42:14 +0000 (13:42 +0200)]
mm: move get_page_from_free_area() to mm/page_alloc.c
The get_page_from_free_area() helper is only used in mm/page_alloc.c so
move it there to reduce noise in include/linux/mmzone.h
Link: https://lkml.kernel.org/r/20230319114214.2133332-1-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 17 Mar 2023 21:58:25 +0000 (21:58 +0000)]
mm: refactor do_fault_around()
Patch series "Refactor do_fault_around()"
Refactor do_fault_around() to avoid bitwise tricks and rather difficult to
follow logic. Additionally, prefer fault_around_pages to
fault_around_bytes as the operations are performed at a base page
granularity.
This patch (of 2):
The existing logic is confusing and fails to abstract a number of bitwise
tricks.
Use ALIGN_DOWN() to perform alignment, pte_index() to obtain a PTE index
and represent the address range using PTE offsets, which naturally make it
clear that the operation is intended to occur within only a single PTE and
prevent spanning of more than one page table.
We rely on the fact that fault_around_bytes will always be page-aligned,
at least one page in size, a power of two and that it will not exceed
PAGE_SIZE * PTRS_PER_PTE in size (i.e. the address space mapped by a
PTE). These are all guaranteed by fault_around_bytes_set().
Michal Hocko [Fri, 17 Mar 2023 13:44:48 +0000 (14:44 +0100)]
memcg: do not drain charge pcp caches on remote isolated cpus
Leonardo Bras has noticed that pcp charge cache draining might be
disruptive on workloads relying on 'isolated cpus', a feature commonly
used on workloads that are sensitive to interruption and context switching
such as vRAN and Industrial Control Systems.
There are essentially two ways how to approach the issue. We can either
allow the pcp cache to be drained on a different rather than a local cpu
or avoid remote flushing on isolated cpus.
The current pcp charge cache is really optimized for high performance and
it always relies to stick with its cpu. That means it only requires
local_lock (preempt_disable on !RT) and draining is handed over to pcp WQ
to drain locally again.
The former solution (remote draining) would require to add an additional
locking to prevent local charges from racing with the draining. This adds
an atomic operation to otherwise simple arithmetic fast path in the
try_charge path. Another concern is that the remote draining can cause a
lock contention for the isolated workloads and therefore interfere with it
indirectly via user space interfaces.
Another option is to avoid draining scheduling on isolated cpus
altogether. That means that those remote cpus would keep their charges
even after drain_all_stock returns. This is certainly not optimal either
but it shouldn't really cause any major problems. In the worst case (many
isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH i.e 64
page) the memory consumption of a memcg would be artificially higher than
can be immediately used from other cpus.
Theoretically a memcg OOM killer could be triggered pre-maturely.
Currently it is not really clear whether this is a practical problem
though. Tight memcg limit would be really counter productive to cpu
isolated workloads pretty much by definition because any memory reclaimed
induced by memcg limit could break user space timing expectations as those
usually expect execution in the userspace most of the time.
Also charges could be left behind on memcg removal. Any future charge on
those isolated cpus will drain that pcp cache so this won't be a permanent
leak.
Considering cons and pros of both approaches this patch is implementing
the second option and simply do not schedule remote draining if the target
cpu is isolated. This solution is much more simpler. It doesn't add any
new locking and it is more more predictable from the user space POV.
Should the pre-mature memcg OOM become a real life problem, we can revisit
this decision.
Link: https://lkml.kernel.org/r/20230317134448.11082-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: Leonardo Bras <leobras@redhat.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frederic Weisbecker [Fri, 17 Mar 2023 13:44:47 +0000 (14:44 +0100)]
sched/isolation: add cpu_is_isolated() API
Patch series "memcg, cpuisol: do not interfere pcp cache charges draining
with cpuisol workloads".
Leonardo has reported [1] that pcp memcg charge draining can interfere
with cpu isolated workloads. The said draining is done from a WQ context
with a pcp worker scheduled on each CPU which holds any cached charges for
a specific memcg hierarchy. Operation is not really a common operation
[2]. It can be triggered from the userspace though so some care is
definitely due.
Leonardo has tried to address the issue by allowing remote charge draining
[3]. This approach requires an additional locking to synchronize pcp
caches sync from a remote cpu from local pcp consumers. Even though the
proposed lock was per-cpu there is still potential for contention and less
predictable behavior.
This patchset addresses the issue from a different angle. Rather than
dealing with a potential synchronization, cpus which are isolated are
simply never scheduled to be drained. This means that a small amount of
charges could be laying around and waiting for a later use or they are
flushed when a different memcg is charged from the same cpu. More details
are in patch 2. The first patch from Frederic is implementing an
abstraction to tell whether a specific cpu has been isolated and therefore
require a special treatment.
This patch (of 2):
Provide this new API to check if a CPU has been isolated either through
isolcpus= or nohz_full= kernel parameter.
It aims at avoiding kernel load deemed to be safely spared on CPUs running
sensitive workload that can't bear any disturbance, such as pcp cache
draining.
Link: https://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20230317134448.11082-2-mhocko@kernel.org Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Leonardo Bras <leobras@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Thu, 16 Mar 2023 11:06:47 +0000 (19:06 +0800)]
mm: compaction: fix the possible deadlock when isolating hugetlb pages
When trying to isolate a migratable pageblock, it can contain several
normal pages or several hugetlb pages (e.g. CONT-PTE 64K hugetlb on arm64)
in a pageblock. That means we may hold the lru lock of a normal page to
continue to isolate the next hugetlb page by isolate_or_dissolve_huge_page()
in the same migratable pageblock.
However in the isolate_or_dissolve_huge_page(), it may allocate a new hugetlb
page and dissolve the old one by alloc_and_dissolve_hugetlb_folio() if the
hugetlb's refcount is zero. That means we can still enter the direct compaction
path to allocate a new hugetlb page under the current lru lock, which
may cause possible deadlock.
To avoid this possible deadlock, we should release the lru lock when
trying to isolate a hugetbl page. Moreover it does not make sense to take
the lru lock to isolate a hugetlb, which is not in the lru list.
Link: https://lkml.kernel.org/r/7ab3bffebe59fb419234a68dec1e4572a2518563.1678962352.git.baolin.wang@linux.alibaba.com Fixes: 369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Lam <william.lam@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Thu, 16 Mar 2023 11:06:46 +0000 (19:06 +0800)]
mm: compaction: consider the number of scanning compound pages in isolate fail path
commit b717d6b93b54 ("mm: compaction: include compound page count for
scanning in pageblock isolation") added compound page statistics for
scanning in pageblock isolation, to make sure the number of scanned pages
is always larger than the number of isolated pages when isolating
mirgratable or free pageblock.
However, when failing to isolate the pages when scanning the migratable or
free pageblocks, the isolation failure path did not consider the scanning
statistics of the compound pages, which result in showing the incorrect
number of scanned pages in tracepoints or in vmstats which will confuse
people about the page scanning pressure in memory compaction.
Thus we should take into account the number of scanning pages when failing
to isolate the compound pages to make the statistics accurate.
Link: https://lkml.kernel.org/r/73d6250a90707649cc010731aedc27f946d722ed.1678962352.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Lam <william.lam@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:58 +0000 (12:12 +0100)]
mm/mremap: simplify vma expansion again
This effectively reverts d014cd7c1c35 ("mm, mremap: fix mremap() expanding
for vma's with vm_ops->close()"). After the recent changes, vma_merge()
is able to handle the expansion properly even when the vma being expanded
has a vm_ops->close operation, so we don't need to special case it
anymore.
Link: https://lkml.kernel.org/r/20230309111258.24079-11-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:57 +0000 (12:12 +0100)]
mm/mmap: start distinguishing if vma can be removed in mergeability test
Since pre-git times, is_mergeable_vma() returns false for a vma with
vm_ops->close, so that no owner assumptions are violated in case the vma
is removed as part of the merge.
This check is currently very conservative and can prevent merging even
situations where vma can't be removed, such as simple expansion of
previous vma, as evidenced by commit d014cd7c1c35 ("mm, mremap: fix
mremap() expanding for vma's with vm_ops->close()")
In order to allow more merging when appropriate and simplify the code that
was made more complex by commit d014cd7c1c35, start distinguishing cases
where the vma can be really removed, and allow merging with vm_ops->close
otherwise.
As a first step, add a may_remove_vma parameter to is_mergeable_vma().
can_vma_merge_before() sets it to true, because when called from
vma_merge(), a removal of the vma is possible.
In can_vma_merge_after(), pass the parameter as false, because no
removal can occur in each of its callers:
- vma_merge() calls it on the 'prev' vma, which is never removed
- mmap_region() and do_brk_flags() call it to determine if it can expand
a vma, which is not removed
As a result, vma's with vm_ops->close may now merge with compatible ranges
in more situations than previously. We can also revert commit d014cd7c1c35 as the next step to simplify mremap code again.
Link: https://lkml.kernel.org/r/20230309111258.24079-10-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:55 +0000 (12:12 +0100)]
mm/mmap/vma_merge: rename adj_next to adj_start
The variable 'adj_next' holds the value by which we adjust vm_start of a
vma in variable 'adjust', that's either 'next' or 'mid', so the current
name is inaccurate. Rename it to 'adj_start'.
Link: https://lkml.kernel.org/r/20230309111258.24079-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:54 +0000 (12:12 +0100)]
mm/mmap/vma_merge: set mid to NULL if not applicable
There are several places where we test if 'mid' is really the area NNNN in
the diagram and the tests have two variants and are non-obvious to follow.
Instead, set 'mid' to NULL up-front if it's not the NNNN area, and
simplify the tests.
Also update the description in comment accordingly.
Link: https://lkml.kernel.org/r/20230309111258.24079-7-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:52 +0000 (12:12 +0100)]
mm/mmap/vma_merge: use the proper vma pointer in case 4
Almost all cases now use the 'next' pointer for the vma following the
merged area, and the cases diagram shows it as XXXX. Case 4 is different
as it uses 'mid' and NNNN, so change it for consistency. No functional
change.
Link: https://lkml.kernel.org/r/20230309111258.24079-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:51 +0000 (12:12 +0100)]
mm/mmap/vma_merge: use the proper vma pointers in cases 1 and 6
Case 1 is now shown in the comment as next vma being merged with prev, so
use 'next' instead of 'mid'. In case 1 they both point to the same vma.
As a consequence, in case 6, the dup_anon_vma() is now tried first on
'next' and then on 'mid', before it was the opposite order. This is not a
functional change, as those two vma's cannnot have a different anon_vma,
as that would have prevented the merging in the first place.
Link: https://lkml.kernel.org/r/20230309111258.24079-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:50 +0000 (12:12 +0100)]
mm/mmap/vma_merge: use the proper vma pointer in case 3
In case 3 we we use 'next' for everything but vma_pgoff. So use 'next'
for that as well, instead of 'mid', for consistency. Then in case 8 we
have to use 'mid' explicitly, which should also make the intent more
obvious.
Adjust the diagram for cases 1-3 in the comment to match the code - we are
using 'next' for case 3 so mark the range with XXXX instead of NNNN. For
case 2 that's a no-op as the code doesn't touch 'next' or 'mid'. For case
1 it's now wrong but that will be fixed next.
No functional change.
Link: https://lkml.kernel.org/r/20230309111258.24079-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 9 Mar 2023 11:12:49 +0000 (12:12 +0100)]
mm/mmap/vma_merge: use only primary pointers for preparing merge
Patch series "cleanup vma_merge() and improve mergeability tests".
My initial goal here was to try making the check for vm_ops->close in
is_mergeable_vma() only be applied for vma's that would be truly removed
as part of the merge (see Patch 9). This would then allow reverting the
quick fix d014cd7c1c35 ("mm, mremap: fix mremap() expanding for vma's with
vm_ops->close()"). This was successful enough to allow the revert (Patch
10). Checks using can_vma_merge_before() are still pessimistic about
possible vma removal, and making them precise would probably complicate
the vma_merge() code too much.
Liam's 6.3-rc1 simplification of vma_merge() and removal of __vma_adjust()
was very much helpful in understanding the vma_merge() implementation and
especially when vma removals can happen, which is now very obvious. While
studing the code, I've found ways to make it hopefully even more easy to
follow, so that's the patches 1-8. That made me also notice a bug that's
now already fixed in 6.3-rc1.
This patch (of 10):
In the merging preparation part of vma_merge(), some vma pointer variables
are assigned for later execution of the merge, but also read from in the
block itself. The code is easier follow and check against the cases
diagram in the comment if the code reads only from the "primary" vma
variables prev, mid, next instead. No functional change.
Axel Rasmussen [Tue, 14 Mar 2023 22:12:50 +0000 (15:12 -0700)]
mm: userfaultfd: add UFFDIO_CONTINUE_MODE_WP to install WP PTEs
UFFDIO_COPY already has UFFDIO_COPY_MODE_WP, so when installing a new PTE
to resolve a missing fault, one can install a write-protected one. This
is useful when using UFFDIO_REGISTER_MODE_{MISSING,WP} in combination.
This was motivated by testing HugeTLB HGM [1], and in particular its
interaction with userfaultfd features. Existing userfaultfd code supports
using WP and MINOR modes together (i.e. you can register an area with
both enabled), but without this CONTINUE flag the combination is in
practice unusable.
So, add an analogous UFFDIO_CONTINUE_MODE_WP, which does the same thing as
UFFDIO_COPY_MODE_WP, but for *minor* faults.
Update the selftest to do some very basic exercising of the new flag.
Update Documentation/ to describe how these flags are used (neither the
COPY nor the new CONTINUE versions of this mode flag were described there
before).
Link: https://lkml.kernel.org/r/20230314221250.682452-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Tue, 14 Mar 2023 22:12:49 +0000 (15:12 -0700)]
mm: userfaultfd: combine 'mode' and 'wp_copy' arguments
Many userfaultfd ioctl functions take both a 'mode' and a 'wp_copy'
argument. In future commits we plan to plumb the flags through to more
places, so we'd be proliferating the very long argument list even further.
Let's take the time to simplify the argument list. Combine the two
arguments into one - and generalize, so when we add more flags in the
future, it doesn't imply more function arguments.
Since the modes (copy, zeropage, continue) are mutually exclusive, store
them as an integer value (0, 1, 2) in the low bits. Place combine-able
flag bits in the high bits.
This is quite similar to an earlier patch proposed by Nadav Amit
("userfaultfd: introduce uffd_flags" [1]). The main difference is that
patch only handled flags, whereas this patch *also* combines the "mode"
argument into the same type to shorten the argument list.
Link: https://lkml.kernel.org/r/20230314221250.682452-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: James Houghton <jthoughton@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Tue, 14 Mar 2023 22:12:48 +0000 (15:12 -0700)]
mm: userfaultfd: don't pass around both mm and vma
Quite a few userfaultfd functions took both mm and vma pointers as
arguments. Since the mm is trivially accessible via vma->vm_mm, there's
no reason to pass both; it just needlessly extends the already long
argument list.
Get rid of the mm pointer, where possible, to shorten the argument list.
Link: https://lkml.kernel.org/r/20230314221250.682452-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Tue, 14 Mar 2023 22:12:47 +0000 (15:12 -0700)]
mm: userfaultfd: rename functions for clarity + consistency
Patch series "mm: userfaultfd: refactor and add UFFDIO_CONTINUE_MODE_WP",
v5.
- Commits 1-3 refactor userfaultfd ioctl code without behavior changes, with the
main goal of improving consistency and reducing the number of function args.
- Commit 4 adds UFFDIO_CONTINUE_MODE_WP.
This patch (of 4):
The basic problem is, over time we've added new userfaultfd ioctls, and
we've refactored the code so functions which used to handle only one case
are now re-used to deal with several cases. While this happened, we
didn't bother to rename the functions.
Similarly, as we added new functions, we cargo-culted pieces of the
now-inconsistent naming scheme, so those functions too ended up with names
that don't make a lot of sense.
A key point here is, "copy" in most userfaultfd code refers specifically
to UFFDIO_COPY, where we allocate a new page and copy its contents from
userspace. There are many functions with "copy" in the name that don't
actually do this (at least in some cases).
So, rename things into a consistent scheme. The high level idea is that
the call stack for userfaultfd ioctls becomes:
userfaultfd_ioctl
-> userfaultfd_(particular ioctl)
-> mfill_atomic_(particular kind of fill operation)
-> mfill_atomic /* loops over pages in range */
-> mfill_atomic_pte /* deals with single pages */
-> mfill_atomic_pte_(particular kind of fill operation)
-> mfill_atomic_install_pte
There are of course some special cases (shmem, hugetlb), but this is the
general structure which all function names now adhere to.
Link: https://lkml.kernel.org/r/20230314221250.682452-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230314221250.682452-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 15 Mar 2023 20:54:45 +0000 (13:54 -0700)]
mm-treewide-redefine-max_order-sanely-fix-2
fix another min_t warning
Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: kernel test robot <lkp@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill A. Shutemov [Wed, 15 Mar 2023 11:31:33 +0000 (14:31 +0300)]
mm, treewide: redefine MAX_ORDER sanely
MAX_ORDER currently defined as number of orders page allocator supports:
user can ask buddy allocator for page order between 0 and MAX_ORDER-1.
This definition is counter-intuitive and lead to number of bugs all over
the kernel.
Change the definition of MAX_ORDER to be inclusive: the range of orders
user can ask from buddy allocator is 0..MAX_ORDER now.
Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill A. Shutemov [Wed, 15 Mar 2023 11:31:32 +0000 (14:31 +0300)]
iommu: fix MAX_ORDER usage in __iommu_dma_alloc_pages()
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.
Fix MAX_ORDER usage in __iommu_dma_alloc_pages().
Also use GENMASK() instead of hard to read "(2U << order) - 1" magic.
Link: https://lkml.kernel.org/r/20230315113133.11326-10-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill A. Shutemov [Wed, 15 Mar 2023 11:31:29 +0000 (14:31 +0300)]
perf/core: fix MAX_ORDER usage in rb_alloc_aux_page()
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.
Fix MAX_ORDER usage in rb_alloc_aux_page().
Link: https://lkml.kernel.org/r/20230315113133.11326-7-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill A. Shutemov [Wed, 15 Mar 2023 11:31:25 +0000 (14:31 +0300)]
um: fix MAX_ORDER usage in linux_main()
MAX_ORDER is not inclusive: the maximum allocation order buddy allocator
can deliver is MAX_ORDER-1.
Fix MAX_ORDER usage in linux_main().
Link: https://lkml.kernel.org/r/20230315113133.11326-3-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Thu, 9 Mar 2023 22:37:11 +0000 (17:37 -0500)]
selftests/mm: smoke test UFFD_FEATURE_WP_UNPOPULATED
Enable it by default on the stress test, and add some smoke tests for the
pte markers on anonymous.
Link: https://lkml.kernel.org/r/20230309223711.823547-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 21 Mar 2023 20:09:26 +0000 (16:09 -0400)]
mm-uffd-uffd_feature_wp_unpopulated-fix
Two comment changes suggested by David, and also a oneliner fix to
khugepaged (to bail out anon thp collapsing when seeing pte markers).
Link: https://lkml.kernel.org/r/ZB2/8jPhD3fpx5U8@x1n Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Thu, 9 Mar 2023 22:37:10 +0000 (17:37 -0500)]
mm/uffd: UFFD_FEATURE_WP_UNPOPULATED
Patch series "mm/uffd: Add feature bit UFFD_FEATURE_WP_UNPOPULATED", v4.
The new feature bit makes anonymous memory acts the same as file memory on
userfaultfd-wp in that it'll also wr-protect none ptes.
It can be useful in two cases:
(1) Uffd-wp app that needs to wr-protect none ptes like QEMU snapshot,
so pre-fault can be replaced by enabling this flag and speed up
protections
(2) It helps to implement async uffd-wp mode that Muhammad is working on [1]
It's debatable whether this is the most ideal solution because with the
new feature bit set, wr-protect none pte needs to pre-populate the
pgtables to the last level (PAGE_SIZE). But it seems fine so far to
service either purpose above, so we can leave optimizations for later.
The series brings pte markers to anonymous memory too. There's some
change in the common mm code path in the 1st patch, great to have some eye
looking at it, but hopefully they're still relatively straightforward.
This patch (of 2):
This is a new feature that controls how uffd-wp handles none ptes. When
it's set, the kernel will handle anonymous memory the same way as file
memory, by allowing the user to wr-protect unpopulated ptes.
File memories handles none ptes consistently by allowing wr-protecting of
none ptes because of the unawareness of page cache being exist or not.
For anonymous it was not as persistent because we used to assume that we
don't need protections on none ptes or known zero pages.
One use case of such a feature bit was VM live snapshot, where if without
wr-protecting empty ptes the snapshot can contain random rubbish in the
holes of the anonymous memory, which can cause misbehave of the guest when
the guest OS assumes the pages should be all zeros.
QEMU worked it around by pre-populate the section with reads to fill in
zero page entries before starting the whole snapshot process [1].
Recently there's another need raised on using userfaultfd wr-protect for
detecting dirty pages (to replace soft-dirty in some cases) [2]. In that
case if without being able to wr-protect none ptes by default, the dirty
info can get lost, since we cannot treat every none pte to be dirty (the
current design is identify a page dirty based on uffd-wp bit being
cleared).
In general, we want to be able to wr-protect empty ptes too even for
anonymous.
This patch implements UFFD_FEATURE_WP_UNPOPULATED so that it'll make
uffd-wp handling on none ptes being consistent no matter what the memory
type is underneath. It doesn't have any impact on file memories so far
because we already have pte markers taking care of that. So it only
affects anonymous.
The feature bit is by default off, so the old behavior will be maintained.
Sometimes it may be wanted because the wr-protect of none ptes will
contain overheads not only during UFFDIO_WRITEPROTECT (by applying pte
markers to anonymous), but also on creating the pgtables to store the pte
markers. So there's potentially less chance of using thp on the first
fault for a none pmd or larger than a pmd.
The major implementation part is teaching the whole kernel to understand
pte markers even for anonymously mapped ranges, meanwhile allowing the
UFFDIO_WRITEPROTECT ioctl to apply pte markers for anonymous too when the
new feature bit is set.
Note that even if the patch subject starts with mm/uffd, there're a few
small refactors to major mm path of handling anonymous page faults. But
they should be straightforward.
With WP_UNPOPUATED, application like QEMU can avoid pre-read faults all
the memory before wr-protect during taking a live snapshot. Quotting from
Muhammad's test result here [3] based on a simple program [4]:
(1) With huge page disabled
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
./uffd_wp_perf
Test DEFAULT: 4
Test PRE-READ: 1111453 (pre-fault 1101011)
Test MADVISE: 278276 (pre-fault 266378)
Test WP-UNPOPULATE: 11712
(2) With Huge page enabled
echo always > /sys/kernel/mm/transparent_hugepage/enabled
./uffd_wp_perf
Test DEFAULT: 4
Test PRE-READ: 22521 (pre-fault 22348)
Test MADVISE: 4909 (pre-fault 4743)
Test WP-UNPOPULATE: 14448
There'll be a great perf boost for no-thp case, while for thp enabled with
extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE,
but that's low possibility in reality, also the overhead was not reduced
but postponed until a follow up write on any huge zero thp, so potentially
it is faster by making the follow up writes slower.
Link: https://lkml.kernel.org/r/20230309223711.823547-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20230309223711.823547-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stefan Roesch [Tue, 14 Mar 2023 20:45:57 +0000 (13:45 -0700)]
docs/mm: extend ksm doc
This adds a description of the new prctl interface for KSM and also adds a
general section on security concerns.
Link: https://lkml.kernel.org/r/20230314204557.3863923-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stefan Roesch [Fri, 10 Mar 2023 18:28:50 +0000 (10:28 -0800)]
mm: add new KSM process and sysfs knobs
This adds the general_profit KSM sysfs knob and the process profit metric
and process merge type knobs to ksm_stat.
1) split off pages_volatile function
This splits off the pages_volatile function. The next patch will
use this function.
2) expose general_profit metric
The documentation mentions a general profit metric, however this
metric is not calculated. In addition the formula depends on the size
of internal structures, which makes it more difficult for an
administrator to make the calculation. Adding the metric for a better
user experience.
3) document general_profit sysfs knob
4) calculate ksm process profit metric
The ksm documentation mentions the process profit metric and how to
calculate it. This adds the calculation of the metric.
5) add ksm_merge_type() function
This adds the ksm_merge_type function. The function returns the
merge type for the process. For madvise it returns "madvise", for
prctl it returns "process" and otherwise it returns "none".
6) mm: expose ksm process profit metric and merge type in ksm_stat
This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
The name of the value is ksm_merge_type. The documentation mentions
the formula for the ksm process profit metric, however it does not
calculate it. In addition the formula depends on the size of internal
structures. So it makes sense to expose it.
Stefan Roesch [Fri, 10 Mar 2023 18:28:49 +0000 (10:28 -0800)]
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v4.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportuniites may exist across multiple workloads or jobs.
Only a higler level entity like a job scheduler or container can know
for certain if its running one or more instances of a job. That job
scheduler however doesn't have the necessary internal worklaod knowledge
to make targeted madvise calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around
20%.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
In addition when KSM is disabled for a process, KSM will be disabled
for the VMA's where KSM has been enabled.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit and ksm type metric to
/proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM. It also adds a new option
to query if prctl KSM has been enabled. It adds a fork test to verify
that the KSM process setting is inherited by client processes.
An update to the prctl(2) manpage has been proposed at [1].
This patch (of 3):
This adds a new prctl to API to enable and disable KSM on a per process
basis instead of only at the VMA basis (with madvise).
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) add flag to __ksm_enter
This change adds the flag parameter to __ksm_enter. This allows to
distinguish if ksm was called by prctl or madvise.
3) add flag to __ksm_exit call
This adds the flag parameter to the __ksm_exit() call. This allows
to distinguish if this call is for an prctl or madvise invocation.
4) invoke madvise for all vmas in scan_get_next_rmap_item
If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate
over all the vmas and enable ksm if possible. For the vmas that can be
ksm enabled this is only done once.
5) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process.
6) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
Andrey Konovalov [Fri, 10 Mar 2023 23:43:33 +0000 (00:43 +0100)]
kasan: suppress recursive reports for HW_TAGS
KASAN suppresses reports for bad accesses done by the KASAN reporting
code. The reporting code might access poisoned memory for reporting
purposes.
Software KASAN modes do this by suppressing reports during reporting via
current->kasan_depth, the same way they suppress reports during accesses
to poisoned slab metadata.
Hardware Tag-Based KASAN does not use current->kasan_depth, and instead
resets pointer tags for accesses to poisoned memory done by the reporting
code.
Despite that, a recursive report can still happen:
1. On hardware with faulty MTE support. This was observed by Weizhao
Ouyang on a faulty hardware that caused memory tags to randomly change
from time to time.
2. Theoretically, due to a previous MTE-undetected memory corruption.
A recursive report can happen via:
1. Accessing a pointer with a non-reset tag in the reporting code, e.g.
slab->slab_cache, which is what Weizhao Ouyang observed.
2. Theoretically, via external non-annotated routines, e.g. stackdepot.
To resolve this issue, resetting tags for all of the pointers in the
reporting code and all the used external routines would be impractical.
Instead, disable tag checking done by the CPU for the duration of KASAN
reporting for Hardware Tag-Based KASAN.
Without this fix, Hardware Tag-Based KASAN reporting code might deadlock.
Andrey Konovalov [Fri, 10 Mar 2023 23:43:29 +0000 (00:43 +0100)]
kasan: drop empty tagging-related defines
mm/kasan/kasan.h provides a number of empty defines for a few
arch-specific tagging-related routines, in case the architecture code
didn't define them.
The original idea was to simplify integration in case another architecture
starts supporting memory tagging. However, right now, if any of those
routines are not provided by an architecture, Hardware Tag-Based KASAN
won't work.
Drop the empty defines, as it would be better to get compiler errors
rather than runtime crashes when adding support for a new architecture.
Also drop empty hw_enable_tagging_sync/async/asymm defines for
!CONFIG_KASAN_HW_TAGS case, as those are only used in mm/kasan/hw_tags.c.
Christoph Hellwig [Tue, 7 Mar 2023 14:34:08 +0000 (15:34 +0100)]
shmem: open code the page cache lookup in shmem_get_folio_gfp
Use the very low level filemap_get_entry helper to look up the entry in
the xarray, and then:
- don't bother locking the folio if only doing a userfault notification
- open code locking the page and checking for truncation in a related
code block
This will allow to eventually remove the FGP_ENTRY flag.
Link: https://lkml.kernel.org/r/20230307143410.28031-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Mon, 20 Mar 2023 05:19:21 +0000 (22:19 -0700)]
shmem: shmem_get_partial_folio use filemap_get_entry
To avoid use of the FGP_ENTRY flag, adapt shmem_get_partial_folio() to use
filemap_get_entry() and folio_lock() instead of __filemap_get_folio().
Update "page" in the comments there to "folio".
Link: https://lkml.kernel.org/r/9d1aaa4-1337-fb81-6f37-74ebc96f9ef@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Tue, 7 Mar 2023 14:34:06 +0000 (15:34 +0100)]
mm: use filemap_get_entry in filemap_get_incore_folio
filemap_get_incore_folio wants to look at the details of xa_is_value
entries, but doesn't need any of the other logic in filemap_get_folio.
Switch it to use the lower-level filemap_get_entry interface.
Link: https://lkml.kernel.org/r/20230307143410.28031-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Tue, 7 Mar 2023 14:34:04 +0000 (15:34 +0100)]
mm: don't look at xarray value entries in split_huge_pages_in_file
Patch series "return an ERR_PTR from __filemap_get_folio", v3.
__filemap_get_folio and its wrappers can return NULL for three different
conditions, which in some cases requires the caller to reverse engineer
the decision making. This is fixed by returning an ERR_PTR instead of
NULL and thus transporting the reason for the failure. But to make
that work we first need to ensure that no xa_value special case is
returned and thus return the FGP_ENTRY flag. It turns out that flag
is barely used and can usually be deal with in a better way.
This patch (of 7):
split_huge_pages_in_file never wants to do anything with the special value
enties. Switch to using filemap_get_folio to not even see them.
Raghavendra K T [Wed, 1 Mar 2023 12:19:02 +0000 (17:49 +0530)]
sched/numa: implement access PID reset logic
This helps to ensure that only recently accessed PIDs scan the VMAs.
Current implementation: (idea supported by PeterZ)
1. Accessing PID information is maintained in two windows.
access_pids[1] being newest.
2. Reset old access PID info i.e. access_pid[0] every (4 *
sysctl_numa_balancing_scan_delay) interval after initial scan delay
period expires.
The above interval seemed to be experimentally optimum since it avoids
frequent reset of access info as well as helps clearing the old access
info regularly. The reset logic is implemented in scan path.
Link: https://lkml.kernel.org/r/f7a675f66d1442d048b4216b2baf94515012c405.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Bharata B Rao <bharata@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Disha Talreja <dishaa.talreja@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Raghavendra K T [Wed, 1 Mar 2023 12:19:01 +0000 (17:49 +0530)]
sched/numa: enhance vma scanning logic
During Numa scanning make sure only relevant vmas of the tasks are
scanned.
Before:
All the tasks of a process participate in scanning the vma even if they
do not access vma in it's lifespan.
Now:
Except cases of first few unconditional scans, if a process do
not touch vma (exluding false positive cases of PID collisions)
tasks no longer scan all vma
Logic used:
1) 6 bits of PID used to mark active bit in vma numab status during
fault to remember PIDs accessing vma. (Thanks Mel)
2) Subsequently in scan path, vma scanning is skipped if current PID
had not accessed vma.
3) First two times we do allow unconditional scan to preserve earlier
behaviour of scanning.
Acknowledgement to Bharata B Rao <bharata@amd.com> for initial patch to
store pid information and Peter Zijlstra <peterz@infradead.org> (Usage of
test and set bit)
Link: https://lkml.kernel.org/r/092f03105c7c1d3450f4636b1ea350407f07640e.1677672277.git.raghavendra.kt@amd.com Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Disha Talreja <dishaa.talreja@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 1 Mar 2023 12:19:00 +0000 (17:49 +0530)]
sched/numa: apply the scan delay to every new vma
Pach series "sched/numa: Enhance vma scanning", v3.
The patchset proposes one of the enhancements to numa vma scanning
suggested by Mel. This is continuation of [3].
Reposting the rebased patchset to akpm mm-unstable tree (March 1)
Existing mechanism of scan period involves, scan period derived from
per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA
fault stats at per-process level to capture aplication behaviour better.
During that course of discussion, Mel proposed several ideas to enhance
current numa balancing. One of the suggestion was below
Track what threads access a VMA. The suggestion was to use an unsigned
long pid_mask and use the lower bits to tag approximately what threads
access a VMA. Skip VMAs that did not trap a fault. This would be
approximate because of PID collisions but would reduce scanning of areas
the thread is not interested in. The above suggestion intends not to
penalize threads that has no interest in the vma, thus reduce scanning
overhead.
V3 changes are mostly based on PeterZ comments (details below in changes)
Summary of patchset:
Current patchset implements:
1. Delay the vma scanning logic for newly created VMA's so that
additional overhead of scanning is not incurred for short lived tasks
(implementation by Mel)
2. Store the information of tasks accessing VMA in 2 windows. It is
regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
The above time is derived from experimenting (Suggested by PeterZ) to
balance between frequent clearing vs obsolete access data
3. hash_32 used to encode task index accessing VMA information
4. VMA's acess information is used to skip scanning for the tasks
which had not accessed VMA
Changes since V2:
patch1:
- Renaming of structure, macro to function,
- Add explanation to heuristics
- Adding more details from result (PeterZ)
Patch2:
- Usage of test and set bit (PeterZ)
- Move storing access PID info to numa_migrate_prep()
- Add a note on fainess among tasks allowed to scan
(PeterZ)
Patch3:
- Maintain two windows of access PID information
(PeterZ supported implementation and Gave idea to extend
to N if needed)
Patch4:
- Apply hash_32 function to track VMA accessing PIDs (PeterZ)
Changes since RFC V1:
- Include Mel's vma scan delay patch
- Change the accessing pid store logic (Thanks Mel)
- Fencing structure / code to NUMA_BALANCING (David, Mel)
- Adding clearing access PID logic (Mel)
- Descriptive change log ( Mike Rapoport)
Things to ponder over:
==========================================
- Improvement to clearing accessing PIDs logic (discussed in-detail in
patch3 itself (Done in this patchset by implementing 2 window history)
- Current scan period is not changed in the patchset, so we do see
frequent tries to scan. Relaxing scan period dynamically could improve
results further.
Results:
Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement
is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
(dbench had huge std deviation to post)
Duration User 66017.43 67959.84
Duration System 30503.15 24657.03
Duration Elapsed 504.61 493.12
6.2.0-mmunstable-base 6.2.0-mmunstable-patched
Ops NUMA alloc hit 1738835089.00 1738780310.00
Ops NUMA alloc local 1738834448.00 1738779711.00
Ops NUMA base-page range updates 477310.00 392566.00
Ops NUMA PTE updates 477310.00 392566.00
Ops NUMA hint faults 96817.00 87555.00
Ops NUMA hint local faults % 10150.00 2192.00
Ops NUMA hint local percent 10.48 2.50
Ops NUMA pages migrated 86660.00 85363.00
Ops AutoNUMA cost 489.07 442.14
Duration User 396433.47 324835.96
Duration System 2808.70 376.66
Duration Elapsed 2258.61 2258.12
6.2.0-mmunstable-base 6.2.0-mmunstable-patched
Ops NUMA alloc hit 59921806.00 49623489.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 59920880.00 49622594.00
Ops NUMA base-page range updates 152259275.00 50075.00
Ops NUMA PTE updates 152259275.00 50075.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 154660352.00 39014.00
Ops NUMA hint local faults % 138550501.00 23139.00
Ops NUMA hint local percent 89.58 59.31
Ops NUMA pages migrated 8179067.00 14147.00
Ops AutoNUMA cost 774522.98 195.69
This patch (of 4):
Currently whenever a new task is created we wait for
sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead.
Extend the same logic to new or very short-lived VMAs.
Haifeng Xu [Tue, 28 Feb 2023 08:35:37 +0000 (08:35 +0000)]
cpuset: clean up cpuset_node_allowed
Commit 002f290627c2 ("cpuset: use static key better and convert to new
API") used __cpuset_node_allowed() instead of cpuset_node_allowed() to
check whether we can allocate on a memory node. Now this function isn't
used by anyone, so we can do the follow things to clean it up.
1. remove unused codes
2. rename __cpuset_node_allowed() to cpuset_node_allowed()
3. update comments in mm/page_alloc.c
Link: https://lkml.kernel.org/r/20230228083537.102665-1-haifeng.xu@shopee.com Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:32 +0000 (09:36 -0800)]
mm: separate vma->lock from vm_area_struct
vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate cache
line still produce performance regression especially on NUMA machines.
Smallest regression was achieved when lock is placed in the fourth cache
line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like the
best option. It increases memory footprint of each VMA but that can be
optimized later if the new size causes issues. Note that after this
change vma_init() does not allocate or initialize vma->lock anymore. A
number of drivers allocate a pseudo VMA on the stack but they never use
the VMA's lock, therefore it does not need to be allocated. The future
drivers which might need the VMA lock should use
vm_area_alloc()/vm_area_free() to allocate the VMA.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:31 +0000 (09:36 -0800)]
mm/mmap: free vm_area_struct without call_rcu in exit_mmap
call_rcu() can take a long time when callback offloading is enabled. Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.
Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it. Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock. Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().
Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.
Laurent Dufour [Mon, 6 Mar 2023 15:42:44 +0000 (16:42 +0100)]
powerpc/mm: fix mmap_lock bad unlock
When page fault is tried holding the per VMA lock, bad_access_pkey() and
bad_access() should not be called because it is assuming the mmap_lock is
held. In the case a bad access is detected, fall back to the default
path, grabbing the mmap_lock to handle the fault and report the error.
Laurent Dufour [Mon, 27 Feb 2023 17:36:30 +0000 (09:36 -0800)]
powerc/mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails. Copied from "x86/mm: try
VMA lock-based page fault handling first"
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:26 +0000 (09:36 -0800)]
mm: prevent userfaults to be handled under per-vma lock
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:25 +0000 (09:36 -0800)]
mm: prevent do_swap_page from handling page faults under VMA lock
Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:23 +0000 (09:36 -0800)]
mm: fall back to mmap_lock if vma->anon_vma is not yet set
When vma->anon_vma is not set, page fault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find a
compatible adjacent VMA and that requires not only the faulting VMA to be
stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:22 +0000 (09:36 -0800)]
mm: introduce lock_vma_under_rcu to be used from arch-specific code
Introduce lock_vma_under_rcu function to lookup and lock a VMA during page
fault handling. When VMA is not found, can't be locked or changes after
being locked, the function returns NULL. The lookup is performed under
RCU protection to prevent the found VMA from being destroyed before the
VMA lock is acquired. VMA lock statistics are updated according to the
results. For now only anonymous VMAs can be searched this way. In other
cases the function returns NULL.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:21 +0000 (09:36 -0800)]
mm: introduce vma detached flag
Per-vma locking mechanism will search for VMA under RCU protection and
then after locking it, has to ensure it was not removed from the VMA tree
after we found it. To make this check efficient, introduce a
vma->detached flag to mark VMAs which were removed from the VMA tree.