Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tony Luck [Wed, 2 Jun 2021 03:52:41 +0000 (13:52 +1000)]
mm/memory-failure: use a mutex to avoid memory_failure() races
Patch series "mm,hwpoison: fix sending SIGBUS for Action Required MCE", v5.
I wrote this patchset to materialize what I think is the current allowable
solution mentioned by the previous discussion [1]. I simply borrowed
Tony's mutex patch and Aili's return code patch, then I queued another one
to find error virtual address in the best effort manner. I know that this
is not a perfect solution, but should work for some typical case.
There can be races when multiple CPUs consume poison from the same page.
The first into memory_failure() atomically sets the HWPoison page flag and
begins hunting for tasks that map this page. Eventually it invalidates
those mappings and may send a SIGBUS to the affected tasks.
But while all that work is going on, other CPUs see a "success" return
code from memory_failure() and so they believe the error has been handled
and continue executing.
Fix by wrapping most of the internal parts of memory_failure() in a mutex.
Link: https://lkml.kernel.org/r/20210521030156.2612074-1-nao.horiguchi@gmail.com Link: https://lkml.kernel.org/r/20210521030156.2612074-2-nao.horiguchi@gmail.com Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Aili Yao <yaoaili@kingsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jue Wang <juew@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liu Shixin [Wed, 2 Jun 2021 03:52:41 +0000 (13:52 +1000)]
mm/page_alloc: fix counting of managed_pages
commit f63661566fad ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if
the zone is empty") clears out zone->lowmem_reserve[] if zone is empty.
But when zone is not empty and sysctl_lowmem_reserve_ratio[i] is set to
zero, zone_managed_pages(zone) is not counted in the managed_pages either.
This is inconsistent with the description of lowmem_reserve, so fix it.
Link: https://lkml.kernel.org/r/20210527125707.3760259-1-liushixin2@huawei.com Fixes: f63661566fad ("mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reported-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dong Aisheng [Wed, 2 Jun 2021 03:52:40 +0000 (13:52 +1000)]
mm: drop SECTION_SHIFT in code comments
Actually SECTIONS_SHIFT is used in the kernel code, so the code comments
is strictly incorrect. And since commit bbeae5b05ef6 ("mm: move page
flags layout to separate header"), SECTIONS_SHIFT definition has been
moved to include/linux/page-flags-layout.h, since code itself looks quite
straighforward, instead of moving the code comment into the new place as
well, we just simply remove it.
This also fixed a checkpatch complain derived from the original code:
WARNING: please, no space before tabs
+ * SECTIONS_SHIFT ^I^I#bits space required to store a section #$
This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction. The old sysctl increased
both pcp->batch and pcp->high with the higher pcp->high potentially
reducing zone->lock contention. However, the higher pcp->batch value also
potentially increased allocation latency while the PCP was refilled. This
sysctl only adjusts pcp->high so that zone->lock contention is potentially
reduced but allocation latency during a PCP refill remains the same.
Mel Gorman [Wed, 2 Jun 2021 03:52:39 +0000 (13:52 +1000)]
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
When kswapd is active then direct reclaim is potentially active. In
either case, it is possible that a zone would be balanced if pages were
not trapped on PCP lists. Instead of draining remote pages, simply limit
the size of the PCP lists while kswapd is active.
Link: https://lkml.kernel.org/r/20210525080119.5455-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:39 +0000 (13:52 +1000)]
mm/page_alloc: scale the number of pages that are batch freed
When a task is freeing a large number of order-0 pages, it may acquire the
zone->lock multiple times freeing pages in batches. This may
unnecessarily contend on the zone lock when freeing very large number of
pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent frees.
As the machines I used were not large enough to test this are not large
enough to illustrate a problem, a debugging patch shows patterns like the
following (slightly editted for clarity)
Baseline vanilla kernel
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
With patches
time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
Link: https://lkml.kernel.org/r/20210525080119.5455-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:39 +0000 (13:52 +1000)]
mm/page_alloc: adjust pcp->high after CPU hotplug events
The PCP high watermark is based on the number of online CPUs so the
watermarks must be adjusted during CPU hotplug. At the time of
hot-remove, the number of online CPUs is already adjusted but during
hot-add, a delta needs to be applied to update PCP to the correct value.
After this patch is applied, the high watermarks are adjusted correctly.
Mel Gorman [Wed, 2 Jun 2021 03:52:39 +0000 (13:52 +1000)]
mm/page_alloc: disassociate the pcp->high from pcp->batch -fix
Vlastimil Babka noted that __setup_per_zone_wmarks updating pcp->high did
not protect watermark-related sysctl handlers from a parallel memory
hotplug operations. This patch moves the PCP update to
setup_per_zone_wmarks and updates the PCP high value while protected by
the pcp_batch_high_lock mutex. As a side-effect, the zone_pcp_update
calls during memory hotplug operations becomes redundant and can be
removed.
This is a fix to the mmotm patch
mm-page_alloc-disassociate-the-pcp-high-from-pcp-batch.patch. It'll cause
a conflict with
mm-page_alloc-adjust-pcp-high-after-cpu-hotplug-events.patch but the
resolution is simple as the zone_pcp_update callers in
setup_per_zone_wmarks no longer exist.
Link: https://lkml.kernel.org/r/20210528105925.GN30378@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:39 +0000 (13:52 +1000)]
mm/page_alloc: disassociate the pcp->high from pcp->batch
The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.
This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies. The
intent behind the default pcp->high is "set the number of PCP pages such
that if they are all full that background reclaim is not started
prematurely".
Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events which is handled later in the
series.
Mel Gorman [Wed, 2 Jun 2021 03:52:38 +0000 (13:52 +1000)]
mm/page_alloc: delete vm.percpu_pagelist_fraction
Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2.
The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the
zone size into account or the number of CPUs local to a zone. With larger
zones and more CPUs per node, the contention is getting worse.
Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch
and high values means that the sysctl can reduce zone lock contention but
also increase allocation latencies.
This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static. It
also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.
The motivation is partially to adjust to larger memory sizes but is also
driven by the fact that large batches of page freeing via release_pages()
often shows zone contention as a major part of the problem. Another is a
bug report based on an older kernel where a multi-terabyte process can
takes several minutes to exit. A workaround was to use
vm.percpu_pagelist_fraction to increase the pcp->high value but testing
indicated that a production workload could not use the same values because
of an increase in allocation latencies. Unfortunately, I cannot reproduce
this test case myself as the multi-terabyte machines are in active use but
it should alleviate the problem.
The series aims to address both and partially acts as a pre-requisite.
pcp only works with order-0 which is useless for SLUB (when using high
orders) and THP (unconditionally). To store high-order pages on PCP, the
pcp->high values need to be increased first.
This patch (of 6):
The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP). The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it
has a problem. While it can decrease contention, it can also increase
latency on the allocation side due to unreasonably large batch sizes.
This leads to games where an administrator adjusts
percpu_pagelist_fraction on the fly to work around contention and
allocation latency problems.
This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems. For the purposes of
review, it's easier to remove this sysctl now and reintroduce a similar
sysctl later in the series that deals only with pcp->high.
Minchan Kim [Wed, 2 Jun 2021 03:52:38 +0000 (13:52 +1000)]
mm: page_alloc: dump migrate-failed pages only at -EBUSY
alloc_contig_dump_pages() aims for helping debugging page migration
failure by elevated page refcount compared to expected_count. (for the
detail, please look at migrate_page_move_mapping)
However, -ENOMEM is just the case that system is under memory pressure
state, not relevant with page refcount at all. Thus, the dumping page
list is not helpful for the debugging point of view.
Link: https://lkml.kernel.org/r/YKa2Wyo9xqIErpfa@google.com Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: John Dias <joaodias@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:38 +0000 (13:52 +1000)]
mm/page_alloc: avoid conflating IRQs disabled with zone->lock
Historically when freeing pages, free_one_page() assumed that callers had
IRQs disabled and the zone->lock could be acquired with spin_lock(). This
confuses the scope of what local_lock_irq is protecting and what
zone->lock is protecting in free_unref_page_list in particular.
This patch uses spin_lock_irqsave() for the zone->lock in free_one_page()
instead of relying on callers to have disabled IRQs.
free_unref_page_commit() is changed to only deal with PCP pages protected
by the local lock. free_unref_page_list() then first frees isolated pages
to the buddy lists with free_one_page() and frees the rest of the pages to
the PCP via free_unref_page_commit(). The end result is that
free_one_page() is no longer depending on side-effects of local_lock to be
correct.
Note that this may incur a performance penalty while memory hot-remove is
running but that is not a common operation.
[lkp@intel.com: Ensure CMA pages get addded to correct pcp list] Link: https://lkml.kernel.org/r/20210512095458.30632-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:38 +0000 (13:52 +1000)]
mm/page_alloc: explicitly acquire the zone lock in __free_pages_ok
__free_pages_ok() disables IRQs before calling a common helper
free_one_page() that acquires the zone lock. This is not safe according
to Documentation/locking/locktypes.rst and in this context, IRQ disabling
is not protecting a per_cpu_pages structure either or a local_lock would
be used.
This patch explicitly acquires the lock with spin_lock_irqsave instead of
relying on a helper. This removes the last instance of local_irq_save()
in page_alloc.c.
Link: https://lkml.kernel.org/r/20210512095458.30632-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:37 +0000 (13:52 +1000)]
mm/page_alloc: reduce duration that IRQs are disabled for VM counters
IRQs are left disabled for the zone and node VM event counters. This is
unnecessary as the affected counters are allowed to race for preemmption
and IRQs.
This patch reduces the scope of IRQs being disabled via
local_[lock|unlock]_irq on !PREEMPT_RT kernels. One
__mod_zone_freepage_state is still called with IRQs disabled. While this
could be moved out, it's not free on all architectures as some require
IRQs to be disabled for mod_zone_page_state on !PREEMPT_RT kernels.
Link: https://lkml.kernel.org/r/20210512095458.30632-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:37 +0000 (13:52 +1000)]
mm/page_alloc: batch the accounting updates in the bulk allocator
Now that the zone_statistics are simple counters that do not require
special protection, the bulk allocator accounting updates can be batch
updated without adding too much complexity with protected RMW updates or
using xchg.
Link: https://lkml.kernel.org/r/20210512095458.30632-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:37 +0000 (13:52 +1000)]
mm/vmstat: convert NUMA statistics to basic NUMA counters
NUMA statistics are maintained on the zone level for hits, misses, foreign
etc but nothing relies on them being perfectly accurate for functional
correctness. The counters are used by userspace to get a general overview
of a workloads NUMA behaviour but the page allocator incurs a high cost to
maintain perfect accuracy similar to what is required for a vmstat like
NR_FREE_PAGES. There even is a sysctl vm.numa_stat to allow userspace to
turn off the collection of NUMA statistics like NUMA_HIT.
This patch converts NUMA_HIT and friends to be NUMA events with similar
accuracy to VM events. There is a possibility that slight errors will be
introduced but the overall trend as seen by userspace will be similar.
The counters are no longer updated from vmstat_refresh context as it is
unnecessary overhead for counters that may never be read by userspace.
Note that counters could be maintained at the node level to save space but
it would have a user-visible impact due to /proc/zoneinfo.
[lkp@intel.com: Fix misplaced closing brace for !CONFIG_NUMA] Link: https://lkml.kernel.org/r/20210512095458.30632-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: Non-standard signature: Debugged-by:
#77: Debugged-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
WARNING: please, no spaces at the start of a line
#117: FILE: mm/page_alloc.c:128:
+ !defined(CONFIG_DEBUG_LOCK_ALLOC) &&^I^I\$
WARNING: please, no spaces at the start of a line
#118: FILE: mm/page_alloc.c:129:
+ !defined(CONFIG_PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT)$
total: 0 errors, 3 warnings, 26 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-page_alloc-convert-per-cpu-list-protection-to-local_lock-fix.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrii Nakryiko bisected the problem to the commit "mm/page_alloc: convert
per-cpu list protection to local_lock" currently staged in mmotm. In his
own words
The immediate problem is two different definitions of numa_node per-cpu
variable. They both are at the same offset within .data..percpu ELF
section, they both have the same name, but one of them is marked as
static and another as global. And one is int variable, while another
is struct pagesets. I'll look some more tomorrow, but adding Jiri and
Arnaldo for visibility.
The patch in question introduces a zero-sized per-cpu struct and while
this is not wrong, versions of pahole prior to 1.22 (unreleased) get
confused during BTF generation with two separate variables occupying the
same address.
This patch checks for older versions of pahole and forces struct pagesets
to be non-zero sized as a workaround when CONFIG_DEBUG_INFO_BTF is set. A
warning is emitted so that distributions can update pahole when 1.22 is
released.
Link: https://lkml.kernel.org/r/20210526080741.GW30378@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Michal Suchanek <msuchanek@suse.de> Reported-by: Hritik Vijay <hritikxx8@gmail.com> Debugged-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:36 +0000 (13:52 +1000)]
mm/page_alloc: convert per-cpu list protection to local_lock
There is a lack of clarity of what exactly
local_irq_save/local_irq_restore protects in page_alloc.c . It conflates
the protection of per-cpu page allocation structures with per-cpu vmstat
deltas.
This patch protects the PCP structure using local_lock which for most
configurations is identical to IRQ enabling/disabling. The scope of the
lock is still wider than it should be but this is decreased later.
It is possible for the local_lock to be embedded safely within struct
per_cpu_pages but it adds complexity to free_unref_page_list.
[lkp@intel.com: Make pagesets static] Link: https://lkml.kernel.org/r/20210512095458.30632-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:36 +0000 (13:52 +1000)]
mm/page_alloc: split per cpu page lists and zone stats -fix
mm/ is not W=1 clean for allnoconfig but the patch "mm/page_alloc: Split
per cpu page lists and zone stats" makes it worse with the following
warning
mm/vmstat.c: In function `zoneinfo_show_print':
mm/vmstat.c:1698:28: warning: variable `pzstats' set but not used [-Wunused-but-set-variable]
struct per_cpu_zonestat *pzstats;
^~~~~~~
This is a fix to the mmotm patch
mm-page_alloc-split-per-cpu-page-lists-and-zone-stats.patch.
Link: https://lkml.kernel.org/r/20210514144622.GA3735@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:52:36 +0000 (13:52 +1000)]
mm/page_alloc: split per cpu page lists and zone stats
The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the same
per-cpu space meaning that it's possible that vmstat updates dirty cache
lines holding per-cpu lists across CPUs unless padding is used. Second,
PREEMPT_RT does not want to disable IRQs for too long in the page
allocator.
This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on
!PREEMPT_RT kernels.
Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst
local_irq_disable();
spin_lock(&lock);
The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore, on
PREEMPT_RT, it's undesirable to leave IRQs disabled for too long. By
converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections for
PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As a
bonus, local_lock also means that PROVE_LOCKING does something useful.
After that, it's obvious that zone_statistics incurs too much overhead and
leaves IRQs disabled for longer than necessary on !PREEMPT_RT kernels.
zone_statistics uses perfectly accurate counters requiring IRQs be
disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on
!PREEMPT_RT.
The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency. Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.
Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so the
locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.
No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.
While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.
This patch (of 9):
The per-cpu page allocator lists and the per-cpu vmstat deltas are stored
in the same struct per_cpu_pages even though vmstats have no direct impact
on the per-cpu page lists. This is inconsistent because the vmstats for a
node are stored on a dedicated structure. The bigger issue is that the
per_cpu_pages structure is not cache-aligned and stat updates either cache
conflict with adjacent per-cpu lists incurring a runtime cost or padding
is required incurring a memory cost.
This patch splits the per-cpu pagelists and the vmstat deltas into
separate structures. It's mostly a mechanical conversion but some
variable renaming is done to clearly distinguish the per-cpu pages
structure (pcp) from the vmstats (pzstats).
Superficially, this appears to increase the size of the per_cpu_pages
structure but the movement of expire fills a structure hole so there is no
impact overall.
[lkp@intel.com: Check struct per_cpu_zonestat has a non-zero size]
[vbabka@suse.cz: Init zone->per_cpu_zonestats properly] Link: https://lkml.kernel.org/r/20210512095458.30632-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210512095458.30632-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrii Nakryiko [Wed, 2 Jun 2021 03:52:35 +0000 (13:52 +1000)]
kbuild: skip per-CPU BTF generation for pahole v1.18-v1.21
Commit "mm/page_alloc: convert per-cpu list protection to local_lock" will
introduce a zero-sized per-CPU variable, which causes pahole to generate
invalid BTF. Only pahole versions 1.18 through 1.21 are impacted, as
before 1.18 pahole doesn't know anything about per-CPU variables, and 1.22
contains the proper fix for the issue.
Luckily, pahole 1.18 got --skip_encoding_btf_vars option disabling BTF
generation for per-CPU variables in anticipation of some unanticipated
problems. So use this escape hatch to disable per-CPU var BTF info on
those problematic pahole versions. Users relying on availability of
per-CPU var BTFs would need to upgrade to pahole 1.22+, but everyone won't
notice any regressions.
Link: https://lkml.kernel.org/r/20210530002536.3193829-1-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Hao Luo <haoluo@google.com> Cc: Michal Suchanek <msuchanek@suse.de> Cc: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Heiner Kallweit [Wed, 2 Jun 2021 03:52:35 +0000 (13:52 +1000)]
mm/page_alloc: switch to pr_debug
Having such debug messages in the dmesg log may confuse users. Therefore
restrict debug output to cases where DEBUG is defined or dynamic debugging
is enabled for the respective code piece.
Matthew Wilcox (Oracle) [Wed, 2 Jun 2021 03:52:35 +0000 (13:52 +1000)]
mm: optimise nth_page for contiguous memmap
If the memmap is virtually contiguous (either because we're using a
virtually mapped memmap or because we don't support a discontig memmap at
all), then we can implement nth_page() by simple addition. Contrary to
popular belief, the compiler is not able to optimise this itself for a
vmemmap configuration. This reduces one example user (sg.c) by four
instructions:
Link: https://lkml.kernel.org/r/20210413194625.1472345-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Douglas Gilbert <dougg@torque.net> Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Matthew Wilcox (Oracle) [Wed, 2 Jun 2021 03:52:34 +0000 (13:52 +1000)]
mm: make compound_head const-preserving
If you pass a const pointer to compound_head(), you get a const pointer
back; if you pass a mutable pointer, you get a mutable pointer back. Also
remove an unnecessary forward definition of struct page; we're about to
dereference page->compound_head, so it must already have been defined.
Link: https://lkml.kernel.org/r/20210416231531.2521383-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Matthew Wilcox (Oracle) [Wed, 2 Jun 2021 03:52:34 +0000 (13:52 +1000)]
mm/debug: factor PagePoisoned out of __dump_page
Move the PagePoisoned test into dump_page(). Skip the hex print for
poisoned pages -- we know they're full of ffffffff. Move the reason
printing from __dump_page() to dump_page().
Link: https://lkml.kernel.org/r/20210416231531.2521383-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Aaron Tomlin [Wed, 2 Jun 2021 03:52:34 +0000 (13:52 +1000)]
mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt
A customer experienced a low-memory situation and decided to issue a
SIGKILL (i.e. a fatal signal). Instead of promptly terminating as one
would expect, the aforementioned task remained unresponsive.
Further investigation indicated that the task was "stuck" in the
reclaim/compaction retry loop. Now, it does not make sense to retry
compaction when a fatal signal is pending.
In the context of try_to_compact_pages(), indeed COMPACT_SKIPPED can be
returned; albeit, not every zone, on the zone list, would be considered in
the case a fatal signal is found to be pending. Yet, in
should_compact_retry(), given the last known compaction result, each zone,
on the zone list, can be considered/or checked (see
compaction_zonelist_suitable()). For example, if a zone was found to
succeed, then reclaim/compaction would be tried again (notwithstanding the
above).
This patch ensures that compaction is not needlessly retried irrespective
of the last known compaction result e.g. if it was skipped, in the
unlikely case a fatal signal is found pending. So, OOM is at least
attempted.
Link: https://lkml.kernel.org/r/20210520142901.3371299-1-atomlin@redhat.com Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Matthew Wilcox (Oracle) [Wed, 2 Jun 2021 03:52:34 +0000 (13:52 +1000)]
mm: make __dump_page static
Patch series "Constify struct page arguments".
While working on various solutions to the 32-bit struct page size
regression, one of the problems I found was the networking stack expects
to be able to pass const struct page pointers around, and the mm doesn't
provide a lot of const-friendly functions to call. The root tangle of
problems is that a lot of functions call VM_BUG_ON_PAGE(), which calls
dump_page(), which calls a lot of functions which don't take a const
struct page (but could be const).
This patch (of 6):
The only caller of __dump_page() now opencodes dump_page(), so remove it
as an externally visible symbol.
Mike Rapoport [Wed, 2 Jun 2021 03:52:33 +0000 (13:52 +1000)]
mm/mmzone.h: simplify is_highmem_idx()
There is a lot of historical ifdefery in is_highmem_idx() and its helper
zone_movable_is_highmem() that was required because of two different paths
for nodes and zones initialization that were selected at compile time.
Until commit 3f08a302f533 ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP
option") the movable_zone variable was only available for configurations
that had CONFIG_HAVE_MEMBLOCK_NODE_MAP enabled so the test in
zone_movable_is_highmem() used that variable only for such configurations.
For other configurations the test checked if the index of ZONE_MOVABLE
was greater by 1 than the index of ZONE_HIGMEM and then movable zone was
considered a highmem zone. Needless to say, ZONE_MOVABLE - 1 equals
ZONE_HIGHMEM by definition when CONFIG_HIGHMEM=y.
Commit 3f08a302f533 ("mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option")
made movable_zone variable always available. Since this variable is set
to ZONE_HIGHMEM if CONFIG_HIGHMEM is enabled and highmem zone is
populated, it is enough to check whether
Rasmus Villemoes [Wed, 2 Jun 2021 03:52:33 +0000 (13:52 +1000)]
mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array
In the event that somebody would call this with an already fully populated
page_array, the last loop iteration would do an access beyond the end of
page_array.
It's of course extremely unlikely that would ever be done, but this
triggers my internal static analyzer. Also, if it really is not supposed
to be invoked this way (i.e., with no NULL entries in page_array), the
nr_populated<nr_pages check could simply be removed instead.
Link: https://lkml.kernel.org/r/20210507064504.1712559-1-linux@rasmusvillemoes.dk Fixes: 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk page allocator") Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Wed, 2 Jun 2021 03:52:33 +0000 (13:52 +1000)]
fix for "printk: introduce dump_stack_lvl()"
Add missing dump_stack_lvl() stub if CONFIG_PRINTK=n.
Link: https://lkml.kernel.org/r/YJ0KAM0hQev1AmWe@elver.google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alexander Potapenko [Wed, 2 Jun 2021 03:52:33 +0000 (13:52 +1000)]
printk: introduce dump_stack_lvl()
dump_stack() is used for many different cases, which may require a log
level consistent with other kernel messages surrounding the dump_stack()
call. Without that, certain systems that are configured to ignore the
default level messages will miss stack traces in critical error reports.
This patch introduces dump_stack_lvl() that behaves similarly to
dump_stack(), but accepts a custom log level. The old dump_stack()
becomes equal to dump_stack_lvl(KERN_DEFAULT).
A somewhat similar patch has been proposed in 2012:
https://lore.kernel.org/lkml/1332493269.2359.9.camel@hebo/ , but wasn't
merged.
Link: https://lkml.kernel.org/r/20210506105405.3535023-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: he, bo <bo.he@intel.com> Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Prasad Sodagudi <psodagud@quicinc.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Uladzislau Rezki [Wed, 2 Jun 2021 03:52:32 +0000 (13:52 +1000)]
mm/vmalloc: Fallback to a single page allocator
Currently for order-0 pages we use a bulk-page allocator to get set of
pages. From the other hand not allocating all pages is something that
might occur. In that case we should fallbak to the single-page allocator
trying to get missing pages, because it is more permissive(direct reclaim,
etc).
Introduce a vm_area_alloc_pages() function where the described logic is
implemented.
Link: https://lkml.kernel.org/r/20210521130718.GA17882@pc638.lan Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mel Gorman <mgorman@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mm/vmalloc: remove quoted strings split across lines
A checkpatch.pl script complains on splitting a text across lines. It is
because if a user wants to find an entire string he or she will not
succeeded.
<snip>
WARNING: quoted string split across lines
+ "vmalloc size %lu allocation failure: "
+ "page order %u allocation failed",
mm/vmalloc: print a warning message first on failure
When a memory allocation for array of pages are not succeed emit a warning
message as a first step and then perform the further cleanup.
The reason it should be done in a right order is the clean up function
which is free_vm_area() can potentially also follow its error paths what
can lead to confusion what was broken first.
Link: https://lkml.kernel.org/r/20210516202056.2120-4-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()
Recently there has been introduced a page bulk allocator for users which
need to get number of pages per one call request.
For order-0 pages switch to an alloc_pages_bulk_array_node() instead of
alloc_pages_node(), the reason is the former is not capable of allocating
set of pages, thus a one call is per one page.
Second, according to my tests the bulk allocator uses less cycles even for
scenarios when only one page is requested. Running the "perf" on same
test case shows below difference:
Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dong Aisheng <aisheng.dong@nxp.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dong Aisheng <aisheng.dong@nxp.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dong Aisheng [Wed, 2 Jun 2021 03:52:31 +0000 (13:52 +1000)]
mm: rename the global section array to mem_sections
In order to distinguish the struct mem_section for a better code
readability and align with kernel doc [1] name below, change the global
mem section name to 'mem_sections' from 'mem_section'.
[1] Documentation/vm/memory-model.rst
"The `mem_section` objects are arranged in a two-dimensional array
called `mem_sections`."
Link: https://lkml.kernel.org/r/20210531091908.1738465-5-aisheng.dong@nxp.com Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Move TLB flush outside page table lock so that kernel does less with page
table lock held. Releasing the ptl with old TLB contents still valid will
behave such that such access happened before the level3 or level2 entry
update.
Link: https://lkml.kernel.org/r/20210422054323.150993-8-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mm/mremap: use range flush that does TLB and page walk cache flush
Some architectures do have the concept of page walk cache which need to be
flush when updating higher levels of page tables. A fast mremap that
involves moving page table pages instead of copying pte entries should
flush page walk cache since the old translation cache is no more valid.
Add new helper flush_pte_tlb_pwc_range() which invalidates both TLB and
page walk cache where TLB entries are mapped with page size PAGE_SIZE.
Link: https://lkml.kernel.org/r/20210422054323.150993-7-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Update _tlbiel_pid() such that we can avoid build errors like below when
using this function in other places.
arch/powerpc/mm/book3s64/radix_tlb.c: In function `__radix__flush_tlb_range_psize':
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: warning: `asm' operand 3 probably does not match constraints
114 | asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
| ^~~
arch/powerpc/mm/book3s64/radix_tlb.c:114:2: error: impossible constraint in `asm'
make[4]: *** [scripts/Makefile.build:271: arch/powerpc/mm/book3s64/radix_tlb.o] Error 1
m
With this fix, we can also drop the __always_inline in
__radix_flush_tlb_range_psize which was added by commit e12d6d7d46a6
("powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as
__always_inline").
Link: https://lkml.kernel.org/r/20210422054323.150993-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
mm/mremap: use pmd/pud_poplulate to update page table entries
pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that
set_pmd/= pud_at can only be used to set a hugepage PTE. Since we are not
setting up a hug= epage PTE here, use the pmd/pud_populate interface.
Link: https://lkml.kernel.org/r/20210422054323.150993-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
selftest/mremap_test: avoid crash with static build
With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.
Link: https://lkml.kernel.org/r/20210422054323.150993-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
selftest/mremap_test: update the test to handle pagesize other than 4K
Patch series "Speedup mremap on ppc64:, v5.
This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without updating
page table entries. This also needs to invalidate the Page Walk Cache on
architectures supporting the same.
Peter Collingbourne [Wed, 2 Jun 2021 03:52:28 +0000 (13:52 +1000)]
mm: improve mprotect(R|W) efficiency on pages referenced once
In the Scudo memory allocator [1] we would like to be able to detect
use-after-free vulnerabilities involving large allocations by issuing
mprotect(PROT_NONE) on the memory region used for the allocation when it
is deallocated. Later on, after the memory region has been "quarantined"
for a sufficient period of time we would like to be able to use it for
another allocation by issuing mprotect(PROT_READ|PROT_WRITE).
Before this patch, after removing the write protection, any writes to the
memory region would result in page faults and entering the copy-on-write
code path, even in the usual case where the pages are only referenced by a
single PTE, harming performance unnecessarily. Make it so that any pages
in anonymous mappings that are only referenced by a single PTE are
immediately made writable during the mprotect so that we can avoid the
page faults.
This program shows the critical syscall sequence that we intend to use in
the allocator:
#include <string.h>
#include <sys/mman.h>
enum { kSize = 131072 };
int main(int argc, char **argv) {
char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
for (int i = 0; i != 100000; ++i) {
memset(addr, i, kSize);
mprotect((void *)addr, kSize, PROT_NONE);
mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
}
}
The effect of this patch on the above program was measured on a
DragonBoard 845c by taking the median real time execution time of 10 runs.
Before: 2.94s
After: 0.66s
The effect was also measured using one of the microbenchmarks that we
normally use to benchmark the allocator [2], after modifying it to make
the appropriate mprotect calls [3]. With an allocation size of 131072
bytes to trigger the allocator's "large allocation" code path the
per-iteration time was measured as follows:
Before: 27450ns
After: 6010ns
This patch means that we do more work during the mprotect call itself in
exchange for less work when the pages are accessed. In the worst case,
the pages are not accessed at all. The effect of this patch in such cases
was measured using the following program:
#include <string.h>
#include <sys/mman.h>
enum { kSize = 131072 };
int main(int argc, char **argv) {
char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
memset(addr, 1, kSize);
for (int i = 0; i != 100000; ++i) {
#ifdef PAGE_FAULT
memset(addr + (i * 4096) % kSize, i, 4096);
#endif
mprotect((void *)addr, kSize, PROT_NONE);
mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
}
}
With PAGE_FAULT undefined (0 pages touched after removing write
protection) the median real time execution time of 100 runs was measured
as follows:
Before: 0.330260s
After: 0.338836s
With PAGE_FAULT defined (1 page touched) the measurements were
as follows:
Before: 0.438048s
After: 0.355661s
So it seems that even with a single page fault the new approach is faster.
I saw similar results if I adjusted the programs to use a larger mapping
size. With kSize = 1048576 I get these numbers with PAGE_FAULT undefined:
Before: 1.428988s
After: 1.512016s
i.e. around 5.5%.
And these with PAGE_FAULT defined:
Before: 1.518559s
After: 1.524417s
i.e. about the same.
What I think we may conclude from these results is that for smaller
mappings the advantage of the previous approach, although measurable, is
wiped out by a single page fault. I think we may expect that there should
be at least one access resulting in a page fault (under the previous
approach) after making the pages writable, since the program presumably
made the pages writable for a reason.
For larger mappings we may guesstimate that the new approach wins if the
density of future page faults is > 0.4%. But for the mappings that are
large enough for density to matter (not just the absolute number of page
faults) it doesn't seem like the increase in mprotect latency would be
very large relative to the total mprotect execution time.
Link: https://lkml.kernel.org/r/20210527190453.1259020-1-pcc@google.com Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8
Link: [1] https://source.android.com/devices/tech/debug/scudo
Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9
Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2 Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Kostya Kortchinsky <kostyak@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:27 +0000 (13:52 +1000)]
nouveau/svm: implement atomic SVM access
Some NVIDIA GPUs do not support direct atomic access to system memory via
PCIe. Instead this must be emulated by granting the GPU exclusive access
to the memory. This is achieved by replacing CPU page table entries with
special swap entries that fault on userspace access.
The driver then grants the GPU permission to update the page undergoing
atomic access via the GPU page tables. When CPU access to the page is
required a CPU fault is raised which calls into the device driver via MMU
notifiers to revoke the atomic access. The original page table entries
are then restored allowing CPU access to proceed.
Link: https://lkml.kernel.org/r/20210524132725.12697-11-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ben Skeggs <bskeggs@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:27 +0000 (13:52 +1000)]
nouveau/svm: refactor nouveau_range_fault
Call mmu_interval_notifier_insert() as part of nouveau_range_fault().
This doesn't introduce any functional change but makes it easier for a
subsequent patch to alter the behaviour of nouveau_range_fault() to
support GPU atomic operations.
Link: https://lkml.kernel.org/r/20210524132725.12697-10-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ben Skeggs <bskeggs@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Colin Ian King [Wed, 2 Jun 2021 03:52:27 +0000 (13:52 +1000)]
mm: selftests: fix potential integer overflow on shift of a int
The left shift of the int mapped is evaluated using 32 bit arithmetic and
then assigned to an unsigned long. In the case where mapped is 0x80000
when PAGE_SHIFT is 12 will lead to the upper bits being sign extended in
the unsigned long. Larger values can lead to an int overflow. Avoid this
by making mapped an unsigned long.
Addresses-Coverity: ("Uninitentional integer overflow") Link: https://lkml.kernel.org/r/20210526170530.3766167-1-colin.king@canonical.com Fixes: 8b2a105c3794 ("mm: selftests for exclusive device memory") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:26 +0000 (13:52 +1000)]
mm: device exclusive memory access
Some devices require exclusive write access to shared virtual memory (SVM)
ranges to perform atomic operations on that memory. This requires CPU
page tables to be updated to deny access whilst atomic operations are
occurring.
In order to do this introduce a new swap entry type
(SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive
access by a device all page table mappings for the particular range are
replaced with device exclusive swap entries. This causes any CPU access
to the page to result in a fault.
Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive access
to the region.
Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.
Link: https://lkml.kernel.org/r/20210524132725.12697-8-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:26 +0000 (13:52 +1000)]
mm/memory.c: allow different return codes for copy_nonpresent_pte()
Currently if copy_nonpresent_pte() returns a non-zero value it is assumed
to be a swap entry which requires further processing outside the loop in
copy_pte_range() after dropping locks. This prevents other values being
returned to signal conditions such as failure which a subsequent change
requires.
Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.
Link: https://lkml.kernel.org/r/20210524132725.12697-7-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:25 +0000 (13:52 +1000)]
mm: rename migrate_pgmap_owner
MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.
Link: https://lkml.kernel.org/r/20210524132725.12697-6-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:25 +0000 (13:52 +1000)]
mm/rmap: split migration into its own function
Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.
Several simplifications can also be made in try_to_migrate_one() based on
the following observations:
- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().
Link: https://lkml.kernel.org/r/20210524132725.12697-5-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:25 +0000 (13:52 +1000)]
mm/rmap: split try_to_munlock from try_to_unmap
The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.
TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.
Link: https://lkml.kernel.org/r/20210524132725.12697-4-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:24 +0000 (13:52 +1000)]
mm/swapops: rework swap entry manipulation code
Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsitent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.
Link: https://lkml.kernel.org/r/20210524132725.12697-3-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Alistair Popple [Wed, 2 Jun 2021 03:52:24 +0000 (13:52 +1000)]
mm: remove special swap entry functions
Patch series "Add support for SVM atomics in Nouveau", v9.
Introduction
============
Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.
These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .
Implementation
==============
Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.
Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.
Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().
Patch 5 renames some existing code but does not introduce functionality.
Patch 6 is a small clean-up to swap entry handling in copy_pte_range().
Patch 7 contains the bulk of the implementation for device exclusive
memory.
Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.
Patch 9 is a cleanup for the Nouveau SVM implementation.
Patch 10 contains the implementation of atomic access for the Nouveau
driver.
Testing
=======
This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/
Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.
This patch (of 10):
Remove multiple similar inline functions for dealing with different types
of special swap entries.
Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.
Link: https://lkml.kernel.org/r/20210524132725.12697-1-apopple@nvidia.com Link: https://lkml.kernel.org/r/20210524132725.12697-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:23 +0000 (13:52 +1000)]
mm/memory.c: use vma_lookup() in __access_remote_vm()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-22-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:23 +0000 (13:52 +1000)]
mm/mremap: use vma_lookup() in vma_to_resize()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-21-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:23 +0000 (13:52 +1000)]
mm/migrate: use vma_lookup() in do_pages_stat_array()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-20-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:22 +0000 (13:52 +1000)]
mm/ksm: use vma_lookup() in find_mergeable_vma()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-19-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:22 +0000 (13:52 +1000)]
lib/test_hmm: use vma_lookup() in dmirror_migrate()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-18-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:22 +0000 (13:52 +1000)]
kernel/events/uprobes: use vma_lookup() in find_active_uprobe()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-17-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:21 +0000 (13:52 +1000)]
misc/sgi-gru/grufault: use vma_lookup() in gru_find_vma()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-16-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:21 +0000 (13:52 +1000)]
drm/amdgpu: use vma_lookup() in amdgpu_ttm_tt_get_user_pages()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-14-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:21 +0000 (13:52 +1000)]
net/ipv5/tcp: use vma_lookup() in tcp_zerocopy_receive()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-13-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:20 +0000 (13:52 +1000)]
x86/sgx: use vma_lookup() in sgx_encl_find()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-10-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:20 +0000 (13:52 +1000)]
arch/m68k/kernel/sys_m68k: use vma_lookup() in sys_cacheflush()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-9-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:19 +0000 (13:52 +1000)]
arch/mips/kernel/traps: use vma_lookup() instead of find_vma()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-8-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:19 +0000 (13:52 +1000)]
arch/arc/kernel/troubleshoot: use vma_lookup() instead of find_vma()
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-4-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:18 +0000 (13:52 +1000)]
drm/i915/selftests: use vma_lookup() in __igt_mmap()
vma_lookup() will look up the vma at a specific address. find_vma() will
start the search for a specific address and continue upwards. This fixes
an issue with the selftest as the returned vma may not be the newly
created vma, but simply the vma at a higher address.
Link: https://lkml.kernel.org/r/20210521174745.2219620-3-Liam.Howlett@Oracle.com Fixes: 6fedafacae1b (drm/i915/selftests: Wrap vm_mmap() around GEM
objects Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.
Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.
The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma. This results in a bug that is often overlooked for a long time.
Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().
This patch (of 22):
Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.
Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.
The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma. This results in a bug that is often overlooked for a long time.
Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().
Also change find_vma_intersection() comments and declaration to be of the
correct length and add kernel documentation style comment.
Liu Xiang [Wed, 2 Jun 2021 03:52:18 +0000 (13:52 +1000)]
mm/memory.c: fix comment of finish_mkwrite_fault()
Fix the return value in comment of finish_mkwrite_fault().
Link: https://lkml.kernel.org/r/20210513093931.15234-1-liu.xiang@zlingsmart.com Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Liam Howlett [Wed, 2 Jun 2021 03:52:18 +0000 (13:52 +1000)]
mm/mmap: use find_vma_intersection() in do_mmap() for overlap
Using find_vma_intersection() avoids the need for a temporary variable and
makes the code cleaner.
Link: https://lkml.kernel.org/r/20210511014328.2902782-1-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>