Miaohe Lin [Sat, 25 Jun 2022 09:28:13 +0000 (17:28 +0800)]
mm/khugepaged: minor cleanup for collapse_file
nr_none is always 0 for non-shmem case because the page can be read from
the backend store. So when nr_none ! = 0, it must be in is_shmem case.
Also only adjust the nrpages and uncharge shmem when nr_none != 0 to save
cpu cycles.
Link: https://lkml.kernel.org/r/20220625092816.4856-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Sat, 25 Jun 2022 09:28:11 +0000 (17:28 +0800)]
mm/khugepaged: stop swapping in page when VM_FAULT_RETRY occurs
When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus
swap entry will remain in pagetable. This will result in later failure.
So stop swapping in pages in this case to save cpu cycles. As A further
optimization, mmap_lock is released when __collapse_huge_page_swapin()
fails to avoid relocking mmap_lock. And "swapped_in++" is moved after
error handling to make it more accurate.
Link: https://lkml.kernel.org/r/20220625092816.4856-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "A few cleanup patches for khugepaged", v2.
This series contains a few cleaup patches to remove unneeded return value,
use helper macro, fix typos and so on. More details can be found in the
respective changelogs.
This patch (of 7):
If we reach here, khugepaged_scan_mm_slot() has already made sure that
hugepage is enabled for shmem, via its call to hugepage_vma_check().
Remove this duplicated check.
Link: https://lkml.kernel.org/r/20220625092816.4856-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220625092816.4856-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Sun, 26 Jun 2022 14:57:17 +0000 (22:57 +0800)]
mm: hugetlb: kill set_huge_swap_pte_at()
Commit e5251fd43007 ("mm/hugetlb: introduce set_huge_swap_pte_at()
helper") add set_huge_swap_pte_at() to handle swap entries on
architectures that support hugepages consisting of contiguous ptes. And
currently the set_huge_swap_pte_at() is only overridden by arm64.
set_huge_swap_pte_at() provide a sz parameter to help determine the number
of entries to be updated. But in fact, all hugetlb swap entries contain
pfn information, so we can find the corresponding folio through the pfn
recorded in the swap entry, then the folio_size() is the number of entries
that need to be updated.
And considering that users will easily cause bugs by ignoring the
difference between set_huge_swap_pte_at() and set_huge_pte_at(). Let's
handle swap entries in set_huge_pte_at() and remove the
set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which
simplifies our coding.
Link: https://lkml.kernel.org/r/20220626145717.53572-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yosry Ahmed [Thu, 23 Jun 2022 00:44:52 +0000 (00:44 +0000)]
mm: vmpressure: don't count userspace-induced reclaim as memory pressure
Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim as
memory pressure") made sure that memory reclaim that is induced by
userspace (limit-setting, proactive reclaim, ..) is not counted as memory
pressure for the purposes of psi.
Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
from try_charge() and reclaim_high() wrap the call to
try_to_free_mem_cgroup_pages() with psi handlers.
However, vmpressure is still counted in these cases where reclaim is
directly induced by userspace. This patch makes sure vmpressure is not
counted in those operations, in the same way as psi. Since vmpressure
calls need to happen deeper within the reclaim path, the same approach
could not be followed. Hence, a new "controlled" flag is added to struct
scan_control to flag a reclaim operation that is controlled by userspace.
This flag is set by limit-setting and proactive reclaim operations, and is
used to count vmpressure correctly.
To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
is effectively reverted and the same flag is used to control psi as well.
Link: https://lkml.kernel.org/r/20220623004452.1217326-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Yang [Thu, 23 Jun 2022 02:08:34 +0000 (02:08 +0000)]
mm/page_alloc: make the annotations of available memory more accurate
Not all systems use swap, so estimating available memory would help to
prevent swapping or OOM of system that not use swap.
And we need to reserve some page cache to prevent swapping or thrashing.
If somebody is accessing the pages in pagecache, and if too much would be
freed, most accesses might mean reading data from disk, i.e. thrashing.
Kalesh Singh [Thu, 23 Jun 2022 22:06:07 +0000 (15:06 -0700)]
procfs: add 'path' to /proc/<pid>/fdinfo/
In order to identify the type of memory a process has pinned through its
open fds, add the file path to fdinfo output. This allows identifying
memory types based on common prefixes: e.g. "/memfd...", "/dmabuf...",
"/dev/ashmem...".
To be cautious, only expose the paths for anonymous inodes, and this also
avoids printing path names with strange characters.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS the
same as /proc/<pid>/maps which also exposes the file path of mappings; so
the security permissions for accessing path is consistent with that of
/proc/<pid>/maps.
Link: https://lkml.kernel.org/r/20220623220613.3014268-3-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: Christoph Hellwig <hch@infradead.org> Cc: Colin Cross <ccross@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Ioannis Ilkos <ilkos@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kalesh Singh [Thu, 23 Jun 2022 22:06:06 +0000 (15:06 -0700)]
procfs: add 'size' to /proc/<pid>/fdinfo/
Patch series "procfs: Add file path and size to /proc/<pid>/fdinfo", v2.
Processes can pin shared memory by keeping a handle to it through a
file descriptor; for instance dmabufs, memfd, and ashmem (in Android).
In the case of a memory leak, to identify the process pinning the
memory, userspace needs to:
- Iterate the /proc/<pid>/fd/* for each process
- Do a readlink on each entry to identify the type of memory from
the file path.
- stat() each entry to get the size of the memory.
The file permissions on /proc/<pid>/fd/* only allows for the owner
or root to perform the operations above; and so is not suitable for
capturing the system-wide state in a production environment.
This issue was addressed for dmabufs by making /proc/*/fdinfo/*
accessible to a process with PTRACE_MODE_READ_FSCREDS credentials[1]
To allow the same kind of tracking for other types of shared memory,
add the following fields to /proc/<pid>/fdinfo/<fd>:
path - This allows identifying the type of memory based on common
prefixes: e.g. "/memfd...", "/dmabuf...", "/dev/ashmem..."
This was not an issued when dmabuf tracking was introduced
because the exp_name field of dmabuf fdinfo could be used
to distinguish dmabuf fds from other types.
size - To track the amount of memory that is being pinned.
dmabufs expose size as an additional field in fdinfo. Remove
this and make it a common field for all fds.
Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS
-- the same as for /proc/<pid>/maps which also exposes the path and
size for mapped memory regions.
This allows for a system process with PTRACE_MODE_READ_FSCREDS to
account the pinned per-process memory via fdinfo.
This patch (of 2):
To be able to account the amount of memory a process is keeping pinned by
open file descriptors add a 'size' field to fdinfo output.
dmabufs fds already expose a 'size' field for this reason, remove this and
make it a common field for all fds. This allows tracking of other types
of memory (e.g. memfd and ashmem in Android).
Link: https://lkml.kernel.org/r/20220623220613.3014268-1-kaleshsingh@google.com Link: https://lkml.kernel.org/r/20220623220613.3014268-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Christian König <christian.koenig@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Ioannis Ilkos <ilkos@google.com> Cc: T.J. Mercier <tjmercier@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: Colin Cross <ccross@google.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 24 Jun 2022 12:54:23 +0000 (13:54 +0100)]
mm/page_alloc: replace local_lock with normal spinlock
struct per_cpu_pages is no longer strictly local as PCP lists can be
drained remotely using a lock for protection. While the use of local_lock
works, it goes against the intent of local_lock which is for "pure CPU
local concurrency control mechanisms and not suited for inter-CPU
concurrency control" (Documentation/locking/locktypes.rst)
local_lock protects against migration between when the percpu pointer is
accessed and the pcp->lock acquired. The lock acquisition is a preemption
point so in the worst case, a task could migrate to another NUMA node and
accidentally allocate remote memory. The main requirement is to pin the
task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT.
Replace local_lock with helpers that pin a task to a CPU, lookup the
per-cpu structure and acquire the embedded lock. It's similar to
local_lock without breaking the intent behind the API. It is not a
complete API as only the parts needed for PCP-alloc are implemented but in
theory, the generic helpers could be promoted to a general API if there
was demand for an embedded lock within a per-cpu struct with a guarantee
that the per-cpu structure locked matches the running CPU and cannot use
get_cpu_var due to RT concerns. PCP requires these semantics to avoid
accidentally allocating remote memory.
Link: https://lkml.kernel.org/r/20220624125423.6126-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nicolas Saenz Julienne [Fri, 24 Jun 2022 12:54:22 +0000 (13:54 +0100)]
mm/page_alloc: remotely drain per-cpu lists
Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce a new mechanism to
remotely drain the per-cpu lists. It is made possible by remotely locking
'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new
scheme is that drain operations are now migration safe.
There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second. The
new scheme performs a bit better (~5%), although the important point here
is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.
Minchan Kim tested an earlier version and reported;
My workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on
drain_all_pages until work on workqueue run.
From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.
The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.
Your patch perfectly removed those wasted time.
Link: https://lkml.kernel.org/r/20220624125423.6126-7-mgorman@techsingularity.net Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 24 Jun 2022 12:54:21 +0000 (13:54 +0100)]
mm/page_alloc: protect PCP lists with a spinlock
Currently the PCP lists are protected by using local_lock_irqsave to
prevent migration and IRQ reentrancy but this is inconvenient. Remote
draining of the lists is impossible and a workqueue is required and every
task allocation/free must disable then enable interrupts which is
expensive.
As preparation for dealing with both of those problems, protect the lists
with a spinlock. The IRQ-unsafe version of the lock is used because IRQs
are already disabled by local_lock_irqsave. spin_trylock is used in
preparation for a time when local_lock could be used instead of
lock_lock_irqsave.
The per_cpu_pages still fits within the same number of cache lines after
this patch relative to before the series.
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */
There is overhead in the fast path due to acquiring the spinlock even
though the spinlock is per-cpu and uncontended in the common case. Page
Fault Test (PFT) running on a 1-socket reported the following results on a
1 socket machine.
Mel Gorman [Fri, 24 Jun 2022 12:54:20 +0000 (13:54 +0100)]
mm/page_alloc: remove mistaken page == NULL check in rmqueue
If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested,
cleared and kswapd woken whether the allocation attempt was via the PCP or
directly via the buddy list.
Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked
unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set,
mark the branch accordingly.
Link: https://lkml.kernel.org/r/20220624125423.6126-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 24 Jun 2022 12:54:18 +0000 (13:54 +0100)]
mm/page_alloc: use only one PCP list for THP-sized allocations
The per_cpu_pages is cache-aligned on a standard x86-64 distribution
configuration but a later patch will add a new field which would push the
structure into the next cache line. Use only one list to store THP-sized
pages on the per-cpu list. This assumes that the vast majority of
THP-sized allocations are GFP_MOVABLE but even if it was another type, it
would not contribute to serious fragmentation that potentially causes a
later THP allocation failure. Align per_cpu_pages on the cacheline
boundary to ensure there is no false cache sharing.
After this patch, the structure sizing is;
struct per_cpu_pages {
int count; /* 0 4 */
int high; /* 4 4 */
int batch; /* 8 4 */
short int free_factor; /* 12 2 */
short int expire; /* 14 2 */
struct list_head lists[13]; /* 16 208 */
Mel Gorman [Fri, 24 Jun 2022 12:54:17 +0000 (13:54 +0100)]
mm/page_alloc: add page->buddy_list and page->pcp_list
Patch series "Drain remote per-cpu directly", v5.
Some setups, notably NOHZ_FULL CPUs, may be running realtime or
latency-sensitive applications that cannot tolerate interference due to
per-cpu drain work queued by __drain_all_pages(). Introduce a new
mechanism to remotely drain the per-cpu lists. It is made possible by
remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has
two advantages, the time to drain is more predictable and other unrelated
tasks are not interrupted.
This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority task
due to a workqueue item draining per-cpu page lists. While many workloads
can tolerate a brief interruption, it may cause a real-time task running
on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is
non-deterministic.
Currently an IRQ-safe local_lock protects the page allocator per-cpu
lists. The local_lock on its own prevents migration and the IRQ disabling
protects from corruption due to an interrupt arriving while a page
allocation is in progress.
This series adjusts the locking. A spinlock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq is
ultimately replaced by just the spinlock in the final patch. This allows
a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave
to be converted to spin_lock to avoid IRQs being disabled/enabled in most
cases. The follow-on patch will be one kernel release later as it is
relatively high risk and it'll make bisections more clear if there are any
problems.
Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
and when it is storing per-cpu pages.
Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
this is not necessary but it avoids per_cpu_pages consuming another
cache line.
Patch 3 is a preparation patch to avoid code duplication.
Patch 4 is a minor correction.
Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
relying on local_lock to prevent migration, stabilise the pcp
lookup and prevent IRQ reentrancy.
Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
Patch 7 uses a normal spinlock instead of local_lock for remote draining
This patch (of 7):
The page allocator uses page->lru for storing pages on either buddy or PCP
lists. Create page->buddy_list and page->pcp_list as a union with
page->lru. This is simply to clarify what type of list a page is on in
the page allocator.
No functional change intended.
[minchan@kernel.org: fix page lru fields in macros] Link: https://lkml.kernel.org/r/20220624125423.6126-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Wed, 22 Jun 2022 02:35:01 +0000 (11:35 +0900)]
zram: do not lookup algorithm in backends table
Always use crypto_has_comp() so that crypto can lookup module, call
usermodhelper to load the modules, wait for usermodhelper to finish and so
on. Otherwise crypto will do all of these steps under CPU hot-plug lock
and this looks like too much stuff to handle under the CPU hot-plug lock.
Besides this can end up in a deadlock when usermodhelper triggers a code
path that attempts to lock the CPU hot-plug lock, that zram already holds.
An example of such deadlock:
- path A. zram grabs CPU hot-plug lock, execs /sbin/modprobe from crypto
and waits for modprobe to finish
- path B. async work kthread that brings in scsi device. It wants to
register CPUHP states at some point, and it needs the CPU hot-plug
lock for that, which is owned by zram.
Mike Kravetz [Tue, 21 Jun 2022 23:56:20 +0000 (16:56 -0700)]
hugetlb: lazy page table copies in fork()
Lazy page table copying at fork time was introduced with commit d992895ba2b2 ("[PATCH] Lazy page table copies in fork()"). At the time,
hugetlb was very new and did not support page faulting. As a result, it
was excluded. When full page fault support was added for hugetlb, the
exclusion was not removed.
Simply remove the check that prevents lazy copying of hugetlb page tables
at fork. Of course, like other mappings this only applies to shared
mappings.
Lazy page table copying at fork will be less advantageous for hugetlb
mappings because:
- There are fewer page table entries with hugetlb
- hugetlb pmds can be shared instead of copied
In any case, completely eliminating the copy at fork time should speed
things up.
Link: https://lkml.kernel.org/r/20220621235620.291305-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Kravetz [Tue, 21 Jun 2022 23:56:19 +0000 (16:56 -0700)]
hugetlb: do not update address in huge_pmd_unshare
As an optimization for loops sequentially processing hugetlb address
ranges, huge_pmd_unshare would update a passed address if it unshared a
pmd. Updating a loop control variable outside the loop like this is
generally a bad idea. These loops are now using hugetlb_mask_last_page to
optimize scanning when non-present ptes are discovered. The same can be
done when huge_pmd_unshare returns 1 indicating a pmd was unshared.
Remove address update from huge_pmd_unshare. Change the passed argument
type and update all callers. In loops sequentially processing addresses
use hugetlb_mask_last_page to update address if pmd is unshared.
Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 21 Jun 2022 23:56:18 +0000 (16:56 -0700)]
arm64/hugetlb: implement arm64 specific hugetlb_mask_last_page
The HugeTLB address ranges are linearly scanned during fork, unmap and
remap operations, and the linear scan can skip to the end of range mapped
by the page table page if hitting a non-present entry, which can help to
speed linear scanning of the HugeTLB address ranges.
So hugetlb_mask_last_page() is introduced to help to update the address in
the loop of HugeTLB linear scanning with getting the last huge page mapped
by the associated page table page[1], when a non-present entry is
encountered.
Considering ARM64 specific cont-pte/pmd size HugeTLB, this patch
implemented an ARM64 specific hugetlb_mask_last_page() to help this case.
Mike Kravetz [Tue, 21 Jun 2022 23:56:17 +0000 (16:56 -0700)]
hugetlb: skip to end of PT page mapping when pte not present
Patch series "hugetlb: speed up linear address scanning", v2.
At unmap, fork and remap time hugetlb address ranges are linearly scanned.
We can optimize these scans if the ranges are sparsely populated.
Also, enable page table "Lazy copy" for hugetlb at fork.
NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to
add an arch specific version hugetlb_mask_last_page() to take advantage of
sparse address scanning improvements. Baolin Wang added the routine for
arm64. Other architectures which could be optimized are: ia64, mips,
parisc, powerpc, s390, sh and sparc.
This patch (of 4):
HugeTLB address ranges are linearly scanned during fork, unmap and remap
operations. If a non-present entry is encountered, the code currently
continues to the next huge page aligned address. However, a non-present
entry implies that the page table page for that entry is not present.
Therefore, the linear scan can skip to the end of range mapped by the page
table page. This can speed operations on large sparsely populated hugetlb
mappings.
Create a new routine hugetlb_mask_last_page() that will return an address
mask. When the mask is ORed with an address, the result will be the
address of the last huge page mapped by the associated page table page.
Use this mask to update addresses in routines which linearly scan hugetlb
address ranges when a non-present pte is encountered.
hugetlb_mask_last_page is related to the implementation of huge_pte_offset
as hugetlb_mask_last_page is called when huge_pte_offset returns NULL.
This patch only provides a complete hugetlb_mask_last_page implementation
when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined. Architectures which
provide their own versions of huge_pte_offset can also provide their own
version of hugetlb_mask_last_page.
Link: https://lkml.kernel.org/r/20220621235620.291305-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220621235620.291305-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miaohe Lin [Sat, 18 Jun 2022 08:20:27 +0000 (16:20 +0800)]
mm/mmap.c: fix missing call to vm_unacct_memory in mmap_region
Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory
twice because vm_unacct_memory will be called by above unmap_region. But
since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from
the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory
anymore. So charged shouldn't be set to 0 now otherwise the calling to
paired vm_unacct_memory will be missed and leads to imbalanced account.
Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yun-Ze Li [Mon, 20 Jun 2022 07:15:16 +0000 (07:15 +0000)]
mm, docs: fix comments that mention mem_hotplug_end()
Comments that mention mem_hotplug_end() are confusing as there is no
function called mem_hotplug_end(). Fix them by replacing all the
occurences of mem_hotplug_end() in the comments with mem_hotplug_done().
The Shared*Proportional fields are not present in smaps, so it is not
always possible to determine how much of the Pss is from dirty pages and
how much is from clean pages. This information can be useful for
measuring memory usage for the purpose of optimisation, since clean pages
can usually be discarded by the kernel immediately while dirty pages
cannot.
The smaps routines in the kernel already have access to this data, so add
a Pss_Dirty to show it to userspace. Pss_Clean is not added since it can
be calculated from Pss and Pss_Dirty.
Baolin Wang [Mon, 20 Jun 2022 11:47:15 +0000 (19:47 +0800)]
mm: rmap: simplify the hugetlb handling when unmapping or migration
According to previous discussion [1], there are so many levels of
indenting to handle the hugetlb case when unmapping or migration. We can
combine folio_test_anon() and huge_pmd_unshare() to save one level of
indenting, by adding a local variable and moving the VM_BUG_ON() a little
forward.
Muchun Song [Tue, 21 Jun 2022 12:56:57 +0000 (20:56 +0800)]
mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance function
We need to make sure that the page is deleted from or added to the correct
lruvec list. So add a VM_WARN_ON_ONCE_FOLIO() to catch invalid users.
Then the VM_BUG_ON_PAGE() in move_pages_to_lru() could be removed since
add_page_to_lru_list() will check that.
Link: https://lkml.kernel.org/r/20220621125658.64935-11-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 21 Jun 2022 12:56:56 +0000 (20:56 +0800)]
mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
We will reuse the obj_cgroup APIs to charge the LRU pages. Finally,
page->memcg_data will have 2 different meanings.
- For the slab pages, page->memcg_data points to an object cgroups
vector.
- For the kmem pages (exclude the slab pages) and the LRU pages,
page->memcg_data points to an object cgroup.
In this patch, we reuse obj_cgroup APIs to charge LRU pages. In the end,
The page cache cannot prevent long-living objects from pinning the
original memory cgroup in the memory.
At the same time we also changed the rules of page and objcg or memcg
binding stability. The new rules are as follows.
For a page any of the following ensures page and objcg binding stability:
Based on the stable binding of page and objcg, for a page any of the
following ensures page and memcg binding stability:
- objcg_lock
- cgroup_mutex
- the lruvec lock
- the split queue lock (only THP page)
If the caller only want to ensure that the page counters of memcg are
updated correctly, ensure that the binding stability of page and objcg is
sufficient.
Link: https://lkml.kernel.org/r/20220621125658.64935-10-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Apart from the page lruvec lock, the deferred split queue lock (THP only)
also needs to do something similar. So we extract the necessary three
steps in the memcg_reparent_objcgs().
Now there are two different locks (e.g. lruvec lock and deferred split
queue lock) need to use this infrastructure. In the next patch, we will
use those APIs to make those locks safe when the LRU pages reparented.
Link: https://lkml.kernel.org/r/20220621125658.64935-9-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 21 Jun 2022 12:56:54 +0000 (20:56 +0800)]
mm: memcontrol: make all the callers of {folio,page}_memcg() safe
When we use objcg APIs to charge the LRU pages, the page will not hold a
reference to the memcg associated with the page. So the caller of the
{folio,page}_memcg() should hold an rcu read lock or obtain a reference to
the memcg associated with the page to protect memcg from being released.
So introduce get_mem_cgroup_from_{page,folio}() to obtain a reference to
the memory cgroup associated with the page.
In this patch, make all the callers hold an rcu read lock or obtain a
reference to the memcg to protect memcg from being released when the LRU
pages reparented.
We do not need to adjust the callers of {folio,page}_memcg() during the
whole process of mem_cgroup_move_task(). Because the cgroup migration and
memory cgroup offlining are serialized by @cgroup_mutex. In this routine,
the LRU pages cannot be reparented to its parent memory cgroup. So
{folio,page}_memcg() is stable and cannot be released.
This is a preparation for reparenting the LRU pages.
Link: https://lkml.kernel.org/r/20220621125658.64935-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 21 Jun 2022 12:56:52 +0000 (20:56 +0800)]
mm: vmscan: rework move_pages_to_lru()
In a later patch, we will reparent the LRU pages. The pages moved to
appropriate LRU list can be reparented during the process of the
move_pages_to_lru(). So holding a lruvec lock by the caller is wrong, we
should use the more general interface of folio_lruvec_relock_irq() to
acquire the correct lruvec lock.
Link: https://lkml.kernel.org/r/20220621125658.64935-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
// The folio is reparented at this time.
spin_lock(&lruvec->lru_lock);
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio)))
// Acquired the wrong lruvec lock and need to retry.
// Because this folio is on the parent memcg lruvec list.
spin_unlock(&lruvec->lru_lock);
goto retry;
// If we reach here, it means that folio_memcg(folio) is stable.
memcg_reparent_objcgs(memcg)
// lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
spin_lock(&lruvec->lru_lock);
spin_lock(&lruvec_parent->lru_lock);
// Move all the pages from the lruvec list to the parent lruvec list.
After we acquire the lruvec lock, we need to check whether the folio is
reparented. If so, we need to reacquire the new lruvec lock. On the
routine of the LRU pages reparenting, we will also acquire the lruvec lock
(will be implemented in the later patch). So folio_memcg() cannot be
changed when we hold the lruvec lock.
Since lruvec_memcg(lruvec) is always equal to folio_memcg(folio) after we
hold the lruvec lock, lruvec_memcg_debug() check is pointless. So remove
it.
This is a preparation for reparenting the LRU pages.
Link: https://lkml.kernel.org/r/20220621125658.64935-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 21 Jun 2022 12:56:50 +0000 (20:56 +0800)]
mm: memcontrol: prepare objcg API for non-kmem usage
Pagecache pages are charged at the allocation time and holding a reference
to the original memory cgroup until being reclaimed. Depending on the
memory pressure, specific patterns of the page sharing between different
cgroups and the cgroup creation and destruction rates, a large number of
dying memory cgroups can be pinned by pagecache pages. It makes the page
reclaim less efficient and wastes memory.
We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the page->memcg will always point
to an object cgroup pointer.
Therefore, the infrastructure of objcg no longer only serves
CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the objcg
out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages can reuse
it to charge pages.
We know that the LRU pages are not accounted at the root level. But the
page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
of the LRU pages always points to a valid pointer. But the
root_mem_cgroup dose not have an object cgroup. If we use obj_cgroup APIs
to charge the LRU pages, we should set the page->memcg_data to a root
object cgroup. So we also allocate an object cgroup for the
root_mem_cgroup.
Link: https://lkml.kernel.org/r/20220621125658.64935-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 21 Jun 2022 12:56:49 +0000 (20:56 +0800)]
mm: rename unlock_page_lruvec{_irq, _irqrestore} to lruvec_unlock{_irq, _irqrestore}
It is weird to use folio_lruvec_lock() variants and unlock_page_lruvec()
variants together, e.g. locking folio and unlocking page. So rename
unlock_page_lruvec{_irq, _irqrestore} to lruvec_unlock{_irq, _irqrestore}.
Link: https://lkml.kernel.org/r/20220621125658.64935-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Waiman Long <longman@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 21 Jun 2022 12:56:48 +0000 (20:56 +0800)]
mm: memcontrol: remove dead code and comments
Patch series "Use obj_cgroup APIs to charge the LRU pages", v6.
With the following patchsets applied, all the kernel memory is charged with
the new APIs of obj_cgroup:
commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages")
commit b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages")
But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted
into a new cgroup every time. Unreclaimable dying cgroups pile up, waste
memory, and make page reclaim very inefficient.
We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the LRU pages will not pin the
memcgs.
This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the
number of the dying cgroups will not increase if we run the following test
script.
for i in {0..2000}
do
mkdir /sys/fs/cgroup/memory/test$i
echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
cat temp >> log
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
rmdir /sys/fs/cgroup/memory/test$i
done
cat /proc/cgroups | grep memory
rm -f temp log
This patch (of 11):
Since no-hierarchy mode is deprecated after
commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode")
so parent_mem_cgroup() cannot return a NULL except root memcg, however,
root memcg cannot be offline, so it is safe to drop the check of returned
value of parent_mem_cgroup(). Remove those dead code.
The comments in memcg_offline_kmem() above memcg_reparent_list_lrus() are
out of date since
commit 5abc1e37afa0 ("mm: list_lru: allocate list_lru_one only when needed")
There is no ordering requirement between memcg_reparent_list_lrus() and
memcg_reparent_objcgs(), so remove those outdated comments.
Miaohe Lin [Sat, 18 Jun 2022 09:05:27 +0000 (17:05 +0800)]
mm/madvise: minor cleanup for swapin_walk_pmd_entry()
Passing index to pte_offset_map_lock() directly so the below calculation
can be avoided. Rename orig_pte to ptep as it's not changed. Also use
helper is_swap_pte() to improve the readability. No functional change
intended.
Muchun Song [Thu, 16 Jun 2022 03:38:46 +0000 (11:38 +0800)]
mm: hugetlb: remove minimum_order variable
commit 641844f5616d ("mm/hugetlb: introduce minimum hugepage order") fixed
a static checker warning and introduced a global variable minimum_order to
fix the warning. However, the local variable in
dissolve_free_huge_pages() can be initialized to
huge_page_order(&default_hstate) to fix the warning.
Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Fri, 17 Jun 2022 13:56:50 +0000 (21:56 +0800)]
mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory
For now, the feature of hugetlb_free_vmemmap is not compatible with the
feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes
precedence over memory_hotplug.memmap_on_memory. However, someone wants
to make memory_hotplug.memmap_on_memory takes precedence over
hugetlb_free_vmemmap since memmap_on_memory makes it more likely to
succeed memory hotplug in close-to-OOM situations. So the decision of
making hugetlb_free_vmemmap take precedence is not wise and elegant.
The proper approach is to have hugetlb_vmemmap.c do the check whether the
section which the HugeTLB pages belong to can be optimized. If the
section's vmemmap pages are allocated from the added memory block itself,
hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do
the optimization. Then both kernel parameters are compatible. So this
patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap
pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page
can be optimized.
Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20220620110616.12056-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Fri, 17 Jun 2022 13:56:49 +0000 (21:56 +0800)]
mm: memory_hotplug: enumerate all supported section flags
Patch series "make hugetlb_optimize_vmemmap compatible with
memmap_on_memory", v3.
This series makes hugetlb_optimize_vmemmap compatible with
memmap_on_memory.
This patch (of 2):
We are almost running out of section flags, only one bit is available in
the worst case (powerpc with 256k pages). However, there are still some
free bits (in ->section_mem_map) on other architectures (e.g. x86_64 has
10 bits available, arm64 has 8 bits available with worst case of 64K
pages). We have hard coded those numbers in code, it is inconvenient to
use those bits on other architectures except powerpc. So transfer those
section flags to enumeration to make it easy to add new section flags in
the future. Also, move SECTION_TAINT_ZONE_DEVICE into the scope of
CONFIG_ZONE_DEVICE to save a bit on non-zone-device case.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:15 +0000 (18:50 +0100)]
mm/swap: convert __put_compound_page() to __folio_put_large()
All the callers now have a folio, so pass it in. This doesn't
save any text, but it does save a call to compound_head() as
folio_test_hugetlb() does not contain a call like PageHuge() does.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:12 +0000 (18:50 +0100)]
mm/swap: convert put_pages_list to use folios
Pages linked through the LRU list cannot be tail pages as ->compound_head
is in a union with one of the words of the list_head, and they cannot
be ZONE_DEVICE pages as ->pgmap is in a union with the same word.
Saves 60 bytes of text by removing a call to page_is_fake_head().
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:11 +0000 (18:50 +0100)]
mm/swap: convert release_pages to use a folio internally
This function was already calling compound_head(), but now it can
cache the result of calling compound_head() and avoid calling it again.
Saves 299 bytes of text by avoiding various calls to compound_page()
and avoiding checks of PageTail.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:08 +0000 (18:50 +0100)]
mm/swap: pull the CPU conditional out of __lru_add_drain_all()
The function is too long, so pull this complicated conditional out into
cpu_needs_drain(). This ends up shrinking the text by 14 bytes,
by allowing GCC to cache the result of calling per_cpu() instead of
relocating each lookup individually.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:06 +0000 (18:50 +0100)]
mm/swap: convert activate_page to a folio_batch
Rename it to just 'activate', saving 696 bytes of text from removals
of compound_page() and the pagevec_lru_move_fn() infrastructure.
Inline need_activate_page_drain() into its only caller.
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:50:02 +0000 (18:50 +0100)]
mm/swap: convert lru_add to a folio_batch
When adding folios to the LRU for the first time, the LRU flag will
already be clear, so skip the test-and-clear part of moving from one
LRU to another.
Removes 285 bytes from kernel text, mostly due to removing
__pagevec_lru_add().
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 17:49:59 +0000 (18:49 +0100)]
mm: add folios_put()
Patch series "Convert the swap code to be more folio-based".
There's still more to do with the swap code, but this reaps a lot of the
folio benefit. More than 4kB of kernel text saved (with the UEK7 kernel
config). I don't know how much that's going to translate into CPU
savings, but some of those compound_head() calls are on every page free,
so it should be noticable. It might even be noticable just from an
I-cache consumption perspective.
This patch (of 22):
This is just a wrapper around release_pages() for now. Place the
prototype in mm.h along with folio_put() and folio_put_refs().
Matthew Wilcox (Oracle) [Fri, 17 Jun 2022 15:42:44 +0000 (16:42 +0100)]
mm/vmscan: convert reclaim_clean_pages_from_list() to folios
Patch series "nvert much of vmscan to folios"
vmscan always operates on folios since it puts the pages on the LRU list.
Switching all of these functions from pages to folios saves 1483 bytes of
text from removing all the baggage around calling compound_page() and
similar functions.
This patch (of 5):
This is a straightforward conversion which removes several hidden calls
to compound_head, saving 330 bytes of kernel text.
Kuan-Ying Lee [Wed, 15 Jun 2022 06:22:18 +0000 (14:22 +0800)]
kasan: separate double free case from invalid free
Currently, KASAN describes all invalid-free/double-free bugs as
"double-free or invalid-free". This is ambiguous.
KASAN should report "double-free" when a double-free is a more likely
cause (the address points to the start of an object) and report
"invalid-free" otherwise [1].
Yang Shi [Thu, 16 Jun 2022 17:48:40 +0000 (10:48 -0700)]
doc: proc: fix the description to THPeligible
The THPeligible bit shows 1 if and only if the VMA is eligible for
allocating THP and the THP is also PMD mappable. Some misaligned file
VMAs may be eligible for allocating THP but the THP can't be mapped by
PMD. Make this more explicitly to avoid ambiguity.
Link: https://lkml.kernel.org/r/20220616174840.1202070-8-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:39 +0000 (10:48 -0700)]
mm: khugepaged: reorg some khugepaged helpers
The khugepaged_{enabled|always|req_madv} are not khugepaged only anymore,
move them to huge_mm.h and rename to hugepage_flags_xxx, and remove
khugepaged_req_madv due to no users.
Also move khugepaged_defrag to khugepaged.c since its only caller is in
that file, it doesn't have to be in a header file.
Link: https://lkml.kernel.org/r/20220616174840.1202070-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:38 +0000 (10:48 -0700)]
mm: thp: kill __transhuge_page_enabled()
The page fault path checks THP eligibility with __transhuge_page_enabled()
which does the similar thing as hugepage_vma_check(), so use
hugepage_vma_check() instead.
However page fault allows DAX and !anon_vma cases, so added a new flag,
in_pf, to hugepage_vma_check() to make page fault work correctly.
The in_pf flag is also used to skip shmem and file THP for page fault
since shmem handles THP in its own shmem_fault() and file THP allocation
on fault is not supported yet.
Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only
caller now, it is not necessary to have a helper function.
Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Thu, 23 Jun 2022 00:02:45 +0000 (17:02 -0700)]
mm-thp-kill-transparent_hugepage_active-fix-fix
add comment to vdso check
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 22 Jun 2022 00:51:42 +0000 (17:51 -0700)]
mm-thp-kill-transparent_hugepage_active-fix
check vma->vm_mm, per Zach
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:37 +0000 (10:48 -0700)]
mm: thp: kill transparent_hugepage_active()
The transparent_hugepage_active() was introduced to show THP eligibility
bit in smaps in proc, smaps is the only user. But it actually does the
similar check as hugepage_vma_check() which is used by khugepaged. We
definitely don't have to maintain two similar checks, so kill
transparent_hugepage_active().
This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas.
Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it
is not only for khugepaged anymore.
Link: https://lkml.kernel.org/r/20220616174840.1202070-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:36 +0000 (10:48 -0700)]
mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidate
The hugepage_vma_revalidate() needs to check if the vma is still anonymous
vma or not since the address may be unmapped then remapped to file before
khugepaged reaquired the mmap_lock.
The old comment is not quite helpful, elaborate this with better comment.
Link: https://lkml.kernel.org/r/20220616174840.1202070-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:35 +0000 (10:48 -0700)]
mm: thp: consolidate vma size check to transhuge_vma_suitable
There are couple of places that check whether the vma size is ok for THP
or whether address fits, they are open coded and duplicate, use
transhuge_vma_suitable() to do the job by passing in (vma->end -
HPAGE_PMD_SIZE).
Move vma size check into hugepage_vma_check(). This will make
khugepaged_enter() is as same as khugepaged_enter_vma(). There is just
one caller for khugepaged_enter(), replace it to khugepaged_enter_vma()
and remove khugepaged_enter().
Link: https://lkml.kernel.org/r/20220616174840.1202070-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Thu, 16 Jun 2022 17:48:34 +0000 (10:48 -0700)]
mm: khugepaged: check THP flag in hugepage_vma_check()
Patch series "Cleanup transhuge_xxx helpers", v5.
This series is the follow-up of the discussion about cleaning up
transhuge_xxx helpers at
https://lore.kernel.org/linux-mm/627a71f8-e879-69a5-ceb3-fc8d29d2f7f1@suse.cz/.
THP has a bunch of helpers that do VMA sanity check for different paths,
they do the similar checks for the most callsites and have a lot duplicate
codes. And it is confusing what helpers should be used at what
conditions.
This series reorganized and cleaned up the code so that we could
consolidate all the checks into hugepage_vma_check().
The transhuge_vma_enabled(), transparent_hugepage_active() and
__transparent_hugepage_enabled() are killed by this series.
This patch (of 7):
Currently the THP flag check in hugepage_vma_check() will fallthrough if
the flag is NEVER and VM_HUGEPAGE is set. This is not a problem for now
since all the callers have the flag checked before or can't be invoked if
the flag is NEVER.
However, the following patch will call hugepage_vma_check() in more
places, for example, page fault, so this flag must be checked in
hugepge_vma_check().
Liam Howlett [Wed, 15 Jun 2022 17:40:58 +0000 (17:40 +0000)]
mm/mlock: drop dead code in count_mm_mlocked_page_nr()
The check for mm being null has never been needed since the only caller
has always passed in current->mm. Remove the check from
count_mm_mlocked_page_nr().
David Hildenbrand [Tue, 14 Jun 2022 09:36:29 +0000 (11:36 +0200)]
mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection
Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
can try mapping anonymous pages in a private writable mapping writable if
they are exclusive, the PTE is already dirty, and no special handling
applies. Mapping the anonymous page writable is essentially the same
thing the write fault handler would do in this case.
Special handling is required for uffd-wp and softdirty tracking, so take
care of that properly. Also, leave PROT_NONE handling alone for now; in
the future, we could similarly extend the logic in do_numa_page() or use
pte_mk_savedwrite() here.
While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
performance, it should also be a valuable optimization for uffd-wp, when
un-protecting.
This has been previously suggested by Peter Collingbourne in [1], relevant
in the context of the Scudo memory allocator, before we had
PageAnonExclusive.
This commit doesn't add the same handling for PMDs (i.e., anonymous THP,
anonymous hugetlb); benchmark results from Andrea indicate that there are
minor performance gains, so it's might still be valuable to streamline
that logic for all anonymous pages in the future.
As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename it
to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually
happening.
Link: https://lkml.kernel.org/r/20220614093629.76309-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Peter Collingbourne <pcc@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Edward Liaw [Mon, 13 Jun 2022 23:33:21 +0000 (23:33 +0000)]
userfaultfd: selftests: infinite loop in faulting_process
On Android this test is getting stuck in an infinite loop due to
indeterminate behavior:
The local variables steps and signalled were being reset to 1 and 0
respectively after every jump back to sigsetjmp by siglongjmp in the
signal handler. The test was incrementing them and expecting them to
retain their incremented values. The documentation for siglongjmp says:
All accessible objects have values as of the time sigsetjmp() was called,
except that the values of objects of automatic storage duration which are
local to the function containing the invocation of the corresponding
sigsetjmp() which do not have volatile-qualified type and which are
changed between the sigsetjmp() invocation and siglongjmp() call are
indeterminate.
Tagging steps and signalled with volatile enabled the test to pass.
Link: https://lkml.kernel.org/r/20220613233321.431282-1-edliaw@google.com Signed-off-by: Edward Liaw <edliaw@google.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Wed, 1 Jun 2022 21:09:51 +0000 (14:09 -0700)]
selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh
This new mode was recently added to the userfaultfd selftest. We want to
exercise both userfaultfd(2) as well as /dev/userfaultfd, so add both test
cases to the script.
Link: https://lkml.kernel.org/r/20220601210951.3916598-7-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Wed, 1 Jun 2022 21:09:50 +0000 (14:09 -0700)]
userfaultfd: selftests: make /dev/userfaultfd testing configurable
Instead of always testing both userfaultfd(2) and /dev/userfaultfd, let
the user choose which to test.
As with other test features, change the behavior based on a new command
line flag. Introduce the idea of "test mods", which are generic (not
specific to a test type) modifications to the behavior of the test. This
is sort of borrowed from this RFC patch series [1], but simplified a bit.
The benefit is, in "typical" configurations this test is somewhat slow
(say, 30sec or something). Testing both clearly doubles it, so it may not
always be desirable, as users are likely to use one or the other, but
never both, in the "real world".
Axel Rasmussen [Wed, 1 Jun 2022 21:09:47 +0000 (14:09 -0700)]
userfaultfd: add /dev/userfaultfd for fine grained access control
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount of
time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl must
be configured so that any unprivileged user can do it.
In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:
- Toggling the sysctl increases attack surface by allowing any
unprivileged user to do it.
- Granting the live migration process CAP_SYS_PTRACE gives it this
ability, but *also* the ability to "observe and control the
execution of another process [...], and examine and change [its]
memory and registers" (from ptrace(2)). This isn't something we need
or want to be able to do, so granting this permission violates the
"principle of least privilege".
This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional permissions
at the same time.
To achieve this, add a /dev/userfaultfd misc device. This device provides
an alternative to the userfaultfd(2) syscall for the creation of new
userfaultfds. The idea is, any userfaultfds created this way will be able
to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.
Link: https://lkml.kernel.org/r/20220601210951.3916598-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Rasmussen [Wed, 1 Jun 2022 21:09:46 +0000 (14:09 -0700)]
selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
Patch series "userfaultfd: add /dev/userfaultfd for fine grained access
control", v3.
This patch (of 6):
This not being included was just a simple oversight. There are certain
features (like minor fault support) which are only enabled on shared
mappings, so without including hugetlb_shared we actually lose a
significant amount of test coverage.
Link: https://lkml.kernel.org/r/20220601210951.3916598-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20220601210951.3916598-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Charan Teja Kalla <charante@codeaurora.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry V. Levin <ldv@altlinux.org> Cc: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 13 Jun 2022 19:23:00 +0000 (19:23 +0000)]
mm/damon: introduce DAMON-based LRU-lists Sorting
Users can do data access-aware LRU-lists sorting using 'LRU_PRIO' and
'LRU_DEPRIO' DAMOS actions. However, finding best parameters including
the hotness/coldness thresholds, CPU quota, and watermarks could be
challenging for some users. To make the scheme easy to be used without
complex tuning for common situations, this commit implements a static
kernel module called 'DAMON_LRU_SORT' using the 'LRU_PRIO' and
'LRU_DEPRIO' DAMOS actions.
It proactively sorts LRU-lists using DAMON with conservatively chosen
default values of the parameters. That is, the module under its default
parameters will make no harm for common situations but provide some level
of efficiency improvements for systems having clear hot/cold access
pattern under a level of memory pressure while consuming only a limited
small portion of CPU time.
SeongJae Park [Mon, 13 Jun 2022 19:22:58 +0000 (19:22 +0000)]
mm/damon/schemes: add 'LRU_DEPRIO' action
This commit adds a new DAMON-based operation scheme action called
'LRU_DEPRIO' for physical address space. The action deprioritizes pages
in the memory area of the target access pattern on their LRU lists. This
is hence supposed to be used for rarely accessed (cold) memory regions so
that cold pages could be more likely reclaimed first under memory
pressure. Internally, it simply calls 'lru_deactivate()'.
Using this with 'LRU_PRIO' action for hot pages, users can proactively
sort LRU lists based on the access pattern. That is, it can make the LRU
lists somewhat more trustworthy source of access temperature. As a
result, efficiency of LRU-lists based mechanisms including the reclamation
target selection could be improved.