Lorenzo Stoakes [Thu, 2 Jan 2025 12:10:51 +0000 (12:10 +0000)]
mips: vdso: prefer do_mmap() to mmap_region()
Patch series "mm: update mips to use do_mmap(), make mmap_region()
internal".
Currently the only user of mmap_region() outside of the memory management
code is the MIPS VDSO implementation.
This uses mmap_region() to map a 'delay slot emulation page' at the top of
the stack which is read-only and executable.
This mapping requires that an already-acquired mmap write lock is utilised
and that uffd and populate logic is ignored. This rules out vm_mmap(),
however do_mmap() fits the bill.
Adapt this code to use do_mmap() and then once done, make mmap_region()
internal and userland testable, and avoid any other uses of mmap_region(),
which is absolutely and strictly an internal mm function which bypasses a
great number of checks and logic.
This patch (of 2):
mmap_region() is an internal memory management implementation detail that
is not intended to be used outside of the memory management subsystem.
Map the delay slot emulation page using do_mmap() which makes use of the
already-held mmap write lock and bypasses unneeded populate and
userfaultfd logic.
This should have the precise same behaviour as the existing logic.
Kairui Song [Mon, 13 Jan 2025 17:57:32 +0000 (01:57 +0800)]
mm, swap_slots: remove slot cache for freeing path
The slot cache for freeing path is mostly for reducing the overhead of
si->lock. As we have basically eliminated the si->lock usage for freeing
path, it can be removed.
This helps simplify the code, and avoids swap entries from being hold in
cache upon freeing. The delayed freeing of entries have been causing
trouble for further optimizations for zswap [1] and in theory will also
cause more fragmentation, and extra overhead.
Test with build linux kernel showed both performance and fragmentation is
better without the cache:
tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, avg of 4 test run::
Before:
Sys time: 36047.78, Real time: 472.43
After: (-7.6% sys time, -7.3% real time)
Sys time: 33314.76, Real time: 437.67
time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, avg of 4 test run:
Before:
Sys time: 46859.04, Real time: 562.63
hugepages-64kB/stats/swpout: 1783392
hugepages-64kB/stats/swpout_fallback: 240875
After: (-23.3% sys time, -21.3% real time)
Sys time: 35958.87, Real time: 442.69
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330
Sequential SWAP should be also slightly faster, tests didn't show a
measurable difference though, at least no regression:
Link: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/[1] Link: https://lkml.kernel.org/r/20250113175732.48099-14-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:31 +0000 (01:57 +0800)]
mm, swap: use a global swap cluster for non-rotation devices
Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the
goal of the SWAP allocator is to avoid contention for clusters. It uses a
per-CPU cluster design, and each CPU will use a different cluster as much
as possible.
However, HDDs are very sensitive to fragmentation, contention is trivial
in comparison. Therefore, we use one global cluster instead. This
ensures that each order will be written to the same cluster as much as
possible, which helps make the I/O more continuous.
This ensures that the performance of the cluster allocator is as good as
that of the old allocator. Tests after this commit compared to those
before this series:
Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap:
make -j32 with tinyconfig, using 1G memcg limit and HDD swap:
Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxresident)k
2901232inputs+0outputs (238877major+4227640minor)pagefaults
After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxresident)k
2548728inputs+0outputs (235471major+4238110minor)pagefaults
[ryncsn@gmail.com: check kmalloc() return in setup_clusters] Link: https://lkml.kernel.org/r/CAMgjq7Au+o04ckHyT=iU-wVx9az=t0B-ZiC5E0bDqNrAtNOP-g@mail.gmail.com Link: https://lkml.kernel.org/r/20250113175732.48099-13-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:30 +0000 (01:57 +0800)]
mm, swap: introduce a helper for retrieving cluster from offset
It's a common operation to retrieve the cluster info from offset,
introduce a helper for this.
Link: https://lkml.kernel.org/r/20250113175732.48099-12-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:29 +0000 (01:57 +0800)]
mm, swap: simplify percpu cluster updating
Instead of using a returning argument, we can simply store the next
cluster offset to the fixed percpu location, which reduce the stack usage
and simplify the function:
Object size:
./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271)
Function old new delta
get_swap_pages 2847 2733 -114
alloc_swap_scan_cluster 894 737 -157
Total: Before=30833, After=30562, chg -0.88%
Link: https://lkml.kernel.org/r/20250113175732.48099-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:28 +0000 (01:57 +0800)]
mm, swap: reduce contention on device lock
Currently, swap locking is mainly composed of two locks: the cluster lock
(ci->lock) and the device lock (si->lock).
The cluster lock is much more fine-grained, so it is best to use ci->lock
instead of si->lock as much as possible.
We have cleaned up other hard dependencies on si->lock. Following the new
cluster allocator design, most operations don't need to touch si->lock at
all. In practice, we only need to take si->lock when moving clusters
between lists.
To achieve this, this commit reworks the locking pattern of all si->lock
and ci->lock users, eliminates all usage of ci->lock inside si->lock, and
introduces a new design to avoid touching si->lock unless needed.
For minimal contention and easier understanding of the system, two ideas
are introduced with the corresponding helpers: isolation and relocation.
- Clusters will be `isolated` from the list when iterating the list
to search for an allocatable cluster.
This ensures other CPUs won't walk into the same cluster easily,
and it releases si->lock after acquiring ci->lock, providing the
only place that handles the inversion of two locks, and avoids
contention.
Iterating the cluster list almost always moves the cluster
(free -> nonfull, nonfull -> frag, frag -> frag tail), but it
doesn't know where the cluster should be moved to until scanning
is done. So keeping the cluster off-list is a good option with
low overhead.
The off-list time window of a cluster is also minimal. In the worst
case, one CPU will return the cluster after scanning the 512 entries
on it, which we used to busy wait with a spin lock.
This is done with the new helper `isolate_lock_cluster`.
- Clusters will be `relocated` after allocation or freeing, according
to their usage count and status.
Allocations no longer hold si->lock now, and may drop ci->lock for
reclaim, so the cluster could be moved to any location while no lock
is held. Besides, isolation clears all flags when it takes the
cluster off the list (the flags must be in sync with the list status,
so cluster users don't need to touch si->lock for checking its list
status). So the cluster has to be relocated to the right list
according to its usage after allocation or freeing.
Relocation is optional, if the cluster flags indicate it's already
on the right list, it will skip touching the list or si->lock.
This is done with `relocate_cluster` after allocation or with
`[partial_]free_cluster` after freeing.
This handled usage of all kinds of clusters in a clean way.
Scanning and allocation by iterating the cluster list is handled by
"isolate - <scan / allocate> - relocate".
Scanning and allocation of per-CPU clusters will only involve
"<scan / allocate> - relocate", as it knows which cluster to lock
and use.
Freeing will only involve "relocate".
Each CPU will keep using its per-CPU cluster until the 512 entries
are all consumed. Freeing also has to free 512 entries to trigger
cluster movement in the best case, so si->lock is rarely touched.
Testing with building the Linux kernel with defconfig showed huge
improvement:
tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C:
Before:
Sys time: 73578.30, Real time: 864.05
After: (-50.7% sys time, -44.8% real time)
Sys time: 36227.49, Real time: 476.66
time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333
After: (-40.4% sys time, -37.1% real time)
Sys time: 44160.56, Real time: 532.07
hugepages-64kB/stats/swpout: 1786288
hugepages-64kB/stats/swpout_fallback: 243384
time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62:
Before:
Sys time: 8098.21, Real time: 401.3
After: (-22.6% sys time, -12.8% real time )
Sys time: 6265.02, Real time: 349.83
The allocation success rate also slightly improved as we sanitized the
usage of clusters with new defined helpers, previously dropping
si->lock or ci->lock during scan will cause cluster order shuffle.
Link: https://lkml.kernel.org/r/20250113175732.48099-10-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:27 +0000 (01:57 +0800)]
mm, swap: use an enum to define all cluster flags and wrap flags changes
Currently, we are only using flags to indicate which list the cluster is
on. Using one bit for each list type might be a waste, as the list type
grows, we will consume too many bits. Additionally, the current mixed
usage of '&' and '==' is a bit confusing.
Make it clean by using an enum to define all possible cluster statuses.
Only an off-list cluster will have the NONE (0) flag. And use a wrapper
to annotate and sanitize all flag settings and list movements.
Link: https://lkml.kernel.org/r/20250113175732.48099-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:26 +0000 (01:57 +0800)]
mm, swap: hold a reference during scan and cleanup flag usage
The flag SWP_SCANNING was used as an indicator of whether a device is
being scanned for allocation, and prevents swapoff. Combined with
SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
1. Swapoff clears SWP_WRITEOK, allocation requests will see
~SWP_WRITEOK and abort as it's serialized by si->lock.
2. Swapoff unuses all allocated entries.
3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
allocations will stop, preventing UAF.
4. Now swapoff can free everything safely.
This will make the allocation path have a hard dependency on si->lock.
Allocation always have to acquire si->lock first for setting SWP_SCANNING
and checking SWP_WRITEOK.
This commit removes this flag, and just uses the existing per-CPU refcount
instead to prevent UAF in step 3, which serves well for such usage without
dependency on si->lock, and scales very well too. Just hold a reference
during the whole scan and allocation process. Swapoff will kill and wait
for the counter.
And for preventing any allocation from happening after step 1 so the unuse
in step 2 can ensure all slots are free, swapoff will acquire the ci->lock
of each cluster one by one to ensure all allocations see ~SWP_WRITEOK and
abort.
This way these dependences on si->lock are gone. And worth noting we
can't kill the refcount as the first step for swapoff as the unuse process
have to acquire the refcount.
Link: https://lkml.kernel.org/r/20250113175732.48099-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:25 +0000 (01:57 +0800)]
mm, swap: clean up plist removal and adding
When the swap device is full (inuse_pages == pages), it should be removed
from the allocation available plist. If any slot is freed, the swap
device should be added back to the plist. Additionally, during swapon or
swapoff, the swap device is forcefully added or removed.
Currently, the condition (inuse_pages == pages) is checked after every
counter update, then remove or add the device accordingly. This is
serialized by si->lock.
This commit decouples it from the protection of si->lock and reworked
plist removal and adding, making it possible to get rid of the hard
dependency on si->lock in allocation path in later commits.
To achieve this, simply using another lock is not an optimal approach, as
the overhead is observable for a hot counter, and may cause complex
locking issues. Thus, this commit manages to make it a lock-free atomic
operation, by embedding the plist state into the second highest bit of the
atomic counter.
Simply making the counter an atomic will not work, if the update and plist
status check are not performed atomically, we may miss an addition or
removal. With the embedded info we can update the counter and check the
plist status with single atomic operations, and avoid any extra overheads:
If the counter is full (inuse_pages == pages) and the off-list bit is
unset, we attempt to remove it from the plist. If the counter is not full
(inuse_pages != pages) and the off-list bit is set, we attempt to add it
to the plist. Removing, adding and bit update is serialized with a lock,
which is a cold path. Ordinary counter updates will be lock-free.
Link: https://lkml.kernel.org/r/20250113175732.48099-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:24 +0000 (01:57 +0800)]
mm, swap: clean up device availability check
Remove highest_bit and lowest_bit. After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
device is full or not, which can instead be determined by checking the
inuse_pages.
Link: https://lkml.kernel.org/r/20250113175732.48099-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:23 +0000 (01:57 +0800)]
mm, swap: use cluster lock for HDD
Cluster lock (ci->lock) was introduced to reduce contention for certain
operations. Using cluster lock for HDD is not helpful as HDD have a poor
performance, so locking isn't the bottleneck. But having different set of
locks for HDD / non-HDD prevents further rework of device lock (si->lock).
This commit just changed all lock_cluster_or_swap_info to lock_cluster,
which is a safe and straight conversion since cluster info is always
allocated now, also removed all cluster_info related checks.
Link: https://lkml.kernel.org/r/20250113175732.48099-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:22 +0000 (01:57 +0800)]
mm, swap: remove old allocation path for HDD
We are currently using different swap allocation algorithm for HDD and
non-HDD. This leads to the existence of a different set of locks, and the
code path is heavily bloated, causing difficulties for further
optimization and maintenance.
This commit removes all HDD swap allocation and related dead code, and
uses the cluster allocation algorithm instead.
The performance may drop temporarily, but this should be negligible: The
main advantage of the legacy HDD allocation algorithm is that it tends to
use continuous slots, but swap device gets fragmented quickly anyway, and
the attempt to use continuous slots will fail easily.
This commit also enables mTHP swap on HDD, which is expected to be
beneficial, and following commits will adapt and optimize the cluster
allocator for HDD.
Link: https://lkml.kernel.org/r/20250113175732.48099-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Suggested-by: "Huang, Ying" <ying.huang@linux.alibaba.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:21 +0000 (01:57 +0800)]
mm, swap: fold swap_info_get_cont in the only caller
The name of the function is confusing, and the code is much easier to
follow after folding, also rename the confusing naming "p" to more
meaningful "si".
Link: https://lkml.kernel.org/r/20250113175732.48099-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 13 Jan 2025 17:57:20 +0000 (01:57 +0800)]
mm, swap: minor clean up for swap entry allocation
Patch series "mm, swap: rework of swap allocator locks", v4.
This series greatly improved swap performance by reworking the locking
design and simplify a lot of code path. Test showed a up to 400%
vm-scalability improvement with pmem as SWAP, and up to 37% reduce of
kernel compile real time with ZRAM as SWAP (up to 60% improvement in
system time).
This is part of the new swap allocator discussed during the "Swap
Abstraction" discussion at LSF/MM 2024, and "mTHP and swap allocator"
discussion at LPC 2024.
This is a follow up of previous swap cluster allocator series:
https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
Also enables further optimizations which will come later.
Previous series introduced a fully cluster based allocator, this series
completely get rid of the old allocator and makes the new allocator avoid
touching the si->lock unless needed. This bring huge performance gain and
get rid of slot cache for freeing path.
Currently, swap locking is mainly composed of two locks, cluster lock
(ci->lock) and device lock (si->lock). The device lock is widely used to
protect many things, causing it to be the main bottleneck for SWAP.
Cluster lock is much more fine-grained, so it will be best to use ci->lock
instead of si->lock as much as possible.
`perf lock' indicates this issue clearly. Doing linux kernel build using
tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k pages),
result of "perf lock contention -ab sleep 3" shows:
contended total wait max wait avg wait type caller
34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350
16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231
11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59
4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3
4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59
406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39
...snip...
The top 5 caller are all users of si->lock, total wait time sums to
several minutes in the 3 seconds time window.
Following the new allocator design, many operation doesn't need to touch
si->lock at all. We only need to take si->lock when doing operations
across multiple clusters (changing the cluster list). So ideally
allocator should always take ci->lock first, then take si->lock only if
needed. But due to historical reasons, ci->lock is used inside si->lock
critical section, causing lock inversion if we simply try to acquire
si->lock after acquiring ci->lock.
This series audited all si->lock usage, clean up legacy codes, eliminate
usage of si->lock as much as possible by introducing new designs based on
the new cluster allocator.
Old HDD allocation codes are removed, cluster allocator is adapted with
small changes for HDD usage, test is looking OK.
And this also removed slot cache for freeing path. The performance is
even better without it now, and this enables other clean up and
optimizations as discussed before:
After this series, lock contention on si->lock is nearly unobservable
with `perf lock` with the same test above:
contended total wait max wait avg wait type caller
... snip ...
52 127.12 us 3.82 us 2.44 us spinlock move_cluster+0x2c
56 120.77 us 12.41 us 2.16 us spinlock move_cluster+0x2c
... snip ...
10 21.96 us 2.78 us 2.20 us spinlock isolate_lock_cluster+0x20
... snip ...
9 19.27 us 2.70 us 2.14 us spinlock move_cluster+0x2c
... snip ...
5 11.07 us 2.70 us 2.21 us spinlock isolate_lock_cluster+0x20
`move_cluster' and `isolate_lock_cluster' (two new introduced helper) are
basically the only users of si->lock now, performance gain is huge, and
LOC is reduced.
Tests Results:
vm-scalability
==============
Running `usemem --init-time -O -y -x -R -31 1G` from vm-scalability in a
12G memory cgroup using simulated pmem as SWAP backend (32G pmem, 32
CPUs).
Using 4K folio by default, 64k mTHP and sequential access (!-R) results
are also provided. 6 test runs for each case, Total Throughput:
Test Before (KB/s) (stdev) After (KB/s) (stdev) Delta
---------------------------------------------------------------------------
Random (4K): 69937.11 (16449.77) 369816.17 (24476.68) +428.78%
Random (64k): 123442.83 (13207.51) 216379.00 (25024.83) +75.28%
Sequential (4K): 6313909.83 (148856.12) 6419860.66 (183563.38) +1.7%
Sequential access will cause lower stress for the allocator so the gain is
limited, but with random access (which is much closer to real workloads)
the performance gain is huge.
Build kernel with defconfig on tmpfs with ZRAM
==============================================
Below results shows a test matrix using different memory cgroup limit and
job numbets, and scaled up progressive for a intuitive result. Done on a
48c96t system.
6 test run for each case, it can be seen clearly that as concurrent job
number goes higher the performance gain is higher, but even -j6 is showing
slight improvement.
The fragmentation are reduced too:
With: make -j96 / 1152M memcg, 64K mTHP:
(avg of 4 test run)
Before:
hugepages-64kB/stats/swpout: 1696184
hugepages-64kB/stats/swpout_fallback: 414318
After: (-63.2% mTHP swapout failure)
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330
There is a up to 65.1% improvement in sys time for build kernel test,
and lower fragmentation rate.
Build kernel with tinyconfig on tmpfs with HDD as swap:
=======================================================
This test is similar to above, but HDD test is very noisy and slow, the
deviation is huge, so just use tinyconfig instead and take the median test
result of 3 test run, which looks OK:
Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU
2901232inputs+0outputs (238877major+4227640minor)pagefaults
After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU
2548728inputs+0outputs (235471major+4238110minor)pagefaults
Single thread SWAP:
===================
Sequential SWAP should also be slightly faster as we removed a lot of
unnecessary parts. Test using micro benchmark for swapout/in 4G
zero memory using ZRAM, 10 test runs:
Suren Baghdasaryan [Thu, 26 Dec 2024 21:16:38 +0000 (13:16 -0800)]
alloc_tag: avoid current->alloc_tag manipulations when profiling is disabled
When memory allocation profiling is disabled there is no need to update
current->alloc_tag and these manipulations add unnecessary overhead. Fix
the overhead by skipping these extra updates.
I ran comprehensive testing on Pixel 6 on Big, Medium and Little cores:
Overhead before fixes Overhead after fixes
slab alloc page alloc slab alloc page alloc
Big 6.21% 5.32% 3.31% 4.93%
Medium 4.51% 5.05% 3.79% 4.39%
Little 7.62% 1.82% 6.68% 1.02%
This is an allocation microbenchmark doing allocations in a tight loop.
Not a really realistic scenario and useful only to make performance
comparisons.
Link: https://lkml.kernel.org/r/20241226211639.1357704-1-surenb@google.com Fixes: b951aaff5035 ("mm: enable page allocation tagging") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chen Ridong [Tue, 24 Dec 2024 02:52:38 +0000 (02:52 +0000)]
memcg: fix soft lockup in the OOM process
A soft lockup issue was found in the product with about 56,000 tasks were
in the OOM cgroup, it was traversing them when the soft lockup was
triggered.
This is because thousands of processes are in the OOM cgroup, it takes a
long time to traverse all of them. As a result, this lead to soft lockup
in the OOM process.
To fix this issue, call 'cond_resched' in the 'mem_cgroup_scan_tasks'
function per 1000 iterations. For global OOM, call
'touch_softlockup_watchdog' per 1000 iterations to avoid this issue.
Link: https://lkml.kernel.org/r/20241224025238.3768787-1-chenridong@huaweicloud.com Fixes: 9cbb78bb3143 ("mm, memcg: introduce own oom handler to iterate only over its own threads") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Mon, 16 Dec 2024 15:04:42 +0000 (00:04 +0900)]
mm/zsmalloc: convert reset_page to reset_zpdesc
zpdesc.zspage matches with page.private, zpdesc.next matches with
page.index. They will be reset in reset_page() which is called prior to
free base pages of a zspage.
Since the fields that need to be initialized are independent of the order
in struct zpdesc, Keep it to use struct page to ensure robustness against
potential rearrangements of struct zpdesc fields in the future.
[42.hyeyoo@gmail.com: reset zpdesc fields in reset_zpdesc()] Link: https://lkml.kernel.org/r/Z4Uw136VdG7vlKCL@localhost.localdomain
[42.hyeyoo@gmail.com: keep reset_zpdesc() to use struct page fields] Link: https://lkml.kernel.org/r/20241216150450.1228021-12-42.hyeyoo@gmail.com Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hyeonggon Yoo [Mon, 16 Dec 2024 15:04:41 +0000 (00:04 +0900)]
mm/zsmalloc: add two helpers for zs_page_migrate() and make it use zpdesc
To convert page to zpdesc in zs_page_migrate(), we added
zpdesc_is_isolated()/zpdesc_zone() helpers. No functional change. Link: https://lkml.kernel.org/r/20241216150450.1228021-11-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Alex Shi <alexs@kernel.org> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hyeonggon Yoo [Mon, 16 Dec 2024 15:04:38 +0000 (00:04 +0900)]
mm/zsmalloc: convert obj_allocated() and related helpers to use zpdesc
Convert obj_allocated(), and related helpers to take zpdesc. Also make
its callers to cast (struct page *) to (struct zpdesc *) when calling
them. The users will be converted gradually as there are many.
Link: https://lkml.kernel.org/r/20241216150450.1228021-8-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Alex Shi <alexs@kernel.org> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Mon, 16 Dec 2024 15:04:33 +0000 (00:04 +0900)]
mm/zsmalloc: use zpdesc in trylock_zspage()/lock_zspage()
Convert trylock_zspage() and lock_zspage() to use zpdesc. To achieve
that, introduce a couple of helper functions:
- zpdesc_lock()
- zpdesc_unlock()
- zpdesc_trylock()
- zpdesc_wait_locked()
- zpdesc_get()
- zpdesc_put()
Here we use the folio version of functions for 2 reasons. First,
zswap.zpool currently only uses order-0 pages and using folio could save
some compound_head checks. Second, folio_put could bypass devmap checking
that we don't need.
BTW, thanks Intel LKP found a build warning on the patch.
Originally-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lkml.kernel.org/r/20241216150450.1228021-3-42.hyeyoo@gmail.com Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Shi [Mon, 16 Dec 2024 15:04:32 +0000 (00:04 +0900)]
mm/zsmalloc: add zpdesc memory descriptor for zswap.zpool
Patch series "Add zpdesc memory descriptor for zswap.zpool", v9.
This patch series introduces a new memory descriptor for zswap.zpool that
currently overlaps with struct page for now. This is part of the effort
to reduce the size of struct page and to enable dynamic allocation of
memory descriptors [1].
This series does not bloat anything for zsmalloc and no functional change
is intended (except for using zpdesc and folios).
In the near future, the removal of page->index from struct page [2] will
be addressed and the project also depends on this patch series.
Thanks to everyone got involved in this series, especially, Alex who's
been pushing it forward this year.
The 1st patch introduces new memory descriptor zpdesc and renames
zspage.first_page to zspage.first_zpdesc, with no functional change.
We removed the comment about PG_owner_priv_1 since it is no longer used
after commit a41ec880aa7b ("zsmalloc: move huge compressed obj from page
to zspage").
[rdunlap@infradead.org: fix function parameter kernel-doc notation] Link: https://lkml.kernel.org/r/20250111063305.911010-1-rdunlap@infradead.org
[42.hyeyoo@gmail.com: rework comments a little bit] Link: https://lkml.kernel.org/r/20241216150450.1228021-1-42.hyeyoo@gmail.com Link: https://lkml.kernel.org/r/20241216150450.1228021-2-42.hyeyoo@gmail.com Originally-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Thu, 9 Jan 2025 17:51:25 +0000 (09:51 -0800)]
Docs/admin-guide/mm/damon/usage: omit DAMOS filter details in favor of design doc
DAMON usage document is describing some details about DAMOS filters, which
are also documented on the design doc. Deduplicate the details in favor
of the design doc.
SeongJae Park [Thu, 9 Jan 2025 17:51:22 +0000 (09:51 -0800)]
mm/damon/sysfs-schemes: add a file for setting damos_filter->allow
Only kernel-space DAMON API users can use inclusive DAMOS filters. Add a
sysfs file named 'allow' under DAMOS filter directory of DAMON sysfs
interface, to let the user-space users use inclusive DAMOS filters.
SeongJae Park [Thu, 9 Jan 2025 17:51:21 +0000 (09:51 -0800)]
mm/damon: add 'allow' argument to damos_new_filter()
DAMON API users should set damos_filter->allow manually to use a DAMOS
allow-filter, since damos_new_filter() unsets the field always. It is
cumbersome and easy to mistake. Add an arugment for setting the field to
damos_new_filter().
SeongJae Park [Thu, 9 Jan 2025 17:51:20 +0000 (09:51 -0800)]
mm/damon/paddr: support damos_filter->allow
Respect damos_filter->allow from 'paddr', which is a DAMON operations set
implementation for the physical address space and supports a few types of
region-internal DAMOS filters (anon, memcg and young). The change is
similar to that of the previous commit for core layer update.
SeongJae Park [Thu, 9 Jan 2025 17:51:19 +0000 (09:51 -0800)]
mm/damon/core: support damos_filter->allow
DAMOS filters supports allowing behavior, but the core layer's DAMOS
filters handling logic still assumes only rejecting (filtering-out)
behavior. Update the logic to aware of and respect the behavioral
decision by reading damos_filter->allow when making the decision to
exclude a region or not.
SeongJae Park [Thu, 9 Jan 2025 17:51:18 +0000 (09:51 -0800)]
mm/damon/core: add damos_filter->allow field
DAMOS filters work as only exclusive (reject) filters. This makes it easy
to be confused, and restrictive at combining multiple filters for covering
various types of memory.
Add a field named 'allow' to damos_filter. The field will be used to
indicate whether the filter should work for inclusion or exclusion. To
keep the old behavior, set it as 'false' (work as exclusive filter) by
default, from damos_new_filter().
Following two commits will make the core and operations set layers, which
handles damos_filter objects, respect the field, respectively.
SeongJae Park [Thu, 9 Jan 2025 17:51:17 +0000 (09:51 -0800)]
mm/damon: fixup damos_filter kernel-doc
Patch series "mm/damon: extend DAMOS filters for inclusion", v2.
DAMOS fitlers are exclusive filters. It only excludes memory of given
criterias from the DAMOS action targets. This has below limitations.
First, the name is not explicitly explaining the behavior. This actually
resulted in users' confusions[1]. Secondly, combined uses of multiple
filters provide only restriced coverages. For example, building a DAMOS
scheme that applies the action to memory that belongs to cgroup A "or"
cgroup B is impossible. A workaround would be using two schemes that
fitlers out memory that not belong to cgroup A and cgroup B, respectively.
It is cumbersome, and difficult to control quota-like per-scheme features
in an orchestration. Monitoring of filters-passed memory statistic will
also be complicated.
Extend DAMOS filters to support not only exclusion (rejecting), but also
inclusion (allowing) behavior. For this, add a new damos_filter struct
field called 'allow' for DAMON kernel API users. The filter works as an
inclusion or exclusion filter when it is set or unset, respectively. For
DAMON user-space ABI users, add a DAMON sysfs file of same name under
DAMOS filter sysfs directory. To prevent exposing a behavioral change to
old users, set rejecting as the default behavior.
Note that allow-filters work for only inclusion, not exclusion of memory
that not satisfying the criteria. And the default behavior of DAMOS for
memory that no filter has involved is that the action can be applied to
those memory. Also, filters-passed memory statistics are for any memory
that passed through the DAMOS filters check stage. These implies
installing allow-filters at the endof the filter list is useless. Refer
to the design doc change of this series for more details.
The comment is slightly wrong. DAMOS filters are not only for pages, but
general bytes of memory. Also the description of 'matching' is bit
confusing, since DAMOS filters do only filtering out. Update the comments
to be less confusing.
Luiz Capitulino [Mon, 23 Dec 2024 22:00:37 +0000 (17:00 -0500)]
mm: alloc_pages_bulk_noprof: drop page_list argument
Patch series "mm: alloc_pages_bulk: small API refactor", v2.
Today, alloc_pages_bulk_noprof() supports two arguments to return
allocated pages: a linked list and an array. There are also higher level
APIs for both.
However, the linked list API has apparently never been used. So, this
series removes it along with the list API and also refactors the remaining
API naming for consistency.
This patch (of 2):
commit 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator") added
__alloc_pages_bulk() along with the page_list argument. The next commit 0f87d9d30f21 ("mm/page_alloc: add an array-based interface to the bulk
page allocator") added the array-based argument. As it turns out, the
page_list argument has no users in the current tree (if it ever had any).
Dropping it allows for a slight simplification and eliminates some
unnecessary checks, now that page_array is required.
Also, note that the removal of the page_list argument was proposed before
in the thread below, where Matthew Wilcox mentions that:
"""
Iterating a linked list is _expensive_. It is about 10x quicker to
iterate an array than a linked list.
"""
(https://lore.kernel.org/linux-mm/20231025093254.xvomlctwhcuerzky@techsingularity.net)
Ryan Roberts [Tue, 7 Jan 2025 14:47:53 +0000 (14:47 +0000)]
selftests/mm: introduce uffd-wp-mremap regression test
Introduce a test that registers a range of memory for
UFFDIO_WRITEPROTECT_MODE_WP without UFFD_FEATURE_EVENT_REMAP. First check
that the uffd-wp bit is set for every PTE in the range. Then mremap() the
range to a new location and check that the uffd-wp bit is clear for every
PTE in the range.
Run the test for small folios, all supported THP sizes and all supported
hugetlb sizes, and for swapped out memory, shared and private.
There was previously a bug in the kernel where the uffd-wp bits remained
set in all PTEs for this case, after fixing the kernel, the tests all
pass.
Link: https://lkml.kernel.org/r/20250107144755.1871363-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 7 Jan 2025 20:40:02 +0000 (15:40 -0500)]
mm/hugetlb: unify restore reserve accounting for new allocations
Either hugetlb pages dequeued from hstate, or newly allocated from buddy,
would require restore-reserve accounting to be managed properly. Merge
the two paths on it. Add a small comment to make it slightly nicer.
Link: https://lkml.kernel.org/r/20250107204002.2683356-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 7 Jan 2025 20:40:01 +0000 (15:40 -0500)]
mm/hugetlb: drop vma_has_reserves()
After the previous cleanup, vma_has_reserves() is mostly an empty helper
except that it says "use reserve count" is inverted meaning from "needs a
global reserve count", which is still true.
To avoid confusions on having two inverted ways to ask the same question,
always use the gbl_chg everywhere, and drop the function.
When at it, rename "chg" to "gbl_chg" in dequeue_hugetlb_folio_vma(). It
might be helpful for readers to see that the "chg" here is the global
reserve count, not the vma resv count.
Link: https://lkml.kernel.org/r/20250107204002.2683356-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 7 Jan 2025 20:40:00 +0000 (15:40 -0500)]
mm/hugetlb: simplify vma_has_reserves()
vma_has_reserves() is a helper "trying" to know whether the vma should
consume one reservation when allocating the hugetlb folio.
However it's not clear on why we need such complexity, as such information
is already represented in the "chg" variable.
From alloc_hugetlb_folio() context, "chg" (or in the function's context,
"gbl_chg") is defined as:
- If gbl_chg=1, the allocation cannot reuse an existing reservation
- If gbl_chg=0, the allocation should reuse an existing reservation
Firstly, map_chg is defined as following, to cover all cases of hugetlb
reservation scenarios (mostly, via vma_needs_reservation(), but
cow_from_owner is an outlier):
CONDITION HAS RESERVATION?
========= ================
- SHARED: always check against per-inode resv_map
(ignore NONRESERVE)
- If resv exists ==> YES [1]
- If not ==> NO [2]
- PRIVATE: complicated...
- Request came from a CoW from owner resv map ==> NO [3]
(when cow_from_owner==true)
- If does not own a resv_map at all.. ==> NO [4]
(examples: VM_NORESERVE, private fork())
- If owns a resv_map, but resv donsn't exists ==> NO [5]
- If owns a resv_map, and resv exists ==> YES [6]
Further on, gbl_chg considered spool setup, so that is a decision based on
all the context.
If we look at vma_has_reserves(), it almost does check that has already
been processed by map_chg accounting (I marked each return value to the
case above):
static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
{
if (vma->vm_flags & VM_NORESERVE) {
if (vma->vm_flags & VM_MAYSHARE && chg == 0)
return true; ==> [1]
else
return false; ==> [2] or [4]
}
if (vma->vm_flags & VM_MAYSHARE) {
if (chg)
return false; ==> [2]
else
return true; ==> [1]
}
if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
if (chg)
return false; ==> [5]
else
return true; ==> [6]
}
return false; ==> [4]
}
It didn't check [3], but [3] case was actually already covered now by the
"chg" / "gbl_chg" / "map_chg" calculations.
In short, vma_has_reserves() doesn't provide anything more than return
"!chg".. so just simplify all the things.
There're a lot of comments describing truncation races, IIUC there should
have no race as long as map_chg is properly done.
Link: https://lkml.kernel.org/r/20250107204002.2683356-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 7 Jan 2025 20:39:59 +0000 (15:39 -0500)]
mm/hugetlb: clean up map/global resv accounting when allocate
alloc_hugetlb_folio() isn't a function easy to read, especially on
reservation accountings for either VMA or globally (majorly, spool only).
The 1st complexity lies in the special private CoW path, aka,
cow_from_owner=true case.
The 2nd complexity may be the confusing updates of gbl_chg after it's set
once, which looks like they can change anytime on the fly.
Logically, cow_from_user is only about vma reservation. We could already
decouple the flag and consolidate it into map charge flag very early.
Then we don't need to keep checking the CoW special flag every time.
This patch does it by making map_chg a tri-state flag. Tri-state needed
is unfortunate, and it's because currently vma_needs_reservation() has a
side effect internally, that it must be followed by either a end() or
commit().
We keep the same semantic as before on one thing: "if (map_chg)" means we
need a separate per-vma resv count. It keeps most of the old code like
before untouched with the new enum.
After this patch, we take these steps to decide these variables, hopefully
slightly easier to follow:
- First, decide map_chg. This will take cow_from_owner into account,
once and for all. It's about whether we could take a resv count from
the vma, no matter it's shared, private, etc.
- Then, decide gbl_chg. The only diff here is spool, comparing to
map_chg.
Now only update each flag once and for all, instead of keep any of them
flipping which can be very hard to follow.
With cow_from_owner merged into map_chg, we could remove quite a few such
checks all over. Side benefit of such is that we can get rid of one more
confusing flag, which is deferred_reserve.
Cleanup the comments a bit too. E.g., MAP_NORESERVE may not need to check
against spool limit, AFAIU, if it's on a shared mapping, and if the page
cache folio has its inode's resv map available (in which case map_chg
would have been set zero, hence the code should be correct, not the
comment).
There's one trivial detail that needs attention that this patch touched,
which is this check right after vma_commit_reservation():
if (map_chg > map_commit)
It changes to:
if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0))
It should behave the same like before, because previously the only way to
make "map_chg > map_commit" happen is map_chg=1 && map_commit=0. That's
exactly the rewritten line. Meanwhile, either commit() or end() will need
to be skipped if ENFORCE, to keep the old behavior.
Even though it looks a lot changed, but no functional change expected.
Link: https://lkml.kernel.org/r/20250107204002.2683356-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 7 Jan 2025 20:39:58 +0000 (15:39 -0500)]
mm/hugetlb: rename avoid_reserve to cow_from_owner
The old name "avoid_reserve" can be too generic and can be used wrongly in
the new call sites that want to allocate a hugetlb folio.
It's confusing on two things: (1) whether one can opt-in to avoid global
reservation, and (2) whether it should take more than one count.
In reality, this flag is only used in an extremely hacky path, in an
extremely hacky way in hugetlb CoW path only, and always use with 1 saying
"skip global reservation". Rename the flag to avoid future abuse of this
flag, making it a boolean so as to reflect its true representation that
it's not a counter. To make it even harder to abuse, add a comment above
the function to explain it.
Link: https://lkml.kernel.org/r/20250107204002.2683356-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 7 Jan 2025 20:39:57 +0000 (15:39 -0500)]
mm/hugetlb: stop using avoid_reserve flag in fork()
When fork() and stumble on top of a dma-pinned hugetlb private page, CoW
must happen during fork() to guarantee dma coherency.
In this specific path, hugetlb pages need to be allocated for the child
process. Stop using avoid_reserve=1 flag here: it's not required to be
used here, as dest_vma (which is destined to be a MAP_PRIVATE hugetlb vma)
will have no private vma resv map, and that will make sure it won't be
able to use a vma reservation later.
No functional change intended with this change. Said that, it's still
wanted to do this, so as to reduce the usage of avoid_reserve to the only
one user, which is also why this flag was introduced initially in commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that
called mmap(MAP_PRIVATE) on hugetlbfs will succeed"). I don't see whoever
else should set it at all.
Further patch will clean up resv accounting based on this.
Link: https://lkml.kernel.org/r/20250107204002.2683356-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The goal of this series is to cleanup hugetlb resv accounting, especially
during folio allocation, to decouple a few things:
- Hugetlb folios v.s. Hugetlbfs: IOW, the hope is in the future hugetlb
folios can be allocated completely without hugetlbfs.
- Decouple VMA v.s. hugetlb folio allocations: allocating a hugetlb folio
should not always require a hugetlbfs VMA. For example, either it got
allocated from the inode level (see hugetlbfs_fallocate() where it used
a pesudo VMA for allocation), or it can be allocated by other kernel
subsystems.
It paves way for other users to allocate hugetlb folios out of either
system reservations, or subpools (instead of hugetlbfs, as a file system).
For longer term, this prepares hugetlb as a separate concept versus
hugetlbfs, so that hugetlb folios can be allocated by not only hugetlbfs
and other things.
Tests I've done:
- I had a reproducer in patch 1 for the bug I found, this will start to
work after patch 1 or the whole set applied.
- Hugetlb regression tests (on x86_64 2MBs), includes:
- All vmtests on hugetlbfs
- libhugetlbfs test suite (which may fail some tests, but no new failures
will be introduced by this series, so all such failures happen before
this series so shouldn't be relevant).
This patch (of 7):
Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
avoid_reserve was introduced for a special case of CoW on hugetlb private
mappings, and only if the owner VMA is trying to allocate yet another
hugetlb folio that is not reserved within the private vma reserved map.
Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle
areas hole punched by fallocate"), alloc_huge_page() enforced to not
consume any global reservation as long as avoid_reserve=true. This
operation doesn't look correct, because even if it will enforce the
allocation to not use global reservation at all, it will still try to take
one reservation from the spool (if the subpool existed). Then since the
spool reserved pages take from global reservation, it'll also take one
reservation globally.
Logically it can cause global reservation to go wrong.
I wrote a reproducer below, trigger this special path, and every run of
such program will cause global reservation count to increment by one, until
it hits the number of free pages:
Fix it by taking the reservation from spool if available. In general,
avoid_reserve is IMHO more about "avoid vma resv map", not spool's.
I copied stable, however I have no intention for backporting if it's not a
clean cherry-pick, because private hugetlb mapping, and then fork() on top
is too rare to hit.
Link: https://lkml.kernel.org/r/20250107204002.2683356-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20250107204002.2683356-2-peterx@redhat.com Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Wed, 8 Jan 2025 02:16:49 +0000 (10:16 +0800)]
mm: shmem: skip swapcache for swapin of synchronous swap device
With fast swap devices (such as zram), swapin latency is crucial to
applications. For shmem swapin, similar to anonymous memory swapin, we
can skip the swapcache operation to improve swapin latency. Testing 1G
shmem sequential swapin without THP enabled, I observed approximately a 6%
performance improvement: (Note: I repeated 5 times and took the mean data
for each test)
w/o patch w/ patch changes
534.8ms 501ms +6.3%
In addition, currently, we always split the large swap entry stored in the
shmem mapping during shmem large folio swapin, which is not perfect,
especially with a fast swap device. We should swap in the whole large
folio instead of splitting the precious large folios to take advantage of
the large folios and improve the swapin latency if the swap device is
synchronous device, which is similar to anonymous memory mTHP swapin.
Testing 1G shmem sequential swapin with 64K mTHP and 2M mTHP, I observed
obvious performance improvement:
Note that skipping swapcache requires attention to concurrent swapin
scenarios. Fortunately the swapcache_prepare() and
shmem_add_to_page_cache() can help identify concurrent swapin and large
swap entry split scenarios, and return -EEXIST for retry.
[akpm@linux-foundation.org: use IS_ENABLED(), tweak comment grammar] Link: https://lkml.kernel.org/r/3d9f3bd3bc6ec953054baff5134f66feeaae7c1e.1736301701.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guo Weikang [Mon, 6 Jan 2025 02:11:25 +0000 (10:11 +0800)]
mm/memmap: prevent double scanning of memmap by kmemleak
kmemleak explicitly scans the mem_map through the valid struct page
objects. However, memmap_alloc() was also adding this memory to the gray
object list, causing it to be scanned twice. Remove memmap_alloc() from
the scan list and add a comment to clarify the behavior.
Bruno Faccini [Mon, 6 Jan 2025 12:06:59 +0000 (04:06 -0800)]
mm/fake-numa: allow later numa node hotplug
Current fake-numa implementation prevents new Numa nodes to be later
hot-plugged by drivers. A common symptom of this limitation is the "node
<X> was absent from the node_possible_map" message by associated warning
in mm/memory_hotplug.c: add_memory_resource().
This comes from the lack of remapping in both pxm_to_node_map[] and
node_to_pxm_map[] tables to take fake-numa nodes into account and thus
triggers collisions with original and physical nodes only-mapping that had
been determined from BIOS tables.
This patch fixes this by doing the necessary node-ids translation in both
pxm_to_node_map[]/node_to_pxm_map[] tables. node_distance[] table has
also been fixed accordingly.
Details:
When trying to use fake-numa feature on our system where new Numa nodes
are being "hot-plugged" upon driver load, this fails with the following
type of message and warning with stack :
node 8 was absent from the node_possible_map WARNING: CPU: 61 PID: 4259 at
mm/memory_hotplug.c:1506 add_memory_resource+0x3dc/0x418
This issue prevents the use of the fake-NUMA debug feature with the
system's full configuration, when it has proven to be sometimes extremely
useful for performance testing of multi-tasked, memory-bound applications,
as it enables better isolation of processes/ranks compared to fat NUMA
nodes.
Link: https://lkml.kernel.org/r/20250106120659.359610-2-bfaccini@nvidia.com Signed-off-by: Bruno Faccini <bfaccini@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 6 Jan 2025 19:19:41 +0000 (11:19 -0800)]
mm/damon: remove DAMON debugfs interface
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
All documents and related tests are also removed. Finally remove the
interface.
Link: https://lkml.kernel.org/r/20250106191941.107070-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
Remove kunit tests for the interface, to prevent unnecessary test
failures.
Link: https://lkml.kernel.org/r/20250106191941.107070-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 6 Jan 2025 19:19:39 +0000 (11:19 -0800)]
kunit: configs: remove configs for DAMON debugfs interface tests
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
Remove kernel configs for running DAMON debugfs interface kunit tests from
the kunit all_tests configuration, to prevent unnecessary noises from
tests.
Link: https://lkml.kernel.org/r/20250106191941.107070-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 6 Jan 2025 19:19:38 +0000 (11:19 -0800)]
selftests/damon: remove tests for DAMON debugfs interface
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
Remove selftests for the interface, to prevent causing unnecessary test
failures.
Link: https://lkml.kernel.org/r/20250106191941.107070-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 6 Jan 2025 19:19:37 +0000 (11:19 -0800)]
selftests/damon/config: remove configs for DAMON debugfs interface selftests
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
Remove configs for selftests of it from DAMON selftests config file, to
prevent unnecessary noises from the tests.
Link: https://lkml.kernel.org/r/20250106191941.107070-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Mon, 6 Jan 2025 19:19:36 +0000 (11:19 -0800)]
Docs/mm/damon/design: update for removal of DAMON debugfs interface
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
Update DAMON design documentation to stop mentioning about the interface,
to avoid unnecessary confuses.
Link: https://lkml.kernel.org/r/20250106191941.107070-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch series for
more details.
Remove DAMON debugfs interface usage documentation, to avoid confusing
users with documents for an already removed thing.
Link: https://lkml.kernel.org/r/20250106191941.107070-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: remove DAMON debugfs interface".
DAMON debugfs interface was the only user interface of DAMON at the
beginning[1]. However, it turned out the interface would be not good
enough for long-term flexibility and stability.
In Feb 2022[2], we therefore introduced DAMON sysfs interface as an
alternative user interface that aims long-term flexibility and stability.
With its introduction, DAMON debugfs interface has announced to be
deprecated in near future.
In Feb 2023[3], we announced the official deprecation of DAMON debugfs
interface. In Jan 2024[4], we further made the deprecation difficult to
be ignored.
In Oct 2024[5], we posted an RFC version of this patch series as the last
notice.
And as of this writing, no problem or concerns about the removal plan have
reported. Apparently users are already moved to the alternative, or made
good plans for the change.
Remove the DAMON debugfs interface code from the tree. Given the past
timeline and the absence of reported problems or concerns, it is safe
enough to be done.
It's time to remove DAMON debugfs interface, which has deprecated long
before in February 2023. Read the cover letter of this patch sereis for
more details.
Remove DAMON debugfs interface usage documentation and references to it
from translations, to avoid confusing users with documents for already
removed things.
Link: https://lkml.kernel.org/r/20250106191941.107070-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250106191941.107070-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Per-region operations set-handled DAMOS filters passed memory size
information is provided to only DAMON core API users. Further expose it
to the user space by adding a new DAMON sysfs interface file under each
scheme tried region directory.
SeongJae Park [Mon, 6 Jan 2025 19:33:57 +0000 (11:33 -0800)]
mm/damon/core: pass per-region filter-passed bytes to damos_walk_control->walk_fn()
Total size of memory that passed DAMON operations set layer-handled DAMOS
filters per scheme is provided to DAMON core API and ABI (sysfs interface)
users. Having it per-region in non-accumulated way can provide it in
finer granularity. Provide it to damos_walk() core API users, by passing
the data to damos_walk_control->walk_fn().
SeongJae Park [Mon, 6 Jan 2025 19:33:53 +0000 (11:33 -0800)]
mm/damon/syfs-schemes: implement per-scheme filter-passed bytes stat
Add a new DAMON sysfs interface file under scheme stat directory, namely
'sz_ops_filter_passed'. It represents total bytes that passed
region-internal DAMOS filters of the scheme that handled by the DAMON
operations set layer.
SeongJae Park [Mon, 6 Jan 2025 19:33:52 +0000 (11:33 -0800)]
mm/damon/core: implement per-scheme ops-handled filter-passed bytes stat
Implement a new per-DAMOS scheme statistic field, namely
sz_ops_filter_passed, using the changed damon_operations->apply_scheme()
interface. It counts total bytes of memory that given DAMOS action tried
to be applied, and passed the operations layer handled region-internal
filters of the scheme. DAMON API users can access it using DAMON-internal
safe access features such as damon_call() and/or damos_walk().
SeongJae Park [Mon, 6 Jan 2025 19:33:51 +0000 (11:33 -0800)]
mm/damon/paddr: report filter-passed bytes back for DAMOS_STAT action
DAMOS_STAT action handling of paddr DAMON operations set implementation is
simply ignoring the region-internal DAMOS filters, and therefore not
reporting back the filter-passed bytes. Apply the filters and report back
the information.
Before this change, DAMOS_STAT was doing nothing for DAMOS filters. Hence
users might see some performance regressions. Such regression for use
cases where no region-internal DAMOS filter is added to the scheme will be
negligible, since this change avoids unnecessary filtering works if no
such filter is installed.
For old users who are using DAMOS_STAT with the types of filters, the
regression could be visible depending on the size of the region and the
overhead of the installed DAMOS filters. But, because the filters were
completely ignored before in the use case, no real users would really
depend on such use case that makes no point.
SeongJae Park [Mon, 6 Jan 2025 19:33:50 +0000 (11:33 -0800)]
mm/damon/paddr: report filter-passed bytes back for normal actions
damon_operations->apply_scheme() implementations are requested to report
back how many bytes of the given region has passed DAMOS filter. 'paddr'
operations set implementation supports some of region-internal DAMOS
filter handling for normal DAMOS actions except DAMOS_STAT action. But,
those are not respecting the request. Report the region-internal DAMOS
filter-passed bytes back for the actions.
SeongJae Park [Mon, 6 Jan 2025 19:33:49 +0000 (11:33 -0800)]
mm/damon: ask apply_scheme() to report filter-passed region-internal bytes
Some DAMOS filter types including those for young page, anon page, and
belonging memcg are handled by underlying DAMON operations set
implementation, via damon_operations->apply_scheme() interface. How many
bytes of the region have passed the filter can be useful for DAMOS scheme
tuning and access pattern monitoring. Modify the interface to let the
callback implementation reports back the number if possible.
SeongJae Park [Mon, 6 Jan 2025 19:33:48 +0000 (11:33 -0800)]
Docs/admin-guide/mm/damon/usage: link damos stat design doc
DAMON sysfs usage document focuses on usage, rather than the detail of the
stat metric itself. Add a link to the design document on DAMOS stat usage
section.
SeongJae Park [Mon, 6 Jan 2025 19:33:47 +0000 (11:33 -0800)]
Docs/mm/damon/design: add 'statistics' section
DAMOS stats are important feature for tuning of DAMOS-based access-aware
system operation, and efficient access pattern monitoring. But not well
documented on the design document. Add a section on the document.
SeongJae Park [Mon, 6 Jan 2025 19:33:46 +0000 (11:33 -0800)]
mm/damon: clarify trying vs applying on damos_stat kernel-doc comment
Patch series "mm/damon: enable page level properties based monitoring".
TL; DR
======
This patch series enables access monitoring based on page level properties
including their anonymousness, belonging cgroups and young-ness, by
extending DAMOS stats and regions walk features with region-internal DAMOS
filters.
Background
==========
DAMOS has initially developed for only access-aware system operations.
But, efficient acces monitoring results querying is yet another major
usage of today's DAMOS. DAMOS stats and regions walk, which exposes
accumulated counts and per-region monitoring results that filtered by
DAMOS parameters including target access pattern, quotas and DAMOS
filters, are the key features for that usage. For tunings and
investigations, it can be more useful if only the information can be
exposed without making real system operational change. Special DAMOS
action, DAMOS_STAT, was introduced for the purpose.
DAMOS fundametally works with only access pattern information in region
granularity. For some use cases, fixed and fine granularity information
based on non access pattern properties can be useful, though. For
example, on systems having swap devices that much faster than storage
devices for files, DAMOS-based proactive reclaim need to be applied
differently for anonymous pages and file-backed pages.
DAMOS filters is a feature that makes it possible. It supports non access
pattern information including page level properties such as anonymousness,
belonging cgroups, and young-ness (whether the page has accessed since the
last access check of it). The information can be useful for tuning and
investigations. DAMOS stat exposes some of it via {nr,sz}_applied, but it
is mixed with operation failures. Also, exposing the information without
making system operation change is impossible, since DAMOS_STAT simply
ignores the page level properties based DAMOS filters.
Design
======
Expose the exact information for every DAMOS action including DAMOS_STAT
by implementing below changes.
Extend the interface for DAMON operations set layer, which contains the
implementation of the page level filters, to report back the amount of
memory that passed the region-internal DAMOS filters to the core layer.
On the core layer, account the operations set layer reported stat with
DAMOS stat for per-scheme monitoring. Also, pass the information to
regions walk for per-region monitoring. In this way, DAMON API users can
efficiently get the fine-grained information.
For the user-space, make DAMON sysfs interface collects the information
using the updated DAMON core API, and expose those to new per-scheme stats
file and per-DAMOS-tried region properties file.
Practical Usages
================
With this patch series, DAMON users can query how many bytes of regions of
specific access temperature is backed by pages of specific type. The type
can be any of DAMOS filter-supporting one, including anonymousness,
belonging cgroups, and young-ness. For example, users can visualize
access hotness-based page granulairty histogram for different cgroups,
backing content type, or youngness. In future, it could be extended to
more types such as whether it is THP, position on LRU lists, etc. This
can be useful for estimating benefits of a new or an existing access-aware
system optimizations without really committing the changes.
Patches Sequence
================
The patches are constructed in four sub-sequences.
First three patches (patches 1-3) update documents to have missing
background knowledges and better structures for easily introducing
followup changes.
Following three patches (patches 4-6) change the operations set layer
interface to report back the region-internal filter passed memory size,
and make the operations set implementations support the changed symantic.
Following five patches (patches 7-11) implement per-scheme accumulated
stat for region-internal filter-passed memory size on core API
(damos_stat) and DAMON sysfs interface. First two patches of those are
for code change, and following three patches are for documentation.
Finally, five patches (patches 12-16) implementing per-region
region-internal filter-passed memory size follows. Similar to that for
per-scheme stat, first two patches implement core-API and sysfs interface
change. Then three patches for documentation update follow.
This patch (of 16):
DAMOS stat kernel-doc documentation is using terms that bit ambiguous.
Without reading the code, understanding it correctly is not that easy.
Add the clarification on the kernel-doc comment.
SeongJae Park [Fri, 3 Jan 2025 17:44:00 +0000 (09:44 -0800)]
mm/damon/sysfs: remove unused code for schemes tried regions update
DAMON sysfs interface was using damon_callback with its own complicated
synchronization logics to update DAMOS scheme applied regions directories
and files. But it is replaced to use damos_walk(), and the additional
synchronization logics are no more being used. Remove those.
SeongJae Park [Fri, 3 Jan 2025 17:43:59 +0000 (09:43 -0800)]
mm/damon/sysfs: use damos_walk() for update_schemes_tried_{bytes,regions}
DAMON sysfs interface uses damon_callback with its own complicated
synchronization facility to handle update_schemes_tried_bytes and
update_schemes_tried_regions commands. But damos_walk() can support the
use case without the additional synchronizations. Convert the code to use
damos_walk() instead.
SeongJae Park [Fri, 3 Jan 2025 17:43:58 +0000 (09:43 -0800)]
Docs/mm/damon/design: document DAMOS regions walking
DAMOS' regions walking is a feature for efficiently retrieving monitoring
results or DAMOS-internal behavior. It can be useful for multiple
purposes including investigations and tuning. Add a section for it on the
design document.
SeongJae Park [Fri, 3 Jan 2025 17:43:57 +0000 (09:43 -0800)]
mm/damon/core: implement damos_walk()
Introduce a new core layer interface, damos_walk(). It aims to replace
some damon_callback usages that access DAMOS schemes applied regions of
ongoing kdamond with additional synchronizations. It receives a function
pointer and asks kdamond to invoke it for any region that it tried to
apply any DAMOS action within one scheme apply interval for every scheme
of it. The function further waits until the kdamond finishes the
invocations for every scheme, or cancels the request, and returns.
The kdamond invokes the function as requested within the main loop. If it
is deactivated by DAMOS watermarks or going out of the main loop, it marks
the request as canceled, so that damos_walk() can wakeup and return.
SeongJae Park [Fri, 3 Jan 2025 17:43:56 +0000 (09:43 -0800)]
mm/damon/sysfs: use damon_call() for update_schemes_effective_quotas
DAMON sysfs interface uses damon_callback with its own synchronization
facility to handle update_schemes_effective_quotas command. But
damon_call() can support the use case without the additional
synchronizations. Convert the code to use damon_call() instead.
SeongJae Park [Fri, 3 Jan 2025 17:43:55 +0000 (09:43 -0800)]
mm/damon/sysfs: use damon_call() for commit_schemes_quota_goals
DAMON sysfs interface uses damon_callback with its own synchronization
facility to handle commit_schemes_quota_goals command. But damon_call()
can support the use case without the additional synchronizations. Convert
the code to use damon_call() instead.
SeongJae Park [Fri, 3 Jan 2025 17:43:54 +0000 (09:43 -0800)]
mm/damon/sysfs: use damon_call() for update_schemes_stats
DAMON sysfs interface uses damon_callback with its own synchronization
facility to handle update_schemes_stats kdamond command. But damon_call()
can support the use case without the additional synchronizations. Convert
the code to use damon_call() instead.
SeongJae Park [Fri, 3 Jan 2025 17:43:53 +0000 (09:43 -0800)]
mm/damon/core: introduce damon_call()
Introduce a new DAMON core API function, damon_call(). It aims to replace
some damon_callback usages that access damon_ctx of ongoing kdamond with
additional synchronizations. It receives a function pointer, let the
parallel kdamond invokes the function, and returns after the invocation is
finished, or canceled due to some races.
kdamond invokes the function inside the main loop after sampling is done.
If it is deactivated by DAMOS watermarks or already out of the main loop,
mark the request as canceled so that damon_call() can wakeup and return.
SeongJae Park [Fri, 3 Jan 2025 17:43:52 +0000 (09:43 -0800)]
mm/damon/sysfs: handle clear_schemes_tried_regions from DAMON sysfs context
DAMON sysfs interface handles clear_schemes_tried_regions request from the
DAMON callback context (damon_sysfs_cmd_request_callback()), which is
designed to be used for safe access to the related DAMON context internal
data. But no DAMON context internal data is accessed for the work.
Directly handle it from DAMON sysfs interface context, namely
damon_sysfs_handle_cmd().
SeongJae Park [Fri, 3 Jan 2025 17:43:51 +0000 (09:43 -0800)]
mm/damon/sysfs-schemes: remove unnecessary schemes existence check in damon_sysfs_schemes_clear_regions()
Patch series "mm/damon: replace most damon_callback usages in sysfs with
new core functions".
DAMON provides damon_callback API that notifies monitoring events and
allows safe access to damon_ctx internal data. The usage is simple.
Users register and deregister callback functions for different monitoring
events in damon_ctx. Then the DAMON worker thread (kdamond) of the
damon_ctx calls back the registered functions on the events.
It is designed in such simple way because it was sufficient for usages of
DAMON at the early days. We also wanted to make it flexible so that API
user code can implement any required additional features on top of
damon_callback on their demands.
As expected, more sophisticated usages have invented. Online updates of
DAMON parameters and DAMOS auto-tuning inputs, and online retrieval of
DAMOS statistics and tried regions information are such usages. Because
damon_callback doesn't provide any explicit synchronization mechanism, the
user ABIs for exposing such functionalities are implemented in
asynchronous ways (DAMON_RECLAIM and DAMON_LRU_SORT}), or synchronous ways
(DAMON_SYSFS) with additional synchronization mechanisms that built inside
the ABI implementation, on top of damon_callback.
So damon_callback is working as expected. However, the additional
mechanisms built inside ABI on top of damon_callback is becoming somewhat
too big and not easy to maintain. The additional mechanisms can be
smaller and easier to maintain when implemented inside the core logic
layer.
Introduce two new DAMON core API, namely 'damon_call()' and
'damos_walk()'. The two functions support synchronous access to
- damon_ctx internal data including DAMON parameters and monitoring
results, and
- DAMOS-specific data such as regions that each DAMOS action is applied,
respectively.
And replace most of damon_callback usages in DAMON sysfs interface with
the new core API functions. damon_callback usage for online DAMON
parameters tuning is not replaced in this series, since it has specific
callback timing assumptions that require more works.
Patch sequence
==============
First two patches are fixups for simplifying the following changes. Those
remove a unnecessary condition check and a synchronization, respectively.
Third patch implements one of the new DAMON core APIs, namely
damon_call(). Three patches replacing damon_callback usages in DAMON
sysfs interface using damon_call() follow.
Then, seventh and eighth patches introduces the other new DAMON API,
damos_walk(), and document it on the design doc. Ninth patch replaces two
damon_callback usages in DAMON sysfs interface using damos_walk().
The tenth patch finally cleans up code that no more being used.
This patch (of 10):
damon_sysfs_schemes_clear_regions() skips removing the scheme tried region
directories only if the matching scheme is still ongoing. It is
unnecessary check, since what users want is just removing the entire
region directories. Remove the unnecessary check.
Kevin Brodsky [Fri, 3 Jan 2025 18:44:15 +0000 (18:44 +0000)]
mm: introduce ctor/dtor at PGD level
Following on from the introduction of P4D-level ctor/dtor, let's finish
the job and introduce ctor/dtor at PGD level. The incurred improvement in
page accounting is minimal - the main motivation is to create a single,
generic place where construction/destruction hooks can be added for all
page table pages.
This patch should cover all architectures and all configurations where
PGDs are one or more regular pages. This excludes any configuration where
PGDs are allocated from a kmem_cache object.
Link: https://lkml.kernel.org/r/20250103184415.2744423-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Fri, 3 Jan 2025 18:44:14 +0000 (18:44 +0000)]
asm-generic: pgalloc: provide generic __pgd_{alloc,free}
We already have a generic implementation of alloc/free up to P4D level, as
well as pgd_free(). Let's finish the work and add a generic PGD-level
alloc helper as well.
Unlike at lower levels, almost all architectures need some specific magic
at PGD level (typically initialising PGD entries), so introducing a
generic pgd_alloc() isn't worth it. Instead we introduce two new helpers,
__pgd_alloc() and __pgd_free(), and make use of them in the arch-specific
pgd_alloc() and pgd_free() wherever possible. To accommodate as many arch
as possible, __pgd_alloc() takes a page allocation order.
Because pagetable_alloc() allocates zeroed pages, explicit zeroing in
pgd_alloc() becomes redundant and we can get rid of it. Some trivial
implementations of pgd_free() also become unnecessary once __pgd_alloc()
is used; remove them.
Another small improvement is consistent accounting of PGD pages by using
GFP_PGTABLE_{USER,KERNEL} as appropriate.
Not all PGD allocations can be handled by the generic helpers. In
particular, multiple architectures allocate PGDs from a kmem_cache, and
those PGDs may not be page-sized.
Link: https://lkml.kernel.org/r/20250103184415.2744423-6-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Fri, 3 Jan 2025 18:44:13 +0000 (18:44 +0000)]
ARM: mm: rename PGD helpers
Generic implementations of __pgd_alloc and __pgd_free are about to be
introduced. Rename the macros in arch/arm/mm/pgd.c to avoid clashes.
While we're at it, also pass down the mm as argument to those helpers, as
it will be needed to call the generic __pgd_{alloc,free}.
Link: https://lkml.kernel.org/r/20250103184415.2744423-5-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Fri, 3 Jan 2025 18:44:12 +0000 (18:44 +0000)]
m68k: mm: add calls to pagetable_pmd_[cd]tor
get_pointer_table() and free_pointer_table() already special-case
TABLE_PTE to call pagetable_pte_[cd]tor. Let's do the same at PMD level
to improve accounting further. TABLE_PGD and TABLE_PMD are currently
defined to the same value, so we first need to separate them. That also
implies separating ptable_list for PMD/PGD levels.
Link: https://lkml.kernel.org/r/20250103184415.2744423-4-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Fri, 3 Jan 2025 18:44:11 +0000 (18:44 +0000)]
parisc: mm: ensure pagetable_pmd_[cd]tor are called
The implementation of pmd_{alloc_one,free} on parisc requires a non-zero
allocation order, but is completely standard aside from that. Let's reuse
the generic implementation of pmd_alloc_one(). Explicit zeroing is not
needed as GFP_PGTABLE_KERNEL includes __GFP_ZERO. The generic pmd_free()
can handle higher allocation orders so we don't need to define our own.
These changes ensure that pagetable_pmd_[cd]tor are called, improving the
accounting of page table pages.
Link: https://lkml.kernel.org/r/20250103184415.2744423-3-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kevin Brodsky [Fri, 3 Jan 2025 18:44:10 +0000 (18:44 +0000)]
mm: move common part of pagetable_*_ctor to helper
Patch series "Account page tables at all levels".
This series should be considered in conjunction with Qi's series [1].
Together, they ensure that page table ctor/dtor are called at all levels
(PTE to PGD) and all architectures, where page tables are regular pages.
Besides the improvement in accounting and general cleanup, this also
create a single place where construction/destruction hooks can be called
for all page tables, namely the now-generic pagetable_dtor() introduced
by Qi, and __pagetable_ctor() introduced in this series.
Lorenzo Stoakes [Fri, 3 Jan 2025 19:35:36 +0000 (19:35 +0000)]
mm/debug: prefer VM_WARN_ON_VMG() to report VMG debug warnings
Now we have VM_WARN_ON_VMG() to provide us with considerably more debug
output when a debug assert fails, utilise it everywhere we can.
This allows us to have considerably more information to go on when things
go wrong, especially when a non-repro issue occurs as reported by
syzkaller or the like.
Link: https://lkml.kernel.org/r/986e45e9549e71284ac7a7fa878688568a94d58b.1735932169.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 3 Jan 2025 19:35:35 +0000 (19:35 +0000)]
mm/debug: introduce VM_WARN_ON_VMG() to dump VMA merge state
Patch series "mm/debug: introduce and use VM_WARN_ON_VMG()".
We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set,
during VMA merge operations to ensure state is as expected.
However, when syzkaller or the like encounters these asserts, often the
information provided by the report is insufficient to narrow down what the
problem is.
We noticed this recently in [0], where a non-repro issue resisted
debugging due to simply not having sufficient information to go on.
This series improves the situation by providing VM_WARN_ON_VMG() which
acts like VM_WARN_ON() (i.e. only actually being invoked if
CONFIG_DEBUG_VM is set), while dumping significant information about the
VMA merge state, the mm_struct describing the virtual address space, all
associated VMAs and, if CONFIG_DEBUG_VM_MAPLE_TREE is set, the associated
maple tree.
We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set,
during VMA merge operations to ensure state is as expected.
However, when syzkaller or the like encounters these asserts, often the
information provided by the report is insufficient to narrow down what the
problem is.
This might not be so much of an issue if the reported problem is
reproducible, but if it is a rarely encountered race or some other case
which precludes a repro, it is a very big problem (see [0] for the
motivating case).
It is therefore sensible to provide a means by which we can easily and
conveniently dump a lot more information in these circumstances.
The aggregation of merge state into a single struct threaded through the
operation makes this trivial - we can simply introduce a variant on
VM_WARN_ON() which takes the VMA merge state object (vmg) and use that to
dump information.
This patch therefore introduces VM_WARN_ON_VMG() which provides this
functionality.
It additionally dumps full mm state, VMA state for each of the three VMAs
the vmg contains (prev, next, vma) and if CONFIG_DEBUG_VM_MAPLE_TREE is
enabled, dumps the maple tree from the provided VMA iterator if non-NULL.
This patch has no functional impact if CONFIG_DEBUG_VM is not set.
Maninder Singh [Mon, 30 Dec 2024 10:10:43 +0000 (15:40 +0530)]
lib/list_debug.c: add object information in case of invalid object
As of now during link list corruption it prints about cluprit address and
its wrong value, but sometime it is not enough to catch the actual issue
point.
If it prints allocation and free path of that corrupted node, it will be a
lot easier to find and fix the issues.
Adding the same information when data mismatch is found in link list
debug data:
[ 14.243055] slab kmalloc-32 start ffff0000cda19320 data offset 32 pointer offset 8 size 32 allocated at add_to_list+0x28/0xb0
[ 14.245259] __kmalloc_cache_noprof+0x1c4/0x358
[ 14.245572] add_to_list+0x28/0xb0
...
[ 14.248632] do_el0_svc_compat+0x1c/0x34
[ 14.249018] el0_svc_compat+0x2c/0x80
[ 14.249244] Free path:
[ 14.249410] kfree+0x24c/0x2f0
[ 14.249724] do_force_corruption+0xbc/0x100
...
[ 14.252266] el0_svc_common.constprop.0+0x40/0xe0
[ 14.252540] do_el0_svc_compat+0x1c/0x34
[ 14.252763] el0_svc_compat+0x2c/0x80
[ 14.253071] ------------[ cut here ]------------
[ 14.253303] list_del corruption. next->prev should be ffff0000cda192a8, but was 6b6b6b6b6b6b6b6b. (next=ffff0000cda19348)
[ 14.254255] WARNING: CPU: 3 PID: 84 at lib/list_debug.c:65 __list_del_entry_valid_or_report+0x158/0x164
Moved prototype of mem_dump_obj() to bug.h, as mm.h can not be included in
bug.h.
Link: https://lkml.kernel.org/r/20241230101043.53773-1-maninder1.s@samsung.com Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>