Liam R. Howlett [Mon, 12 Aug 2024 21:59:25 +0000 (17:59 -0400)]
mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding
vmas
Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.
Since remove_mt() is only called in one location and the contents will
be reduced to almost nothing. The remains of the function can be added
to vms_complete_munmap_vmas().
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Liam R. Howlett [Fri, 9 Aug 2024 19:56:40 +0000 (15:56 -0400)]
mm/vma: Extract the gathering of vmas from do_vmi_align_munmap()
Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later. Part of the gathering is the
splitting of vmas that span the boundary.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Liam R. Howlett [Fri, 9 Aug 2024 19:09:32 +0000 (15:09 -0400)]
mm/vma: Introduce vmi_complete_munmap_vmas()
Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation. Extracting this makes
the later patch in the series easier to understand.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Liam R. Howlett [Fri, 9 Aug 2024 19:04:00 +0000 (15:04 -0400)]
mm/vma: Introduce abort_munmap_vmas()
Extract clean up of failed munmap() operations from
do_vmi_align_munmap(). This simplifies later patches in the series.
It is worth noting that the mas_for_each() loop now has a different
upper limit. This should not change the number of vmas visited for
reattaching to the main vma tree (mm_mt), as all vmas are reattached in
both scenarios.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Liam R. Howlett [Fri, 9 Aug 2024 18:19:19 +0000 (14:19 -0400)]
mm/vma: Correctly position vma_iterator in __split_vma()
The vma iterator may be left pointing to the newly created vma. This
happens when inserting the new vma at the end of the old vma
(!new_below).
The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.
This has limited impact in the current code, but is required for future
changes.
Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Pedro Falcato [Sat, 17 Aug 2024 00:18:34 +0000 (01:18 +0100)]
selftests/mm: add more mseal traversal tests
Add more mseal traversal tests across VMAs, where we could possibly
screw up sealing checks. These test more across-vma traversal for
mprotect, munmap and madvise. Particularly, we test for the case where a
regular VMA is followed by a sealed VMA.
Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com>
Pedro Falcato [Sat, 17 Aug 2024 00:18:32 +0000 (01:18 +0100)]
mseal: Replace can_modify_mm_madv with a vma variant
Replace can_modify_mm_madv() with a single vma variant, and associated
checks in madvise.
While we're at it, also invert the order of checks in:
if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma))
Checking if we can modify the vma itself (through vm_flags) is
certainly cheaper than is_ro_anon() due to arch_vma_access_permitted()
looking at e.g pkeys registers (with extra branches) in some
architectures.
This patch allows for partial madvise success when finding a sealed VMA,
which historically has been allowed in Linux.
Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com>
Pedro Falcato [Sat, 17 Aug 2024 00:18:31 +0000 (01:18 +0100)]
mm/mremap: Replace can_modify_mm with can_modify_vma
Delegate all can_modify checks to the proper places. Unmap checks are
done in do_unmap (et al). The source VMA check is done purposefully
before unmapping, to keep the original mseal semantics.
Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com>
Zhaoyang Huang [Thu, 11 May 2023 05:22:30 +0000 (13:22 +0800)]
mm: optimization on page allocation when CMA enabled
According to current CMA utilization policy, an alloc_pages(GFP_USER)
could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of CMA(pass
zone_watermark_ok by counting CMA in but use U&R in rmqueue), which could
lead to following alloc_pages(GFP_KERNEL) fail. Solving this by
introducing second watermark checking for GFP_MOVABLE, which could have
the allocation use CMA when proper.
The root cause is that unpoison_memory() tries to check the PG_HWPoison
flags of an uninitialized page. So VM_BUG_ON_PAGE(PagePoisoned(page)) is
triggered. This can be reproduced by below steps:
1.Offline memory block:
echo offline > /sys/devices/system/memory/memory12/state
2.Get offlined memory pfn:
page-types -b n -rlN
3.Write pfn to unpoison-pfn
echo <pfn> > /sys/kernel/debug/hwpoison/unpoison-pfn
Link: https://lkml.kernel.org/r/20240712064249.3882707-1-linmiaohe@huawei.com Fixes: f165b378bbdf ("mm: uninitialized struct page poisoning sanity checking") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb_cgroup: introduce peak and rsvd.peak to v2
Introduce peak and rsvd.peak to v2 to show the historical maximum usage of
resources, as in some scenarios it is necessary to configure the value of
max/rsvd.max based on the peak usage of resources.
Since HugeTLB doesn't support page reclaim, enforcing the limit at page
fault time implies that, the application will get SIGBUS signal if it
tries to fault in HugeTLB pages beyond its limit. Therefore the
application needs to know exactly how many HugeTLB pages it uses before
hand, and the sysadmin needs to make sure that there are enough available
on the machine for all the users to avoid processes getting SIGBUS.
When running some open-source software, it may not be possible to know the
exact amount of hugetlb it consumes, so cannot correctly configure the max
value. If there is a peak metric, we can run the open-source software
first and then configure the max based on the peak value. In cgroup v1,
the hugetlb controller provides the max_usage_in_bytes and
rsvd.max_usage_in_bytes interface to display the historical maximum usage,
so introduce peak and rsvd.peak to v2 to address this issue.
Link: https://lkml.kernel.org/r/20240702125728.2743143-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gaoxu [Fri, 16 Aug 2024 07:48:01 +0000 (07:48 +0000)]
mm: add lazyfree folio to lru tail
Replace lruvec_add_folio with lruvec_add_folio_tail in the lru_lazyfree_fn:
1. The lazy-free folio is added to the LRU_INACTIVE_FILE list. If
it's moved to the LRU tail, it allows for faster release lazy-free
folio and reduces the impact on file refault.
2. When mglru is enabled, the lazy-free folio is reclaimabled and
should be added using lru_gen_add_folio(lruvec, folio, true) instead of
lru_gen_add_folio(lruvec, folio, false) for adding to gen.
With the change in place, workingset_refault_file is reduced by 33% in the
continuous startup testing of the applications in the Android system.
Link: https://lkml.kernel.org/r/f29f64e29c08427b95e3df30a5770056@honor.com Signed-off-by: gao xu <gaoxu2@hihonor.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Shaohua Li <shli@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 16 Aug 2024 09:04:34 +0000 (17:04 +0800)]
mm: migrate: add isolate_folio_to_list()
Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU
movable and LRU folios to a list, which will be reused by
do_migrate_range() from memory hotplug soon, also drop the
mf_isolate_folio() since we could directly use new helper in the
soft_offline_in_use_page().
Link: https://lkml.kernel.org/r/20240816090435.888946-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 16 Aug 2024 09:04:33 +0000 (17:04 +0800)]
mm: memory_hotplug: check hwpoisoned page firstly in do_migrate_range()
The commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages
to be offlined") don't handle the hugetlb pages, the endless loop still
occur if offline a hwpoison hugetlb, luckly, with the commit e591ef7d96d6
("mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with
hwpoisoned hugepage") section with hwpoisoned hugepage"), the
HPageMigratable of hugetlb page will be clear, and the hwpoison hugetlb
page will be skipped in scan_movable_pages(), so the endless loop issue is
fixed.
However if the HPageMigratable() check passed(without reference and lock),
the hugetlb page may be hwpoisoned, it won't cause issue since the
hwpoisoned page will be handled correctly in the next movable pages scan
loop, and it will be isolated in do_migrate_range() but fails to migrate.
In order to avoid the unnecessary isolation and unify all hwpoisoned page
handling, let's unconditionally check hwpoison firstly, and if it is a
hwpoisoned hugetlb page, try to unmap it as the catch all safety net like
normal page does.
Link: https://lkml.kernel.org/r/20240816090435.888946-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Fri, 16 Aug 2024 09:04:32 +0000 (17:04 +0800)]
mm: memory-failure: add unmap_posioned_folio()
Add unmap_posioned_folio() helper which will be reused by
do_migrate_range() from memory hotplug soon.
Link: https://lkml.kernel.org/r/20240816090435.888946-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Fri, 16 Aug 2024 10:32:46 +0000 (12:32 +0200)]
mm/rmap: use folio->_mapcount for small folios
We have some cases left whereby we operate on small folios and still refer
to page->_mapcount. Let's just use folio->_mapcount instead, which
currently still overlays page->_mapcount, so no change.
This change will make it easier to later spot any remaining users of
page->_mapcount that target tail pages.
Mike Yuan [Fri, 16 Aug 2024 14:44:07 +0000 (14:44 +0000)]
mm/memcontrol: respect zswap.writeback setting from parent cg too
Currently, the behavior of zswap.writeback wrt. the cgroup hierarchy
seems a bit odd. Unlike zswap.max, it doesn't honor the value from parent
cgroups. This surfaced when people tried to globally disable zswap
writeback, i.e. reserve physical swap space only for hibernation [1] -
disabling zswap.writeback only for the root cgroup results in subcgroups
with zswap.writeback=3D1 still performing writeback.
The inconsistency became more noticeable after I introduced the
MemoryZSwapWriteback=3D systemd unit setting [2] for controlling the knob.
The patch assumed that the kernel would enforce the value of parent
cgroups. It could probably be workarounded from systemd's side, by going
up the slice unit tree and inheriting the value. Yet I think it's more
sensible to make it behave consistently with zswap.max and friends.
Shakeel Butt [Wed, 14 Aug 2024 22:00:21 +0000 (15:00 -0700)]
memcg: initiate deprecation of pressure_level
The pressure_level in memcg v1 provides memory pressure notifications to
the user space. At the moment it provides notifications for three levels
of memory pressure i.e. low, medium and critical, which are defined based
on internal memory reclaim implementation details. More specifically the
ratio of scanned and reclaimed pages during a memory reclaim. However
this is not robust as there are workloads with mostly unreclaimable user
memory or kernel memory.
For v2, the users can use PSI for memory pressure status of the system or
the cgroup. Let's start the deprecation process for pressure_level and
add warnings to gather the info on how the current users are using this
interface and how they can be used to PSI.
Link: https://lkml.kernel.org/r/20240814220021.3208384-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Wed, 14 Aug 2024 22:00:20 +0000 (15:00 -0700)]
memcg: initiate deprecation of oom_control
The oom_control provides functionality to disable memcg oom-killer,
notifications on oom-kill and reading the stats regarding oom-kills. This
interface was mainly introduced to provide functionality for userspace
oom-killers. However it is not robust enough and only supports OOM
handling in the page fault path.
For v2, the users can use the combination of memory.events notifications,
memory.high and PSI to provide userspace OOM-killing functionality.
Actually LMKD in Android and OOMd in systemd and Meta infrastructure
already use PSI in combination with other stats to implement userspace
OOM-killing.
Let's start the deprecation process for v1 and gather the info on how the
current users are using this interface and work on providing a more robust
functionality in v2.
Link: https://lkml.kernel.org/r/20240814220021.3208384-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Wed, 14 Aug 2024 22:00:19 +0000 (15:00 -0700)]
memcg: initiate deprecation of v1 soft limit
Memcg v1 provides soft limit functionality for the best effort memory
sharing between multiple workloads on a system. It is usually triggered
through kswapd and at the moment does not reclaim kernel memory.
Memcg v2 provides more straightforward best effort (memory.low) and hard
protection (memory.min) functionalities. Let's initiate the deprecation
of soft limit from v1 and gather if v2 needs something more to move the
existing v1 users to v2 regarding soft limit.
Link: https://lkml.kernel.org/r/20240814220021.3208384-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Wed, 14 Aug 2024 22:00:18 +0000 (15:00 -0700)]
memcg: initiate deprecation of v1 tcp accounting
Patch series "memcg: initiate deprecation of v1 features", v2.
Start the deprecation process of the memcg v1 features which we discussed
during LSFMMBPF 2024 [1]. For now add the warnings to collect the
information on how the current users are using these features. Next we
will work on providing better alternatives in v2 (if needed) and fully
deprecate these features.
Memcg v1 provides opt-in TCP memory accounting feature. However it is
mostly unused due to its performance impact on the network traffic. In
v2, the TCP memory is accounted in the regular memory usage and is
transparent to the users but they can observe the TCP memory usage through
memcg stats.
Let's initiate the deprecation process of memcg v1's tcp accounting
functionality and add warnings to gather if there are any users and if
there are, collect how they are using it and plan to provide them better
alternative in v2.
Shakeel Butt [Thu, 15 Aug 2024 05:04:53 +0000 (22:04 -0700)]
memcg: make PGPGIN and PGPGOUT v1 only
Currently PGPGIN and PGPGOUT are used and exposed in the memcg v1 only
code. So, let's put them under CONFIG_MEMCG_V1.
Link: https://lkml.kernel.org/r/20240815050453.1298138-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 15 Aug 2024 05:04:52 +0000 (22:04 -0700)]
memcg: allocate v1 event percpu only on v1 deployment
Currently memcg->events_percpu gets allocated on v2 deployments. Let's
move the allocation to v1 only codebase. This is not needed in v2.
Link: https://lkml.kernel.org/r/20240815050453.1298138-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 15 Aug 2024 05:04:51 +0000 (22:04 -0700)]
memcg: make v1 only functions static
The functions memcg1_charge_statistics() and memcg1_check_events() are
never used outside of v1 source file. So, make them static.
Link: https://lkml.kernel.org/r/20240815050453.1298138-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 15 Aug 2024 05:04:50 +0000 (22:04 -0700)]
memcg: move v1 events and statistics code to v1 file
Currently the common code path for charge commit, swapout and batched
uncharge are executing v1 only code which is completely useless for the v2
deployments where CONFIG_MEMCG_V1 is disabled. In addition, it is mucking
with IRQs which might be slow on some architectures. Let's move all of
this code to v1 only code and remove them from v2 only deployments.
Link: https://lkml.kernel.org/r/20240815050453.1298138-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 15 Aug 2024 05:04:49 +0000 (22:04 -0700)]
memcg: move mem_cgroup_charge_statistics to v1 code
There are no callers of mem_cgroup_charge_statistics() in the v2 code
base, so move it to the v1 only code and rename it to
memcg1_charge_statistics().
Link: https://lkml.kernel.org/r/20240815050453.1298138-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 15 Aug 2024 05:04:48 +0000 (22:04 -0700)]
memcg: move mem_cgroup_event_ratelimit to v1 code
There are no callers of mem_cgroup_event_ratelimit() in the v2 code. Move
it to v1 only code and rename it to memcg1_event_ratelimit().
Link: https://lkml.kernel.org/r/20240815050453.1298138-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 15 Aug 2024 05:04:47 +0000 (22:04 -0700)]
memcg: move v1 only percpu stats in separate struct
Patch series "memcg: further decouple v1 code from v2".
Some of the v1 code is still in v2 code base due to v1 fields in the
struct memcg_vmstats_percpu. This field decouples those fileds from v2
struct and move all the related code into v1 only code base.
This patch (of 7):
At the moment struct memcg_vmstats_percpu contains two v1 only fields
which consumes memory even when CONFIG_MEMCG_V1 is not enabled. In
addition there are v1 only functions accessing them and are in the main
memcontrol source file and can not be moved to v1 only source file due to
these fields. Let's move these fields into their own struct. Later
patches will move the functions accessing them to v1 source file and only
allocate these fields when CONFIG_MEMCG_V1 is enabled.
Barry Song [Wed, 14 Aug 2024 22:34:16 +0000 (10:34 +1200)]
mm: use get_oder() and check size is is_power_of_2
Using get_order() is more robust according to Baolin. It is also better
to filter illegal size such as 3KB, 16KB according to David.
Link: https://lkml.kernel.org/r/20240814224635.43272-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Roberts [Wed, 14 Aug 2024 02:02:47 +0000 (14:02 +1200)]
mm: override mTHP "enabled" defaults at kernel cmdline
Add thp_anon= cmdline parameter to allow specifying the default enablement
of each supported anon THP size. The parameter accepts the following
format and can be provided multiple times to configure each size:
See Documentation/admin-guide/mm/transhuge.rst for more details.
Configuring the defaults at boot time is useful to allow early user
space to take advantage of mTHP before its been configured through
sysfs.
Link: https://lkml.kernel.org/r/20240814020247.67297-1-21cnbao@gmail.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 14 Aug 2024 03:54:51 +0000 (21:54 -0600)]
mm/hugetlb: use __GFP_COMP for gigantic folios
Use __GFP_COMP for gigantic folios to greatly reduce not only the amount
of code but also the allocation and free time.
LOC (approximately): +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU models.
Link: https://lkml.kernel.org/r/20240814035451.773331-4-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Frank van der Linden <fvdl@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 14 Aug 2024 03:54:50 +0000 (21:54 -0600)]
mm/cma: add cma_{alloc,free}_folio()
With alloc_contig_range() and free_contig_range() supporting large folios,
CMA can allocate and free large folios too, by cma_alloc_folio() and
cma_free_folio().
Link: https://lkml.kernel.org/r/20240814035451.773331-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 14 Aug 2024 03:54:49 +0000 (21:54 -0600)]
mm/contig_alloc: support __GFP_COMP
Patch series "mm/hugetlb: alloc/free gigantic folios", v2.
Use __GFP_COMP for gigantic folios can greatly reduce not only the amount
of code but also the allocation and free time.
Approximate LOC to mm/hugetlb.c: +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU
models.
Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
success the function returns a large folio prepared by prep_new_page(),
rather than a range of order-0 pages prepared by split_free_pages() (which
is renamed from split_map_pages()).
alloc_contig_range() can be used to allocate folios larger than
MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
free_one_page() needs to handle that by split_large_buddy().
Kaiyang Zhao [Wed, 14 Aug 2024 17:42:27 +0000 (17:42 +0000)]
mm,memcg: provide per-cgroup counters for NUMA balancing operations
The ability to observe the demotion and promotion decisions made by the
kernel on a per-cgroup basis is important for monitoring and tuning
containerized workloads on machines equipped with tiered memory.
Different containers in the system may experience drastically different
memory tiering actions that cannot be distinguished from the global
counters alone.
For example, a container running a workload that has a much hotter memory
accesses will likely see more promotions and fewer demotions, potentially
depriving a colocated container of top tier memory to such an extent that
its performance degrades unacceptably.
For another example, some containers may exhibit longer periods between
data reuse, causing much more numa_hint_faults than numa_pages_migrated.
In this case, tuning hot_threshold_ms may be appropriate, but the signal
can easily be lost if only global counters are available.
In the long term, we hope to introduce per-cgroup control of promotion and
demotion actions to implement memory placement policies in tiering.
This patch set adds seven counters to memory.stat in a cgroup:
numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd,
pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_*
and pgpromote_success are also available in memory.numa_stat.
count_memcg_events_mm() is added to count multiple event occurrences at
once, and get_mem_cgroup_from_folio() is added because we need to get a
reference to the memcg of a folio before it's migrated to track
numa_pages_migrated. The accounting of PGDEMOTE_* is moved to
shrink_inactive_list() before being changed to per-cgroup.
Link: https://lkml.kernel.org/r/20240814174227.30639-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:43 +0000 (12:19 -0400)]
maple_tree: remove unneeded mas_wr_walk() in mas_store_prealloc()
Users of mas_store_prealloc() enter this function with nodes already
preallocated. This means the store type must be already set. We can then
remove the call to mas_wr_store_type() and initialize the write state to
continue the partial walk that was done when determining the store type.
Link: https://lkml.kernel.org/r/20240814161944.55347-17-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:42 +0000 (12:19 -0400)]
maple_tree: remove repeated sanity checks from write helper functions
These sanity checks are now redundant as they are already checked in
mas_wr_store_type(). We can remove them from mas_wr_append() and
mas_wr_node_store().
Link: https://lkml.kernel.org/r/20240814161944.55347-16-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:41 +0000 (12:19 -0400)]
maple_tree: remove node allocations from various write helper functions
These write helper functions are all called from store paths which
preallocate enough nodes that will be needed for the write. There is no
more need to allocate within the functions themselves.
Link: https://lkml.kernel.org/r/20240814161944.55347-15-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:40 +0000 (12:19 -0400)]
maple_tree: have mas_store() allocate nodes if needed
Not all users of mas_store() enter with nodes already preallocated.
Check for the MA_STATE_PREALLOC flag to decide whether to preallocate nodes
within mas_store() rather than relying on future write helper functions
to perform the allocations. This allows the write helper functions to be
simplified as they do not have to do checks to make sure there are
enough allocated nodes to perform the write.
Link: https://lkml.kernel.org/r/20240814161944.55347-14-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:38 +0000 (12:19 -0400)]
maple_tree: simplify mas_commit_b_node()
The only callers of mas_commit_b_node() are those with store type of
wr_rebalance and wr_split_store. Use mas->store_type to dispatch to the
correct helper function. This allows the removal of mas_reuse_node() as
it is no longer used.
Link: https://lkml.kernel.org/r/20240814161944.55347-12-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:37 +0000 (12:19 -0400)]
maple_tree: convert mas_insert() to preallocate nodes
By setting the store type in mas_insert(), we no longer need to use
mas_wr_modify() to determine the correct store function to use. Instead,
set the store type and call mas_wr_store_entry(). Also, pass in the
requested gfp flags to mas_insert() so they can be passed to the call to
mas_wr_preallocate().
Link: https://lkml.kernel.org/r/20240814161944.55347-11-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:36 +0000 (12:19 -0400)]
maple_tree: use store type in mas_wr_store_entry()
When storing an entry, we can read the store type that was set from a
previous partial walk of the tree. Now that the type of store is known,
select the correct write helper function to use to complete the store.
Also noinline mas_wr_spanning_store() to limit stack frame usage in
mas_wr_store_entry() as it allocates a maple_big_node on the stack.
Link: https://lkml.kernel.org/r/20240814161944.55347-10-sidhartha.kumar@oracle.com Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:33 +0000 (12:19 -0400)]
maple_tree: preallocate nodes in mas_erase()
Use mas_wr_preallocate() in mas_erase() to preallocate enough nodes to
complete the erase. Add error handling by skipping the store if the
preallocation lead to some error besides no memory.
Link: https://lkml.kernel.org/r/20240814161944.55347-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:32 +0000 (12:19 -0400)]
maple_tree: remove mas_destroy() from mas_nomem()
Separate call to mas_destroy() from mas_nomem() so we can check for no
memory errors without destroying the current maple state in
mas_store_gfp(). We then add calls to mas_destroy() to callers of
mas_nomem().
Link: https://lkml.kernel.org/r/20240814161944.55347-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 14 Aug 2024 16:19:31 +0000 (12:19 -0400)]
maple_tree: introduce mas_wr_store_type()
Introduce mas_wr_store_type() which will set the correct store type based
on a walk of the tree. In mas_wr_node_store() the <= min_slots condition
is changed to < as if new_end is = to mt_min_slots then there is not
enough room.
mas_prealloc_calc() is also introduced to abstract the calculation used to
determine the number of nodes needed for a store operation.
In this change a call to mas_reset() is removed in the error case of
mas_prealloc(). This is only needed in the MA_STATE_REBALANCE case of
mas_destroy(). We can move the call to mas_reset() directly to
mas_destroy().
Also, add a test case to validate the order that we check the store type
in is correct. This test models a vma expanding and then shrinking which
is part of the boot process.
Link: https://lkml.kernel.org/r/20240814161944.55347-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series implements two work items[3]: "aligning mas_store_gfp() with
mas_preallocate()" and "enum for store type".
mas_store_gfp() is modified to preallocate nodes. This simplies many of
the write helper functions by allowing them to use mas_store_gfp() rather
than open coding node allocation and error handling.
In the current maple tree code, a walk down the tree is done in
mas_preallocate() to determine the number of nodes needed for this write.
After node allocation, mas_wr_store_entry() will perform another walk to
determine which write helper function to use to complete the write.
Rather than performing the second walk, we can store the type of write in
the maple write state during node allocation and read this field to
complete the write.
Patches 1-16 implement this store type feature.
Patch 17 is a cleanup patch to change functions that have unused return
types to be void.
Add a store_type enum that is stored in ma_state. This will be used to
keep track of partial walks of the tree so that subsequent walks can pick
up where a previous walk left off.
Muchun Song [Wed, 14 Aug 2024 09:34:15 +0000 (17:34 +0800)]
mm: kmem: add lockdep assertion to obj_cgroup_memcg
obj_cgroup_memcg() is supposed to safe to prevent the returned memory
cgroup from being freed only when the caller is holding the rcu read lock
or objcg_lock or cgroup_mutex. It is very easy to ignore thoes conditions
when users call some upper APIs which call obj_cgroup_memcg() internally
like mem_cgroup_from_slab_obj() (See the link below). So it is better to
add lockdep assertion to obj_cgroup_memcg() to find those issues ASAP.
Because there is no user of obj_cgroup_memcg() holding objcg_lock to make
the returned memory cgroup safe, do not add objcg_lock assertion (We
should export objcg_lock if we really want to do). Additionally, this is
some internal implementation detail of memcg and should not be accessible
outside memcg code.
Some users like __mem_cgroup_uncharge() do not care the lifetime of the
returned memory cgroup, which just want to know if the folio is charged to
a memory cgroup, therefore, they do not need to hold the needed locks. In
which case, introduce a new helper folio_memcg_charged() to do this.
Compare it to folio_memcg(), it could eliminate a memory access of
objcg->memcg for kmem, actually, a really small gain.
Andrey Konovalov [Tue, 13 Aug 2024 22:40:27 +0000 (00:40 +0200)]
kasan: simplify and clarify Makefile
When KASAN support was being added to the Linux kernel, GCC did not yet
support all of the KASAN-related compiler options. Thus, the KASAN
Makefile had to probe the compiler for supported options.
Nowadays, the Linux kernel GCC version requirement is 5.1+, and thus we
don't need the probing of the -fasan-shadow-offset parameter: it exists in
all 5.1+ GCCs.
Simplify the KASAN Makefile to drop CFLAGS_KASAN_MINIMAL.
Also add a few more comments and unify the indentation.
Link: https://lkml.kernel.org/r/20240813224027.84503-1-andrey.konovalov@linux.dev Signed-off-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Matthew Maurer <mmaurer@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Tue, 13 Aug 2024 21:53:58 +0000 (14:53 -0700)]
memcg: use ratelimited stats flush in the reclaim
The Meta prod is seeing large amount of stalls in memcg stats flush from
the memcg reclaim code path. At the moment, this specific callsite is
doing a synchronous memcg stats flush. The rstat flush is an expensive
and time consuming operation, so concurrent relaimers will busywait on the
lock potentially for a long time. Actually this issue is not unique to
Meta and has been observed by Cloudflare [1] as well. For the Cloudflare
case, the stalls were due to contention between kswapd threads running on
their 8 numa node machines which does not make sense as rstat flush is
global and flush from one kswapd thread should be sufficient for all.
Simply replace the synchronous flush with the ratelimited one.
One may raise a concern on potentially using 2 sec stale (at worst) stats
for heuristics like desirable inactive:active ratio and preferring
inactive file pages over anon pages but these specific heuristics do not
require very precise stats and also are ignored under severe memory
pressure.
More specifically for this code path, the stats are needed for two
specific heuristics:
1. Deactivate LRUs
2. Cache trim mode
The deactivate LRUs heuristic is to maintain a desirable inactive:active
ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and
the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if
there is a refault since last snapshot and the LRU size are needed for the
desirable ratio between inactive and active LRUs. See the table below on
how the desirable ratio is calculated.
The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100
GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU
size information to calculate this ratio. In addition, if deactivation is
skipped for some LRU, the kernel will force deactive on the severe memory
pressure situation.
For the cache trim mode, inactive file LRU size is read and the kernel
scales it down based on the reclaim iteration (file >> sc->priority) and
only checks if it is zero or not. Again precise information is not
needed.
This patch has been running on Meta fleet for several months and we have
not observed any issues. Please note that MGLRU is not impacted by this
issue at all as it avoids rstat flushing completely.
Baolin Wang [Mon, 12 Aug 2024 07:42:10 +0000 (15:42 +0800)]
mm: shmem: support large folio swap out
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, hence this patch set supports the large folio swap out for
shmem.
Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate
spliting the large folio.
[1] https://lore.kernel.org/all/cover.1717495894.git.baolin.wang@linux.alibaba.com/
[2] https://lore.kernel.org/all/20240515055719.32577-1-da.gomez@samsung.com/ Link: https://lkml.kernel.org/r/d80c21abd20e1b0f5ca66b330f074060fb2f082d.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:09 +0000 (15:42 +0800)]
mm: shmem: split large entry if the swapin folio is not large
Now the swap device can only swap-in order 0 folio, even though a large
folio is swapped out. This requires us to split the large entry
previously saved in the shmem pagecache to support the swap in of small
folios.
Link: https://lkml.kernel.org/r/4a0f12f27c54a62eb4d9ca1265fed3a62531a63e.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:08 +0000 (15:42 +0800)]
mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding into
swap cache.
Link: https://lkml.kernel.org/r/b371eadb27f42fc51261c51008fbb9a334985b4c.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:07 +0000 (15:42 +0800)]
mm: shmem: support large folio allocation for shmem_replace_folio()
To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in
shmem_replace_folio(). Moreover large folios occupy N consecutive entries
in the swap cache instead of using multi-index entries like the page
cache, therefore we should replace each consecutive entries in the swap
cache instead of using the shmem_replace_entry().
As well as updating statistics and folio reference count using the number
of pages in the folio.
Link: https://lkml.kernel.org/r/a41138ecc857ef13e7c5ffa0174321e9e2c9970a.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:06 +0000 (15:42 +0800)]
mm: shmem: use swap_free_nr() to free shmem swap entries
As a preparation for supporting shmem large folio swapout, use
swap_free_nr() to free some continuous swap entries of the shmem large
folio when the large folio was swapped in from the swap cache. In
addition, the index should also be round down to the number of pages when
adding the swapin folio into the pagecache.
Link: https://lkml.kernel.org/r/342207fa679fc88a447dac2e101ad79e6050fe79.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:05 +0000 (15:42 +0800)]
mm: filemap: use xa_get_order() to get the swap entry order
In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.
Link: https://lkml.kernel.org/r/6876d55145c1cc80e79df7884aa3a62e397b101d.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daniel Gomez [Mon, 12 Aug 2024 07:42:04 +0000 (15:42 +0800)]
mm: shmem: return number of pages beeing freed in shmem_free_swap
Both shmem_free_swap callers expect the number of pages being freed. In
the large folios context, this needs to support larger values other than 0
(used as 1 page being freed) and -ENOENT (used as 0 pages being freed).
In preparation for large folios adoption, make shmem_free_swap routine
return the number of pages being freed. So, returning 0 in this context,
means 0 pages being freed.
While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.
Link: https://lkml.kernel.org/r/9623e863c83d749d5ab407f6fdf0a8e5a3bdf052.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:03 +0000 (15:42 +0800)]
mm: shmem: extend shmem_partial_swap_usage() to support large folio swap
To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.
Link: https://lkml.kernel.org/r/60b130b9fc3e422bb91293a172c2113c85e9233a.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:02 +0000 (15:42 +0800)]
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
Patch series "support large folio swap-out and swap-in for shmem", v5.
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap-out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, and large folio swap-in[3] series is queued into
mm-unstable branch. Hence this patch set also supports the large folio
swap-out and swap-in for shmem.
This patch (of 9):
To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem
swap entries.
While we are at it, using folio_nr_pages() to get the number of pages of
the folio as a preparation.
Link: https://lkml.kernel.org/r/cover.1723434324.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Ellerman [Mon, 12 Aug 2024 08:26:05 +0000 (18:26 +1000)]
powerpc/vdso: refactor error handling
Linus noticed that the error handling in __arch_setup_additional_pages()
fails to clear the mm VDSO pointer if _install_special_mapping() fails.
In practice there should be no actual bug, because if there's an error the
VDSO pointer is cleared later in arch_setup_additional_pages().
However it's no longer necessary to set the pointer before installing the
mapping. Commit c1bab64360e6 ("powerpc/vdso: Move to
_install_special_mapping() and remove arch_vma_name()") reworked the code
so that the VMA name comes from the vm_special_mapping.name, rather than
relying on arch_vma_name().
So rework the code to only set the VDSO pointer once the mappings have
been installed correctly, and remove the stale comment.
Link: https://lkml.kernel.org/r/20240812082605.743814-4-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jeff Xu <jeffxu@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Ellerman [Mon, 12 Aug 2024 08:26:04 +0000 (18:26 +1000)]
mm: remove arch_unmap()
Now that powerpc no longer uses arch_unmap() to handle VDSO unmapping,
there are no meaningful implementions left. Drop support for it entirely,
and update comments which refer to it.
Link: https://lkml.kernel.org/r/20240812082605.743814-3-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jeff Xu <jeffxu@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Ellerman [Mon, 12 Aug 2024 08:26:03 +0000 (18:26 +1000)]
powerpc/mm: handle VDSO unmapping via close() rather than arch_unmap()
Add a close() callback to the VDSO special mapping to handle unmapping of
the VDSO. That will make it possible to remove the arch_unmap() hook
entirely in a subsequent patch.
Link: https://lkml.kernel.org/r/20240812082605.743814-2-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jeff Xu <jeffxu@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michael Ellerman [Mon, 12 Aug 2024 08:26:02 +0000 (18:26 +1000)]
mm: add optional close() to struct vm_special_mapping
Add an optional close() callback to struct vm_special_mapping. It will be
used, by powerpc at least, to handle unmapping of the VDSO.
Although support for unmapping the VDSO was initially added for CRIU[1],
it is not desirable to guard that support behind
CONFIG_CHECKPOINT_RESTORE.
There are other known users of unmapping the VDSO which are not related to
CRIU, eg. Valgrind [2] and void-ship [3].
The powerpc arch_unmap() hook has been in place for ~9 years, with no
ifdef, so there may be other unknown users that have come to rely on
unmapping the VDSO. Even if the code was behind an ifdef, major distros
enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO
depends on that configuration option.
It's also undesirable to have such core mm behaviour behind a relatively
obscure CONFIG option.
Longer term the unmap behaviour should be standardised across
architectures, however that is complicated by the fact the VDSO pointer is
stored differently across architectures. There was a previous attempt to
unify that handling [4], which could be revived.
Tianchen Ding [Mon, 12 Aug 2024 09:55:17 +0000 (17:55 +0800)]
kfence: save freeing stack trace at calling time instead of freeing time
For kmem_cache with SLAB_TYPESAFE_BY_RCU, the freeing trace stack at
calling kmem_cache_free() is more useful. While the following stack is
meaningless and provides no help:
freed by task 46 on cpu 0 at 656.840729s:
rcu_do_batch+0x1ab/0x540
nocb_cb_wait+0x8f/0x260
rcu_nocb_cb_kthread+0x25/0x80
kthread+0xd2/0x100
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1a/0x30
Link: https://lkml.kernel.org/r/20240812095517.2357-1-dtcccc@linux.alibaba.com Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Mon, 12 Aug 2024 15:09:25 +0000 (15:09 +0000)]
maple_tree: fix comment typo with corresponding maple_status
In comment of function mas_start(), we list the return value of different
cases. According to the comment context, tell the maple_status here is
more consistent with others.
Let's correct it with ma_active in the case it's a tree.
Wei Yang [Mon, 12 Aug 2024 15:09:24 +0000 (15:09 +0000)]
maple_tree: fix comment typo of ma_root
In comment of mas_start(), we lists the return value for different cases.
In case of a single entry, we set mas->status to ma_root, while the
comment uses mas_root, which is not a maple_status.
Sidhartha Kumar [Mon, 12 Aug 2024 19:05:43 +0000 (15:05 -0400)]
maple_tree: add test to replicate low memory race conditions
Add new callback fields to the userspace implementation of struct
kmem_cache. This allows for executing callback functions in order to
further test low memory scenarios where node allocation is retried.
This callback can help test race conditions by calling a function when a
low memory event is tested.
CPU 1 CPU 2
--------- ---------
mas_set_range(a,b)
mas_erase()
-> range is expanded (a,c) because of null expansion
mas_nomem()
mas_unlock()
mas_store_range(b,c,0xC)
The node now looks like:
a<------->b<----------->c<--------->d
0xA 0xC 0xB
mas_lock()
mas_erase() <------ range of erase is still (a,c)
The node is now NULL from (a,c) but the write from CPU 2 should have been
retained and range (b,c) should still have 0xC as its value. We can fix
this by re-intializing to the original index and last. This does not need
a cc: Stable as there are no users of the maple tree which use internal
locking and this condition is only possible with internal locking.
Yu Zhao [Mon, 12 Aug 2024 22:48:23 +0000 (16:48 -0600)]
mm/hugetlb_vmemmap: batch HVO work when demoting
Batch the HVO work, including de-HVO of the source and HVO of the
destination hugeTLB folios, to speed up demotion.
After commit bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative
PFN walkers"), each request of HVO or de-HVO, batched or not, invokes
synchronize_rcu() once. For example, when not batched, demoting one 1GB
hugeTLB folio to 512 2MB hugeTLB folios invokes synchronize_rcu() 513
times (1 de-HVO plus 512 HVO requests), whereas when batched, only twice
(1 de-HVO plus 1 HVO request). And the performance difference between the
two cases is significant, e.g.,
echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size
time echo 100 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote
Before this patch:
real 8m58.158s
user 0m0.009s
sys 0m5.900s
After this patch:
real 0m0.900s
user 0m0.000s
sys 0m0.851s
Note that this patch changes the behavior of the `demote` interface when
de-HVO fails. Before, the interface aborts immediately upon failure; now,
it tries to finish an entire batch, meaning it can make extra progress if
the rest of the batch contains folios that do not need to de-HVO.
Link: https://lkml.kernel.org/r/20240812224823.3914837-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
yangge [Tue, 13 Aug 2024 09:52:23 +0000 (17:52 +0800)]
mm/swap: take folio refcount after testing the LRU flag
Whoever passes a folio to __folio_batch_add_and_move() must hold a
reference, otherwise something else would already be messed up. If the
folio is referenced, it will not be freed elsewhere, so we can safely
clear the folio's lru flag. As discussed with David in [1], we should
take the reference after testing the LRU flag, not before.
Takaya Saeki [Tue, 13 Aug 2024 10:03:12 +0000 (10:03 +0000)]
filemap: add trace events for get_pages, map_pages, and fault
To allow precise tracking of page caches accessed, add new tracepoints
that trigger when a process actually accesses them.
The ureadahead program used by ChromeOS traces the disk access of programs
as they start up at boot up. It uses mincore(2) or the
'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores
this information in a "pack" file and on subsequent boots, it will read
the pack file and call readahead(2) on the information so that disk
storage can be loaded into RAM before the applications actually need it.
A problem we see is that due to the kernel's readahead algorithm that can
aggressively pull in more data than needed (to try and accomplish the same
goal) and this data is also recorded. The end result is that the pack
file contains a lot of pages on disk that are never actually used.
Calling readahead(2) on these unused pages can slow down the system boot
up times.
To solve this, add 3 new trace events, get_pages, map_pages, and fault.
These will be used to trace the pages are not only pulled in from disk,
but are actually used by the application. Only those pages will be stored
in the pack file, and this helps out the performance of boot up.
With the combination of these 3 new trace events and
mm_filemap_add_to_page_cache, we observed a reduction in the pack file by
7.3% - 20% on ChromeOS varying by device.
Peter Xu [Mon, 12 Aug 2024 18:12:25 +0000 (14:12 -0400)]
mm/mprotect: fix dax pud handlings
This is only relevant to the two archs that support PUD dax, aka, x86_64
and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
count in this case.
DAX have had PUD mappings for years, but change protection path never
worked. When the path is triggered in any form (a simple test program
would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
"bad pud". This patch should fix that.
The new change_huge_pud() tries to keep everything simple. For example,
it doesn't optimize write bit as that will need even more PUD helpers.
It's not too bad anyway to have one more write fault in the worst case
once for 1G range; may be a bigger thing for each PAGE_SIZE, though.
Neither does it support userfault-wp bits, as there isn't such PUD
mappings that is supported; file mappings always need a split there.
The same to TLB shootdown: the pmd path (which was for x86 only) has the
trick of using _ad() version of pmdp_invalidate*() which can avoid one
redundant TLB, but let's also leave that for later. Again, the larger the
mapping, the smaller of such effect.
There's some difference on handling "retry" for change_huge_pud() (where
it can return 0): it isn't like change_huge_pmd(), as the pmd version is
safe with all conditions handled in change_pte_range() later, thanks to
Hugh's new pte_offset_map_lock(). In short, change_pte_range() is simply
smarter. For that, change_pud_range() will need proper retry if it races
with something else when a huge PUD changed from under us.
The last thing to mention is currently the PUD path ignores the huge pte
numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not
applicable to NUMA, but also that it's ambiguous on its own to decide how
to account pud in this case. In one earlier version of this patchset I
proposed to remove the counter as it doesn't even look right to do the
accounting as of now [1], but then a further discussion suggests we can
leave that for later, as that doesn't block this series if we choose to
ignore that counter. That's what this patch does, by ignoring it.
When at it, touch up the comment in pgtable_split_needed() to make it
generic to either pmd or pud file THPs.
Peter Xu [Mon, 12 Aug 2024 18:12:24 +0000 (14:12 -0400)]
mm/x86: add missing pud helpers
Some new helpers will be needed for pud entry updates soon. Introduce
these helpers by referencing the pmd ones. Namely:
- pudp_invalidate(): this helper invalidates a huge pud before a
split happens, so that the invalidated pud entry will make sure no
race will happen (either with software, like a concurrent zap, or
hardware, like a/d bit lost).
- pud_modify(): this helper applies a new pgprot to an existing huge
pud mapping.
For more information on why we need these two helpers, please refer to the
corresponding pmd helpers in the mprotect() code path.
When at it, simplify the pud_modify()/pmd_modify() comments on shadow
stack pgtable entries to reference pte_modify() to avoid duplicating the
whole paragraph three times.
Link: https://lkml.kernel.org/r/20240812181225.1360970-7-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 12 Aug 2024 18:12:23 +0000 (14:12 -0400)]
mm/x86: implement arch_check_zapped_pud()
Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD
zaps. It has the same logic as the PMD helper.
One thing to mention is, it might be a good idea to use page_table_check
in the future for trapping wrong setups of shadow stack pgtable entries
[1]. That is left for the future as a separate effort.
Peter Xu [Mon, 12 Aug 2024 18:12:22 +0000 (14:12 -0400)]
mm/x86: make pud_leaf() only care about PSE bit
When working on mprotect() on 1G dax entries, I hit an zap bad pud error
when zapping a huge pud that is with PROT_NONE permission.
Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits
set to report a pud entry as a leaf, but that doesn't look right, as it's
not following the pXd_leaf() definition that we stick with so far, where
PROT_NONE entries should be reported as leaves.
To fix it, change x86's pud_leaf() implementation to only check against
PSE bit to report a leaf, irrelevant of whether PRESENT bit is set.
Link: https://lkml.kernel.org/r/20240812181225.1360970-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 12 Aug 2024 18:12:21 +0000 (14:12 -0400)]
mm/powerpc: add missing pud helpers
Some new helpers will be needed for pud entry updates soon. Introduce
these helpers by referencing the pmd ones. Namely:
- pudp_invalidate(): this helper invalidates a huge pud before a split
happens, so that the invalidated pud entry will make sure no race will
happen (either with software, like a concurrent zap, or hardware, like
a/d bit lost).
- pud_modify(): this helper applies a new pgprot to an existing huge pud
mapping.
For more information on why we need these two helpers, please refer to the
corresponding pmd helpers in the mprotect() code path.
Link: https://lkml.kernel.org/r/20240812181225.1360970-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 12 Aug 2024 18:12:20 +0000 (14:12 -0400)]
mm/mprotect: push mmu notifier to PUDs
mprotect() does mmu notifiers in PMD levels. It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").
At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated. The notification can
be avoided in this case if no valid pmd detected, which includes either
THP or a PTE pgtable page.
Now to pave way for PUD handling, this isn't enough. We need to generate
mmu notifications even on PUD entries properly. mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.
To fix that, this patch proposes to push such notifications to the PUD
layers.
There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:
1) Consider a large VM that should definitely contain more than GBs of
memory, it's highly likely that PUDs are also none. In this case there
will have no regression.
2) KVM has evolved a lot over the years to get rid of rmap walks, which
might be the major cause of the previous soft-lockup. At least TDP MMU
already got rid of rmap as long as not nested (which should be the major
use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
pgtable (e.g. EPT on x86), the invalidation of a full empty region in
most cases could be pretty fast now, comparing to 2014.
3) KVM has explicit code paths now to even give way for mmu notifiers
just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
fault before acquiring mmu_lock if mapping is changing"). It'll also
avoid contentions that may also contribute to a soft-lockup.
4) Stick with PMD layer simply don't work when PUD is there... We need
one way or another to fix PUD mappings on mprotect().
Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.
Link: https://lkml.kernel.org/r/20240812181225.1360970-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Mon, 12 Aug 2024 18:12:19 +0000 (14:12 -0400)]
mm/dax: dump start address in fault handler
Patch series "mm/mprotect: Fix dax puds", v5.
Dax supports pud pages for a while, but mprotect on puds was missing since
the start. This series tries to fix that by providing pud handling in
mprotect(). The goal is to add more types of pud mappings like hugetlb or
pfnmaps. This series paves way for it by fixing known pud entries.
Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for
stable and this may not need to be backported. I would guess whoever
cares about mprotect() won't care 1G dax puds yet, vice versa. I hope
fixing that in new kernels would be fine, but I'm open to suggestions.
There're a few small things changed to teach mprotect work on PUDs. E.g.
it will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
making sense when there can be more than one type of huge pte. OTOH,
we'll also need to push the mmu notifiers from pmd to pud layers, which
might need some attention but so far I think it's safe. For such details,
please refer to each patch's commit message.
The mprotect() pud process should be straightforward, as I kept it as
simple as possible. There's no NUMA handled as dax simply doesn't support
that. There's also no userfault involvements as file memory (even if work
with userfault-wp async mode) will need to split a pud, so pud entry
doesn't need to yet know userfault's existance (but hugetlb entries will;
that's also for later).
This patch (of 7):
Currently the dax fault handler dumps the vma range when dynamic debugging
enabled. That's mostly not useful. Dump the (aligned) address instead
with the order info.
Link: https://lkml.kernel.org/r/20240812181225.1360970-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240812181225.1360970-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuanchu Xie [Tue, 13 Aug 2024 16:37:59 +0000 (09:37 -0700)]
mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true
When non-leaf pmd accessed bits are available, MGLRU page table walks can
clear the non-leaf pmd accessed bit and ignore the accessed bit on the pte
if it's on a different node, skipping a generation update as well. If
another scan occurs on the same node as said skipped pte.
The non-leaf pmd accessed bit might remain cleared and the pte accessed
bits won't be checked. While this is sufficient for reclaim-driven aging,
where the goal is to select a reasonably cold page, the access can be
missed when aging proactively for workingset estimation of a node/memcg.
In more detail, get_pfn_folio returns NULL if the folio's nid != node
under scanning, so the page table walk skips processing of said pte. Now
the pmd_young flag on this pmd is cleared, and if none of the pte's are
accessed before another scan occurs on the folio's node, the pmd_young
check fails and the pte accessed bit is skipped.
Since force_scan disables various other optimizations, we check force_scan
to ignore the non-leaf pmd accessed bit.
Miao Wang [Tue, 13 Aug 2024 17:12:13 +0000 (01:12 +0800)]
mm: vmalloc: add optimization hint on page existence check
In commit 21e516b913c1 ("mm: vmalloc: dump page owner info if page is
already mapped"), a BUG_ON macro was changed into an if statement, where
the compiler optimization hint introduced in the BUG_ON macro was removed
along with this change. This patch adds back the hint.
Link: https://lkml.kernel.org/r/20240814-fix_vmap_unlikely-v1-1-cd7954775f12@gmail.com Fixes: 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped") Signed-off-by: Miao Wang <shankerwangmiao@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hariom Panthi <hariom1.p@samsung.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmstat: defer the refresh_zone_stat_thresholds after all CPUs bringup
refresh_zone_stat_thresholds function has two loops which is expensive for
higher number of CPUs and NUMA nodes.
Below is the rough estimation of total iterations done by these loops
based on number of NUMA and CPUs.
Total number of iterations: nCPU * 2 * Numa * mCPU
Where:
nCPU = total number of CPUs
Numa = total number of NUMA nodes
mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
For the system under test with 16 NUMA nodes and 1024 CPUs, this results
in a substantial increase in the number of loop iterations during boot-up
when NUMA is enabled:
No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds
takes around 224 ms total for all the CPUs in the system under test.
16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds
takes around 4.5 seconds total for all the CPUs in the system under test.
Calling this for each CPU is expensive when there are large number of CPUs
along with multiple NUMAs. Fix this by deferring
refresh_zone_stat_thresholds to be called later at once when all the
secondary CPUs are up. Also, register the DYN hooks to keep the existing
hotplug functionality intact.
Without this patch, refresh_zone_stat_threshold was being called 1024
times. After applying the patch, it is called only once, which is same
as the last iteration of the earlier 1024 calls. Further testing with
this patch, I observed a 4.5-second improvement in the overall boot
timing due to this fix, which is same as the total time taken by
refresh_zone_stat_thresholds without thie patch, leading me to
reasonably conclude that refresh_zone_stat_threshold now takes a
negligible amount of time (likely just a few milliseconds).
Page isolation machinery doesn't know anything about unaccepted memory and
considers it non-free. It leads to alloc_contig_pages() failure.
Treat unaccepted memory as free and accept memory on pageblock isolation.
Once memory is accepted it becomes PageBuddy() and page isolation knows
how to deal with them.
Link: https://lkml.kernel.org/r/20240809114854.3745464-8-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>