Matthew Wilcox [Fri, 4 Oct 2024 17:43:52 +0000 (18:43 +0100)]
mm-add-pageanonnotksm-fix
fix assertions
Link: https://lkml.kernel.org/r/ZwApWPER7caIA_N3@casper.infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: Alex Shi <alexs@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 2 Oct 2024 15:25:30 +0000 (16:25 +0100)]
mm: add PageAnonNotKsm()
Check that this anonymous page is really anonymous, not anonymous-or-KSM.
This optimises the debug check, but its real purpose is to remove the last
two users of PageKsm().
Gaosheng Cui [Thu, 24 Oct 2024 03:23:00 +0000 (11:23 +0800)]
mm/ksm: add missing IS_ERR_OR_NULL check for stable_tree_search()
The stable_tree_search() maybe return -EBUSY if the stable node's page is
being migrated or nullptr, we need to check kfolio with IS_ERR_OR_NULL()
before dereference it.
To mitigate this, add IS_ERR_OR_NULL check for stable_tree_search().
Link: https://lkml.kernel.org/r/20241024032300.2501949-1-cuigaosheng1@huawei.com Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Cc: Alex Shi <alexs@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 2 Oct 2024 15:25:28 +0000 (16:25 +0100)]
ksm: convert cmp_and_merge_page() to use a folio
By making try_to_merge_two_pages() and stable_tree_search() return a
folio, we can replace kpage with kfolio. This replaces 7 calls to
compound_head() with one.
Link: https://lkml.kernel.org/r/20241002152533.1350629-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 2 Oct 2024 15:25:27 +0000 (16:25 +0100)]
ksm: use a folio in try_to_merge_one_page()
Patch series "Remove PageKsm()".
The KSM flag is almost always tested on the folio rather than on the page.
This series removes the final users of PageKsm() and makes the flag only
This patch (of 5):
It is safe to use a folio here because all callers took a refcount on this
page. The one wrinkle is that we have to recalculate the value of folio
after splitting the page, since it has probably changed. Replaces nine
calls to compound_head() with one.
Kanchana P Sridhar [Wed, 2 Oct 2024 17:33:29 +0000 (10:33 -0700)]
mm: zswap: zswap_store_page() will initialize entry after adding to xarray.
This incorporates Yosry's suggestions in [1] for further simplifying
zswap_store_page(). If the page is successfully compressed and added to
the xarray, we get the pool/objcg refs, and initialize all the entry's
members. Only after this, we add it to the zswap LRU.
In the time between the entry's addition to the xarray and it's member
initialization, we are protected against concurrent stores/loads/swapoff
through the folio lock, and are protected against writeback because the
entry is not on the LRU yet.
This way, we don't have to drop the pool/objcg refs, now that the entry
initialization is centralized to the successful page store code path.
zswap_compress() is modified to take a zswap_pool parameter in keeping
with this simplification (as against obtaining this from entry->pool).
Kanchana P Sridhar [Tue, 1 Oct 2024 05:32:22 +0000 (22:32 -0700)]
mm: swap: Count successful large folio zswap stores in hugepage zswpout stats
Added a new MTHP_STAT_ZSWPOUT entry to the sysfs transparent_hugepage
stats so that successful large folio zswap stores can be accounted under
the per-order sysfs "zswpout" stats:
Also, added documentation for the newly added sysfs per-order hugepage
"zswpout" stats. The documentation clarifies that only non-zswap swapouts
will be accounted in the existing "swpout" stats.
Link: https://lkml.kernel.org/r/20241001053222.6944-8-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kanchana P Sridhar [Tue, 1 Oct 2024 05:32:21 +0000 (22:32 -0700)]
mm: zswap: support large folios in zswap_store()
This series enables zswap_store() to accept and store large folios. The
most significant contribution in this series is from the earlier RFC
submitted by Ryan Roberts [1]. Ryan's original RFC has been migrated to
mm-unstable as of 9-30-2024 in patch 6 of this series, and adapted based
on code review comments received for the current patch-series.
[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
The first few patches do the prep work for supporting large folios in
zswap_store. Patch 6 provides the main functionality to swap-out large
folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout"
counters that get incremented upon successful zswap_store of large folios,
and also updates the documentation for this:
This series is a pre-requisite for zswap compress batching of large folio
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent patch-series, with performance improvement data.
Thanks to Ying Huang for pre-posting review feedback and suggestions!
Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama, Ying and
Matthew for their helpful feedback, code/data reviews and suggestions!
I would like to thank Ryan Roberts for his original RFC [1].
System setup for testing:
=========================
Testing of this series was done with mm-unstable as of 9-27-2024, commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000 (without this patch-series) and
mm-unstable 9-30-2024 commit c121617e3606be6575cdacfdb63cc8d67b46a568
(with this patch-series). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at
2500MHz.
The vm-scalability "usemem" test was run in a cgroup whose memory.high was
fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and
sleeping for 10 sec before exiting:
In the experiments where "deflate-iaa" is used as the zswap compressor,
IAA "compression verification" is enabled by default (cat
/sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors, and the experimental data listed
below is with verify_compress set to "1".
Metrics reporting methodology:
==============================
Total and average throughput are derived from the individual 30 processes'
throughputs reported by usemem. elapsed/sys times are measured with perf.
All percentage changes are "new" vs. "old"; hence a positive value
denotes an increase in the metric, whether it is throughput or latency,
and a negative value denotes a reduction in the metric. Positive
throughput change percentages and negative latency change percentages
denote improvements.
The vm stats and sysfs hugepages stats included with the performance data
provide details on the swapout activity to zswap/swap device.
Testing labels used in data summaries:
======================================
The data refers to these test configurations and the before/after
comparisons that they do:
In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
into 4K folios that get processed by zswap.
before-case2:
-------------
mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios)
In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
folios, which will then be stored by the SSD swap device.
after:
------
v10 of this patch-series, CONFIG_THP_SWAP=Y
The "after" is CONFIG_THP_SWAP=Y and v10 of this patch-series, that results
in 64K/2M folios to not be split, and to be processed by zswap_store.
Regression Testing:
===================
I ran vm-scalability usemem without large folios, i.e., only 4K folios with
mm-unstable and this patch-series. The main goal was to make sure that
there is no functional or performance regression wrt the earlier zswap
behavior for 4K folios, now that 4K folios will be processed by the new
zswap_store() code.
The data indicates there is no significant regression.
And finally, this is a comparison of deflate-iaa vs. zstd with v10 of this
patch-series:
---------------------------------------------
zswap_store large folios v10
Impr w/ deflate-iaa vs. zstd
64K folios 2M folios
---------------------------------------------
Throughput (KB/s) 17% 9%
elapsed time (sec) -20% -19%
sys time (sec) -27% -23%
---------------------------------------------
Conclusions based on the performance results:
=============================================
v10 wrt before-case1:
---------------------
We see significant improvements in throughput, elapsed and sys time for
zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after
(THP_SWAP=Y) with zswap_store large folios.
v10 wrt before-case2:
---------------------
We see even more significant improvements in throughput and elapsed time
for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
vs. after (large-folio-zswap). The sys time increases with
large-folio-zswap as expected, due to the CPU compression time
vs. asynchronous disk write times, as pointed out by Ying and Yosry.
In before-case2, when zswap does not store large folios, only allocations
and cgroup charging due to 4K folio zswap stores count towards the cgroup
memory limit. However, in the after scenario, with the introduction of
zswap_store() of large folios, there is an added component of the zswap
compressed pool usage from large folio stores from potentially all 30
processes, that gets counted towards the memory limit. As a result, we see
higher swapout activity in the "after" data.
Summary:
========
The v10 data presented above shows that zswap_store of large folios
demonstrates good throughput/performance improvements compared to
conventional SSD swap of large folios with a sufficiently large 525G SSD
swap device. Hence, it seems reasonable for zswap_store to support large
folios, so that further performance improvements can be implemented.
In the experimental setup used in this patchset, we have enabled IAA
compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. We see good
throughput/latency improvements with deflate-iaa vs. zstd with zswap_store
of large folios.
Some of the ideas for further reducing latency that have shown promise in
our experiments, are:
1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.
The tests run for this patchset are using only 1 IAA device per core, that
avails of 2 compress engines on the device. In our experiments with IAA
batching, we distribute compress jobs from all cores to the 8 compress
engines available per socket. We further compress the pages in each folio
in parallel in the accelerator. As a result, we improve compress latency
and reclaim throughput.
In decompress batching, we use swapin_readahead to generate a prefetch
batch of 4K folios that we decompress in parallel in IAA.
------------------------------------------------------------------------------
IAA compress/decompress batching
Further improvements wrt v10 zswap_store Sequential
subpage store using "deflate-iaa":
With zswap IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in server
scalability experiments in highly contended system scenarios under
significant memory pressure; as compared to software compressors. We hope
to submit this work in subsequent patch series. The current patch-series
is a prequisite for these future submissions.
This patch (of 7):
zswap_store() will store large folios by compressing them page by page.
This patch provides a sequential implementation of storing a large folio
in zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.
zswap_store() calls the newly added zswap_store_page() function for each
page in the folio. zswap_store_page() handles compressing and storing
each page.
We check the global and per-cgroup limits once at the beginning of
zswap_store(), and only check that the limit is not reached yet. This is
racy and inaccurate, but it should be sufficient for now. We also obtain
initial references to the relevant objcg and pool to guarantee that
subsequent references can be acquired by zswap_store_page(). A new
function zswap_pool_get() is added to facilitate this.
If these one-time checks pass, we compress the pages of the folio, while
maintaining a running count of compressed bytes for all the folio's pages.
If all pages are successfully compressed and stored, we do the cgroup
zswap charging with the total compressed bytes, and batch update the
zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
before returning from zswap_store().
If an error is encountered during the store of any page in the folio, all
pages in that folio currently stored in zswap will be invalidated. Thus,
a folio is either entirely stored in zswap, or entirely not stored in
zswap.
The most important value provided by this patch is it enables swapping out
large folios to zswap without splitting them. Furthermore, it batches
some operations while doing so (cgroup charging, stats updates).
This patch also forms the basis for building compress batching of pages in
a large folio in zswap_store() by compressing up to say, 8 pages of the
folio in parallel in hardware using the Intel In-Memory Analytics
Accelerator (Intel IAA).
This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:
"[RFC,v1] mm: zswap: Store large folios without splitting"
Kanchana P Sridhar [Tue, 1 Oct 2024 05:32:20 +0000 (22:32 -0700)]
mm: zswap: modify zswap_stored_pages to be atomic_long_t
For zswap_store() to support large folios, we need to be able to do a
batch update of zswap_stored_pages upon successful store of all pages in
the folio. For this, we need to add folio_nr_pages(), which returns a
long, to zswap_stored_pages.
Link: https://lkml.kernel.org/r/20241001053222.6944-6-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kanchana P Sridhar [Tue, 1 Oct 2024 05:32:19 +0000 (22:32 -0700)]
mm: change count_objcg_event() to count_objcg_events() for batch event updates
With the introduction of zswap_store() swapping out large folios, we need
to efficiently update the objcg's memcg events once per successfully
stored folio. For instance, the 'ZSWPOUT' event needs to be incremented
by folio_nr_pages().
To facilitate this, the existing count_objcg_event() API is modified to be
count_objcg_events() that additionally accepts a count parameter. The
only existing calls to count_objcg_event() are in zswap.c - these have
been modified to call count_objcg_events() with a count of 1.
Link: https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kanchana P Sridhar [Tue, 1 Oct 2024 05:32:18 +0000 (22:32 -0700)]
mm: zswap: rename zswap_pool_get() to zswap_pool_tryget()
Modify the name of the existing zswap_pool_get() to zswap_pool_tryget() to
be representative of the call it makes to percpu_ref_tryget(). A
subsequent patch will introduce a new zswap_pool_get() that calls
percpu_ref_get().
The intent behind this change is for higher level zswap API such as
zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
refcount is "0" (which means it could be getting destroyed) and to handle
this as an error condition. zswap_store() would proceed only if
zswap_pool_tryget() returns success, and any additional pool refcounts
that need to be obtained for compressing sub-pages in a large folio could
simply call zswap_pool_get().
Link: https://lkml.kernel.org/r/20241001053222.6944-4-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kanchana P Sridhar [Tue, 1 Oct 2024 05:32:17 +0000 (22:32 -0700)]
mm: zswap: modify zswap_compress() to accept a page instead of a folio
For zswap_store() to be able to store a large folio by compressing it one
page at a time, zswap_compress() needs to accept a page as input. This
will allow us to iterate through each page in the folio in zswap_store(),
compress it and store it in the zpool.
Link: https://lkml.kernel.org/r/20241001053222.6944-3-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Cc: "Zou, Nanhai" <nanhai.zou@intel.com> Cc: Barry Song <21cnbao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pintu Kumar [Tue, 1 Oct 2024 17:53:58 +0000 (23:23 +0530)]
zsmalloc: replace kmap_atomic with kmap_local_page
The use of kmap_atomic/kunmap_atomic is deprecated. Replace it will
kmap_local_page/kunmap_local all over the place. Also fix SPDX missing
license header.
WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead
+ vaddr = kmap_atomic(page);
Link: https://lkml.kernel.org/r/20241001175358.12970-1-quic_pintu@quicinc.com Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com> Cc: Joe Perches <joe@perches.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pintu Agarwal <pintu.ping@gmail.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For
each batch, it traverses the page cache tree and collects the entries
(folio and shadow entries) in the struct folio_batch. For the shadow
entries present in the folio_batch, it has to traverse the page cache tree
for each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple
fadvise(DONTNEED) operation.
# time xfs_io -c 'fadvise -d 0 ${file_size}' file
time (sec)
Without 5.12 +- 0.061
With-patch 4.19 +- 0.086 (18.16% decrease)
Link: https://lkml.kernel.org/r/20240925224716.2904498-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chris Mason <clm@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: optimize shadow entries removal", v2.
Some of our production workloads which processes a large amount of data
spends considerable amount of CPUs on truncation and invalidation of large
sized files (100s of GiBs of size). Tracing the operations showed that
most of the time is in shadow entries removal. This patch series
optimizes the truncation and invalidation operations.
This patch (of 2):
The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
batch, it traverses the page cache tree and collects the entries (folio
and shadow entries) in the struct folio_batch. For the shadow entries
present in the folio_batch, it has to traverse the page cache tree for
each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
On large machines in our production which run workloads manipulating large
amount of data, we have observed that a large amount of CPUs are spent on
truncation of very large files (100s of GiBs file sizes). More
specifically most of time was spent on shadow entries cleanup, so
optimizing the shadow entries cleanup, even a little bit, has good impact.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple truncation
operation.
# time truncate -s 0 file
time (sec)
Without 5.164 +- 0.059
With-patch 4.21 +- 0.066 (18.47% decrease)
mm: migrate LRU_REFS_MASK bits in folio_migrate_flags
Bits of LRU_REFS_MASK are not inherited during migration which lead to new
folio start from tier0 when MGLRU enabled. Try to bring as much bits of
folio->flags as possible since compaction and alloc_contig_range which
introduce migration do happen at times.
Link: https://lkml.kernel.org/r/20240926050647.5653-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_rw_nolock(). At this time, the
pte_same() check is not performed after the ptl held, so we should get
pmdval and do pmd_same() check to ensure the stability of pmd entry.
Link: https://lkml.kernel.org/r/7e9c194a5efacc9609cfd31abb9c7df88b53b530.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will use pte_same() to detect the change of the pte entry, there is no
need to get pmdval, so just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/1530e8fdbfc72eacf3b095babe139ce3d715600a.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At this
time, the pte_same() check is not performed after the pvmw->ptl held, so
we should get pmdval and do pmd_same() check to ensure the stability of
pvmw->pmd.
Link: https://lkml.kernel.org/r/2620a48f34c9f19864ab0169cdbf253d31a8fcaa.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so
hpage_collapse_scan_file() path can not find this by traversing
file->f_mapping, so there is no concurrency with retract_page_tables().
In addition, we already hold the exclusive mmap_lock, so this new_pte page
is stable, so there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9d582a09dbcf12e562ac5fe0ba05e9248a58f5e0.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: copy_pte_range() use pte_offset_map_rw_nolock()
In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we
already hold the exclusive mmap_lock, and the copy_pte_range() and
retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE
page is stable, there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the PTL held. So
we should get pgt_pmd and do pmd_same() check after the ptl held.
Link: https://lkml.kernel.org/r/055e42db68da00ac8ecab94bd2633c7cd965eb1c.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: handle_pte_fault() use pte_offset_map_rw_nolock()
In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In do_adjust_pte(), we may modify the pte entry. The corresponding pmd
entry may have been modified concurrently. Therefore, in order to ensure
the stability if pmd entry, use pte_offset_map_rw_nolock() to replace
pte_offset_map_nolock(), and do pmd_same() check after holding the PTL.
All callers of update_mmu_cache_range() hold the vmf->ptl, so we can
determined whether split PTE locks is being used by doing the following,
just as we do elsewhere in the kernel.
ptl != vmf->ptl
And then we can delete the do_pte_lock() and do_pte_unlock().
Link: https://lkml.kernel.org/r/0eaf6b69aeb2fe35092a633fed12537efe645303.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use
pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock().
Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Nanyong Sun [Thu, 26 Sep 2024 07:49:22 +0000 (15:49 +0800)]
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features.
This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h.
The linux/sched/coredump.h has include the mm_types.h, so the C files
related to coredump does not need to change head file inclusion. In
addition, the inclusion of sched/coredump.h now can be deleted from the C
files that irrelevant to core dump.
Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 26 Sep 2024 15:10:19 +0000 (16:10 +0100)]
mm/madvise: unrestrict process_madvise() for current process
The process_madvise() call was introduced in commit ecb8ac8b1f14
("mm/madvise: introduce process_madvise() syscall: an external memory
hinting API") as a means of performing madvise() operations on another
process.
However, as it provides the means by which to perform multiple madvise()
operations in a batch via an iovec, it is useful to utilise the same
interface for performing operations on the current process rather than a
remote one.
Commit 22af8caff7d1 ("mm/madvise: process_madvise() drop capability check
if same mm") removed the need for a caller invoking process_madvise() on
its own pidfd to possess the CAP_SYS_NICE capability, however this leaves
the restrictions on operation in place.
Resolve this by only applying the restriction on operations when accessing
a remote process.
Moving forward we plan to implement a simpler means of specifying this
condition other than needing to establish a self pidfd, perhaps in the
form of a sentinel pidfd.
Also take the opportunity to refactor the system call implementation
abstracting the vectorised operation.
Link: https://lkml.kernel.org/r/20240926151019.82902-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Thu, 26 Sep 2024 15:20:44 +0000 (17:20 +0200)]
selftests/mm: hugetlb_fault_after_madv: improve test output
Let's improve the test output. For example, print the proper test result.
Install a SIGBUS handler to catch any SIGBUS instead of crashing the test
on failure.
With unsuitable hugetlb page count:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
# [INFO] detected default hugetlb page size: 2048 KiB
ok 2 # SKIP This test needs one and only one page to execute. Got 0
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
On a failure:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
not ok 1 SIGBUS behavior
Bail out! 1 out of 1 tests failed
On success:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
# [INFO] detected default hugetlb page size: 2048 KiB
ok 1 SIGBUS behavior
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lkml.kernel.org/r/20240926152044.2205129-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Tested-by: Mario Casquero <mcasquer@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Thu, 26 Sep 2024 15:20:43 +0000 (17:20 +0200)]
selftests/mm: hugetlb_fault_after_madv: use default hugetlb page size
Patch series "selftests/mm: hugetlb_fault_after_madv improvements".
Mario brought to my attention that the hugetlb_fault_after_madv test is
currently always skipped on s390x. Let's adjust the test to be
independent of the default hugetlb page size and while at it, also improve
the test output.
This patch (of 2):
We currently assume that the hugetlb page size is 2 MiB, which is why we
mmap() a 2 MiB range.
Is the default hugetlb size is larger, mmap() will fail because the range
is not suitable. If the default hugetlb size is smaller (e.g., s390x),
mmap() will fail because we would need more than one hugetlb page, but
just asserted that we have exactly one.
So let's simply use the default hugetlb page size instead of hard-coded 2
MiB, so the test isn't unconditionally skipped on architectures like
s390x.
Before this patch on s390x:
$ ./hugetlb_fault_after_madv
1..0 # SKIP Failed to allocated huge page
With this change on s390x:
$ ./hugetlb_fault_after_madv
Zhiguo Jiang [Fri, 12 Jan 2024 01:23:52 +0000 (09:23 +0800)]
mm: fix shrink nr.unqueued_dirty counter issue
It is needed to ensure sc->nr.unqueued_dirty > 0, which can avoid setting
PGDAT_DIRTY flag when sc->nr.unqueued_dirty and sc->nr.file_taken are both
zero.
Kairui Song [Wed, 25 Sep 2024 17:10:20 +0000 (01:10 +0800)]
mm/list_lru: simplify the list_lru walk callback function
Now isolation no longer takes the list_lru global node lock, only use the
per-cgroup lock instead. And this lock is inside the list_lru_one being
walked, no longer needed to pass the lock explicitly.
Link: https://lkml.kernel.org/r/20240925171020.32142-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Wed, 25 Sep 2024 17:10:19 +0000 (01:10 +0800)]
mm/list_lru: split the lock to per-cgroup scope
Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.
This lock can be split into per-cgroup scope to reduce contention.
To achieve this, we need a stable list_lru_one for every cgroup. This
commit adds a lock to each list_lru_one and introduced a helper function
lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg.
Then reworked the reparenting process.
Reparenting will switch the list_lru_one instances one by one. By locking
each instance and marking it dead using the nr_items counter, reparenting
ensures that all items in the corresponding cgroup (on-list or not,
because items have a stable cgroup, see below) will see the list_lru_one
switch synchronously.
Objcg reparent is also moved after list_lru reparent so items will have a
stable mem cgroup until all list_lru_one instances are drained.
The only caller that doesn't work the *_obj interfaces are direct calls to
list_lru_{add,del}. But it's only used by zswap and that's also based on
objcg, so it's fine.
This also changes the bahaviour of the isolation function when LRU_RETRY
or LRU_REMOVED_RETRY is returned, because now releasing the lock could
unblock reparenting and free the list_lru_one, isolation function will
have to return withoug re-lock the lru.
prepare() {
mkdir /tmp/test-fs
modprobe brd rd_nr=1 rd_size=33554432
mkfs.xfs -f /dev/ram0
mount -t xfs /dev/ram0 /tmp/test-fs
for i in $(seq 1 512); do
mkdir "/tmp/test-fs/$i"
for j in $(seq 1 10240); do
echo TEST-CONTENT > "/tmp/test-fs/$i/$j"
done &
done; wait
}
do_test() {
read_worker() {
sleep 1
tar -cv "$1" &>/dev/null
}
read_in_all() {
cd "/tmp/test-fs" && ls
for i in $(seq 1 512); do
(exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs"
read_worker "$i" &
done; wait
}
for i in $(seq 1 512); do
mkdir -p "/sys/fs/cgroup/benchmark/$i"
done
echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control
echo 512M > /sys/fs/cgroup/benchmark/memory.max
echo 3 > /proc/sys/vm/drop_caches
time read_in_all
}
Above script simulates compression of small files in multiple cgroups
with memory pressure. Run prepare() then do_test for 6 times:
Before:
real 0m7.762s user 0m11.340s sys 3m11.224s
real 0m8.123s user 0m11.548s sys 3m2.549s
real 0m7.736s user 0m11.515s sys 3m11.171s
real 0m8.539s user 0m11.508s sys 3m7.618s
real 0m7.928s user 0m11.349s sys 3m13.063s
real 0m8.105s user 0m11.128s sys 3m14.313s
After this commit (about ~15% faster):
real 0m6.953s user 0m11.327s sys 2m42.912s
real 0m7.453s user 0m11.343s sys 2m51.942s
real 0m6.916s user 0m11.269s sys 2m43.957s
real 0m6.894s user 0m11.528s sys 2m45.346s
real 0m6.911s user 0m11.095s sys 2m43.168s
real 0m6.773s user 0m11.518s sys 2m40.774s
Link: https://lkml.kernel.org/r/20240925171020.32142-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Wed, 25 Sep 2024 17:10:18 +0000 (01:10 +0800)]
mm/list_lru: simplify reparenting and initial allocation
Currently, there is a lot of code for detecting reparent racing using
kmemcg_id as the synchronization flag. And an intermediate table is
required to record and compare the kmemcg_id.
We can simplify this by just checking the cgroup css status, skip if
cgroup is being offlined. On the reparenting side, ensure no more
allocation is on going and no further allocation will occur by using the
XArray lock as barrier.
Combined with a O(n^2) top-down walk for the allocation, we get rid of the
intermediate table allocation completely. Despite being O(n^2), it should
be actually faster because it's not practical to have a very deep cgroup
level, and in most cases the parent cgroup should have been allocated
already.
This also avoided changing kmemcg_id before reparenting, making cgroups
have a stable index for list_lru_memcg. After this change it's possible
that a dying cgroup will see a NULL value in XArray corresponding to the
kmemcg_id, because the kmemcg_id will point to an empty slot. In such
case, just fallback to use its parent.
As a result the code is simpler, following test also showed a very slight
performance gain (12 test runs):
prepare() {
mkdir /tmp/test-fs
modprobe brd rd_nr=1 rd_size=16777216
mkfs.xfs -f /dev/ram0
mount -t xfs /dev/ram0 /tmp/test-fs
for i in $(seq 10000); do
seq 8000 > "/tmp/test-fs/$i"
done
mkdir -p /sys/fs/cgroup/system.slice/bench/test/1
echo +memory > /sys/fs/cgroup/system.slice/bench/cgroup.subtree_control
echo +memory > /sys/fs/cgroup/system.slice/bench/test/cgroup.subtree_control
echo +memory > /sys/fs/cgroup/system.slice/bench/test/1/cgroup.subtree_control
echo 768M > /sys/fs/cgroup/system.slice/bench/memory.max
}
do_test() {
read_worker() {
mkdir -p "/sys/fs/cgroup/system.slice/bench/test/1/$1"
echo $BASHPID > "/sys/fs/cgroup/system.slice/bench/test/1/$1/cgroup.procs"
read -r __TMP < "/tmp/test-fs/$1";
}
read_in_all() {
for i in $(seq 10000); do
read_worker "$i" &
done; wait
}
echo 3 > /proc/sys/vm/drop_caches
time read_in_all
for i in $(seq 1 10000); do
rmdir "/sys/fs/cgroup/system.slice/bench/test/1/$i" &>/dev/null
done
}
Before:
real 0m3.498s user 0m11.037s sys 0m35.872s
real 1m33.860s user 0m11.593s sys 3m1.169s
real 1m31.883s user 0m11.265s sys 2m59.198s
real 1m32.394s user 0m11.294s sys 3m1.616s
real 1m31.017s user 0m11.379s sys 3m1.349s
real 1m31.931s user 0m11.295s sys 2m59.863s
real 1m32.758s user 0m11.254s sys 2m59.538s
real 1m35.198s user 0m11.145s sys 3m1.123s
real 1m30.531s user 0m11.393s sys 2m58.089s
real 1m31.142s user 0m11.333s sys 3m0.549s
After:
real 0m3.489s user 0m10.943s sys 0m36.036s
real 1m10.893s user 0m11.495s sys 2m38.545s
real 1m29.129s user 0m11.382s sys 3m1.601s
real 1m29.944s user 0m11.494s sys 3m1.575s
real 1m31.208s user 0m11.451s sys 2m59.693s
real 1m25.944s user 0m11.327s sys 2m56.394s
real 1m28.599s user 0m11.312s sys 3m0.162s
real 1m26.746s user 0m11.538s sys 2m55.462s
real 1m30.668s user 0m11.475s sys 3m2.075s
real 1m29.258s user 0m11.292s sys 3m0.780s
Which is slightly faster in real time.
Link: https://lkml.kernel.org/r/20240925171020.32142-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Split list_lru lock into per-cgroup scope", v2.
Currently, every list_lru has a per-node lock that protects adding,
deletion, isolation, and reparenting of all list_lru_one instances
belonging to this list_lru on this node. This lock contention is heavy
when multiple cgroups modify the same list_lru.
This can be alleviated by splitting the lock into per-cgroup scope.
To achieve this, this series reworked and optimized the reparenting
process step by step, making it possible to have a stable list_lru_one,
and making it possible to pin the list_lru_one. Then split the lock into
per-cgroup scope.
The result is ~15% performance gain for simple multi-cgroup tar test of
small files, and reduced LOC. See PATCH 5/6 for test details.
This patch (of 6):
When LOCKDEP is not enabled, lock_class_key is an empty struct that is
never used. But the list_lru initialization function still takes a
placeholder pointer as parameter, and the compiler cannot optimize it
because the function is not static and exported.
Remove this parameter and move it inside the list_lru struct. Only use it
when LOCKDEP is enabled. Kernel builds with LOCKDEP will be slightly
larger, while !LOCKDEP builds without it will be slightly smaller (the
common case).
Link: https://lkml.kernel.org/r/20240925171020.32142-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20240925171020.32142-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Sun, 8 Sep 2024 14:05:54 +0000 (14:05 +0000)]
maple_tree: memset maple_big_node as a whole
In mast_fill_bnode(), we first clear some fields of maple_big_node and set
the 'type' unconditionally before return. This means we won't leverage
any information in maple_big_node and it is safe to clear the whole
structure.
In maple_big_node, we define slot and padding/gap in a union. And based
on current definition of MAPLE_BIG_NODE_SLOTS/GAPS, padding is always less
than slot and part of the gap is overlapped by slot.
For example on 64bit system:
MAPLE_BIG_NODE_SLOT is 34
MAPLE_BIG_NODE_GAP is 21
With this knowledge, current code may clear some space by twice. And
this could be avoid by clearing the structure as a whole.
Wei Yang [Wed, 11 Sep 2024 14:27:59 +0000 (14:27 +0000)]
maple_tree: goto complete directly on a pivot of 0
When we break the loop after assigning a pivot, the index i/j is not
changed. Then the following code assign pivot, which means we do the
assignment with same i/j by mas_safe_pivot.
Since the loop condition is (i < piv_end), from which we can get i is less
than mt_pivots[mt]. It implies mas_safe_pivot() return pivot[i] which is
the same value we get in loop.
Now we can conclude it does a redundant assignment on a pivot of 0. Let's
just go to complete to avoid it.
We can see in all these four cases, i is always less than or equal to
mas_end after finish the loop:
Case 1: After assign pivot 0, i is set to 1, which is bigger than
mas_end 0. So it jumps to complete and skip the check.
Case 2: After assign pivot 0, i is set to 1.
∵ (mas_start < mas_end) && (mas_start == 0)
==> (1 <= mas_end)
∵ (i == 1) && (1 <= mas_end)
==> (i <= mas_end)
∴ Before loop, we have (i <= mas_end). And we still hold this
if it skips the loop. For example, (i == mas_end).
Now let's see what happens in the loop:
∵ piv_end = min(mas_end, mt_pivots[mt])
==> (piv_end <= mas_end)
∵ loop condition is (i < piv_end)
==> (i <= piv_end) on finish the loop both normally or break
∵ (i <= piv_end) && (piv_end <= mas_end)
==> (i <= mas_end)
∴ After loop, we still get (i <= mas_end) in this case
Case 3: This case would skip both if clause and loop. So when it comes
to the check, i is still mas_start which equals to mas_end.
Case 4: This case would skip the if clause.
∵ (mas_start < mas_end) && (i == mas_start)
==> (i < mas_end)
∴ Before loop, we have (i < mas_end).
The loop process is similar with Case 2, so we get the same
result.
Now we can conclude in all cases, we get (i <= mas_end) when doing
check. Then it is not necessary to do the check.
Lorenzo Stoakes [Wed, 25 Sep 2024 07:48:27 +0000 (08:48 +0100)]
procfs: prefer neater pointer error comparison
We can compare a pointer to a known error code via PTR_ERR(ptr) == -EINVAL
or via ptr == ERR_PTR(-EINVAL) - the latter is neater and collects the
macro and constant in one, so refactor to use this form in
proc_mem_open().
Lorenzo Stoakes [Tue, 24 Sep 2024 20:10:23 +0000 (21:10 +0100)]
mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled
the same as an error in all callers, with some translating this into an
-ESRCH error.
Only proc_mem_open() returns NULL if no mm is found, however in this case
it is clearer and makes more sense to explicitly handle the error.
Additionally we take the opportunity to refactor the function to eliminate
unnecessary nesting.
Simplify things by simply returning -ESRCH if no mm is found - this both
eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies
callers which would return -ESRCH by returning this error directly.
We now have only one active post-processing at any time, so we don't have
same race conditions that we had before. If slot selected for
post-processing gets freed or freed and reallocated it loses its PP_SLOT
flag and there is no way for such a slot to gain PP_SLOT flag again until
current post-processing terminates.
ZRAM_SAME slots cannot be post-processed (writeback or recompress) so do
not mark them ZRAM_IDLE. Same with ZRAM_WB slots, they cannot be
ZRAM_IDLE because they are not in zsmalloc pool anymore.
Writeback suffers from the same problem as recompression did before -
target slot selection for writeback is just a simple iteration over
zram->table entries (stored pages) which selects suboptimal targets for
writeback. This is especially problematic for writeback, because we
uncompress objects before writeback so each of them takes 4K out of
limited writeback storage. For example, when we take a 48 bytes slot and
store it as a 4K object to writeback device we only save 48 bytes of
memory (release from zsmalloc pool). We naturally want to pick the
largest objects for writeback, because then each writeback will release
the largest amount of memory.
This patch applies the same solution and strategy as for recompression
target selection: pp control (post-process) with 16 buckets of candidate
pp slots. Slots are assigned to pp buckets based on sizes - the larger
the slot the higher the group index. This gives us sorted by size lists
of candidate slots (in linear time), so that among post-processing
candidate slots we always select the largest ones first and maximize the
memory saving.
TEST
====
A very simple demonstration: zram is configured with a writeback device.
A limited writeback (wb_limit 2500 pages) is performed then, with a log of
sizes of slots that were written back. You can see that patched zram
selects slots for recompression in significantly different manner, which
leads to higher memory savings (see column #2 of mm_stat output).
Target slot selection for recompression is just a simple iteration over
zram->table entries (stored pages) from slot 0 to max slot. Given that
zram->table slots are written in random order and are not sorted by size,
a simple iteration over slots selects suboptimal targets for
recompression. This is not a problem if we recompress every single
zram->table slot, but we never do that in reality. In reality we limit
the number of slots we can recompress (via max_pages parameter) and hence
proper slot selection becomes very important. The strategy is quite
simple, suppose we have two candidate slots for recompression, one of size
48 bytes and one of size 2800 bytes, and we can recompress only one, then
it certainly makes more sense to pick 2800 entry for recompression.
Because even if we manage to compress 48 bytes objects even further the
savings are going to be very small. Potential savings after good
re-compression of 2800 bytes objects are much higher.
This patch reworks slot selection and introduces the strategy described
above: among candidate slots always select the biggest ones first.
For that the patch introduces zram_pp_ctl (post-processing) structure
which holds NUM_PP_BUCKETS pp buckets of slots. Slots are assigned to a
particular group based on their sizes - the larger the size of the slot
the higher the group index. This, basically, sorts slots by size in liner
time (we still perform just one iteration over zram->table slots). When
we select slot for recompression we always first lookup in higher pp
buckets (those that hold the largest slots). Which achieves the desired
behavior.
TEST
====
A very simple demonstration: zram is configured with zstd, and zstd with
dict as a recompression stream. A limited (max 4096 pages) recompression
is performed then, with a log of sizes of slots that were recompressed.
You can see that patched zram selects slots for recompression in
significantly different manner, which leads to higher memory savings (see
column #2 of mm_stat output).
zram: permit only one post-processing operation at a time
Both recompress and writeback soon will unlock slots during processing,
which makes things too complex wrt possible race-conditions. We still
want to clear PP_SLOT in slot_free, because this is how we figure out that
slot that was selected for post-processing has been released under us and
when we start post-processing we check if slot still has PP_SLOT set. At
the same time, theoretically, we can have something like this:
CPU0 CPU1
recompress
scan slots
set PP_SLOT
unlock slot
slot_free
clear PP_SLOT
allocate PP_SLOT
writeback
scan slots
set PP_SLOT
unlock slot
select PP-slot
test PP_SLOT
So recompress will not detect that slot has been re-used and re-selected
for concurrent writeback post-processing.
Make sure that we only permit on post-processing operation at a time. So
now recompress and writeback post-processing don't race against each
other, we only need to handle slot re-use (slot_free and write), which is
handled individually by each pp operation.
Having recompress and writeback competing for the same slots is not
exactly good anyway (can't imagine anyone doing that).
Patch series "zram: optimal post-processing target selection", v5.
Problem:
--------
Both recompression and writeback perform a very simple linear scan of all
zram slots in search for post-processing (writeback or recompress)
candidate slots. This often means that we pick the worst candidate for pp
(post-processing), e.g. a 48 bytes object for writeback, which is nearly
useless, because it only releases 48 bytes from zsmalloc pool, but
consumes an entire 4K slot in the backing device. Similarly,
recompression of an 48 bytes objects is unlikely to save more memory that
recompression of a 3000 bytes object. Both recompression and writeback
consume constrained resources (CPU time, batter, backing device storage
space) and quite often have a (daily) limit on the number of items they
post-process, so we should utilize those constrained resources in the most
optimal way.
Solution:
---------
This patch reworks the way we select pp targets. We, quite clearly, want
to sort all the candidates and always pick the largest, be it
recompression or writeback. Especially for writeback, because the larger
object we writeback the more memory we release. This series introduces
concept of pp buckets and pp scan/selection.
The scan step is a simple iteration over all zram->table entries, just
like what we currently do, but we don't post-process a candidate slot
immediately. Instead we assign it to a PP (post-processing) bucket. PP
bucket is, basically, a list which holds pp candidate slots that belong to
the same size class. PP buckets are 64 bytes apart, slots are not
strictly sorted within a bucket there is a 64 bytes variance.
The select step simply iterates over pp buckets from highest to lowest and
picks all candidate slots a particular buckets contains. So this gives us
sorted candidates (in linear time) and allows us to select most optimal
(largest) candidates for post-processing first.
This patch (of 7):
This flag indicates that the slot was selected as a candidate slot for
post-processing (pp) and was assigned to a pp bucket. It does not
necessarily mean that the slot is currently under post-processing, but may
mean so. The slot can loose its PP_SLOT flag, while still being in the
pp-bucket, if it's accessed or slot_free-ed.
Adrian Huang [Fri, 26 Jul 2024 16:52:46 +0000 (00:52 +0800)]
mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running
KASAN-enabled kernel on a 256-core machine, the following soft lockup is
shown:
1. The following ftrace log shows that the lockup CPU spends too much
time iterating vmap_nodes and flushing TLB when purging vm_area
structures. (Some info is trimmed).
The drain_vmap_area_work() spends over 23 seconds.
There are 2805 flush_tlb_kernel_range() calls in the ftrace log.
* One is called in __purge_vmap_area_lazy().
* Others are called by purge_vmap_node->kasan_release_vmalloc.
purge_vmap_node() iteratively releases kasan vmalloc
allocations and flushes TLB for each vmap_area.
- [Rough calculation] Each flush_tlb_kernel_range() runs
about 7.5ms.
-- 2804 * 7.5ms = 21.03 seconds.
-- That's why a soft lock is triggered.
2. Extending the soft lockup time can work around the issue (For example,
# echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the
above-mentioned speculation: drain_vmap_area_work() spends too much
time.
If we combine all TLB flush operations of the KASAN shadow virtual
address into one operation in the call path
'purge_vmap_node()->kasan_release_vmalloc()', the running time of
drain_vmap_area_work() can be saved greatly. The idea is from the
flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the
soft lockup won't be triggered.
Here is the test result based on 6.10:
[6.10 wo/ the patch]
1. ftrace latency profiling (record a trace if the latency > 20s).
echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh
echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
2. Run `make -j $(nproc)` to compile the kernel source
3. Once the soft lockup is reproduced, check the ftrace log:
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
76) $ 50412985 us | } /* __purge_vmap_area_lazy */
76) $ 50412997 us | } /* drain_vmap_area_work */
76) $ 29165911 us | } /* __purge_vmap_area_lazy */
76) $ 29165926 us | } /* drain_vmap_area_work */
91) $ 53629423 us | } /* __purge_vmap_area_lazy */
91) $ 53629434 us | } /* drain_vmap_area_work */
91) $ 28121014 us | } /* __purge_vmap_area_lazy */
91) $ 28121026 us | } /* drain_vmap_area_work */
[6.10 w/ the patch]
1. Repeat step 1-2 in "[6.10 wo/ the patch]"
2. The soft lockup is not triggered and ftrace log is empty.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace
log.
4. Setting 'tracing_thresh' to 1 second gets ftrace log.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
23) $ 1074942 us | } /* __purge_vmap_area_lazy */
23) $ 1074950 us | } /* drain_vmap_area_work */
The worst execution time of drain_vmap_area_work() is about 1 second.
Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/ Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In proactive memory reclamation scenarios, it is necessary to estimate the
pswpin and pswpout metrics of the cgroup to determine whether to continue
reclaiming anonymous pages in the current batch. This patch will collect
these metrics and expose them.
Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Sun, 22 Sep 2024 04:32:13 +0000 (12:32 +0800)]
mm: shmem: fix khugepaged activation policy for shmem
Shmem has a separate interface (different from anonymous pages) to control
huge page allocation, that means shmem THP can be enabled while anonymous
THP is disabled. However, in this case, khugepaged will not start to
collapse shmem THP, which is unreasonable.
To fix this issue, we should call start_stop_khugepaged() to activate or
deactivate the khugepaged thread when setting shmem mTHP interfaces.
Moreover, add a new helper shmem_hpage_pmd_enabled() to help to check
whether shmem THP is enabled, which will determine if khugepaged should be
activated.
Link: https://lkml.kernel.org/r/9b9c6cbc4499bf44c6455367fd9e0f6036525680.1726978977.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 24 Sep 2024 18:59:11 +0000 (19:59 +0100)]
selftests/mm: add pkey_sighandler_xx, hugetlb_dio to .gitignore
Commit 6998a73efbb8 ("selftests/mm: Add new testcases for pkeys") and
commit 3a103b5315b7 ("selftest: mm: Test if hugepage does not get leaked
during __bio_release_pages()") generate test binaries hugetlb_dio,
pkey_sighandler_tests_32 and pkey_sighandler_tests_64 but did not add
these to .gitignore. Correct this.
Link: https://lkml.kernel.org/r/20240924185911.117937-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Thu, 31 Oct 2024 16:12:03 +0000 (09:12 -0700)]
mm/damon/core: avoid overflow in damon_feed_loop_next_input()
damon_feed_loop_next_input() is inefficient and fragile to overflows.
Specifically, 'score_goal_diff_bp' calculation can overflow when 'score'
is high. The calculation is actually unnecessary at all because 'goal' is
a constant of value 10,000. Calculation of 'compensation' is again
fragile to overflow. Final calculation of return value for under-achiving
case is again fragile to overflow when the current score is
under-achieving the target.
Add two corner cases handling at the beginning of the function to make the
body easier to read, and rewrite the body of the function to avoid
overflows and the unnecessary bp value calcuation.
SeongJae Park [Thu, 31 Oct 2024 18:37:57 +0000 (11:37 -0700)]
mm/damon/core: handle zero schemes apply interval
DAMON's logics to determine if this is the time to apply damos schemes
assumes next_apply_sis is always set larger than current
passed_sample_intervals. And therefore assume continuously incrementing
passed_sample_intervals will make it reaches to the next_apply_sis in
future. The logic hence does apply the scheme and update next_apply_sis
only if passed_sample_intervals is same to next_apply_sis.
If Schemes apply interval is set as zero, however, next_apply_sis is set
same to current passed_sample_intervals, respectively. And
passed_sample_intervals is incremented before doing the next_apply_sis
check. Hence, next_apply_sis becomes larger than next_apply_sis, and the
logic says it is not the time to apply schemes and update next_apply_sis.
In other words, DAMON stops applying schemes until passed_sample_intervals
overflows.
Based on the documents and the common sense, a reasonable behavior for
such inputs would be applying the schemes for every sampling interval.
Handle the case by removing the assumption.
Link: https://lkml.kernel.org/r/20241031183757.49610-3-sj@kernel.org Fixes: 42f994b71404 ("mm/damon/core: implement scheme-specific apply interval") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.7.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Thu, 31 Oct 2024 18:37:56 +0000 (11:37 -0700)]
mm/damon/core: handle zero {aggregation,ops_update} intervals
Patch series "mm/damon/core: fix handling of zero non-sampling intervals".
DAMON's internal intervals accounting logic is not correctly handling
non-sampling intervals of zero values for a wrong assumption. This could
cause unexpected monitoring behavior, and even result in infinite hang of
DAMON sysfs interface user threads in case of zero aggregation interval.
Fix those by updating the intervals accounting logic. For details of the
root case and solutions, please refer to commit messages of fixes.
This patch (of 2):
DAMON's logics to determine if this is the time to do aggregation and ops
update assumes next_{aggregation,ops_update}_sis are always set larger
than current passed_sample_intervals. And therefore it further assumes
continuously incrementing passed_sample_intervals every sampling interval
will make it reaches to the next_{aggregation,ops_update}_sis in future.
The logic therefore make the action and update
next_{aggregation,ops_updaste}_sis only if passed_sample_intervals is same
to the counts, respectively.
If Aggregation interval or Ops update interval are zero, however,
next_aggregation_sis or next_ops_update_sis are set same to current
passed_sample_intervals, respectively. And passed_sample_intervals is
incremented before doing the next_{aggregation,ops_update}_sis check.
Hence, passed_sample_intervals becomes larger than
next_{aggregation,ops_update}_sis, and the logic says it is not the time
to do the action and update next_{aggregation,ops_update}_sis forever,
until an overflow happens. In other words, DAMON stops doing aggregations
or ops updates effectively forever, and users cannot get monitoring
results.
Based on the documents and the common sense, a reasonable behavior for
such inputs is doing an aggregation and an ops update for every sampling
interval. Handle the case by removing the assumption.
Note that this could incur particular real issue for DAMON sysfs interface
users, in case of zero Aggregation interval. When user starts DAMON with
zero Aggregation interval and asks online DAMON parameter tuning via DAMON
sysfs interface, the request is handled by the aggregation callback.
Until the callback finishes the work, the user who requested the online
tuning just waits. Hence, the user will be stuck until the
passed_sample_intervals overflows.
Wei Yang [Sun, 27 Oct 2024 12:33:21 +0000 (12:33 +0000)]
mm/mlock: set the correct prev on failure
After commit 94d7d9233951 ("mm: abstract the vma_merge()/split_vma()
pattern for mprotect() et al."), if vma_modify_flags() return error, the
vma is set to an error code. This will lead to an invalid prev be
returned.
Generally this shouldn't matter as the caller should treat an error as
indicating state is now invalidated, however unfortunately
apply_mlockall_flags() does not check for errors and assumes that
mlock_fixup() correctly maintains prev even if an error were to occur.
This patch fixes that assumption.
[lorenzo.stoakes@oracle.com: provide a better fix and rephrase the log] Link: https://lkml.kernel.org/r/20241027123321.19511-1-richard.weiyang@gmail.com Fixes: 94d7d9233951 ("mm: abstract the vma_merge()/split_vma() pattern for mprotect() et al.") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
objpool: fix to make percpu slot allocation more robust
Since gfp & GFP_ATOMIC == GFP_ATOMIC is true for GFP_KERNEL | GFP_HIGH, it
will use kmalloc if user specifies that combination. Here the reason why
combining the __vmalloc_node() and kmalloc_node() is that the vmalloc does
not support all GFP flag, especially GFP_ATOMIC. So we should check if
gfp & (GFP_ATOMIC | GFP_KERNEL) != GFP_ATOMIC for vmalloc first. This
ensures caller can sleep. And for the robustness, even if vmalloc fails,
it should retry with kmalloc to allocate it.
Link: https://lkml.kernel.org/r/173008598713.1262174.2959179484209897252.stgit@mhiramat.roam.corp.google.com Fixes: aff1871bfc81 ("objpool: fix choosing allocation for percpu slots") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/all/CAHk-=whO+vSH+XVRio8byJU8idAWES0SPGVZ7KAVdc4qrV0VUA@mail.gmail.com/ Cc: Leo Yan <leo.yan@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Wu <wuqiang.matt@bytedance.com> Cc: Mikel Rychliski <mikel@mikelr.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().
Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.
We move a number of things here:
1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.
2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.
We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.
We perform a debug assert to ensure a driver does not attempt to do the
opposite.
3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.
With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.
This eliminates an entire class of errors, makes the code easier to reason
about and more robust.
Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Mark Brown <broonie@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:47 +0000 (18:11 +0000)]
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().
The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().
Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.
It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.
We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.
This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.
This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.
So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().
Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Larsson <andreas@gaisler.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:46 +0000 (18:11 +0000)]
mm: refactor map_deny_write_exec()
Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to
use this function early in mmap_region() in a subsequent commit.
While we're here, we refactor the function to be more readable and add
some additional documentation.
Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:45 +0000 (18:11 +0000)]
mm: unconditionally close VMAs on error
Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.
With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.
Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.
We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.
This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.
Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:44 +0000 (18:11 +0000)]
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
Patch series "fix error handling in mmap_region() and refactor
(hotfixes)", v4.
mmap_region() is somewhat terrifying, with spaghetti-like control flow and
numerous means by which issues can arise and incomplete state, memory
leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the
VMA is in an inconsistent state.
The patches in this series comprise the minimal changes required to
resolve existing issues in mmap_region() error handling, in order that
they can be hotfixed and backported. There is additionally a follow up
series which goes further, separated out from the v1 series and sent and
updated separately.
This patch (of 5):
After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced,
meaning that we need complicated handling to ensure we do not incorrectly
call these hooks.
We can avoid the whole issue by treating the VMA as suspect the moment
that the file->f_ops->mmap() function reports an error by replacing
whatever VMA operations were installed with a dummy empty set of VMA
operations.
We do so through a new helper function internal to mm - mmap_file() -
which is both more logically named than the existing call_mmap() function
and correctly isolates handling of the vm_op reassignment to mm.
All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.
It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:
Hugh Dickins [Sun, 27 Oct 2024 20:02:13 +0000 (13:02 -0700)]
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does. But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list. __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache. That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails. Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included. There
is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Sun, 27 Oct 2024 19:59:34 +0000 (12:59 -0700)]
mm/thp: fix deferred split queue not partially_mapped
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
The new unlocked list_del_init() in deferred_split_scan() is buggy. I
gave bad advice, it looks plausible since that's a local on-stack list,
but the fact is that it can race with a third party freeing or migrating
the preceding folio (properly unqueueing it with refcount 0 while holding
split_queue_lock), thereby corrupting the list linkage.
The obvious answer would be to take split_queue_lock there: but it has a
long history of contention, so I'm reluctant to add to that. Instead,
make sure that there is always one safe (raised refcount) folio before, by
delaying its folio_put(). (And of course I was wrong to suggest updating
split_queue_len without the lock: leave that until the splice.)
And remove two over-eager partially_mapped checks, restoring those tests
to how they were before: if uncharge_folio() or free_tail_page_prepare()
finds _deferred_list non-empty, it's in trouble whether or not that folio
is partially_mapped (and the flag was already cleared in the latter case).
Link: https://lkml.kernel.org/r/81e34a8b-113a-0701-740e-2135c97eb1d7@google.com Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 30 Oct 2024 00:37:30 +0000 (17:37 -0700)]
mm-page_alloc-keep-track-of-free-highatomic-fix
s/if/else if/, per Johannes, stealth whitespace tweak
Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Link Lin <linkl@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The second line above shows that the OOM kill was due to the following
condition:
free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
And the third line shows there were no free pages in any
MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'.
Therefore __zone_watermark_unusable_free() underestimated the usable free
memory by over 1GB, which resulted in the unnecessary OOM kill above.
The comments in __zone_watermark_unusable_free() warns about the potential
risk, i.e.,
If the caller does not have rights to reserves below the min
watermark then subtract the high-atomic reserves. This will
over-estimate the size of the atomic reserve but it avoids a search.
However, it is possible to keep track of free pages in reserved highatomic
pageblocks with a new per-zone counter nr_free_highatomic protected by the
zone lock, to avoid a search when calculating the usable free memory. And
the cost would be minimal, i.e., simple arithmetics in the highatomic
alloc/free/move paths.
Note that since nr_free_highatomic can be relatively small, using a
per-cpu counter might cause too much drift and defeat its purpose, in
addition to the extra memory overhead.
Yu Zhao [Sat, 19 Oct 2024 01:29:39 +0000 (01:29 +0000)]
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
When the MM_WALK capability is enabled, memory that is mostly accessed by
a VM appears younger than it really is, therefore this memory will be less
likely to be evicted. Therefore, the presence of a running VM can
significantly increase swap-outs for non-VM memory, regressing the
performance for the rest of the system.
Fix this regression by always calling {ptep,pmdp}_clear_young_notify()
whenever we clear the young bits on PMDs/PTEs.
Link: https://lkml.kernel.org/r/20241019012940.3656292-3-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reported-by: David Stevens <stevensd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Sat, 19 Oct 2024 01:29:38 +0000 (01:29 +0000)]
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
Patch series "mm: multi-gen LRU: Have secondary MMUs participate in
MM_WALK".
Today, the MM_WALK capability causes MGLRU to clear the young bit from
PMDs and PTEs during the page table walk before eviction, but MGLRU does
not call the clear_young() MMU notifier in this case. By not calling this
notifier, the MM walk takes less time/CPU, but it causes pages that are
accessed mostly through KVM / secondary MMUs to appear younger than they
should be.
We do call the clear_young() notifier today, but only when attempting to
evict the page, so we end up clearing young/accessed information less
frequently for secondary MMUs than for mm PTEs, and therefore they appear
younger and are less likely to be evicted. Therefore, memory that is
*not* being accessed mostly by KVM will be evicted *more* frequently,
worsening performance.
ChromeOS observed a tab-open latency regression when enabling MGLRU with a
setup that involved running a VM:
Tab-open latency histogram (ms)
Version p50 mean p95 p99 max
base 1315 1198 2347 3454 10319
mglru 2559 1311 7399 12060 43758
fix 1119 926 2470 4211 6947
This series replaces the final non-selftest patchs from this series[1],
which introduced a similar change (and a new MMU notifier) with KVM
optimizations. I'll send a separate series (to Sean and Paolo) for the
KVM optimizations.
This series also makes proactive reclaim with MGLRU possible for KVM
memory. I have verified that this functions correctly with the selftest
from [1], but given that that test is a KVM selftest, I'll send it with
the rest of the KVM optimizations later. Andrew, let me know if you'd
like to take the test now anyway.
The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very helpful
and become more complicated to properly compute when adding
test/clear_young() notifiers in MGLRU's mm walk.
Link: https://lkml.kernel.org/r/20241019012940.3656292-1-jthoughton@google.com Link: https://lkml.kernel.org/r/20241019012940.3656292-2-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: David Stevens <stevensd@google.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Thu, 24 Oct 2024 15:12:29 +0000 (17:12 +0200)]
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address hint
and of at least PMD_SIZE will be aligned to PMD so that it can benefit
from a THP backing page.
However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have merged
to a large THP-backed area before commit efa7df3e3bb5 and now they are
fragmented to multiple areas each aligned to PMD boundary with gaps
between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing due
to the aligned boundaries of the individual areas.
Another known regression bisected to commit efa7df3e3bb5 is darktable [2]
[3] and early testing suggests this patch fixes the regression there as
well.
To fix the regression but still try to benefit from THP-friendly anonymous
mapping alignment, add a condition that the size of the mapping must be a
multiple of PMD size instead of at least PMD size. In case of many
odd-sized mapping like the cactusBSSN creates, those will stop being
aligned and with gaps between, and instead naturally merge again.
Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michael Matz <matz@suse.de> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1] Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2] Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3] Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Cc: Rik van Riel <riel@surriel.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gregory Price [Fri, 25 Oct 2024 14:17:24 +0000 (10:17 -0400)]
vmscan,migrate: fix page count imbalance on node stats when demoting pages
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. migrate_pages will decrement the
the node's isolated page count, leading to an imbalanced count when
invoked from (MG)LRU code.
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic
is the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Sun, 20 Oct 2024 04:22:12 +0000 (22:22 -0600)]
mm: allow set/clear page_type again
Some page flags (page->flags) were converted to page types
(page->page_types). A recent example is PG_hugetlb.
From the exclusive writer's perspective, e.g., a thread doing
__folio_set_hugetlb(), there is a difference between the page flag and
type APIs: the former allows the same non-atomic operation to be repeated
whereas the latter does not. For example, calling __folio_set_hugetlb()
twice triggers VM_BUG_ON_FOLIO(), since the second call expects the type
(PG_hugetlb) not to be set previously.
Using add_hugetlb_folio() as an example, it calls __folio_set_hugetlb() in
the following error-handling path. And when that happens, it triggers the
aforementioned VM_BUG_ON_FOLIO().
if (folio_test_hugetlb(folio)) {
rc = hugetlb_vmemmap_restore_folio(h, folio);
if (rc) {
spin_lock_irq(&hugetlb_lock);
add_hugetlb_folio(h, folio, false);
...
It is possible to make hugeTLB comply with the new requirements from the
page type API. However, a straightforward fix would be to just allow the
same page type to be set or cleared again inside the API, to avoid any
changes to its callers.
Link: https://lkml.kernel.org/r/20241020042212.296781-1-yuzhao@google.com Fixes: d99e3140a4d3 ("mm: turn folio_test_hugetlb into a PageType") Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryusuke Konishi [Sun, 20 Oct 2024 04:51:28 +0000 (13:51 +0900)]
nilfs2: fix potential deadlock with newly created symlinks
Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers
memory reclamation involving the filesystem layer, which can result in
circular lock dependencies among the reader/writer semaphore
nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the
fs_reclaim pseudo lock.
This is because after commit 21fc61c73c39 ("don't put symlink bodies in
pagecache into highmem"), the gfp flags of the page cache for symbolic
links are overwritten to GFP_KERNEL via inode_nohighmem().
This is not a problem for symlinks read from the backing device, because
the __GFP_FS flag is dropped after inode_nohighmem() is called. However,
when a new symlink is created with nilfs_symlink(), the gfp flags remain
overwritten to GFP_KERNEL. Then, memory allocation called from
page_symlink() etc. triggers memory reclamation including the FS layer,
which may call nilfs_evict_inode() or nilfs_dirty_inode(). And these can
cause a deadlock if they are called while nilfs->ns_segctor_sem is held:
Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags
of newly created symlinks in the same way that nilfs_new_inode() and
__nilfs_read_inode() do, as a workaround until we adopt nofs allocation
scope consistently or improve the locking constraints.
Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com Fixes: 21fc61c73c39 ("don't put symlink bodies in pagecache into highmem") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256 Tested-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Phillip Lougher [Sun, 20 Oct 2024 23:22:00 +0000 (00:22 +0100)]
Squashfs: fix variable overflow in squashfs_readpage_block
Syzbot reports a slab out of bounds access in squashfs_readpage_block().
This is caused by an attempt to read page index 0x2000000000. This value
(start_index) is stored in an integer loop variable which overflows
producing a value of 0. This causes a loop which iterates over pages
start_index -> end_index to iterate over 0 -> end_index, which ultimately
causes an out of bounds page array access.
Fix by changing variable to a loff_t, and rename to index to make it
clearer it is a page index, and not a loop count.
Andrey Konovalov [Tue, 22 Oct 2024 16:07:06 +0000 (18:07 +0200)]
kasan: remove vmalloc_percpu test
Commit 1a2473f0cbc0 ("kasan: improve vmalloc tests") added the
vmalloc_percpu KASAN test with the assumption that __alloc_percpu always
uses vmalloc internally, which is tagged by KASAN.
However, __alloc_percpu might allocate memory from the first per-CPU
chunk, which is not allocated via vmalloc(). As a result, the test might
fail.
Remove the test until proper KASAN annotation for the per-CPU allocated
are added; tracked in https://bugzilla.kernel.org/show_bug.cgi?id=215019.
Wladislav Wiebe [Tue, 22 Oct 2024 17:21:13 +0000 (19:21 +0200)]
tools/mm: -Werror fixes in page-types/slabinfo
Commit e6d2c436ff693 ("tools/mm: allow users to provide additional
cflags/ldflags") passes now CFLAGS to Makefile. With this, build systems
with default -Werror enabled found:
slabinfo.c:1300:25: error: ignoring return value of 'chdir'
declared with attribute 'warn_unused_result' [-Werror=unused-result]
chdir("..");
^~~~~~~~~~~
page-types.c:397:35: error: format '%lu' expects argument of type
'long unsigned int', but argument 2 has type 'uint64_t'
{aka 'long long unsigned int'} [-Werror=format=]
printf("%lu\t", mapcnt0);
~~^ ~~~~~~~
..
Fix page-types by using PRIu64 for uint64_t prints and check in slabinfo
for return code on chdir("..").
Link: https://lkml.kernel.org/r/c1ceb507-94bc-461c-934d-c19b77edd825@gmail.com Fixes: e6d2c436ff69 ("tools/mm: allow users to provide additional cflags/ldflags") Signed-off-by: Wladislav Wiebe <wladislav.kw@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Herton R. Krzesinski <herton@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Tue, 22 Oct 2024 17:55:12 +0000 (01:55 +0800)]
mm, swap: avoid over reclaim of full clusters
When running low on usable slots, cluster allocator will try to reclaim
the full clusters aggressively to reclaim HAS_CACHE slots. This
guarantees that as long as there are any usable slots, HAS_CACHE or not,
the swap device will be usable and workload won't go OOM early.
Before the cluster allocator, swap allocator fails easily if device is
filled up with reclaimable HAS_CACHE slots. Which can be easily
reproduced with following simple program:
Setup an 8G non ramdisk swap, the first run of the program will swapout 8G
ram successfully. But run same program again after the first run paused,
the second run can't swapout all 8G memory as now half of the swap device
is pinned by HAS_CACHE. There was a random scan in the old allocator that
may reclaim part of the HAS_CACHE by luck, but it's unreliable.
The new allocator's added reclaim of full clusters when device is low on
usable slots. But when multiple CPUs are seeing the device is low on
usable slots at the same time, they ran into a thundering herd problem.
This is an observable problem on large machine with mass parallel
workload, as full cluster reclaim is slower on large swap device and
higher number of CPUs will also make things worse.
Testing using a 128G ZRAM on a 48c96t system. When the swap device is
very close to full (eg. 124G / 128G), running build linux kernel with
make -j96 in a 1G memory cgroup will hung (not a softlockup though)
spinning in full cluster reclaim for about ~5min before go OOM.
To solve this, split the full reclaim into two parts:
- Instead of do a synchronous aggressively reclaim when device is low,
do only one aggressively reclaim when device is strictly full with a
kworker. This still ensures in worst case the device won't be unusable
because of HAS_CACHE slots.
- To avoid allocation (especially higher order) suffer from HAS_CACHE
filling up clusters and kworker not responsive enough, do one synchronous
scan every time the free list is drained, and only scan one cluster. This
is kind of similar to the random reclaim before, keeps the full clusters
rotated and has a minimal latency. This should provide a fair reclaim
strategy suitable for most workloads.
Link: https://lkml.kernel.org/r/20241022175512.10398-1-ryncsn@gmail.com Fixes: 2cacbdfdee65 ("mm: swap: add a adaptive full cluster cache reclaim") Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Wed, 23 Oct 2024 21:02:01 +0000 (10:02 +1300)]
mm: fix PSWPIN counter for large folios swap-in
Similar to PSWPOUT, we should count the number of base pages instead of
large folios.
Link: https://lkml.kernel.org/r/20241023210201.2798-1-21cnbao@gmail.com Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zi Yan [Wed, 23 Oct 2024 17:12:36 +0000 (13:12 -0400)]
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
An anonymous large folio can be split into non order-0 folios,
try_to_map_unused_to_zeropage() should not VM_BUG_ON compound pages but
just return false. This fixes the crash when splitting anonymous large
folios to non order-0 folios.
Link: https://lkml.kernel.org/r/20241023171236.1122535-1-ziy@nvidia.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hao Ge [Sun, 20 Oct 2024 07:08:19 +0000 (15:08 +0800)]
mm/codetag: fix null pointer check logic for ref and tag
When we compile and load lib/slub_kunit.c,it will cause a panic.
The root cause is that __kmalloc_cache_noprof was directly called instead
of kmem_cache_alloc,which resulted in no alloc_tag being allocated.This
caused current->alloc_tag to be null,leading to a null pointer dereference
in alloc_tag_ref_set.
Despite the fact that my colleague Pei Xiao will later fix the code in
slub_kunit.c,we still need fix null pointer check logic for ref and tag to
avoid panic caused by a null pointer dereference.