Christoph Hellwig [Wed, 25 Jan 2023 13:34:30 +0000 (14:34 +0100)]
mpage: stop using bdev_{read,write}_page
Patch series "remove ->rw_page".
This series removes the ->rw_page block_device_operation, which is an old
and clumsy attempt at a simple read/write fast path for the block layer.
It isn't actually used by the fastest block layer operations that we
support (polled I/O through io_uring), but only used by the mpage buffered
I/O helpers which are some of the slowest I/O we have and do not make any
difference there at all, and zram which is a block device abused to
duplicate the zram functionality.
Given that zram is heavily used we need to make sure there is a good
replacement for synchronous I/O, so this series adds a new flag for
drivers that complete I/O synchronously and uses that flag to use on-stack
bios and synchronous submission for them in the swap code.
This patch (of 7):
These are micro-optimizations for synchronous I/O, which do not matter
compared to all the other inefficiencies in the legacy buffer_head based
mpage code.
Link: https://lkml.kernel.org/r/20230125133436.447864-1-hch@lst.de Link: https://lkml.kernel.org/r/20230125133436.447864-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Sat, 21 Jan 2023 07:10:47 +0000 (08:10 +0100)]
mm: move __remove_vm_area out of va_remove_mappings
__remove_vm_area is the only part of va_remove_mappings that requires a
vmap_area. Move the call out to the caller and only pass the vm_struct to
va_remove_mappings.
Link: https://lkml.kernel.org/r/20230121071051.1143058-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Sat, 21 Jan 2023 07:10:44 +0000 (08:10 +0100)]
mm: remove __vfree_deferred
Fold __vfree_deferred into vfree_atomic, and call vfree_atomic early on
from vfree if called from interrupt context so that the extra low-level
helper can be avoided.
Link: https://lkml.kernel.org/r/20230121071051.1143058-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Sat, 21 Jan 2023 07:10:43 +0000 (08:10 +0100)]
mm: remove __vfree
__vfree is a subset of vfree that just skips a few checks, and which is
only used by vfree and an error cleanup path. Fold __vfree into vfree and
switch the only other caller to call vfree() instead.
Link: https://lkml.kernel.org/r/20230121071051.1143058-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 25 Jan 2023 13:44:34 +0000 (13:44 +0000)]
mm, compaction: finish pageblocks on complete migration failure
Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") address an issue where a pageblock selected by
fast_find_migrateblock() was ignored. Unfortunately, the same fix
resulted in numerous reports of khugepaged or kcompactd stalling for long
periods of time or consuming 100% of CPU.
Tracing showed that there was a lot of rescanning between a small subset
of pageblocks because the conditions for marking the block skip are not
met. The scan is not reaching the end of the pageblock because enough
pages were isolated but none were migrated successfully. Eventually it
circles back to the same block.
Pageblock skip tracking tries to minimise both latency and excessive
scanning but tracking exactly when a block is fully scanned requires an
excessive amount of state. This patch forcibly rescans a pageblock when
all isolated pages fail to migrate even though it could be for transient
reasons such as page writeback or page dirty. This will sometimes migrate
too many pages but pageblocks will be marked skip and forward progress
will be made.
"Usemen" from the mmtests configuration
workload-usemem-stress-numa-compact was used to stress compaction. The
compaction trace events were recorded using a 6.2-rc5 kernel that includes
commit 7efc3b726103 and count of unique ranges were measured. The top 5
ranges were
While this workload is very different than what the bugs reported, the
pattern of the same subset of blocks being repeatedly scanned is observed.
At one point, *only* the range range=(0x11b600 ~ 0x11b800) was scanned
for 2 seconds. 14 seconds passed between the first migration-related
event and the last.
With the series applied including this patch, the top 5 ranges were
Mel Gorman [Wed, 25 Jan 2023 13:44:33 +0000 (13:44 +0000)]
mm, compaction: finish scanning the current pageblock if requested
cc->finish_pageblock is set when the current pageblock should be rescanned
but fast_find_migrateblock can select an alternative block. Disable
fast_find_migrateblock when the current pageblock scan should be
completed.
Link: https://lkml.kernel.org/r/20230125134434.18017-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 25 Jan 2023 13:44:31 +0000 (13:44 +0000)]
mm, compaction: rename compact_control->rescan to finish_pageblock
Patch series "Fix excessive CPU usage during compaction".
Commit 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
fixed a problem where pageblocks found by fast_find_migrateblock() were
ignored. Unfortunately there were numerous bug reports complaining about high
CPU usage and massive stalls once 6.1 was released. Due to the severity,
the patch was reverted by Vlastimil as a short-term fix[1] to -stable.
The underlying problem for each of the bugs is suspected to be the
repeated scanning of the same pageblocks. This series should guarantee
forward progress even with commit 7efc3b726103. More information is in
the changelog for patch 4.
The rescan field was not well named albeit accurate at the time. Rename
the field to finish_pageblock to indicate that the remainder of the
pageblock should be scanned regardless of COMPACT_CLUSTER_MAX. The intent
is that pageblocks with transient failures get marked for skipping to
avoid revisiting the same pageblock.
Link: https://lkml.kernel.org/r/20230125134434.18017-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Chuyi Zhou <zhouchuyi@bytedance.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Tue, 24 Jan 2023 20:35:26 +0000 (21:35 +0100)]
kasan: reset page tags properly with sampling
The implementation of page_alloc poisoning sampling assumed that
tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations.
However, this is no longer the case since commit 70c248aca9e7 ("mm: kasan:
Skip unpoisoning of user pages").
This leads to kernel crashes when MTE-enabled userspace mappings are used
with Hardware Tag-Based KASAN enabled.
Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook().
Hyeonggon Yoo [Sat, 21 Jan 2023 16:50:54 +0000 (01:50 +0900)]
mm/page_owner: record single timestamp value for high order allocations
When allocating a high-order page, separate allocation timestamp is
recorded for each sub-page resulting in different timestamp values between
them.
This behavior is not consistent with the behavior when recording free
timestamp and caused confusion when analyzing memory dumps. Record single
timestamp for the entire allocation, aligning with the behavior for free
timestamps.
Link: https://lkml.kernel.org/r/20230121165054.520507-1-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiaqi Yan [Fri, 20 Jan 2023 03:46:22 +0000 (03:46 +0000)]
mm: memory-failure: document memory failure stats
Add documentation for memory_failure's per NUMA node sysfs entries
Link: https://lkml.kernel.org/r/20230120034622.2698268-4-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiaqi Yan [Fri, 20 Jan 2023 03:46:21 +0000 (03:46 +0000)]
mm: memory-failure: bump memory failure stats to pglist_data
Right before memory_failure finishes its handling, accumulate poisoned
page's resolution counters to pglist_data's memory_failure_stats, so as to
update the corresponding sysfs entries.
Tested:
1) Start an application to allocate memory buffer chunks
2) Convert random memory buffer addresses to physical addresses
3) Inject memory errors using EINJ at chosen physical addresses
4) Access poisoned memory buffer and recover from SIGBUS
5) Check counter values under
/sys/devices/system/node/node*/memory_failure/*
Link: https://lkml.kernel.org/r/20230120034622.2698268-3-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiaqi Yan [Fri, 20 Jan 2023 03:46:20 +0000 (03:46 +0000)]
mm: memory-failure: add memory failure stats to sysfs
Patch series "Introduce per NUMA node memory error statistics", v2.
Background
==========
In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability to
make statistics visible to system administrators. The statistics include
2 categories:
* Memory error statistics, for example, how many memory error are
encountered, how many of them are recovered by the kernel. Note these
memory errors are non-fatal to kernel: during the machine check
exception (MCE) handling kernel already classified MCE's severity to be
unnecessary to panic (but either action required or optional).
* Scanner statistics, for example how many times the scanner have fully
scanned a NUMA node, how many errors are first detected by the scanner.
The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.
Motivation
==========
Memory error stats are important to userspace but insufficient in kernel
today. Datacenter administrators can better monitor a machine's memory
health with the visible stats. For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance when
there are only 1~2 recovered memory errors could be overreacting; in cloud
production environment maintenance usually means live migrate all the
workload running on the server and this usually causes nontrivial
disruption to the customer. Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action.
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.
Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text
Exposing memory error stats is also a good start for the in-kernel memory
error detector. Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO). Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.
How Implemented
===============
As Naoya pointed out [2], exposing memory error statistics to userspace is
useful independent of software or hardware scanner. Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector. It exposes the following per NUMA node memory error
counters:
These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. This approach can be
easier to extend for future use cases than /proc/meminfo, trace event, and
log. The following math holds for the statistics:
* total = recovered + ignored + failed + delayed
These memory error stats are reset during machine boot.
The 1st commit introduces these sysfs entries. The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery. The 3rd commit adds documentations for introduced stats.
These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. The following math
holds for the statistics:
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:27 +0000 (00:18 +0000)]
mm: multi-gen LRU: simplify lru_gen_look_around()
Update the folio generation in place with or without
current->reclaim_state->mm_walk. The LRU lock is held for longer, if
mm_walk is NULL and the number of folios to update is more than
PAGEVEC_SIZE.
This causes a measurable regression from the LRU lock contention during a
microbencmark. But a tiny regression is not worth the complexity.
T.J. Alumbaugh [Wed, 18 Jan 2023 00:18:23 +0000 (00:18 +0000)]
mm: multi-gen LRU: section for Bloom filters
Move Bloom filters code into a dedicated section. Improve the design doc
to explain Bloom filter usage and connection between aging and eviction in
their use.
SeongJae Park [Thu, 19 Jan 2023 01:38:30 +0000 (01:38 +0000)]
mm/damon/core: update monitoring results for new monitoring attributes
region->nr_accesses is the number of sampling intervals in the last
aggregation interval that access to the region has found, and region->age
is the number of aggregation intervals that its access pattern has
maintained. Hence, the real meaning of the two fields' values is
depending on current sampling and aggregation intervals.
This means the values need to be updated for every sampling and/or
aggregation intervals updates. As DAMON core doesn't, it is a duty of
in-kernel DAMON framework applications like DAMON sysfs interface, or the
userspace users.
Handling it in userspace or in-kernel DAMON application is complicated,
inefficient, and repetitive compared to doing the update in DAMON core.
Do the update in DAMON core.
SeongJae Park [Thu, 19 Jan 2023 01:38:29 +0000 (01:38 +0000)]
mm/damon: update comments in damon.h for damon_attrs
Patch series "mm/damon: misc fixes".
This patchset contains three miscellaneous simple fixes for DAMON online
tuning.
This patch (of 3):
Commit cbeaa77b0449 ("mm/damon/core: use a dedicated struct for monitoring
attributes") moved monitoring intervals from damon_ctx to a new struct,
damon_attrs, but a comment in the header file has not updated for the
change. Update it.
Waiman Long [Thu, 19 Jan 2023 04:01:11 +0000 (23:01 -0500)]
mm/kmemleak: fix UAF bug in kmemleak_scan()
Commit 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object
iteration loop of kmemleak_scan()") fixes soft lockup problem in
kmemleak_scan() by periodically doing a cond_resched(). It does take a
reference of the current object before doing it. Unfortunately, if the
object has been deleted from the object_list, the next object pointed to
by its next pointer may no longer be valid after coming back from
cond_resched(). This can result in use-after-free and other nasty
problem.
Fix this problem by adding a del_state flag into kmemleak_object structure
to synchronize the object deletion process between kmemleak_cond_resched()
and __remove_object() to make sure that the object remained in the
object_list in the duration of the cond_resched() call.
Link: https://lkml.kernel.org/r/20230119040111.350923-3-longman@redhat.com Fixes: 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF", v2.
It was found that a KASAN use-after-free error was reported in the
kmemleak_scan() function. After further examination, it is believe that
even though a reference is taken from the current object, it does not
prevent the object pointed to by the next pointer from going away after a
cond_resched().
To fix that, additional flags are added to make sure that the current
object won't be removed from the object_list during the duration of the
cond_resched() to ensure the validity of the next pointer.
While making the change, I also simplify the current usage of
kmemleak_cond_resched() to make it easier to understand.
This patch (of 2):
The presence of a pinned argument and the 64k loop count make
kmemleak_cond_resched() a bit more complex to read. The pinned argument
is used only by first kmemleak_scan() loop.
Simplify the usage of kmemleak_cond_resched() by removing the pinned
argument and always do a get_object()/put_object() sequence. In addition,
the 64k loop is removed by using need_resched() to decide if
kmemleak_cond_resched() should be called.
Joey Gouly [Thu, 19 Jan 2023 16:03:43 +0000 (16:03 +0000)]
mm: implement memory-deny-write-execute as a prctl
Patch series "mm: In-kernel support for memory-deny-write-execute (MDWE)",
v2.
The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [2], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but subsequently
changed to read-only. Therefore the filter simply rejects any
mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI support
(Branch Target Identification), the dynamic loader cannot change an ELF
section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect(). For
libraries, it can resort to unmapping and re-mapping but for the main
executable it does not have a file descriptor. The original bug report in
the Red Hat bugzilla - [3] - and subsequent glibc workaround for libraries
- [4].
This series adds in-kernel support for this feature as a prctl
PR_SET_MDWE, that is inherited on fork(). The prctl denies PROT_WRITE |
PROT_EXEC mappings. Like the systemd BPF filter it also denies adding
PROT_EXEC to mappings. However unlike the BPF filter it only denies it if
the mapping didn't previous have PROT_EXEC. This allows to PROT_EXEC ->
PROT_EXEC | PROT_BTI with mprotect(), which is a problem with the BPF
filter.
This patch (of 2):
The aim of such policy is to prevent a user task from creating an
executable mapping that is also writeable.
An example of mmap() returning -EACCESS if the policy is enabled:
The BPF filter that systemd MDWE uses is stateless, and disallows
mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to
be enabled if it was already PROT_EXEC, which allows the following case:
Herton R. Krzesinski [Mon, 16 Jan 2023 22:49:21 +0000 (19:49 -0300)]
tools/mm: allow users to provide additional cflags/ldflags
Right now there is no way to provide additional cflags/ldflags when
building tools/vm binaries. And using eg. make CFLAGS=<options> will
override the CFLAGS being set in the Makefile, making the build fail since
it requires the include of the ../lib dir (for libapi).
This change then allows you to specify:
CFLAGS=<options> LDFLAGS=<options> make V=1 -C tools/vm
And the options will be correctly appended as can be seen from the
make output.
Link: https://lkml.kernel.org/r/20230116224921.4106324-1-herton@redhat.com Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Justin Forbes <jforbes@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Scott Weaver <scweaver@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Deming Wang [Wed, 18 Jan 2023 02:54:03 +0000 (21:54 -0500)]
Documentation: mm: use `s/higmem/highmem/` fix typo for highmem
We should use highmem replace higmem.
Link: https://lkml.kernel.org/r/20230118025403.1531-1-wangdeming@inspur.com Signed-off-by: Deming Wang <wangdeming@inspur.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Levi Yun [Wed, 18 Jan 2023 08:05:23 +0000 (17:05 +0900)]
mm/cma: fix potential memory loss on cma_declare_contiguous_nid
Suppose memblock_alloc_range_nid() with highmem_start succeeds when
cma_declare_contiguous_nid is called with !fixed on a 32-bit system with
PHYS_ADDR_T_64BIT enabled with memblock.bottom_up == false.
But the next trial to memblock_alloc_range_nid() to allocate in [SIZE_4G,
limits) nullifies former successfully allocated addr and it retries
memblock_alloc_ragne_nid().
In this situation, the first successfully allocated address area is lost.
Change the order of allocation (SIZE_4G, high_memory and base) and check
whether the allocated succeeded to prevent potential memory loss.
Link: https://lkml.kernel.org/r/20230118080523.44522-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Yang [Wed, 18 Jan 2023 12:13:03 +0000 (20:13 +0800)]
swap_state: update shadow_nodes for anonymous page
Shadow_nodes is for shadow nodes reclaiming of workingset handling, it is
updated when page cache add or delete since long time ago workingset only
supported page cache. But when workingset supports anonymous page
detection, we missied updating shadow nodes for it. This caused that
shadow nodes of anonymous page will never be reclaimd by
scan_shadow_nodes() even they use much memory and system memory is tense.
So update shadow_nodes of anonymous page when swap cache is add or delete
by calling xas_set_update(..workingset_update_node).
Link: https://lkml.kernel.org/r/202301182013032211005@zte.com.cn Fixes: aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU") Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Wed, 18 Jan 2023 00:52:10 +0000 (09:52 +0900)]
zsmalloc: set default zspage chain size to 8
This changes key characteristics (pages per-zspage and objects per-zspage)
of a number of size classes which in results in different pool
configuration. With zspage chain size of 8 we have more size clases
clusters (123) and higher huge size class watermark (3632 bytes).
Please read zsmalloc documentation for more details.
Sergey Senozhatsky [Wed, 18 Jan 2023 00:52:09 +0000 (09:52 +0900)]
zsmalloc: make zspage chain size configurable
Remove hard coded limit on the maximum number of physical pages
per-zspage.
This will allow tuning of zsmalloc pool as zspage chain size changes
`pages per-zspage` and `objects per-zspage` characteristics of size
classes which also affects size classes clustering (the way size classes
are merged).
Sergey Senozhatsky [Wed, 18 Jan 2023 00:52:07 +0000 (09:52 +0900)]
zsmalloc: rework zspage chain size selection
Patch series "zsmalloc: make zspage chain size configurable".
Computers are bad at division. We currently decide the best zspage chain
size (max number of physical pages per-zspage) by looking at a `used
percentage` value. This is not enough as we lose precision during usage
percentage calculations For example, let's look at size class 208:
Anshuman Khandual [Thu, 5 Jan 2023 08:25:06 +0000 (13:55 +0530)]
mm/page_alloc: use deferred_pages_enabled() wherever applicable
Instead of directly accessing static deferred_pages, replace such
instances with the helper deferred_pages_enabled(). No functional change
is intended.
Link: https://lkml.kernel.org/r/20230105082506.241529-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pasha Tatashin [Tue, 17 Jan 2023 20:46:17 +0000 (20:46 +0000)]
mm/page_ext: init page_ext early if there are no deferred struct pages
page_ext must be initialized after all struct pages are initialized.
Therefore, page_ext is initialized after page_alloc_init_late(), and can
optionally be initialized earlier via early_page_ext kernel parameter
which as a side effect also disables deferred struct pages.
Allow to automatically init page_ext early when there are no deferred
struct pages in order to be able to use page_ext during kernel boot and
track for example page allocations early.
Colin Ian King [Mon, 16 Jan 2023 16:43:32 +0000 (16:43 +0000)]
mm/secretmem: remove redundant initiialization of pointer file
The pointer file is being initialized with a value that is never read, it
is being re-assigned later on. Clean up code by removing the redundant
initialization.
Link: https://lkml.kernel.org/r/20230116164332.79500-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Mon, 16 Jan 2023 19:39:39 +0000 (19:39 +0000)]
filemap: convert filemap_map_pmd() to take a folio
Patch series "Some more filemap folio conversions".
Three more places which could easily be converted to folios. The third
one fixes a minor bug in readahead_expand(), but it's only a performance
bug and there are few users of readahead_expand(), so I don't think it's
worth backporting.
This patch (of 3):
Save a few calls to compound_head(). We specify exactly which page from
the folio to use by passing in start_pgoff, which means this will work for
a folio which is larger than PMD size. The rest of the VM isn't prepared
for that yet, but now this function is.
Matthew Wilcox (Oracle) [Mon, 16 Jan 2023 19:29:59 +0000 (19:29 +0000)]
rmap: add folio parameter to __page_set_anon_rmap()
Avoid the compound_head() call in PageAnon() by passing in the folio that
all callers have. Also save me from wondering whether page->mapping can
ever be overwritten on a tail page (I don't think it can, but I'm not 100%
sure).
Matthew Wilcox (Oracle) [Mon, 16 Jan 2023 19:18:13 +0000 (19:18 +0000)]
mm: use a folio in copy_present_pte()
We still have to keep the page around because we need to know which page
in the folio we're copying, but we can replace five implict calls to
compound_head() with one.
Matthew Wilcox (Oracle) [Mon, 16 Jan 2023 19:18:11 +0000 (19:18 +0000)]
mm: convert wp_page_copy() to use folios
Use new_folio instead of new_page throughout, because we allocated it
and know it's an order-0 folio. Most old_page uses become old_folio,
but use vmf->page where we need the precise page.
Matthew Wilcox (Oracle) [Mon, 16 Jan 2023 19:18:09 +0000 (19:18 +0000)]
mm: add vma_alloc_zeroed_movable_folio()
Replace alloc_zeroed_user_highpage_movable(). The main difference is
returning a folio containing a single page instead of returning the page,
but take the opportunity to rename the function to match other allocation
functions a little better and rewrite the documentation to place more
emphasis on the zeroing rather than the highmem aspect.
nilfs2: convert nilfs_clear_dirty_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). This change removes 2 calls to
compound_head().
nilfs2: convert nilfs_copy_dirty_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). This change removes 8 calls to
compound_head().
nilfs2: convert nilfs_btree_lookup_dirty_buffers() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). This change removes 1 call to
compound_head().
nilfs2: convert nilfs_lookup_dirty_node_buffers() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). This change removes 1 call to
compound_head().
nilfs2: convert nilfs_lookup_dirty_data_buffers() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for
the removal of find_get_pages_range_tag(). This change removes 4 calls
to compound_head().
gfs2: convert gfs2_write_cache_jdata() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pgaes_range_tag(). This change removes 8 calls to
compound_head().
Also had to modify and rename gfs2_write_jdata_pagevec() to take in and
utilize folio_batch rather than pagevec and use folios rather than pages.
gfs2_write_jdata_batch() now supports large folios.
f2fs: convert f2fs_sync_meta_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). This change removes 5 calls to
compound_head().
Initially the function was checking if the previous page index is truly
the previous page i.e. 1 index behind the current page. To convert to
folios and maintain this check we need to make the check folio->index !=
prev + folio_nr_pages(previous folio) since we don't know how many pages
are in a folio.
At index i == 0 the check is guaranteed to succeed, so to workaround
indexing bounds we can simply ignore the check for that specific index.
This makes the initial assignment of prev trivial, so I removed that as
well.
Also modify a comment in commit_checkpoint for consistency.
f2fs: convert f2fs_write_cache_pages() to use filemap_get_folios_tag()
Convert the function to use a folio_batch instead of pagevec. This is in
preparation for the removal of find_get_pages_range_tag().
Also modified f2fs_all_cluster_page_ready to take in a folio_batch instead
of pagevec. This does NOT support large folios. The function currently
only utilizes folios of size 1 so this shouldn't cause any issues right
now.
This version of the patch limits the number of pages fetched to
F2FS_ONSTACK_PAGES. If that ever happens, update the start index here
since filemap_get_folios_tag() updates the index to be after the last
found folio, not necessarily the last used page.
ext4: convert mpage_prepare_extent_to_map() to use filemap_get_folios_tag()
Convert the function to use folios throughout. This is in preparation for
the removal of find_get_pages_range_tag(). Now supports large folios.
This change removes 11 calls to compound_head().
cifs: convert wdata_alloc_and_fillpages() to use filemap_get_folios_tag()
This is in preparation for the removal of find_get_pages_range_tag(). Now
also supports the use of large folios.
Since tofind might be larger than the max number of folios in a
folio_batch (15), we loop through filling in wdata->pages pulling more
batches until we either reach tofind pages or run out of folios.
This function may not return all pages in the last found folio before
tofind pages are reached.
page-writeback: convert write_cache_pages() to use filemap_get_folios_tag()
Convert function to use folios throughout. This is in preparation for the
removal of find_get_pages_range_tag(). This change removes 8 calls to
compound_head(), and the function now supports large folios.
This is the equivalent of find_get_pages_range_tag(), except for folios
instead of pages.
One noteable difference is filemap_get_folios_tag() does not take in a
maximum pages argument. It instead tries to fill a folio batch and stops
either once full (15 folios) or reaching the end of the search range.
The new function supports large folios, the initial function did not since
all callers don't use large folios.
Patch series "Convert to filemap_get_folios_tag()", v5.
This patch series replaces find_get_pages_range_tag() with
filemap_get_folios_tag(). This also allows the removal of multiple calls
to compound_head() throughout.
It also makes a good chunk of the straightforward conversions to folios,
and takes the opportunity to introduce a function that grabs a folio from
the pagecache.
This patch (of 23):
Add function filemap_grab_folio() to grab a folio from the page cache.
This function is meant to serve as a folio replacement for
grab_cache_page, and is used to facilitate the removal of
find_get_pages_range_tag().
David Stevens [Fri, 13 Jan 2023 02:30:11 +0000 (11:30 +0900)]
mm: fix khugepaged with shmem_enabled=advise
Pass vm_flags as a parameter to shmem_is_huge, rather than reading the
flags from the vm_area_struct in question. This allows the updated flags
from hugepage_madvise to be passed to the check, which is necessary
because madvise does not update the vm_area_struct's flags until after
hugepage_madvise returns.
This fixes an issue when shmem_enabled=madvise, where MADV_HUGEPAGE on
shmem was not able to register the mm_struct with khugepaged. Prior to cd89fb065099, the mm_struct was registered by MADV_HUGEPAGE regardless of
the value of shmem_enabled (which was only checked when scanning vmas).
Link: https://lkml.kernel.org/r/20230113023011.1784015-1-stevensd@google.com Fixes: cd89fb065099 ("mm,thp,shmem: make khugepaged obey tmpfs mount flags") Signed-off-by: David Stevens <stevensd@chromium.org> Cc: David Stevens <stevensd@chromium.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
NeilBrown [Fri, 13 Jan 2023 11:12:17 +0000 (11:12 +0000)]
mm: discard __GFP_ATOMIC
__GFP_ATOMIC serves little purpose. Its main effect is to set
ALLOC_HARDER which adds a few little boosts to increase the chance of an
allocation succeeding, one of which is to lower the water-mark at which it
will succeed.
It is *always* paired with __GFP_HIGH which sets ALLOC_HIGH which also
adjusts this watermark. It is probable that other users of __GFP_HIGH
should benefit from the other little bonuses that __GFP_ATOMIC gets.
__GFP_ATOMIC also gives a warning if used with __GFP_DIRECT_RECLAIM.
There is little point to this. We already get a might_sleep() warning if
__GFP_DIRECT_RECLAIM is set.
__GFP_ATOMIC allows the "watermark_boost" to be side-stepped. It is
probable that testing ALLOC_HARDER is a better fit here.
__GFP_ATOMIC is used by tegra-smmu.c to check if the allocation might
sleep. This should test __GFP_DIRECT_RECLAIM instead.
This patch:
- removes __GFP_ATOMIC
- allows __GFP_HIGH allocations to ignore watermark boosting as well
as GFP_ATOMIC requests.
- makes other adjustments as suggested by the above.
The net result is not change to GFP_ATOMIC allocations. Other
allocations that use __GFP_HIGH will benefit from a few different extra
privileges. This affects:
xen, dm, md, ntfs3
the vermillion frame buffer
hibernation
ksm
swap
all of which likely produce more benefit than cost if these selected
allocation are more likely to succeed quickly.
Mel Gorman [Fri, 13 Jan 2023 11:12:16 +0000 (11:12 +0000)]
mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves
GFP_ATOMIC allocations get flagged ALLOC_HARDER which is a vague
description. In preparation for the removal of GFP_ATOMIC redefine
__GFP_ATOMIC to simply mean non-blocking and renaming ALLOC_HARDER to
ALLOC_NON_BLOCK accordingly. __GFP_HIGH is required for access to
reserves but non-blocking is granted more access. For example, GFP_NOWAIT
is non-blocking but has no special access to reserves. A __GFP_NOFAIL
blocking allocation is granted access similar to __GFP_HIGH if the only
alternative is an OOM kill.
Link: https://lkml.kernel.org/r/20230113111217.14134-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 13 Jan 2023 11:12:14 +0000 (11:12 +0000)]
mm/page_alloc: explicitly record high-order atomic allocations in alloc_flags
A high-order ALLOC_HARDER allocation is assumed to be atomic. While that
is accurate, it changes later in the series. In preparation, explicitly
record high-order atomic allocations in gfp_to_alloc_flags().
Link: https://lkml.kernel.org/r/20230113111217.14134-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 13 Jan 2023 11:12:13 +0000 (11:12 +0000)]
mm/page_alloc: treat RT tasks similar to __GFP_HIGH
RT tasks are allowed to dip below the min reserve but ALLOC_HARDER is
typically combined with ALLOC_MIN_RESERVE so RT tasks are a little
unusual. While there is some justification for allowing RT tasks access
to memory reserves, there is a strong chance that a RT task that is also
under memory pressure is at risk of missing deadlines anyway. Relax how
much reserves an RT task can access by treating it the same as __GFP_HIGH
allocations.
Note that in a future kernel release that the RT special casing will be
removed. Hard realtime tasks should be locking down resources in advance
and ensuring enough memory is available. Even a soft-realtime task like
audio or video live decoding which cannot jitter should be allocating both
memory and any disk space required up-front before the recording starts
instead of relying on reserves. At best, reserve access will only delay
the problem by a very short interval.
Link: https://lkml.kernel.org/r/20230113111217.14134-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 13 Jan 2023 11:12:12 +0000 (11:12 +0000)]
mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVE
Patch series "Discard __GFP_ATOMIC", v3.
Neil's patch has been residing in mm-unstable as commit 2fafb4fe8f7a ("mm:
discard __GFP_ATOMIC") for a long time and recently brought up again.
Most recently, I was worried that __GFP_HIGH allocations could use
high-order atomic reserves which is unintentional but there was no
response so lets revisit -- this series reworks how min reserves are used,
protects highorder reserves and then finishes with Neil's patch with very
minor modifications so it fits on top.
There was a review discussion on renaming __GFP_DIRECT_RECLAIM to
__GFP_ALLOW_BLOCKING but I didn't think it was that big an issue and is
orthogonal to the removal of __GFP_ATOMIC.
There were some concerns about how the gfp flags affect the min reserves
but it never reached a solid conclusion so I made my own attempt.
The series tries to iron out some of the details on how reserves are used.
ALLOC_HIGH becomes ALLOC_MIN_RESERVE and ALLOC_HARDER becomes
ALLOC_NON_BLOCK and documents how the reserves are affected. For example,
ALLOC_NON_BLOCK (no direct reclaim) on its own allows 25% of the min
reserve. ALLOC_MIN_RESERVE (__GFP_HIGH) allows 50% and both combined
allows deeper access again. ALLOC_OOM allows access to 75%.
High-order atomic allocations are explicitly handled with the caveat that
no __GFP_ATOMIC flag means that any high-order allocation that specifies
GFP_HIGH and cannot enter direct reclaim will be treated as if it was
GFP_ATOMIC.
This patch (of 6):
__GFP_HIGH aliases to ALLOC_HIGH but the name does not really hint what it
means. As ALLOC_HIGH is internal to the allocator, rename it to
ALLOC_MIN_RESERVE to document that the min reserves can be depleted.
Pasha Tatashin [Fri, 13 Jan 2023 15:42:53 +0000 (15:42 +0000)]
mm/page_ext: do not allocate space for page_ext->flags if not needed
There is 8 byte page_ext->flags field allocated per page whenever
CONFIG_PAGE_EXTENSION is enabled. However, not every user of page_ext
uses flags. Therefore, check whether flags is needed at least by one user
and if so allocate space for it.
For example when page_table_check is enabled, on a machine with 128G
of memory before the fix:
[ 2.244288] allocated 536870912 bytes of page_ext
after the fix:
[ 2.160154] allocated 268435456 bytes of page_ext
Also, add a kernel-doc comment before page_ext_operations that describes
the fields, and remove check if need() is set, as that is now a required
field.
[pasha.tatashin@soleen.com: address comments from Mike Rapoport] Link: https://lkml.kernel.org/r/20230117202103.1412449-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20230113154253.92480-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Fri, 13 Jan 2023 17:10:25 +0000 (18:10 +0100)]
xtensa/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by using bit 1. This bit
should be safe to use for our usecase.
Most importantly, we can still distinguish swap PTEs from PAGE_NONE PTEs
(see pte_present()) and don't use one of the two reserved attribute masks
(1101 and 1111). Attribute mask 1100 and 1110 now identify swap PTEs.
While at it, remove SWP_TYPE_BITS (not really helpful as it's not used in
the actual swap macros) and mask the type in __swp_entry().
David Hildenbrand [Fri, 13 Jan 2023 17:10:24 +0000 (18:10 +0100)]
x86/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE also on 32bit
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE just like we already do on
x86-64. After deciphering the PTE layout it becomes clear that there are
still unused bits for 2-level and 3-level page tables that we should be
able to use. Reusing a bit avoids stealing one bit from the swap offset.
While at it, mask the type in __swp_entry(); use some helper definitions
to make the macros easier to grasp.
Link: https://lkml.kernel.org/r/20230113171026.582290-25-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Fri, 13 Jan 2023 17:10:23 +0000 (18:10 +0100)]
um/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE by using bit 10, which is yet
unused for swap PTEs.
The pte_mkuptodate() is a bit weird in __pte_to_swp_entry() for a swap PTE
... but it only messes with bit 1 and 2 and there is a comment in
set_pte(), so leave these bits alone.
While at it, mask the type in __swp_entry().
Link: https://lkml.kernel.org/r/20230113171026.582290-24-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>