Vlastimil Babka [Mon, 7 Aug 2023 21:00:36 +0000 (23:00 +0200)]
maple_tree: replace preallocation with slub percpu array prefill
With the percpu array we can try not doing the preallocations in maple
tree, and instead make sure the percpu array is prefilled, and using
GFP_ATOMIC in places that relied on the preallocation (in case we miss
or fail trylock on the array), i.e. mas_store_prealloc(). For now simply
add __GFP_NOFAIL there as well.
Liam R. Howlett [Tue, 8 Aug 2023 18:54:27 +0000 (14:54 -0400)]
maple_tree: Remove MA_STATE_PREALLOC
MA_SATE_PREALLOC was added to catch any writes that try to allocate when
the maple state is being used in preallocation mode. This can safely be
removed in favour of the percpu array of nodes.
Note that mas_expected_entries() still expects no allocations during
operation and so MA_STATE_BULK can be used in place of preallocations
for this case, which is primarily used for forking.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka [Wed, 15 Nov 2023 10:38:15 +0000 (11:38 +0100)]
mm/slub: add opt-in slub_percpu_array
kmem_cache_setup_percpu_array() will allocate a per-cpu array for
caching alloc/free objects of given size for the cache. The cache
has to be created with SLAB_NO_MERGE flag.
When empty, half of the array is filled by an internal bulk alloc
operation. When full, half of the array is flushed by an internal bulk
free operation.
The bulk operations exposed to slab users also try to utilize the array
when possible, but leave the array empty or full and use the bulk
alloc/free only for the operation itself. If kmemcg is enabled and
active, bulk freeing skips the array as it would be less efficient.
The locking is copied from the page allocator's pcplists, based on
embedded spin locks. Interrupts are not disabled, only preemption (cpu
migration on RT). Trylock is attempted to avoid deadlock due to
an intnerrupt, trylock failure means the array is bypassed.
Sysfs stat counters alloc_cpu_cache and free_cpu_cache count operations
that used the percpu array.
kmem_cache_prefill_percpu_array() can be called to ensure the array on
the current cpu to at least the given number of objects. However this is
only opportunistic as there's no cpu pinning between the prefill and
usage, and trylocks may fail when the usage is in an irq handler.
Therefore allocations cannot rely on the array for success even after
the prefill. But misses should be rare enough that e.g. GFP_ATOMIC
allocations should be acceptable after the refill.
Mark SLAB_DEPRECATED as BROKEN so the new prefill call doesn't need to
be reimplemented there and the bots don't complain. SLAB has percpu
arrays by design but their sizes are determined internally and lack
prefill.
More TODO/FIXMEs:
- NUMA awareness - preferred node currently ignored, __GFP_THISNODE not
honored
- slub_debug - will not work for allocations from the array. Normally in
SLUB implementation the enabling slub_debug for a cache effectively
disables all the fast paths and makes every operation work with the
shared list, but that could lead to depleting the reserves if we ignore
the prefill and use GFP_ATOMIC. Needs more thought.
Vlastimil Babka [Tue, 14 Nov 2023 21:12:47 +0000 (22:12 +0100)]
mm/slub: free KFENCE objects in slab_free_hook()
When freeing an object that was allocated from KFENCE, we do that in the
slowpath __slab_free(), relying on the fact that KFENCE "slab" cannot be
the cpu slab, so the fastpath has to fallback to the slowpath.
This optimization doesn't help much though, because is_kfence_address()
is checked earlier anyway during the free hook processing or detached
freelist building. Thus we can simplify the code by making the
slab_free_hook() free the KFENCE object immediately, similarly to KASAN
quarantine.
In slab_free_hook() we can place kfence_free() above init processing, as
callers have been making sure to set init to false for KFENCE objects.
This simplifies slab_free(). This places it also above kasan_slab_free()
which is ok as that skips KFENCE objects anyway.
While at it also determine the init value in slab_free_freelist_hook()
outside of the loop.
This change will also make introducing per cpu array caches easier.
Vlastimil Babka [Fri, 3 Nov 2023 19:24:51 +0000 (20:24 +0100)]
mm/slub: handle bulk and single object freeing separately
Until now we have a single function slab_free() handling both single
object freeing and bulk freeing with neccessary hooks, the latter case
requiring slab_free_freelist_hook(). It should be however better to
distinguish the two scenarios for the following reasons:
- code simpler to follow for the single object case
- better code generation - although inlining should eliminate the
slab_free_freelist_hook() in case no debugging options are enabled, it
seems it's not perfect. When e.g. KASAN is enabled, we're imposing
additional unnecessary overhead for single object freeing.
- preparation to add percpu array caches in later patches
Therefore, simplify slab_free() for the single object case by dropping
unnecessary parameters and calling only slab_free_hook() instead of
slab_free_freelist_hook(). Rename the bulk variant to slab_free_bulk()
and adjust callers accordingly.
While at it, flip (and document) slab_free_hook() return value so that
it returns true when the freeing can proceed, which matches the logic of
slab_free_freelist_hook() and is not confusingly the opposite.
Additionally we can simplify a bit by changing the tail parameter of
do_slab_free() when freeing a single object - instead of NULL we can set
equal to head.
bloat-o-meter shows small code reduction with a .config that has KASAN
etc disabled:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-118 (-118)
Function old new delta
kmem_cache_alloc_bulk 1203 1196 -7
kmem_cache_free 861 835 -26
__kmem_cache_free 741 704 -37
kmem_cache_free_bulk 911 863 -48
Vlastimil Babka [Thu, 2 Nov 2023 15:34:39 +0000 (16:34 +0100)]
mm/slub: introduce __kmem_cache_free_bulk() without free hooks
Currently, when __kmem_cache_alloc_bulk() fails, it frees back the
objects that were allocated before the failure, using
kmem_cache_free_bulk(). Because kmem_cache_free_bulk() calls the free
hooks (kasan etc.) and those expect objects processed by the post alloc
hooks, slab_post_alloc_hook() is called before kmem_cache_free_bulk().
This is wasteful, although not a big concern in practice for the very
rare error path. But in order to efficiently handle percpu array batch
refill and free in the following patch, we will also need a variant of
kmem_cache_free_bulk() that avoids the free hooks. So introduce it first
and use it in the error path too.
As a consequence, __kmem_cache_alloc_bulk() no longer needs the objcg
parameter, remove it.
Zhongkun He [Tue, 24 Oct 2023 14:27:06 +0000 (22:27 +0800)]
mm: zswap: fix the lack of page lru flag in zswap_writeback_entry
The zswap_writeback_entry() will add a page to the swap cache, decompress
the entry data into the page, and issue a bio write to write the page back
to the swap device. Move the page to the tail of lru list through
SetPageReclaim(page) and folio_rotate_reclaimable().
Currently, about half of the pages will fail to move to the tail of lru
list because there is no LRU flag in page which is not in the LRU list but
the cpu_fbatches. So fix it.
Link: https://lkml.kernel.org/r/20231024142706.195517-1-hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liu Shixin [Wed, 30 Aug 2023 03:56:00 +0000 (11:56 +0800)]
mm: vmscan: try to reclaim swapcache pages if no swap space
When spaces of swap devices are exhausted, only file pages can be
reclaimed. But there are still some swapcache pages in anon lru list.
This can lead to a premature out-of-memory.
The problem is found with such step:
Firstly, set a 9MB disk swap space, then create a cgroup with 10MB
memory limit, then runs an program to allocates about 15MB memory.
The problem occurs occasionally, which may need about 100 times [1].
Fix it by checking number of swapcache pages in can_reclaim_anon_pages().
If the number is not zero, return true and set swapcache_only to 1. When
scan anon lru list in swapcache_only mode, non-swapcache pages will be
skipped to isolate in order to accelerate reclaim efficiency.
However, in swapcache_only mode, the scan count still increased when scan
non-swapcache pages because there are large number of non-swapcache pages
and rare swapcache pages in swapcache_only mode, and if the non-swapcache
is skipped and do not count, the scan of pages in isolate_lru_folios() can
eventually lead to hung task, just as Sachin reported [2].
By the way, since there are enough times of memory reclaim before OOM, it
is not need to isolate too much swapcache pages in one times.
Zhaoyang Huang [Thu, 11 May 2023 05:22:30 +0000 (13:22 +0800)]
mm: optimization on page allocation when CMA enabled
According to current CMA utilization policy, an alloc_pages(GFP_USER)
could 'steal' UNMOVABLE & RECLAIMABLE page blocks via the help of CMA(pass
zone_watermark_ok by counting CMA in but use U&R in rmqueue), which could
lead to following alloc_pages(GFP_KERNEL) fail. Solving this by
introducing second watermark checking for GFP_MOVABLE, which could have
the allocation use CMA when proper.
SeongJae Park [Sun, 19 Nov 2023 17:15:29 +0000 (17:15 +0000)]
mm/damon/core-test: test damon_split_region_at()'s access rate copying
damon_split_region_at() should set access rate related fields of the
resulting regions same. It may forgotten, and actually there was the
mistake before. Test it with the unit test case for the function.
Link: https://lkml.kernel.org/r/20231119171529.66863-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Juntong Deng [Sun, 19 Nov 2023 20:46:29 +0000 (04:46 +0800)]
kasan: improve free meta storage in Generic KASAN
Currently free meta can only be stored in object if the object is not
smaller than free meta.
After the improvement, even when the object is smaller than free meta, it
is still possible to store part of the free meta in the object, reducing
the increased size of the redzone.
Peng Zhang [Mon, 20 Nov 2023 07:09:37 +0000 (15:09 +0800)]
maple_tree: simplify mas_leaf_set_meta()
Now it seems that the incoming 'end' is already pointing to the last item,
so we can simplify this function, considering only whether the last slot
is being used. This has passed the maple tree test suite.
Peng Zhang [Mon, 20 Nov 2023 07:09:34 +0000 (15:09 +0800)]
maple_tree: avoid ascending when mas->min is also the parent's minimum
When the child node is the first child of its parent node, mas->min does
not need to be updated. This can reduce the number of ascending times
in some cases.
Peng Zhang [Mon, 20 Nov 2023 07:09:33 +0000 (15:09 +0800)]
maple_tree: move the check forward to avoid static check warning
Patch series "Some cleanups of maple tree", v2.
These are some small cleanups of maple tree.
This patch (of 5):
Put the check for gap before its reference to avoid Smatch static check
warnings. This is not a bug, it's just a validation program. Even with
this change, Smatch may still generate warnings because MT_BUG_ON()
doesn't necessarily stop the program. It may require fixing Smatch itself
to avoid these warnings.
Fabio M. De Francesco [Mon, 20 Nov 2023 14:28:23 +0000 (15:28 +0100)]
mm/page_poison: replace kmap_atomic() with kmap_local_page()
kmap_atomic() has been deprecated in favor of kmap_local_page().
Therefore, replace kmap_atomic() with kmap_local_page().
kmap_atomic() is implemented like a kmap_local_page() which also disables
page-faults and preemption (the latter only in !PREEMPT_RT kernels). The
kernel virtual addresses returned by these two API are only valid in the
context of the callers (i.e., they cannot be handed to other threads).
With kmap_local_page() the mappings are per thread and CPU local like in
kmap_atomic(); however, they can handle page-faults and can be called from
any context (including interrupts). The tasks that call kmap_local_page()
can be preempted and, when they are scheduled to run again, the kernel
virtual addresses are restored and are still valid.
The code blocks between the mappings and un-mappings do not rely on the
above-mentioned side effects of kmap_atomic(), so that mere replacements
of the old API with the new one is all that they require (i.e., there is
no need to explicitly call pagefault_disable() and/or preempt_disable()).
Fabio M. De Francesco [Mon, 20 Nov 2023 14:26:31 +0000 (15:26 +0100)]
mm/mempool: replace kmap_atomic() with kmap_local_page()
kmap_atomic() has been deprecated in favor of kmap_local_page().
Therefore, replace kmap_atomic() with kmap_local_page().
kmap_atomic() is implemented like a kmap_local_page() which also disables
page-faults and preemption (the latter only in !PREEMPT_RT kernels). The
kernel virtual addresses returned by these two API are only valid in the
context of the callers (i.e., they cannot be handed to other threads).
With kmap_local_page() the mappings are per thread and CPU local like in
kmap_atomic(); however, they can handle page-faults and can be called from
any context (including interrupts). The tasks that call kmap_local_page()
can be preempted and, when they are scheduled to run again, the kernel
virtual addresses are restored and are still valid.
The code blocks between the mappings and un-mappings don't rely on the
above-mentioned side effects of kmap_atomic(), so that mere replacements
of the old API with the new one is all that they require (i.e., there is
no need to explicitly call pagefault_disable() and/or preempt_disable()).
Fabio M. De Francesco [Mon, 20 Nov 2023 14:24:05 +0000 (15:24 +0100)]
mm/memory: use kmap_local_page() in __wp_page_copy_user()
kmap_atomic() has been deprecated in favor of kmap_local_{folio,page}.
Therefore, replace kmap_atomic() with kmap_local_page in
__wp_page_copy_user().
kmap_atomic() disables preemption in !PREEMPT_RT kernels and
unconditionally disables also page-faults. My limited knowledge of the
implementation of __wp_page_copy_user() makes me think that the latter
side effect is still needed here, but kmap_local_page() is implemented not
to disable page-faults.
So, in addition to the conversion to local mapping, add explicit
pagefault_disable() / pagefault_enable() between mapping and un-mapping.
Fabio M. De Francesco [Mon, 20 Nov 2023 14:18:44 +0000 (15:18 +0100)]
mm/ksm: use kmap_local_page() in calc_checksum()
kmap_atomic() has been deprecated in favor of kmap_local_page().
Therefore, replace kmap_atomic() with kmap_local_page() in
calc_checksum().
kmap_atomic() is implemented like a kmap_local_page() which also disables
page-faults and preemption (the latter only in !PREEMPT_RT kernels). The
kernel virtual addresses returned by these two API are only valid in the
context of the callers (i.e., they cannot be handed to other threads).
With kmap_local_page() the mappings are per thread and CPU local like in
kmap_atomic(); however, they can handle page-faults and can be called from
any context (including interrupts). The tasks that call kmap_local_page()
can be preempted and, when they are scheduled to run again, the kernel
virtual addresses are restored and are still valid.
In calc_checksum(), the block of code between the mapping and un-mapping
does not depend on the above-mentioned side effects of kmap_aatomic(), so
that a mere replacements of the old API with the new one is all that is
required (i.e., there is no need to explicitly call pagefault_disable()
and/or preempt_disable()).
Fabio De Francesco [Mon, 20 Nov 2023 14:15:27 +0000 (15:15 +0100)]
mm/util: use kmap_local_page() in memcmp_pages()
kmap_atomic() has been deprecated in favor of kmap_local_page().
Therefore, replace kmap_atomic() with kmap_local_page() in memcmp_pages().
kmap_atomic() is implemented like a kmap_local_page() which also disables
page-faults and preemption (the latter only in !PREEMPT_RT kernels). The
kernel virtual addresses returned by these two API are only valid in the
context of the callers (i.e., they cannot be handed to other threads).
With kmap_local_page() the mappings are per thread and CPU local like in
kmap_atomic(); however, they can handle page-faults and can be called from
any context (including interrupts). The tasks that call kmap_local_page()
can be preempted and, when they are scheduled to run again, the kernel
virtual addresses are restored and are still valid.
In memcmp_pages(), the block of code between the mapping and un-mapping
does not depend on the above-mentioned side effects of kmap_aatomic(), so
that mere replacements of the old API with the new one is all that is
required (i.e., there is no need to explicitly call pagefault_disable()
and/or preempt_disable()).
Sumanth Korikkar [Mon, 20 Nov 2023 14:53:54 +0000 (15:53 +0100)]
mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
vmem_altmap_free() and vmem_altmap_offset() could be utlized without
CONFIG_ZONE_DEVICE enabled. For example,
mm/memory_hotplug.c:__add_pages() relies on that. The altmap is no longer
restricted to ZONE_DEVICE handling, but instead depends on
CONFIG_SPARSEMEM_VMEMMAP.
When CONFIG_SPARSEMEM_VMEMMAP is disabled, these functions are defined as
inline stubs, ensuring compatibility with configurations that do not use
sparsemem vmemmap. Without it, lkp reported the following:
ld: arch/x86/mm/init_64.o: in function `remove_pagetable':
init_64.c:(.meminit.text+0xfc7): undefined reference to
`vmem_altmap_free'
Link: https://lkml.kernel.org/r/20231120145354.308999-4-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311180545.VeyRXEDq-lkp@intel.com/ Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:20 +0000 (18:47 +0100)]
lib/stackdepot: adjust DEPOT_POOLS_CAP for KMSAN
KMSAN is frequently used in fuzzing scenarios and thus saves a lot of
stack traces. As KMSAN does not support evicting stack traces from the
stack depot, the stack depot capacity might be reached quickly with large
stack records.
Adjust the maximum number of stack depot pools for this case.
The average size of a stack trace saved into the stack depot is ~16
frames. Thus, adjust the maximum pools number accordingly to keep the
maximum number of stack traces that can be saved into the stack depot
similar to the one that was allowed before the stack trace eviction
changes.
Link: https://lkml.kernel.org/r/301a115cf7ce8ddb42ef6de9151c2bb76ba728fc.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:19 +0000 (18:47 +0100)]
kasan: use stack_depot_put for Generic mode
Evict alloc/free stack traces from the stack depot for Generic KASAN once
they are evicted from the quaratine.
For auxiliary stack traces, evict the oldest stack trace once a new one is
saved (KASAN only keeps references to the last two).
Also evict all saved stack traces on krealloc.
To avoid double-evicting and mis-evicting stack traces (in case KASAN's
metadata was corrupted), reset KASAN's per-object metadata that stores
stack depot handles when the object is initialized and when it's evicted
from the quarantine.
Note that stack_depot_put is no-op if the handle is 0.
Link: https://lkml.kernel.org/r/5cef104d9b842899489b4054fe8d1339a71acee0.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:18 +0000 (18:47 +0100)]
kasan: use stack_depot_put for tag-based modes
Make tag-based KASAN modes evict stack traces from the stack depot once
they are evicted from the stack ring.
Internally, pass STACK_DEPOT_FLAG_GET to stack_depot_save_flags (via
kasan_save_stack) to increment the refcount when saving a new entry to
stack ring and call stack_depot_put when removing an entry from stack
ring.
Link: https://lkml.kernel.org/r/b4773e5c1b0b9df6826ec0b65c1923feadfa78e5.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:17 +0000 (18:47 +0100)]
kasan: check object_size in kasan_complete_mode_report_info
Check the object size when looking up entries in the stack ring.
If the size of the object for which a report is being printed does not
match the size of the object for which a stack trace has been saved in the
stack ring, the saved stack trace is irrelevant.
Link: https://lkml.kernel.org/r/68c6948175aadd7e7e7deea61725103d64a4528f.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:16 +0000 (18:47 +0100)]
kasan: remove atomic accesses to stack ring entries
Remove the atomic accesses to entry fields in save_stack_info and
kasan_complete_mode_report_info for tag-based KASAN modes.
These atomics are not required, as the read/write lock prevents the
entries from being read (in kasan_complete_mode_report_info) while being
written (in save_stack_info) and the try_cmpxchg prevents the same entry
from being rewritten (in save_stack_info) in the unlikely case of wrapping
during writing.
Link: https://lkml.kernel.org/r/29f59126d9845c5257b6c29cd7ad113b16f19f47.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:15 +0000 (18:47 +0100)]
lib/stackdepot: allow users to evict stack traces
Add stack_depot_put, a function that decrements the reference counter on a
stack record and removes it from the stack depot once the counter reaches
0.
Internally, when removing a stack record, the function unlinks it from the
hash table bucket and returns to the freelist.
With this change, the users of stack depot can call stack_depot_put when
keeping a stack trace in the stack depot is not needed anymore. This
allows avoiding polluting the stack depot with irrelevant stack traces and
thus have more space to store the relevant ones before the stack depot
reaches its capacity.
Link: https://lkml.kernel.org/r/1d1ad5692ee43d4fc2b3fd9d221331d30b36123f.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:10 +0000 (18:47 +0100)]
lib/stackdepot: use read/write lock
Currently, stack depot uses the following locking scheme:
1. Lock-free accesses when looking up a stack record, which allows to
have multiple users to look up records in parallel;
2. Spinlock for protecting the stack depot pools and the hash table
when adding a new record.
For implementing the eviction of stack traces from stack depot, the
lock-free approach is not going to work anymore, as we will need to be
able to also remove records from the hash table.
Convert the spinlock into a read/write lock, and drop the atomic
accesses, as they are no longer required.
Looking up stack traces is now protected by the read lock and adding new
records - by the write lock. One of the following patches will add a
new function for evicting stack records, which will be protected by the
write lock as well.
With this change, multiple users can still look up records in parallel.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/9f81ffcc4bb422ebb6326a65a770bf1918634cbb.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:09 +0000 (18:47 +0100)]
lib/stackdepot: store free stack records in a freelist
Instead of using the global pool_offset variable to find a free slot when
storing a new stack record, mainlain a freelist of free slots within the
allocated stack pools.
A global next_stack variable is used as the head of the freelist, and the
next field in the stack_record struct is reused as freelist link (when the
record is not in the freelist, this field is used as a link in the hash
table).
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/b9e4c79955c2121b69301778643b203d3fb09ccc.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:08 +0000 (18:47 +0100)]
lib/stackdepot: store next pool pointer in new_pool
Instead of using the last pointer in stack_pools for storing the pointer
to a new pool (which does not yet store any stack records), use a new
new_pool variable.
This a purely code readability change: it seems more logical to store the
pointer to a pool with a special meaning in a dedicated variable.
Link: https://lkml.kernel.org/r/448bc18296c16bef95cb3167697be6583dcc8ce3.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:07 +0000 (18:47 +0100)]
lib/stackdepot: rename next_pool_required to new_pool_required
Rename next_pool_required to new_pool_required.
This a purely code readability change: the following patch will change
stack depot to store the pointer to the new pool in a separate variable,
and "new" seems like a more logical name.
Link: https://lkml.kernel.org/r/fd7cd6c6eb250c13ec5d2009d75bb4ddd1470db9.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:06 +0000 (18:47 +0100)]
lib/stackdepot: rework helpers for depot_alloc_stack
Split code in depot_alloc_stack and depot_init_pool into 3 functions:
1. depot_keep_next_pool that keeps preallocated memory for the next pool
if required.
2. depot_update_pools that moves on to the next pool if there's no space
left in the current pool, uses preallocated memory for the new current
pool if required, and calls depot_keep_next_pool otherwise.
3. depot_alloc_stack that calls depot_update_pools and then allocates
a stack record as before.
This makes it somewhat easier to follow the logic of depot_alloc_stack and
also serves as a preparation for implementing the eviction of stack
records from the stack depot.
Link: https://lkml.kernel.org/r/71fb144d42b701fcb46708d7f4be6801a4a8270e.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:05 +0000 (18:47 +0100)]
lib/stackdepot: fix and clean-up atomic annotations
Drop smp_load_acquire from next_pool_required in depot_init_pool, as both
depot_init_pool and the all smp_store_release's to this variable are
executed under the stack depot lock.
Also simplify and clean up comments accompanying the use of atomic
accesses in the stack depot code.
Link: https://lkml.kernel.org/r/c118ef044d8db80248d9e1f14592c72e8429e9d9.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:47:04 +0000 (18:47 +0100)]
lib/stackdepot: use fixed-sized slots for stack records
Instead of storing stack records in stack depot pools one right after
another, use fixed-sized slots.
Add a new Kconfig option STACKDEPOT_MAX_FRAMES that allows to select the
size of the slot in frames. Use 64 as the default value, which is the
maximum stack trace size both KASAN and KMSAN use right now.
Also add descriptions for other stack depot Kconfig options.
This is preparatory patch for implementing the eviction of stack records
from the stack depot.
Link: https://lkml.kernel.org/r/dce7d030a99ff61022509665187fac45b0827298.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Mon, 20 Nov 2023 17:46:59 +0000 (18:46 +0100)]
lib/stackdepot: print disabled message only if truly disabled
Patch series "stackdepot: allow evicting stack traces", v4.
Currently, the stack depot grows indefinitely until it reaches its
capacity. Once that happens, the stack depot stops saving new stack
traces.
This creates a problem for using the stack depot for in-field testing and
in production.
For such uses, an ideal stack trace storage should:
1. Allow saving fresh stack traces on systems with a large uptime while
limiting the amount of memory used to store the traces;
2. Have a low performance impact.
Implementing #1 in the stack depot is impossible with the current
keep-forever approach. This series targets to address that. Issue #2 is
left to be addressed in a future series.
This series changes the stack depot implementation to allow evicting
unneeded stack traces from the stack depot. The users of the stack depot
can do that via new stack_depot_save_flags(STACK_DEPOT_FLAG_GET) and
stack_depot_put APIs.
Internal changes to the stack depot code include:
1. Storing stack traces in fixed-frame-sized slots (vs precisely-sized
slots in the current implementation); the slot size is controlled via
CONFIG_STACKDEPOT_MAX_FRAMES (default: 64 frames);
2. Keeping available slots in a freelist (vs keeping an offset to the next
free slot);
3. Using a read/write lock for synchronization (vs a lock-free approach
combined with a spinlock).
This series also integrates the eviction functionality into KASAN: the
tag-based modes evict stack traces when the corresponding entry leaves the
stack ring, and Generic KASAN evicts stack traces for objects once those
leave the quarantine.
With KASAN, despite wasting some space on rounding up the size of each
stack record, the total memory consumed by stack depot gets saturated due
to the eviction of irrelevant stack traces from the stack depot.
With the tag-based KASAN modes, the average total amount of memory used
for stack traces becomes ~0.5 MB (with the current default stack ring size
of 32k entries and the default CONFIG_STACKDEPOT_MAX_FRAMES of 64). With
Generic KASAN, the stack traces take up ~1 MB per 1 GB of RAM (as the
quarantine's size depends on the amount of RAM).
However, with KMSAN, the stack depot ends up using ~4x more memory per a
stack trace than before. Thus, for KMSAN, the stack depot capacity is
increased accordingly. KMSAN uses a lot of RAM for shadow memory anyway,
so the increased stack depot memory usage will not make a significant
difference.
Other users of the stack depot do not save stack traces as often as KASAN
and KMSAN. Thus, the increased memory usage is taken as an acceptable
trade-off. In the future, these other users can take advantage of the
eviction API to limit the memory waste.
There is no measurable boot time performance impact of these changes for
KASAN on x86-64. I haven't done any tests for arm64 modes (the stack
depot without performance optimizations is not suitable for intended use
of those anyway), but I expect a similar result. Obtaining and copying
stack trace frames when saving them into stack depot is what takes the
most time.
This series does not yet provide a way to configure the maximum size of
the stack depot externally (e.g. via a command-line parameter). This
will be added in a separate series, possibly together with the performance
improvement changes.
This patch (of 22):
Currently, if stack_depot_disable=off is passed to the kernel command-line
after stack_depot_disable=on, stack depot prints a message that it is
disabled, while it is actually enabled.
Fix this by moving printing the disabled message to
stack_depot_early_init. Place it before the
__stack_depot_early_init_requested check, so that the message is printed
even if early stack depot init has not been requested.
Also drop the stack_table = NULL assignment from disable_stack_depot, as
stack_table is NULL by default.
Jim Cromie [Thu, 16 Nov 2023 22:43:18 +0000 (15:43 -0700)]
kmemleak: add checksum to backtrace report
Change /sys/kernel/debug/kmemleak report format slightly, adding
"(extra info)" to the backtrace header:
from: " backtrace:"
to: " backtrace (crc <cksum>):"
The <cksum> allows a user to see recurring backtraces without
detailed/careful reading of multiline stacks. So after cycling
kmemleak-test a few times, I know some leaks are repeating.
Jim Cromie [Thu, 16 Nov 2023 22:43:17 +0000 (15:43 -0700)]
kmemleak: drop (age <increasing>) from leak record
Patch series "tweak kmemleak report format".
These 2 patches make minor changes to the report:
1st strips "age <increasing>" from output. This makes the output
idempotent; unchanging until a new leak is reported.
2nd adds the backtrace.checksum to the "backtrace:" line. This lets a
user see repeats without actually reading the whole backtrace. So now
the backtrace line looks like this:
Syzkaller parses kmemleak in executor/common_linux.h:
static void check_leaks(char** frames, int nframes)
It just counts occurrences of "unreferenced object", specifically it
does not look for "age", nor would it choke on "crc" being added.
github has 3 repos with "kmemleak" mentioned, all are moribund.
gitlab has 0 hits on "kmemleak".
This patch (of 2):
Displaying age is pretty, but counter-productive; it changes with
current-time, so it surrenders idempotency of the output, which breaks
simple hash-based cataloging of the records by the user.
The trouble: sequential reads, wo new leaks, get new results:
:#> sum /sys/kernel/debug/kmemleak
53439 74 /sys/kernel/debug/kmemleak
:#> sum /sys/kernel/debug/kmemleak
59066 74 /sys/kernel/debug/kmemleak
and age is why (nothing else changes):
:#> grep -v age /sys/kernel/debug/kmemleak | sum
58894 67
:#> grep -v age /sys/kernel/debug/kmemleak | sum
58894 67
Since jiffies is already printed in the "comm" line, age adds nothing.
Notably, syzkaller reads kmemleak only for "unreferenced object", and
won't care about this reform of age-ism. A few moribund github repos
mention it, but don't compile.
Matthew Wilcox (Oracle) [Fri, 17 Nov 2023 16:14:47 +0000 (16:14 +0000)]
fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.
Matthew Wilcox (Oracle) [Fri, 17 Nov 2023 16:14:46 +0000 (16:14 +0000)]
memory-failure: convert truncate_error_page to truncate_error_folio
Both callers now have a folio, so pass it in. Nothing downstream was
expecting a tail page; that's asserted in generic_error_remove_page(), for
example.
Matthew Wilcox (Oracle) [Fri, 17 Nov 2023 16:14:45 +0000 (16:14 +0000)]
memory-failure: use a folio in me_huge_page()
This function was already explicitly calling compound_head();
unfortunately the compiler can't know that and elide the redundant calls
to compound_head() buried in page_mapping(), unlock_page(), etc. Switch
to using a folio, which does let us elide these calls.
The latter one is the reason why memory-tracking depends on DEBUG_FS,
while the former one is used far beyond debugging these days. Namely
ac-time is used for fine grained writeback of idle entries (pages).
Move ac-time tracking under its own config option so that it can be
enabled (along with writeback) on systems without DEBUG_FS.
Barry Song [Tue, 14 Nov 2023 03:42:02 +0000 (16:42 +1300)]
mm/page_owner: record and dump free_pid and free_tgid
While investigating some complex memory allocation and free bugs
especially in multi-processes and multi-threads cases, from time to time,
I feel the free stack isn't sufficient as a page can be freed by processes
or threads other than the one allocating it. And other processes and
threads which free the page often have the exactly same free stack with
the one allocating the page. We can't know who free the page only through
the free stack though the current page_owner does tell us the pid and tgid
of the one allocating the page. This makes the bug investigation often
hard.
So this patch adds free pid and tgid in page_owner, so that we can easily
figure out if the freeing is crossing processes or threads.
Link: https://lkml.kernel.org/r/20231114034202.73098-1-v-songbaohua@oppo.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Cc: Audra Mitchell <audra@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kassey Li <quic_yingangl@quicinc.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
York Jasper Niebuhr [Sat, 11 Nov 2023 18:48:59 +0000 (19:48 +0100)]
mm: fix process_vm_rw page counts
1. There is a "-1" missing in the page number calculation in
process_vm_rw_core. While this can't break anything, it can cause
unnecessary allocations in certain cases:
Consider handling an iovec ranging over PVM_MAX_PP_ARRAY_COUNT pages
that is also aligned to a page boundary. While pp_stack could hold
references to such an amount of pinned pages, nr_pages yields
(PVM_MAX_PP_ARRAY + 1) in process_vm_rw_core. Consequently, a larger
buffer is allocated with kmalloc for no reason.
For any page boundary aligned iovec that is a multiple of PAGE_SIZE
and larger than PVM_MAX_PP_ARRAY_COUNT pages, nr_pages will be too big
by 1 and thus kmalloc allocates excess space for one more pointer.
2. max_pages_per_loop is constant and there is no reason to have it as
a variable. A macro does the job just fine and saves memory.
3. Replaced "sizeof(struct pages *)" with "sizeof(struct page *)" to
have matching types for allocation and prevent confusion.
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:15:07 +0000 (21:15 +0000)]
gfp: include __GFP_NOWARN in GFP_NOWAIT
GFP_NOWAIT callers are always prepared for their allocations to fail
because they fail so frequently. Forcing the callers to remember to add
__GFP_NOWARN is just annoying and leads to an endless stream of patches
for the places where we forgot to add it.
We can now remove __GFP_NOWARN from all the callers which specify
GFP_NOWAIT, but I'd rather wait a cycle and send patches to each
maintainer instead of creating a big pile of merge conflicts.
Brendan Jackman [Wed, 8 Nov 2023 16:49:20 +0000 (16:49 +0000)]
mm/page_alloc: dedupe some memcg uncharging logic
The duplication makes it seem like some work is required before uncharging
in the !PageHWPoison case. But it isn't, so we can simplify the code a
little.
Note the PageMemcgKmem check is redundant, but I've left it in as it
avoids an unnecessary function call.
Link: https://lkml.kernel.org/r/20231108164920.3401565-1-jackmanb@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:06:08 +0000 (21:06 +0000)]
buffer: fix more functions for block size > PAGE_SIZE
Both __block_write_full_folio() and block_read_full_folio() assumed that
block size <= PAGE_SIZE. Replace the shift with a divide, which is
probably cheaper than first calculating the shift. That lets us remove
block_size_bits() as these were the last callers.
Link: https://lkml.kernel.org/r/20231109210608.2252323-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:06:07 +0000 (21:06 +0000)]
buffer: handle large folios in __block_write_begin_int()
When __block_write_begin_int() was converted to support folios, we did not
expect large folios to be passed to it. With the current work to support
large block size storage devices, this will no longer be true so change
the checks on 'from' and 'to' to be related to the size of the folio
instead of PAGE_SIZE. Also remove an assumption that the block size is
smaller than PAGE_SIZE.
Link: https://lkml.kernel.org/r/20231109210608.2252323-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:06:06 +0000 (21:06 +0000)]
buffer: fix various functions for block size > PAGE_SIZE
If i_blkbits is larger than PAGE_SHIFT, we shift by a negative number,
which is undefined. It is safe to shift the block left as a block device
must be smaller than MAX_LFS_FILESIZE, which is guaranteed to fit in
loff_t.
Link: https://lkml.kernel.org/r/20231109210608.2252323-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:06:04 +0000 (21:06 +0000)]
buffer: fix grow_buffers() for block size > PAGE_SIZE
We must not shift by a negative number so work in terms of a byte offset
to avoid the awkward shift left-or-right-depending-on-sign option. This
means we need to use check_mul_overflow() to ensure that a large block
number does not result in a wrap.
Link: https://lkml.kernel.org/r/20231109210608.2252323-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:06:03 +0000 (21:06 +0000)]
buffer: calculate block number inside folio_init_buffers()
The calculation of block from index doesn't work for devices with a block
size larger than PAGE_SIZE as we end up shifting by a negative number.
Instead, calculate the number of the first block from the folio's position
in the block device. We no longer need to pass sizebits to
grow_dev_folio().
Link: https://lkml.kernel.org/r/20231109210608.2252323-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Thu, 9 Nov 2023 21:06:02 +0000 (21:06 +0000)]
buffer: return bool from grow_dev_folio()
Patch series "More buffer_head cleanups", v2.
The first patch is a left-over from last cycle. The rest fix "obvious"
block size > PAGE_SIZE problems. I haven't tested with a large block size
setup (but I have done an ext4 xfstests run).
This patch (of 7):
Rename grow_dev_page() to grow_dev_folio() and make it return a bool.
Document what that bool means; it's more subtle than it first appears.
Also rename the 'failed' label to 'unlock' beacuse it's not exactly
'failed'. It just hasn't succeeded.
Link: https://lkml.kernel.org/r/20231109210608.2252323-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 8 Nov 2023 18:28:08 +0000 (18:28 +0000)]
mm: convert isolate_page() to mf_isolate_folio()
The only caller now has a folio, so pass it in and operate on it. Saves
many page->folio conversions and introduces only one folio->page
conversion when calling isolate_movable_page().
Matthew Wilcox (Oracle) [Wed, 8 Nov 2023 18:28:05 +0000 (18:28 +0000)]
mm: convert __do_fault() to use a folio
Convert vmf->page to a folio as soon as we're going to use it. This fixes
a bug if the fault handler returns a tail page with hardware poison; tail
pages have an invalid page->index, so we would fail to unmap the page from
the page tables. We actually have to unmap the entire folio (or
mapping_evict_folio() will fail), so use unmap_mapping_folio() instead.
This also saves various calls to compound_head() hidden in lock_page(),
put_page(), etc.
Link: https://lkml.kernel.org/r/20231108182809.602073-3-willy@infradead.org Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 8 Nov 2023 18:28:04 +0000 (18:28 +0000)]
mm: make mapping_evict_folio() the preferred way to evict clean folios
Patch series "Fix fault handler's handling of poisoned tail pages".
Since introducing the ability to have large folios in the page cache, it's
been possible to have a hwpoisoned tail page returned from the fault
handler. We handle this situation poorly; failing to remove the affected
page from use.
This isn't a minimal patch to fix it, it's a full conversion of all the
code surrounding it.
This patch (of 6):
invalidate_inode_page() does very little beyond calling
mapping_evict_folio(). Move the check for mapping being NULL into
mapping_evict_folio() and make it available to the rest of the MM for use
in the next few patches.
Matthew Wilcox (Oracle) [Wed, 8 Nov 2023 20:46:05 +0000 (20:46 +0000)]
mm: return void from folio_start_writeback() and related functions
Nobody now checks the return value from any of these functions, so
add an assertion at the beginning of the function and return void.
Link: https://lkml.kernel.org/r/20231108204605.745109-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Cc: David Howells <dhowells@redhat.com> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 8 Nov 2023 20:46:02 +0000 (20:46 +0000)]
mm: remove test_set_page_writeback()
Patch series "Make folio_start_writeback return void".
Most of the folio flag-setting functions return void.
folio_start_writeback is gratuitously different; the only two filesystems
that do anything with the return value emit debug messages if it's already
set, and we can (and should) do that internally without bothering the
filesystem to do it.
Matthew Wilcox (Oracle) [Tue, 7 Nov 2023 21:26:42 +0000 (21:26 +0000)]
gfs2: convert stuffed_readpage() to stuffed_read_folio()
Use folio_fill_tail() to implement the unstuffing and folio_end_read() to
simultaneously mark the folio uptodate and unlock it. Unifies a couple of
code paths.
Link: https://lkml.kernel.org/r/20231107212643.3490372-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Tue, 7 Nov 2023 21:26:41 +0000 (21:26 +0000)]
mm: add folio_fill_tail() and use it in iomap
The iomap code was limited to PAGE_SIZE bytes; generalise it to cover
an arbitrary-sized folio, and move it to be a common helper.
Link: https://lkml.kernel.org/r/20231107212643.3490372-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Tue, 7 Nov 2023 21:26:40 +0000 (21:26 +0000)]
mm: add folio_zero_tail() and use it in ext4
Patch series "Add folio_zero_tail() and folio_fill_tail()".
I'm trying to make it easier for filesystems with tailpacking / stuffing /
inline data to use folios. The primary function here is
folio_fill_tail(). You give it a pointer to memory where the data
currently is, and it takes care of copying it into the folio at that
offset. That works for gfs2 & iomap. Then There's Ext4. Rather than gin
up some kind of specialist "Here's a two pointers to two blocks of memory"
routine, just let it do its current thing, and let it call
folio_zero_tail(), which is also called by folio_fill_tail().
Other filesystems can be converted later; these ones seemed like good
examples as they're already partly or completely converted to folios.
This patch (of 3):
Instead of unmapping the folio after copying the data to it, then mapping
it again to zero the tail, provide folio_zero_tail() to zero the tail of
an already-mapped folio.
Andrei Vagin [Fri, 17 Nov 2023 18:11:27 +0000 (10:11 -0800)]
selftests/mm: don't fail if pagemap_scan isn't supported
This change allows to run tests on old kernels.
Link: https://lkml.kernel.org/r/20231117181127.2574897-1-avagin@google.com Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/lkml/696a0a99-eb42-4e13-be14-58a88c9c33f7@arm.com/ Signed-off-by: Andrei Vagin <avagin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrei Vagin [Mon, 6 Nov 2023 22:09:59 +0000 (14:09 -0800)]
selftests/mm: check that PAGEMAP_SCAN returns correct categories
Right now, tests read page flags from /proc/pid/pagemap files. With this
change, tests will check that PAGEMAP_SCAN return correct information too.
Link: https://lkml.kernel.org/r/20231106220959.296568-2-avagin@google.com Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20231107164139.576046-1-avagin@google.com Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrei Vagin [Mon, 6 Nov 2023 22:09:58 +0000 (14:09 -0800)]
fs/proc/task_mmu: report SOFT_DIRTY bits through the PAGEMAP_SCAN ioctl
The PAGEMAP_SCAN ioctl returns information regarding page table entries.
It is more efficient compared to reading pagemap files. CRIU can start to
utilize this ioctl, but it needs info about soft-dirty bits to track
memory changes.
We are aware of a new method for tracking memory changes implemented in
the PAGEMAP_SCAN ioctl. For CRIU, the primary advantage of this method is
its usability by unprivileged users. However, it is not feasible to
transparently replace the soft-dirty tracker with the new one. The main
problem here is userfault descriptors that have to be preserved between
pre-dump iterations. It means criu continues supporting the soft-dirty
method to avoid breakage for current users. The new method will be
implemented as a separate feature.
Link: https://lkml.kernel.org/r/20231106220959.296568-1-avagin@google.com Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vishal Verma [Tue, 7 Nov 2023 07:22:43 +0000 (00:22 -0700)]
dax/kmem: allow kmem to add memory with memmap_on_memory
Large amounts of memory managed by the kmem driver may come in via CXL,
and it is often desirable to have the memmap for this memory on the new
memory itself.
Enroll kmem-managed memory for memmap_on_memory semantics if the dax
region originates via CXL. For non-CXL dax regions, retain the existing
default behavior of hot adding without memmap_on_memory semantics.
Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-3-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Fan Ni <fan.ni@samsung.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vishal Verma [Tue, 7 Nov 2023 07:22:42 +0000 (00:22 -0700)]
mm/memory_hotplug: split memmap_on_memory requests across memblocks
The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.
For users of hotplug such as kmem, large amounts of memory might get added
from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the memory
being added, even if it means splitting the addition into memblock-sized
chunks.
Change add_memory_resource() to loop over memblock-sized chunks of memory
if caller requested memmap_on_memory, and if other conditions for it are
met. Teach try_remove_memory() to also expect that a memory range being
removed might have been split up into memblock sized chunks, and to loop
through those as needed.
This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out here[1].
Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-2-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Fan Ni <fan.ni@samsung.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vishal Verma [Tue, 7 Nov 2023 07:22:41 +0000 (00:22 -0700)]
mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()
Patch series "mm: use memmap_on_memory semantics for dax/kmem", v10.
The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit the memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory to
place its memmap on the newly added memory itself.
The main hurdle for accomplishing this for kmem is that memmap_on_memory
can only be done if the memory being added is equal to the size of one
memblock. To overcome this, allow the hotplug code to split an
add_memory() request into memblock-sized chunks, and try_remove_memory()
to also expect and handle such a scenario.
Patch 1 replaces an open-coded kmemdup()
Patch 2 teaches the memory_hotplug code to allow for splitting
add_memory() and remove_memory() requests over memblock sized chunks.
Patch 3 allows the dax region drivers to request memmap_on_memory
semantics. CXL dax regions default this to 'on', all others default to
off to keep existing behavior unchanged.
This patch (of 3):
A review of the memmap_on_memory modifications to add_memory_resource()
revealed an instance of an open-coded kmemdup(). Replace it with
kmemdup().
Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-0-1253ec050ed0@intel.com Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-1-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Ni [Thu, 26 Oct 2023 02:03:29 +0000 (10:03 +0800)]
NUMA: optimize detection of memory with no node id assigned by firmware
Sanity check that makes sure the nodes cover all memory loops over
numa_meminfo to count the pages that have node id assigned by the
firmware, then loops again over memblock.memory to find the total amount
of memory and in the end checks that the difference between the total
memory and memory that covered by nodes is less than some threshold.
Worse, the loop over numa_meminfo calls __absent_pages_in_range() that
also partially traverses memblock.memory.
It's much simpler and more efficient to have a single traversal of
memblock.memory that verifies that amount of memory not covered by nodes
is less than a threshold.
Introduce memblock_validate_numa_coverage() that does exactly that and use
it instead of numa_meminfo_cover_memory().
Link: https://lkml.kernel.org/r/20231026020329.327329-1-zhiguangni01@gmail.com Signed-off-by: Liam Ni <zhiguangni01@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Binbin Zhou <zhoubinbin@loongson.cn> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feiyang Chen <chenfeiyang@loongson.cn> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The reason is that the split_huge_page_to_list() will build migration
entries for each subpage of a pte-mapped Anon THP by try_to_migrate(), or
unmap for file THP, and it will clear and tlb flush for each subpage's
pte. Moreover, the split_huge_page_to_list() will set TTU_SPLIT_HUGE_PMD
flag to ensure the THP is already a pte-mapped THP before splitting it to
some normal pages.
Actually, there is no need to flush tlb for each subpage immediately,
instead we can batch tlb flush for the pte-mapped THP to improve the
performance.
After this patch, we can see the batch tlb flush can improve the latency
obviously when running thpscale.
Liam R. Howlett [Wed, 1 Nov 2023 17:16:29 +0000 (13:16 -0400)]
maple_tree: mtree_range_walk() clean up
mtree_range_walk() needed to be updated to avoid checking if there was a
pivot value. On closer examination, the code could avoid setting min or
max in certain scenarios. The commit removes the extra check for
pivot[offset] before setting max and only sets max when necessary. It
also only sets min if it is necessary by checking offset 0 prior to the
loop (as it has always done).
The commit also drops a dead node check since the end of the node will
return the array size when the last slot is occupied (by a potential reuse
in a dead node). The data will be discarded later if the node is marked
dead.
Benchmarking these changes results in an increase in performance of 5.45%
using the BENCH_WALK in the maple tree test code.