Liam R. Howlett [Thu, 2 Feb 2023 16:16:24 +0000 (11:16 -0500)]
maple_tree: Add __init and __exit to test module
The test functions are not needed after the module is removed, so mark
them as such. Add __exit to the module removal function. Some other
variables have been marked as const static as well.
Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Thu, 1 Dec 2022 16:48:12 +0000 (11:48 -0500)]
maple_tree: Make test code work without debug enabled
The test code is less useful without debug, but can still do general
validations. Define mt_dump(), mas_dump() and mas_wr_dump() as a noop
if debug is not enabled and document it in the test module information
that more information can be obtained with another kernel config option.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Return error on mte_pivots() out of range
Rename mte_pivots() to mas_pivots() and pass through the ma_state to set
the error code to -EIO when the offset is out of range for the node
type. Change the WARN_ON() to MAS_WARN_ON() to log the maple state.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
maple_tree: Use MAS_WR_BUG_ON() in mas_store_prealloc()
mas_store_prealloc() should never fail, but if it does due to internal
tree issues then get as much debug information as possible prior to
crashing the kernel.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Use MAS_BUG_ON() instead of MT_BUG_ON() to get the maple state
information. In the unlikely even of a tree height of > 31, try to increase
the probability of useful information being logged.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Wed, 31 Aug 2022 19:55:15 +0000 (15:55 -0400)]
maple_tree: Convert debug code to use MT_WARN_ON() and MAS_WARN_ON()
Using MT_WARN_ON() allows for the removal of if statements before
logging. Using MAS_WARN_ON() will provide more information when issues
are encountered.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Tue, 15 Nov 2022 14:29:33 +0000 (09:29 -0500)]
maple_tree: Clean up mas_parent_enum()
mas_parent_enum() is a simple wrapper for mte_parent_enum() which is
only called from that wrapper. Remove the wrapper and inline
mte_parent_enum() into mas_parent_enum().
At the same time, clean up the bit masking of the root pointer since it
cannot be set by the time the bit masking occurs. Change the check on
the root bit to a WARN_ON(), and fix the verification code to not
trigger the WARN_ON() before checking if the node is root.
Reported-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Andrey Konovalov [Fri, 10 Feb 2023 21:16:04 +0000 (22:16 +0100)]
lib/stackdepot: annotate racy pool_index accesses
Accesses to pool_index are protected by pool_lock everywhere except
in a sanity check in stack_depot_fetch. The read access there can race
with the write access in depot_alloc_stack.
Use WRITE/READ_ONCE() to annotate the racy accesses.
As the sanity check is only used to print a warning in case of a
violation of the stack depot interface usage, it does not make a lot
of sense to use proper synchronization.
The current implementation of the extra_bits interface is confusing:
passing extra_bits to __stack_depot_save makes it seem that the extra
bits are somehow stored in stack depot. In reality, they are only
embedded into a stack depot handle and are not used within stack depot.
Drop the extra_bits argument from __stack_depot_save and instead provide
a new stack_depot_set_extra_bits function (similar to the exsiting
stack_depot_get_extra_bits) that saves extra bits into a stack depot
handle.
Update the callers of __stack_depot_save to use the new interace.
This change also fixes a minor issue in the old code: __stack_depot_save
does not return NULL if saving stack trace fails and extra_bits is used.
Andrey Konovalov [Fri, 10 Feb 2023 21:16:02 +0000 (22:16 +0100)]
lib/stackdepot: rename next_pool_inited to next_pool_required
Stack depot uses next_pool_inited to mark that either the next pool is
initialized or the limit on the number of pools is reached. However,
the flag name only reflects the former part of its purpose, which is
confusing.
Rename next_pool_inited to next_pool_required and invert its value.
Also annotate usages of next_pool_required with comments.
Andrey Konovalov [Fri, 10 Feb 2023 21:16:00 +0000 (22:16 +0100)]
lib/stacktrace: drop impossible WARN_ON for depot_init_pool
depot_init_pool has two call sites:
1. In depot_alloc_stack with a potentially NULL prealloc.
2. In __stack_depot_save with a non-NULL prealloc.
At the same time depot_init_pool can only return false when prealloc is
NULL.
As the second call site makes sure that prealloc is not NULL, the WARN_ON
there can never trigger. Thus, drop the WARN_ON and also move the prealloc
check from depot_init_pool to its first call site.
Also change the return type of depot_init_pool to void as it now always
returns true.
Andrey Konovalov [Fri, 10 Feb 2023 21:15:57 +0000 (22:15 +0100)]
lib/stackdepot: rename slab to pool
Use "pool" instead of "slab" for naming memory regions stack depot
uses to store stack traces. Using "slab" is confusing, as stack depot
pools have nothing to do with the slab allocator.
Also give better names to pool-related global variables: change
"depot_" prefix to "pool_" to point out that these variables are
related to stack depot pools.
Also rename the slabindex (poolindex) field in handle_parts to pool_index
to align its name with the pool_index global variable.
Rename stack_depot_want_early_init to stack_depot_request_early_init.
The old name is confusing, as it hints at returning some kind of intention
of stack depot. The new name reflects that this function requests an
action from stack depot instead.
Andrey Konovalov [Fri, 10 Feb 2023 21:15:49 +0000 (22:15 +0100)]
lib/stackdepot: put functions in logical order
Patch series "lib/stackdepot: fixes and clean-ups", v2.
A set of fixes, comments, and clean-ups I came up with while reading
the stack depot code.
This patch (of 18):
Put stack depot functions' declarations and definitions in a more logical
order:
1. Functions that save stack traces into stack depot.
2. Functions that fetch and print stack traces.
3. stack_depot_get_extra_bits that operates on stack depot handles
and does not interact with the stack depot storage.
DAMON debugfs interface has announced to be deprecated after >v5.15 LTS
kernel is released. And, v6.1.y has announced to be an LTS[1].
Though the announcement was there for a while, some people might not
noticed that so far. Also, some users could depend on it and have
problems at movng to the alternative (DAMON sysfs interface).
For such cases, warn DAMON debugfs interface deprecation with contacts
to ask helps when any DAMON debugfs interface file is opened.
DAMON debugfs interface has announced to be deprecated after >v5.15 LTS
kernel is released. And, v6.1.y has announced to be an LTS[1].
Though the announcement was there for a while, some people might not
noticed that so far. Also, some users could depend on it and have
problems at movng to the alternative (DAMON sysfs interface).
For such cases, note DAMON debugfs interface as deprecated, and contacts
to ask helps on the Kconfig.
Patch series "mm/damon: deprecate DAMON debugfs interface".
DAMON debugfs interface has announced to be deprecated after >v5.15 LTS
kernel is released. And v6.1.y has been announced to be an LTS[1].
Though the announcement was there for a while, some people might not have
noticed that so far. Also, some users could depend on it and have
problems at movng to the alternative (DAMON sysfs interface).
For such cases, keep the code and documents with warning messages and
contacts to ask helps for the deprecation.
DAMON debugfs interface has announced to be deprecated after >v5.15 LTS
kernel is released. And, v6.1.y has announced to be an LTS[1].
Though the announcement was there for a while, some people might not
noticed that so far. Also, some users could depend on it and have
problems at movng to the alternative (DAMON sysfs interface).
For such cases, note DAMON debugfs interface as deprecated, and contacts
to ask helps on the document.
Alexander Halbuer [Wed, 1 Feb 2023 16:25:49 +0000 (17:25 +0100)]
mm: reduce lock contention of pcp buffer refill
rmqueue_bulk() batches the allocation of multiple elements to refill the
per-CPU buffers into a single hold of the zone lock. Each element is
allocated and checked using check_pcp_refill(). The check touches every
related struct page which is especially expensive for higher order
allocations (huge pages).
This patch reduces the time holding the lock by moving the check out of
the critical section similar to rmqueue_buddy() which allocates a single
element.
Measurements of parallel allocation-heavy workloads show a reduction of
the average huge page allocation latency of 50 percent for two cores and
nearly 90 percent for 24 cores.
folio_movable_ops() does the same as page_movable_ops() except uses folios
instead of pages. This function will help make folio conversions in
migrate.c more readable.
Patch series "Convert a couple migrate functions to use folios", v2.
This patchset introduces folio_movable_ops() and converts 3 functions in
mm/migrate.c to use folios. It also introduces folio_get_nontail_page()
for folio conversions which may want to distinguish between head and tail
pages.
This patch (of 4):
folio_get_nontail_page() returns the folio associated with a head page.
This is necessary for folio conversions where the behavior of that
function differs between head pages and tail pages.
mm/mempolicy: convert migrate_page_add() to migrate_folio_add()
Replace migrate_page_add() with migrate_folio_add(). migrate_folio_add()
does the same a migrate_page_add() but takes in a folio instead of a page.
This removes a couple of calls to compound_head().
Link: https://lkml.kernel.org/r/20230130201833.27042-7-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/mempolicy: convert queue_pages_required() to queue_folio_required()
Replace queue_pages_required() with queue_folio_required().
queue_folio_required() does the same as queue_pages_required(), except
takes in a folio instead of a page.
Link: https://lkml.kernel.org/r/20230130201833.27042-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: "Yin, Fengwei" <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Convert various mempolicy.c functions to use folios", v4.
This patch series converts migrate_page_add() and queue_pages_required()
to migrate_folio_add() and queue_page_required(). It also converts the
callers of the functions to use folios as well, and introduces a helper
function to estimate the number of sharers of a folio.
This patch (of 6):
folio_estimated_sharers() takes in a folio and returns the precise number
of times the first subpage of the folio is mapped.
This function aims to provide an estimate for the number of sharers of a
folio. This is necessary for folio conversions where we care about the
number of processes that share a folio, but don't necessarily want to
check every single page within that folio.
This is in contrast to folio_mapcount() which calculates the total number
of the times a folio and all its subpages are mapped.
Keith Busch [Thu, 26 Jan 2023 21:51:25 +0000 (13:51 -0800)]
dmapool: create/destroy cleanup
Set the 'empty' bool directly from the result of the function that
determines its value instead of adding additional logic.
Link: https://lkml.kernel.org/r/20230126215125.4069751-13-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:24 +0000 (13:51 -0800)]
dmapool: link blocks across pages
The allocated dmapool pages are never freed for the lifetime of the pool.
There is no need for the two level list+stack lookup for finding a free
block since nothing is ever removed from the list. Just use a simple
stack, reducing time complexity to constant.
The implementation inserts the stack linking elements and the dma handle
of the block within itself when freed. This means the smallest possible
dmapool block is increased to at most 16 bytes to accommodate these
fields, but there are no exisiting users requesting a dma pool smaller
than that anyway.
Removing the list has a significant change in performance. Using the
kernel's micro-benchmarking self test:
The module test allocates quite a few blocks that may not accurately
represent how these pools are used in real life. For a more marco level
benchmark, running fio high-depth + high-batched on nvme, this patch shows
submission and completion latency reduced by ~100usec each, 1% IOPs
improvement, and perf record's time spent in dma_pool_alloc/free were
reduced by half.
Link: https://lkml.kernel.org/r/20230126215125.4069751-12-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:23 +0000 (13:51 -0800)]
dmapool: don't memset on free twice
If debug is enabled, dmapool will poison the range, so no need to clear it
to 0 immediately before writing over it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-11-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:22 +0000 (13:51 -0800)]
dmapool: simplify freeing
The actions for busy and not busy are mostly the same, so combine these
and remove the unnecessary function. Also, the pool is about to be freed
so there's no need to poison the page data since we only check for poison
on alloc, which can't be done on a freed pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-10-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:21 +0000 (13:51 -0800)]
dmapool: consolidate page initialization
Various fields of the dma pool are set in different places. Move it all
to one function.
Link: https://lkml.kernel.org/r/20230126215125.4069751-9-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:20 +0000 (13:51 -0800)]
dmapool: rearrange page alloc failure handling
Handle the error in a condition so the good path can be in the normal
flow.
Link: https://lkml.kernel.org/r/20230126215125.4069751-8-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:19 +0000 (13:51 -0800)]
dmapool: move debug code to own functions
Clean up the normal path by moving the debug code outside it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-7-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:18 +0000 (13:51 -0800)]
dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
Avoid double-memset of the same allocated memory in dma_pool_alloc() when
both DMAPOOL_DEBUG is enabled and init_on_alloc=1.
Link: https://lkml.kernel.org/r/20230126215125.4069751-6-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:17 +0000 (13:51 -0800)]
dmapool: cleanup integer types
To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places. Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-5-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:16 +0000 (13:51 -0800)]
dmapool: use sysfs_emit() instead of scnprintf()
Use sysfs_emit instead of scnprintf, snprintf or sprintf.
Link: https://lkml.kernel.org/r/20230126215125.4069751-4-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:15 +0000 (13:51 -0800)]
dmapool: remove checks for dev == NULL
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device. But nobody
ended up using dma pools without a device, and trying to do so will result
in an oops. So remove the checks for pool->dev == NULL since they are
unneeded bloat.
[kbusch@kernel.org: add check for null dev on create] Link: https://lkml.kernel.org/r/20230126215125.4069751-3-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:14 +0000 (13:51 -0800)]
dmapool: add alloc/free performance test
Patch series "dmapool enhancements", v4.
Time spent in dma_pool alloc/free increases linearly with the number of
pages backing the pool. We can reduce this to constant time with minor
changes to how free pages are tracked.
This patch (of 12):
Provide a module that allocates and frees many blocks of various sizes and
report how long it takes. This is intended to provide a consistent way to
measure how changes to the dma_pool_alloc/free routines affect timing.
v5->v6:
According to David's suggestions, the following changes are made:
1) Rename check_ksm_zero_pages_count() -> ksm_get_zero_pages(), and do the
comparison outside.
2) Open all global fd from main() rather than the test case.
3) Remove COW-related test codes and focus on explicit unmerging here.
4) Add some coments to explain why wait_two_full_scans is required.
5) Clean up some unneed changes.
Link: https://lkml.kernel.org/r/202302100921574141612@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xu xin [Fri, 30 Dec 2022 01:18:47 +0000 (09:18 +0800)]
selftest: add testing unsharing and counting ksm zero page
Add a function test_unmerge_zero_page() to test the functionality on
unsharing and counting ksm-placed zero pages and counting of this patch
series.
test_unmerge_zero_page() actually contains three subjct test objects:
1) whether the count of ksm zero page can react correctly to cow
(copy on write);
2) whether the count of ksm zero page can react correctly to unmerge;
3) whether ksm zero pages are really unmerged.
Link: https://lkml.kernel.org/r/202212300918477352037@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xu xin [Fri, 30 Dec 2022 01:17:28 +0000 (09:17 +0800)]
ksm: add zero_pages_sharing documentation
When enabling use_zero_pages, pages_sharing cannot represent how much
memory saved indeed. zero_pages_sharing + pages_sharing does. add the
description of zero_pages_sharing.
Link: https://lkml.kernel.org/r/202212300917284911971@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Cc: Xiaokai Ran <ran.xiaokai@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xu xin [Fri, 30 Dec 2022 01:16:29 +0000 (09:16 +0800)]
ksm: count zero pages for each process
As the number of ksm zero pages is not included in ksm_merging_pages per
process when enabling use_zero_pages, it's unclear of how many actual
pages are merged by KSM. To let users accurately estimate their memory
demands when unsharing KSM zero-pages, it's necessary to show KSM zero-
pages per process.
Since unsharing zero pages placed by KSM accurately is achieved, then
tracking empty pages merging and unmerging is not a difficult thing any
longer.
Since we already have /proc/<pid>/ksm_stat, just add the information of
zero_pages_sharing in it.
Link: https://lkml.kernel.org/r/202212300916292181912@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Cc: Xiaokai Ran <ran.xiaokai@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xu xin [Fri, 30 Dec 2022 01:15:14 +0000 (09:15 +0800)]
ksm: count all zero pages placed by KSM
As pages_sharing and pages_shared don't include the number of zero pages
merged by KSM, we cannot know how many pages are zero pages placed by KSM
when enabling use_zero_pages, which leads to KSM not being transparent
with all actual merged pages by KSM. In the early days of use_zero_pages,
zero-pages was unable to get unshared by the ways like MADV_UNMERGEABLE so
it's hard to count how many times one of those zeropages was then
unmerged.
But now, unsharing KSM-placed zero page accurately has been achieved, so
we can easily count both how many times a page full of zeroes was merged
with zero-page and how many times one of those pages was then unmerged.
and so, it helps to estimate memory demands when each and every shared
page could get unshared.
So we add zero_pages_sharing under /sys/kernel/mm/ksm/ to show the number
of all zero pages placed by KSM.
Link: https://lkml.kernel.org/r/202212300915147801864@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xu xin [Fri, 30 Dec 2022 01:13:57 +0000 (09:13 +0800)]
ksm: support unsharing zero pages placed by KSM
use_zero_pages may be very useful, not just because of cache colouring as
described in doc, but also because use_zero_pages can accelerate merging
empty pages when there are plenty of empty pages (full of zeros) as the
time of page-by-page comparisons (unstable_tree_search_insert) is saved.
But when enabling use_zero_pages, madvise(addr, len, MADV_UNMERGEABLE) and
other ways (like write 2 to /sys/kernel/mm/ksm/run) to trigger unsharing
will *not* actually unshare the shared zeropage as placed by KSM (which is
against the MADV_UNMERGEABLE documentation). As these KSM-placed zero
pages are out of the control of KSM, the related counts of ksm pages don't
expose how many zero pages are placed by KSM (these special zero pages are
different from those initially mapped zero pages, because the zero pages
mapped to MADV_UNMERGEABLE areas are expected to be a complete and
unshared page)
To not blindly unshare all shared zero_pages in applicable VMAs, the patch
introduces a dedicated flag ZERO_PAGE_FLAG to mark the rmap_items of those
shared zero_pages. and guarantee that these rmap_items will be not freed
during the time of zero_pages not being writing, so we can only unshare
the *KSM-placed* zero_pages.
The patch will not degrade the performance of use_zero_pages as it doesn't
change the way of merging empty pages in use_zero_pages's feature.
Link: https://lkml.kernel.org/r/202212300913573751808@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn> Reviewed-by: Yang Yang <yang.yang29@zte.com.cn> Reported-by: David Hildenbrand <david@redhat.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Xuexin Jiang <jiang.xuexin@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xu xin [Fri, 30 Dec 2022 01:12:44 +0000 (09:12 +0800)]
ksm: abstract the function try_to_get_old_rmap_item
Patch series "ksm: support tracking KSM-placed zero-pages", v6.
The core idea of this patch set is to enable users to perceive the number
of any pages merged by KSM, regardless of whether use_zero_page switch has
been turned on, so that users can know how much free memory increase is
really due to their madvise(MERGEABLE) actions. But the problem is, when
enabling use_zero_pages, all empty pages will be merged with kernel zero
pages instead of with each other as use_zero_pages is disabled, and then
these zero-pages are no longer monitored by KSM.
The motivations for me to do this contains three points:
1) MADV_UNMERGEABLE and other ways to trigger unsharing will *not*
unshare the shared zeropage as placed by KSM (which is against the
MADV_UNMERGEABLE documentation at least); see the link:
https://lore.kernel.org/lkml/4a3daba6-18f9-d252-697c-197f65578c44@redhat.com/
2) We cannot know how many pages are zero pages placed by KSM when
enabling use_zero_pages, which hides the critical information about
how much actual memory are really saved by KSM. Knowing how many
ksm-placed zero pages are helpful for user to use the policy of madvise
(MERGEABLE) better because they can see the actual profit brought by KSM.
3) The zero pages placed-by KSM are different from those initial empty page
(filled with zeros) which are never touched by applications. The former
is active-merged by KSM while the later have never consume actual memory.
use_zero_pages is useful, not only because of cache colouring as described
in doc, but also because use_zero_pages can accelerate merging empty pages
when there are plenty of empty pages (full of zeros) as the time of
page-by-page comparisons (unstable_tree_search_insert) is saved. So we
hope to implement the support for ksm zero page tracking without affecting
the feature of use_zero_pages.
Zero pages may be the most common merged pages in actual environment(not
only VM but also including other application like containers). Enabling
use_zero_pages in the environment with plenty of empty pages(full of
zeros) will be very useful. Users and app developer can also benefit from
knowing the proportion of zero pages in all merged pages to optimize
applications.
With the patch series, we can both unshare zero-pages(KSM-placed)
accurately and count ksm zero pages with enabling use_zero_pages.
This patch (of 6):
A new function try_to_get_old_rmap_item is abstracted from
get_next_rmap_item. This function will be reused by the subsequent
patches about counting ksm_zero_pages.
The patch improves the readability and reusability of KSM code.
Jiaqi Yan [Mon, 5 Dec 2022 23:40:59 +0000 (15:40 -0800)]
mm/khugepaged: recover from poisoned file-backed memory
Make collapse_file roll back when copying pages failed. More concretely:
- extract copying operations into a separate loop
- postpone the updates for nr_none until both scanning and copying
succeeded
- postpone joining small xarray entries until both scanning and copying
succeeded
- postpone the update operations to NR_XXX_THPS until both scanning and
copying succeeded
- for non-SHMEM file, roll back filemap_nr_thps_inc if scan succeeded but
copying failed
Tested manually:
0. Enable khugepaged on system under test. Mount tmpfs at /mnt/ramdisk.
1. Start a two-thread application. Each thread allocates a chunk of
non-huge memory buffer from /mnt/ramdisk.
2. Pick 4 random buffer address (2 in each thread) and inject
uncorrectable memory errors at physical addresses.
3. Signal both threads to make their memory buffer collapsible, i.e.
calling madvise(MADV_HUGEPAGE).
4. Wait and then check kernel log: khugepaged is able to recover from
poisoned pages by skipping them.
5. Signal both threads to inspect their buffer contents and make sure no
data corruption.
Link: https://lkml.kernel.org/r/20221205234059.42971-3-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <tongtiangen@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiaqi Yan [Mon, 5 Dec 2022 23:40:58 +0000 (15:40 -0800)]
mm/khugepaged: recover from poisoned anonymous memory
Patch series "Memory poison recovery in khugepaged collapsing", v9.
Problem
=======
Memory DIMMs are subject to multi-bit flips, i.e. memory errors. As
memory size and density increase, the chances of and number of memory
errors increase. The increasing size and density of server RAM in the
data center and cloud have shown increased uncorrectable memory errors.
There are already mechanisms in the kernel to recover from uncorrectable
memory errors. This series of patches provides the recovery mechanism for
the particular kernel agent khugepaged when it collapses memory pages.
Impact
======
The main reason we chose to make khugepaged collapsing tolerant of memory
failures was its high possibility of accessing poisoned memory while
performing functionally optional compaction actions. Standard
applications typically don't have strict requirements on the size of their
pages. So they are given 4K pages by the kernel. The kernel is able to
improve application performance by either
1) giving applications 2M pages to begin with, or
2) collapsing 4K pages into 2M pages when possible.
This collapsing operation is done by khugepaged, a kernel agent that is
constantly scanning memory. When collapsing 4K pages into a 2M page, it
must copy the data from the 4K pages into a physically contiguous 2M page.
Therefore, as long as there exists one poisoned cache line in collapsible
4K pages, khugepaged will eventually access it. The current impact to
users is a machine check exception triggered kernel panic. However,
khugepaged’s compaction operations are not functionally required kernel
actions. Therefore making khugepaged tolerant to poisoned memory will
greatly improve user experience.
This patch series is for cases where khugepaged is the first guy that
detects the memory errors on the poisoned pages. IOW, the pages are not
known to have memory errors when khugepaged collapsing gets to them. In
our observation, this happens frequently when the huge page ratio of the
system is relatively low, which is fairly common in virtual machines
running on cloud.
Solution
========
As stated before, it is less desirable to crash the system only because
khugepaged accesses poisoned pages while it is collapsing 4K pages. The
high level idea of this patch series is to skip the group of pages
(usually 512 4K-size pages) once khugepaged finds one of them is poisoned,
as these pages have become ineligible to be collapsed.
We are also careful to unwind operations khuagepaged has performed before
it detects memory failures. For example, before copying and collapsing a
group of anonymous pages into a huge page, the source pages will be
isolated and their page table is unlinked from their PMD. These
operations need to be undone in order to ensure these pages are not
changed/lost from the perspective of other threads (both user and kernel
space). As for file backed memory pages, there already exists a rollback
case. This patch just extends it so that khugepaged also correctly rolls
back when it fails to copy poisoned 4K pages.
This patch (of 2):
Make __collapse_huge_page_copy return whether copying anonymous pages
succeeded, and make collapse_huge_page handle the return status.
Break existing PTE scan loop into two for-loops. The first loop copies
source pages into target huge page, and can fail gracefully when running
into memory errors in source pages. If copying all pages succeeds, the
second loop releases and clears up these normal pages. Otherwise, the
second loop rolls back the page table and page states by: -
re-establishing the original PTEs-to-PMD connection. - releasing source
pages back to their LRU list.
Tested manually:
0. Enable khugepaged on system under test.
1. Start a two-thread application. Each thread allocates a chunk of
non-huge anonymous memory buffer.
2. Pick 4 random buffer locations (2 in each thread) and inject
uncorrectable memory errors at corresponding physical addresses.
3. Signal both threads to make their memory buffer collapsible, i.e.
calling madvise(MADV_HUGEPAGE).
4. Wait and check kernel log: khugepaged is able to recover from poisoned
pages and skips collapsing them.
5. Signal both threads to inspect their buffer contents and make sure no
data corruption.
Link: https://lkml.kernel.org/r/20221205234059.42971-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20221205234059.42971-2-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <tongtiangen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 25 Jan 2023 17:05:35 +0000 (09:05 -0800)]
mm/hugetlb: convert hugetlb_add_to_page_cache to take in a folio
Every caller of hugetlb_add_to_page_cache() is now passing in
&folio->page, change the function to take in a folio directly and clean up
the call sites.
Link: https://lkml.kernel.org/r/20230125170537.96973-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 25 Jan 2023 17:05:34 +0000 (09:05 -0800)]
mm/hugetlb: convert restore_reserve_on_error to take in a folio
Every caller of restore_reserve_on_error() is now passing in &folio->page,
change the function to take in a folio directly and clean up the call
sites.
Link: https://lkml.kernel.org/r/20230125170537.96973-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 25 Jan 2023 17:05:33 +0000 (09:05 -0800)]
mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()
Change alloc_huge_page() to alloc_hugetlb_folio() by changing all callers
to handle the now folio return type of the function. In this conversion,
alloc_huge_page_vma() is also changed to alloc_hugetlb_folio_vma() and
hugepage_add_new_anon_rmap() is changed to take in a folio directly. Many
additions of '&folio->page' are cleaned up in subsequent patches.
hugetlbfs_fallocate() is also refactored to use the RCU +
page_cache_next_miss() API.
Link: https://lkml.kernel.org/r/20230125170537.96973-5-sidhartha.kumar@oracle.com Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 25 Jan 2023 17:05:32 +0000 (09:05 -0800)]
mm/hugetlb: convert putback_active_hugepage to take in a folio
Convert putback_active_hugepage() to folio_putback_active_hugetlb(), this
removes one user of the Huge Page macros which take in a page. The
callers in migrate.c are also cleaned up by being able to directly use the
src and dst folio variables.
Link: https://lkml.kernel.org/r/20230125170537.96973-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Wed, 25 Jan 2023 17:05:30 +0000 (09:05 -0800)]
mm/hugetlb: convert hugetlb_install_page to folios
Patch series "convert hugetlb fault functions to folios", v2.
This series converts the hugetlb page faulting functions to operate on
folios. These include hugetlb_no_page(), hugetlb_wp(),
copy_hugetlb_page_range(), and hugetlb_mcopy_atomic_pte().
This patch (of 8):
Change hugetlb_install_page() to hugetlb_install_folio(). This reduces
one user of the Huge Page flag macros which take in a page.
Sidhartha Kumar [Fri, 13 Jan 2023 22:30:55 +0000 (16:30 -0600)]
mm/hugetlb: convert alloc_migrate_huge_page to folios
Change alloc_huge_page_nodemask() to alloc_hugetlb_folio_nodemask() and
alloc_migrate_huge_page() to alloc_migrate_hugetlb_folio(). Both
functions now return a folio rather than a page.
Link: https://lkml.kernel.org/r/20230113223057.173292-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Fri, 13 Jan 2023 22:30:54 +0000 (16:30 -0600)]
mm/hugetlb: increase use of folios in alloc_huge_page()
Change hugetlb_cgroup_commit_charge{,_rsvd}(), dequeue_huge_page_vma() and
alloc_buddy_huge_page_with_mpol() to use folios so alloc_huge_page() is
cleaned by operating on folios until its return.
Link: https://lkml.kernel.org/r/20230113223057.173292-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Fri, 13 Jan 2023 22:30:52 +0000 (16:30 -0600)]
mm/hugetlb: convert dequeue_hugetlb_page functions to folios
dequeue_huge_page_node_exact() is changed to dequeue_hugetlb_folio_node_
exact() and dequeue_huge_page_nodemask() is changed to dequeue_hugetlb_
folio_nodemask(). Update their callers to pass in a folio.
Link: https://lkml.kernel.org/r/20230113223057.173292-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Fri, 13 Jan 2023 22:30:50 +0000 (16:30 -0600)]
mm/hugetlb: convert isolate_hugetlb to folios
Patch series "continue hugetlb folio conversion", v3.
This series continues the conversion of core hugetlb functions to use
folios. This series converts many helper funtions in the hugetlb fault
path. This is in preparation for another series to convert the hugetlb
fault code paths to operate on folios.
This patch (of 8):
Convert isolate_hugetlb() to take in a folio and convert its callers to
pass a folio. Use page_folio() to convert the callers to use a folio is
safe as isolate_hugetlb() operates on a head page.
David Chen [Thu, 9 Feb 2023 17:48:28 +0000 (17:48 +0000)]
mm/page_alloc.c: fix page corruption caused by racy check in __free_pages
When we upgraded our kernel, we started seeing some page corruption like
the following consistently:
BUG: Bad page state in process ganesha.nfsd pfn:1304ca
page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
flags: 0x17ffffc0000000()
raw: 0017ffffc0000000ffff8a513ffd4c98ffffeee24b35ec080000000000000000
raw: 0000000000000000000000000000000100000000ffffff7f0000000000000000
page dumped because: nonzero mapcount
CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
Call Trace:
dump_stack+0x74/0x96
bad_page.cold+0x63/0x94
check_new_page_bad+0x6d/0x80
rmqueue+0x46e/0x970
get_page_from_freelist+0xcb/0x3f0
? _cond_resched+0x19/0x40
__alloc_pages_nodemask+0x164/0x300
alloc_pages_current+0x87/0xf0
skb_page_frag_refill+0x84/0x110
...
Sometimes, it would also show up as corruption in the free list pointer and
cause crashes.
After bisecting the issue, we found the issue started from e320d3012d25:
if (put_page_testzero(page))
free_the_page(page, order);
else if (!PageHead(page))
while (order-- > 0)
free_the_page(page + (1 << order), order);
So the problem is the check PageHead is racy because at this point we
already dropped our reference to the page. So even if we came in with
compound page, the page can already be freed and PageHead can return false
and we will end up freeing all the tail pages causing double free.
Zach O'Keefe [Wed, 25 Jan 2023 01:57:37 +0000 (17:57 -0800)]
mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount
During collapse, in a few places we check to see if a given small page has
any unaccounted references. If the refcount on the page doesn't match our
expectations, it must be there is an unknown user concurrently interested
in the page, and so it's not safe to move the contents elsewhere.
However, the unaccounted pins are likely an ephemeral state.
In this situation, MADV_COLLAPSE returns -EINVAL when it should return
-EAGAIN. This could cause userspace to conclude that the syscall
failed, when it in fact could succeed by retrying.
Qian Yingjin [Wed, 8 Feb 2023 02:24:00 +0000 (10:24 +0800)]
mm/filemap: fix page end in filemap_get_read_batch
I was running traces of the read code against an RAID storage system to
understand why read requests were being misaligned against the underlying
RAID strips. I found that the page end offset calculation in
filemap_get_read_batch() was off by one.
When a read is submitted with end offset 1048575, then it calculates the
end page for read of 256 when it should be 255. "last_index" is the index
of the page beyond the end of the read and it should be skipped when get a
batch of pages for read in @filemap_get_read_batch().
The below simple patch fixes the problem. This code was introduced in
kernel 5.12.
Link: https://lkml.kernel.org/r/20230208022400.28962-1-coolqyj@163.com Fixes: cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read") Signed-off-by: Qian Yingjin <qian@ddn.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jason Gunthorpe [Tue, 24 Jan 2023 20:34:34 +0000 (16:34 -0400)]
mm/gup: move private gup FOLL_ flags to internal.h
Move the flags that should not/are not used outside gup.c and related into
mm/internal.h to discourage driver abuse.
To make this more maintainable going forward compact the two FOLL ranges
with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
explicit.
Switch to an enum so the whole thing is easier to read.
Link: https://lkml.kernel.org/r/13-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>