Lorenzo Stoakes [Fri, 25 Oct 2024 12:26:27 +0000 (13:26 +0100)]
mm: defer second attempt at merge on mmap()
Rather than trying to merge again when ostensibly allocating a new VMA,
instead defer until the VMA is added and attempt to merge the existing
range.
This way we have no complicated unwinding logic midway through the process
of mapping the VMA.
In addition this removes limitations on the VMA not being able to be the
first in the virtual memory address space which was previously implicitly
required.
In theory, for this very same reason, we should unconditionally attempt
merge here, however this is likely to have a performance impact so it is
better to avoid this given the unlikely outcome of a merge.
Lorenzo Stoakes [Fri, 25 Oct 2024 12:26:26 +0000 (13:26 +0100)]
mm: remove unnecessary reset state logic on merge new VMA
The only place where this was used was in mmap_region(), which we have now
adjusted to not require this to be performed (we reset ourselves in
effect).
It also created a dangerous assumption that VMG state could be safely
reused after a merge, at which point it may have been mutated in
unexpected ways, leading to subtle bugs.
Note that it was discovered by Wei Yang that there was also an error in
this code - we are comparing vmg->vma with prev after setting it to NULL.
This however had no impact, as we previously reset VMA iterator state
before attempting merge again, but it was useless effort.
In any case, this patch removes all of the logic so also eliminates this
wasted effort.
Link: https://lkml.kernel.org/r/5d9a59eee6498ae017cc87d89aa723de7179f75d.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 25 Oct 2024 12:26:25 +0000 (13:26 +0100)]
mm: refactor __mmap_region()
We have seen bugs and resource leaks arise from the complexity of the
__mmap_region() function. This, and the generally deeply fragile error
handling logic and complexity which makes understanding the function
difficult make it highly desirable to refactor it into something readable.
Achieve this by separating the function into smaller logical parts which
are easier to understand and follow, and which importantly very
significantly simplify the error handling.
Note that we now call vms_abort_munmap_vmas() in more error paths than we
used to, however in cases where no abort need occur, vms->nr_pages will be
equal to zero and we simply exit this function without doing more than we
would have done previously.
Importantly, the invocation of the driver mmap hook via mmap_file() now
has very simple and obvious handling (this was previously the most
problematic part of the mmap() operation).
Use a generalised stack-based 'mmap state' to thread through values and
also retrieve state as needed.
Also avoid ever relying on vma merge (vmg) state after a merge is
attempted, instead maintain meaningful state in the mmap state and
establish vmg state as and when required.
This avoids any subtle bugs arising from merge logic mutating this state
and mmap_region() logic later relying upon it.
Link: https://lkml.kernel.org/r/25bd2edc3275450f448cbfe0756ce2a7cd06810f.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 25 Oct 2024 12:26:24 +0000 (13:26 +0100)]
mm: isolate mmap internal logic to mm/vma.c
In previous commits we effected improvements to the mmap() logic in
mmap_region() and its newly introduced internal implementation function
__mmap_region().
However as these changes are intended to be backported, we kept the delta
as small as is possible and made as few changes as possible to the newly
introduced mm/vma.* files.
Take the opportunity to move this logic to mm/vma.c which not only
isolates it, but also makes it available for later userland testing which
can help us catch such logic errors far earlier.
Link: https://lkml.kernel.org/r/93fc2c3aa37dd30590b7e4ee067dfd832007bf7e.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fix error handling in mmap_region() and refactor", v3.
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the
VMA is in an inconsistent state.
This series builds on the previously submitted hotfix patches (see link to
v2 below) which addresses the most critical issues around mmap_region(),
and further works to improve mmap_region() complexity, stability, and
testability.
This series moves the code to mm/vma.c to render it userland testable,
refactors and simplifies it into smaller functions that are significantly
more readable.
It additionally avoids performing an attempt at a second merge mid-way
through allocating a new VMA, a dubious proposition at best and one that
is highly subject to subtle bugs.
Rather than do this, we simply note that we ought to retry the merge and
do this as a final step.
This patch (of 3):
Add some additional vma_internal.h stubs in preparation for
__mmap_region() being moved to mm/vma.c. Without these the move would
result in the tests no longer compiling.
Shakeel Butt [Fri, 25 Oct 2024 01:23:03 +0000 (18:23 -0700)]
memcg-v1: remove memcg move locking code
The memcg v1's charge move feature has been deprecated. All the places
using the memcg move lock, have stopped using it as they don't need the
protection any more. Let's proceed to remove all the locking code related
to charge moving.
Link: https://lkml.kernel.org/r/20241025012304.2473312-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Fri, 25 Oct 2024 01:23:02 +0000 (18:23 -0700)]
memcg-v1: no need for memcg locking for MGLRU
While updating the generation of the folios, MGLRU requires that the
folio's memcg association remains stable. With the charge migration
deprecated, there is no need for MGLRU to acquire locks to keep the folio
and memcg association stable.
Shakeel Butt [Fri, 25 Oct 2024 01:23:01 +0000 (18:23 -0700)]
memcg-v1: no need for memcg locking for writeback tracking
During the era of memcg charge migration, the kernel has to be make
sure that the writeback stat updates do not race with the charge
migration. Otherwise it might update the writeback stats of the wrong
memcg. Now with the memcg charge migration gone, there is no more race
for writeback stat updates and the previous locking can be removed.
Link: https://lkml.kernel.org/r/20241025012304.2473312-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Fri, 25 Oct 2024 01:23:00 +0000 (18:23 -0700)]
memcg-v1: no need for memcg locking for dirty tracking
During the era of memcg charge migration, the kernel has to be make
sure that the dirty stat updates do not race with the charge migration.
Otherwise it might update the dirty stats of the wrong memcg. Now
with the memcg charge migration gone, there is no more race for dirty
stat updates and the previous locking can be removed.
Link: https://lkml.kernel.org/r/20241025012304.2473312-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memcg-v1: fully deprecate charge moving".
The memcg v1's charge moving feature has been deprecated for almost 2
years and the kernel warns if someone try to use it. This warning has
been backported to all stable kernel and there have not been any report of
the warning or the request to support this feature anymore. Let's proceed
to fully deprecate this feature.
This patch (of 6):
Proceed with the complete deprecation of memcg v1's charge moving feature.
The deprecation warning has been in the kernel for almost two years and
has been ported to all stable kernel since. Now is the time to fully
deprecate this feature.
Baolin Wang [Sat, 26 Oct 2024 13:51:52 +0000 (21:51 +0800)]
mm: shmem: fallback to page size splice if large folio has poisoned pages
The tmpfs has already supported the PMD-sized large folios, and splice()
can not read any pages if the large folio has a poisoned page, which is
not good as Matthew pointed out in a previous email[1]:
"so if we have hwpoison set on one page in a folio, we now can't read
bytes from any page in the folio? That seems like we've made a bad
situation worse."
Thus add a fallback to the PAGE_SIZE splice() still allows reading normal
pages if the large folio has hwpoisoned pages.
Zheng Yejian [Tue, 22 Oct 2024 08:39:27 +0000 (16:39 +0800)]
mm/damon/vaddr: add 'nr_piece == 1' check in damon_va_evenly_split_region()
As discussed in [1], damon_va_evenly_split_region() is called to
size-evenly split a region into 'nr_pieces' small regions,
when nr_pieces == 1, no actual split is required. Check that case
for better code readability and add a simple kunit testcase.
Zheng Yejian [Tue, 22 Oct 2024 08:39:26 +0000 (16:39 +0800)]
mm/damon/vaddr: fix issue in damon_va_evenly_split_region()
Patch series "mm/damon/vaddr: Fix issue in
damon_va_evenly_split_region()". v2.
According to the logic of damon_va_evenly_split_region(), currently
following split case would not meet the expectation:
Suppose DAMON_MIN_REGION=0x1000,
Case: Split [0x0, 0x3000) into 2 pieces, then the result would be
acutually 3 regions:
[0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000)
but NOT the expected 2 regions:
[0x0, 0x1000), [0x1000, 0x3000) !!!
The root cause is that when calculating size of each split piece in
damon_va_evenly_split_region():
both the dividing and the ALIGN_DOWN may cause loss of precision, then
each time split one piece of size 'sz_piece' from origin 'start' to 'end'
would cause more pieces are split out than expected!!!
To fix it, count for each piece split and make sure no more than
'nr_pieces'. In addition, add above case into damon_test_split_evenly().
And add 'nr_piece == 1' check in damon_va_evenly_split_region() for better
code readability and add a corresponding kunit testcase.
This patch (of 2):
According to the logic of damon_va_evenly_split_region(), currently
following split case would not meet the expectation:
Suppose DAMON_MIN_REGION=0x1000,
Case: Split [0x0, 0x3000) into 2 pieces, then the result would be
acutually 3 regions:
[0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000)
but NOT the expected 2 regions:
[0x0, 0x1000), [0x1000, 0x3000) !!!
The root cause is that when calculating size of each split piece in
damon_va_evenly_split_region():
both the dividing and the ALIGN_DOWN may cause loss of precision,
then each time split one piece of size 'sz_piece' from origin 'start' to
'end' would cause more pieces are split out than expected!!!
To fix it, count for each piece split and make sure no more than
'nr_pieces'. In addition, add above case into damon_test_split_evenly().
Ryan Roberts [Mon, 21 Oct 2024 13:00:26 +0000 (14:00 +0100)]
mm/memcontrol: fix seq_buf size to save memory when PAGE_SIZE is large
Previously the seq_buf used for accumulating the memory.stat output was
sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE;
If 4K is enough on a 4K page system, then it should also be enough on a
64K page system, so we can save 60K on the static buffer used in
mem_cgroup_print_oom_meminfo(). Let's make it so.
This also has the beneficial side effect of removing a place in the code
that assumed PAGE_SIZE is a compile-time constant. So this helps our
quest towards supporting boot-time page size selection.
Link: https://lkml.kernel.org/r/20241021130027.3615969-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
James Houghton [Mon, 21 Oct 2024 16:02:12 +0000 (16:02 +0000)]
mm: add missing mmu_notifier_clear_young for !MMU_NOTIFIER
Remove the now unnecessary ifdef in mm/damon/vaddr.c as well.
Link: https://lkml.kernel.org/r/20241021160212.9935-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jim Zhao [Wed, 23 Oct 2024 10:00:32 +0000 (18:00 +0800)]
mm/page-writeback: raise wb_thresh to prevent write blocking with strictlimit
With the strictlimit flag, wb_thresh acts as a hard limit in
balance_dirty_pages() and wb_position_ratio(). When device write
operations are inactive, wb_thresh can drop to 0, causing writes to be
blocked. The issue occasionally occurs in fuse fs, particularly with
network backends, the write thread is blocked frequently during a period.
To address it, this patch raises the minimum wb_thresh to a controllable
level, similar to the non-strictlimit case.
Baolin Wang [Fri, 18 Oct 2024 03:00:28 +0000 (11:00 +0800)]
mm: shmem: improve the tmpfs large folio read performance
tmpfs already supports PMD-sized large folios, but the tmpfs read
operation still performs copying at PAGE_SIZE granularity, which is
unreasonable. This patch changes tmpfs to copy data at folio granularity,
which can improve the read performance, as well as changing to use folio
related functions.
Moreover, if a large folio has a subpage that is hwpoisoned, it will
still fall back to page granularity copying.
Use 'fio bs=64k' to read a 1G tmpfs file populated with 2M THPs, and I can
see about 20% performance improvement, and no regression with bs=4k.
Before the patch:
READ: bw=10.0GiB/s
Baolin Wang [Fri, 18 Oct 2024 03:00:27 +0000 (11:00 +0800)]
mm: shmem: update iocb->ki_pos directly to simplify tmpfs read logic
Patch series "Improve the tmpfs large folio read performance", v2.
tmpfs already supports PMD-sized large folios, but the tmpfs read
operation still performs copying at PAGE_SIZE granularity, which is not
perfect. This patchset changes tmpfs to copy data at the folio
granularity, which can improve the read performance.
Use 'fio bs=64k' to read a 1G tmpfs file populated with 2M THPs, and I can
see about 20% performance improvement, and no regression with bs=4k. I
also did some functional testing with the xfstests suite, and I did not
find any regressions with the following xfstests config:
Using iocb->ki_pos to check if the read bytes exceeds the file size and to
calculate the bytes to be read can help simplify the code logic.
Meanwhile, this is also a preparation for improving tmpfs large folios
read performance in the following patch.
Liam R. Howlett [Fri, 18 Oct 2024 17:41:13 +0000 (13:41 -0400)]
mm/mremap: cleanup vma_to_resize()
Patch series "mm/mremap: Remove extra vma tree walk", v2.
An extra vma tree walk was discovered in some mremap call paths during the
discussion on mseal() changes. This patch set removes the extra vma tree
walk and further cleans up mremap_to().
This patch (of 2):
vma_to_resize() is used in two locations to find and validate the vma for
the mremap location. One of the two locations already has the vma, which
is then re-found to validate the same vma.
This code can be simplified by moving the vma_lookup() from
vma_to_resize() to mremap_to() and changing the return type to an int
error.
Since the function now just validates the vma, the function is renamed to
resize_is_valid() to better reflect what it is doing.
This commit also adds documentation about the function.
Pankaj Raghav [Thu, 17 Oct 2024 06:23:42 +0000 (08:23 +0200)]
mm: don't set readahead flag on a folio when lookahead_size > nr_to_read
The readahead flag is set on a folio based on the lookahead_size and
nr_to_read. For example, when the readahead happens from index to index +
nr_to_read, then the readahead `mark` offset from index is set at
nr_to_read - lookahead_size.
There are some scenarios where the lookahead_size > nr_to_read. For
example, readahead window was created, but the file was truncated before
the readahead starts. do_page_cache_ra() will clamp the nr_to_read if the
readahead window extends beyond EOF after truncation. If this happens,
readahead flag should not be set on any folio on the current readahead
window.
The current calculation for `mark` with mapping_min_order > 0 gives
incorrect results when lookahead_size > nr_to_read due to rounding up
operation:
ra_folio_index = round_up(128 + 16 - 28, 16) = 128;
mark = 128 - 128 = 0; # offset from index to set RA flag
In the above example, the lookahead_size is actually lying outside the
current readahead window. Without this patch, RA flag will be set
incorrectly on the folio at index 128. This can lead to marking the
readahead flag on the wrong folio, therefore, triggering a readahead when
it is not necessary.
Explicitly initialize `mark` to be ULONG_MAX and only calculate it when
lookahead_size is within the readahead window.
Link: https://lkml.kernel.org/r/20241017062342.478973-1-kernel@pankajraghav.com Fixes: 26cfdb395eef ("readahead: allocate folios with mapping_min_order in readahead") Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Thu, 17 Oct 2024 14:14:57 +0000 (22:14 +0800)]
mm: shmem: remove __shmem_huge_global_enabled()
Remove __shmem_huge_global_enabled() since it as only one caller, and
remove repeated check of VM_NOHUGEPAGE/MMF_DISABLE_THP as they are checked
in shmem_allowable_huge_orders(), also remove unnecessary vma parameter.
Link: https://lkml.kernel.org/r/20241017141457.1169092-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Thu, 17 Oct 2024 14:14:56 +0000 (22:14 +0800)]
mm: huge_memory: move file_thp_enabled() into huge_memory.c
file_thp_enabled() is only used in __thp_vma_allowable_orders(), so move
it into huge_memory.c, also check READ_ONLY_THP_FOR_FS ahead to avoid
unnecessary code if config disabled.
Link: https://lkml.kernel.org/r/20241017141457.1169092-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Thu, 17 Oct 2024 14:17:42 +0000 (22:17 +0800)]
tmpfs: don't enable large folios if not supported
tmpfs can support large folios, but there are some configurable options
(mount options and runtime deny/force) to enable/disable large folio
allocation, so there is a performance issue when performing writes without
large folios. The issue is similar to commit 4e527d5841e2 ("iomap: fault
in smaller chunks for non-large folio mappings").
Since 'deny' is for emergencies and 'force' is for testing, performance
issues should not be a problem in real production environments, so don't
call mapping_set_large_folios() in __shmem_get_inode() when large folio is
disabled with mount huge=never option (default policy).
Link: https://lkml.kernel.org/r/20241017141742.1169404-1-wangkefeng.wang@huawei.com Fixes: 9aac777aaf94 ("filemap: Convert generic_perform_write() to support large folios") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 17 Oct 2024 16:56:38 +0000 (17:56 +0100)]
tools: testing: fix phys_addr_t size on 64-bit systems
The phys_addr_t size is predicated on whether CONFIG_PHYS_ADDR_T_64BIT is
set or not.
In the VMA tests, virt_to_phys() from tools/include/linux casts a volatile
void * pointer to phys_addr_t, if CONFIG_PHYS_ADDR_T_64BIT is not set,
this will be 32-bit and trigger a warning.
Obviously this might also lead to truncation, which we would rather avoid.
Fix this by adjusting the generation of generated/bit-length.h to generate
a CONFIG_PHYS_ADDR_T{bits}BIT define.
This does result in the generation of the useless CONFIG_PHYS_ADDR_T_32BIT
define for 32-bit systems, but this should have no effect, and makes
implementation of this easier.
Wei Xu [Thu, 17 Oct 2024 18:15:28 +0000 (18:15 +0000)]
mm/mglru: reset page lru tier bits when activating
When a folio is activated, lru_gen_add_folio() moves the folio to the
youngest generation. But unlike folio_update_gen()/folio_inc_gen(),
lru_gen_add_folio() doesn't reset the folio lru tier bits (LRU_REFS_MASK |
LRU_REFS_FLAGS). This inconsistency can affect how pages are aged via
folio_mark_accessed() (e.g. fd accesses), though no user visible impact
related to this has been detected yet.
Note that lru_gen_add_folio() cannot clear PG_workingset if the activation
is due to workingset refault, otherwise PSI accounting will be skipped.
So fix lru_gen_add_folio() to clear the lru tier bits other than
PG_workingset when activating a folio, and also clear all the lru tier
bits when a folio is activated via folio_activate() in
lru_gen_look_around().
Link: https://lkml.kernel.org/r/20241017181528.3358821-1-weixugc@google.com Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") Signed-off-by: Wei Xu <weixugc@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens <heftig@archlinux.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Wed, 16 Oct 2024 18:23:52 +0000 (21:23 +0300)]
percpu: add a test case for the specific 64-bit value addition
It might be a corner case when we add UINT_MAX as 64-bit unsigned value to
the percpu variable as it's not the same as -1 (ULONG_LONG_MAX). Add a
test case for that.
Link: https://lkml.kernel.org/r/20241016182635.1156168-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Wed, 16 Oct 2024 18:23:51 +0000 (21:23 +0300)]
x86/percpu: fix clang warning when dealing with unsigned types
Patch series "percpu: Add a test case and fix for clang", v2.
Add a test case to percpu to check a corner case with the specific 64-bit
unsigned value. This test case shows why the first patch is done in the
way it's done.
The before and after has been tested with binary comparison of the
percpu_test module and runnig it on the real Intel system.
This patch (of 2):
When percpu_add_op() is used with an unsigned argument, it prevents kernel
builds with clang, `make W=1` and CONFIG_WERROR=y:
net/ipv4/tcp_output.c:187:3: error: result of comparison of constant -1 with expression of type 'u8' (aka 'unsigned char') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
187 | NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPACKCOMPRESSED,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
188 | tp->compressed_ack);
| ~~~~~~~~~~~~~~~~~~~
...
arch/x86/include/asm/percpu.h:238:31: note: expanded from macro 'percpu_add_op'
238 | ((val) == 1 || (val) == -1)) ? \
| ~~~~~ ^ ~~
Fix this by casting -1 to the type of the parameter and then compare.
Link: https://lkml.kernel.org/r/20241016182635.1156168-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20241016182635.1156168-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sabyrzhan Tasbolatov [Fri, 11 Oct 2024 03:53:10 +0000 (08:53 +0500)]
mm, kasan, kmsan: instrument copy_from/to_kernel_nofault
Instrument copy_from_kernel_nofault() with KMSAN for uninitialized kernel
memory check and copy_to_kernel_nofault() with KASAN, KCSAN to detect the
memory corruption.
syzbot reported that bpf_probe_read_kernel() kernel helper triggered KASAN
report via kasan_check_range() which is not the expected behaviour as
copy_from_kernel_nofault() is meant to be a non-faulting helper.
Solution is, suggested by Marco Elver, to replace KASAN, KCSAN check in
copy_from_kernel_nofault() with KMSAN detection of copying uninitilaized
kernel memory. In copy_to_kernel_nofault() we can retain
instrument_write() explicitly for the memory corruption instrumentation.
copy_to_kernel_nofault() is tested on x86_64 and arm64 with
CONFIG_KASAN_SW_TAGS. On arm64 with CONFIG_KASAN_HW_TAGS, kunit test
currently fails. Need more clarification on it.
Jiazi Li [Wed, 26 Jun 2024 16:06:31 +0000 (12:06 -0400)]
maple_tree: add some alloc node test case
Add some maple_tree alloc node tese case.
Link: https://lkml.kernel.org/r/20240626160631.3636515-2-Liam.Howlett@oracle.com Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is because there may be some full maple_alloc node in current maple
state. Use full maple_alloc node will make max_req equal to 0. And it
leads to mt_alloc_bulk return 0. As a result, mas_node_count set mas.node
to MA_ERROR(-ENOMEM).
Find a non-full maple_alloc node, and if necessary, use this non-full node
in the next while loop.
Link: https://lkml.kernel.org/r/20240626160631.3636515-1-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Saurabh Sengar [Mon, 12 Aug 2024 06:13:40 +0000 (23:13 -0700)]
mm/vmstat: defer the refresh_zone_stat_thresholds after all CPUs bringup
refresh_zone_stat_thresholds function has two loops which is expensive for
higher number of CPUs and NUMA nodes.
Below is the rough estimation of total iterations done by these loops
based on number of NUMA and CPUs.
Total number of iterations: nCPU * 2 * Numa * mCPU
Where:
nCPU = total number of CPUs
Numa = total number of NUMA nodes
mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs)
For the system under test with 16 NUMA nodes and 1024 CPUs, this results
in a substantial increase in the number of loop iterations during boot-up
when NUMA is enabled:
No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds
takes around 224 ms total for all the CPUs in the system under test.
16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds
takes around 4.5 seconds total for all the CPUs in the system under test.
Calling this for each CPU is expensive when there are large number of CPUs
along with multiple NUMAs. Fix this by deferring
refresh_zone_stat_thresholds to be called later at once when all the
secondary CPUs are up. Also, register the DYN hooks to keep the existing
hotplug functionality intact.
Link: https://lkml.kernel.org/r/1723443220-20623-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu> Cc: Saurabh Singh Sengar <ssengar@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jaewon Kim [Fri, 11 Oct 2024 12:49:28 +0000 (21:49 +0900)]
vmscan: add a vmscan event for reclaim_pages
reclaim_folio_list uses a dummy reclaim_stat and is not being used. To
know the memory stat, add a new trace event. This is useful how how many
pages are not reclaimed or why.
Currently reclaim_folio_list is only called by reclaim_pages, and
reclaim_pages is used by damon and madvise. In the latest Android,
reclaim_pages is also used by shmem to reclaim all pages in a
address_space.
Zi Yan [Fri, 11 Oct 2024 15:03:04 +0000 (11:03 -0400)]
mm: avoid zeroing user movable page twice with init_on_alloc=1
Commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and
init_on_free=1 boot options") forces allocated page to be zeroed in
post_alloc_hook() when init_on_alloc=1.
For order-0 folios, if arch does not define
vma_alloc_zeroed_movable_folio(), the default implementation again zeros
the page return from the buddy allocator. So the page is zeroed twice.
Fix it by passing __GFP_ZERO instead to avoid double page zeroing. At the
moment, s390,arm64,x86,alpha,m68k are not impacted since they define their
own vma_alloc_zeroed_movable_folio().
For >0 order folios (mTHP and PMD THP), folio_zero_user() is called to
zero the folio again. Fix it by calling folio_zero_user() only if
init_on_alloc is set. All arch are impacted.
Add alloc_zeroed() helper to encapsulate the init_on_alloc check.
[ziy@nvidia.com: comment fixes, per David] Link: https://lkml.kernel.org/r/97DB52E1-C594-49B5-9736-89AC302FAB01@nvidia.com Link: https://lkml.kernel.org/r/20241011150304.709590-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Fri, 11 Oct 2024 17:19:50 +0000 (01:19 +0800)]
mm/zswap: avoid touching XArray for unnecessary invalidation
zswap_invalidation simply calls xa_erase, which acquires the Xarray lock
first, then does a look up. This has a higher overhead even if zswap is
not used or the tree is empty.
So instead, do a very lightweight xa_empty check first, if there is
nothing to erase, don't touch the lock or the tree.
Using xa_empty rather than zswap_never_enabled is more helpful as it cover
both case where zswap wes never used or the particular range doesn't have
any zswap entry. And it's safe as the swap slot should be currently
pinned by caller with HAS_CACHE.
Sequential SWAP in/out tests with zswap disabled showed a minor
performance gain, SWAP in of zero page with zswap enabled also showed a
performance gain. (swapout is basically unchanged so only test one case):
And the performance is very slightly better or unchanged for
build kernel test with zswap enabled or disabled.
Build Linux Kernel with defconfig and -j32 in 1G memory cgroup,
using brd SWAP, zswap disabled (sys time in seconds, 6 testrun, -0.1%):
Before: 1648.83 1653.52 1666.34 1665.95 1663.06 1656.67
After: 1651.36 1661.89 1645.70 1657.45 1662.07 1652.83
Build Linux Kernel with defconfig and -j32 in 2G memory cgroup,
using brd SWAP zswap enabled (sys time in seconds, 6 testrun, -0.3%):
Before: 1240.25 1254.06 1246.77 1265.92 1244.23 1227.74
After: 1226.41 1218.21 1249.12 1249.13 1244.39 1233.01
Link: https://lkml.kernel.org/r/20241011171950.62684-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sidhartha Kumar [Fri, 11 Oct 2024 21:44:51 +0000 (17:44 -0400)]
maple_tree: refactor mas_wr_store_type()
In mas_wr_store_type(), we check if new_end < mt_slots[wr_mas->type]. If
this check fails, we know that ,after this, new_end is >= mt_min_slots.
Checking this again when we detect a wr_node_store later in the function
is reduntant. Because this check is part of an OR statement, the
statement will always evaluate to true, therefore we can just get rid of
it.
We also refactor mas_wr_store_type() to return the store type rather than
set it directly as it greatly cleans up the function.
Link: https://lkml.kernel.org/r/20241011214451.7286-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha <sidhartha.kumar@oracle.com> Suggested-by: Liam Howlett <liam.howlett@oracle.com> Suggested-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
suhua [Sat, 12 Oct 2024 07:08:02 +0000 (15:08 +0800)]
mm/hugetlb: perform vmemmap optimization batchly for specific node allocation
When HVO is enabled and huge page memory allocs are made, the freed memory
can be aggregated into higher order memory in the following paths, which
facilitates further allocs for higher order memory.
Shakeel Butt [Thu, 10 Oct 2024 00:35:50 +0000 (17:35 -0700)]
memcg: add tracing for memcg stat updates
The memcg stats are maintained in rstat infrastructure which provides very
fast updates side and reasonable read side. However memcg added plethora
of stats and made the read side, which is cgroup rstat flush, very slow.
To solve that, threshold was added in the memcg stats read side i.e. no
need to flush the stats if updates are within the threshold.
This threshold based improvement worked for sometime but more stats were
added to memcg and also the read codepath was getting triggered in the
performance sensitive paths which made threshold based ratelimiting
ineffective. We need more visibility into the hot and cold stats i.e.
stats with a lot of updates. Let's add trace to get that visibility.
[shakeel.butt@linux.dev: use unsigned long type for memcg_rstat_events, per Yosry] Link: https://lkml.kernel.org/r/20241015213721.3804209-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20241010003550.3695245-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: JP Kobryn <inwardvessel@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kefeng Wang [Thu, 10 Oct 2024 06:15:56 +0000 (14:15 +0800)]
mm: remove unused hugepage for vma_alloc_folio()
The hugepage parameter was deprecated since commit ddc1a5cbc05d
("mempolicy: alloc_pages_mpol() for NUMA policy without vma"), for
PMD-sized THP, it still tries only preferred node if possible in
vma_alloc_folio() by checking the order of the folio allocation.
Link: https://lkml.kernel.org/r/20241010061556.1846751-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MengEn Sun [Thu, 10 Oct 2024 12:09:36 +0000 (20:09 +0800)]
mm: add pcp high_min high_max to proc zoneinfo
When we do not set percpu_pagelist_high_fraction the kernel will compute
the pcp high_min/max by itself, which makes it hard to determine the
current high_min/max values.
So output the pcp high_min/max values to /proc/zoneinfo.
John Hubbard [Wed, 9 Oct 2024 02:50:24 +0000 (19:50 -0700)]
kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end
For clarity. It's increasingly hard to reason about the code, when KASLR
is moving around the boundaries. In this case where KASLR is randomizing
the location of the kernel image within physical memory, the maximum
number of address bits for physical memory has not changed.
What has changed is the ending address of memory that is allowed to be
directly mapped by the kernel.
Let's name the variable, and the associated macro accordingly.
Also, enhance the comment above the direct_map_physmem_end definition,
to further clarify how this all works.
Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jordan Niethe <jniethe@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Tue, 8 Oct 2024 06:17:46 +0000 (11:47 +0530)]
mm: allocate THP on hugezeropage wp-fault
Introduce do_huge_zero_wp_pmd() to handle wp-fault on a hugezeropage and
replace it with a PMD-mapped THP. Remember to flush TLB entry
corresponding to the hugezeropage. In case of failure, fallback to
splitting the PMD.
Link: https://lkml.kernel.org/r/20241008061746.285961-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Tue, 8 Oct 2024 06:17:45 +0000 (11:47 +0530)]
mm: abstract THP allocation
Patch series "Do not shatter hugezeropage on wp-fault", v7.
It was observed at [1] and [2] that the current kernel behaviour of
shattering a hugezeropage is inconsistent and suboptimal. For a VMA with
a THP allowable order, when we write-fault on it, the kernel installs a
PMD-mapped THP. On the other hand, if we first get a read fault, we get a
PMD pointing to the hugezeropage; subsequent write will trigger a
write-protection fault, shattering the hugezeropage into one writable
page, and all the other PTEs write-protected. The conclusion being, as
compared to the case of a single write-fault, applications have to suffer
512 extra page faults if they were to use the VMA as such, plus we get the
overhead of khugepaged trying to replace that area with a THP anyway.
Instead, replace the hugezeropage with a THP on wp-fault.
In preparation for the second patch, abstract away the THP allocation
logic present in the create_huge_pmd() path, which corresponds to the
faulting case when no page is present.
There should be no functional change as a result of applying this patch,
except that, as David notes at [1], a PMD-aligned address should be passed
to update_mmu_cache_pmd().
Link: https://lkml.kernel.org/r/20241008061746.285961-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20241008061746.285961-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dennis Zhou [Tue, 8 Oct 2024 00:19:42 +0000 (17:19 -0700)]
percpu: fix data race with pcpu_nr_empty_pop_pages
Fixes the data race by moving the read to be behind the pcpu_lock. This
is okay because the code (initially) above it will not increase the
empty populated page count because it is populating backing pages that
already have allocations served out of them.
Link: https://lkml.kernel.org/r/20241008001942.8114-1-dennis@kernel.org Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407191651.f24e499d-oliver.sang@intel.com Signed-off-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:37 +0000 (09:50 +0200)]
mm: consolidate common checks in hugetlb_get_unmapped_area
prepare_hugepage_range() performs almost the same checks for all
architectures that define it, with the exception of mips and loongarch
that also check for overflows.
The rest checks for the addr and len to be properly aligned, so we can
move that to hugetlb_get_unmapped_area() and get rid of a fair amount of
duplicated code.
[akpm@linux-foundation.org: remove now-unused local] Link: https://lore.kernel.org/oe-kbuild-all/202410081210.uNLbf3Jk-lkp@intel.com/ Link: https://lkml.kernel.org/r/20241007075037.267650-10-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:36 +0000 (09:50 +0200)]
arch/s390: clean up hugetlb definitions
s390 redefines functions that are already defined (and the same) in
include/asm-generic/hugetlb.h.
Do as the other architectures:
1) include include/asm-generic/hugetlb.h
2) drop the already defined functions in the generic hugetlb.h and
3) use the __HAVE_ARCH_HUGE_* macros to define our own.
This gets rid of quite some code.
Link: https://lkml.kernel.org/r/20241007075037.267650-9-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:35 +0000 (09:50 +0200)]
mm: drop hugetlb_get_unmapped_area{_*} functions
Hugetlb mappings are now handled through normal channels just like any
other mapping, so we no longer need hugetlb_get_unmapped_area* specific
functions.
Link: https://lkml.kernel.org/r/20241007075037.267650-8-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:34 +0000 (09:50 +0200)]
mm: make hugetlb mappings go through mm_get_unmapped_area_vmflags
Hugetlb mappings will no longer be special cased but rather go through the
generic mm_get_unmapped_area_vmflags function. For that to happen, let us
remove the .get_unmapped_area from hugetlbfs_file_operations struct, and
hint __get_unmapped_area that it should not send hugetlb mappings through
thp_get_unmapped_area_vmflags but through mm_get_unmapped_area_vmflags.
Create also a function called hugetlb_mmap_check_and_align() where a
couple of safety checks are being done and the addr is aligned to the huge
page size. Otherwise we will have to do this in every single function,
which duplicates quite a lot of code.
Link: https://lkml.kernel.org/r/20241007075037.267650-7-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:33 +0000 (09:50 +0200)]
arch/powerpc: teach book3s64 arch_get_unmapped_area{_topdown} to handle hugetlb mappings
We want to stop special casing hugetlb mappings and make them go through
generic channels, so teach arch_get_unmapped_area{_topdown} to handle
those.
Reshuffle file_to_psize() definition so arch_get_unmapped_area{_topdown}
can make use of it.
Link: https://lkml.kernel.org/r/20241007075037.267650-6-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:32 +0000 (09:50 +0200)]
arch/sparc: teach arch_get_unmapped_area{_topdown} to handle hugetlb mappings
We want to stop special casing hugetlb mappings and make them go through
generic channels, so teach arch_get_unmapped_area{_topdown} to handle
those.
sparc specific hugetlb function does not set info.align_offset, and does
not care about adjusting the align_mask for MAP_SHARED cases, so the same
here for compatibility.
Link: https://lkml.kernel.org/r/20241007075037.267650-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:31 +0000 (09:50 +0200)]
arch/x86: teach arch_get_unmapped_area_vmflags to handle hugetlb mappings
We want to stop special casing hugetlb mappings and make them go through
generic channels, so teach arch_get_unmapped_area_{topdown_}vmflags to
handle those.
x86 specific hugetlb function does not set either info.start_gap or
info.align_offset so the same here for compatibility.
Link: https://lkml.kernel.org/r/20241007075037.267650-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:30 +0000 (09:50 +0200)]
arch/s390: teach arch_get_unmapped_area{_topdown} to handle hugetlb mappings
We want to stop special casing hugetlb mappings and make them go through
generic channels, so teach arch_get_unmapped_area{_topdown} to handle
those.
s390 specific hugetlb function does not set info.align_offset, so do the
same here for compatibility.
Link: https://lkml.kernel.org/r/20241007075037.267650-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oscar Salvador [Mon, 7 Oct 2024 07:50:29 +0000 (09:50 +0200)]
mm/mmap: teach generic_get_unmapped_area{_topdown} to handle hugetlb mappings
Patch series "Unify hugetlb into arch_get_unmapped_area functions", v4.
This is an attempt to get rid of a fair amount of duplicated code wrt.
hugetlb and *get_unmapped_area* functions.
HugeTLB registers a .get_unmapped_area function which gets called from
__get_unmapped_area().
hugetlb_get_unmapped_area() is defined by a bunch of architectures and
it also has a generic definition for those that do not define it.
Short-long story is that there is a ton of duplicated code between
specific hugetlb *_get_unmapped_area_* functions and mm-core functions,
so we can do better by teaching arch_get_unmapped_area* functions how
to deal with hugetlb mappings.
Note that not a lot of things need to be taught though.
hugetlb_get_unmapped_area, that gets called for hugetlb mappings, runs
some sanity checks prior to calling mm_get_unmapped_area_vmflags(), so we
do not need to that down the road in the respective
{generic,arch}_get_unmapped_area* functions.
More information can be found in the respective patches.
LTP mmapstress hugetlb selftests were ran succesfully on:
This patch (of 9):
We want to stop special casing hugetlb mappings and make them go through
generic channels, so teach generic_get_unmapped_area{_topdown} to handle
those. The main difference is that we set info.align_mask for huge
mappings.
Breno Leitao [Fri, 4 Oct 2024 16:48:31 +0000 (09:48 -0700)]
mm: remove misleading 'unlikely' hint in vms_gather_munmap_vmas()
Performance analysis using branch annotation on a fleet of 200 hosts
running web servers revealed that the 'unlikely' hint in
vms_gather_munmap_vmas() was 100% consistently incorrect. In all observed
cases, the branch behavior contradicted the hint.
Remove the 'unlikely' qualifier from the condition checking 'vms->uf'. By
doing so, we allow the compiler to make optimization decisions based on
its own heuristics and profiling data, rather than relying on a static
hint that has proven to be inaccurate in real-world scenarios.
Link: https://lkml.kernel.org/r/20241004164832.218681-1-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Mon, 7 Oct 2024 11:53:35 +0000 (12:53 +0100)]
maple_tree: do not hash pointers on dump in debug mode
Many maple tree values output when an mt_validate() or equivalent hits an
issue utilise tagged pointers, most notably parent nodes. Also some
pivots/slots contain meaningful values, output as pointers, such as the
index of the last entry with data for example.
All pointer values such as this are destroyed by kernel pointer hashing
rendering the debug output obtained from CONFIG_DEBUG_VM_MAPLE_TREE
considerably less usable.
Update this code to output the raw pointers using %px rather than %p when
CONFIG_DEBUG_VM_MAPLE_TREE is defined. This is justified, as the use of
this configuration flag indicates that this is a test environment.
Userland does not understand %px, so use %p there.
In an abundance of caution, if CONFIG_DEBUG_VM_MAPLE_TREE is not set, also
use %p to avoid exposing raw kernel pointers except when we are positive a
testing mode is enabled.
This was inspired by the investigation performed in recent debugging
efforts around a maple tree regression [0] where kernel pointer tagging had
to be disabled in order to obtain truly meaningful and useful data.
Shakeel Butt [Wed, 2 Oct 2024 22:51:50 +0000 (15:51 -0700)]
mm/truncate: reset xa_has_values flag on each iteration
Currently mapping_try_invalidate() and invalidate_inode_pages2_range()
traverses the xarray in batches and then for each batch, maintains and
sets the flag named xa_has_values if the batch has a shadow entry to clear
the entries at the end of the iteration.
However they forgot to reset the flag at the end of the iteration which
causes them to always try to clear the shadow entries in the subsequent
iterations where there might not be any shadow entries.
Fix this inefficiency.
Link: https://lkml.kernel.org/r/20241002225150.2334504-1-shakeel.butt@linux.dev Fixes: 61c663e020d2 ("mm/truncate: batch-clear shadow entries") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kanchana P Sridhar [Wed, 2 Oct 2024 22:58:22 +0000 (15:58 -0700)]
mm: swap: make some count_mthp_stat() call-sites be THP-agnostic.
In commit 246d3aa3e531 ("mm: cleanup count_mthp_stat() definition"), Ryan
Roberts has pointed out the merits of mm code that does not require THP,
to be compile-able without requiring THP ifdefs. As a step in that
direction, he has moved count_mthp_stat() to be always defined, resolving
to a no-op if THP is not defined.
Barry Song referred me to Ryan's commit when I was working on the "mm:
zswap swap-out of large folios" patch-series [1].
This patch propagates the benefits of the above change to page_io.c and
vmscan.c. As a result, there is one less reason to have the ifdef THP in
these code sections.
Anshuman Khandual [Thu, 3 Oct 2024 04:48:42 +0000 (10:18 +0530)]
mm: move set_pxd_safe() helpers from generic to platform
set_pxd_safe() helpers that serve a specific purpose for both x86 and
riscv platforms, do not need to be in the common memory code. Otherwise
they just unnecessarily make the common API more complicated. This moves
the helpers from common code to platform instead.
Link: https://lkml.kernel.org/r/20241003044842.246016-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 2 Oct 2024 15:25:30 +0000 (16:25 +0100)]
mm: add PageAnonNotKsm()
Check that this anonymous page is really anonymous, not anonymous-or-KSM.
This optimises the debug check, but its real purpose is to remove the last
two users of PageKsm().
Matthew Wilcox (Oracle) [Wed, 2 Oct 2024 15:25:28 +0000 (16:25 +0100)]
ksm: convert cmp_and_merge_page() to use a folio
By making try_to_merge_two_pages() and stable_tree_search() return a
folio, we can replace kpage with kfolio. This replaces 7 calls to
compound_head() with one.
[cuigaosheng1@huawei.com: add IS_ERR_OR_NULL check for stable_tree_search()] Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Link: https://lkml.kernel.org/r/20241002152533.1350629-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Shi <alexs@kernel.org> Cc: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 2 Oct 2024 15:25:27 +0000 (16:25 +0100)]
ksm: use a folio in try_to_merge_one_page()
Patch series "Remove PageKsm()".
The KSM flag is almost always tested on the folio rather than on the page.
This series removes the final users of PageKsm() and makes the flag only
This patch (of 5):
It is safe to use a folio here because all callers took a refcount on this
page. The one wrinkle is that we have to recalculate the value of folio
after splitting the page, since it has probably changed. Replaces nine
calls to compound_head() with one.
The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For
each batch, it traverses the page cache tree and collects the entries
(folio and shadow entries) in the struct folio_batch. For the shadow
entries present in the folio_batch, it has to traverse the page cache tree
for each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple
fadvise(DONTNEED) operation.
# time xfs_io -c 'fadvise -d 0 ${file_size}' file
time (sec)
Without 5.12 +- 0.061
With-patch 4.19 +- 0.086 (18.16% decrease)
Link: https://lkml.kernel.org/r/20240925224716.2904498-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chris Mason <clm@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: optimize shadow entries removal", v2.
Some of our production workloads which processes a large amount of data
spends considerable amount of CPUs on truncation and invalidation of large
sized files (100s of GiBs of size). Tracing the operations showed that
most of the time is in shadow entries removal. This patch series
optimizes the truncation and invalidation operations.
This patch (of 2):
The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
batch, it traverses the page cache tree and collects the entries (folio
and shadow entries) in the struct folio_batch. For the shadow entries
present in the folio_batch, it has to traverse the page cache tree for
each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
On large machines in our production which run workloads manipulating large
amount of data, we have observed that a large amount of CPUs are spent on
truncation of very large files (100s of GiBs file sizes). More
specifically most of time was spent on shadow entries cleanup, so
optimizing the shadow entries cleanup, even a little bit, has good impact.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple truncation
operation.
# time truncate -s 0 file
time (sec)
Without 5.164 +- 0.059
With-patch 4.21 +- 0.066 (18.47% decrease)
mm: migrate LRU_REFS_MASK bits in folio_migrate_flags
Bits of LRU_REFS_MASK are not inherited during migration which lead to new
folio start from tier0 when MGLRU enabled. Try to bring as much bits of
folio->flags as possible since compaction and alloc_contig_range which
introduce migration do happen at times.
Link: https://lkml.kernel.org/r/20240926050647.5653-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_rw_nolock(). At this time, the
pte_same() check is not performed after the ptl held, so we should get
pmdval and do pmd_same() check to ensure the stability of pmd entry.
Link: https://lkml.kernel.org/r/7e9c194a5efacc9609cfd31abb9c7df88b53b530.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will use pte_same() to detect the change of the pte entry, there is no
need to get pmdval, so just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/1530e8fdbfc72eacf3b095babe139ce3d715600a.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At this
time, the pte_same() check is not performed after the pvmw->ptl held, so
we should get pmdval and do pmd_same() check to ensure the stability of
pvmw->pmd.
Link: https://lkml.kernel.org/r/2620a48f34c9f19864ab0169cdbf253d31a8fcaa.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so
hpage_collapse_scan_file() path can not find this by traversing
file->f_mapping, so there is no concurrency with retract_page_tables().
In addition, we already hold the exclusive mmap_lock, so this new_pte page
is stable, so there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9d582a09dbcf12e562ac5fe0ba05e9248a58f5e0.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: copy_pte_range() use pte_offset_map_rw_nolock()
In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we
already hold the exclusive mmap_lock, and the copy_pte_range() and
retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE
page is stable, there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the PTL held. So
we should get pgt_pmd and do pmd_same() check after the ptl held.
Link: https://lkml.kernel.org/r/055e42db68da00ac8ecab94bd2633c7cd965eb1c.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: handle_pte_fault() use pte_offset_map_rw_nolock()
In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In do_adjust_pte(), we may modify the pte entry. The corresponding pmd
entry may have been modified concurrently. Therefore, in order to ensure
the stability if pmd entry, use pte_offset_map_rw_nolock() to replace
pte_offset_map_nolock(), and do pmd_same() check after holding the PTL.
All callers of update_mmu_cache_range() hold the vmf->ptl, so we can
determined whether split PTE locks is being used by doing the following,
just as we do elsewhere in the kernel.
ptl != vmf->ptl
And then we can delete the do_pte_lock() and do_pte_unlock().
Link: https://lkml.kernel.org/r/0eaf6b69aeb2fe35092a633fed12537efe645303.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use
pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock().
Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>