SeongJae Park [Tue, 27 Aug 2024 03:03:33 +0000 (20:03 -0700)]
mm/damon/dbgfs-test: skip dbgfs_set_targets() test if PADDR is not registered
The test depends on registration of DAMON_OPS_PADDR. It would be
registered only when CONFIG_DAMON_PADDR is set. DAMON core kunit tests do
fake ops registration for such case. However, the functions for such fake
ops registration is not available to DAMON debugfs interface. Just skip
the test in the case.
Link: https://lkml.kernel.org/r/20240827030336.7930-7-sj@kernel.org Fixes: 999b9467974f ("mm/damon/dbgfs-test: fix is_target_id() change") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 27 Aug 2024 03:03:32 +0000 (20:03 -0700)]
mm/damon/core-test: fix damon_test_ops_registration() for DAMON_VADDR unset case
DAMON core kunit test can be executed without CONFIG_DAMON_VADDR. In the
case, vaddr DAMON ops is not registered. Meanwhile, ops registration
kunit test assumes the vaddr ops is registered. Check and handle the case
by registrering fake vaddr ops inside the test code.
Link: https://lkml.kernel.org/r/20240827030336.7930-6-sj@kernel.org Fixes: 4f540f5ab4f2 ("mm/damon/core-test: add a kunit test case for ops registration") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 27 Aug 2024 03:03:31 +0000 (20:03 -0700)]
mm/damon/core-test: test only vaddr case on ops registration test
DAMON ops registration kunit test tests both vaddr and paddr use cases in
parts of the whole test cases. Basically testing only one ops use case is
enough. Do the test with only vaddr use case.
Link: https://lkml.kernel.org/r/20240827030336.7930-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 27 Aug 2024 03:03:29 +0000 (20:03 -0700)]
selftests/damon: cleanup __pycache__/ with 'make clean'
Python-based tests creates __pycache__/ directory. Remove it with 'make
clean' by defining it as EXTRA_CLEAN.
Link: https://lkml.kernel.org/r/20240827030336.7930-3-sj@kernel.org Fixes: b5906f5f7359 ("selftests/damon: add a test for update_schemes_tried_regions sysfs command") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 27 Aug 2024 03:03:28 +0000 (20:03 -0700)]
selftests/damon: add access_memory_even to .gitignore
Patch series "misc fixups for DAMON {self,kunit} tests".
This patchset is for minor fixups of DAMON selftests and kunit tests.
First three patches make DAMON selftests more cleanly maintained (patches
1 and 2) without unnecessary warnings (patch 3). Following six patches
remove unnecessary test case (patch 4), handle configs combinations that
can make tests fail (patches 5-7), reorganize the test files following the
new guideline (patch 8), and add reference kunitconfig for DAMON kunit
tests (patch 9).
This patch (of 9):
DAMON selftests build access_memory_even, but its not on the .gitignore
list. Add it to make 'git status' output cleaner.
Yujie Liu [Tue, 27 Aug 2024 11:29:58 +0000 (19:29 +0800)]
sched/numa: Fix the vma scan starving issue
Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
the vma scan might create less Numa page fault information. The
insufficient information makes it harder for the Numa balancer to make
decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
partial VMAs regardless of PID activity") and commit 84db47ca7146d7
("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
bring back part of the performance.
Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
long duration of remote Numa node read was observed by PMU events: A few
cores having ~500MB/s remote memory access for ~20 seconds. It causes
high core-to-core variance and performance penalty. After the
investigation, it is found that many vmas are skipped due to the active
PID check. According to the trace events, in most cases,
vma_is_accessed() returns false because the history access info stored in
pids_active array has been cleared.
Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.
Thus compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
scan the vma.
This patch especially helps the cases where there are small number of
threads, like the process-based SPECcpu. Without this patch, if the
SPECcpu process access the vma at the beginning, then sleeps for a long
time, the pid_active array will be cleared. A a result, if this process
is woken up again, it never has a chance to set prot_none anymore.
Because only the first 2 times of access is granted for vma scan:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
worse, no other threads within the task can help set the prot_none. This
causes information lost.
Raghavendra helped test current patch and got the positive result
on the AMD platform:
Yanfei Xu [Tue, 27 Aug 2024 11:36:14 +0000 (19:36 +0800)]
memory tier: fix deadlock warning while onlining pages
commit 823430c8e9d9 ("memory tier: consolidate the initialization of
memory tiers") introduces a locking change that use guard(mutex) to
instead of mutex_lock/unlock() for memory_tier_lock. It unexpectedly
expanded the locked region to include the hotplug_memory_notifier(), as a
result, it triggers an locking dependency detected of ABBA deadlock.
Exclude hotplug_memory_notifier() from the locked region to fixing it.
The deadlock scenario is that when a memory online event occurs, the
execution of memory notifier will access the read lock of the
memory_chain.rwsem, then the reigistration of the memory notifier in
memory_tier_init() acquires the write lock of the memory_chain.rwsem while
holding memory_tier_lock. Then the memory online event continues to
invoke the memory hotplug callback registered by memory_tier_init().
Since this callback tries to acquire the memory_tier_lock, a deadlock
occurs.
In fact, this deadlock can't happen because memory_tier_init() always
executes before memory online events happen due to the subsys_initcall()
has an higher priority than module_init().
mm: vmalloc: refactor vm_area_alloc_pages() function
The aim is to simplify and making the vm_area_alloc_pages()
function less confusing as it became more clogged nowadays:
- eliminate a "bulk_gfp" variable and do not overwrite a gfp
flag for bulk allocator;
- drop __GFP_NOFAIL flag for high-order-page requests on upper
layer. It becomes less spread between levels when it comes to
__GFP_NOFAIL allocations;
- add a comment about a fallback path if high-order attempt is
unsuccessful because for such cases __GFP_NOFAIL is dropped;
- fix a typo in a commit message.
Link: https://lkml.kernel.org/r/20240827190916.34242-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:22 +0000 (19:10 +0100)]
mm: rework vm_ops->close() handling on VMA merge
In commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be
removed in mergeability test") we relaxed the VMA merge rules for VMAs
possessing a vm_ops->close() hook, permitting this operation in instances
where we wouldn't delete the VMA as part of the merge operation.
This was later corrected in commit fc0c8f9089c2 ("mm, mmap: fix
vma_merge() case 7 with vma_ops->close") to account for a subtle case that
the previous commit had not taken into account.
In both instances, we first rely on is_mergeable_vma() to determine
whether we might be dealing with a VMA that might be removed, taking
advantage of the fact that a 'previous' VMA will never be deleted, only
VMAs that follow it.
The second patch corrects the instance where a merge of the previous VMA
into a subsequent one did not correctly check whether the subsequent VMA
had a vm_ops->close() handler.
Both changes prevent merge cases that are actually permissible (for
instance a merge of a VMA into a following VMA with a vm_ops->close(), but
with no previous VMA, which would result in the next VMA being extended,
not deleted).
In addition, both changes fail to consider the case where a VMA that would
otherwise be merged with the previous and next VMA might have
vm_ops->close(), on the assumption that for this to be the case, all three
would have to have the same vma->vm_file to be mergeable and thus the same
vm_ops.
And in addition both changes operate at 50,000 feet, trying to guess
whether a VMA will be deleted.
As we have majorly refactored the VMA merge operation and de-duplicated
code to the point where we know precisely where deletions will occur, this
patch removes the aforementioned checks altogether and instead explicitly
checks whether a VMA will be deleted.
In cases where a reduced merge is still possible (where we merge both
previous and next VMA but the next VMA has a vm_ops->close hook, meaning
we could just merge the previous and current VMA), we do so, otherwise the
merge is not permitted.
We take advantage of our userland testing to assert that this functions
correctly - replacing the previous limited vm_ops->close() tests with
tests for every single case where we delete a VMA.
We also update all testing for both new and modified VMAs to set
vma->vm_ops->close() in every single instance where this would not prevent
the merge, to assert that we never do so.
Link: https://lkml.kernel.org/r/9f96b8cfeef3d14afabddac3d6144afdfbef2e22.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:21 +0000 (19:10 +0100)]
mm: refactor vma_merge() into modify-only vma_merge_existing_range()
The existing vma_merge() function is no longer required to handle what
were previously referred to as cases 1-3 (i.e. the merging of a new VMA),
as this is now handled by vma_merge_new_vma().
Additionally, simplify the convoluted control flow of the original,
maintaining identical logic only expressed more clearly and doing away
with a complicated set of cases, rather logically examining each possible
outcome - merging of both the previous and subsequent VMA, merging of the
previous VMA and merging of the subsequent VMA alone.
We now utilise the previously implemented commit_merge() function to share
logic with vma_expand() de-duplicating code and providing less surface
area for bugs and confusion. In order to do so, we adjust this function
to accept parameters specific to merging existing ranges.
Link: https://lkml.kernel.org/r/2cf6016b7bfcc4965fc3cde10827560c42e4f12c.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:20 +0000 (19:10 +0100)]
mm: introduce commit_merge(), abstracting final commit of merge
Pull the part of vma_expand() which actually commits the merge operation,
that is inserts it into the maple tree and sets the VMA's vma->vm_start
and vma->vm_end parameters, into its own function.
We implement only the parts needed for vma_expand() which now as a result
of previous work is also the means by which new VMA ranges are merged.
The next commit in the series will implement merging of existing ranges
which will extend commit_merge() to accommodate this case and result in
all merges using this common code.
Link: https://lkml.kernel.org/r/7b985a20dfa549e3c370cd274d732b64c44f6dbd.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:19 +0000 (19:10 +0100)]
mm: make vma_prepare() and friends static and internal to vma.c
Now we have abstracted merge behaviour for new VMA ranges, we are able to
render vma_prepare(), init_vma_prep(), vma_complete(),
can_vma_merge_before() and can_vma_merge_after() static and internal to
vma.c.
These are internal implementation details of kernel VMA manipulation and
merging mechanisms and thus should not be exposed. This also renders the
functions userland testable.
Link: https://lkml.kernel.org/r/7f7f1c34ce10405a6aab2714c505af3cf41b7851.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:18 +0000 (19:10 +0100)]
mm: avoid using vma_merge() for new VMAs
Abstract vma_merge_new_vma() to use vma_merge_struct and rename the
resultant function vma_merge_new_range() to be clear what the purpose of
this function is - a new VMA is desired in the specified range, and we
wish to see if it is possible to 'merge' surrounding VMAs into this range
rather than having to allocate a new VMA.
Note that this function uses vma_extend() exclusively, so adopts its
requirement that the iterator point at or before the gap. We add an
assert to this effect.
This is as opposed to vma_merge_existing_range(), which will be introduced
in a subsequent commit, and provide the same functionality for cases in
which we are modifying an existing VMA.
In mmap_region() and do_brk_flags() we open code scenarios where we prefer
to use vma_expand() rather than invoke a full vma_merge() operation.
Abstract this logic and eliminate all of the open-coding, and also use the
same logic for all cases where we add new VMAs to, rather than ultimately
use vma_merge(), rather use vma_expand().
Doing so removes duplication and simplifies VMA merging in all such cases,
laying the ground for us to eliminate the merging of new VMAs in
vma_merge() altogether.
Also add the ability for the vmg to track state, and able to report
errors, allowing for us to differentiate a failed merge from an inability
to allocate memory in callers.
This makes it far easier to understand what is happening in these cases
avoiding confusion, bugs and allowing for future optimisation.
Also introduce vma_iter_next_rewind() to allow for retrieval of the next,
and (optionally) the prev VMA, rewinding to the start of the previous gap.
Introduce are_anon_vmas_compatible() to abstract individual VMA anon_vma
comparison for the case of merging on both sides where the anon_vma of the
VMA being merged maybe compatible with prev and next, but prev and next's
anon_vma's may not be compatible with each other.
Finally also introduce can_vma_merge_left() / can_vma_merge_right() to
check adjacent VMA compatibility and that they are indeed adjacent.
Link: https://lkml.kernel.org/r/49d37c0769b6b9dc03b27fe4d059173832556392.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Mark Brown <broonie@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:17 +0000 (19:10 +0100)]
mm: abstract vma_expand() to use vma_merge_struct
The purpose of the vmg is to thread merge state through functions and
avoid egregious parameter lists. We expand this to vma_expand(), which is
used for a number of merge cases.
Accordingly, adjust its callers, mmap_region() and relocate_vma_down(), to
use a vmg.
An added purpose of this change is the ability in a future commit to
perform all new VMA range merging using vma_expand().
Link: https://lkml.kernel.org/r/4bc8c9dbc9ca52452ef8e587b28fe555854ceb38.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:16 +0000 (19:10 +0100)]
mm: remove duplicated open-coded VMA policy check
Both can_vma_merge_before() and can_vma_merge_after() are invoked after
checking for compatible VMA NUMA policy, we can simply move this to
is_mergeable_vma() and abstract this altogether.
In mmap_region() we set vmg->policy to NULL, so the policy comparisons
checked in can_vma_merge_before() and can_vma_merge_after() are exactly
equivalent to !vma_policy(vmg.next) and !vma_policy(vmg.prev).
Equally, in do_brk_flags(), vmg->policy is NULL, so the
can_vma_merge_after() is checking !vma_policy(vma), as we set vmg.prev to
vma.
In vma_merge(), we compare prev and next policies with vmg->policy before
checking can_vma_merge_after() and can_vma_merge_before() respectively,
which this patch causes to be checked in precisely the same way.
This therefore maintains precisely the same logic as before, only now
abstracted into is_mergeable_vma().
Link: https://lkml.kernel.org/r/0dbff286d9c4988333bc6f4ff3734cb95dd5410a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:15 +0000 (19:10 +0100)]
mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify()
Rather than passing around huge numbers of parameters to numerous helper
functions, abstract them into a single struct that we thread through the
operation, the vma_merge_struct ('vmg').
Adjust vma_merge() and vma_modify() to accept this parameter, as well as
predicate functions can_vma_merge_before(), can_vma_merge_after(), and the
vma_modify_...() helper functions.
Also introduce VMG_STATE() and VMG_VMA_STATE() helper macros to allow for
easy vmg declaration.
We additionally remove the requirement that vma_merge() is passed a VMA
object representing the candidate new VMA. Previously it used this to
obtain the mm_struct, file and anon_vma properties of the proposed range
(a rather confusing state of affairs), which are now provided by the vmg
directly.
We also remove the pgoff calculation previously performed vma_modify(),
and instead calculate this in VMG_VMA_STATE() via the vma_pgoff_offset()
helper.
Link: https://lkml.kernel.org/r/a955aad09d81329f6fbeb636b2dd10cde7b73dab.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:14 +0000 (19:10 +0100)]
tools: add VMA merge tests
Add a variety of VMA merge unit tests to assert that the behaviour of VMA
merge is correct at an abstract level and VMAs are merged or not merged as
expected.
These are intentionally added _before_ we start refactoring vma_merge() in
order that we can continually assert correctness throughout the rest of
the series.
In order to reduce churn going forward, we backport the vma_merge_struct
data type to the test code which we introduce and use in a future commit,
and add wrappers around the merge new and existing VMA cases.
Link: https://lkml.kernel.org/r/1c7a0b43cfad2c511a6b1b52f3507696478ff51a.1725040657.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Fri, 30 Aug 2024 18:10:13 +0000 (19:10 +0100)]
tools: improve vma test Makefile
Patch series "mm: remove vma_merge()", v3.
The infamous vma_merge() function has been the cause of a great deal of
pain, bugs and confusion for a very long time.
It is subtle, contains many corner cases, tries to do far too much and is
as a result very fragile.
The fact that the function requires there to be a numbering system to
cover each possible eventuality with references to each in the many
branches of its implementation as to which case you are looking at speaks
to all this.
Some of this complexity is inherent - unfortunately there is no getting
away from the need to figure out precisely how to execute the merge,
whether we need to remove VMAs, whether it is safe to do so, what
constitutes a mergeable VMA and so on.
However, a lot of the complexity is not inherent but instead a product of
the function's 'organic' development.
Liam has gone to great lengths to improve the situation as a part of his
maple tree implementation, greatly improving the readability of the code,
and Vlastimil and myself have additionally gone to lengths to try to
improve things further.
However, with the availability of userland VMA testing, it now becomes
possible to perform a rather more significant refactoring while
maintaining confidence in its correct operation.
An attempt was previously made by Vlastimil [0] to eliminate vma_merge(),
however it was rather - brutal - and an astute reader might refer to the
date of that patch for insight as to its intent.
This series instead divides merge operations into two natural kinds -
merges which occur when a NEW vma is being added to the address space, and
merges which occur when a vma is being MODIFIED.
Happily, the vma_expand() function introduced by Liam, which has the
capacity for also deleting a subsequent VMA, covers each of the NEW vma
cases.
By abstracting the actual final commit of changes to a VMA to its own
function, commit_merge() and writing a wrapper around vma_expand() for new
VMA cases vma_merge_new_range(), we can avoid having to use vma_merge()
for these instances altogether.
By doing so we are also able to then de-duplicate all existing merge logic
in mmap_region() and do_brk_flags() and have everything invoke this new
function, so we universally take the same approach to merging new VMAs.
Having done so, we can then completely rework vma_merge() into
vma_merge_existing_range() and use this for the instances where a merge is
proposed for a region of an existing VMA.
This eliminates vma_merge() and its numbered cases and instead divides
things into logical cases - merge both, merge left, merge right (the
latter 2 being either partial or full merges).
The code is heavily annotated with ASCII diagrams and greatly simplified
in comparison to the existing vma_merge() function.
Having made this change, we take the opportunity to address an issue with
merging VMAs possessing a vm_ops->close() hook - commit 714965ca8252
("mm/mmap: start distinguishing if vma can be removed in mergeability
test") and commit fc0c8f9089c2 ("mm, mmap: fix vma_merge() case 7 with
vma_ops->close") make efforts to relax how we handle these, making
assumptions about which VMAs might end up deleted (and thus, if possessing
a vm_ops->close() hook, cannot be).
This refactor means we do not need to guess, so instead explicitly only
disallow merge in instances where a VMA with a vm_ops->close() hook would
be deleted (and try a smaller merge in cases where this is possible).
In addition to these changes, we introduce a new vma_merge_struct
abstraction to allow VMA merge state to be threaded through the operation
neatly.
There is heavy unit testing provided for all merge functionality, added
prior to the refactoring, allowing for before/after testing.
The vm_ops->close() change also introduces exhaustive testing to
demonstrate that this functions as expected, and in addition to this the
reproduction code from commit fc0c8f9089c2 ("mm, mmap: fix vma_merge()
case 7 with vma_ops->close") was tested and confirmed passing.
Liam R. Howlett [Fri, 30 Aug 2024 04:01:01 +0000 (00:01 -0400)]
mm/vma.h: optimise vma_munmap_struct
The vma_munmap_struct has a hole of 4 bytes and pushes the struct to three
cachelines. Relocating the three booleans upwards allows for the struct
to only use two cachelines (as reported by pahole on amd64).
long unsigned int nr_pages; /* 80 8 */
long unsigned int locked_vm; /* 88 8 */
long unsigned int nr_accounted; /* 96 8 */
long unsigned int exec_vm; /* 104 8 */
long unsigned int stack_vm; /* 112 8 */
long unsigned int data_vm; /* 120 8 */
/* size: 128, cachelines: 2, members: 19 */
/* sum members: 127, holes: 1, sum holes: 1 */
};
Link: https://lkml.kernel.org/r/20240830040101.822209-22-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:01:00 +0000 (00:01 -0400)]
mm/vma: drop incorrect comment from vms_gather_munmap_vmas()
The comment has been outdated since 6b73cff239e52 ("mm: change munmap
splitting order and move_vma()"). The move_vma() was altered to fix the
fragile state of the accounting since then.
Link: https://lkml.kernel.org/r/20240830040101.822209-21-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:59 +0000 (00:00 -0400)]
mm: move may_expand_vm() check in mmap_region()
The may_expand_vm() check requires the count of the pages within the
munmap range. Since this is needed for accounting and obtained later, the
reodering of ma_expand_vm() to later in the call stack, after the vma
munmap struct (vms) is initialised and the gather stage is potentially
run, will allow for a single loop over the vmas. The gather sage does not
commit any work and so everything can be undone in the case of a failure.
The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.
Link: https://lkml.kernel.org/r/20240830040101.822209-20-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:58 +0000 (00:00 -0400)]
ipc/shm, mm: drop do_vma_munmap()
The do_vma_munmap() wrapper existed for callers that didn't have a vma
iterator and needed to check the vma mseal status prior to calling the
underlying munmap(). All callers now use a vma iterator and since the
mseal check has been moved to do_vmi_align_munmap() and the vmas are
aligned, this function can just be called instead.
do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
it is exported via the mm.h header.
Link: https://lkml.kernel.org/r/20240830040101.822209-19-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:55 +0000 (00:00 -0400)]
mm: change failure of MAP_FIXED to restoring the gap on failure
Prior to call_mmap(), the vmas that will be replaced need to clear the way
for what may happen in the call_mmap(). This clean up work includes
clearing the ptes and calling the close() vm_ops. Some users do more
setup than can be restored by calling the vm_ops open() function. It is
safer to store the gap in the vma tree in these cases.
That is to say that the failure scenario that existed before the MAP_FIXED
gap exposure is restored as it is safer than trying to undo a partial
mapping.
Since abort_munmap_vmas() is only reattaching vmas with this change, the
function is renamed to reattach_vmas().
There is also a secondary failure that may occur if there is not enough
memory to store the gap. In this case, the vmas are reattached and
resources freed. If the system cannot complete the call_mmap() and fails
to allocate with GFP_KERNEL, then the system will print a warning about
the failure.
[lorenzo.stoakes@oracle.com: fix off-by-one error in vms_abort_munmap_vmas()] Link: https://lkml.kernel.org/r/52ee7eb3-955c-4ade-b5f0-28fed8ba3d0b@lucifer.local Link: https://lkml.kernel.org/r/20240830040101.822209-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:54 +0000 (00:00 -0400)]
mm/mmap: avoid zeroing vma tree in mmap_region()
Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().
To ensure locking is downgraded correctly, the mm is set regardless of
MAP_FIXED or not (NULL vma).
If a driver is mapping over an existing vma, then clear the ptes before
the call_mmap() invocation. This is done using the vms_clean_up_area()
helper. If there is a close vm_ops, that must also be called to ensure
any cleanup is done before mapping over the area. This also means that
calling open has been added to the abort of an unmap operation, for now.
Since vm_ops->open() and vm_ops->close() are not always undo each other
(state cleanup may exist in ->close() that is lost forever), the code
cannot be left in this way, but that change has been isolated to another
commit to make this point very obvious for traceability.
Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.
This also drops the validate_mm() call in the vma_expand() function. It
is necessary to drop the validate as it would fail since the mm map_count
would be incorrect during a vma expansion, prior to the cleanup from
vms_complete_munmap_vmas().
Clean up the error handing of the vms_gather_munmap_vmas() by calling the
verification within the function.
Link: https://lkml.kernel.org/r/20240830040101.822209-15-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:52 +0000 (00:00 -0400)]
mm/vma: track start and end for munmap in vma_munmap_struct
Set the start and end address for munmap when the prev and next are
gathered. This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are expanded.
Add a new helper vms_complete_pte_clear(), which is needed later and will
avoid growing the argument list to unmap_region() beyond the 9 it already
has.
Link: https://lkml.kernel.org/r/20240830040101.822209-13-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:51 +0000 (00:00 -0400)]
mm/mmap: reposition vma iterator in mmap_region()
Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location. Pointing the vma iterator
at the insert location allows for a cleaner walk of the vma tree for
MAP_FIXED and the no expansion cases.
The vma_prev() call in the case of merging the previous vma is equivalent
to vma_iter_prev_range(), since the vma iterator will be pointing to the
location just before the previous vma.
This change needs to export abort_munmap_vmas() from mm/vma.
Link: https://lkml.kernel.org/r/20240830040101.822209-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:50 +0000 (00:00 -0400)]
mm/vma: support vma == NULL in init_vma_munmap()
Adding support for a NULL vma means the init_vma_munmap() can be
initialized for a less error-prone process when calling
vms_complete_munmap_vmas() later on.
Link: https://lkml.kernel.org/r/20240830040101.822209-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:47 +0000 (00:00 -0400)]
mm/vma: extract validate_mm() from vma_complete()
vma_complete() will need to be called during an unsafe time to call
validate_mm(). Extract the call in all places now so that only one
location can be modified in the next change.
Link: https://lkml.kernel.org/r/20240830040101.822209-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:46 +0000 (00:00 -0400)]
mm/vma: change munmap to use vma_munmap_struct() for accounting and surrounding vmas
Clean up the code by changing the munmap operation to use a structure for
the accounting and munmap variables.
Since remove_mt() is only called in one location and the contents will be
reduced to almost nothing. The remains of the function can be added to
vms_complete_munmap_vmas().
Link: https://lkml.kernel.org/r/20240830040101.822209-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:44 +0000 (00:00 -0400)]
mm/vma: extract the gathering of vmas from do_vmi_align_munmap()
Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later. Part of the gathering is the
splitting of vmas that span the boundary.
Link: https://lkml.kernel.org/r/20240830040101.822209-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:43 +0000 (00:00 -0400)]
mm/vma: introduce vmi_complete_munmap_vmas()
Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation. Extracting this makes
the later patch in the series easier to understand.
Link: https://lkml.kernel.org/r/20240830040101.822209-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:42 +0000 (00:00 -0400)]
mm/vma: introduce abort_munmap_vmas()
Extract clean up of failed munmap() operations from do_vmi_align_munmap().
This simplifies later patches in the series.
It is worth noting that the mas_for_each() loop now has a different upper
limit. This should not change the number of vmas visited for reattaching
to the main vma tree (mm_mt), as all vmas are reattached in both
scenarios.
Link: https://lkml.kernel.org/r/20240830040101.822209-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Fri, 30 Aug 2024 04:00:41 +0000 (00:00 -0400)]
mm/vma: correctly position vma_iterator in __split_vma()
Patch series "Avoid MAP_FIXED gap exposure", v8.
It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention. Doing so while a MAP_FIXED
mapping is executing means that a reader may see a gap in the vma tree
that should never logically exist - and does not when using the mmap lock
in read mode. The temporal gap exists because mmap_region() calls
munmap() prior to installing the new mapping.
This patch set stops rcu readers from seeing the temporal gap by splitting
up the munmap() function into two parts. The first part prepares the vma
tree for modifications by doing the necessary splits and tracks the vmas
marked for removal in a side tree. The second part completes the
munmapping of the vmas after the vma tree has been overwritten (either by
a MAP_FIXED replacement vma or by a NULL in the munmap() case).
Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed. vma_start_write() are called on both
parts of the split vma, so this state is detectable.
If existing vmas have a vm_ops->close(), then they will be called prior to
mapping the new vmas (and ptes are cleared out). Without calling
->close(), hugetlbfs tests fail (hugemmap06 specifically) due to resources
still being marked as 'busy'. Unfortunately, calling the corresponding
->open() may not restore the state of the vmas, so it is safer to keep the
existing failure scenario where a gap is inserted and never replaced. The
failure scenario is in its own patch (0015) for traceability.
This patch (of 21):
The vma iterator may be left pointing to the newly created vma. This
happens when inserting the new vma at the end of the old vma (!new_below).
The incorrect position in the vma iterator is not exposed currently since
the vma iterator is repositioned in the munmap path and is not reused in
any of the other paths.
This has limited impact in the current code, but is required for future
changes.
Link: https://lkml.kernel.org/r/20240830040101.822209-2-Liam.Howlett@oracle.com Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Yuan [Fri, 23 Aug 2024 16:27:11 +0000 (16:27 +0000)]
Documentation/cgroup-v2: clarify that zswap.writeback is ignored if zswap is disabled
As discussed in [1], zswap-related settings natually lose their effect
when zswap is disabled, specifically zswap.writeback here. Be explicit
about this behavior.
Mike Yuan [Fri, 23 Aug 2024 16:27:09 +0000 (16:27 +0000)]
selftests: test_zswap: add test for hierarchical zswap.writeback
Ensure that zswap.writeback check goes up the cgroup tree, i.e. is
hierarchical. Create a subcgroup which has zswap.writeback set to 1, and
the upper hierarchy's restrictions shall apply.
Link: https://lkml.kernel.org/r/20240823162506.12117-2-me@yhndnzj.com Signed-off-by: Mike Yuan <me@yhndnzj.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Usama Arif [Fri, 23 Aug 2024 19:04:40 +0000 (20:04 +0100)]
mm: remove code to handle same filled pages
With an earlier commit to handle zero-filled pages in swap directly, and
with only 1% of the same-filled pages being non-zero, zswap no longer
needs to handle same-filled pages and can just work on compressed pages.
Link: https://lkml.kernel.org/r/20240823190545.979059-3-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Usama Arif [Fri, 23 Aug 2024 19:04:39 +0000 (20:04 +0100)]
mm: store zero pages to be swapped out in a bitmap
Patch series "mm: store zero pages to be swapped out in a bitmap", v8.
As shown in the patch series that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet. By using VM counters in
swap_writepage (not included in this patchseries) it was found that less
than 1% of the same-filled pages to be swapped out are non-zero pages.
For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.
When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.
A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled. This patchseries adds
zero-filled pages optimization to swap (hence it can be used even if zswap
is disabled) and removes the same-filled code from zswap (as only 1% of
the same-filled pages are non-zero), simplifying code.
Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).
Zhongkun He [Thu, 22 Aug 2024 09:26:12 +0000 (17:26 +0800)]
mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()
should_reclaim_retry() is not ALLOC_CPUSET aware and that means that it
considers reclaimability of NUMA nodes which are outside of the cpuset.
If other nodes have a lot of reclaimable memory then should_reclaim_retry
would instruct page allocator to retry even though there is no memory
reclaimable on the cpuset nodemask. This is not really a huge problem
because the number of retries without any reclaim progress is bound but it
could be certainly improved. This is a cold path so this shouldn't really
have a measurable impact on performance on most workloads.
With this patch, we can check the right node and get less retry in
__alloc_pages_slowpath() because there is nothing to do.
Link: https://lkml.kernel.org/r/20240822092612.3209286-1-hezhongkun.hzk@bytedance.com Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 22 Aug 2024 11:24:58 +0000 (13:24 +0200)]
mm: swapfile: fix SSD detection with swapfile on btrfs
We've been noticing a trend of significant lock contention in the swap
subsystem as core counts have been increasing in our fleet. It turns out
that our swapfiles on btrfs on flash were in fact using the old swap code
for rotational storage.
This turns out to be a detection issue in the swapon sequence: btrfs sets
si->bdev during swap activation, which currently happens *after* swapon's
SSD detection and cluster setup. Thus, none of the SSD optimizations and
cluster lock splitting are enabled for btrfs swap.
Rearrange the swapon sequence so that filesystem activation happens
*before* determining swap behavior based on the backing device.
Afterwards, the nonrotational drive is detected correctly:
- Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k
+ Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k SS
David Hildenbrand [Wed, 21 Aug 2024 12:31:15 +0000 (14:31 +0200)]
selftests/mm: fix charge_reserved_hugetlb.sh test
Currently, running the charge_reserved_hugetlb.sh selftest we can
sometimes observe something like:
$ ./charge_reserved_hugetlb.sh -cgroup-v2
...
write_result is 0
After write:
hugetlb_usage=0
reserved_usage=10485760
killing write_to_hugetlbfs
Received 2.
Deleting the memory
Detach failure: Invalid argument
umount: /mnt/huge: target is busy.
Both cases are issues in the test.
While the unmount error seems to be racy, it will make the test fail:
$ ./run_vmtests.sh -t hugetlb
...
# [FAIL]
not ok 10 charge_reserved_hugetlb.sh -cgroup-v2 # exit=32
The issue is that we are not waiting for the write_to_hugetlbfs process to
quit. So it might still have a hugetlbfs file open, about which umount is
not happy. Fix that by making "killall" wait for the process to quit.
The other error ("Detach failure: Invalid argument") does not seem to
result in a test error, but is misleading. Turns out write_to_hugetlbfs.c
unconditionally tries to cleanup using shmdt(), even when we only
mmap()'ed a hugetlb file. Even worse, shmaddr is never even set for the
SHM case. Fix that as well.
With this change it seems to work as expected.
Link: https://lkml.kernel.org/r/20240821123115.2068812-1-david@redhat.com Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 21 Aug 2024 19:34:42 +0000 (20:34 +0100)]
mm: rename PG_mappedtodisk to PG_owner_2
This flag has similar constraints to PG_owner_priv_1 -- it is ignored by
core code, and is entirely for the use of the code which allocated the
folio. Since the pagecache does not use it, individual filesystems can
use it. The bufferhead code does use it, so filesystems which use the
buffer cache must not use it for another purpose.
Matthew Wilcox (Oracle) [Wed, 21 Aug 2024 19:34:34 +0000 (20:34 +0100)]
mm: remove PageActive
Patch series "Simplify the page flags a little".
In the course of our folio conversions, we have made many page flags only
used on folios, so we can now remove the page-based accessors. This
should cut down compile time a little, and prevent new users from cropping
up.
There is more that could be done in this area, but it would produce merge
conflicts, so I'll sit on those patches until next merge window. We now
have line of sight to removing PG_private_2 and PG_private.
This patch (of 10):
This flag is now only used on folios, so we can remove all the page
accessors.
Matthew Wilcox (Oracle) [Wed, 21 Aug 2024 17:39:12 +0000 (18:39 +0100)]
zsmalloc: use all available 24 bits of page_type
Now that we have an extra 8 bits, we don't need to limit ourselves to
supporting a 64KiB page size. I'm sure both Hexagon users are grateful,
but it does reduce complexity a little. We can also remove
reset_first_obj_offset() as calling __ClearPageZsmalloc() will now reset
all 32 bits of page_type.
Link: https://lkml.kernel.org/r/20240821173914.2270383-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox (Oracle) [Wed, 21 Aug 2024 17:39:11 +0000 (18:39 +0100)]
mm: support only one page_type per page
By using a few values in the top byte, users of page_type can store up to
24 bits of additional data in page_type. It also reduces the code size as
(with replacement of READ_ONCE() with data_race()), the kernel can check
just a single byte. eg:
Matthew Wilcox (Oracle) [Wed, 21 Aug 2024 17:39:09 +0000 (18:39 +0100)]
printf: remove %pGt support
Patch series "Increase the number of bits available in page_type".
Kent wants more than 16 bits in page_type, so I resurrected this old patch
and expanded it a bit. It's a bit more efficient than our current scheme
(1 4-byte insn vs 3 insns of 13 bytes total) to test a single page type.
This patch (of 4):
An upcoming patch will convert page type from being a bitfield to a
single byte, so we will not be able to use %pG to print the page type
any more. The printing of the symbolic name will be restored in that
patch.
Pedro Falcato [Sat, 17 Aug 2024 00:18:34 +0000 (01:18 +0100)]
selftests/mm: add more mseal traversal tests
Add more mseal traversal tests across VMAs, where we could possibly screw
up sealing checks. These test more across-vma traversal for mprotect,
munmap and madvise. Particularly, we test for the case where a regular
VMA is followed by a sealed VMA.
[akpm@linux-foundation.org: remove incorrect comment, per review]
[akpm@linux-foundation.org: remove the correct comment, per Pedro]
[pedro.falcato@gmail.com: fix mseal's length] Link: https://lkml.kernel.org/r/vc4czyuemmu3kylqb4ctaga6y5yvondlyabimx6jvljlw2fkea@djawlllf45xa Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-7-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pedro Falcato [Sat, 17 Aug 2024 00:18:32 +0000 (01:18 +0100)]
mseal: replace can_modify_mm_madv with a vma variant
Replace can_modify_mm_madv() with a single vma variant, and associated
checks in madvise.
While we're at it, also invert the order of checks in:
if (unlikely(is_ro_anon(vma) && !can_modify_vma(vma))
Checking if we can modify the vma itself (through vm_flags) is certainly
cheaper than is_ro_anon() due to arch_vma_access_permitted() looking at
e.g pkeys registers (with extra branches) in some architectures.
This patch allows for partial madvise success when finding a sealed VMA,
which historically has been allowed in Linux.
Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-5-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pedro Falcato [Sat, 17 Aug 2024 00:18:31 +0000 (01:18 +0100)]
mm/mremap: replace can_modify_mm with can_modify_vma
Delegate all can_modify checks to the proper places. Unmap checks are
done in do_unmap (et al). The source VMA check is done purposefully
before unmapping, to keep the original mseal semantics.
Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-4-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pedro Falcato [Sat, 17 Aug 2024 00:18:28 +0000 (01:18 +0100)]
mm: move can_modify_vma to mm/vma.h
Patch series "mm: Optimize mseal checks", v3.
Optimize mseal checks by removing the separate can_modify_mm() step, and
just doing checks on the individual vmas, when various operations are
themselves iterating through the tree. This provides a nice speedup and
restores performance parity with pre-mseal[3].
This was a Completely Unscientific test but seems to show there were
around 5-10% gains on ops per second.
Oliver performed his own tests and showed[3] a similar ~5% gain in them.
[1]: mmap1_process does mmap and munmap in a loop. I didn't bother testing multithreading cases.
[2]: https://lore.kernel.org/all/20240807124103.85644-1-mpe@ellerman.id.au/
[3]: https://lore.kernel.org/all/ZrMMJfe9aXSWxJz6@xsang-OptiPlex-9020/ Link: https://lore.kernel.org/all/202408041602.caa0372-oliver.sang@intel.com/
This patch (of 7):
Move can_modify_vma to vma.h so it can be inlined properly (with the
intent to remove can_modify_mm callsites).
Link: https://lkml.kernel.org/r/20240817-mseal-depessimize-v3-1-d8d2e037df30@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yafang Shao [Tue, 20 Aug 2024 02:26:39 +0000 (10:26 +0800)]
mm: allow read-ahead with IOCB_NOWAIT set
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9
("mm: allow read-ahead with IOCB_NOWAIT set"). However, this
implementation broke the semantics of IOCB_NOWAIT by potentially causing
it to wait on I/O during memory reclamation. This behavior was later
modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO").
To resolve the blocking issue during memory reclamation, we can use
memalloc_noio_{save,restore} to ensure non-blocking behavior. This change
restores the original functionality, allowing preadv2(IOCB_NOWAIT) to
trigger readahead if the file content is not present in the page cache.
While this process may trigger direct memory reclamation, the
__GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't
block.
A use case for this change is when we want to trigger readahead in the
preadv2(2) syscall if the file cache is absent, but without waiting for
certain filesystem locks, like xfs_ilock. A simple example is as follows:
Kefeng Wang [Tue, 20 Aug 2024 03:26:30 +0000 (11:26 +0800)]
mm: remove migration for HugePage in isolate_single_pageblock()
The gigantic page size may larger than memory block size, so memory
offline always fails in this case after commit b2c9e2fbba32 ("mm: make
alloc_contig_range work at pageblock granularity"),
[ 15.815756] memory offlining [mem 0x3c0000000-0x3c7ffffff] failed due to failure to isolate range
Gigantic PageHuge is bigger than a pageblock, but since it is freed as
order-0 pages, its pageblocks after being freed will get to the right
free list. There is no need to have special handling code for them in
start_isolate_page_range(). For both alloc_contig_range() and memory
offline cases, the migration code after start_isolate_page_range() will
be able to migrate gigantic PageHuge when possible. Let's clean up
start_isolate_page_range() and fix the aforementioned memory offline
failure issue all together.
Let's clean up start_isolate_page_range() and fix the aforementioned
memory offline failure issue all together.
Link: https://lkml.kernel.org/r/20240820032630.1894770-1-wangkefeng.wang@huawei.com Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 20 Aug 2024 09:49:17 +0000 (17:49 +0800)]
selftests: mm: support shmem mTHP collapse testing
Add shmem mTHP collpase testing. Similar to the anonymous page, users can
use the '-s' parameter to specify the shmem mTHP size for testing.
Link: https://lkml.kernel.org/r/fa44bfa20ca5b9fd6f9163a048f3d3c1e53cd0a8.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 20 Aug 2024 09:49:16 +0000 (17:49 +0800)]
mm: khugepaged: support shmem mTHP collapse
Shmem already supports the allocation of mTHP, but khugepaged does not yet
support collapsing mTHP folios. Now khugepaged is ready to support mTHP,
and this patch enables the collapse of shmem mTHP.
Link: https://lkml.kernel.org/r/b9da76aab4276eb6e5d12c479af2b5eea5b4575d.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 20 Aug 2024 09:49:15 +0000 (17:49 +0800)]
mm: khugepaged: support shmem mTHP copy
Iterate each subpage in the large folio to copy, as preparation for
supporting shmem mTHP collapse.
Link: https://lkml.kernel.org/r/222d615b7c837eabb47a238126c5fdeff8aa5283.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Tue, 20 Aug 2024 09:49:14 +0000 (17:49 +0800)]
mm: khugepaged: use the number of pages in the folio to check the reference count
Use the number of pages in the folio to check the reference count as
preparation for supporting shmem mTHP collapse.
Link: https://lkml.kernel.org/r/9ea49262308de28957596cc6e8edc2d3a4f54659.1724140601.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Tue, 20 Aug 2024 12:22:10 +0000 (14:22 +0200)]
mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y
We already force-inline page_fixed_fake_head(), page_is_fake_head() and
PageTail(), however the compiler might decide that _compound_head() is not
worthy to be inlined, because of page_fixed_fake_head().
The result is that, for example, PageAnonExclusive() now might involve a
function call when checking PageHuge(), which performs a
page_folio()->_compound_head() call. This can lead to a slight regression
of the stress-ng.clone benchmark.
This is not super-urgent to fix, but always inlining _compound_head()
seems like the obvious thing to do for this primitive, similar to the
other ones.
This change restores the slight regression and a compilation with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y shows no relevant bloat [2]:
Uros Bizjak [Sun, 18 Aug 2024 21:01:52 +0000 (23:01 +0200)]
mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space
Use IS_ERR_PCPU() instead of IS_ERR() for pointers in the percpu address
space. The patch also fixes following sparse warnings:
kmemleak.c:1063:39: warning: cast removes address space '__percpu' of expression
kmemleak.c:1138:37: warning: cast removes address space '__percpu' of expression
Jinjiang Tu [Mon, 19 Aug 2024 13:06:09 +0000 (21:06 +0800)]
selftests/mm: remove unnecessary ia64 code and comment
IA64 has gone with commit cf8e8658100d ("arch: Remove Itanium (IA-64)
architecture"), so remove unnecessary ia64 special mm code and comment in
selftests too.
Link: https://lkml.kernel.org/r/20240819130609.3386195-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Danilo Krummrich [Mon, 12 Aug 2024 22:34:35 +0000 (00:34 +0200)]
mm: krealloc: clarify valid usage of __GFP_ZERO
Properly document that if __GFP_ZERO logic is requested, callers must
ensure that, starting with the initial memory allocation, every subsequent
call to this API for the same memory allocation is flagged with
__GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully
honored by this API.
Link: https://lkml.kernel.org/r/20240812223707.32049-2-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Danilo Krummrich [Mon, 12 Aug 2024 22:34:34 +0000 (00:34 +0200)]
mm: krealloc: consider spare memory for __GFP_ZERO
As long as krealloc() is called with __GFP_ZERO consistently, starting
with the initial memory allocation, __GFP_ZERO should be fully honored.
However, if for an existing allocation krealloc() is called with a
decreased size, it is not ensured that the spare portion the allocation is
zeroed. Thus, if krealloc() is subsequently called with a larger size
again, __GFP_ZERO can't be fully honored, since we don't know the previous
size, but only the bucket size.
David Hildenbrand [Fri, 16 Aug 2024 10:32:46 +0000 (12:32 +0200)]
mm/rmap: use folio->_mapcount for small folios
We have some cases left whereby we operate on small folios and still refer
to page->_mapcount. Let's just use folio->_mapcount instead, which
currently still overlays page->_mapcount, so no change.
This change will make it easier to later spot any remaining users of
page->_mapcount that target tail pages.
Yu Zhao [Wed, 14 Aug 2024 03:54:51 +0000 (21:54 -0600)]
mm/hugetlb: use __GFP_COMP for gigantic folios
Use __GFP_COMP for gigantic folios to greatly reduce not only the amount
of code but also the allocation and free time.
LOC (approximately): +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU models.
Link: https://lkml.kernel.org/r/20240814035451.773331-4-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Frank van der Linden <fvdl@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Wed, 14 Aug 2024 03:54:50 +0000 (21:54 -0600)]
mm/cma: add cma_{alloc,free}_folio()
With alloc_contig_range() and free_contig_range() supporting large folios,
CMA can allocate and free large folios too, by cma_alloc_folio() and
cma_free_folio().
Yu Zhao [Wed, 14 Aug 2024 03:54:49 +0000 (21:54 -0600)]
mm/contig_alloc: support __GFP_COMP
Patch series "mm/hugetlb: alloc/free gigantic folios", v2.
Use __GFP_COMP for gigantic folios can greatly reduce not only the amount
of code but also the allocation and free time.
Approximate LOC to mm/hugetlb.c: +60, -240
Allocate and free 500 1GB hugeTLB memory without HVO by:
time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Before After
Alloc ~13s ~10s
Free ~15s <1s
The above magnitude generally holds for multiple x86 and arm64 CPU
models.
Support __GFP_COMP in alloc_contig_range(). When the flag is set, upon
success the function returns a large folio prepared by prep_new_page(),
rather than a range of order-0 pages prepared by split_free_pages() (which
is renamed from split_map_pages()).
alloc_contig_range() can be used to allocate folios larger than
MAX_PAGE_ORDER, e.g., gigantic hugeTLB folios. So on the free path,
free_one_page() needs to handle that by split_large_buddy().
Kaiyang Zhao [Wed, 14 Aug 2024 17:42:27 +0000 (17:42 +0000)]
mm,memcg: provide per-cgroup counters for NUMA balancing operations
The ability to observe the demotion and promotion decisions made by the
kernel on a per-cgroup basis is important for monitoring and tuning
containerized workloads on machines equipped with tiered memory.
Different containers in the system may experience drastically different
memory tiering actions that cannot be distinguished from the global
counters alone.
For example, a container running a workload that has a much hotter memory
accesses will likely see more promotions and fewer demotions, potentially
depriving a colocated container of top tier memory to such an extent that
its performance degrades unacceptably.
For another example, some containers may exhibit longer periods between
data reuse, causing much more numa_hint_faults than numa_pages_migrated.
In this case, tuning hot_threshold_ms may be appropriate, but the signal
can easily be lost if only global counters are available.
In the long term, we hope to introduce per-cgroup control of promotion and
demotion actions to implement memory placement policies in tiering.
This patch set adds seven counters to memory.stat in a cgroup:
numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd,
pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_*
and pgpromote_success are also available in memory.numa_stat.
count_memcg_events_mm() is added to count multiple event occurrences at
once, and get_mem_cgroup_from_folio() is added because we need to get a
reference to the memcg of a folio before it's migrated to track
numa_pages_migrated. The accounting of PGDEMOTE_* is moved to
shrink_inactive_list() before being changed to per-cgroup.
[kaiyang2@cs.cmu.edu: add documentation of the memcg counters in cgroup-v2.rst] Link: https://lkml.kernel.org/r/20240814235122.252309-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240814174227.30639-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Konovalov [Tue, 13 Aug 2024 22:40:27 +0000 (00:40 +0200)]
kasan: simplify and clarify Makefile
When KASAN support was being added to the Linux kernel, GCC did not yet
support all of the KASAN-related compiler options. Thus, the KASAN
Makefile had to probe the compiler for supported options.
Nowadays, the Linux kernel GCC version requirement is 5.1+, and thus we
don't need the probing of the -fasan-shadow-offset parameter: it exists in
all 5.1+ GCCs.
Simplify the KASAN Makefile to drop CFLAGS_KASAN_MINIMAL.
Also add a few more comments and unify the indentation.
Baolin Wang [Mon, 12 Aug 2024 07:42:10 +0000 (15:42 +0800)]
mm: shmem: support large folio swap out
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, hence this patch set supports the large folio swap out for
shmem.
Note the i915_gem_shmem driver still need to be split when swapping, thus
add a new flag 'split_large_folio' for writeback_control to indicate
spliting the large folio.
Baolin Wang [Mon, 12 Aug 2024 07:42:09 +0000 (15:42 +0800)]
mm: shmem: split large entry if the swapin folio is not large
Now the swap device can only swap-in order 0 folio, even though a large
folio is swapped out. This requires us to split the large entry
previously saved in the shmem pagecache to support the swap in of small
folios.
Baolin Wang [Mon, 12 Aug 2024 07:42:08 +0000 (15:42 +0800)]
mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache()
To support large folio swapin/swapout for shmem in the following patches,
drop the folio's reference count by the number of pages contained in the
folio when a shmem folio is deleted from shmem pagecache after adding into
swap cache.
Link: https://lkml.kernel.org/r/b371eadb27f42fc51261c51008fbb9a334985b4c.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:07 +0000 (15:42 +0800)]
mm: shmem: support large folio allocation for shmem_replace_folio()
To support large folio swapin for shmem in the following patches, add
large folio allocation for the new replacement folio in
shmem_replace_folio(). Moreover large folios occupy N consecutive entries
in the swap cache instead of using multi-index entries like the page
cache, therefore we should replace each consecutive entries in the swap
cache instead of using the shmem_replace_entry().
As well as updating statistics and folio reference count using the number
of pages in the folio.
Baolin Wang [Mon, 12 Aug 2024 07:42:06 +0000 (15:42 +0800)]
mm: shmem: use swap_free_nr() to free shmem swap entries
As a preparation for supporting shmem large folio swapout, use
swap_free_nr() to free some continuous swap entries of the shmem large
folio when the large folio was swapped in from the swap cache. In
addition, the index should also be round down to the number of pages when
adding the swapin folio into the pagecache.
Link: https://lkml.kernel.org/r/342207fa679fc88a447dac2e101ad79e6050fe79.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:05 +0000 (15:42 +0800)]
mm: filemap: use xa_get_order() to get the swap entry order
In the following patches, shmem will support the swap out of large folios,
which means the shmem mappings may contain large order swap entries, so
using xa_get_order() to get the folio order of the shmem swap entry to
update the '*start' correctly.
[hughd@google.com: use xa_get_order() to get the swap entry order] Link: https://lkml.kernel.org/r/c336e6e4-da7f-b714-c0f1-12df715f2611@google.com Link: https://lkml.kernel.org/r/6876d55145c1cc80e79df7884aa3a62e397b101d.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Daniel Gomez [Mon, 12 Aug 2024 07:42:04 +0000 (15:42 +0800)]
mm: shmem: return number of pages beeing freed in shmem_free_swap
Both shmem_free_swap callers expect the number of pages being freed. In
the large folios context, this needs to support larger values other than 0
(used as 1 page being freed) and -ENOENT (used as 0 pages being freed).
In preparation for large folios adoption, make shmem_free_swap routine
return the number of pages being freed. So, returning 0 in this context,
means 0 pages being freed.
While we are at it, changing to use free_swap_and_cache_nr() to free large
order swap entry by Baolin Wang.
Link: https://lkml.kernel.org/r/9623e863c83d749d5ab407f6fdf0a8e5a3bdf052.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:03 +0000 (15:42 +0800)]
mm: shmem: extend shmem_partial_swap_usage() to support large folio swap
To support shmem large folio swapout in the following patches, using
xa_get_order() to get the order of the swap entry to calculate the swap
usage of shmem.
Link: https://lkml.kernel.org/r/60b130b9fc3e422bb91293a172c2113c85e9233a.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Mon, 12 Aug 2024 07:42:02 +0000 (15:42 +0800)]
mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting
Patch series "support large folio swap-out and swap-in for shmem", v5.
Shmem will support large folio allocation [1] [2] to get a better
performance, however, the memory reclaim still splits the precious large
folios when trying to swap-out shmem, which may lead to the memory
fragmentation issue and can not take advantage of the large folio for
shmeme.
Moreover, the swap code already supports for swapping out large folio
without split, and large folio swap-in[3] series is queued into
mm-unstable branch. Hence this patch set also supports the large folio
swap-out and swap-in for shmem.
This patch (of 9):
To support shmem large folio swap operations, add a new parameter to
swap_shmem_alloc() that allows batch SWAP_MAP_SHMEM flag setting for shmem
swap entries.
While we are at it, using folio_nr_pages() to get the number of pages of
the folio as a preparation.
Link: https://lkml.kernel.org/r/cover.1723434324.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/99f64115d04b285e009580eb177352c57119ffd0.1723434324.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Wed, 7 Aug 2024 21:58:58 +0000 (09:58 +1200)]
mm: rename instances of swap_info_struct to meaningful 'si'
Patch series "mm: batch free swaps for zap_pte_range()", v3.
Batch free swap slots for zap_pte_range(), making munmap three times
faster when the page table entries are filled with swap entries to
be freed. This is likely another advantage of using mTHP.
This patch (of 3):
"p" means "pointer to something", rename it to a more meaningful
identifier - "si". We also have a case with the name "sis", rename it to
"si" as well.
Link: https://lkml.kernel.org/r/20240807215859.57491-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240807215859.57491-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zhiguo Jiang <justinjiang@vivo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>