Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:15 +0000 (09:36 -0800)]
mm/mmap: write-lock VMAs in vma_prepare before modifying them
Write-lock all VMAs which might be affected by a merge, split, expand or
shrink operations. All these operations use vma_prepare() before making
the modifications, therefore it provides a centralized place to perform
VMA locking.
retract_page_tables takes i_mmap_rwsem before exclusive mmap_lock, which
is inverse to normal order. Deadlock is avoided by try-locking mmap_lock
and skipping on failure to obtain it. Locking the VMA should use the same
locking pattern to avoid this lock inversion.
Link: https://lkml.kernel.org/r/20230303213250.3555716-1-surenb@google.com Fixes: 44a83f2083bd ("mm/khugepaged: write-lock VMA while collapsing a huge page") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: <syzbot+8955a9646d1a48b8be92@syzkaller.appspotmail.com> Cc: Arjun Roy <arjunroy@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@android.com Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:14 +0000 (09:36 -0800)]
mm/khugepaged: write-lock VMA while collapsing a huge page
Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.
Before this patch, page tables can be walked under any one of the
mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either are
locked or don't exist. This patch adds a fourth lock under which page
tables can be traversed, and so khugepaged must also lock out that one.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:13 +0000 (09:36 -0800)]
mm/mmap: move vma_prepare before vma_adjust_trans_huge
vma_prepare() acquires all locks required before VMA modifications. Move
vma_prepare() before vma_adjust_trans_huge() so that VMA is locked before
any modification.
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:11 +0000 (09:36 -0800)]
mm: add per-VMA lock and helper functions to control it
Introduce per-VMA locking. The lock implementation relies on a per-vma
and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_start_read) requires the vma
(vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
If they match then there must be a vma exclusive lock held somewhere.
- read unlock - (implemented by vma_end_read) is a trivial vma->lock
unlock.
- write lock - (vma_start_write) requires the mmap_lock to be held
exclusively and the current mm counter is assigned to the vma counter.
This will allow multiple vmas to be locked under a single mmap_lock
write lock (e.g. during vma merging). The vma counter is modified
under exclusive vma lock.
- write unlock - (vma_end_write_all) is a batch release of all vma
locks held. It doesn't pair with a specific vma_start_write! It is
done before exclusive mmap_lock is released by incrementing mm
sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read lock, all
vma write locks are released as well (effectivelly same as write
unlock).
Suren Baghdasaryan [Mon, 27 Feb 2023 17:36:08 +0000 (09:36 -0800)]
mm: introduce CONFIG_PER_VMA_LOCK
Patch series "Per-VMA locks", v4.
LWN article describing the feature: https://lwn.net/Articles/906852/
Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM
last year [2], which concluded with suggestion that “a reader/writer
semaphore could be put into the VMA itself; that would have the effect of
using the VMA as a sort of range lock. There would still be contention at
the VMA level, but it would be an improvement.” This patchset implements
this suggested approach.
When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.
One notable way the implementation deviates from the proposal is the way
VMAs are read-locked. During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g. vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex. So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.
We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock. Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.
Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered read-locked
when these sequence numbers are equal. To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct. To unlock all VMAs we increment mm_struct's seq number. This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.
The patchset implements per-VMA locking only for anonymous pages which are
not in swap and avoids userfaultfs as their implementation is more
complex. Additional support for file-back page faults, swapped and user
pages can be added incrementally.
Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.
Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results. Here
are the known usecases when per-VMA locks show benefits:
Android:
Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode. Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs. Per-vma locks prevent this contention by
providing more granular lock.
Google Fibers:
We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.
At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup. Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.
TCP zerocopy receive:
From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.
In today's implementation, a process with N threads where N - 1 are
performing zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g. needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done. Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition. Per-VMA locking _removes_ by
definition this source of contention entirely.
There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock. On an RPC style synthetic workload with 4KB RPCs:
1a) The find+lock+unlock VMA path in the base case, without the
per-vma lock patchset, is about 0.7% of cycles as measured by perf.
1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
cycles overall - most of this is within the TCP read hotpath (a small
fraction is 'other' usage in the system).
2a) The find+lock+unlock VMA path, with the per-vma patchset and a
trivial patch written to take advantage of it in TCP, is about 0.4% of
cycles (down from 0.7% above)
2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is <
0.1% cycles and is out of the TCP read hotpath entirely (down from
0.5% before, the remaining usage is the 'other' usage in the system).
So, in addition to entirely removing an onerous source of contention,
it also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
(compared to overall cycles in perf) for the 'small' RPC scenario.
In https://lkml.kernel.org/r/87fsaqouyd.fsf_-_@stealth, Punit
demonstrated throughput improvements of as much as 188% from this
patchset.
This patch (of 25):
This configuration variable will be used to build the support for VMA
locking during page fault handling.
This is enabled on supported architectures with SMP and MMU set.
The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.
Charan Teja Kalla [Tue, 14 Feb 2023 12:51:50 +0000 (18:21 +0530)]
mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem
Currently fadvise(2) is supported only for the files that doesn't
associated with noop_backing_dev_info thus for the files, like shmem,
fadvise results into NOP. But then there is file_operations->fadvise()
that lets the file systems to implement their own fadvise implementation.
Use this support to implement some of the POSIX_FADV_XXX functionality for
shmem files.
This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
advices to shmem files which can be helpful for the clients who may want
to manage the shmem pages of the files that are created through
shmem_file_setup[_with_mnt](). One usecase is implemented on the
Snapdragon SoC's running Android where the graphics client is allocating
lot of shmem pages per process and pinning them. When this process is put
to background, the instantaneous reclaim is performed on those shmem pages
using the logic implemented downstream[3][4]. With this patch, the client
can now issue the fadvise calls on the shmem files that does the
instantaneous reclaim which can aid the use cases like mentioned above.
This usecase lead to ~2% reduction in average launch latencies of the apps
and 10% in total number of kills by the low memory killer running on
Android.
Some questions asked while reviewing this patch:
Q) Can the same thing be achieved with FD mapped to user and use madvise?
A) All drivers are not mapping all the shmem fd's to user space and
want to manage them with in the kernel. Ex: shmem memory can be mapped
to the other subsystems and they fill in the data and then give it to
other subsystem for further processing, where, the user mapping is not
at all required. A simple example, memory that is given for gpu
subsystem which can be filled directly and give to display subsystem.
And the respective drivers know well about when to keep that memory in
ram or swap based on may be a user activity.
Q) Should we add the documentation section in Manual pages?
A) The man[1] pages for the fadvise() whatever says is also applicable
for shmem files. so couldn't feel it correct to add specific to shmem
files separately.
Q) The proposed semantics of POSIX_FADV_DONTNEED is actually similar to
MADV_PAGEOUT and different from MADV_DONTNEED. This is a user facing
API and this difference will cause confusion?
A) man pages [2] says that "POSIX_FADV_DONTNEED attempts to free cached
pages associated with the specified region." This means on issuing this
FADV, it is expected to free the file cache pages. And it is
implementation defined If the dirty pages may be attempted to
writeback. And the unwritten dirty pages will not be freed. So,
FADV_DONTNEED also covers the semantics of MADV_PAGEOUT for file pages
and there is no purpose of PAGEOUT for file pages.
Charan Teja Kalla [Tue, 14 Feb 2023 12:51:49 +0000 (18:21 +0530)]
mm: fadvise: move 'endbyte' calculations to helper function
Patch series "mm: shmem: support POSIX_FADV_[WILL|DONT]NEED for shmem
files", v7.
This patchset aims to implement POSIX_FADV_WILLNEED and
POSIX_FADV_DONTNEED advices to shmem files which can be helpful for the
drivers who may want to manage the pages of shmem files on their own,
like, that are created through shmem_file_setup[_with_mnt]().
This patch (of 2):
Move the 'endbyte' calculations that determines last byte that fadvise can
to a helper function. This is a preparatory change made for
shmem_fadvise() functionality in the next patch. No functional changes in
this patch.
Keith Busch [Thu, 26 Jan 2023 21:51:25 +0000 (13:51 -0800)]
dmapool: create/destroy cleanup
Set the 'empty' bool directly from the result of the function that
determines its value instead of adding additional logic.
Link: https://lkml.kernel.org/r/20230126215125.4069751-13-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:24 +0000 (13:51 -0800)]
dmapool: link blocks across pages
The allocated dmapool pages are never freed for the lifetime of the pool.
There is no need for the two level list+stack lookup for finding a free
block since nothing is ever removed from the list. Just use a simple
stack, reducing time complexity to constant.
The implementation inserts the stack linking elements and the dma handle
of the block within itself when freed. This means the smallest possible
dmapool block is increased to at most 16 bytes to accommodate these
fields, but there are no exisiting users requesting a dma pool smaller
than that anyway.
Removing the list has a significant change in performance. Using the
kernel's micro-benchmarking self test:
The module test allocates quite a few blocks that may not accurately
represent how these pools are used in real life. For a more marco level
benchmark, running fio high-depth + high-batched on nvme, this patch shows
submission and completion latency reduced by ~100usec each, 1% IOPs
improvement, and perf record's time spent in dma_pool_alloc/free were
reduced by half.
Keith Busch [Thu, 26 Jan 2023 21:51:23 +0000 (13:51 -0800)]
dmapool: don't memset on free twice
If debug is enabled, dmapool will poison the range, so no need to clear it
to 0 immediately before writing over it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-11-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:22 +0000 (13:51 -0800)]
dmapool: simplify freeing
The actions for busy and not busy are mostly the same, so combine these
and remove the unnecessary function. Also, the pool is about to be freed
so there's no need to poison the page data since we only check for poison
on alloc, which can't be done on a freed pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-10-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:21 +0000 (13:51 -0800)]
dmapool: consolidate page initialization
Various fields of the dma pool are set in different places. Move it all
to one function.
Link: https://lkml.kernel.org/r/20230126215125.4069751-9-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:20 +0000 (13:51 -0800)]
dmapool: rearrange page alloc failure handling
Handle the error in a condition so the good path can be in the normal
flow.
Link: https://lkml.kernel.org/r/20230126215125.4069751-8-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:19 +0000 (13:51 -0800)]
dmapool: move debug code to own functions
Clean up the normal path by moving the debug code outside it.
Link: https://lkml.kernel.org/r/20230126215125.4069751-7-kbusch@meta.com Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:18 +0000 (13:51 -0800)]
dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
Avoid double-memset of the same allocated memory in dma_pool_alloc() when
both DMAPOOL_DEBUG is enabled and init_on_alloc=1.
Link: https://lkml.kernel.org/r/20230126215125.4069751-6-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:17 +0000 (13:51 -0800)]
dmapool: cleanup integer types
To represent the size of a single allocation, dmapool currently uses
'unsigned int' in some places and 'size_t' in other places. Standardize
on 'unsigned int' to reduce overhead, but use 'size_t' when counting all
the blocks in the entire pool.
Link: https://lkml.kernel.org/r/20230126215125.4069751-5-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:16 +0000 (13:51 -0800)]
dmapool: use sysfs_emit() instead of scnprintf()
Use sysfs_emit instead of scnprintf, snprintf or sprintf.
Link: https://lkml.kernel.org/r/20230126215125.4069751-4-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tony Battersby [Thu, 26 Jan 2023 21:51:15 +0000 (13:51 -0800)]
dmapool: remove checks for dev == NULL
dmapool originally tried to support pools without a device because
dma_alloc_coherent() supports allocations without a device. But nobody
ended up using dma pools without a device, and trying to do so will result
in an oops. So remove the checks for pool->dev == NULL since they are
unneeded bloat.
[kbusch@kernel.org: add check for null dev on create] Link: https://lkml.kernel.org/r/20230126215125.4069751-3-kbusch@meta.com Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Keith Busch [Thu, 26 Jan 2023 21:51:14 +0000 (13:51 -0800)]
dmapool: add alloc/free performance test
Patch series "dmapool enhancements", v4.
Time spent in dma_pool alloc/free increases linearly with the number of
pages backing the pool. We can reduce this to constant time with minor
changes to how free pages are tracked.
This patch (of 12):
Provide a module that allocates and frees many blocks of various sizes and
report how long it takes. This is intended to provide a consistent way to
measure how changes to the dma_pool_alloc/free routines affect timing.
Peter Xu [Wed, 5 Apr 2023 15:51:20 +0000 (11:51 -0400)]
mm/khugepaged: check again on anon uffd-wp during isolation
Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd
round done in __collapse_huge_page_isolate() after
hpage_collapse_scan_pmd(), during which all the locks will be released
temporarily. It means the pgtable can change during this phase before 2nd
round starts.
It's logically possible some ptes got wr-protected during this phase, and
we can errornously collapse a thp without noticing some ptes are
wr-protected by userfault. e1e267c7928f wanted to avoid it but it only
did that for the 1st phase, not the 2nd phase.
Since __collapse_huge_page_isolate() happens after a round of small page
swapins, we don't need to worry on any !present ptes - if it existed
khugepaged will already bail out. So we only need to check present ptes
with uffd-wp bit set there.
This is something I found only but never had a reproducer, I thought it
was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
turns out it's not the cause of that but an userspace bug. However this
seems to still be a real bug even with a very small race window, still
worth to have it fixed and copy stable.
Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Wed, 5 Apr 2023 16:02:35 +0000 (18:02 +0200)]
mm/userfaultfd: fix uffd-wp handling for THP migration entries
Looks like what we fixed for hugetlb in commit 44f86392bdd1 ("mm/hugetlb:
fix uffd-wp handling for migration entries in
hugetlb_change_protection()") similarly applies to THP.
Setting/clearing uffd-wp on THP migration entries is not implemented
properly. Further, while removing migration PMDs considers the uffd-wp
bit, inserting migration PMDs does not consider the uffd-wp bit.
We have to set/clear independently of the migration entry type in
change_huge_pmd() and properly copy the uffd-wp bit in
set_pmd_migration_entry().
Verified using a simple reproducer that triggers migration of a THP, that
the set_pmd_migration_entry() no longer loses the uffd-wp bit.
Link: https://lkml.kernel.org/r/20230405160236.587705-2-david@redhat.com Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: swap: fix performance regression on sparsetruncate-tiny
The ->percpu_pvec_drained was originally introduced by commit d9ed0d08b6c6
("mm: only drain per-cpu pagevecs once per pagevec usage") to drain
per-cpu pagevecs only once per pagevec usage. But after converting the
swap code to be more folio-based, the commit c2bc16817aa0 ("mm/swap: add
folio_batch_move_lru()") breaks this logic, which would cause
->percpu_pvec_drained to be reset to false, that means per-cpu pagevecs
will be drained multiple times per pagevec usage.
In theory, there should be no functional changes when converting code to
be more folio-based. We should call folio_batch_reinit() in
folio_batch_move_lru() instead of folio_batch_init(). And to verify that
we still need ->percpu_pvec_drained, I ran mmtests/sparsetruncate-tiny and
got the following data:
baseline with
baseline/ patch/
Min Time 326.00 ( 0.00%) 328.00 ( -0.61%)
1st-qrtle Time 334.00 ( 0.00%) 336.00 ( -0.60%)
2nd-qrtle Time 338.00 ( 0.00%) 341.00 ( -0.89%)
3rd-qrtle Time 343.00 ( 0.00%) 347.00 ( -1.17%)
Max-1 Time 326.00 ( 0.00%) 328.00 ( -0.61%)
Max-5 Time 327.00 ( 0.00%) 330.00 ( -0.92%)
Max-10 Time 328.00 ( 0.00%) 331.00 ( -0.91%)
Max-90 Time 350.00 ( 0.00%) 357.00 ( -2.00%)
Max-95 Time 395.00 ( 0.00%) 390.00 ( 1.27%)
Max-99 Time 508.00 ( 0.00%) 434.00 ( 14.57%)
Max Time 547.00 ( 0.00%) 476.00 ( 12.98%)
Amean Time 344.61 ( 0.00%) 345.56 * -0.28%*
Stddev Time 30.34 ( 0.00%) 19.51 ( 35.69%)
CoeffVar Time 8.81 ( 0.00%) 5.65 ( 35.87%)
BAmean-99 Time 342.38 ( 0.00%) 344.27 ( -0.55%)
BAmean-95 Time 338.58 ( 0.00%) 341.87 ( -0.97%)
BAmean-90 Time 336.89 ( 0.00%) 340.26 ( -1.00%)
BAmean-75 Time 335.18 ( 0.00%) 338.40 ( -0.96%)
BAmean-50 Time 332.54 ( 0.00%) 335.42 ( -0.87%)
BAmean-25 Time 329.30 ( 0.00%) 332.00 ( -0.82%)
>From the above it can be seen that we get similar data to when
->percpu_pvec_drained was introduced, so we still need it. Let's call
folio_batch_reinit() in folio_batch_move_lru() to restore the original
logic.
When CPU1 is about to execute the instruction pointed to by IP, the
ma_data_end() executed by CPU2 may return the wrong end position, which
will cause the value loaded by mtree_load() to be wrong.
An example of triggering the bug:
Add mdelay(100) between rcu_assign_pointer() and ma_set_meta() in
mas_root_expand().
static DEFINE_MTREE(tree);
int work(void *p) {
unsigned long val;
for (int i = 0 ; i< 30; ++i) {
val = (unsigned long)mtree_load(&tree, 8);
mdelay(5);
pr_info("%lu",val);
}
return 0;
}
In RCU mode, mtree_load() should always return the value before or after
the data structure is modified, and in this example mtree_load(&tree, 8)
may return 56789 which is not expected, it should always return NULL. Fix
it by put ma_set_meta() before rcu_assign_pointer().
Link: https://lkml.kernel.org/r/20230314124203.91572-4-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peng Zhang [Tue, 14 Mar 2023 12:42:01 +0000 (20:42 +0800)]
maple_tree: fix get wrong data_end in mtree_lookup_walk()
if (likely(offset > end))
max = pivots[offset];
The above code should be changed to if (likely(offset < end)), which is
correct. This affects the correctness of ma_data_end(). Now it seems
that the final result will not be wrong, but it is best to change it.
This patch does not change the code as above, because it simplifies the
code by the way.
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Petr Mladek <pmladek@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Patrick Daly <quic_pdaly@quicinc.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rongwei Wang [Tue, 4 Apr 2023 15:47:16 +0000 (23:47 +0800)]
mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
The si->lock must be held when deleting the si from the available list.
Otherwise, another thread can re-add the si to the available list, which
can lead to memory corruption. The only place we have found where this
happens is in the swapoff path. This case can be described as below:
core 0 core 1
swapoff
del_from_avail_list(si) waiting
try lock si->lock acquire swap_avail_lock
and re-add si into
swap_avail_head
acquire si->lock but missing si already being added again, and continuing
to clear SWP_WRITEOK, etc.
It can be easily found that a massive warning messages can be triggered
inside get_swap_pages() by some special cases, for example, we call
madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile,
run much swapon-swapoff operations (e.g. stress-ng-swap).
However, in the worst case, panic can be caused by the above scene. In
swapoff(), the memory used by si could be kept in swap_info[] after
turning off a swap. This means memory corruption will not be caused
immediately until allocated and reset for a new swap in the swapon path.
A panic message caused: (with CONFIG_PLIST_DEBUG enabled)
Now, si->lock locked before calling 'del_from_avail_list()' to make sure
other thread see the si had been deleted and SWP_WRITEOK cleared together,
will not reinsert again.
This problem exists in versions after stable 5.10.y.
Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com Fixes: a2468cc9bfdff ("swap: choose swap device according to numa node") Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryusuke Konishi [Thu, 30 Mar 2023 20:55:15 +0000 (05:55 +0900)]
nilfs2: fix sysfs interface lifetime
The current nilfs2 sysfs support has issues with the timing of creation
and deletion of sysfs entries, potentially leading to null pointer
dereferences, use-after-free, and lockdep warnings.
Some of the sysfs attributes for nilfs2 per-filesystem instance refer to
metadata file "cpfile", "sufile", or "dat", but
nilfs_sysfs_create_device_group that creates those attributes is executed
before the inodes for these metadata files are loaded, and
nilfs_sysfs_delete_device_group which deletes these sysfs entries is
called after releasing their metadata file inodes.
Therefore, access to some of these sysfs attributes may occur outside of
the lifetime of these metadata files, resulting in inode NULL pointer
dereferences or use-after-free.
In addition, the call to nilfs_sysfs_create_device_group() is made during
the locking period of the semaphore "ns_sem" of nilfs object, so the
shrinker call caused by the memory allocation for the sysfs entries, may
derive lock dependencies "ns_sem" -> (shrinker) -> "locks acquired in
nilfs_evict_inode()".
Since nilfs2 may acquire "ns_sem" deep in the call stack holding other
locks via its error handler __nilfs_error(), this causes lockdep to report
circular locking. This is a false positive and no circular locking
actually occurs as no inodes exist yet when
nilfs_sysfs_create_device_group() is called. Fortunately, the lockdep
warnings can be resolved by simply moving the call to
nilfs_sysfs_create_device_group() out of "ns_sem".
This fixes these sysfs issues by revising where the device's sysfs
interface is created/deleted and keeping its lifetime within the lifetime
of the metadata files above.
Alistair Popple [Thu, 30 Mar 2023 01:25:19 +0000 (12:25 +1100)]
mm: take a page reference when removing device exclusive entries
Device exclusive page table entries are used to prevent CPU access to a
page whilst it is being accessed from a device. Typically this is used to
implement atomic operations when the underlying bus does not support
atomic access. When a CPU thread encounters a device exclusive entry it
locks the page and restores the original entry after calling mmu notifiers
to signal drivers that exclusive access is no longer available.
The device exclusive entry holds a reference to the page making it safe to
access the struct page whilst the entry is present. However the fault
handling code does not hold the PTL when taking the page lock. This means
if there are multiple threads faulting concurrently on the device
exclusive entry one will remove the entry whilst others will wait on the
page lock without holding a reference.
This can lead to threads locking or waiting on a folio with a zero
refcount. Whilst mmap_lock prevents the pages getting freed via munmap()
they may still be freed by a migration. This leads to warnings such as
PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount
drops to zero.
Fix this by trying to take a reference on the folio before locking it.
The code already checks the PTE under the PTL and aborts if the entry is
no longer there. It is also possible the folio has been unmapped, freed
and re-allocated allowing a reference to be taken on an unrelated folio.
This case is also detected by the PTE check and the folio is unlocked
without further changes.
Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mathieu Desnoyers [Thu, 30 Mar 2023 13:38:22 +0000 (09:38 -0400)]
mm: fix memory leak on mm_init error handling
commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.
Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths. Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.
Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Himadri Pandya <himadrispandya@gmail.com> Cc: Ivan Orlov <ivan.orlov0322@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The xas_store() call during page cache scanning can potentially translate
'xas' into the error state (with the reproducer provided by the syzkaller
the error code is -ENOMEM). However, there are no further checks after
the 'xas_store', and the next call of 'xas_next' at the start of the
scanning cycle doesn't increase the xa_index, and the issue occurs.
This patch will add the xarray state error checking after the xas_store()
and the corresponding result error code.
Tested via syzbot.
Link: https://lkml.kernel.org/r/20230329145330.23191-1-ivan.orlov0322@gmail.com Link: https://syzkaller.appspot.com/bug?id=7d6bb3760e026ece7524500fe44fb024a0e959fc Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Reported-by: syzbot+9578faa5475acb35fa50@syzkaller.appspotmail.com Tested-by: Zach O'Keefe <zokeefe@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Himadri Pandya <himadrispandya@gmail.com> Cc: Ivan Orlov <ivan.orlov0322@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Song Liu <songliubraving@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This failure really confuse me as there're still lots of available pages.
Finally I figured out it was caused by a fatal signal. When a process is
allocating memory via vm_area_alloc_pages(), it will break directly even
if it hasn't allocated the requested pages when it receives a fatal
signal. In that case, we shouldn't show this warn_alloc, as it is
useless. We only need to show this warning when there're really no enough
pages.
Link: https://lkml.kernel.org/r/20230330162625.13604-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Sun, 26 Mar 2023 15:21:46 +0000 (00:21 +0900)]
nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field
nilfs_btree_assign_p() and nilfs_direct_assign_p() are not initializing
"struct nilfs_binfo_dat"->bi_pad field, causing uninit-value reports when
being passed to CRC function.
Ryusuke Konishi [Mon, 27 Mar 2023 17:53:18 +0000 (02:53 +0900)]
nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread()
The finalization of nilfs_segctor_thread() can race with
nilfs_segctor_kill_thread() which terminates that thread, potentially
causing a use-after-free BUG as KASAN detected.
At the end of nilfs_segctor_thread(), it assigns NULL to "sc_task" member
of "struct nilfs_sc_info" to indicate the thread has finished, and then
notifies nilfs_segctor_kill_thread() of this using waitqueue
"sc_wait_task" on the struct nilfs_sc_info.
However, here, immediately after the NULL assignment to "sc_task", it is
possible that nilfs_segctor_kill_thread() will detect it and return to
continue the deallocation, freeing the nilfs_sc_info structure before the
thread does the notification.
This fixes the issue by protecting the NULL assignment to "sc_task" and
its notification, with spinlock "sc_state_lock" of the struct
nilfs_sc_info. Since nilfs_segctor_kill_thread() does a final check to
see if "sc_task" is NULL with "sc_state_lock" locked, this can eliminate
the race.
Link: https://lkml.kernel.org/r/1679653680-2-1-git-send-email-ruansy.fnst@fujitsu.com Fixes: f80e1668888f3 ("fsdax: invalidate pages when CoW") Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Tue, 21 Mar 2023 19:18:40 +0000 (15:18 -0400)]
mm/hugetlb: fix uffd wr-protection for CoW optimization path
This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be
writable even with uffd-wp bit set. It only happens with hugetlb private
mappings, when someone firstly wr-protects a missing pte (which will
install a pte marker), then a write to the same page without any prior
access to the page.
Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before
reaching hugetlb_wp() to avoid taking more locks that userfault won't
need. However there's one CoW optimization path that can trigger
hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap.
This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit
is detected. The new path will only trigger in the CoW optimization path
because generic hugetlb_fault() (e.g. when a present pte was
wr-protected) will resolve the uffd-wp bit already. Also make sure
anonymous UNSHARE won't be affected and can still be resolved, IOW only
skip CoW not CoR.
This patch will be needed for v5.19+ hence copy stable.
Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com Fixes: 166f3ecc0daf ("mm/hugetlb: hook page faults for uffd write protection") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 27 Feb 2023 17:36:07 +0000 (09:36 -0800)]
mm: enable maple tree RCU mode by default
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space. Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
Liam R. Howlett [Mon, 27 Feb 2023 17:36:06 +0000 (09:36 -0800)]
maple_tree: add RCU lock checking to rcu callback functions
Dereferencing RCU objects within the RCU callback without the RCU check
has caused lockdep to complain. Fix the RCU dereferencing by using the
RCU callback lock to ensure the operation is safe.
Also stop creating a new lock to use for dereferencing during destruction
of the tree or subtree. Instead, pass through a pointer to the tree that
has the lock that is held for RCU dereferencing checking. It also does
not make sense to use the maple state in the freeing scenario as the tree
walk is a special case where the tree no longer has the normal encodings
and parent pointers.
Link: https://lkml.kernel.org/r/20230227173632.3292573-8-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 27 Feb 2023 17:36:05 +0000 (09:36 -0800)]
maple_tree: add smp_rmb() to dead node detection
Add an smp_rmb() before reading the parent pointer to ensure that anything
read from the node prior to the parent pointer hasn't been reordered ahead
of this check.
The is necessary for RCU mode.
Link: https://lkml.kernel.org/r/20230227173632.3292573-7-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Mon, 27 Feb 2023 17:36:04 +0000 (09:36 -0800)]
maple_tree: fix write memory barrier of nodes once dead for RCU mode
During the development of the maple tree, the strategy of freeing multiple
nodes changed and, in the process, the pivots were reused to store
pointers to dead nodes. To ensure the readers see accurate pivots, the
writers need to mark the nodes as dead and call smp_wmb() to ensure any
readers can identify the node as dead before using the pivot values.
There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead. Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.
Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being freed
are marked as dead to ensure there are no other call paths besides the two
updated paths.
This is necessary for the RCU mode of the maple tree.
Link: https://lkml.kernel.org/r/20230227173632.3292573-6-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Mon, 27 Feb 2023 17:36:03 +0000 (09:36 -0800)]
maple_tree: remove extra smp_wmb() from mas_dead_leaves()
The call to mte_set_dead_node() before the smp_wmb() already calls
smp_wmb() so this is not needed. This is an optimization for the RCU mode
of the maple tree.
Link: https://lkml.kernel.org/r/20230227173632.3292573-5-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Mon, 27 Feb 2023 17:36:02 +0000 (09:36 -0800)]
maple_tree: fix freeing of nodes in rcu mode
The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types. This is necessary for the
RCU mode of the maple tree.
Link: https://lkml.kernel.org/r/20230227173632.3292573-4-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Mon, 27 Feb 2023 17:36:01 +0000 (09:36 -0800)]
maple_tree: detect dead nodes in mas_start()
When initially starting a search, the root node may already be in the
process of being replaced in RCU mode. Detect and restart the walk if
this is the case. This is necessary for RCU mode of the maple tree.
Link: https://lkml.kernel.org/r/20230227173632.3292573-3-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam Howlett [Mon, 27 Feb 2023 17:36:00 +0000 (09:36 -0800)]
maple_tree: be more cautious about dead nodes
Patch series "Fix VMA tree modification under mmap read lock".
Syzbot reported a BUG_ON in mm/mmap.c which was found to be caused by an
inconsistency between threads walking the VMA maple tree. The
inconsistency is caused by the page fault handler modifying the maple tree
while holding the mmap_lock for read.
This only happens for stack VMAs. We had thought this was safe as it only
modifies a single pivot in the tree. Unfortunately, syzbot constructed a
test case where the stack had no guard page and grew the stack to abut the
next VMA. This causes us to delete the NULL entry between the two VMAs
and rewrite the node.
We considered several options for fixing this, including dropping the
mmap_lock, then reacquiring it for write; and relaxing the definition of
the tree to permit a zero-length NULL entry in the node. We decided the
best option was to backport some of the RCU patches from -next, which
solve the problem by allocating a new node and RCU-freeing the old node.
Since the problem exists in 6.1, we preferred a solution which is similar
to the one we intended to merge next merge window.
These patches have been in -next since next-20230301, and have received
intensive testing in Android as part of the RCU page fault patchset. They
were also sent as part of the "Per-VMA locks" v4 patch series. Patches 1
to 7 are bug fixes for RCU mode of the tree and patch 8 enables RCU mode
for the tree.
Performance v6.3-rc3 vs patched v6.3-rc3: Running these changes through
mmtests showed there was a 15-20% performance decrease in
will-it-scale/brk1-processes. This tests creating and inserting a single
VMA repeatedly through the brk interface and isn't representative of any
real world applications.
This patch (of 8):
ma_pivots() and ma_data_end() may be called with a dead node. Ensure to
that the node isn't dead before using the returned values.
This is necessary for RCU mode of the maple tree.
Link: https://lkml.kernel.org/r/20230327185532.2354250-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230227173632.3292573-1-surenb@google.com Link: https://lkml.kernel.org/r/20230227173632.3292573-2-surenb@google.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chriscli@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: freak07 <michalechner92@googlemail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ondrej Mosnacek [Fri, 17 Feb 2023 16:21:54 +0000 (17:21 +0100)]
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Xu [Wed, 15 Mar 2023 17:16:42 +0000 (13:16 -0400)]
mm/thp: rename TRANSPARENT_HUGEPAGE_NEVER_DAX to _UNSUPPORTED
TRANSPARENT_HUGEPAGE_NEVER_DAX has nothing to do with DAX. It's set when
has_transparent_hugepage() returns false, checked in hugepage_vma_check()
and will disable THP completely if false. Rename it to
TRANSPARENT_HUGEPAGE_UNSUPPORTED to reflect its real purpose.
Kefeng Wang [Mon, 13 Mar 2023 05:39:29 +0000 (13:39 +0800)]
mm: memory-failure: directly use IS_ENABLED(CONFIG_HWPOISON_INJECT)
It's more clear and simple to just use IS_ENABLED(CONFIG_HWPOISON_INJECT)
to check whether or not to enable HWPoison injector module instead of
CONFIG_HWPOISON_INJECT/CONFIG_HWPOISON_INJECT_MODULE.
Link: https://lkml.kernel.org/r/20230313053929.84607-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Mon, 13 Mar 2023 11:28:18 +0000 (19:28 +0800)]
mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()
Currently, the synchronize_shrinkers() is only used by TTM pool. It only
requires that no shrinkers run in parallel, and doesn't care about
registering and unregistering of shrinkers.
Since slab shrink is protected by SRCU, synchronize_srcu() is sufficient
to ensure that no shrinker is running in parallel. So the shrinker_rwsem
in synchronize_shrinkers() is no longer needed, just remove it.
Link: https://lkml.kernel.org/r/20230313112819.38938-8-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Mon, 13 Mar 2023 11:28:17 +0000 (19:28 +0800)]
mm: vmscan: hold write lock to reparent shrinker nr_deferred
For now, reparent_shrinker_deferred() is the only holder of read lock of
shrinker_rwsem. And it already holds the global cgroup_mutex, so it will
not be called in parallel.
Therefore, in order to convert shrinker_rwsem to shrinker_mutex later,
here we change to hold the write lock of shrinker_rwsem to reparent.
Link: https://lkml.kernel.org/r/20230313112819.38938-7-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kirill Tkhai [Mon, 13 Mar 2023 11:28:15 +0000 (19:28 +0800)]
mm: vmscan: add shrinker_srcu_generation
After we make slab shrink lockless with SRCU, the longest sleep
unregister_shrinker() will be a sleep waiting for all do_shrink_slab()
calls.
To avoid long unbreakable action in the unregister_shrinker(), add
shrinker_srcu_generation to restore a check similar to the
rwsem_is_contendent() check that we had before.
And for memcg slab shrink, we unlock SRCU and continue iterations from the
next shrinker id.
Link: https://lkml.kernel.org/r/20230313112819.38938-5-zhengqi.arch@bytedance.com Signed-off-by: Kirill Tkhai <tkhai@ya.ru> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Qi Zheng [Mon, 13 Mar 2023 11:28:13 +0000 (19:28 +0800)]
mm: vmscan: make global slab shrink lockless
The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
which protects most operations such as slab shrink, registration and
unregistration of shrinkers, etc. This can easily cause problems in the
following cases.
1) When the memory pressure is high and there are many
filesystems mounted or unmounted at the same time,
slab shrink will be affected (down_read_trylock()
failed).
Such as the real workload mentioned by Kirill Tkhai:
```
One of the real workloads from my experience is start
of an overcommitted node containing many starting
containers after node crash (or many resuming containers
after reboot for kernel update). In these cases memory
pressure is huge, and the node goes round in long reclaim.
```
2) If a shrinker is blocked (such as the case mentioned
in [1]) and a writer comes in (such as mount a fs),
then this writer will be blocked and cause all
subsequent shrinker-related operations to be blocked.
Even if there is no competitor when shrinking slab, there may still be a
problem. If we have a long shrinker list and we do not reclaim enough
memory with each shrinker, then the down_read_trylock() may be called with
high frequency. Because of the poor multicore scalability of atomic
operations, this can lead to a significant drop in IPC (instructions per
cycle).
So many times in history ([2],[3],[4],[5]), some people wanted to replace
shrinker_rwsem trylock with SRCU in the slab shrink, but all these patches
were abandoned because SRCU was not unconditionally enabled.
But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"), the SRCU
is unconditionally enabled. So it's time to use SRCU to protect readers
who previously held shrinker_rwsem.
This commit uses SRCU to make global slab shrink lockless,
the memcg slab shrink is handled in the subsequent patch.
Link: https://lkml.kernel.org/r/20230313112819.38938-3-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We can see that down_read_trylock() of shrinker_rwsem is being called with
high frequency at that time. Because of the poor multicore scalability of
atomic operations, this can lead to a significant drop in IPC
(instructions per cycle).
And more, the shrinker_rwsem is a global read-write lock in shrinkers
subsystem, which protects most operations such as slab shrink,
registration and unregistration of shrinkers, etc. This can easily cause
problems in the following cases.
1) When the memory pressure is high and there are many filesystems
mounted or unmounted at the same time, slab shrink will be affected
(down_read_trylock() failed).
Such as the real workload mentioned by Kirill Tkhai:
```
One of the real workloads from my experience is start of an
overcommitted node containing many starting containers after node crash
(or many resuming containers after reboot for kernel update). In these
cases memory pressure is huge, and the node goes round in long reclaim.
```
2) If a shrinker is blocked (such as the case mentioned in [1]) and a
writer comes in (such as mount a fs), then this writer will be blocked
and cause all subsequent shrinker-related operations to be blocked.
All the above cases can be solved by replacing the shrinker_rwsem trylocks
with SRCU.
2. Survey
=========
Before doing the code implementation, I found that there were many similar
submissions in the community:
a. Davidlohr Bueso submitted a patch in 2015.
Subject: [PATCH -next v2] mm: srcu-ify shrinkers Link: https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/
Result: It was finally merged into the linux-next branch,
but failed on arm allnoconfig (without CONFIG_SRCU)
b. Tetsuo Handa submitted a patchset in 2017.
Subject: [PATCH 1/2] mm,vmscan: Kill global shrinker lock. Link: https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/
Result: Finally chose to use the current simple way (break
when rwsem_is_contended()). And Christoph Hellwig suggested to
using SRCU, but SRCU was not unconditionally enabled at the
time.
c. Kirill Tkhai submitted a patchset in 2018.
Subject: [PATCH RFC 00/10] Introduce lockless shrink_slab() Link: https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/
Result: At that time, SRCU was not unconditionally enabled,
and there were some objections to enabling SRCU. Later,
because Kirill's focus was moved to other things, this patchset
was not continued to be updated.
We can find that almost all these historical commits were abandoned
because SRCU was not unconditionally enabled. But now SRCU has been
unconditionally enable by Paul E. McKenney in 2023 [2], so it's time to
replace shrinker_rwsem trylocks with SRCU.
Lorenzo Stoakes [Mon, 13 Mar 2023 12:27:14 +0000 (12:27 +0000)]
mm: prefer xxx_page() alloc/free functions for order-0 pages
Update instances of alloc_pages(..., 0), __get_free_pages(..., 0) and
__free_pages(..., 0) to use alloc_page(), __get_free_page() and
__free_page() respectively in core code.
Link: https://lkml.kernel.org/r/50c48ca4789f1da2a65795f2346f5ae3eff7d665.1678710232.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Collingbourne [Fri, 10 Mar 2023 04:29:14 +0000 (20:29 -0800)]
kasan: remove PG_skip_kasan_poison flag
Code inspection reveals that PG_skip_kasan_poison is redundant with
kasantag, because the former is intended to be set iff the latter is the
match-all tag. It can also be observed that it's basically pointless to
poison pages which have kasantag=0, because any pages with this tag would
have been pointed to by pointers with match-all tags, so poisoning the
pages would have little to no effect in terms of bug detection.
Therefore, change the condition in should_skip_kasan_poison() to check
kasantag instead, and remove PG_skip_kasan_poison and associated flags.
Sebastian Andrzej Siewior [Fri, 10 Mar 2023 16:29:05 +0000 (17:29 +0100)]
io-mapping: don't disable preempt on RT in io_mapping_map_atomic_wc().
io_mapping_map_atomic_wc() disables preemption and pagefaults for
historical reasons. The conversion to io_mapping_map_local_wc(), which
only disables migration, cannot be done wholesale because quite some call
sites need to be updated to accommodate with the changed semantics.
On PREEMPT_RT enabled kernels the io_mapping_map_atomic_wc() semantics are
problematic due to the implicit disabling of preemption which makes it
impossible to acquire 'sleeping' spinlocks within the mapped atomic
sections.
PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
more than a decade. It could be argued that this is a justification to do
this unconditionally, but PREEMPT_RT covers only a limited number of
architectures and it disables some functionality which limits the coverage
further.
Limit the replacement to PREEMPT_RT for now. This is also done
kmap_atomic().
Luis Chamberlain [Thu, 9 Mar 2023 23:05:45 +0000 (15:05 -0800)]
shmem: add support to ignore swap
In doing experimentations with shmem having the option to avoid swap
becomes a useful mechanism. One of the *raves* about brd over shmem is
you can avoid swap, but that's not really a good reason to use brd if we
can instead use shmem. Using brd has its own good reasons to exist, but
just because "tmpfs" doesn't let you do that is not a great reason to
avoid it if we can easily add support for it.
I don't add support for reconfiguring incompatible options, but if we
really wanted to we can add support for that.
To avoid swap we use mapping_set_unevictable() upon inode creation, and
put a WARN_ON_ONCE() stop-gap on writepages() for reclaim.
Link: https://lkml.kernel.org/r/20230309230545.2930737-7-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Luis Chamberlain [Thu, 9 Mar 2023 23:05:43 +0000 (15:05 -0800)]
shmem: skip page split if we're not reclaiming
In theory when info->flags & VM_LOCKED we should not be getting
shem_writepage() called so we should be verifying this with a
WARN_ON_ONCE(). Since we should not be swapping then best to ensure we
also don't do the folio split earlier too. So just move the check early
to avoid folio splits in case its a dubious call.
We also have a similar early bail when !total_swap_pages so just move that
earlier to avoid the possible folio split in the same situation.
Link: https://lkml.kernel.org/r/20230309230545.2930737-5-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Luis Chamberlain [Thu, 9 Mar 2023 23:05:42 +0000 (15:05 -0800)]
shmem: move reclaim check early on writepages()
i915_gem requires huge folios to be split when swapping. However we have
check for usage of writepages() to ensure it used only for swap purposes
later. Avoid the splits if we're not being called for reclaim, even if
they should in theory not happen.
This makes the conditions easier to follow on shem_writepage().
Link: https://lkml.kernel.org/r/20230309230545.2930737-4-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Luis Chamberlain [Thu, 9 Mar 2023 23:05:41 +0000 (15:05 -0800)]
shmem: set shmem_writepage() variables early
shmem_writepage() sets up variables typically used *after* a possible huge
page split. However even if that does happen the address space mapping
should not change, and the inode does not change either. So it should be
safe to set that from the very beginning.
This commit makes no functional changes.
Link: https://lkml.kernel.org/r/20230309230545.2930737-3-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Luis Chamberlain [Thu, 9 Mar 2023 23:05:40 +0000 (15:05 -0800)]
shmem: remove check for folio lock on writepage()
Patch series "tmpfs: add the option to disable swap", v2.
I'm doing this work as part of future experimentation with tmpfs and the
page cache, but given a common complaint found about tmpfs is the
innability to work without the page cache I figured this might be useful
to others. It turns out it is -- at least Christian Brauner indicates
systemd uses ramfs for a few use-cases because they don't want to use swap
and so having this option would let them move over to using tmpfs for
those small use cases, see systemd-creds(1).
To see if you hit swap:
mkswap /dev/nvme2n1
swapon /dev/nvme2n1
free -h
With swap - what we see today
=============================
mount -t tmpfs -o size=5G tmpfs /data-tmpfs/
dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
free -h
total used free shared buff/cache available
Mem: 3.7Gi 2.6Gi 1.2Gi 2.2Gi 2.2Gi 1.2Gi
Swap: 99Gi 2.8Gi 97Gi
Without swap
=============
free -h
total used free shared buff/cache available
Mem: 3.7Gi 387Mi 3.4Gi 2.1Mi 57Mi 3.3Gi
Swap: 99Gi 0B 99Gi
mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/
dd if=/dev/urandom of=/data-tmpfs/5g-rand2 bs=1G count=5
free -h
total used free shared buff/cache available
Mem: 3.7Gi 2.6Gi 1.2Gi 2.3Gi 2.3Gi 1.1Gi
Swap: 99Gi 21Mi 99Gi
The mix and match remount testing
=================================
# Cannot disable swap after it was first enabled:
mount -t tmpfs -o size=5G tmpfs /data-tmpfs/
mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
mount: /data-tmpfs: mount point not mounted or bad option.
dmesg(1) may have more information after failed mount system call.
dmesg -c
tmpfs: Cannot disable swap on remount
# Remount with the same noswap option is OK:
mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/
mount -t tmpfs -o remount -o size=5G -o noswap tmpfs /data-tmpfs/
dmesg -c
# Trying to enable swap with a remount after it first disabled:
mount -t tmpfs -o size=5G -o noswap tmpfs /data-tmpfs/
mount -t tmpfs -o remount -o size=5G tmpfs /data-tmpfs/
mount: /data-tmpfs: mount point not mounted or bad option.
dmesg(1) may have more information after failed mount system call.
dmesg -c
tmpfs: Cannot enable swap on remount if it was disabled on first mount
This patch (of 6):
Matthew notes we should not need to check the folio lock on the
writepage() callback so remove it. This sanity check has been lingering
since linux-history days. We remove this as we tidy up the writepage()
callback to make things a bit clearer.
Link: https://lkml.kernel.org/r/20230309230545.2930737-1-mcgrof@kernel.org Link: https://lkml.kernel.org/r/20230309230545.2930737-2-mcgrof@kernel.org Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Danilo Krummrich [Thu, 2 Mar 2023 01:10:35 +0000 (02:10 +0100)]
maple_tree: export symbol mas_preallocate()
Fix missing EXPORT_SYMBOL_GPL() statement for mas_preallocate().
It isn't actually used by anything yet, but mas_preallocate() is part of
the maple tree's 'Advanced API'. All other functions of this API are
exported already.
Christoph Hellwig [Tue, 7 Mar 2023 14:31:25 +0000 (15:31 +0100)]
mm,jfs: move write_one_page/folio_write_one to jfs
The last remaining user of folio_write_one through the write_one_page
wrapper is jfs, so move the functionality there and hard code the call to
metapage_writepage.
Note that the use of the pagecache by the JFS 'metapage' buffer cache is a
bit odd, and we could probably do without VM-level dirty tracking at all,
but that's a change for another time.
Link: https://lkml.kernel.org/r/20230307143125.27778-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Gang He <ghe@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Tue, 7 Mar 2023 14:31:24 +0000 (15:31 +0100)]
ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page
Use filemap_write_and_wait_range to write back the range of the dirty page
instead of write_one_page in preparation of removing write_one_page and
eventually ->writepage.
Link: https://lkml.kernel.org/r/20230307143125.27778-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Gang He <ghe@suse.com> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Hellwig [Tue, 7 Mar 2023 14:31:23 +0000 (15:31 +0100)]
ufs: don't flush page immediately for DIRSYNC directories
Patch series "remove most callers of write_one_page", v4.
This series removes most users of the write_one_page API. These helpers
internally call ->writepage which we are gradually removing from the
kernel.
This patch (of 3):
We do not need to writeout modified directory blocks immediately when
modifying them while the page is locked. It is enough to do the flush
somewhat later which has the added benefit that inode times can be flushed
as well. It also allows us to stop depending on write_one_page()
function.
Ported from an ext2 patch by Jan Kara.
Link: https://lkml.kernel.org/r/20230307143125.27778-1-hch@lst.de Link: https://lkml.kernel.org/r/20230307143125.27778-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Jan Kara <jack@suse.cz> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexander Potapenko [Mon, 6 Mar 2023 11:13:21 +0000 (12:13 +0100)]
lib/stackdepot: kmsan: mark API outputs as initialized
KMSAN does not instrument stackdepot and may treat memory allocated by it
as uninitialized. This is not a problem for KMSAN itself, because its
functions calling stackdepot API are also not instrumented. But other
kernel features (e.g. netdev tracker) may access stack depot from
instrumented code, which will lead to false positives, unless we
explicitly mark stackdepot outputs as initialized.
Link: https://lkml.kernel.org/r/20230306111322.205724-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yue Zhao [Mon, 6 Mar 2023 15:41:38 +0000 (23:41 +0800)]
mm, memcg: Prevent memory.soft_limit_in_bytes load/store tearing
The knob for cgroup v1 memory controller: memory.soft_limit_in_bytes is
not protected by any locking so it can be modified while it is used. This
is not an actual problem because races are unlikely. But it is better to
use [READ|WRITE]_ONCE to prevent compiler from doing anything funky.
The access of memcg->soft_limit is lockless, so it can be concurrently set
at the same time as we are trying to read it. All occurrences of
memcg->soft_limit are updated with [READ|WRITE]_ONCE.
[findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-5-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-5-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yue Zhao [Mon, 6 Mar 2023 15:41:37 +0000 (23:41 +0800)]
mm, memcg: Prevent memory.oom_control load/store tearing
The knob for cgroup v1 memory controller: memory.oom_control is not
protected by any locking so it can be modified while it is used. This is
not an actual problem because races are unlikely. But it is better to use
[READ|WRITE]_ONCE to prevent compiler from doing anything funky.
The access of memcg->oom_kill_disable is lockless, so it can be
concurrently set at the same time as we are trying to read it. All
occurrences of memcg->oom_kill_disable are updated with [READ|WRITE]_ONCE.
[findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-4-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.377-4-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yue Zhao [Mon, 6 Mar 2023 15:41:36 +0000 (23:41 +0800)]
mm, memcg: Prevent memory.swappiness load/store tearing
The knob for cgroup v1 memory controller: memory.swappiness is not
protected by any locking so it can be modified while it is used. This is
not an actual problem because races are unlikely. But it is better to use
[READ|WRITE]_ONCE to prevent compiler from doing anything funky.
The access of memcg->swappiness and vm_swappiness is lockless, so both of
them can be concurrently set at the same time as we are trying to read
them. All occurrences of memcg->swappiness and vm_swappiness are updated
with [READ|WRITE]_ONCE.
[findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-3-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-3-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yue Zhao [Mon, 6 Mar 2023 15:41:35 +0000 (23:41 +0800)]
mm, memcg: Prevent memory.oom.group load/store tearing
Patch series "mm, memcg: cgroup v1 and v2 tunable load/store tearing
fixes", v2.
This patch series helps to prevent load/store tearing in
several cgroup knobs.
As kindly pointed out by Michal Hocko and Roman Gushchin
, the changelog has been rephrased.
Besides, more knobs were checked, according to kind suggestions
from Shakeel Butt and Muchun Song.
This patch (of 4):
The knob for cgroup v2 memory controller: memory.oom.group
is not protected by any locking so it can be modified while it is used.
This is not an actual problem because races are unlikely (the knob is
usually configured long before any workloads hits actual memcg oom)
but it is better to use READ_ONCE/WRITE_ONCE to prevent compiler from
doing anything funky.
The access of memcg->oom_group is lockless, so it can be
concurrently set at the same time as we are trying to read it.
Link: https://lkml.kernel.org/r/20230306154138.3775-1-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-2-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gerald Schaefer [Mon, 6 Mar 2023 16:15:48 +0000 (17:15 +0100)]
mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()
s390 can do more fine-grained handling of spurious TLB protection faults,
when there also is the PTE pointer available.
Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an
additional parameter.
This will add no functional change to other architectures, but those with
private flush_tlb_fix_spurious_fault() implementations need to be made
aware of the new parameter.
Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sergey Senozhatsky [Sat, 4 Mar 2023 03:48:35 +0000 (12:48 +0900)]
zsmalloc: show per fullness group class stats
We keep the old fullness (3/4 threshold) reporting in
zs_stats_size_show(). Switch from allmost full/empty stats to
fine-grained per inuse ratio (fullness group) reporting, which gives
signicantly more data on classes fragmentation.
Sergey Senozhatsky [Sat, 4 Mar 2023 03:48:34 +0000 (12:48 +0900)]
zsmalloc: rework compaction algorithm
The zsmalloc compaction algorithm has the potential to waste some CPU
cycles, particularly when compacting pages within the same fullness group.
This is due to the way it selects the head page of the fullness list for
source and destination pages, and how it reinserts those pages during each
iteration. The algorithm may first use a page as a migration destination
and then as a migration source, leading to an unnecessary back-and-forth
movement of objects.
Consider the following fullness list:
PageA PageB PageC PageD PageE
During the first iteration, the compaction algorithm will select PageA as
the source and PageB as the destination. All of PageA's objects will be
moved to PageB, and then PageA will be released while PageB is reinserted
into the fullness list.
PageB PageC PageD PageE
During the next iteration, the compaction algorithm will again select the
head of the list as the source and destination, meaning that PageB will
now serve as the source and PageC as the destination. This will result in
the objects being moved away from PageB, the same objects that were just
moved to PageB in the previous iteration.
To prevent this avalanche effect, the compaction algorithm should not
reinsert the destination page between iterations. By doing so, the most
optimal page will continue to be used and its usage ratio will increase,
reducing internal fragmentation. The destination page should only be
reinserted into the fullness list if:
- It becomes full
- No source page is available.
TEST
====
It's very challenging to reliably test this series. I ended up developing
my own synthetic test that has 100% reproducibility. The test generates
significan fragmentation (for each size class) and then performs
compaction for each class individually and tracks the number of memcpy()
in zs_object_copy(), so that we can compare the amount work compaction
does on per-class basis.
Total amount of work (zram mm_stat objs_moved)
----------------------------------------------
Old fullness grouping, old compaction algorithm:
323977 memcpy() in zs_object_copy().
Old fullness grouping, new compaction algorithm:
262944 memcpy() in zs_object_copy().
New fullness grouping, new compaction algorithm:
213978 memcpy() in zs_object_copy().
x Old fullness grouping, old compaction algorithm
+ Old fullness grouping, new compaction algorithm
N Min Max Median Avg Stddev
x 140 349 3513 2461 2314.1214 806.03271
+ 140 289 2778 2006 1878.1714 641.02073
Difference at 95.0% confidence
-435.95 +/- 170.595
-18.8387% +/- 7.37193%
(Student's t, pooled s = 728.216)
x Old fullness grouping, old compaction algorithm
+ New fullness grouping, new compaction algorithm
N Min Max Median Avg Stddev
x 140 349 3513 2461 2314.1214 806.03271
+ 140 226 2279 1644 1528.4143 524.85268
Difference at 95.0% confidence
-785.707 +/- 159.331
-33.9527% +/- 6.88516%
(Student's t, pooled s = 680.132)
Sergey Senozhatsky [Sat, 4 Mar 2023 03:48:33 +0000 (12:48 +0900)]
zsmalloc: fine-grained inuse ratio based fullness grouping
Each zspage maintains ->inuse counter which keeps track of the number of
objects stored in the zspage. The ->inuse counter also determines the
zspage's "fullness group" which is calculated as the ratio of the "inuse"
objects to the total number of objects the zspage can hold
(objs_per_zspage). The closer the ->inuse counter is to objs_per_zspage,
the better.
Each size class maintains several fullness lists, that keep track of
zspages of particular "fullness". Pages within each fullness list are
stored in random order with regard to the ->inuse counter. This is
because sorting the zspages by ->inuse counter each time obj_malloc() or
obj_free() is called would be too expensive. However, the ->inuse counter
is still a crucial factor in many situations.
For the two major zsmalloc operations, zs_malloc() and zs_compact(), we
typically select the head zspage from the corresponding fullness list as
the best candidate zspage. However, this assumption is not always
accurate.
For the zs_malloc() operation, the optimal candidate zspage should have
the highest ->inuse counter. This is because the goal is to maximize the
number of ZS_FULL zspages and make full use of all allocated memory.
For the zs_compact() operation, the optimal source zspage should have the
lowest ->inuse counter. This is because compaction needs to move objects
in use to another page before it can release the zspage and return its
physical pages to the buddy allocator. The fewer objects in use, the
quicker compaction can release the zspage. Additionally, compaction is
measured by the number of pages it releases.
This patch reworks the fullness grouping mechanism. Instead of having two
groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage
ration above 3/4) - that result in too many zspages being included in the
ALMOST_EMPTY group for specific classes, size classes maintain a larger
number of fullness lists that give strict guarantees on the minimum and
maximum ->inuse values within each group. Each group represents a 10%
change in the ->inuse ratio compared to neighboring groups. In essence,
there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up
to 100%.
This enhances the selection of candidate zspages for both zs_malloc() and
zs_compact(). A printout of the ->inuse counters of the first 7 zspages
per (random) class fullness group:
The zs_malloc() function searches through the groups of pages starting
with the one having the highest usage ratio. This means that it always
selects a zspage from the group with the least internal fragmentation
(highest usage ratio) and makes it even less fragmented by increasing its
usage ratio.
The zs_compact() function, on the other hand, begins by scanning the group
with the highest fragmentation (lowest usage ratio) to locate the source
page. The first available zspage is selected, and then the function moves
downward to find a destination zspage in the group with the lowest
internal fragmentation (highest usage ratio).
Patch series "zsmalloc: fine-grained fullness and new compaction
algorithm", v4.
Existing zsmalloc page fullness grouping leads to suboptimal page
selection for both zs_malloc() and zs_compact(). This patchset reworks
zsmalloc fullness grouping/classification.
Additinally it also implements new compaction algorithm that is expected
to use less CPU-cycles (as it potentially does fewer memcpy-s in
zs_object_copy()).
Test (synthetic) results can be seen in patch 0003.
This patch (of 4):
This optimization has no effect. It only ensures that when a zspage was
added to its corresponding fullness list, its "inuse" counter was higher
or lower than the "inuse" counter of the zspage at the head of the list.
The intention was to keep busy zspages at the head, so they could be
filled up and moved to the ZS_FULL fullness group more quickly. However,
this doesn't work as the "inuse" counter of a zspage can be modified by
obj_free() but the zspage may still belong to the same fullness list. So,
fix_fullness_group() won't change the zspage's position in relation to the
head's "inuse" counter, leading to a largely random order of zspages
within the fullness list.
For instance, consider a printout of the "inuse" counters of the first 10
zspages in a class that holds 93 objects per zspage:
ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52
As we can see the zspage with the lowest "inuse" counter
is actually the head of the fullness list.
Jaewon Kim [Fri, 3 Mar 2023 05:03:32 +0000 (14:03 +0900)]
dma-buf: system_heap: avoid reclaim for order 4
Using order 4 pages would be helpful for IOMMUs mapping, but trying to get
order 4 pages could spend quite much time in the page allocation. From
the perspective of responsiveness, the deterministic memory allocation
speed, I think, is quite important.
The order 4 allocation with __GFP_RECLAIM may spend much time in reclaim
and compation logic. __GFP_NORETRY also may affect. These cause
unpredictable delay.
To get reasonable allocation speed from dma-buf system heap, use
HIGH_ORDER_GFP for order 4 to avoid reclaim. And let me remove
meaningless __GFP_COMP for order 0.
According to my tests, order 4 with MID_ORDER_GFP could get more number
of order 4 pages but the elapsed times could be very slow.
Link: https://lkml.kernel.org/r/20230303050332.10138-1-jaewon31.kim@samsung.com Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reviewed-by: John Stultz <jstultz@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexander Potapenko [Fri, 3 Mar 2023 14:14:32 +0000 (15:14 +0100)]
x86: kmsan: use C versions of memset16/memset32/memset64
KMSAN must see as many memory accesses as possible to prevent false
positive reports. Fall back to versions of
memset16()/memset32()/memset64() implemented in lib/string.c instead of
those written in assembly.
Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexander Potapenko [Fri, 3 Mar 2023 14:14:31 +0000 (15:14 +0100)]
kmsan: another take at fixing memcpy tests
commit 5478afc55a21 ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR()
to hide the uninitialized var from the compiler optimizations.
However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit,
so memcpy tests did not actually check the behavior of memcpy(), because
they always contained a KMSAN report.
Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers
the memory with a barrier(), and add a test case for memcpy() that does
not expect an error report.
Also reflow kmsan_test.c with clang-format.
Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>