mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so
hpage_collapse_scan_file() path can not find this by traversing
file->f_mapping, so there is no concurrency with retract_page_tables().
In addition, we already hold the exclusive mmap_lock, so this new_pte page
is stable, so there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9d582a09dbcf12e562ac5fe0ba05e9248a58f5e0.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: copy_pte_range() use pte_offset_map_rw_nolock()
In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we
already hold the exclusive mmap_lock, and the copy_pte_range() and
retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE
page is stable, there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the PTL held. So
we should get pgt_pmd and do pmd_same() check after the ptl held.
Link: https://lkml.kernel.org/r/055e42db68da00ac8ecab94bd2633c7cd965eb1c.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: handle_pte_fault() use pte_offset_map_rw_nolock()
In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In do_adjust_pte(), we may modify the pte entry. The corresponding pmd
entry may have been modified concurrently. Therefore, in order to ensure
the stability if pmd entry, use pte_offset_map_rw_nolock() to replace
pte_offset_map_nolock(), and do pmd_same() check after holding the PTL.
All callers of update_mmu_cache_range() hold the vmf->ptl, so we can
determined whether split PTE locks is being used by doing the following,
just as we do elsewhere in the kernel.
ptl != vmf->ptl
And then we can delete the do_pte_lock() and do_pte_unlock().
Link: https://lkml.kernel.org/r/0eaf6b69aeb2fe35092a633fed12537efe645303.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use
pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock().
Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Nanyong Sun [Thu, 26 Sep 2024 07:49:22 +0000 (15:49 +0800)]
mm: move mm flags to mm_types.h
The types of mm flags are now far beyond the core dump related features.
This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h.
The linux/sched/coredump.h has include the mm_types.h, so the C files
related to coredump does not need to change head file inclusion. In
addition, the inclusion of sched/coredump.h now can be deleted from the C
files that irrelevant to core dump.
Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com Signed-off-by: Nanyong Sun <sunnanyong@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 26 Sep 2024 15:10:19 +0000 (16:10 +0100)]
mm/madvise: unrestrict process_madvise() for current process
The process_madvise() call was introduced in commit ecb8ac8b1f14
("mm/madvise: introduce process_madvise() syscall: an external memory
hinting API") as a means of performing madvise() operations on another
process.
However, as it provides the means by which to perform multiple madvise()
operations in a batch via an iovec, it is useful to utilise the same
interface for performing operations on the current process rather than a
remote one.
Commit 22af8caff7d1 ("mm/madvise: process_madvise() drop capability check
if same mm") removed the need for a caller invoking process_madvise() on
its own pidfd to possess the CAP_SYS_NICE capability, however this leaves
the restrictions on operation in place.
Resolve this by only applying the restriction on operations when accessing
a remote process.
Moving forward we plan to implement a simpler means of specifying this
condition other than needing to establish a self pidfd, perhaps in the
form of a sentinel pidfd.
Also take the opportunity to refactor the system call implementation
abstracting the vectorised operation.
Link: https://lkml.kernel.org/r/20240926151019.82902-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <brauner@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Thu, 26 Sep 2024 15:20:44 +0000 (17:20 +0200)]
selftests/mm: hugetlb_fault_after_madv: improve test output
Let's improve the test output. For example, print the proper test result.
Install a SIGBUS handler to catch any SIGBUS instead of crashing the test
on failure.
With unsuitable hugetlb page count:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
# [INFO] detected default hugetlb page size: 2048 KiB
ok 2 # SKIP This test needs one and only one page to execute. Got 0
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
On a failure:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
not ok 1 SIGBUS behavior
Bail out! 1 out of 1 tests failed
On success:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
# [INFO] detected default hugetlb page size: 2048 KiB
ok 1 SIGBUS behavior
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lkml.kernel.org/r/20240926152044.2205129-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Tested-by: Mario Casquero <mcasquer@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Hildenbrand [Thu, 26 Sep 2024 15:20:43 +0000 (17:20 +0200)]
selftests/mm: hugetlb_fault_after_madv: use default hugetlb page size
Patch series "selftests/mm: hugetlb_fault_after_madv improvements".
Mario brought to my attention that the hugetlb_fault_after_madv test is
currently always skipped on s390x. Let's adjust the test to be
independent of the default hugetlb page size and while at it, also improve
the test output.
This patch (of 2):
We currently assume that the hugetlb page size is 2 MiB, which is why we
mmap() a 2 MiB range.
Is the default hugetlb size is larger, mmap() will fail because the range
is not suitable. If the default hugetlb size is smaller (e.g., s390x),
mmap() will fail because we would need more than one hugetlb page, but
just asserted that we have exactly one.
So let's simply use the default hugetlb page size instead of hard-coded 2
MiB, so the test isn't unconditionally skipped on architectures like
s390x.
Before this patch on s390x:
$ ./hugetlb_fault_after_madv
1..0 # SKIP Failed to allocated huge page
With this change on s390x:
$ ./hugetlb_fault_after_madv
Zhiguo Jiang [Fri, 12 Jan 2024 01:23:52 +0000 (09:23 +0800)]
mm: fix shrink nr.unqueued_dirty counter issue
It is needed to ensure sc->nr.unqueued_dirty > 0, which can avoid setting
PGDAT_DIRTY flag when sc->nr.unqueued_dirty and sc->nr.file_taken are both
zero.
Wei Yang [Sun, 8 Sep 2024 14:05:54 +0000 (14:05 +0000)]
maple_tree: memset maple_big_node as a whole
In mast_fill_bnode(), we first clear some fields of maple_big_node and set
the 'type' unconditionally before return. This means we won't leverage
any information in maple_big_node and it is safe to clear the whole
structure.
In maple_big_node, we define slot and padding/gap in a union. And based
on current definition of MAPLE_BIG_NODE_SLOTS/GAPS, padding is always less
than slot and part of the gap is overlapped by slot.
For example on 64bit system:
MAPLE_BIG_NODE_SLOT is 34
MAPLE_BIG_NODE_GAP is 21
With this knowledge, current code may clear some space by twice. And
this could be avoid by clearing the structure as a whole.
Wei Yang [Wed, 11 Sep 2024 14:27:59 +0000 (14:27 +0000)]
maple_tree: goto complete directly on a pivot of 0
When we break the loop after assigning a pivot, the index i/j is not
changed. Then the following code assign pivot, which means we do the
assignment with same i/j by mas_safe_pivot.
Since the loop condition is (i < piv_end), from which we can get i is less
than mt_pivots[mt]. It implies mas_safe_pivot() return pivot[i] which is
the same value we get in loop.
Now we can conclude it does a redundant assignment on a pivot of 0. Let's
just go to complete to avoid it.
We can see in all these four cases, i is always less than or equal to
mas_end after finish the loop:
Case 1: After assign pivot 0, i is set to 1, which is bigger than
mas_end 0. So it jumps to complete and skip the check.
Case 2: After assign pivot 0, i is set to 1.
∵ (mas_start < mas_end) && (mas_start == 0)
==> (1 <= mas_end)
∵ (i == 1) && (1 <= mas_end)
==> (i <= mas_end)
∴ Before loop, we have (i <= mas_end). And we still hold this
if it skips the loop. For example, (i == mas_end).
Now let's see what happens in the loop:
∵ piv_end = min(mas_end, mt_pivots[mt])
==> (piv_end <= mas_end)
∵ loop condition is (i < piv_end)
==> (i <= piv_end) on finish the loop both normally or break
∵ (i <= piv_end) && (piv_end <= mas_end)
==> (i <= mas_end)
∴ After loop, we still get (i <= mas_end) in this case
Case 3: This case would skip both if clause and loop. So when it comes
to the check, i is still mas_start which equals to mas_end.
Case 4: This case would skip the if clause.
∵ (mas_start < mas_end) && (i == mas_start)
==> (i < mas_end)
∴ Before loop, we have (i < mas_end).
The loop process is similar with Case 2, so we get the same
result.
Now we can conclude in all cases, we get (i <= mas_end) when doing
check. Then it is not necessary to do the check.
Lorenzo Stoakes [Tue, 24 Sep 2024 20:10:23 +0000 (21:10 +0100)]
mm: refactor mm_access() to not return NULL
mm_access() can return NULL if the mm is not found, but this is handled
the same as an error in all callers, with some translating this into an
-ESRCH error.
Only proc_mem_open() returns NULL if no mm is found, however in this case
it is clearer and makes more sense to explicitly handle the error.
Additionally we take the opportunity to refactor the function to eliminate
unnecessary nesting.
Simplify things by simply returning -ESRCH if no mm is found - this both
eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies
callers which would return -ESRCH by returning this error directly.
We now have only one active post-processing at any time, so we don't have
same race conditions that we had before. If slot selected for
post-processing gets freed or freed and reallocated it loses its PP_SLOT
flag and there is no way for such a slot to gain PP_SLOT flag again until
current post-processing terminates.
ZRAM_SAME slots cannot be post-processed (writeback or recompress) so do
not mark them ZRAM_IDLE. Same with ZRAM_WB slots, they cannot be
ZRAM_IDLE because they are not in zsmalloc pool anymore.
Writeback suffers from the same problem as recompression did before -
target slot selection for writeback is just a simple iteration over
zram->table entries (stored pages) which selects suboptimal targets for
writeback. This is especially problematic for writeback, because we
uncompress objects before writeback so each of them takes 4K out of
limited writeback storage. For example, when we take a 48 bytes slot and
store it as a 4K object to writeback device we only save 48 bytes of
memory (release from zsmalloc pool). We naturally want to pick the
largest objects for writeback, because then each writeback will release
the largest amount of memory.
This patch applies the same solution and strategy as for recompression
target selection: pp control (post-process) with 16 buckets of candidate
pp slots. Slots are assigned to pp buckets based on sizes - the larger
the slot the higher the group index. This gives us sorted by size lists
of candidate slots (in linear time), so that among post-processing
candidate slots we always select the largest ones first and maximize the
memory saving.
TEST
====
A very simple demonstration: zram is configured with a writeback device.
A limited writeback (wb_limit 2500 pages) is performed then, with a log of
sizes of slots that were written back. You can see that patched zram
selects slots for recompression in significantly different manner, which
leads to higher memory savings (see column #2 of mm_stat output).
Target slot selection for recompression is just a simple iteration over
zram->table entries (stored pages) from slot 0 to max slot. Given that
zram->table slots are written in random order and are not sorted by size,
a simple iteration over slots selects suboptimal targets for
recompression. This is not a problem if we recompress every single
zram->table slot, but we never do that in reality. In reality we limit
the number of slots we can recompress (via max_pages parameter) and hence
proper slot selection becomes very important. The strategy is quite
simple, suppose we have two candidate slots for recompression, one of size
48 bytes and one of size 2800 bytes, and we can recompress only one, then
it certainly makes more sense to pick 2800 entry for recompression.
Because even if we manage to compress 48 bytes objects even further the
savings are going to be very small. Potential savings after good
re-compression of 2800 bytes objects are much higher.
This patch reworks slot selection and introduces the strategy described
above: among candidate slots always select the biggest ones first.
For that the patch introduces zram_pp_ctl (post-processing) structure
which holds NUM_PP_BUCKETS pp buckets of slots. Slots are assigned to a
particular group based on their sizes - the larger the size of the slot
the higher the group index. This, basically, sorts slots by size in liner
time (we still perform just one iteration over zram->table slots). When
we select slot for recompression we always first lookup in higher pp
buckets (those that hold the largest slots). Which achieves the desired
behavior.
TEST
====
A very simple demonstration: zram is configured with zstd, and zstd with
dict as a recompression stream. A limited (max 4096 pages) recompression
is performed then, with a log of sizes of slots that were recompressed.
You can see that patched zram selects slots for recompression in
significantly different manner, which leads to higher memory savings (see
column #2 of mm_stat output).
zram: permit only one post-processing operation at a time
Both recompress and writeback soon will unlock slots during processing,
which makes things too complex wrt possible race-conditions. We still
want to clear PP_SLOT in slot_free, because this is how we figure out that
slot that was selected for post-processing has been released under us and
when we start post-processing we check if slot still has PP_SLOT set. At
the same time, theoretically, we can have something like this:
CPU0 CPU1
recompress
scan slots
set PP_SLOT
unlock slot
slot_free
clear PP_SLOT
allocate PP_SLOT
writeback
scan slots
set PP_SLOT
unlock slot
select PP-slot
test PP_SLOT
So recompress will not detect that slot has been re-used and re-selected
for concurrent writeback post-processing.
Make sure that we only permit on post-processing operation at a time. So
now recompress and writeback post-processing don't race against each
other, we only need to handle slot re-use (slot_free and write), which is
handled individually by each pp operation.
Having recompress and writeback competing for the same slots is not
exactly good anyway (can't imagine anyone doing that).
Patch series "zram: optimal post-processing target selection", v5.
Problem:
--------
Both recompression and writeback perform a very simple linear scan of all
zram slots in search for post-processing (writeback or recompress)
candidate slots. This often means that we pick the worst candidate for pp
(post-processing), e.g. a 48 bytes object for writeback, which is nearly
useless, because it only releases 48 bytes from zsmalloc pool, but
consumes an entire 4K slot in the backing device. Similarly,
recompression of an 48 bytes objects is unlikely to save more memory that
recompression of a 3000 bytes object. Both recompression and writeback
consume constrained resources (CPU time, batter, backing device storage
space) and quite often have a (daily) limit on the number of items they
post-process, so we should utilize those constrained resources in the most
optimal way.
Solution:
---------
This patch reworks the way we select pp targets. We, quite clearly, want
to sort all the candidates and always pick the largest, be it
recompression or writeback. Especially for writeback, because the larger
object we writeback the more memory we release. This series introduces
concept of pp buckets and pp scan/selection.
The scan step is a simple iteration over all zram->table entries, just
like what we currently do, but we don't post-process a candidate slot
immediately. Instead we assign it to a PP (post-processing) bucket. PP
bucket is, basically, a list which holds pp candidate slots that belong to
the same size class. PP buckets are 64 bytes apart, slots are not
strictly sorted within a bucket there is a 64 bytes variance.
The select step simply iterates over pp buckets from highest to lowest and
picks all candidate slots a particular buckets contains. So this gives us
sorted candidates (in linear time) and allows us to select most optimal
(largest) candidates for post-processing first.
This patch (of 7):
This flag indicates that the slot was selected as a candidate slot for
post-processing (pp) and was assigned to a pp bucket. It does not
necessarily mean that the slot is currently under post-processing, but may
mean so. The slot can loose its PP_SLOT flag, while still being in the
pp-bucket, if it's accessed or slot_free-ed.
Adrian Huang [Fri, 26 Jul 2024 16:52:46 +0000 (00:52 +0800)]
mm/vmalloc: combine all TLB flush operations of KASAN shadow virtual address into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running
KASAN-enabled kernel on a 256-core machine, the following soft lockup is
shown:
1. The following ftrace log shows that the lockup CPU spends too much
time iterating vmap_nodes and flushing TLB when purging vm_area
structures. (Some info is trimmed).
The drain_vmap_area_work() spends over 23 seconds.
There are 2805 flush_tlb_kernel_range() calls in the ftrace log.
* One is called in __purge_vmap_area_lazy().
* Others are called by purge_vmap_node->kasan_release_vmalloc.
purge_vmap_node() iteratively releases kasan vmalloc
allocations and flushes TLB for each vmap_area.
- [Rough calculation] Each flush_tlb_kernel_range() runs
about 7.5ms.
-- 2804 * 7.5ms = 21.03 seconds.
-- That's why a soft lock is triggered.
2. Extending the soft lockup time can work around the issue (For example,
# echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the
above-mentioned speculation: drain_vmap_area_work() spends too much
time.
If we combine all TLB flush operations of the KASAN shadow virtual
address into one operation in the call path
'purge_vmap_node()->kasan_release_vmalloc()', the running time of
drain_vmap_area_work() can be saved greatly. The idea is from the
flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the
soft lockup won't be triggered.
Here is the test result based on 6.10:
[6.10 wo/ the patch]
1. ftrace latency profiling (record a trace if the latency > 20s).
echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh
echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
2. Run `make -j $(nproc)` to compile the kernel source
3. Once the soft lockup is reproduced, check the ftrace log:
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
76) $ 50412985 us | } /* __purge_vmap_area_lazy */
76) $ 50412997 us | } /* drain_vmap_area_work */
76) $ 29165911 us | } /* __purge_vmap_area_lazy */
76) $ 29165926 us | } /* drain_vmap_area_work */
91) $ 53629423 us | } /* __purge_vmap_area_lazy */
91) $ 53629434 us | } /* drain_vmap_area_work */
91) $ 28121014 us | } /* __purge_vmap_area_lazy */
91) $ 28121026 us | } /* drain_vmap_area_work */
[6.10 w/ the patch]
1. Repeat step 1-2 in "[6.10 wo/ the patch]"
2. The soft lockup is not triggered and ftrace log is empty.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace
log.
4. Setting 'tracing_thresh' to 1 second gets ftrace log.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
23) $ 1074942 us | } /* __purge_vmap_area_lazy */
23) $ 1074950 us | } /* drain_vmap_area_work */
The worst execution time of drain_vmap_area_work() is about 1 second.
Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/ Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Tested-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In proactive memory reclamation scenarios, it is necessary to estimate the
pswpin and pswpout metrics of the cgroup to determine whether to continue
reclaiming anonymous pages in the current batch. This patch will collect
these metrics and expose them.
Baolin Wang [Sun, 22 Sep 2024 04:32:13 +0000 (12:32 +0800)]
mm: shmem: fix khugepaged activation policy for shmem
Shmem has a separate interface (different from anonymous pages) to control
huge page allocation, that means shmem THP can be enabled while anonymous
THP is disabled. However, in this case, khugepaged will not start to
collapse shmem THP, which is unreasonable.
To fix this issue, we should call start_stop_khugepaged() to activate or
deactivate the khugepaged thread when setting shmem mTHP interfaces.
Moreover, add a new helper shmem_hpage_pmd_enabled() to help to check
whether shmem THP is enabled, which will determine if khugepaged should be
activated.
Link: https://lkml.kernel.org/r/9b9c6cbc4499bf44c6455367fd9e0f6036525680.1726978977.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 24 Sep 2024 18:59:11 +0000 (19:59 +0100)]
selftests/mm: add pkey_sighandler_xx, hugetlb_dio to .gitignore
Commit 6998a73efbb8 ("selftests/mm: Add new testcases for pkeys") and
commit 3a103b5315b7 ("selftest: mm: Test if hugepage does not get leaked
during __bio_release_pages()") generate test binaries hugetlb_dio,
pkey_sighandler_tests_32 and pkey_sighandler_tests_64 but did not add
these to .gitignore. Correct this.
Link: https://lkml.kernel.org/r/20240924185911.117937-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Keith Lucas <keith.lucas@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 6 Nov 2024 00:53:31 +0000 (16:53 -0800)]
Merge branch 'mm-hotfixes-stable' into mm-stable.
Pick these into mm-stable:
5de195060b2e mm: resolve faulty mmap_region() error path behaviour 5baf8b037deb mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling 0fb4a7ad270b mm: refactor map_deny_write_exec() 4080ef1579b2 mm: unconditionally close VMAs on error 3dd6ed34ce1f mm: avoid unsafe VMA hook invocation when error arises on mmap hook f8f931bba0f9 mm/thp: fix deferred split unqueue naming and locking e66f3185fa04 mm/thp: fix deferred split queue not partially_mapped
to get a clean merge of these from mm-unstable into mm-stable:
Subject: memcg-v1: fully deprecate move_charge_at_immigrate
Subject: memcg-v1: remove charge move code
Subject: memcg-v1: no need for memcg locking for dirty tracking
Subject: memcg-v1: no need for memcg locking for writeback tracking
Subject: memcg-v1: no need for memcg locking for MGLRU
Subject: memcg-v1: remove memcg move locking code
Subject: tools: testing: add additional vma_internal.h stubs
Subject: mm: isolate mmap internal logic to mm/vma.c
Subject: mm: refactor __mmap_region()
Subject: mm: remove unnecessary reset state logic on merge new VMA
Subject: mm: defer second attempt at merge on mmap()
Subject: mm/vma: the pgoff is correct if can_merge_right
Subject: memcg: workingset: remove folio_memcg_rcu usage
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().
Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.
We move a number of things here:
1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.
2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.
We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.
We perform a debug assert to ensure a driver does not attempt to do the
opposite.
3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.
With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.
This eliminates an entire class of errors, makes the code easier to reason
about and more robust.
Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Mark Brown <broonie@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:47 +0000 (18:11 +0000)]
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().
The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().
Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.
It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.
We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.
This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.
This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.
So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().
[akpm@linux-foundation.org: fix whitespace, per Catalin] Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andreas Larsson <andreas@gaisler.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:46 +0000 (18:11 +0000)]
mm: refactor map_deny_write_exec()
Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to
use this function early in mmap_region() in a subsequent commit.
While we're here, we refactor the function to be more readable and add
some additional documentation.
Link: https://lkml.kernel.org/r/6be8bb59cd7c68006ebb006eb9d8dc27104b1f70.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:45 +0000 (18:11 +0000)]
mm: unconditionally close VMAs on error
Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.
With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.
Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.
We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.
This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.
Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Tue, 29 Oct 2024 18:11:44 +0000 (18:11 +0000)]
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
Patch series "fix error handling in mmap_region() and refactor
(hotfixes)", v4.
mmap_region() is somewhat terrifying, with spaghetti-like control flow and
numerous means by which issues can arise and incomplete state, memory
leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the
VMA is in an inconsistent state.
The patches in this series comprise the minimal changes required to
resolve existing issues in mmap_region() error handling, in order that
they can be hotfixed and backported. There is additionally a follow up
series which goes further, separated out from the v1 series and sent and
updated separately.
This patch (of 5):
After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced,
meaning that we need complicated handling to ensure we do not incorrectly
call these hooks.
We can avoid the whole issue by treating the VMA as suspect the moment
that the file->f_ops->mmap() function reports an error by replacing
whatever VMA operations were installed with a dummy empty set of VMA
operations.
We do so through a new helper function internal to mm - mmap_file() -
which is both more logically named than the existing call_mmap() function
and correctly isolates handling of the vm_op reassignment to mm.
All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.
It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:
Hugh Dickins [Sun, 27 Oct 2024 20:02:13 +0000 (13:02 -0700)]
mm/thp: fix deferred split unqueue naming and locking
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
Before fixing locking: rename misleading folio_undo_large_rmappable(),
which does not undo large_rmappable, to folio_unqueue_deferred_split(),
which is what it does. But that and its out-of-line __callee are mm
internals of very limited usability: add comment and WARN_ON_ONCEs to
check usage; and return a bool to say if a deferred split was unqueued,
which can then be used in WARN_ON_ONCEs around safety checks (sparing
callers the arcane conditionals in __folio_unqueue_deferred_split()).
Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all
of whose callers now call it beforehand (and if any forget then bad_page()
will tell) - except for its caller put_pages_list(), which itself no
longer has any callers (and will be deleted separately).
Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0
without checking and unqueueing a THP folio from deferred split list;
which is unfortunate, since the split_queue_lock depends on the memcg
(when memcg is enabled); so swapout has been unqueueing such THPs later,
when freeing the folio, using the pgdat's lock instead: potentially
corrupting the memcg's list. __remove_mapping() has frozen refcount to 0
here, so no problem with calling folio_unqueue_deferred_split() before
resetting memcg_data.
That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split
shrinker memcg aware"): which included a check on swapcache before adding
to deferred queue, but no check on deferred queue before adding THP to
swapcache. That worked fine with the usual sequence of events in reclaim
(though there were a couple of rare ways in which a THP on deferred queue
could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split
underused THPs") avoids splitting underused THPs in reclaim, which makes
swapcache THPs on deferred queue commonplace.
Keep the check on swapcache before adding to deferred queue? Yes: it is
no longer essential, but preserves the existing behaviour, and is likely
to be a worthwhile optimization (vmstat showed much more traffic on the
queue under swapping load if the check was removed); update its comment.
Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing
folio->memcg_data without checking and unqueueing a THP folio from the
deferred list, sometimes corrupting "from" memcg's list, like swapout.
Refcount is non-zero here, so folio_unqueue_deferred_split() can only be
used in a WARN_ON_ONCE to validate the fix, which must be done earlier:
mem_cgroup_move_charge_pte_range() first try to split the THP (splitting
of course unqueues), or skip it if that fails. Not ideal, but moving
charge has been requested, and khugepaged should repair the THP later:
nobody wants new custom unqueueing code just for this deprecated case.
The 87eaceb3faa5 commit did have the code to move from one deferred list
to another (but was not conscious of its unsafety while refcount non-0);
but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care
deferred split queue in memcg charge move path"), which argued that the
existence of a PMD mapping guarantees that the THP cannot be on a deferred
list. As above, false in rare cases, and now commonly false.
Backport to 6.11 should be straightforward. Earlier backports must take
care that other _deferred_list fixes and dependencies are included. There
is not a strong case for backports, but they can fix cornercases.
Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Sun, 27 Oct 2024 19:59:34 +0000 (12:59 -0700)]
mm/thp: fix deferred split queue not partially_mapped
Recent changes are putting more pressure on THP deferred split queues:
under load revealing long-standing races, causing list_del corruptions,
"Bad page state"s and worse (I keep BUGs in both of those, so usually
don't get to see how badly they end up without). The relevant recent
changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin,
improved swap allocation, and underused THP splitting.
The new unlocked list_del_init() in deferred_split_scan() is buggy. I
gave bad advice, it looks plausible since that's a local on-stack list,
but the fact is that it can race with a third party freeing or migrating
the preceding folio (properly unqueueing it with refcount 0 while holding
split_queue_lock), thereby corrupting the list linkage.
The obvious answer would be to take split_queue_lock there: but it has a
long history of contention, so I'm reluctant to add to that. Instead,
make sure that there is always one safe (raised refcount) folio before, by
delaying its folio_put(). (And of course I was wrong to suggest updating
split_queue_len without the lock: leave that until the splice.)
And remove two over-eager partially_mapped checks, restoring those tests
to how they were before: if uncharge_folio() or free_tail_page_prepare()
finds _deferred_list non-empty, it's in trouble whether or not that folio
is partially_mapped (and the flag was already cleared in the latter case).
Link: https://lkml.kernel.org/r/81e34a8b-113a-0701-740e-2135c97eb1d7@google.com Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Sun, 3 Nov 2024 20:25:05 +0000 (10:25 -1000)]
Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM.
The usual collection of singletons - please see the changelogs"
* tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
mm: shrinker: avoid memleak in alloc_shrinker_info
.mailmap: update e-mail address for Eugen Hristev
vmscan,migrate: fix page count imbalance on node stats when demoting pages
mailmap: update Jarkko's email addresses
mm: allow set/clear page_type again
nilfs2: fix potential deadlock with newly created symlinks
Squashfs: fix variable overflow in squashfs_readpage_block
kasan: remove vmalloc_percpu test
tools/mm: -Werror fixes in page-types/slabinfo
mm, swap: avoid over reclaim of full clusters
mm: fix PSWPIN counter for large folios swap-in
mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
mm/codetag: fix null pointer check logic for ref and tag
mm/gup: stop leaking pinned pages in low memory conditions
Linus Torvalds [Sun, 3 Nov 2024 20:15:50 +0000 (10:15 -1000)]
Merge tag 'dmaengine-fix-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine fixes from Vinod Koul:
- TI driver fix to set EOP for cyclic BCDMA transfers
- sh rz-dmac driver fix for handling config with zero address
* tag 'dmaengine-fix-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
dmaengine: ti: k3-udma: Set EOP for all TRs in cyclic BCDMA transfer
dmaengine: sh: rz-dmac: handle configs where one address is zero
Linus Torvalds [Sun, 3 Nov 2024 18:51:53 +0000 (08:51 -1000)]
Merge tag 'driver-core-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core revert from Greg KH:
"Here is a single driver core revert for 6.12-rc6. It reverts a change
that came in -rc1 that was supposed to resolve a reported problem, but
caused another one, so revert it for now so that we can get this all
worked out properly in 6.13.
The revert has been in linux-next all week with no reported issues"
* tag 'driver-core-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Revert "driver core: Fix uevent_show() vs driver detach race"
Linus Torvalds [Sun, 3 Nov 2024 18:48:11 +0000 (08:48 -1000)]
Merge tag 'usb-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt fixes from Greg KH:
"Here are some small USB and Thunderbolt driver fixes for 6.12-rc6 that
have been sitting in my tree this week. Included in here are the
following:
- thunderbolt driver fixes for reported issues
- USB typec driver fixes
- xhci driver fixes for reported problems
- dwc2 driver revert for a broken change
- usb phy driver fix
- usbip tool fix
All of these have been in linux-next this week with no reported
issues"
* tag 'usb-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: tcpm: restrict SNK_WAIT_CAPABILITIES_TIMEOUT transitions to non self-powered devices
usb: phy: Fix API devm_usb_put_phy() can not release the phy
usb: typec: use cleanup facility for 'altmodes_node'
usb: typec: fix unreleased fwnode_handle in typec_port_register_altmodes()
usb: typec: qcom-pmic-typec: fix missing fwnode removal in error path
usb: typec: qcom-pmic-typec: use fwnode_handle_put() to release fwnodes
usb: acpi: fix boot hang due to early incorrect 'tunneled' USB3 device links
Revert "usb: dwc2: Skip clock gating on Broadcom SoCs"
xhci: Fix Link TRB DMA in command ring stopped completion event
xhci: Use pm_runtime_get to prevent RPM on unsupported systems
usbip: tools: Fix detach_port() invalid port error path
thunderbolt: Honor TMU requirements in the domain when setting TMU mode
thunderbolt: Fix KASAN reported stack out-of-bounds read in tb_retimer_scan()
Yu Zhao [Sat, 19 Oct 2024 01:29:39 +0000 (01:29 +0000)]
mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
When the MM_WALK capability is enabled, memory that is mostly accessed by
a VM appears younger than it really is, therefore this memory will be less
likely to be evicted. Therefore, the presence of a running VM can
significantly increase swap-outs for non-VM memory, regressing the
performance for the rest of the system.
Fix this regression by always calling {ptep,pmdp}_clear_young_notify()
whenever we clear the young bits on PMDs/PTEs.
[jthoughton@google.com: fix link-time error] Link: https://lkml.kernel.org/r/20241019012940.3656292-3-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Reported-by: David Stevens <stevensd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yu Zhao [Sat, 19 Oct 2024 01:29:38 +0000 (01:29 +0000)]
mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
Patch series "mm: multi-gen LRU: Have secondary MMUs participate in
MM_WALK".
Today, the MM_WALK capability causes MGLRU to clear the young bit from
PMDs and PTEs during the page table walk before eviction, but MGLRU does
not call the clear_young() MMU notifier in this case. By not calling this
notifier, the MM walk takes less time/CPU, but it causes pages that are
accessed mostly through KVM / secondary MMUs to appear younger than they
should be.
We do call the clear_young() notifier today, but only when attempting to
evict the page, so we end up clearing young/accessed information less
frequently for secondary MMUs than for mm PTEs, and therefore they appear
younger and are less likely to be evicted. Therefore, memory that is
*not* being accessed mostly by KVM will be evicted *more* frequently,
worsening performance.
ChromeOS observed a tab-open latency regression when enabling MGLRU with a
setup that involved running a VM:
Tab-open latency histogram (ms)
Version p50 mean p95 p99 max
base 1315 1198 2347 3454 10319
mglru 2559 1311 7399 12060 43758
fix 1119 926 2470 4211 6947
This series replaces the final non-selftest patchs from this series[1],
which introduced a similar change (and a new MMU notifier) with KVM
optimizations. I'll send a separate series (to Sean and Paolo) for the
KVM optimizations.
This series also makes proactive reclaim with MGLRU possible for KVM
memory. I have verified that this functions correctly with the selftest
from [1], but given that that test is a KVM selftest, I'll send it with
the rest of the KVM optimizations later. Andrew, let me know if you'd
like to take the test now anyway.
The removed stats, MM_LEAF_OLD and MM_NONLEAF_TOTAL, are not very helpful
and become more complicated to properly compute when adding
test/clear_young() notifiers in MGLRU's mm walk.
Link: https://lkml.kernel.org/r/20241019012940.3656292-1-jthoughton@google.com Link: https://lkml.kernel.org/r/20241019012940.3656292-2-jthoughton@google.com Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: James Houghton <jthoughton@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Matlack <dmatlack@google.com> Cc: David Rientjes <rientjes@google.com> Cc: David Stevens <stevensd@google.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Sun, 3 Nov 2024 18:45:03 +0000 (08:45 -1000)]
Merge tag 'char-misc-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull misc driver fixes from Greg KH:
"Here are some small char/misc/iio fixes for 6.12-rc6 that resolve
some reported issues. Included in here are the following:
- small IIO driver fixes for many reported issues
- mei driver fix for a suddenly much reported issue for an "old"
issue.
- MAINTAINERS update for a developer who has moved companies and
forgot to update their old entry.
All of these have been in linux-next this week with no reported
issues"
* tag 'char-misc-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mei: use kvmalloc for read buffer
MAINTAINERS: add netup_unidvb maintainer
iio: dac: Kconfig: Fix build error for ltc2664
iio: adc: ad7124: fix division by zero in ad7124_set_channel_odr()
staging: iio: frequency: ad9832: fix division by zero in ad9832_calc_freqreg()
docs: iio: ad7380: fix supply for ad7380-4
iio: adc: ad7380: fix supplies for ad7380-4
iio: adc: ad7380: add missing supplies
iio: adc: ad7380: use devm_regulator_get_enable_read_voltage()
dt-bindings: iio: adc: ad7380: fix ad7380-4 reference supply
iio: light: veml6030: fix microlux value calculation
iio: gts-helper: Fix memory leaks for the error path of iio_gts_build_avail_scale_table()
iio: gts-helper: Fix memory leaks in iio_gts_build_avail_scale_table()
Linus Torvalds [Sun, 3 Nov 2024 18:35:29 +0000 (08:35 -1000)]
Merge tag 'input-for-v6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- a fix for regression in input core introduced in 6.11 preventing
re-registering input handlers
- a fix for adp5588-keys driver tyring to disable interrupt 0 at
suspend when devices is used without interrupt
- a fix for edt-ft5x06 to stop leaking regmap structure when probing
fails and to make sure it is not released too early on removal.
* tag 'input-for-v6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: fix regression when re-registering input handlers
Input: adp5588-keys - do not try to disable interrupt 0
Input: edt-ft5x06 - fix regmap leak when probe fails
Linus Torvalds [Sun, 3 Nov 2024 18:29:02 +0000 (08:29 -1000)]
Merge tag 'kbuild-fixes-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix a memory leak in modpost
- Resolve build issues when cross-compiling RPM and Debian packages
- Fix another regression in Kconfig
- Fix incorrect MODULE_ALIAS() output in modpost
* tag 'kbuild-fixes-v6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
modpost: fix input MODULE_DEVICE_TABLE() built for 64-bit on 32-bit host
modpost: fix acpi MODULE_DEVICE_TABLE built with mismatched endianness
kconfig: show sub-menu entries even if the prompt is hidden
kbuild: deb-pkg: add pkg.linux-upstream.nokerneldbg build profile
kbuild: deb-pkg: add pkg.linux-upstream.nokernelheaders build profile
kbuild: rpm-pkg: disable kernel-devel package when cross-compiling
sumversion: Fix a memory leak in get_src_version()
Linus Torvalds [Sun, 3 Nov 2024 18:22:21 +0000 (08:22 -1000)]
Merge tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix for posix CPU timers.
When a thread is cloned, the posix CPU timers are not inherited.
If the parent has a CPU timer armed the corresponding tick dependency
in the tasks tick_dep_mask is set and copied to the new thread, which
means the new thread and all decendants will prevent the system to go
into full NOHZ operation.
Clear the tick dependency mask in copy_process() to fix this"
* tag 'timers-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
Linus Torvalds [Sun, 3 Nov 2024 18:18:28 +0000 (08:18 -1000)]
Merge tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- Plug a race between pick_next_task_fair() and try_to_wake_up() where
both try to write to the same task, even though both paths hold a
runqueue lock, but obviously from different runqueues.
The problem is that the store to task::on_rq in __block_task() is
visible to try_to_wake_up() which assumes that the task is not
queued. Both sides then operate on the same task.
Cure it by rearranging __block_task() so the the store to task::on_rq
is the last operation on the task.
- Prevent a potential NULL pointer dereference in task_numa_work()
task_numa_work() iterates the VMAs of a process. A concurrent unmap
of the address space can result in a NULL pointer return from
vma_next() which is unchecked.
Add the missing NULL pointer check to prevent this.
- Operate on the correct scheduler policy in task_should_scx()
task_should_scx() returns true when a task should be handled by sched
EXT. It checks the tasks scheduling policy.
This fails when the check is done before a policy has been set.
Cure it by handing the policy into task_should_scx() so it operates
on the requested value.
- Add the missing handling of sched EXT in the delayed dequeue
mechanism. This was simply forgotten.
* tag 'sched-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/ext: Fix scx vs sched_delayed
sched: Pass correct scheduling policy to __setscheduler_class
sched/numa: Fix the potential null pointer dereference in task_numa_work()
sched: Fix pick_next_task_fair() vs try_to_wake_up() race
Linus Torvalds [Sun, 3 Nov 2024 18:13:52 +0000 (08:13 -1000)]
Merge tag 'perf-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"perf_event_clear_cpumask() uses list_for_each_entry_rcu() without
being in a RCU read side critical section, which triggers a
'suspicious RCU usage' warning.
It turns out that the list walk does not be RCU protected because the
write side lock is held in this context.
Change it to a regular list walk"
* tag 'perf-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix missing RCU reader protection in perf_event_clear_cpumask()
Linus Torvalds [Sun, 3 Nov 2024 18:09:25 +0000 (08:09 -1000)]
Merge tag 'irq-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
- Fix an off-by-one error in the failure path of msi_domain_alloc(),
which causes the cleanup loop to terminate early and leaking the
first allocated interrupt.
- Handle a corner case in GIC-V4 versus a lazily mapped Virtual
Processing Element (VPE). If the VPE has not been mapped because the
guest has not yet emitted a mapping command, then the set_affinity()
callback returns an error code, which causes the vCPU management to
fail.
Return success in this case without touching the hardware. This will
be done later when the guest issues the mapping command.
* tag 'irq-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v4: Correctly deal with set_affinity on lazily-mapped VPEs
genirq/msi: Fix off-by-one error in msi_domain_alloc()
When building a 64-bit kernel, BITS_PER_LONG is defined as 64. However,
on a 32-bit build machine, the constant 1L is a signed 32-bit value.
Left-shifting it beyond 32 bits causes wraparound, and shifting by 31
or 63 bits makes it a negative value.
The fix in commit e0e92632715f ("[PATCH] PATCH: 1 line 2.6.18 bugfix:
modpost-64bit-fix.patch") is incorrect; it only addresses cases where
a 64-bit kernel is built on a 64-bit build machine, overlooking cases
on a 32-bit build machine.
Using 1ULL ensures a 64-bit width on both 32-bit and 64-bit machines,
avoiding the wraparound issue.
Dmitry Torokhov [Mon, 28 Oct 2024 05:31:15 +0000 (22:31 -0700)]
Input: fix regression when re-registering input handlers
Commit d469647bafd9 ("Input: simplify event handling logic") introduced
code that would set handler->events() method to either
input_handler_events_filter() or input_handler_events_default() or
input_handler_events_null(), depending on the kind of input handler
(a filter or a regular one) we are dealing with. Unfortunately this
breaks cases when we try to re-register the same filter (as is the case
with sysrq handler): after initial registration the handler will have 2
event handling methods defined, and will run afoul of the check in
input_handler_check_methods():
input: input_handler_check_methods: only one event processing method can be defined (sysrq)
sysrq: Failed to register input handler, error -22
Fix this by adding handle_events() method to input_handle structure and
setting it up when registering a new input handle according to event
handling methods defined in associated input_handler structure, thus
avoiding modifying the input_handler structure.
Reported-by: "Ned T. Crigler" <crigler@gmail.com> Reported-by: Christian Heusel <christian@heusel.eu> Tested-by: "Ned T. Crigler" <crigler@gmail.com> Tested-by: Peter Seiderer <ps.report@gmx.net> Fixes: d469647bafd9 ("Input: simplify event handling logic") Link: https://lore.kernel.org/r/Zx2iQp6csn42PJA7@xavtug Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Linus Torvalds [Sat, 2 Nov 2024 19:27:11 +0000 (09:27 -1000)]
Merge tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix two async COPY bugs found during NFS bake-a-thon
- Fix an svcrdma memory leak
* tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
rpcrdma: Always release the rpcrdma_device's xa_array
NFSD: Never decrement pending_async_copies on error
NFSD: Initialize struct nfsd4_copy earlier
Linus Torvalds [Sat, 2 Nov 2024 19:22:16 +0000 (09:22 -1000)]
Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- fix a sysbot reported crash on filestreams
- Reduce cpu time spent searching for extents in a very fragmented FS
- Check for delayed allocations before setting extsize
* tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: streamline xfs_filestream_pick_ag
xfs: fix finding a last resort AG in xfs_filestream_pick_ag
xfs: Reduce unnecessary searches when searching for the best extents
xfs: Check for delayed allocations before setting extsize
- fix idmap_mount_tree_invalid test failure due to incorrect argument
- fix watchdog-test run leaving the watchdog timer enabled causing
system reboot. With this fix, the test disables the watchdog timer
when it gets terminated with SIGTERM, SIGKILL, and SIGQUIT in
addition to SIGINT
* tag 'linux_kselftest-fixes-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/watchdog-test: Fix system accidentally reset after watchdog-test
selftests/intel_pstate: check if cpupower is installed
selftests/intel_pstate: fix operand expected error
selftests/mount_setattr: fix idmap_mount_tree_invalid failed to run
Linus Torvalds [Sat, 2 Nov 2024 01:59:46 +0000 (15:59 -1000)]
Merge tag 'rust-fixes-6.12-3' of https://github.com/Rust-for-Linux/linux
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Avoid build errors with old 'rustc's without LLVM patch version
(important since it impacts people that do not even enable Rust)
- Update LLVM version for 'HAVE_CFI_ICALL_NORMALIZE_INTEGERS' in
'depends on' condition (the fix was eventually backported rather
than land in LLVM 19)"
* tag 'rust-fixes-6.12-3' of https://github.com/Rust-for-Linux/linux:
cfi: tweak llvm version for HAVE_CFI_ICALL_NORMALIZE_INTEGERS
kbuild: rust: avoid errors with old `rustc`s without LLVM patch version
Linus Torvalds [Sat, 2 Nov 2024 01:44:23 +0000 (15:44 -1000)]
Merge tag 'pci-v6.12-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fix from Bjorn Helgaas:
- Enable device-specific ACS-like functionality even if the device
doesn't advertise an ACS capability, which got broken when adding
fancy ACS kernel parameter (Jason Gunthorpe)
* tag 'pci-v6.12-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Fix pci_enable_acs() support for the ACS quirks
Linus Torvalds [Sat, 2 Nov 2024 01:37:09 +0000 (15:37 -1000)]
Merge tag 'drm-fixes-2024-11-02' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Regular fixes pull, nothing too out of the ordinary, the mediatek
fixes came in a batch that I might have preferred a bit earlier but
all seem fine, otherwise regular xe/amdgpu and a few misc ones.
xe:
- Fix missing HPD interrupt enabling, bringing one PM refactor with it
- Workaround LNL GGTT invalidation not being visible to GuC
- Avoid getting jobs stuck without a protecting timeout
ivpu:
- Fix firewall IRQ handling
panthor:
- Fix firmware initialization wrt page sizes
- Fix handling and reporting of dead job groups
sched:
- Guarantee forward progress via WC_MEM_RECLAIM
tests:
- Fix memory leak in drm_display_mode_from_cea_vic()
mediatek:
- Fix degradation problem of alpha blending
- Fix color format MACROs in OVL
- Fix get efuse issue for MT8188 DPTX
- Fix potential NULL dereference in mtk_crtc_destroy()
- Correct dpi power-domains property
- Add split subschema property constraints"
* tag 'drm-fixes-2024-11-02' of https://gitlab.freedesktop.org/drm/kernel: (27 commits)
drm/xe: Don't short circuit TDR on jobs not started
drm/xe: Add mmio read before GGTT invalidate
drm/tests: hdmi: Fix memory leaks in drm_display_mode_from_cea_vic()
drm/connector: hdmi: Fix memory leak in drm_display_mode_from_cea_vic()
drm/tests: helpers: Add helper for drm_display_mode_from_cea_vic()
drm/panthor: Report group as timedout when we fail to properly suspend
drm/panthor: Fail job creation when the group is dead
drm/panthor: Fix firmware initialization on systems with a page size > 4k
accel/ivpu: Fix NOC firewall interrupt handling
drm/xe/display: Add missing HPD interrupt enabling during non-d3cold RPM resume
drm/xe/display: Separate the d3cold and non-d3cold runtime PM handling
drm/xe: Remove runtime argument from display s/r functions
drm/amdgpu/smu13: fix profile reporting
drm/amd/pm: Vangogh: Fix kernel memory out of bounds write
Revert "drm/amd/display: update DML2 policy EnhancedPrefetchScheduleAccelerationFinal DCN35"
drm/sched: Mark scheduler work queues with WQ_MEM_RECLAIM
drm/tegra: Fix NULL vs IS_ERR() check in probe()
dt-bindings: display: mediatek: split: add subschema property constraints
dt-bindings: display: mediatek: dpi: correct power-domains property
drm/mediatek: Fix potential NULL dereference in mtk_crtc_destroy()
...
Linus Torvalds [Sat, 2 Nov 2024 01:22:57 +0000 (15:22 -1000)]
Merge tag 'cxl-fixes-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Ira Weiny:
"The bulk of these fixes center around an initialization order bug
reported by Gregory Price and some additional fall out from the
debugging effort.
In summary, cxl_acpi and cxl_mem race and previously worked because of
a bus_rescan_devices() while testing without modules built in.
Unfortunately with modules built in the rescan would fail due to the
cxl_port driver being registered late via the build order. Furthermore
it was found bus_rescan_devices() did not guarantee a probe barrier
which CXL was expecting. Additional fixes to cxl-test and decoder
allocation came along as they were found in this debugging effort.
The other fixes are pretty minor but one affects trace point data seen
by user space.
Summary:
- Fix crashes when running with cxl-test code
- Fix Trace DRAM Event Record field decodes
- Fix module/built in initialization order errors
- Fix use after free on decoder shutdowns
- Fix out of order decoder allocations
- Improve cxl-test to better reflect real world systems"
* tag 'cxl-fixes-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/test: Improve init-order fidelity relative to real-world systems
cxl/port: Prevent out-of-order decoder allocation
cxl/port: Fix use-after-free, permit out-of-order decoder shutdown
cxl/acpi: Ensure ports ready at cxl_acpi_probe() return
cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices()
cxl/port: Fix CXL port initialization order when the subsystem is built-in
cxl/events: Fix Trace DRAM Event Record
cxl/core: Return error when cxl_endpoint_gather_bandwidth() handles a non-PCI device
Linus Torvalds [Fri, 1 Nov 2024 23:41:55 +0000 (13:41 -1000)]
Merge tag 'block-6.12-20241101' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fixup for a recent blk_rq_map_user_bvec() patch
- NVMe pull request via Keith:
- Spec compliant identification fix (Keith)
- Module parameter to enable backward compatibility on unusual
namespace formats (Keith)
- Target double free fix when using keys (Vitaliy)
- Passthrough command error handling fix (Keith)
* tag 'block-6.12-20241101' of git://git.kernel.dk/linux:
nvme: re-fix error-handling for io_uring nvme-passthrough
nvmet-auth: assign dh_key to NULL after kfree_sensitive
nvme: module parameter to disable pi with offsets
block: fix queue limits checks in blk_rq_map_user_bvec for real
nvme: enhance cns version checking
Linus Torvalds [Fri, 1 Nov 2024 23:38:01 +0000 (13:38 -1000)]
Merge tag 'io_uring-6.12-20241101' of git://git.kernel.dk/linux
Pull io_uring fix from Jens Axboe:
- Fix not honoring IOCB_NOWAIT for starting buffered writes in terms of
calling sb_start_write(), leading to a deadlock if someone is
attempting to freeze the file system with writes in progress, as each
side will end up waiting for the other to make progress.
* tag 'io_uring-6.12-20241101' of git://git.kernel.dk/linux:
io_uring/rw: fix missing NOWAIT check for O_DIRECT start write
Linus Torvalds [Fri, 1 Nov 2024 19:04:23 +0000 (09:04 -1000)]
Merge tag 'acpi-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Make the ACPI CPPC library use a raw spinlock for operations carried
out in scheduler context via the schedutil governor and the ACPI CPPC
cpufreq driver (Pierre Gondois)"
* tag 'acpi-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: CPPC: Make rmw_lock a raw_spin_lock
Dave Airlie [Fri, 1 Nov 2024 18:44:02 +0000 (04:44 +1000)]
Merge tag 'drm-xe-fixes-2024-10-31' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Fix missing HPD interrupt enabling, bringing one PM refactor with it
(Imre / Maarten)
- Workaround LNL GGTT invalidation not being visible to GuC
(Matthew Brost)
- Avoid getting jobs stuck without a protecting timeout (Matthew Brost)
Linus Torvalds [Fri, 1 Nov 2024 18:26:38 +0000 (08:26 -1000)]
Merge tag 'riscv-for-linus-6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- Avoid accessing the early boot ACPI tables via unsafe memory
attributes, which can result in incorrect ACPI table data appearing.
This can cause all sorts of bad behavior.
- Avoid compiler-inserted library calls in the VDSO.
- GCC+Rust builds have been disabled, to avoid issues related to ISA
string mismatched between the GCC and LLVM Rust implementations.
- The NX flag is now set in the EFI PE/COFF headers, which is necessary
for some distro GRUB versions to boot images.
- A fix to avoid leaking DT node reference counts on ACPI systems
during cache info parsing.
- CPU numbers are now printed as unsigned values during hotplug.
- A pair of build fixes for usused macros, which can trigger warnings
on some configurations.
* tag 'riscv-for-linus-6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Remove duplicated GET_RM
riscv: Remove unused GENERATING_ASM_OFFSETS
riscv: Use '%u' to format the output of 'cpu'
riscv: Prevent a bad reference count on CPU nodes
riscv: efi: Set NX compat flag in PE/COFF header
RISC-V: disallow gcc + rust builds
riscv: Do not use fortify in early code
RISC-V: ACPI: fix early_ioremap to early_memremap
riscv: vdso: Prevent the compiler from inserting calls to memset()
Linus Torvalds [Fri, 1 Nov 2024 17:54:11 +0000 (07:54 -1000)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The important one is a change to the way in which we handle protection
keys around signal delivery so that we're more closely aligned with
the x86 behaviour, however there is also a revert of the previous fix
to disable software tag-based KASAN with GCC, since a workaround
materialised shortly afterwards.
I'd love to say we're done with 6.12, but we're aware of some
longstanding fpsimd register corruption issues that we're almost at
the bottom of resolving.
Summary:
- Fix handling of POR_EL0 during signal delivery so that pushing the
signal context doesn't fail based on the pkey configuration of the
interrupted context and align our user-visible behaviour with that
of x86.
- Fix a bogus pointer being passed to the CPU hotplug code from the
Arm SDEI driver.
- Re-enable software tag-based KASAN with GCC by using an alternative
implementation of '__no_sanitize_address'"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: signal: Improve POR_EL0 handling to avoid uaccess failures
firmware: arm_sdei: Fix the input parameter of cpuhp_remove_state()
Revert "kasan: Disable Software Tag-Based KASAN with GCC"
kasan: Fix Software Tag-Based KASAN with GCC
Linus Torvalds [Fri, 1 Nov 2024 17:45:00 +0000 (07:45 -1000)]
Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull iomap fixes from Christian Brauner:
"Fixes for iomap to prevent data corruption bugs in the fallocate
unshare range implementation of fsdax and a small cleanup to turn
iomap_want_unshare_iter() into an inline function"
* tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iomap: turn iomap_want_unshare_iter into an inline function
fsdax: dax_unshare_iter needs to copy entire blocks
fsdax: remove zeroing code from dax_unshare_iter
iomap: share iomap_unshare_iter predicate code with fsdax
xfs: don't allocate COW extents when unsharing a hole
Linus Torvalds [Fri, 1 Nov 2024 17:37:10 +0000 (07:37 -1000)]
Merge tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull filesystem fixes from Christian Brauner:
"VFS:
- Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set
- Add a get_tree_bdev_flags() helper that allows to modify e.g.,
whether errors are logged into the filesystem context during
superblock creation. This is used by erofs to fix a userspace
regression where an error is currently logged when its used on a
regular file which is an new allowed mode in erofs.
netfs:
- Fix the sysfs debug path in the documentation.
- Fix iov_iter_get_pages*() for folio queues by skipping the page
extracation if we're at the end of a folio.
afs:
- Fix moving subdirectories to different parent directory.
autofs:
- Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in
validate_dev_ioctl(). The actual ioctl number, not the ioctl
command needs to be checked for autofs"
* tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP
autofs: fix thinko in validate_dev_ioctl()
iov_iter: Fix iov_iter_get_pages*() for folio_queue
afs: Fix missing subdir edit when renamed between parent dirs
doc: correcting the debug path for cachefiles
erofs: use get_tree_bdev_flags() to avoid misleading messages
fs/super.c: introduce get_tree_bdev_flags()
Linus Torvalds [Fri, 1 Nov 2024 17:31:47 +0000 (07:31 -1000)]
Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more stability fixes. There's one patch adding export of MIPS
cmpxchg helper, used in the error propagation fix.
- fix error propagation from split bios to the original btrfs bio
- fix merging of adjacent extents (normal operation, defragmentation)
- fix potential use after free after freeing btrfs device structures"
* tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix defrag not merging contiguous extents due to merged extent maps
btrfs: fix extent map merging not happening for adjacent extents
btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()
btrfs: fix error propagation of split bios
MIPS: export __cmpxchg_small()
Linus Torvalds [Fri, 1 Nov 2024 17:21:03 +0000 (07:21 -1000)]
Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Various syzbot fixes, and the more notable ones:
- Fix for pointers in an extent overflowing the max (16) on a
filesystem with many devices: we were creating too many cached
copies when moving data around. Now, we only create at most one
cached copy if there's a promote target set.
Caching will be a bit broken for reflinked data until 6.13: I have
larger series queued up which significantly improves the plumbing
for data options down into the extent (bch_extent_rebalance) to fix
this.
- Fix for deadlock on -ENOSPC on tiny filesystems
Allocation from the partial open_bucket list wasn't correctly
accounting partial open_buckets as free: this fixes the main cause
of tests timing out in the automated tests"
* tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek
bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
bcachefs: Don't filter partial list buckets in open_buckets_to_text()
bcachefs: Don't keep tons of cached pointers around
bcachefs: init freespace inited bits to 0 in bch2_fs_initialize
bcachefs: Fix unhandled transaction restart in fallocate
bcachefs: Fix UAF in bch2_reconstruct_alloc()
bcachefs: fix null-ptr-deref in have_stripes()
bcachefs: fix shift oob in alloc_lru_idx_fragmentation
bcachefs: Fix invalid shift in validate_sb_layout()
Vlastimil Babka [Thu, 24 Oct 2024 15:12:29 +0000 (17:12 +0200)]
mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address hint
and of at least PMD_SIZE will be aligned to PMD so that it can benefit
from a THP backing page.
However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have merged
to a large THP-backed area before commit efa7df3e3bb5 and now they are
fragmented to multiple areas each aligned to PMD boundary with gaps
between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing due
to the aligned boundaries of the individual areas.
Another known regression bisected to commit efa7df3e3bb5 is darktable [2]
[3] and early testing suggests this patch fixes the regression there as
well.
To fix the regression but still try to benefit from THP-friendly anonymous
mapping alignment, add a condition that the size of the mapping must be a
multiple of PMD size instead of at least PMD size. In case of many
odd-sized mapping like the cactusBSSN creates, those will stop being
aligned and with gaps between, and instead naturally merge again.
Link: https://lkml.kernel.org/r/20241024151228.101841-2-vbabka@suse.cz Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michael Matz <matz@suse.de> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1] Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2] Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3] Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Cc: Rik van Riel <riel@surriel.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Petr Tesarik <ptesarik@suse.com> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gregory Price [Fri, 25 Oct 2024 14:17:24 +0000 (10:17 -0400)]
vmscan,migrate: fix page count imbalance on node stats when demoting pages
When numa balancing is enabled with demotion, vmscan will call
migrate_pages when shrinking LRUs. migrate_pages will decrement the
the node's isolated page count, leading to an imbalanced count when
invoked from (MG)LRU code.
This path happens for SUCCESSFUL migrations, not failures. Typically
callers to migrate_pages are required to handle putback/accounting for
failures, but this is already handled in the shrink code.
When accounting for migrations, instead do not decrement the count when
the migration reason is MR_DEMOTION. As of v6.11, this demotion logic
is the only source of MR_DEMOTION.
Link: https://lkml.kernel.org/r/20241025141724.17927-1-gourry@gourry.net Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Fri, 1 Nov 2024 02:49:23 +0000 (16:49 -1000)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
- Put the QP netlink dump back in cxgb4, fixes a user visible
regression
- Don't change the rounding style in mlx5 for user provided rd_atomic
values
- Resolve a race in bnxt_re around the qp-handle table array
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/bnxt_re: synchronize the qp-handle table array
RDMA/bnxt_re: Fix the usage of control path spin locks
RDMA/mlx5: Round max_rd_atomic/max_dest_rd_atomic up instead of down
RDMA/cxgb4: Dump vendor specific QP details
Linus Torvalds [Fri, 1 Nov 2024 00:56:19 +0000 (14:56 -1000)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Daniel Borkmann:
- Fix BPF verifier to force a checkpoint when the program's jump
history becomes too long (Eduard Zingerman)
- Add several fixes to the BPF bits iterator addressing issues like
memory leaks and overflow problems (Hou Tao)
- Fix an out-of-bounds write in trie_get_next_key (Byeonguk Jeong)
- Fix BPF test infra's LIVE_FRAME frame update after a page has been
recycled (Toke Høiland-Jørgensen)
- Fix BPF verifier and undo the 40-bytes extra stack space for
bpf_fastcall patterns due to various bugs (Eduard Zingerman)
- Fix a BPF sockmap race condition which could trigger a NULL pointer
dereference in sock_map_link_update_prog (Cong Wang)
- Fix tcp_bpf_recvmsg_parser to retrieve seq_copied from tcp_sk under
the socket lock (Jiayuan Chen)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled
selftests/bpf: Add three test cases for bits_iter
bpf: Use __u64 to save the bits in bits iterator
bpf: Check the validity of nr_words in bpf_iter_bits_new()
bpf: Add bpf_mem_alloc_check_size() helper
bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()
bpf: disallow 40-bytes extra stack for bpf_fastcall patterns
selftests/bpf: Add test for trie_get_next_key()
bpf: Fix out-of-bounds write in trie_get_next_key()
selftests/bpf: Test with a very short loop
bpf: Force checkpoint when jmp history is too long
bpf: fix filed access without lock
sock_map: fix a NULL pointer dereference in sock_map_link_update_prog()
Linus Torvalds [Thu, 31 Oct 2024 22:39:58 +0000 (12:39 -1000)]
Merge tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi, bluetooth and netfilter.
No known new regressions outstanding.
Current release - regressions:
- wifi: mt76: do not increase mcu skb refcount if retry is not
supported
Current release - new code bugs:
- wifi:
- rtw88: fix the RX aggregation in USB 3 mode
- mac80211: fix memory corruption bug in struct ieee80211_chanctx
Previous releases - regressions:
- sched:
- stop qdisc_tree_reduce_backlog on TC_H_ROOT
- sch_api: fix xa_insert() error path in tcf_block_get_ext()
- wifi:
- revert "wifi: iwlwifi: remove retry loops in start"
- cfg80211: clear wdev->cqm_config pointer on free
- netfilter: fix potential crash in nf_send_reset6()
- ip_tunnel: fix suspicious RCU usage warning in ip_tunnel_find()
- bluetooth: fix null-ptr-deref in hci_read_supported_codecs
- eth: mlxsw: add missing verification before pushing Tx header
- eth: hns3: fixed hclge_fetch_pf_reg accesses bar space out of
bounds issue
Previous releases - always broken:
- wifi: mac80211: do not pass a stopped vif to the driver in
.get_txpower
- netfilter: sanitize offset and length before calling skb_checksum()
- core:
- fix crash when config small gso_max_size/gso_ipv4_max_size
- skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
- mptcp: protect sched with rcu_read_lock
- eth: ice: fix crash on probe for DPLL enabled E810 LOM
- eth: macsec: fix use-after-free while sending the offloading packet
- eth: stmmac: fix unbalanced DMA map/unmap for non-paged SKB data
- eth: hns3: fix kernel crash when 1588 is sent on HIP08 devices
- eth: mtk_wed: fix path of MT7988 WO firmware"
* tag 'net-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
net: hns3: fix kernel crash when 1588 is sent on HIP08 devices
net: hns3: fixed hclge_fetch_pf_reg accesses bar space out of bounds issue
net: hns3: initialize reset_timer before hclgevf_misc_irq_init()
net: hns3: don't auto enable misc vector
net: hns3: Resolved the issue that the debugfs query result is inconsistent.
net: hns3: fix missing features due to dev->features configuration too early
net: hns3: fixed reset failure issues caused by the incorrect reset type
net: hns3: add sync command to sync io-pgtable
net: hns3: default enable tx bounce buffer when smmu enabled
netfilter: nft_payload: sanitize offset and length before calling skb_checksum()
net: ethernet: mtk_wed: fix path of MT7988 WO firmware
selftests: forwarding: Add IPv6 GRE remote change tests
mlxsw: spectrum_ipip: Fix memory leak when changing remote IPv6 address
mlxsw: pci: Sync Rx buffers for device
mlxsw: pci: Sync Rx buffers for CPU
mlxsw: spectrum_ptp: Add missing verification before pushing Tx header
net: skip offload for NETIF_F_IPV6_CSUM if ipv6 header contains extension
Bluetooth: hci: fix null-ptr-deref in hci_read_supported_codecs
netfilter: nf_reject_ipv6: fix potential crash in nf_send_reset6()
netfilter: Fix use-after-free in get_info()
...
1. Fix degradation problem of alpha blending
2. Fix color format MACROs in OVL
3. Fix get efuse issue for MT8188 DPTX
4. Fix potential NULL dereference in mtk_crtc_destroy()
5. Correct dpi power-domains property
6. Add split subschema property constraints
Linus Torvalds [Thu, 31 Oct 2024 18:15:40 +0000 (08:15 -1000)]
Merge tag 'sound-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Here we see slightly more commits than wished, but basically all are
small and mostly trivial fixes.
The only core change is the workaround for __counted_by() usage in
ASoC DAPM code, while the rest are device-specific fixes for Intel
Baytrail devices, Cirrus and wcd937x codecs, and HD-audio / USB-audio
devices"
* tag 'sound-6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek: Fix headset mic on TUXEDO Stellaris 16 Gen6 mb1
ALSA: hda/realtek: Fix headset mic on TUXEDO Gemini 17 Gen3
ALSA: usb-audio: Add quirks for Dell WD19 dock
ASoC: codecs: wcd937x: relax the AUX PDM watchdog
ASoC: codecs: wcd937x: add missing LO Switch control
ASoC: dt-bindings: rockchip,rk3308-codec: add port property
ALSA: hda/realtek: Add subwoofer quirk for Infinix ZERO BOOK 13
ASoC: dapm: fix bounds checker error in dapm_widget_list_create
ASoC: Intel: sst: Fix used of uninitialized ctx to log an error
ASoC: cs42l51: Fix some error handling paths in cs42l51_probe()
ASoC: Intel: sst: Support LPE0F28 ACPI HID
ALSA: hda/realtek: Limit internal Mic boost on Dell platform
ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet
ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec
ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work()
Johan Hovold [Mon, 28 Oct 2024 12:49:58 +0000 (13:49 +0100)]
gpiolib: fix debugfs newline separators
The gpiolib debugfs interface exports a list of all gpio chips in a
system and the state of their pins.
The gpio chip sections are supposed to be separated by a newline
character, but a long-standing bug prevents the separator from
being included when output is generated in multiple sessions, making the
output inconsistent and hard to read.
Make sure to only suppress the newline separator at the beginning of the
file as intended.
Filipe Manana [Tue, 29 Oct 2024 15:18:45 +0000 (15:18 +0000)]
btrfs: fix defrag not merging contiguous extents due to merged extent maps
When running defrag (manual defrag) against a file that has extents that
are contiguous and we already have the respective extent maps loaded and
merged, we end up not defragging the range covered by those contiguous
extents. This happens when we have an extent map that was the result of
merging multiple extent maps for contiguous extents and the length of the
merged extent map is greater than or equals to the defrag threshold
length.
$ ./test.sh
Initial number of file extent items: 4
Number of file extent items after defrag with 128K threshold: 4
Number of file extent items after defrag with 256K threshold: 4
The 4 extents don't get merged because we have an extent map with a size
of 256K that is the result of merging the individual extent maps for each
of the four 64K extents and at defrag_lookup_extent() we have a value of
zero for the generation threshold ('newer_than' argument) since this is a
manual defrag. As a consequence we don't call defrag_get_extent() to get
an extent map representing a single file extent item in the inode's
subvolume tree, so we end up using the merged extent map at
defrag_collect_targets() and decide not to defrag.
Fix this by updating defrag_lookup_extent() to always discard extent maps
that were merged and call defrag_get_extent() regardless of the minimum
generation threshold ('newer_than' argument).
A test case for fstests will be sent along soon.
CC: stable@vger.kernel.org # 6.1+ Fixes: 199257a78bb0 ("btrfs: defrag: don't use merged extent map for their generation check") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 28 Oct 2024 16:23:00 +0000 (16:23 +0000)]
btrfs: fix extent map merging not happening for adjacent extents
If we have 3 or more adjacent extents in a file, that is, consecutive file
extent items pointing to adjacent extents, within a contiguous file range
and compatible flags, we end up not merging all the extents into a single
extent map.
For example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
After all the ordered extents complete we unpin the extent maps and try
to merge them, but instead of getting a single extent map we get two
because:
1) When the first ordered extent completes (file range [0, 64K)) we
unpin its extent map and attempt to merge it with the extent map for
the range [64K, 128K), but we can't because that extent map is still
pinned;
2) When the second ordered extent completes (file range [64K, 128K)), we
unpin its extent map and merge it with the previous extent map, for
file range [0, 64K), but we can't merge with the next extent map, for
the file range [128K, 192K), because this one is still pinned.
The merged extent map for the file range [0, 128K) gets the flag
EXTENT_MAP_MERGED set;
3) When the third ordered extent completes (file range [128K, 192K)), we
unpin its extent map and attempt to merge it with the previous extent
map, for file range [0, 128K), but we can't because that extent map
has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false
due to different flags) while the extent map for the range [128K, 192K)
doesn't have that flag set.
We also can't merge it with the next extent map, for file range
[192K, 256K), because that one is still pinned.
At this moment we have 3 extent maps:
One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
One for file range [128K, 192K).
One for file range [192K, 256K) which is still pinned;
4) When the fourth and final extent completes (file range [192K, 256K)),
we unpin its extent map and attempt to merge it with the previous
extent map, for file range [128K, 192K), which succeeds since none
of these extent maps have the EXTENT_MAP_MERGED flag set.
So we end up with 2 extent maps:
One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set.
Since after merging extent maps we don't attempt to merge again, that
is, merge the resulting extent map with the one that is now preceding
it (and the one following it), we end up with those two extent maps,
when we could have had a single extent map to represent the whole file.
Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag.
While this doesn't present any functional issue, it prevents the merging
of extent maps which allows to save memory, and can make defrag not
merging extents too (that will be addressed in the next patch).
Fixes: 199257a78bb0 ("btrfs: defrag: don't use merged extent map for their generation check") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Toke Høiland-Jørgensen [Wed, 30 Oct 2024 10:48:26 +0000 (11:48 +0100)]
bpf, test_run: Fix LIVE_FRAME frame update after a page has been recycled
The test_run code detects whether a page has been modified and
re-initialises the xdp_frame structure if it has, using
xdp_update_frame_from_buff(). However, xdp_update_frame_from_buff()
doesn't touch frame->mem, so that wasn't correctly re-initialised, which
led to the pages from page_pool not being returned correctly. Syzbot
noticed this as a memory leak.
Fix this by also copying the frame->mem structure when re-initialising
the frame, like we do on initialisation of a new page from page_pool.
Fixes: e5995bc7e2ba ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption") Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: syzbot+d121e098da06af416d23@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+d121e098da06af416d23@syzkaller.appspotmail.com Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/bpf/20241030-test-run-mem-fix-v1-1-41e88e8cae43@redhat.com
Jens Axboe [Thu, 31 Oct 2024 15:10:07 +0000 (09:10 -0600)]
Merge tag 'nvme-6.12-2024-10-31' of git://git.infradead.org/nvme into block-6.12
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.12
- Spec compliant identification fix (Keith)
- Module parameter to enable backward compatibility on unusual
namespace formats (Keith)
- Target double free fix when using keys (Vitaliy)
- Passthrough command error handling fix (Keith)"
* tag 'nvme-6.12-2024-10-31' of git://git.infradead.org/nvme:
nvme: re-fix error-handling for io_uring nvme-passthrough
nvmet-auth: assign dh_key to NULL after kfree_sensitive
nvme: module parameter to disable pi with offsets
nvme: enhance cns version checking
Jens Axboe [Thu, 31 Oct 2024 14:05:44 +0000 (08:05 -0600)]
io_uring/rw: fix missing NOWAIT check for O_DIRECT start write
When io_uring starts a write, it'll call kiocb_start_write() to bump the
super block rwsem, preventing any freezes from happening while that
write is in-flight. The freeze side will grab that rwsem for writing,
excluding any new writers from happening and waiting for existing writes
to finish. But io_uring unconditionally uses kiocb_start_write(), which
will block if someone is currently attempting to freeze the mount point.
This causes a deadlock where freeze is waiting for previous writes to
complete, but the previous writes cannot complete, as the task that is
supposed to complete them is blocked waiting on starting a new write.
This results in the following stuck trace showing that dependency with
the write blocked starting a new write:
task:fio state:D stack:0 pid:886 tgid:886 ppid:876
Call trace:
__switch_to+0x1d8/0x348
__schedule+0x8e8/0x2248
schedule+0x110/0x3f0
percpu_rwsem_wait+0x1e8/0x3f8
__percpu_down_read+0xe8/0x500
io_write+0xbb8/0xff8
io_issue_sqe+0x10c/0x1020
io_submit_sqes+0x614/0x2110
__arm64_sys_io_uring_enter+0x524/0x1038
invoke_syscall+0x74/0x268
el0_svc_common.constprop.0+0x160/0x238
do_el0_svc+0x44/0x60
el0_svc+0x44/0xb0
el0t_64_sync_handler+0x118/0x128
el0t_64_sync+0x168/0x170
INFO: task fsfreeze:7364 blocked for more than 15 seconds.
Not tainted 6.12.0-rc5-00063-g76aaf945701c #7963
with the attempting freezer stuck trying to grab the rwsem:
Fix this by having the io_uring side honor IOCB_NOWAIT, and only attempt a
blocking grab of the super block rwsem if it isn't set. For normal issue
where IOCB_NOWAIT would always be set, this returns -EAGAIN which will
have io_uring core issue a blocking attempt of the write. That will in
turn also get completions run, ensuring forward progress.
Since freezing requires CAP_SYS_ADMIN in the first place, this isn't
something that can be triggered by a regular user.
Matthew Brost [Fri, 25 Oct 2024 21:43:29 +0000 (14:43 -0700)]
drm/xe: Don't short circuit TDR on jobs not started
Short circuiting TDR on jobs not started is an optimization which is not
required. On LNL we are facing an issue where jobs do not get scheduled
by the GuC if it misses a GGTT page update. When this occurs let the TDR
fire, toggle the scheduling which may get the job unstuck, and print a
warning message. If the TDR fires twice on job that hasn't started,
timeout the job.
v2:
- Add warning message (Paulo)
- Add fixes tag (Paulo)
- Timeout job which hasn't started after TDR firing twice
v3:
- Include local change
v4:
- Short circuit check_timeout on job not started
- use warn level rather than notice (Paulo)
Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out") Cc: stable@vger.kernel.org Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241025214330.2010521-2-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 35d25a4a0012e690ef0cc4c5440231176db595cc) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Matthew Brost [Wed, 23 Oct 2024 22:12:00 +0000 (15:12 -0700)]
drm/xe: Add mmio read before GGTT invalidate
On LNL without a mmio read before a GGTT invalidate the GuC can
incorrectly read the GGTT scratch page upon next access leading to jobs
not getting scheduled. A mmio read before a GGTT invalidate seems to fix
this. Since a GGTT invalidate is not a hot code path, blindly do a mmio
read before each GGTT invalidate.
Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: stable@vger.kernel.org Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reported-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3164 Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241023221200.1797832-1-matthew.brost@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 5a710196883e0ac019ac6df2a6d79c16ad3c32fa)
[ Fix conflict with mmio vs gt argument ] Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Andy Shevchenko [Wed, 30 Oct 2024 17:36:52 +0000 (19:36 +0200)]
gpio: sloppy-logic-analyzer: Check for error code from devm_mutex_init() call
Even if it's not critical, the avoidance of checking the error code
from devm_mutex_init() call today diminishes the point of using devm
variant of it. Tomorrow it may even leak something. Add the missed
check.
Fixes: 7828b7bbbf20 ("gpio: add sloppy logic analyzer using polling") Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20241030174132.2113286-3-andriy.shevchenko@linux.intel.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Masahiro Yamada [Sat, 26 Oct 2024 17:55:50 +0000 (02:55 +0900)]
kconfig: show sub-menu entries even if the prompt is hidden
Since commit f79dc03fe68c ("kconfig: refactor choice value
calculation"), when EXPERT is disabled, nothing within the "if INPUT"
... "endif" block in drivers/input/Kconfig is displayed. This issue
affects all command-line interfaces and GUI frontends.
The prompt for INPUT is hidden when EXPERT is disabled. Previously,
menu_is_visible() returned true in this case; however, it now returns
false, resulting in all sub-menu entries being skipped.
Here is a simplified test case illustrating the issue:
config A
bool "A" if X
default y
config B
bool "B"
depends on A
When X is disabled, A becomes unconfigurable and is forced to y.
B should be displayed, as its dependency is met.
This commit restores the necessary code, so menu_is_visible() functions
as it did previously.
Fixes: f79dc03fe68c ("kconfig: refactor choice value calculation") Reported-by: Edmund Raile <edmund.raile@proton.me> Closes: https://lore.kernel.org/all/5fd0dfc7ff171aa74352e638c276069a5f2e888d.camel@proton.me/ Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>