Liam R. Howlett [Tue, 1 Dec 2020 02:30:04 +0000 (21:30 -0500)]
mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
do_brk_munmap() has already aligned the address and has a maple tree
state to be used. Use the new do_mas_align_munmap() to avoid
unnecessary alignment and error checks.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Liam R. Howlett [Thu, 19 Nov 2020 17:57:23 +0000 (12:57 -0500)]
mm/mmap: Reorganize munmap to use maple states
Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
do_mas_align_munmap().
do_munmap() is a wrapper to create a maple state for any callers that
have not been converted to the maple tree.
do_mas_munmap() takes a maple state to mumap a range. This is just a
small function which checks for error conditions and aligns the end of
the range.
do_mas_align_munmap() uses the aligned range to mumap a range.
do_mas_align_munmap() starts with the first VMA in the range, then finds
the last VMA in the range. Both start and end are split if necessary.
Then the VMAs are unlocked and removed from the linked list at the same
time. Followed by a single tree operation of overwriting the area in
with a NULL. Finally, the detached list is unmapped and freed.
By reorganizing the munmap calls as outlined, it is now possible to
avoid extra work of aligning pre-aligned callers which are known to be
safe, avoid extra VMA lookups or tree walks for modifications.
detach_vmas_to_be_unmapped() is no longer used, so drop this code.
vm_brk_flags() can just call the do_mas_munmap() as it checks for
intersecting VMAs directly.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Liam R. Howlett [Mon, 16 Nov 2020 19:50:20 +0000 (14:50 -0500)]
mm: Remove vmacache
By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code. Remove the vmacache
to reduce the work in keeping it up to date and code complexity.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
Liam R. Howlett [Tue, 10 Nov 2020 18:37:40 +0000 (13:37 -0500)]
mm/mmap: Use advanced maple tree API for mmap_region()
Changing mmap_region() to use the maple tree state and the advanced
maple tree interface allows for a lot less tree walking.
This change removes the last caller of munmap_vma_range(), so drop this
unused function.
Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()
Avoid allocating a new VMA when it a vma modification can occur. When a
brk() can expand or contract a VMA, then the single store operation will
only modify one index of the maple tree instead of causing a node to
split or coalesce. This avoids unnecessary allocations/frees of maple
tree nodes and VMAs.
Move some limit & flag verifications out of the do_brk_flags() function
to use only relevant checks in the code path of bkr() and
vm_brk_flags().
Set the vma to check if it can expand in vm_brk_flags() if extra
criteria are met.
Drop userfaultfd from do_brk_flags() path and only use it in
vm_brk_flags() path since that is the only place a munmap will happen.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
mm/khugepaged: Optimize collapse_pte_mapped_thp() by using vma_lookup()
vma_lookup() will walk the vma tree once and not continue to look for
the next vma. Since the exact vma is checked below, this is a more
optimal way of searching.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Use vma_lookup() to walk the tree to the start value requested. If
the vma at the start does not match, then the answer is NULL and there
is no need to look at the next vma the way that find_vma() would.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value. It is more efficient
to only walk to the requested value since privcmd_ioctl_mmap() will exit
the loop if vm_start != msg->va.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
mmap: Change zeroing of maple tree in __vma_adjust()
Only write to the maple tree if we are not inserting or the insert isn't
going to overwrite the area to clear. This avoids spanning writes and
node coealescing when unnecessary.
The change requires a custom search for the linked list addition to find
the correct VMA for the prev link.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Matthew Wilcox (Oracle) [Thu, 28 Oct 2021 18:39:28 +0000 (14:39 -0400)]
damon: Convert __damon_va_three_regions to use the VMA iterator
This rather specialised walk can use the VMA iterator. If this proves
to be too slow, we can write a custom routine to find the two largest
gaps, but it will be somewhat complicated, so let's see if we need it
first.
Update the kunit test case to use the maple tree. This also fixes an
issue with the kunit testcase not adding the last VMA to the list.
Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Liam R. Howlett [Fri, 22 Oct 2021 17:53:01 +0000 (13:53 -0400)]
kernel/fork: Use maple tree for dup_mmap() during forking
The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator and switch to
the maple tree advanced API to avoid multiple walks of the tree during
insert operations. Unexport the now-unused vma_store() function.
For performance reasons we bulk allocate the maple tree nodes. The node
calculations are done internally to the tree and use the VMA count and
assume the worst-case node requirements. The VM_DONT_COPY flag does
not allow for the most efficient copy method of the tree and so a bulk
loading algorithm is used.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
mm/mmap: Use maple tree for unmapped_area{_topdown}
The maple tree code was added to find the unmapped area in a previous
commit and was checked against what the rbtree returned, but the actual
result was never used. Start using the maple tree implementation and
remove the rbtree code.
Add kernel documentation comment for these functions.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
Use the maple tree's advanced API and a maple state to walk the tree
for the entry at the address of the next vma, then use the maple state
to walk back one entry to find the previous entry.
Add kernel documentation comments for this API.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
Matthew Wilcox (Oracle) [Fri, 22 Oct 2021 14:52:21 +0000 (10:52 -0400)]
mm: Add VMA iterator
This thin layer of abstraction over the maple tree state is for
iterating over VMAs. You can go forwards, go backwards or ask where
the iterator is. Rename the existing vma_next() to __vma_next() --
it will be removed by the end of this series.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz>
Start tracking the VMAs with the new maple tree structure in parallel
with the rb_tree. Add debug and trace events for maple tree operations
and duplicate the rb_tree that is created on forks into the maple tree.
The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in
kernel/fork for process forking, and used to find the unmapped_area and
checked against what the rbtree finds.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct. The long term goal is to reduce
or remove the mmap_sem contention.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses. The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Liam R. Howlett [Tue, 1 Jun 2021 14:35:43 +0000 (10:35 -0400)]
radix tree test suite: Add allocation counts and size to kmem_cache
Add functions to get the number of allocations, and total allocations
from a kmem_cache. Also add a function to get the allocated size and a
way to zero the total allocations.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Liam R. Howlett [Thu, 27 Feb 2020 20:13:20 +0000 (15:13 -0500)]
radix tree test suite: Add kmem_cache_set_non_kernel()
kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
the flags. This functionality allows for testing different paths though
the code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Liam R. Howlett [Thu, 3 Feb 2022 14:59:53 +0000 (09:59 -0500)]
xarray: Fix bitmap breakage
bitmap header changes broke the testing code for the xarray. Fix the
issue by directly including the header into the actual xarray header.
This should at least make the error more pronounced in the future.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Matthew Wilcox (Oracle) [Mon, 31 Jan 2022 15:37:40 +0000 (15:37 +0000)]
binfmt_elf: Take the mmap lock when walking the VMA list
I'm not sure if the VMA list can change under us, but dump_vma_snapshot()
is very careful to take the mmap_lock in write mode. We only need to
take it in read mode here as we do not care if the size of the stack
VMA changes underneath us.
If it can be changed underneath us, this is a potential use-after-free
for a multithreaded process which is dumping core.
Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Linus Torvalds [Sun, 30 Jan 2022 13:12:02 +0000 (15:12 +0200)]
Merge tag 'irq_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Drop an unused private data field in the AIC driver
- Various fixes to the realtek-rtl driver
- Make the GICv3 ITS driver compile again in !SMP configurations
- Force reset of the GICv3 ITSs at probe time to avoid issues during kexec
- Yet another kfree/bitmap_free conversion
- Various DT updates (Renesas, SiFive)
* tag 'irq_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
dt-bindings: interrupt-controller: sifive,plic: Group interrupt tuples
dt-bindings: interrupt-controller: sifive,plic: Fix number of interrupts
dt-bindings: irqchip: renesas-irqc: Add R-Car V3U support
irqchip/gic-v3-its: Reset each ITS's BASERn register before probe
irqchip/gic-v3-its: Fix build for !SMP
irqchip/loongson-pch-ms: Use bitmap_free() to free bitmap
irqchip/realtek-rtl: Service all pending interrupts
irqchip/realtek-rtl: Fix off-by-one in routing
irqchip/realtek-rtl: Map control data to virq
irqchip/apple-aic: Drop unused ipi_hwirq field
Linus Torvalds [Sun, 30 Jan 2022 13:02:32 +0000 (15:02 +0200)]
Merge tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Prevent accesses to the per-CPU cgroup context list from another CPU
except the one it belongs to, to avoid list corruption
- Make sure parent events are always woken up to avoid indefinite hangs
in the traced workload
* tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix cgroup event list management
perf: Always wake the parent event
Linus Torvalds [Sun, 30 Jan 2022 11:09:00 +0000 (13:09 +0200)]
Merge tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"Make sure the membarrier-rseq fence commands are part of the reported
set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY"
* tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
Linus Torvalds [Sun, 30 Jan 2022 10:55:06 +0000 (12:55 +0200)]
Merge tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add another Intel CPU model to the list of CPUs supporting the
processor inventory unique number
- Allow writing to MCE thresholding sysfs files again - a previous
change had accidentally disabled it and no one noticed. Goes to show
how much is this stuff used
* tag 'x86_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPIN
x86/MCE/AMD: Allow thresholding interface updates after init
Linus Torvalds [Sun, 30 Jan 2022 09:21:50 +0000 (11:21 +0200)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"12 patches.
Subsystems affected by this patch series: sysctl, binfmt, ia64, mm
(memory-failure, folios, kasan, and psi), selftests, and ocfs2"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
ocfs2: fix a deadlock when commit trans
jbd2: export jbd2_journal_[grab|put]_journal_head
psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n
psi: fix "no previous prototype" warnings when CONFIG_CGROUPS=n
mm, kasan: use compare-exchange operation to set KASAN page tag
kasan: test: fix compatibility with FORTIFY_SOURCE
tools/testing/scatterlist: add missing defines
mm: page->mapping folio->mapping should have the same offset
memory-failure: fetch compound_head after pgmap_pfn_valid()
ia64: make IA64_MCA_RECOVERY bool instead of tristate
binfmt_misc: fix crash when load/unload module
include/linux/sysctl.h: fix register_sysctl_mount_point() return type
Joseph Qi [Sat, 29 Jan 2022 21:41:23 +0000 (13:41 -0800)]
jbd2: export jbd2_journal_[grab|put]_journal_head
Patch series "ocfs2: fix a deadlock case".
This fixes a deadlock case in ocfs2. We firstly export jbd2 symbols
jbd2_journal_[grab|put]_journal_head as preparation and later use them
in ocfs2 insread of jbd_[lock|unlock]_bh_journal_head to fix the
deadlock.
This patch (of 2):
This exports symbols jbd2_journal_[grab|put]_journal_head, which will be
used outside modules, e.g. ocfs2.
Link: https://lkml.kernel.org/r/20220121071205.100648-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Cc: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suren Baghdasaryan [Sat, 29 Jan 2022 21:41:20 +0000 (13:41 -0800)]
psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n
When CONFIG_PROC_FS is disabled psi code generates the following
warnings:
kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
1364 | static const struct proc_ops psi_cpu_proc_ops = {
| ^~~~~~~~~~~~~~~~
kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
1355 | static const struct proc_ops psi_memory_proc_ops = {
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
1346 | static const struct proc_ops psi_io_proc_ops = {
| ^~~~~~~~~~~~~~~
Make definitions of these structures and related functions conditional
on CONFIG_PROC_FS config.
Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com Fixes: 0e94682b73bf ("psi: introduce psi monitor") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Collingbourne [Sat, 29 Jan 2022 21:41:14 +0000 (13:41 -0800)]
mm, kasan: use compare-exchange operation to set KASAN page tag
It has been reported that the tag setting operation on newly-allocated
pages can cause the page flags to be corrupted when performed
concurrently with other flag updates as a result of the use of
non-atomic operations.
Fix the problem by using a compare-exchange loop to update the tag.
Marco Elver [Sat, 29 Jan 2022 21:41:11 +0000 (13:41 -0800)]
kasan: test: fix compatibility with FORTIFY_SOURCE
With CONFIG_FORTIFY_SOURCE enabled, string functions will also perform
dynamic checks using __builtin_object_size(ptr), which when failed will
panic the kernel.
Because the KASAN test deliberately performs out-of-bounds operations,
the kernel panics with FORTIFY_SOURCE, for example:
Fix it by also hiding `ptr` from the optimizer, which will ensure that
__builtin_object_size() does not return a valid size, preventing
fortified string functions from panicking.
Maor Gottlieb [Sat, 29 Jan 2022 21:41:07 +0000 (13:41 -0800)]
tools/testing/scatterlist: add missing defines
The cited commits replaced preemptible with pagefault_disabled and
flush_kernel_dcache_page with flush_dcache_page respectively, hence need
to update the corresponding defines in the test.
scatterlist.c: In function ‘sg_miter_stop’:
scatterlist.c:919:4: warning: implicit declaration of function ‘flush_dcache_page’ [-Wimplicit-function-declaration]
flush_dcache_page(miter->page);
^~~~~~~~~~~~~~~~~
In file included from linux/scatterlist.h:8:0,
from scatterlist.c:9:
scatterlist.c:922:18: warning: implicit declaration of function ‘pagefault_disabled’ [-Wimplicit-function-declaration]
WARN_ON_ONCE(!pagefault_disabled());
^
linux/mm.h:23:25: note: in definition of macro ‘WARN_ON_ONCE’
int __ret_warn_on = !!(condition); \
^~~~~~~~~
Link: https://lkml.kernel.org/r/20220118082105.1737320-1-maorg@nvidia.com Fixes: 723aca208516 ("mm/scatterlist: replace the !preemptible warning in sg_miter_stop()") Fixes: 0e84f5dbf8d6 ("scatterlist: replace flush_kernel_dcache_page with flush_dcache_page") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yang [Sat, 29 Jan 2022 21:41:04 +0000 (13:41 -0800)]
mm: page->mapping folio->mapping should have the same offset
As with the other members of folio, the offset of page->mapping and
folio->mapping must be the same. The compile-time check was
inadvertently removed during development. Add it back.
[willy@infradead.org: changelog redo]
Link: https://lkml.kernel.org/r/20220104011734.21714-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joao Martins [Sat, 29 Jan 2022 21:41:01 +0000 (13:41 -0800)]
memory-failure: fetch compound_head after pgmap_pfn_valid()
memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()). For devmap with compound pages fetch the
compound_head in case a tail page memory failure is being handled.
Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
Without this fix memory-failure handling (i.e. MCEs on pmem) with
device-dax configured namespaces will regress (and crash).
Link: https://lkml.kernel.org/r/20211202204422.26777-2-joao.m.martins@oracle.com Reported-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Sat, 29 Jan 2022 21:40:58 +0000 (13:40 -0800)]
ia64: make IA64_MCA_RECOVERY bool instead of tristate
In linux-next, IA64_MCA_RECOVERY uses the (new) function
make_task_dead(), which is not exported for use by modules. Instead of
exporting it for one user, convert IA64_MCA_RECOVERY to be a bool
Kconfig symbol.
In a config file from "kernel test robot <lkp@intel.com>" for a
different problem, this linker error was exposed when
CONFIG_IA64_MCA_RECOVERY=m.
Link: https://lkml.kernel.org/r/20220124213129.29306-1-rdunlap@infradead.org Fixes: 0e25498f8cd4 ("exit: Add and use make_task_dead.") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tong Zhang [Sat, 29 Jan 2022 21:40:55 +0000 (13:40 -0800)]
binfmt_misc: fix crash when load/unload module
We should unregister the table upon module unload otherwise something
horrible will happen when we load binfmt_misc module again. Also note
that we should keep value returned by register_sysctl_mount_point() and
release it later, otherwise it will leak.
Also, per Christian's comment, to fully restore the old behavior that
won't break userspace the check(binfmt_misc_header) should be
eliminated.
binfmt_misc: Failed to create fs/binfmt_misc sysctl mount point
binfmt_misc: Failed to create fs/binfmt_misc sysctl mount point
BUG: unable to handle page fault for address: fffffbfff8004802
Call Trace:
init_misc_binfmt+0x2d/0x1000 [binfmt_misc]
Link: https://lkml.kernel.org/r/20220124181812.1869535-2-ztong0001@gmail.com Fixes: 3ba442d5331f ("fs: move binfmt_misc sysctl to its own file") Signed-off-by: Tong Zhang <ztong0001@gmail.com> Co-developed-by: Christian Brauner<brauner@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 29 Jan 2022 17:05:47 +0000 (19:05 +0200)]
Merge tag 'pci-v5.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci fixes from Bjorn Helgaas:
- Fix compilation warnings in new mt7621 driver (Sergio Paracuellos)
- Restore the sysfs "rom" file for VGA shadow ROMs, which was broken
when converting "rom" to be a static attribute (Bjorn Helgaas)
* tag 'pci-v5.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI/sysfs: Find shadow ROM before static attribute initialization
PCI: mt7621: Remove unused function pcie_rmw()
PCI: mt7621: Drop of_match_ptr() to avoid unused variable
Linus Torvalds [Sat, 29 Jan 2022 13:45:33 +0000 (15:45 +0200)]
Merge tag 'gpio-fixes-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"Two fixes for the gpio-simulator:
- fix a bug with hogs not being set-up in gpio-sim when user-space
sets the chip label to an empty string
- include the gpio-sim documentation in the index"
* tag 'gpio-fixes-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: sim: add doc file to index file
gpio: sim: check the label length when setting up device properties
Linus Torvalds [Sat, 29 Jan 2022 13:34:04 +0000 (15:34 +0200)]
Merge tag 'char-misc-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are two small char/misc driver fixes for 5.17-rc2 that fix some
reported issues. They are:
- fix up a merge issue in the at25.c driver that ended up dropping
some lines in the driver. The removed lines ended being needed, so
this restores it and the driver works again.
- counter core fix where the wrong error was being returned, NULL
should be the correct error for when memory is gone here, like the
kmalloc() core does.
Both of these have been in linux-next this week with no reported
issues"
* tag 'char-misc-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
counter: fix an IS_ERR() vs NULL bug
eeprom: at25: Restore missing allocation
Linus Torvalds [Sat, 29 Jan 2022 13:23:13 +0000 (15:23 +0200)]
Merge tag 'tty-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are some small bug fixes and reverts for reported problems with
the tty core and drivers. They include:
- revert the fifo use for the 8250 console mode. It caused too many
regressions and problems, and had a bug in it as well. This is
being reworked and should show up in a later -rc1 release, but it's
not ready for 5.17
- rpmsg tty race fix
- restore the cyclades.h uapi header file. Turns out a compiler test
suite used it for some unknown reason. Bring it back just for the
parts that are used by the builder test so they continue to build.
No functionality is restored as no one actually has this hardware
anymore, nor is it really tested.
- stm32 driver fixes
- n_gsm flow control fixes
- pl011 driver fix
- rs485 initialization fix
All of these have been in linux-next this week with no reported
problems"
* tag 'tty-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
kbuild: remove include/linux/cyclades.h from header file check
serial: core: Initialize rs485 RTS polarity already on probe
serial: pl011: Fix incorrect rs485 RTS polarity on set_mctrl
serial: stm32: fix software flow control transfer
serial: stm32: prevent TDR register overwrite when sending x_char
tty: n_gsm: fix SW flow control encoding/handling
serial: 8250: of: Fix mapped region size when using reg-offset property
tty: rpmsg: Fix race condition releasing tty port
tty: Partially revert the removal of the Cyclades public API
tty: Add support for Brainboxes UC cards.
Revert "tty: serial: Use fifo in 8250 console driver"
Linus Torvalds [Sat, 29 Jan 2022 13:17:20 +0000 (15:17 +0200)]
Merge tag 'usb-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB driver fixes from Greg KH:
"Here are some small USB driver fixes for 5.17-rc2 that resolve a
number of reported problems. These include:
- typec driver fixes
- xhci platform driver fixes for suspending
- ulpi core fix
- role.h build fix
- new device ids
- syzbot-reported bugfixes
- gadget driver fixes
- dwc3 driver fixes
- other small fixes
All of these have been in linux-next this week with no reported
issues"
* tag 'usb-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: cdnsp: Fix segmentation fault in cdns_lost_power function
usb: dwc2: gadget: don't try to disable ep0 in dwc2_hsotg_suspend
usb: gadget: at91_udc: fix incorrect print type
usb: dwc3: xilinx: Fix error handling when getting USB3 PHY
usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode
usb: xhci-plat: fix crash when suspend if remote wake enable
usb: common: ulpi: Fix crash in ulpi_match()
usb: gadget: f_sourcesink: Fix isoc transfer for USB_SPEED_SUPER_PLUS
ucsi_ccg: Check DEV_INT bit only when starting CCG4
USB: core: Fix hang in usb_kill_urb by adding memory barriers
usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
usb: typec: tcpm: Do not disconnect when receiving VSAFE0V
usb: typec: tcpm: Do not disconnect while receiving VBUS off
usb: typec: Don't try to register component master without components
usb: typec: Only attempt to link USB ports if there is fwnode
usb: typec: tcpci: don't touch CC line if it's Vconn source
usb: roles: fix include/linux/usb/role.h compile issue
Linus Torvalds [Sat, 29 Jan 2022 13:01:08 +0000 (15:01 +0200)]
Merge tag 'block-5.17-2022-01-28' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request
- add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs (Wu
Zheng)
- remove the unneeded ret variable in nvmf_dev_show (Changcheng
Deng)
- Fix for a hang regression introduced with a patch in the merge
window, where low queue depth devices would not always get woken
correctly (Laibin)
- Small series fixing an IO accounting issue with bio backed dm devices
(Mike, Yu)
* tag 'block-5.17-2022-01-28' of git://git.kernel.dk/linux-block:
dm: properly fix redundant bio-based IO accounting
dm: revert partial fix for redundant bio-based IO accounting
block: add bio_start_io_acct_time() to control start_time
blk-mq: Fix wrong wakeup batch configuration which will cause hang
nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show
nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs
blk-mq: fix missing blk_account_io_done() in error path
block: fix memory leak in disk_register_independent_access_ranges
Linus Torvalds [Sat, 29 Jan 2022 12:53:07 +0000 (14:53 +0200)]
Merge tag 'io_uring-5.17-2022-01-28' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Just two small fixes this time:
- Fix a bug that can lead to node registration taking 1 second, when
it should finish much quicker (Dylan)
- Remove an unused argument from a function (Usama)"
* tag 'io_uring-5.17-2022-01-28' of git://git.kernel.dk/linux-block:
io_uring: remove unused argument from io_rsrc_node_alloc
io_uring: fix bug in slow unregistering of nodes
Linus Torvalds [Sat, 29 Jan 2022 12:46:19 +0000 (14:46 +0200)]
Merge tag 'powerpc-5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix VM debug warnings on boot triggered via __set_fixmap().
- Fix a debug warning in the 64-bit Book3S PMU handling code.
- Fix nested guest HFSCR handling with multiple vCPUs on Power9 or
later.
- Fix decrementer storm caused by a recent change, seen with some
configs.
Thanks to Alexey Kardashevskiy, Athira Rajeev, Christophe Leroy,
Fabiano Rosas, Maxime Bizon, Nicholas Piggin, and Sachin Sant.
* tag 'powerpc-5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s/interrupt: Fix decrementer storm
KVM: PPC: Book3S HV Nested: Fix nested HFSCR being clobbered with multiple vCPUs
powerpc/perf: Fix power_pmu_disable to call clear_pmi_irq_pending only if PMI is pending
powerpc/fixmap: Fix VM debug warning on unmap
Linus Torvalds [Sat, 29 Jan 2022 06:57:22 +0000 (08:57 +0200)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Errata workarounds for Cortex-A510: broken hardware dirty bit
management, detection code for the TRBE (tracing) bugs with the
actual fixes going in via the CoreSight tree.
- Cortex-X2 errata handling for TRBE (inheriting the workarounds from
Cortex-A710).
- Fix ex_handler_load_unaligned_zeropad() to use the correct struct
members.
- A couple of kselftest fixes for FPSIMD.
- Silence the vdso "no previous prototype" warning.
- Mark start_backtrace() notrace and NOKPROBE_SYMBOL.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: cpufeature: List early Cortex-A510 parts as having broken dbm
kselftest/arm64: Correct logging of FPSIMD register read via ptrace
kselftest/arm64: Skip VL_INHERIT tests for unsupported vector types
arm64: errata: Add detection for TRBE trace data corruption
arm64: errata: Add detection for TRBE invalid prohibited states
arm64: errata: Add detection for TRBE ignored system register writes
arm64: Add Cortex-A510 CPU part definition
arm64: extable: fix load_unaligned_zeropad() reg indices
arm64: Mark start_backtrace() notrace and NOKPROBE_SYMBOL
arm64: errata: Update ARM64_ERRATUM_[2119858|2224489] with Cortex-X2 ranges
arm64: Add Cortex-X2 CPU part definition
arm64: vdso: Fix "no previous prototype" warning
Linus Torvalds [Sat, 29 Jan 2022 06:52:27 +0000 (08:52 +0200)]
Merge tag 'fixes-v5.17-lsm-ceph-null' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security sybsystem fix from James Morris:
"Fix NULL pointer crash in LSM via Ceph, from Vivek Goyal"
* tag 'fixes-v5.17-lsm-ceph-null' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security, lsm: dentry_init_security() Handle multi LSM registration
Linus Torvalds [Sat, 29 Jan 2022 06:27:28 +0000 (08:27 +0200)]
Merge tag 'docs-5.17-3' of git://git.lwn.net/linux
Pull documentation fixes from Jonathan Corbet:
"A few documentation fixes for 5.17"
* tag 'docs-5.17-3' of git://git.lwn.net/linux:
docs/vm: Fix typo in *harden*
Documentation: arm: marvell: Extend Avanta list
docs: fix typo in Documentation/kernel-hacking/locking.rst
docs: Hook the RTLA documents into the kernel docs build
Mike Snitzer [Fri, 28 Jan 2022 15:58:40 +0000 (10:58 -0500)]
dm: revert partial fix for redundant bio-based IO accounting
Reverts a1e1cb72d9649 ("dm: fix redundant IO accounting for bios that
need splitting") because it was too narrow in scope (only addressed
redundant 'sectors[]' accounting and not ios, nsecs[], etc).
Mike Snitzer [Fri, 28 Jan 2022 15:58:39 +0000 (10:58 -0500)]
block: add bio_start_io_acct_time() to control start_time
bio_start_io_acct_time() interface is like bio_start_io_acct() that
allows start_time to be passed in. This gives drivers the ability to
defer starting accounting until after IO is issued (but possibily not
entirely due to bio splitting).
Linus Torvalds [Fri, 28 Jan 2022 19:17:58 +0000 (21:17 +0200)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Sixteen patches, mostly minor fixes and updates; however there are
substantive driver bug fixes in pm8001, bnx2fc, zfcp, myrs and qedf"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: myrs: Fix crash in error case
scsi: 53c700: Remove redundant assignment to pointer SCp
scsi: ufs: Treat link loss as fatal error
scsi: ufs: Use generic error code in ufshcd_set_dev_pwr_mode()
scsi: bfa: Remove useless DMA-32 fallback configuration
scsi: hisi_sas: Remove useless DMA-32 fallback configuration
scsi: 3w-sas: Remove useless DMA-32 fallback configuration
scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put()
scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices
scsi: pm8001: Fix bogus FW crash for maxcpus=1
scsi: qedf: Change context reset messages to ratelimited
scsi: qedf: Fix refcount issue when LOGO is received during TMF
scsi: qedf: Add stag_work to all the vports
scsi: ufs: ufshcd-pltfrm: Check the return value of devm_kstrdup()
scsi: target: iscsi: Make sure the np under each tpg is unique
scsi: elx: efct: Don't use GFP_KERNEL under spin lock
Linus Torvalds [Fri, 28 Jan 2022 19:12:07 +0000 (21:12 +0200)]
Merge tag 'efi-urgent-for-v5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- avoid UEFI v2.00+ runtime services on Apple Mac systems, as they have
been reported to cause crashes, and most Macs claim to be EFI v1.10
anyway
- avoid a spurious boot time warning on arm64 systems with 64k pages
* tag 'efi-urgent-for-v5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: runtime: avoid EFIv2 runtime services on Apple x86 machines
efi/libstub: arm64: Fix image check alignment at entry
Core of the problem is that ceph checks for return code from
security_dentry_init_security() and if return code is 0, it assumes
everything is fine and continues to call strlen(name), which crashes.
Typically SELinux LSM returns 0 and sets name to "security.selinux" and
it is not a problem. Or if selinux is not compiled in or disabled, it
returns -EOPNOTSUP and ceph deals with it.
But somehow in this configuration, 0 is being returned and "name" is
not being initialized and that's creating the problem.
Our suspicion is that BPF LSM is registering a hook for
dentry_init_security() and returns hook default of 0.
I have not been able to reproduce it just by doing CONFIG_BPF_LSM=y.
Stephen has tested the patch though and confirms it solves the problem
for him.
dentry_init_security() is written in such a way that it expects only one
LSM to register the hook. Atleast that's the expectation with current code.
If another LSM returns a hook and returns default, it will simply return
0 as of now and that will break ceph.
Hence, suggestion is that change semantics of this hook a bit. If there
are no LSMs or no LSM is taking ownership and initializing security context,
then return -EOPNOTSUP. Also allow at max one LSM to initialize security
context. This hook can't deal with multiple LSMs trying to init security
context. This patch implements this new behavior.
Reported-by: Stephen Muth <smuth4@gmail.com> Tested-by: Stephen Muth <smuth4@gmail.com> Suggested-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: <stable@vger.kernel.org> # 5.16.0 Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>
Linus Torvalds [Fri, 28 Jan 2022 18:44:07 +0000 (20:44 +0200)]
Merge tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These make the buffer handling in pm_show_wakelocks() more robust and
drop an unused hibernation-related function.
Specifics:
- Make the buffer handling in pm_show_wakelocks() more robust by
using sysfs_emit_at() in it to generate output (Greg
Kroah-Hartman).
- Drop register_nosave_region_late() which is not used (Amadeusz
Sławiński)"
* tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Remove register_nosave_region_late()
PM: wakeup: simplify the output logic of pm_show_wakelocks()
Linus Torvalds [Fri, 28 Jan 2022 17:30:35 +0000 (19:30 +0200)]
Merge tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pulltracing fixes from Steven Rostedt:
- Limit mcount build time sorting to only those archs that we know it
works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
* tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't inc err_log entry count if entry allocation fails
tracing: Propagate is_signed to expression
tracing: Fix smatch warning for do while check in event_hist_trigger_parse()
tracing: Fix smatch warning for null glob in event_hist_trigger_parse()
tools/tracing: Update Makefile to build rtla
rtla: Make doc build optional
tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro
tracing: Avoid -Warray-bounds warning for __rel_loc macro
tracing/histogram: Fix a potential memory leak for kstrdup()
ftrace: Have architectures opt-in for mcount build time sorting
Geert Uytterhoeven [Fri, 28 Jan 2022 09:03:57 +0000 (10:03 +0100)]
dt-bindings: interrupt-controller: sifive,plic: Fix number of interrupts
The number of interrupts lacks an upper bound, thus assuming one,
causing properly grouped "interrupts-extended" properties to be flagged
as an error by "make dtbs_check".
Fix this by adding the missing "maxItems", using the architectural
maximum of 15872 interrupts.
Linus Torvalds [Fri, 28 Jan 2022 17:25:24 +0000 (19:25 +0200)]
Merge branch 'ucount-rlimit-fixes-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucount rlimit fix from Eric Biederman.
Make sure the ucounts have a reference to the user namespace it refers
to, so that users that themselves don't carry such a reference around
can safely use the ucount functions.
* 'ucount-rlimit-fixes-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucount: Make get_ucount a safe get_user replacement
Linus Torvalds [Fri, 28 Jan 2022 17:19:22 +0000 (19:19 +0200)]
Merge tag 'rcu-urgent.2022.01.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This fixes a brown-paper-bag bug in RCU tasks that causes things like
BPF and ftrace to fail miserably on systems with non-power-of-two
numbers of CPUs.
It fixes a math error added in 7a30871b6a27 ("rcu-tasks: Introduce
->percpu_enqueue_shift for dynamic queue selection') during the v5.17
merge window. This commit works correctly only on systems with a
power-of-two number of CPUs, which just so happens to be the kind that
rcutorture always uses by default.
This pull request fixes the math so that things also work on systems
that don't happen to have a power-of-two number of CPUs"
* tag 'rcu-urgent.2022.01.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu-tasks: Fix computation of CPU-to-list shift counts
Linus Torvalds [Fri, 28 Jan 2022 17:06:11 +0000 (19:06 +0200)]
Merge tag 'hyperv-fixes-signed-20220128' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Fix screen resolution for hyperv framebuffer (Michael Kelley)
- Fix packet header accounting for balloon driver (Yanming Liu)
* tag 'hyperv-fixes-signed-20220128' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
video: hyperv_fb: Fix validation of screen resolution
Drivers: hv: balloon: account for vmbus packet header in max_pkt_size
Linus Torvalds [Fri, 28 Jan 2022 17:00:26 +0000 (19:00 +0200)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Two larger x86 series:
- Redo incorrect fix for SEV/SMAP erratum
- Windows 11 Hyper-V workaround
Other x86 changes:
- Various x86 cleanups
- Re-enable access_tracking_perf_test
- Fix for #GP handling on SVM
- Fix for CPUID leaf 0Dh in KVM_GET_SUPPORTED_CPUID
- Fix for ICEBP in interrupt shadow
- Avoid false-positive RCU splat
- Enable Enlightened MSR-Bitmap support for real
ARM:
- Correctly update the shadow register on exception injection when
running in nVHE mode
- Correctly use the mm_ops indirection when performing cache
invalidation from the page-table walker
- Restrict the vgic-v3 workaround for SEIS to the two known broken
implementations
Generic code changes:
- Dead code cleanup"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
KVM: eventfd: Fix false positive RCU usage warning
KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use
KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread()
KVM: nVMX: Rename vmcs_to_field_offset{,_table}
KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS
selftests: kvm: check dynamic bits against KVM_X86_XCOMP_GUEST_SUPP
KVM: x86: add system attribute to retrieve full set of supported xsave states
KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr
selftests: kvm: move vm_xsave_req_perm call to amx_test
KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time
KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS
KVM: x86: Keep MSR_IA32_XSS unchanged for INIT
KVM: x86: Free kvm_cpuid_entry2 array on post-KVM_RUN KVM_SET_CPUID{,2}
KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02
KVM: selftests: Don't skip L2's VMCALL in SMM test for SVM guest
KVM: x86: Check .flags in kvm_cpuid_check_equal() too
KVM: x86: Forcibly leave nested virt when SMM state is toggled
KVM: SVM: drop unnecessary code in svm_hv_vmcb_dirty_nested_enlightenments()
KVM: SVM: hyper-v: Enable Enlightened MSR-Bitmap support for real
...
Linus Torvalds [Fri, 28 Jan 2022 16:50:05 +0000 (18:50 +0200)]
Merge tag 's390-5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix loading of modules with lots of relocations and add a regression
test for it.
- Fix machine check handling for vector validity and guarded storage
validity failures in KVM guests.
- Fix hypervisor performance data to include z/VM guests with access
control group set.
- Fix z900 build problem in uaccess code.
- Update defconfigs.
* tag 's390-5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/hypfs: include z/VM guests with access control group set
s390: update defconfigs
s390/module: test loading modules with a lot of relocations
s390/module: fix loading modules with a lot of relocations
s390/uaccess: fix compile error
s390/nmi: handle vector validity failures for KVM guests
s390/nmi: handle guarded storage validity failures for KVM guests
Linus Torvalds [Fri, 28 Jan 2022 16:36:42 +0000 (18:36 +0200)]
Merge tag 'ceph-for-5.17-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A ZERO_SIZE_PTR dereference fix from Xiubo and two fixes for async
creates interacting with pool namespace-constrained OSD permissions
from Jeff (marked for stable)"
* tag 'ceph-for-5.17-rc2' of git://github.com/ceph/ceph-client:
ceph: set pool_ns in new inode layout for async creates
ceph: properly put ceph_string reference after async create attempt
ceph: put the requests/sessions when it fails to alloc memory
Linus Torvalds [Fri, 28 Jan 2022 08:00:29 +0000 (10:00 +0200)]
ocfs2: fix subdirectory registration with register_sysctl()
The kernel test robot reports that commit c42ff46f97c1 ("ocfs2: simplify
subdirectory registration with register_sysctl()") is broken, and
results in kernel warning messages like
sysctl table check failed: fs/ocfs2/nm Not a file
sysctl table check failed: fs/ocfs2/nm No proc_handler
sysctl table check failed: fs/ocfs2/nm bogus .mode 0555
and in fact this was already reported back in linux-next, but nobody
seems to have reacted to that report. Possibly that original report
only ever made it to the lkp list.
The problem seems to be that the simplification didn't actually go far
enough, and should have converted the whole directory path to the final
sysctl file, rather than just the two first components.
Catalin Marinas [Fri, 28 Jan 2022 16:14:06 +0000 (16:14 +0000)]
Merge tag 'trbe-cortex-a510-errata' of gitolite.kernel.org:pub/scm/linux/kernel/git/coresight/linux into for-next/fixes
coresight: trbe: Workaround Cortex-A510 erratas
This pull request is providing arm64 definitions to support
TRBE Cortex-A510 erratas.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
* tag 'trbe-cortex-a510-errata' of gitolite.kernel.org:pub/scm/linux/kernel/git/coresight/linux:
arm64: errata: Add detection for TRBE trace data corruption
arm64: errata: Add detection for TRBE invalid prohibited states
arm64: errata: Add detection for TRBE ignored system register writes
arm64: Add Cortex-A510 CPU part definition
Linus Torvalds [Fri, 28 Jan 2022 15:51:31 +0000 (17:51 +0200)]
Merge tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"Fixes for userspace breakage caused by fsnotify changes ~3 years ago
and one fanotify cleanup"
* tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: fix fsnotify hooks in pseudo filesystems
fsnotify: invalidate dcache before IN_DELETE event
fanotify: remove variable set but not used
Linus Torvalds [Fri, 28 Jan 2022 15:19:49 +0000 (17:19 +0200)]
Merge tag 'fs_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull udf and quota fixes from Jan Kara:
"Fixes for crashes in UDF when inode expansion fails and one quota
cleanup"
* tag 'fs_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
quota: cleanup double word in comment
udf: Restore i_lenAlloc when inode expansion fails
udf: Fix NULL ptr deref when converting from inline format
Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
a false positive warning. So use hlist_for_each_entry_srcu() instead of
hlist_for_each_entry_rcu().
Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Wed, 12 Jan 2022 17:01:34 +0000 (18:01 +0100)]
KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use
Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
Enlightened VMCS interface is in use:
"Any VMREAD or VMWRITE instructions while an enlightened VMCS is
active is unsupported and can result in unexpected behavior.""
Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
0x4404 ("VM-exit interruption information") are observed. Failing
these attempts with nested_vmx_failInvalid() makes such guests
unbootable.
Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
eventually but for the time being we need a workaround. (Temporary) allow
VMREAD to get data from the currently loaded Enlightened VMCS.
Note: VMWRITE instructions remain forbidden, it is not clear how to
handle them properly and hopefully won't ever be needed.
Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Wed, 12 Jan 2022 17:01:33 +0000 (18:01 +0100)]
KVM: nVMX: Implement evmcs_field_offset() suitable for handle_vmread()
In preparation to allowing reads from Enlightened VMCS from
handle_vmread(), implement evmcs_field_offset() to get the correct
read offset. get_evmcs_offset(), which is being used by KVM-on-Hyper-V,
is almost what's needed but a few things need to be adjusted. First,
WARN_ON() is unacceptable for handle_vmread() as any field can (in
theory) be supplied by the guest and not all fields are defined in
eVMCS v1. Second, we need to handle 'holes' in eVMCS (missing fields).
It also sounds like a good idea to WARN_ON() if such fields are ever
accessed by KVM-on-Hyper-V.
Implement dedicated evmcs_field_offset() helper.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Wed, 12 Jan 2022 17:01:32 +0000 (18:01 +0100)]
KVM: nVMX: Rename vmcs_to_field_offset{,_table}
vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
blob which is not supposed to be accessed directly. In fact,
vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.
Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.
No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Wed, 12 Jan 2022 17:01:31 +0000 (18:01 +0100)]
KVM: nVMX: eVMCS: Filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER
Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field,
PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes
sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too.
Note, none of the currently existing Windows/Hyper-V versions are known
to enable 'save VMX-preemption timer value' when eVMCS is in use, the
change is aimed at making the filtering future proof.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Wed, 12 Jan 2022 17:01:30 +0000 (18:01 +0100)]
KVM: nVMX: Also filter MSR_IA32_VMX_TRUE_PINBASED_CTLS when eVMCS
Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS,
MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair,
MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way
MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely
on 'true' MSR data.
Note, none of the currently existing Windows/Hyper-V versions are known
to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change
is aimed at making the filtering future proof.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 26 Jan 2022 12:49:45 +0000 (07:49 -0500)]
KVM: x86: add system attribute to retrieve full set of supported xsave states
Because KVM_GET_SUPPORTED_CPUID is meant to be passed (by simple-minded
VMMs) to KVM_SET_CPUID2, it cannot include any dynamic xsave states that
have not been enabled. Probing those, for example so that they can be
passed to ARCH_REQ_XCOMP_GUEST_PERM, requires a new ioctl or arch_prctl.
The latter is in fact worse, even though that is what the rest of the
API uses, because it would require supported_xcr0 to be moved from the
KVM module to the kernel just for this use. In addition, the value
would be nonsensical (or an error would have to be returned) until
the KVM module is loaded in.
Therefore, to limit the growth of system ioctls, add a /dev/kvm
variant of KVM_{GET,HAS}_DEVICE_ATTR, and implement it in x86
with just one group (0) and attribute (KVM_X86_XCOMP_GUEST_SUPP).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Thu, 27 Jan 2022 15:31:53 +0000 (07:31 -0800)]
KVM: x86: Add a helper to retrieve userspace address from kvm_device_attr
Add a helper to handle converting the u64 userspace address embedded in
struct kvm_device_attr into a userspace pointer, it's all too easy to
forget the intermediate "unsigned long" cast as well as the truncation
check.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark Brown [Mon, 24 Jan 2022 17:55:27 +0000 (17:55 +0000)]
kselftest/arm64: Correct logging of FPSIMD register read via ptrace
There's a cut'n'paste error in the logging for our test for reading register
state back via ptrace, correctly say that we did a read instead of a write.
Mark Brown [Mon, 24 Jan 2022 17:55:26 +0000 (17:55 +0000)]
kselftest/arm64: Skip VL_INHERIT tests for unsupported vector types
Currently we unconditionally test the ability to set the vector length
inheritance flag via ptrace meaning that we generate false failures on
systems that don't support SVE when we attempt to set the vector length
there. Check the hwcap and mark the tests as skipped when it's not present.
Fixes: 0ba1ce1e8605 ("selftests: arm64: Add coverage of ptrace flags for SVE VL inheritance") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20220124175527.3260234-2-broonie@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Torvalds [Fri, 28 Jan 2022 09:47:05 +0000 (11:47 +0200)]
Merge tag 'ata-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA fix from Damien Le Moal:
"A single fix for 5.17-rc2, adding a missing resource allocation error
check in the pata_platform driver, from Zhou"
* tag 'ata-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: pata_platform: Fix a NULL pointer dereference in __pata_platform_probe()
Linus Torvalds [Fri, 28 Jan 2022 07:48:20 +0000 (09:48 +0200)]
Merge tag 'hwmon-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Fix crash in nct6775 driver
- Prevent divide by zero in adt7470 driver
- Fix conditional compile warning in pmbus/ir38064 driver
- Various minor fixes in lm90 driver
* tag 'hwmon-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (nct6775) Fix crash in clear_caseopen
hwmon: (adt7470) Prevent divide by zero in adt7470_fan_write()
hwmon: (pmbus/ir38064) Mark ir38064_of_match as __maybe_unused
hwmon: (lm90) Fix sysfs and udev notifications
hwmon: (lm90) Mark alert as broken for MAX6646/6647/6649
hwmon: (lm90) Mark alert as broken for MAX6680
hwmon: (lm90) Mark alert as broken for MAX6654
hwmon: (lm90) Re-enable interrupts after alert clears
hwmon: (lm90) Reduce maximum conversion rate for G781
* tag 'drm-fixes-2022-01-28' of git://anongit.freedesktop.org/drm/drm: (25 commits)
drm/privacy-screen: honor acpi=off in detect_thinkpad_privacy_screen
Revert "drm/ast: Support 1600x900 with 108MHz PCLK"
drm/amdgpu/display: Remove t_srx_delay_us.
drm/amd/display: Wrap dcn301_calculate_wm_and_dlg for FPU.
drm/amd/display: Fix FP start/end for dcn30_internal_validate_bw.
drm/amd/display/dc/calcs/dce_calcs: Fix a memleak in calculate_bandwidth()
drm/amdgpu/display: use msleep rather than udelay for long delays
drm/amdgpu/display: adjust msleep limit in dp_wait_for_training_aux_rd_interval
drm/amdgpu: filter out radeon secondary ids as well
drm/amd/display: change FIFO reset condition to embedded display only
drm/amd/display: Correct MPC split policy for DCN301
drm/amd/display: Fix for otg synchronization logic
drm/etnaviv: relax submit size limits
drm/msm/gpu: Cancel idle/boost work on suspend
drm/msm/gpu: Wait for idle before suspending
drm/atomic: Add the crtc to affected crtc only if uapi.enable = true
drm/msm/dsi: invalid parameter check in msm_dsi_phy_enable
drm/msm/a6xx: Add missing suspend_count increment
drm/msm: Fix wrong size calculation
drm/msm/dpu: invalid parameter check in dpu_setup_dspp_pcc
...