www.infradead.org Git - users/dwmw2/linux.git/log

]> www.infradead.org Git - users/dwmw2/linux.git/log

projects / users / dwmw2 / linux.git / log

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:26 +0000 (11:36 +0100)]

KVM: x86: Make Hyper-V emulation optional

Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.

Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.

Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:25 +0000 (11:36 +0100)]

KVM: nVMX: Move guest_cpuid_has_evmcs() to hyperv.h

In preparation for making Hyper-V emulation optional, move Hyper-V specific
guest_cpuid_has_evmcs() to hyperv.h.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-12-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:24 +0000 (11:36 +0100)]

KVM: selftests: Fix vmxon_pa == vmcs12_pa == -1ull nVMX testcase for !eVMCS

The "vmxon_pa == vmcs12_pa == -1ull" test happens to work by accident: as
Enlightened VMCS is always supported, set_default_vmx_state() adds
'KVM_STATE_NESTED_EVMCS' to 'flags' and the following branch of
vmx_set_nested_state() is executed:

        if ((kvm_state->flags & KVM_STATE_NESTED_EVMCS) &&
            (!guest_can_use(vcpu, X86_FEATURE_VMX) ||
             !vmx->nested.enlightened_vmcs_enabled))
                        return -EINVAL;

as 'enlightened_vmcs_enabled' is false. In fact, "vmxon_pa == vmcs12_pa ==
-1ull" is a valid state when not tainted by wrong flags so the test should
aim for this branch:

        if (kvm_state->hdr.vmx.vmxon_pa == INVALID_GPA)
                return 0;

Test all this properly:
- Without KVM_STATE_NESTED_EVMCS in the flags, the expected return value is
'0'.
- With KVM_STATE_NESTED_EVMCS flag (when supported) set, the expected
return value is '-EINVAL' prior to enabling eVMCS and '0' after.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-11-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:23 +0000 (11:36 +0100)]

KVM: selftests: Make Hyper-V tests explicitly require KVM Hyper-V support

In preparation for conditional Hyper-V emulation enablement in KVM, make
Hyper-V specific tests skip gracefully instead of failing when KVM support
for emulating Hyper-V is not there.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-10-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:22 +0000 (11:36 +0100)]

KVM: nVMX: Split off helper for emulating VMCLEAR on Hyper-V eVMCS

To avoid overloading handle_vmclear() with Hyper-V specific details and to
prepare the code to making Hyper-V emulation optional, create a dedicated
nested_evmcs_handle_vmclear() helper.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-9-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:21 +0000 (11:36 +0100)]

KVM: x86: Introduce helper to handle Hyper-V paravirt TLB flush requests

As a preparation to making Hyper-V emulation optional, introduce a helper
to handle pending KVM_REQ_HV_TLB_FLUSH requests.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-8-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:20 +0000 (11:36 +0100)]

KVM: VMX: Split off hyperv_evmcs.{ch}

Some Enlightened VMCS related code is needed both by Hyper-V on KVM and
KVM on Hyper-V. As a preparation to making Hyper-V emulation optional,
create dedicated 'hyperv_evmcs.{ch}' files which are used by both.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-7-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:19 +0000 (11:36 +0100)]

KVM: x86: Introduce helper to check if vector is set in Hyper-V SynIC

As a preparation to making Hyper-V emulation optional, create a dedicated
kvm_hv_synic_has_vector() helper to avoid extra ifdefs in lapic.c.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-6-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:18 +0000 (11:36 +0100)]

KVM: x86: Introduce helper to check if auto-EOI is set in Hyper-V SynIC

As a preparation to making Hyper-V emulation optional, create a dedicated
kvm_hv_synic_auto_eoi_set() helper to avoid extra ifdefs in lapic.c

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-5-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:17 +0000 (11:36 +0100)]

KVM: VMX: Split off vmx_onhyperv.{ch} from hyperv.{ch}

hyperv.{ch} is currently a mix of stuff which is needed by both Hyper-V on
KVM and KVM on Hyper-V. As a preparation to making Hyper-V emulation
optional, put KVM-on-Hyper-V specific code into dedicated files.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-4-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:16 +0000 (11:36 +0100)]

KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context

Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.

While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Vitaly Kuznetsov [Tue, 5 Dec 2023 10:36:15 +0000 (11:36 +0100)]

KVM: x86/xen: Remove unneeded xen context from kvm_arch when !CONFIG_KVM_XEN

Saving a few bytes of memory per KVM VM is certainly great but what's more
important is the ability to see where the code accesses Xen emulation
context while CONFIG_KVM_XEN is not enabled. Currently, kvm_cpu_get_extint()
is the only such place and it is harmless: kvm_xen_has_interrupt() always
returns '0' when !CONFIG_KVM_XEN.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-2-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Sean Christopherson [Wed, 18 Oct 2023 19:23:25 +0000 (12:23 -0700)]

KVM: x86/mmu: Declare flush_remote_tlbs{_range}() hooks iff HYPERV!=n

Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when
running under Hyper-V if and only if CONFIG_HYPERV!=n. Wrapping yet more
code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional
branches, and makes it super obvious why the hooks *might* be valid.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

commit | commitdiff | tree

Paolo Bonzini [Tue, 21 Nov 2023 16:24:08 +0000 (11:24 -0500)]

selftests/kvm: fix compilation on non-x86_64 platforms

MEM_REGION_SLOT and MEM_REGION_GPA are not really needed in
test_invalid_memory_region_flags; the VM never runs and there are no
other slots, so it is okay to use slot 0 and place it at address
zero. This fixes compilation on architectures that do not
define them.

Fixes: 5d74316466f4 ("KVM: selftests: Add a memory region subtest to validate invalid flags")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Paolo Bonzini [Mon, 13 Nov 2023 10:58:30 +0000 (05:58 -0500)]

Merge branch 'kvm-guestmemfd' into HEAD

Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

There are two non-KVM patches buried in the middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).

commit | commitdiff | tree

Sean Christopherson [Tue, 31 Oct 2023 00:20:49 +0000 (17:20 -0700)]

KVM: selftests: Add a memory region subtest to validate invalid flags

Add a subtest to set_memory_region_test to verify that KVM rejects invalid
flags and combinations with -EINVAL. KVM might or might not fail with
EINVAL anyways, but we can at least try.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231031002049.3915752-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Ackerley Tng [Fri, 27 Oct 2023 18:22:17 +0000 (11:22 -0700)]

KVM: selftests: Test KVM exit behavior for private memory/access

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out the similarities with set_memory_region_test]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-36-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:22:16 +0000 (11:22 -0700)]

KVM: selftests: Add basic selftest for guest_memfd()

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
  read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
  fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected
+ file size and inode are unique (the innocuous-sounding
  anon_inode_getfile() backs all files with a single inode...)

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-35-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:22:15 +0000 (11:22 -0700)]

KVM: selftests: Expand set_memory_region_test to validate guest_memfd()

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

- Non-guest_memfd() file descriptor for private memory
- guest_memfd() from different VM
- Overlapping bindings
- Unaligned bindings

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-34-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:22:14 +0000 (11:22 -0700)]

KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-33-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Vishal Annapurve [Fri, 27 Oct 2023 18:22:13 +0000 (11:22 -0700)]

KVM: selftests: Add x86-only selftest for private memory conversions

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

- Shared memory is visible to host userspace
- Private memory is not visible to host userspace
- Host userspace and guest can communicate over shared memory
- Data in shared backing is preserved across conversions (test's
   host userspace doesn't free the data)
- Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well.  But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

In addition to the more obvious tests, verify that PUNCH_HOLE actually
frees memory.  Directly verifying that KVM frees memory is impractical, if
it's even possible, so instead indirectly verify memory is freed by
asserting that the guest reads zeroes after a PUNCH_HOLE.  E.g. if KVM
zaps SPTEs but doesn't actually punch a hole in the inode, the subsequent
read will still see the previous value.  And obviously punching a hole
shouldn't cause explosions.

Let the user specify the number of memslots in the private mem conversion
test, i.e. don't require the number of memslots to be '1' or "nr_vcpus".
Creating more memslots than vCPUs is particularly interesting, e.g. it can
result in a single KVM_SET_MEMORY_ATTRIBUTES spanning multiple memslots.
To keep the math reasonable, align each vCPU's chunk to at least 2MiB (the
size is 2MiB+4KiB), and require the total size to be cleanly divisible by
the number of memslots.  The goal is to be able to validate that KVM plays
nice with multiple memslots, being able to create a truly arbitrary number
of memslots doesn't add meaningful value, i.e. isn't worth the cost.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:12 +0000 (11:22 -0700)]

KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-31-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:11 +0000 (11:22 -0700)]

KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM. "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD
without needing an entirely separate set of helpers. At this time,
guest_memfd is effectively usable only by confidential VM types in the
form of guest private memory, and it's expected that x86 will double down
and require unique VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-30-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Vishal Annapurve [Fri, 27 Oct 2023 18:22:10 +0000 (11:22 -0700)]

KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-29-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Vishal Annapurve [Fri, 27 Oct 2023 18:22:09 +0000 (11:22 -0700)]

KVM: selftests: Add helpers to convert guest memory b/w private and shared

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate(). Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of truth.
Provide allocate() helpers so that tests can mimic a userspace that frees
private memory on conversion, e.g. to prioritize memory usage over
performance.

Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-28-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:08 +0000 (11:22 -0700)]

KVM: selftests: Add support for creating private memslots

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2. Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file. If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-27-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:07 +0000 (11:22 -0700)]

KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
support for guest private memory can be added without needing an entirely
separate set of helpers.

Note, this obviously makes selftests backwards-incompatible with older KVM
versions from this point forward.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-26-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:06 +0000 (11:22 -0700)]

KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused). If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-25-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:05 +0000 (11:22 -0700)]

KVM: x86: Add support for "protected VMs" that can utilize private memory

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-24-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:04 +0000 (11:22 -0700)]

KVM: Allow arch code to track number of memslot address spaces per VM

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs. Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:03 +0000 (11:22 -0700)]

KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:22:02 +0000 (11:22 -0700)]

KVM: x86/mmu: Handle page fault for private memory

Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory.  For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.

For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes.  To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace.  Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.

Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits.  In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.

Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.

Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:22:01 +0000 (11:22 -0700)]

KVM: x86: Disallow hugepages when memory attributes are mixed

Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:22:00 +0000 (11:22 -0700)]

KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-19-seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Mon, 13 Nov 2023 10:42:34 +0000 (05:42 -0500)]

KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Paolo Bonzini [Fri, 3 Nov 2023 10:47:51 +0000 (06:47 -0400)]

fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()

The call to the inode_init_security_anon() LSM hook is not the sole
reason to use anon_inode_getfile_secure() or anon_inode_getfd_secure().
For example, the functions also allow one to create a file with non-zero
size, without needing a full-blown filesystem.  In this case, you don't
need a "secure" version, just unique inodes; the current name of the
functions is confusing and does not explain well the difference with
the more "standard" anon_inode_getfile() and anon_inode_getfd().

Of course, there is another side of the coin; neither io_uring nor
userfaultfd strictly speaking need distinct inodes, and it is not
that clear anymore that anon_inode_create_get{file,fd}() allow the LSM
to intercept and block the inode's creation.  If one was so inclined,
anon_inode_getfile_secure() and anon_inode_getfd_secure() could be kept,
using the shared inode or a new one depending on CONFIG_SECURITY.
However, this is probably overkill, and potentially a cause of bugs in
different configurations.  Therefore, just add a comment to io_uring
and userfaultfd explaining the choice of the function.

While at it, remove the export for what is now anon_inode_create_getfd().
There is no in-tree module that uses it, and the old name is gone anyway.
If anybody actually needs the symbol, they can ask or they can just use
anon_inode_create_getfile(), which will be exported very soon for use
in KVM.

Suggested-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:56 +0000 (11:21 -0700)]

mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance. KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running. To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox <willy@infradead.org>
Co-developed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:21:55 +0000 (11:21 -0700)]

KVM: Introduce per-page memory attributes

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by
KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory
attributes to a guest memory range.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation. For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Fuad Tabba <tabba@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Cc: Mickaël Salaün <mic@digikod.net>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:53 +0000 (11:21 -0700)]

KVM: Drop .on_unlock() mmu_notifier hook

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed. Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.

Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:52 +0000 (11:21 -0700)]

KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired. Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found. Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:21:51 +0000 (11:21 -0700)]

KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
    to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
    that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:50 +0000 (11:21 -0700)]

KVM: Introduce KVM_SET_USER_MEMORY_REGION2

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail. The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-9-seanjc@google.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:49 +0000 (11:21 -0700)]

KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:48 +0000 (11:21 -0700)]

KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs). PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:47 +0000 (11:21 -0700)]

KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:46 +0000 (11:21 -0700)]

KVM: WARN if there are dangling MMU invalidations at VM destruction

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Chao Peng [Fri, 27 Oct 2023 18:21:45 +0000 (11:21 -0700)]

KVM: Use gfn instead of hva for mmu_notifier_retry

Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Message-Id: <20231027182217.3615211-4-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:44 +0000 (11:21 -0700)]

KVM: Assert that mmu_invalidate_in_progress *never* goes negative

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-3-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Sean Christopherson [Fri, 27 Oct 2023 18:21:43 +0000 (11:21 -0700)]

KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address. Rename the
handler typedef too (arguably it should always have been gfn_handler_t).

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-2-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

commit | commitdiff | tree

Linus Torvalds [Mon, 13 Nov 2023 00:19:07 +0000 (16:19 -0800)]

Linux 6.7-rc1

commit | commitdiff | tree

Miri Korenblit [Sun, 12 Nov 2023 14:36:20 +0000 (16:36 +0200)]

wifi: iwlwifi: fix system commands group ordering

The commands should be sorted inside the group definition.
Fix the ordering so we won't get following warning:
WARN_ON(iwl_cmd_groups_verify_sorted(trans_cfg))

Link: https://lore.kernel.org/regressions/2fa930bb-54dd-4942-a88d-05a47c8e9731@gmail.com/
Link: https://lore.kernel.org/linux-wireless/CAHk-=wix6kqQ5vHZXjOPpZBfM7mMm9bBZxi2Jh7XnaKCqVf94w@mail.gmail.com/
Fixes: b6e3d1ba4fcf ("wifi: iwlwifi: mvm: implement new firmware API for statistics")
Tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Tested-by: Damian Tometzki <damian@riscv-rocks.de>
Acked-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Linus Torvalds [Sun, 12 Nov 2023 19:05:31 +0000 (11:05 -0800)]

Merge tag 'parisc-for-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc architecture fixes from Helge Deller:

- Include the upper 5 address bits when inserting TLB entries on a
   64-bit kernel.

   On physical machines those are ignored, but in qemu it's nice to have
   them included and to be correct.

- Stop the 64-bit kernel and show a warning if someone tries to boot on
   a machine with a 32-bit CPU

- Fix a "no previous prototype" warning in parport-gsc

* tag 'parisc-for-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Prevent booting 64-bit kernels on PA1.x machines
  parport: gsc: mark init function static
  parisc/pgtable: Do not drop upper 5 address bits of physical address

commit | commitdiff | tree

Linus Torvalds [Sun, 12 Nov 2023 18:58:08 +0000 (10:58 -0800)]

Merge tag 'loongarch-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch updates from Huacai Chen:

- support PREEMPT_DYNAMIC with static keys

- relax memory ordering for atomic operations

- support BPF CPU v4 instructions for LoongArch

- some build and runtime warning fixes

* tag 'loongarch-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  selftests/bpf: Enable cpu v4 tests for LoongArch
  LoongArch: BPF: Support signed mod instructions
  LoongArch: BPF: Support signed div instructions
  LoongArch: BPF: Support 32-bit offset jmp instructions
  LoongArch: BPF: Support unconditional bswap instructions
  LoongArch: BPF: Support sign-extension mov instructions
  LoongArch: BPF: Support sign-extension load instructions
  LoongArch: Add more instruction opcodes and emit_* helpers
  LoongArch/smp: Call rcutree_report_cpu_starting() earlier
  LoongArch: Relax memory ordering for atomic operations
  LoongArch: Mark __percpu functions as always inline
  LoongArch: Disable module from accessing external data directly
  LoongArch: Support PREEMPT_DYNAMIC with static keys

commit | commitdiff | tree

Linus Torvalds [Sun, 12 Nov 2023 18:50:38 +0000 (10:50 -0800)]

Merge tag 'powerpc-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

- Finish a refactor of pgprot_framebuffer() which dependend
   on some changes that were merged via the drm tree

- Fix some kernel-doc warnings to quieten the bots

Thanks to Nathan Lynch and Thomas Zimmermann.

* tag 'powerpc-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/rtas: Fix ppc_rtas_rmo_buf_show() kernel-doc
  powerpc/pseries/rtas-work-area: Fix rtas_work_area_reserve_arena() kernel-doc
  powerpc/fb: Call internal __phys_mem_access_prot() in fbdev code
  powerpc: Remove file parameter from phys_mem_access_prot()
  powerpc/machdep: Remove trailing whitespaces

commit | commitdiff | tree

Linus Torvalds [Sun, 12 Nov 2023 01:17:22 +0000 (17:17 -0800)]

Merge tag '6.7-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- ctime caching fix (for setxattr)

- encryption fix

- DNS resolver mount fix

- debugging improvements

- multichannel fixes including cases where server stops or starts
   supporting multichannel after mount

- reconnect fix

- minor cleanups

* tag '6.7-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko
  cifs: handle when server stops supporting multichannel
  cifs: handle when server starts supporting multichannel
  Missing field not being returned in ioctl CIFS_IOC_GET_MNT_INFO
  smb3: allow dumping session and tcon id to improve stats analysis and debugging
  smb: client: fix mount when dns_resolver key is not available
  smb3: fix caching of ctime on setxattr
  smb3: minor cleanup of session handling code
  cifs: reconnect work should have reference on server struct
  cifs: do not pass cifs_sb when trying to add channels
  cifs: account for primary channel in the interface list
  cifs: distribute channels across interfaces based on speed
  cifs: handle cases where a channel is closed
  smb3: more minor cleanups for session handling routines
  smb3: minor RDMA cleanup
  cifs: Fix encryption of cleared, but unset rq_iter data buffers

commit | commitdiff | tree

Linus Torvalds [Sat, 11 Nov 2023 00:35:04 +0000 (16:35 -0800)]

Merge tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

- Documentation update: Add a note about argument and return value
   fetching is the best effort because it depends on the type.

- objpool: Fix to make internal global variables static in
   test_objpool.c.

- kprobes: Unify kprobes_exceptions_nofify() prototypes. There are the
   same prototypes in asm/kprobes.h for some architectures, but some of
   them are missing the prototype and it causes a warning. So move the
   prototype into linux/kprobes.h.

- tracing: Fix to check the tracepoint event and return event at
   parsing stage. The tracepoint event doesn't support %return but if
   $retval exists, it will be converted to %return silently. This finds
   that case and rejects it.

- tracing: Fix the order of the descriptions about the parameters of
   __kprobe_event_gen_cmd_start() to be consistent with the argument
   list of the function.

* tag 'probes-fixes-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/kprobes: Fix the order of argument descriptions
  tracing: fprobe-event: Fix to check tracepoint event and return
  kprobes: unify kprobes_exceptions_nofify() prototypes
  lib: test_objpool: make global variables static
  Documentation: tracing: Add a note about argument and retval access

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 23:07:01 +0000 (15:07 -0800)]

Merge tag 'fbdev-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes and cleanups from Helge Deller:

- fix double free and resource leaks in imsttfb

- lots of remove callback cleanups and section mismatch fixes in
   omapfb, amifb and atmel_lcdfb

- error code fix and memparse simplification in omapfb

* tag 'fbdev-for-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: (31 commits)
  fbdev: fsl-diu-fb: mark wr_reg_wa() static
  fbdev: amifb: Convert to platform remove callback returning void
  fbdev: amifb: Mark driver struct with __refdata to prevent section mismatch warning
  fbdev: hyperv_fb: fix uninitialized local variable use
  fbdev: omapfb/tpd12s015: Convert to platform remove callback returning void
  fbdev: omapfb/tfp410: Convert to platform remove callback returning void
  fbdev: omapfb/sharp-ls037v7dw01: Convert to platform remove callback returning void
  fbdev: omapfb/opa362: Convert to platform remove callback returning void
  fbdev: omapfb/hdmi: Convert to platform remove callback returning void
  fbdev: omapfb/dvi: Convert to platform remove callback returning void
  fbdev: omapfb/dsi-cm: Convert to platform remove callback returning void
  fbdev: omapfb/dpi: Convert to platform remove callback returning void
  fbdev: omapfb/analog-tv: Convert to platform remove callback returning void
  fbdev: atmel_lcdfb: Convert to platform remove callback returning void
  fbdev: omapfb/tpd12s015: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/tfp410: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/sharp-ls037v7dw01: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/opa362: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/hdmi: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  fbdev: omapfb/dvi: Don't put .remove() in .exit.text and drop suppress_bind_attrs
  ...

commit | commitdiff | tree

Yujie Liu [Tue, 31 Oct 2023 04:13:05 +0000 (12:13 +0800)]

tracing/kprobes: Fix the order of argument descriptions

The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.

int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
const char *name, const char *loc, ...)

Link: https://lore.kernel.org/all/20231031041305.3363712-1-yujie.liu@intel.com/
Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Yujie Liu <yujie.liu@intel.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 22:59:30 +0000 (14:59 -0800)]

Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:
"Dave's VPN to the big machine died, so it's on me to do fixes pr this
  and next week while everyone else is at plumbers.

   - big pile of amd fixes, but mostly for hw support newly added in 6.7

   - i915 fixes, mostly minor things

   - qxl memory leak fix

   - vc4 uaf fix in mock helpers

   - syncobj fix for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE"

* tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm: (78 commits)
  drm/amdgpu: fix error handling in amdgpu_vm_init
  drm/amdgpu: Fix possible null pointer dereference
  drm/amdgpu: move UVD and VCE sched entity init after sched init
  drm/amdgpu: move kfd_resume before the ip late init
  drm/amd: Explicitly check for GFXOFF to be enabled for s0ix
  drm/amdgpu: Change WREG32_RLC to WREG32_SOC15_RLC where inst != 0 (v2)
  drm/amdgpu: Use correct KIQ MEC engine for gfx9.4.3 (v5)
  drm/amdgpu: add smu v13.0.6 pcs xgmi ras error query support
  drm/amdgpu: fix software pci_unplug on some chips
  drm/amd/display: remove duplicated argument
  drm/amdgpu: correct mca debugfs dump reg list
  drm/amdgpu: correct acclerator check architecutre dump
  drm/amdgpu: add pcs xgmi v6.4.0 ras support
  drm/amdgpu: Change extended-scope MTYPE on GC 9.4.3
  drm/amdgpu: disable smu v13.0.6 mca debug mode by default
  drm/amdgpu: Support multiple error query modes
  drm/amdgpu: refine smu v13.0.6 mca dump driver
  drm/amdgpu: Do not program PF-only regs in hdp_v4_0.c under SRIOV (v2)
  drm/amdgpu: Skip PCTL0_MMHUB_DEEPSLEEP_IB write in jpegv4.0.3 under SRIOV
  drm: amd: Resolve Sphinx unexpected indentation warning
  ...

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 20:22:14 +0000 (12:22 -0800)]

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:
"Mostly PMU fixes and a reworking of the pseudo-NMI disabling on broken
  MediaTek firmware:

   - Move the MediaTek GIC quirk handling from irqchip to core. Before
     the merging window commit 44bd78dd2b88 ("irqchip/gic-v3: Disable
     pseudo NMIs on MediaTek devices w/ firmware issues") temporarily
     addressed this issue. Fixed now at a deeper level in the arch code

   - Reject events meant for other PMUs in the CoreSight PMU driver,
     otherwise some of the core PMU events would disappear

   - Fix the Armv8 PMUv3 driver driver to not truncate 64-bit registers,
     causing some events to be invisible

   - Remove duplicate declaration of __arm64_sys##name following the
     patch to avoid prototype warning for syscalls

   - Typos in the elf_hwcap documentation"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64/syscall: Remove duplicate declaration
  Revert "arm64: smp: avoid NMI IPIs with broken MediaTek FW"
  arm64: Move MediaTek GIC quirk handling from irqchip to core
  arm64/arm: arm_pmuv3: perf: Don't truncate 64-bit registers
  perf: arm_cspmu: Reject events meant for other PMUs
  Documentation/arm64: Fix typos in elf_hwcaps

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:57:51 +0000 (11:57 -0800)]

Merge tag 'sound-fix-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of fixes for rc1.

  The majority of changes are various ASoC driver-specific small fixes
  and usual HD-audio quirks, while there are a couple of core changes: a
  fix in ALSA core procfs code to avoid deadlocks at disconnection and
  an ASoC core fix for DAPM clock widgets"

* tag 'sound-fix-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  OSS: dmasound/paula: Convert to platform remove callback returning void
  ALSA: hda: ASUS UM5302LA: Added quirks for cs35L41/10431A83 on i2c bus
  ALSA: info: Fix potential deadlock at disconnection
  ASoC: nau8540: Add self recovery to improve capture quility
  ALSA: hda/realtek: Add support dual speaker for Dell
  ALSA: hda: Add ASRock X670E Taichi to denylist
  ALSA: hda/realtek: Add quirk for ASUS UX7602ZM
  ASoC: SOF: sof-client: trivial: fix comment typo
  ASoC: dapm: fix clock get name
  ASoC: hdmi-codec: register hpd callback on component probe
  ASoC: mediatek: mt8186_mt6366_rt1019_rt5682s: trivial: fix error messages
  ASoC: da7219: Improve system suspend and resume handling
  ASoC: codecs: Modify macro value error
  ASoC: codecs: Modify the wrong judgment of re value
  ASoC: codecs: Modify the maximum value of calib
  ASoC: amd: acp: fix for i2s mode register field update
  ASoC: codecs: aw88399: Fix -Wuninitialized in aw_dev_set_vcalb()
  ASoC: rt712-sdca: fix speaker route missing issue
  ASoC: rockchip: Fix unused rockchip_i2s_tdm_match warning for !CONFIG_OF
  ASoC: ti: omap-mcbsp: Fix runtime PM underflow warnings

commit | commitdiff | tree

Daniel Vetter [Fri, 10 Nov 2023 19:51:37 +0000 (20:51 +0100)]

Merge tag 'amd-drm-next-6.7-2023-11-10' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.7-2023-11-10:

amdgpu:
- SR-IOV fixes
- DMCUB fixes
- DCN3.5 fixes
- DP2 fixes
- SubVP fixes
- SMU14 fixes
- SDMA4.x fixes
- Suspend/resume fixes
- AGP regression fix
- UAF fixes for some error cases
- SMU 13.0.6 fixes
- Documentation fixes
- RAS fixes
- Hotplug fixes
- Scheduling entity ordering fix
- GPUVM fixes

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231110190703.4741-1-alexander.deucher@amd.com

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:44:38 +0000 (11:44 -0800)]

Merge tag 'spi-fix-v6.7-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A couple of fixes that came in during the merge window: one Kconfig
  dependency fix and another fix for a long standing issue where a sync
  transfer races with system suspend"

* tag 'spi-fix-v6.7-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: Fix null dereference on suspend
  spi: spi-zynq-qspi: add spi-mem to driver kconfig dependencies

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:40:38 +0000 (11:40 -0800)]

Merge tag 'mmc-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
"MMC core:
   - Fix broken cache-flush support for Micron eMMCs
   - Revert 'mmc: core: Capture correct oemid-bits for eMMC cards'

  MMC host:
   - sdhci_am654: Fix TAP value parsing for legacy speed mode
   - sdhci-pci-gli: Fix support for ASPM mode for GL9755/GL9750
   - vub300: Fix an error path in probe"

* tag 'mmc-v6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-pci-gli: GL9750: Mask the replay timer timeout of AER
  mmc: sdhci-pci-gli: GL9755: Mask the replay timer timeout of AER
  Revert "mmc: core: Capture correct oemid-bits for eMMC cards"
  mmc: vub300: fix an error code
  mmc: Add quirk MMC_QUIRK_BROKEN_CACHE_FLUSH for Micron eMMC Q2J54A
  mmc: sdhci_am654: fix start loop index for TAP value parsing

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:34:16 +0000 (11:34 -0800)]

Merge tag 'pwm/for-6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm

Pull pwm fixes from Thierry Reding:
"This contains two very small fixes that I failed to include in the
  main pull request"

* tag 'pwm/for-6.7-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
  pwm: Fix double shift bug
  pwm: samsung: Fix a bit test in pwm_samsung_resume()

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:25:58 +0000 (11:25 -0800)]

Merge tag 'io_uring-6.7-2023-11-10' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:
"Mostly just a few fixes and cleanups caused by the read multishot
  support.

  Outside of that, a stable fix for how a connect retry is done"

* tag 'io_uring-6.7-2023-11-10' of git://git.kernel.dk/linux:
  io_uring: do not clamp read length for multishot read
  io_uring: do not allow multishot read to set addr or len
  io_uring: indicate if io_kbuf_recycle did recycle anything
  io_uring/rw: add separate prep handler for fixed read/write
  io_uring/rw: add separate prep handler for readv/writev
  io_uring/net: ensure socket is marked connected on connect retry
  io_uring/rw: don't attempt to allocate async data if opcode doesn't need it

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:20:33 +0000 (11:20 -0800)]

Merge tag 'block-6.7-2023-11-10' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- NVMe pull request via Keith:
      - nvme keyring config compile fixes (Hannes and Arnd)
      - fabrics keep alive fixes (Hannes)
      - tcp authentication fixes (Mark)
      - io_uring_cmd error handling fix (Anuj)
      - stale firmware attribute fix (Daniel)
      - tcp memory leak (Christophe)
      - crypto library usage simplification (Eric)

- nbd use-after-free fix. May need a followup, but at least it's better
   than what it was before (Li)

- Rate limit write on read-only device warnings (Yu)

* tag 'block-6.7-2023-11-10' of git://git.kernel.dk/linux:
  nvme: keyring: fix conditional compilation
  nvme: common: make keyring and auth separate modules
  blk-core: use pr_warn_ratelimited() in bio_check_ro()
  nbd: fix uaf in nbd_open
  nvme: start keep-alive after admin queue setup
  nvme-loop: always quiesce and cancel commands before destroying admin q
  nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()
  nvme-auth: always set valid seq_num in dhchap reply
  nvme-auth: add flag for bi-directional auth
  nvme-auth: auth success1 msg always includes resp
  nvme: fix error-handling for io_uring nvme-passthrough
  nvme: update firmware version after commit
  nvme-tcp: Fix a memory leak
  nvme-auth: use crypto_shash_tfm_digest()

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:15:34 +0000 (11:15 -0800)]

Merge tag 'ata-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata

Pull ata fixes from Damien Le Moal:

- Revert a change in ata_pci_shutdown_one() to suspend disks on
   shutdown as this is now done using the manage_shutdown scsi device
   flag (me)

- Change the pata_falcon and pata_gayle drivers to stop using
   module_platform_driver_probe(). This makes these drivers more inline
   with all other drivers (allowing bind/unbind) and suppress a
   compilation warning (Uwe)

- Convert the pata_falcon and pata_gayle drivers to the new
   .remove_new() void-return callback. These 2 drivers are the last ones
   needing this change (Uwe)

* tag 'ata-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
  ata: pata_gayle: Convert to platform remove callback returning void
  ata: pata_falcon: Convert to platform remove callback returning void
  ata: pata_gayle: Stop using module_platform_driver_probe()
  ata: pata_falcon: Stop using module_platform_driver_probe()
  ata: libata-core: Fix ata_pci_shutdown_one()

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 19:09:07 +0000 (11:09 -0800)]

Merge tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

- don't leave pages decrypted for DMA in encrypted memory setups linger
   around on failure (Petr Tesarik)

- fix an out of bounds access in the new dynamic swiotlb code (Petr
   Tesarik)

- fix dma_addressing_limited for systems with weird physical memory
   layouts (Jia He)

* tag 'dma-mapping-6.7-2023-11-10' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC
  dma-mapping: fix dma_addressing_limited() if dma_range_map can't cover all system RAM
  dma-mapping: move dma_addressing_limited() out of line
  swiotlb: do not free decrypted pages if dynamic

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 18:58:49 +0000 (10:58 -0800)]

Merge tag 'lsm-pr-20231109' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull lsm updates from Paul Moore:
"We've got two small patches to correct the default return
  value of two LSM hooks: security_vm_enough_memory_mm() and
  security_inode_getsecctx()"

* tag 'lsm-pr-20231109' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lsm: fix default return value for inode_getsecctx
  lsm: fix default return value for vm_enough_memory

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 18:23:53 +0000 (10:23 -0800)]

Merge tag '6.7-rc-smb3-server-part2' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- slab out of bounds fix in ACL handling

- fix malformed request oops

- minor doc fix

* tag '6.7-rc-smb3-server-part2' of git://git.samba.org/ksmbd:
  ksmbd: handle malformed smb1 message
  ksmbd: fix kernel-doc comment of ksmbd_vfs_kern_path_locked()
  ksmbd: fix slab out of bounds write in smb_inherit_dacl()

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 17:52:56 +0000 (09:52 -0800)]

Merge tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

- support for idmapped mounts in CephFS (Christian Brauner, Alexander
   Mikhalitsyn).

   The series was originally developed by Christian and later picked up
   and brought over the finish line by Alexander, who also contributed
   an enabler on the MDS side (separate owner_{u,g}id fields on the
   wire).

   The required exports for mnt_idmap_{get,put}() in VFS have been acked
   by Christian and received no objection from Christoph.

- a churny change in CephFS logging to include cluster and client
   identifiers in log and debug messages (Xiubo Li).

   This would help in scenarios with dozens of CephFS mounts on the same
   node which are getting increasingly common, especially in the
   Kubernetes world.

* tag 'ceph-for-6.7-rc1' of https://github.com/ceph/ceph-client:
  ceph: allow idmapped mounts
  ceph: allow idmapped atomic_open inode op
  ceph: allow idmapped set_acl inode op
  ceph: allow idmapped setattr inode op
  ceph: pass idmap to __ceph_setattr
  ceph: allow idmapped permission inode op
  ceph: allow idmapped getattr inode op
  ceph: pass an idmapping to mknod/symlink/mkdir
  ceph: add enable_unsafe_idmap module parameter
  ceph: handle idmapped mounts in create_request_message()
  ceph: stash idmapping in mdsc request
  fs: export mnt_idmap_get/mnt_idmap_put
  libceph, ceph: move mdsmap.h to fs/ceph
  ceph: print cluster fsid and client global_id in all debug logs
  ceph: rename _to_client() to _to_fs_client()
  ceph: pass the mdsc to several helpers
  libceph: add doutc and *_client debug macros support

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 17:23:17 +0000 (09:23 -0800)]

Merge tag 'riscv-for-linus-6.7-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull more RISC-V updates from Palmer Dabbelt:

- Support for handling misaligned accesses in S-mode

- Probing for misaligned access support is now properly cached and
   handled in parallel

- PTDUMP now reflects the SW reserved bits, as well as the PBMT and
   NAPOT extensions

- Performance improvements for TLB flushing

- Support for many new relocations in the module loader

- Various bug fixes and cleanups

* tag 'riscv-for-linus-6.7-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits)
  riscv: Optimize bitops with Zbb extension
  riscv: Rearrange hwcap.h and cpufeature.h
  drivers: perf: Do not broadcast to other cpus when starting a counter
  drivers: perf: Check find_first_bit() return value
  of: property: Add fw_devlink support for msi-parent
  RISC-V: Don't fail in riscv_of_parent_hartid() for disabled HARTs
  riscv: Fix set_memory_XX() and set_direct_map_XX() by splitting huge linear mappings
  riscv: Don't use PGD entries for the linear mapping
  RISC-V: Probe misaligned access speed in parallel
  RISC-V: Remove __init on unaligned_emulation_finish()
  RISC-V: Show accurate per-hart isa in /proc/cpuinfo
  RISC-V: Don't rely on positional structure initialization
  riscv: Add tests for riscv module loading
  riscv: Add remaining module relocations
  riscv: Avoid unaligned access when relocating modules
  riscv: split cache ops out of dma-noncoherent.c
  riscv: Improve flush_tlb_kernel_range()
  riscv: Make __flush_tlb_range() loop over pte instead of flushing the whole tlb
  riscv: Improve flush_tlb_range() for hugetlb pages
  riscv: Improve tlb_flush()
  ...

commit | commitdiff | tree

Linus Torvalds [Fri, 10 Nov 2023 17:19:46 +0000 (09:19 -0800)]

Merge tag 'mips_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS updates from Thomas Bogendoerfer:

- removed AR7 platform support

- cleanups and fixes

* tag 'mips_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: AR7: remove platform
  watchdog: ar7_wdt: remove driver to prepare for platform removal
  vlynq: remove bus driver
  mtd: parsers: ar7: remove support
  serial: 8250: remove AR7 support
  arch: mips: remove ReiserFS from defconfig
  MIPS: lantiq: Remove unnecessary include of <linux/of_irq.h>
  MIPS: lantiq: Fix pcibios_plat_dev_init() "no previous prototype" warning
  MIPS: KVM: Fix a build warning about variable set but not used
  MIPS: Remove dead code in relocate_new_kernel
  mips: dts: ralink: mt7621: rename to GnuBee GB-PC1 and GnuBee GB-PC2
  mips: dts: ralink: mt7621: define each reset as an item
  mips: dts: ingenic: Remove unneeded probe-type properties
  MIPS: loongson32: Remove dma.h and nand.h

commit | commitdiff | tree

Christian König [Tue, 31 Oct 2023 14:35:27 +0000 (15:35 +0100)]

drm/amdgpu: fix error handling in amdgpu_vm_init

When clearing the root PD fails we need to properly release it again.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

commit | commitdiff | tree

Felix Kuehling [Tue, 31 Oct 2023 17:30:00 +0000 (13:30 -0400)]

drm/amdgpu: Fix possible null pointer dereference

mem = bo->tbo.resource may be NULL in amdgpu_vm_bo_update.

Fixes: 180253782038 ("drm/ttm: stop allocating dummy resources during BO creation")
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

commit | commitdiff | tree

Alex Deucher [Wed, 8 Nov 2023 14:40:44 +0000 (09:40 -0500)]

drm/amdgpu: move UVD and VCE sched entity init after sched init

We need kernel scheduling entities to deal with handle clean up
if apps are not cleaned up properly. With commit 56e449603f0ac5
("drm/sched: Convert the GPU scheduler to variable number of run-queues")
the scheduler entities have to be created after scheduler init, so
change the ordering to fix this.

v2: Leave logic in UVD and VCE code

Fixes: 56e449603f0a ("drm/sched: Convert the GPU scheduler to variable number of run-queues")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Luben Tuikov <ltuikov89@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: ltuikov89@gmail.com

commit | commitdiff | tree

Tim Huang [Thu, 19 Oct 2023 07:50:43 +0000 (15:50 +0800)]

drm/amdgpu: move kfd_resume before the ip late init

The kfd_resume needs to touch GC registers to enable the interrupts,
it needs to be done before GFXOFF is enabled to ensure that the GFX is
not off and GC registers can be touched. So move kfd_resume before the
amdgpu_device_ip_late_init which enables the CGPG/GFXOFF.

Signed-off-by: Tim Huang <Tim.Huang@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

commit | commitdiff | tree

Mario Limonciello [Thu, 9 Nov 2023 17:23:46 +0000 (11:23 -0600)]

drm/amd: Explicitly check for GFXOFF to be enabled for s0ix

If a user has disabled GFXOFF this may cause problems for the suspend
sequence. Ensure that it is enabled in amdgpu_acpi_is_s0ix_active().

The system won't reach the deepest state but it also won't hang.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

commit | commitdiff | tree

Daniel Vetter [Fri, 10 Nov 2023 15:54:41 +0000 (16:54 +0100)]

Merge tag 'drm-misc-fixes-2023-11-08' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-fixes for v6.7-rc1:

qxl:
- qxl memory leak fix.
syncobj:
- Fix waiting for DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE
vc4:
- Fix UAF in mock helpers

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
[sima: Stitch together both changelogs from Maarten. Also because of
branch history this contains a few more bugfixes which are already in
v6.6, but I didn't feel like this justifies some backmerge since there
wasn't any real conflict.]
Link: https://patchwork.freedesktop.org/patch/msgid/bc8598ee-d427-4616-8ebd-64107ab9a2d8@linux.intel.com

commit | commitdiff | tree

Daniel Vetter [Fri, 10 Nov 2023 15:43:44 +0000 (16:43 +0100)]

Merge tag 'drm-intel-next-fixes-2023-11-08' of git://anongit.freedesktop.org/drm/drm-intel into drm-next

drm/i915 fixes for v6.7-rc1:
- Fix null dereference when perf interface is not available
- Fix a -Wstringop-overflow warning
- Fix a -Wformat-truncation warning in intel_tc_port_init
- Flush WC GGTT only on required platforms
- Fix MTL HBR3 rate support on C10 phy and eDP
- Fix MTL notify_guc for multi-GT
- Bump GLK CDCLK frequency when driving multiple pipes
- Fix potential spectre vulnerability

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/878r78xrxd.fsf@intel.com

commit | commitdiff | tree

Steve French [Thu, 20 Jul 2023 13:30:32 +0000 (08:30 -0500)]

cifs: update internal module version number for cifs.ko

From 2.45 to 2.46

Signed-off-by: Steve French <stfrench@microsoft.com>

commit | commitdiff | tree

Shyam Prasad N [Fri, 13 Oct 2023 11:40:09 +0000 (11:40 +0000)]

cifs: handle when server stops supporting multichannel

When a server stops supporting multichannel, we will
keep attempting reconnects to the secondary channels today.
Avoid this by freeing extra channels when negotiate
returns no multichannel support.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

commit | commitdiff | tree

Shyam Prasad N [Fri, 13 Oct 2023 11:33:21 +0000 (11:33 +0000)]

cifs: handle when server starts supporting multichannel

When the user mounts with multichannel option, but the
server does not support it, there can be a time in future
where it can be supported.

With this change, such a case is handled.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>

commit | commitdiff | tree

Steve French [Fri, 10 Nov 2023 07:24:16 +0000 (01:24 -0600)]

Missing field not being returned in ioctl CIFS_IOC_GET_MNT_INFO

The tcon_flags field was always being set to zero in the information
about the mount returned by the ioctl CIFS_IOC_GET_MNT_INFO instead
of being set to the value of the Flags field in the tree connection
structure as intended.

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

commit | commitdiff | tree

Helge Deller [Fri, 10 Nov 2023 15:13:15 +0000 (16:13 +0100)]

parisc: Prevent booting 64-bit kernels on PA1.x machines

Bail out early with error message when trying to boot a 64-bit kernel on
32-bit machines. This fixes the previous commit to include the check for
true 64-bit kernels as well.

Signed-off-by: Helge Deller <deller@gmx.de>
Fixes: 591d2108f3abc ("parisc: Add runtime check to prevent PA2.0 kernels on PA1.x machines")
Cc: <stable@vger.kernel.org> # v6.0+

commit | commitdiff | tree

Mark Hasemeyer [Tue, 7 Nov 2023 21:47:43 +0000 (14:47 -0700)]

spi: Fix null dereference on suspend

A race condition exists where a synchronous (noqueue) transfer can be
active during a system suspend. This can cause a null pointer
dereference exception to occur when the system resumes.

Example order of events leading to the exception:
1. spi_sync() calls __spi_transfer_message_noqueue() which sets
   ctlr->cur_msg
2. Spi transfer begins via spi_transfer_one_message()
3. System is suspended interrupting the transfer context
4. System is resumed
6. spi_controller_resume() calls spi_start_queue() which resets cur_msg
   to NULL
7. Spi transfer context resumes and spi_finalize_current_message() is
   called which dereferences cur_msg (which is now NULL)

Wait for synchronous transfers to complete before suspending by
acquiring the bus mutex and setting/checking a suspend flag.

Signed-off-by: Mark Hasemeyer <markhas@chromium.org>
Link: https://lore.kernel.org/r/20231107144743.v1.1.I7987f05f61901f567f7661763646cb7d7919b528@changeid
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@kernel.org

commit | commitdiff | tree

Masami Hiramatsu (Google) [Wed, 8 Nov 2023 12:12:39 +0000 (21:12 +0900)]

tracing: fprobe-event: Fix to check tracepoint event and return

Fix to check the tracepoint event is not valid with $retval.
The commit 08c9306fc2e3 ("tracing/fprobe-event: Assume fprobe is
a return event by $retval") introduced automatic return probe
conversion with $retval. But since tracepoint event does not
support return probe, $retval is not acceptable.

Without this fix, ftracetest, tprobe_syntax_errors.tc fails;

[22] Tracepoint probe event parser error log check [FAIL]
----
# tail 22-tprobe_syntax_errors.tc-log.mRKroL
+ ftrace_errlog_check trace_fprobe t kfree ^$retval dynamic_events
+ printf %s t kfree
+ wc -c
+ pos=8
+ printf %s t kfree ^$retval
+ tr -d ^
+ command=t kfree $retval
+ echo Test command: t kfree $retval
Test command: t kfree $retval
+ echo
----

So 't kfree $retval' should fail (tracepoint doesn't support
return probe) but passed it.

Link: https://lore.kernel.org/all/169944555933.45057.12831706585287704173.stgit@devnote2/
Fixes: 08c9306fc2e3 ("tracing/fprobe-event: Assume fprobe is a return event by $retval")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

commit | commitdiff | tree

Arnd Bergmann [Fri, 10 Nov 2023 10:59:05 +0000 (19:59 +0900)]

kprobes: unify kprobes_exceptions_nofify() prototypes

Most architectures that support kprobes declare this function in their
own asm/kprobes.h header and provide an override, but some are missing
the prototype, which causes a warning for the __weak stub implementation:

kernel/kprobes.c:1865:12: error: no previous prototype for 'kprobe_exceptions_notify' [-Werror=missing-prototypes]
1865 | int __weak kprobe_exceptions_notify(struct notifier_block *self,

Move the prototype into linux/kprobes.h so it is visible to all
the definitions.

Link: https://lore.kernel.org/all/20231108125843.3806765-4-arnd@kernel.org/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

commit | commitdiff | tree

wuqiang.matt [Fri, 10 Nov 2023 10:59:04 +0000 (19:59 +0900)]

lib: test_objpool: make global variables static

Kernel test robot reported build warnings that structures g_ot_sync_ops,
g_ot_async_ops and g_testcases should be static. These definitions are
only used in test_objpool.c, so make them static

Link: https://lore.kernel.org/all/20231108012248.313574-1-wuqiang.matt@bytedance.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202311071229.WGrWUjM1-lkp@intel.com/
Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

commit | commitdiff | tree

Masami Hiramatsu (Google) [Fri, 10 Nov 2023 10:59:03 +0000 (19:59 +0900)]

Documentation: tracing: Add a note about argument and retval access

Add a note about the argument and return value accecss will be best
effort. Depending on the type, it will be passed via stack or a
pair of the registers, but $argN and $retval only support the
single register access.

Link: https://lore.kernel.org/all/169556269377.146934.14829235476649685954.stgit@devnote2/
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

commit | commitdiff | tree

Dan Carpenter [Wed, 25 Oct 2023 11:58:18 +0000 (14:58 +0300)]

pwm: Fix double shift bug

These enums are passed to set/test_bit().  The set/test_bit() functions
take a bit number instead of a shifted value.  Passing a shifted value
is a double shift bug like doing BIT(BIT(1)).  The double shift bug
doesn't cause a problem here because we are only checking 0 and 1 but
if the value was 5 or above then it can lead to a buffer overflow.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: Thierry Reding <thierry.reding@gmail.com>

commit | commitdiff | tree

Dan Carpenter [Wed, 25 Oct 2023 11:57:34 +0000 (14:57 +0300)]

pwm: samsung: Fix a bit test in pwm_samsung_resume()

The PWMF_REQUESTED enum is supposed to be used with test_bit() and not
used as in a bitwise AND. In this specific code the flag will never be
set so the function is effectively a no-op.

Fixes: e3fe982b2e4e ("pwm: samsung: Put per-channel data into driver data")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: Thierry Reding <thierry.reding@gmail.com>

commit | commitdiff | tree

Arnd Bergmann [Wed, 8 Nov 2023 12:58:42 +0000 (13:58 +0100)]

fbdev: fsl-diu-fb: mark wr_reg_wa() static

wr_reg_wa() is not an appropriate name for a global function, and doesn't need
to be global anyway, so mark it static and avoid the warning:

drivers/video/fbdev/fsl-diu-fb.c:493:6: error: no previous prototype for 'wr_reg_wa' [-Werror=missing-prototypes]

Fixes: 0d9dab39fbbe ("powerpc/5121: fsl-diu-fb: fix issue with re-enabling DIU area descriptor")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Helge Deller <deller@gmx.de>

commit | commitdiff | tree

Uwe Kleine-König [Thu, 9 Nov 2023 22:01:54 +0000 (23:01 +0100)]

fbdev: amifb: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Helge Deller <deller@gmx.de>

commit | commitdiff | tree

Uwe Kleine-König [Thu, 9 Nov 2023 22:01:53 +0000 (23:01 +0100)]

fbdev: amifb: Mark driver struct with __refdata to prevent section mismatch warning

As described in the added code comment, a reference to .exit.text is ok
for drivers registered via module_platform_driver_probe(). Make this
explicit to prevent a section mismatch warning.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Helge Deller <deller@gmx.de>

commit | commitdiff | tree

Steve French [Thu, 9 Nov 2023 21:28:12 +0000 (15:28 -0600)]

smb3: allow dumping session and tcon id to improve stats analysis and debugging

When multiple mounts are to the same share from the same client it was not
possible to determine which section of /proc/fs/cifs/Stats (and DebugData)
correspond to that mount. In some recent examples this turned out to be
a significant problem when trying to analyze performance data - since
there are many cases where unless we know the tree id and session id we
can't figure out which stats (e.g. number of SMB3.1.1 requests by type,
the total time they take, which is slowest, how many fail etc.) apply to
which mount. The only existing loosely related ioctl CIFS_IOC_GET_MNT_INFO
does not return the information needed to uniquely identify which tcon
is which mount although it does return various flags and device info.

Add a cifs.ko ioctl CIFS_IOC_GET_TCON_INFO (0x800ccf0c) to return tid,
session id, tree connect count.

Cc: stable@vger.kernel.org
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

commit | commitdiff | tree

Arnd Bergmann [Wed, 8 Nov 2023 12:58:26 +0000 (13:58 +0100)]

parport: gsc: mark init function static

This is only used locally, so mark it static to avoid a warning:

drivers/parport/parport_gsc.c:395:5: error: no previous prototype for 'parport_gsc_init' [-Werror=missing-prototypes]

Acked-by: Helge Deller <deller@gmx.de>
Acked-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Helge Deller <deller@gmx.de>

commit | commitdiff | tree

Arnd Bergmann [Wed, 8 Nov 2023 14:58:13 +0000 (15:58 +0100)]

fbdev: hyperv_fb: fix uninitialized local variable use

When CONFIG_SYSFB is disabled, the hyperv_fb driver can now run into
undefined behavior on a gen2 VM, as indicated by this smatch warning:

drivers/video/fbdev/hyperv_fb.c:1077 hvfb_getmem() error: uninitialized symbol 'base'.
drivers/video/fbdev/hyperv_fb.c:1077 hvfb_getmem() error: uninitialized symbol 'size'.

Since there is no way to know the actual framebuffer in this configuration,
just return an allocation failure here, which should avoid the build
warning and the undefined behavior.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202311070802.YCpvehaz-lkp@intel.com/
Fixes: a07b50d80ab6 ("hyperv: avoid dependency on screen_info")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Helge Deller <deller@gmx.de>

commit | commitdiff | tree

Uwe Kleine-König [Tue, 7 Nov 2023 09:18:03 +0000 (10:18 +0100)]

fbdev: omapfb/tpd12s015: Convert to platform remove callback returning void

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Helge Deller <deller@gmx.de>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom