www.infradead.org Git - users/dwmw2/linux.git/log

KVM: x86/xen: Inject vCPU upcall vector when local APIC is enabled

Linux guests since commit b1c3497e604d ("x86/xen: Add support for
HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU
upcall vector when it's advertised in the Xen CPUID leaves.

This upcall is injected through the guest's local APIC as an MSI, unlike
the older system vector which was merely injected by the hypervisor any
time the CPU was able to receive an interrupt and the upcall_pending
flags is set in its vcpu_info.

Effectively, that makes the per-CPU upcall edge triggered instead of
level triggered, which results in the upcall being lost if the MSI is
delivered when the local APIC is *disabled*.

Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC
for a vCPU is software enabled (in fact, on any write to the SPIV
register which doesn't disable the APIC). Do the same in KVM since KVM
doesn't provide a way for userspace to intervene and trap accesses to
the SPIV register of a local APIC emulated by KVM.

Astute reviewers may note that kvm_xen_inject_vcpu_vector() function has
a WARN_ON_ONCE() in the case where kvm_irq_delivery_to_apic_fast() fails
and returns false. In the case where the MSI is not delivered due to the
local APIC being disabled, kvm_irq_delivery_to_apic_fast() still returns
true but the value in *r is zero. So the WARN_ON_ONCE() remains correct,
as that case should still never happen.

Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Cc: stable@vger.kernel.org

KVM: x86/xen: improve accuracy of Xen timers

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues

This function can race with kvm_gpc_deactivate(), which does not take
the ->refresh_lock. This means kvm_gpc_deactivate() can wipe the ->pfn
and ->khva fields, and unmap the latter, while hva_to_pfn_retry() has
temporarily dropped its write lock on gpc->lock.

Then if hva_to_pfn_retry() determines that the PFN hasn't changed and
that the original pfn and khva can be reused, they get assigned back to
gpc->pfn and gpc->khva even though the khva was already unmapped by
kvm_gpc_deactivate(). This leaves the cache in an apparently valid state
but with ->khva pointing to an address which has been unmapped. Which in
turn leads to oopses in e.g. __kvm_xen_has_interrupt() and
set_shinfo_evtchn_pending() when they dereference said khva.

It may be possible to fix this just by making kvm_gpc_deactivate() take
the ->refresh_lock, but that still leaves ->refresh_lock being basically
redundant with the write lock on ->lock, which frankly makes my skin
itch, with the way that pfn_to_hva_retry() operates on fields in the gpc
without holding a write lock on ->lock.

Instead, fix it by cleaning up the semantics of hva_to_pfn_retry(). It
now no longer does locking gymnastics because it no longer operates on
the gpc object at all. I's called with a uhva and simply returns the
corresponding pfn (pinned), and a mapped khva for it.

Its caller __kvm_gpc_refresh() now sets gpc->uhva and clears gpc->valid
before dropping ->lock, calling hva_to_pfn_retry() and retaking ->lock
for write.

If hva_to_pfn_retry() fails, *or* if the ->uhva or ->active fields in
the gpc changed while the lock was dropped, the new mapping is discarded
and the gpc is not modified. On success with an unchanged gpc, the new
mapping is installed and the current ->pfn and ->uhva are taken into the
local old_pfn and old_khva variables to be unmapped once the locks are
all released.

This simplification means that ->refresh_lock is no longer needed for
correctness, but it does still provide a minor optimisation because it
will prevent two concurrent __kvm_gpc_refresh() calls from mapping a
given PFN, only for one of them to lose the race and discard its
mapping.

The optimisation in hva_to_pfn_retry() where it attempts to use the old
mapping if the pfn doesn't change is dropped, since it makes the pinning
more complex. It's a pointless optimisation anyway, since the odds of
the pfn ending up the same when the uhva has changed (i.e. the odds of
the two userspace addresses both pointing to the same underlying
physical page) are negligible,

The 'hva_changed' local variable in __kvm_gpc_refresh() is also removed,
since it's simpler just to clear gpc->valid if the uhva changed.
Likewise the unmap_old variable is dropped because it's just as easy to
check the old_pfn variable for KVM_PFN_ERR_FAULT.

I remain slightly confused because although this is clearly a race in
the gfn_to_pfn_cache code, I don't quite know how the Xen support code
actually managed to trigger it. We've seen oopses from dereferencing a
valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info)
and in set_shinfo_evtchn_pending() (the shared_info). But surely the
race shouldn't happen for the vcpu_info gpc because all calls to both
refresh and deactivate hold the vcpu mutex, and it shouldn't happen
for the shared_info gpc because all calls to both will hold the
kvm->arch.xen.xen_lock mutex.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
v12:
- New in this version.

KVM: xen: allow vcpu_info content to be 'safely' copied

If the guest sets an explicit vcpu_info GPA then, for any of the first 32
vCPUs, the content of the default vcpu_info in the shared_info page must be
copied into the new location. Because this copy may race with event
delivery (which updates the 'evtchn_pending_sel' field in vcpu_info) there
needs to be a way to defer that until the copy is complete.
Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen
that is used in atomic context if the vcpu_info PFN cache has been
invalidated so that the update of vcpu_info can be deferred until the
cache can be refreshed (on vCPU thread's the way back into guest context).

Also use this shadow if the vcpu_info cache has been *deactivated*, so that
the VMM can safely copy the vcpu_info content and then re-activate the
cache with the new GPA. To do this, stop considering an inactive vcpu_info
cache as a hard error in kvm_xen_set_evtchn_fast().

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
v8:
- Update commit comment.

v6:
- New in this version.

KVM: pfncache: check the need for invalidation under read lock first

Taking a write lock on a pfncache will be disruptive if the cache is
heavily used (which only requires a read lock). Hence, in the MMU notifier
callback, take read locks on caches to check for a match; only taking a
write lock to actually perform an invalidation (after a another check).

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v10:
- New in this version.

KVM: xen: don't block on pfncache locks in kvm_xen_set_evtchn_fast()

As described in [1] compiling with CONFIG_PROVE_RAW_LOCK_NESTING shows that
kvm_xen_set_evtchn_fast() is blocking on pfncache locks in IRQ context.
There is only actually blocking with PREEMPT_RT because the locks will
turned into mutexes. There is no 'raw' version of rwlock_t that can be used
to avoid that, so use read_trylock() and treat failure to lock the same as
an invalid cache.

[1] https://lore.kernel.org/lkml/99771ef3a4966a01fefd3adbb2ba9c3a75f97cf2.camel@infradead.org/T/#mbd06e5a04534ce9c0ee94bd8f1e8d942b2d45bd6

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v11:
- Amended the commit comment.

v10:
- New in this version.

KVM: xen: split up kvm_xen_set_evtchn_fast()

The implementation of kvm_xen_set_evtchn_fast() is a rather lengthy piece
of code that performs two operations: updating of the shared_info
evtchn_pending mask, and updating of the vcpu_info evtchn_pending_sel
mask. Introduce a separate function to perform each of those operations and
re-work kvm_xen_set_evtchn_fast() to use them.

No functional change intended.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v11:
- Fixed /64 vs /32 switcheroo and changed type of port_word_bit back to
int.

v10:
- Updated in this version. Dropped David'd R-b since the updates are
non-trivial.

v8:
- New in this version.

KVM: xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability

Now that all relevant kernel changes and selftests are in place, enable the
new capability.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v2:
- New in this version.

KVM: selftests / xen: re-map vcpu_info using HVA rather than GPA

If the relevant capability (KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA) is present
then re-map vcpu_info using the HVA part way through the tests to make sure
then there is no functional change.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v5:
- New in this version.

KVM: selftests / xen: map shared_info using HVA rather than GFN

Using the HVA of the shared_info page is more efficient, so if the
capability (KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA) is present use that method
to do the mapping.

NOTE: Have the juggle_shinfo_state() thread map and unmap using both
GFN and HVA, to make sure the older mechanism is not broken.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v3:
- Re-work the juggle_shinfo_state() thread.

v2:
- New in this version.

KVM: xen: allow vcpu_info to be mapped by fixed HVA

If the guest does not explicitly set the GPA of vcpu_info structure in
memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded
in the shared_info page may be used. As described in a previous commit,
the shared_info page is an overlay at a fixed HVA within the VMM, so in
this case it also more optimal to activate the vcpu_info cache with a
fixed HVA to avoid unnecessary invalidation if the guest memory layout
is modified.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
v8:
- Re-base.

v5:
- New in this version.

KVM: xen: allow shared_info to be mapped by fixed HVA

The shared_info page is not guest memory as such. It is a dedicated page
allocated by the VMM and overlaid onto guest memory in a GFN chosen by the
guest and specified in the XENMEM_add_to_physmap hypercall. The guest may
even request that shared_info be moved from one GFN to another by
re-issuing that hypercall, but the HVA is never going to change.

Because the shared_info page is an overlay the memory slots need to be
updated in response to the hypercall. However, memory slot adjustment is
not atomic and, whilst all vCPUs are paused, there is still the possibility
that events may be delivered (which requires the shared_info page to be
updated) whilst the shared_info GPA is absent. The HVA is never absent
though, so it makes much more sense to use that as the basis for the
kernel's mapping.

Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this
purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its
availability. Don't actually advertize it yet though. That will be done in
a subsequent patch, which will also add tests for the new attribute type.

Also update the KVM API documentation with the new attribute and also fix
it up to consistently refer to 'shared_info' (with the underscore).

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
v8:
- Re-base.

v2:
- Define the new attribute and capability but don't advertize the
capability yet.
- Add API documentation.

KVM: xen: re-initialize shared_info if guest (32/64-bit) mode is set

If the shared_info PFN cache has already been initialized then the content
of the shared_info page needs to be re-initialized whenever the guest
mode is (re)set.
Setting the guest mode is either done explicitly by the VMM via the
KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes
the MSR to set up the hypercall page.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v12:
- Fix missing update of return value if mode is not actually changed.

v11:
- Drop the hunk removing the call to kvm_xen_shared_info_init() when
KVM_XEN_ATTR_TYPE_SHARED_INFO is set; it was a mistake and causes self-
test failures.

v10:
- New in this version.

KVM: xen: separate initialization of shared_info cache and content

A subsequent patch will allow shared_info to be initialized using either a
GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate
the initialization of the shared_info content from the activation of the
pfncache.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v11:
- Fix accidental regression from commit 5d6d6a7d7e66a ("KVM: x86: Refine
calculation of guest wall clock to use a single TSC read").

v10:
- New in this version.

KVM: pfncache: allow a cache to be activated with a fixed (userspace) HVA

Some pfncache pages may actually be overlays on guest memory that have a
fixed HVA within the VMM. It's pointless to invalidate such cached
mappings if the overlay is moved so allow a cache to be activated directly
with the HVA to cater for such cases. A subsequent patch will make use
of this facility.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v11:
- Fixed kvm_gpc_check() to ignore memslot generation if the cache is not
activated with a GPA. (This breakage occured during the re-work for v8).

v9:
- Pass both GPA and HVA into __kvm_gpc_refresh() rather than overloading
the address paraneter and using a bool flag to indicated what it is.

v8:
- Re-worked to avoid messing with struct gfn_to_pfn_cache.

KVM: pfncache: include page offset in uhva and use it consistently

Currently the pfncache page offset is sometimes determined using the gpa
and sometimes the khva, whilst the uhva is always page-aligned. After a
subsequent patch is applied the gpa will not always be valid so adjust
the code to include the page offset in the uhva and use it consistently
as the source of truth.

Also, where a page-aligned address is required, use PAGE_ALIGN_DOWN()
for clarity.

No functional change intended.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v8:
- New in this version.

KVM: pfncache: stop open-coding offset_in_page()

Some code in pfncache uses offset_in_page() but in other places it is open-
coded. Use offset_in_page() consistently everywhere.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v8:
- New in this version.

KVM: pfncache: remove KVM_GUEST_USES_PFN usage

As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any
callers of kvm_gpc_init(), which also makes the 'vcpu' argument redundant.
Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage
check in hva_to_pfn_retry() and hence the 'usage' argument to
kvm_gpc_init() are also redundant.
Remove the pfn_cache_usage enumeration and remove the redundant arguments,
fields of struct gfn_to_hva_cache, and all the related code.

[1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com/

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v8:
- New in this version.

KVM: pfncache: add a mark-dirty helper

At the moment pages are marked dirty by open-coded calls to
mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot
from the cache. After a subsequent patch these may not always be set
so add a helper now so that caller will protected from the need to know
about this detail.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
v8:
- Make the helper a static inline.

KVM: xen: mark guest pages dirty with the pfncache lock held

Sampling gpa and memslot from an unlocked pfncache may yield inconsistent
values so, since there is no problem with calling mark_page_dirty_in_slot()
with the pfncache lock held, relocate the calls in
kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events()
accordingly.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v8:
- New in this version.

KVM: pfncache: remove unnecessary exports

There is no need for the existing kvm_gpc_XXX() functions to be exported.
Clean up now before additional functions are added in subsequent patches.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
v8:
- New in this version.

KVM: pfncache: Add a map helper function

There is a pfncache unmap helper but mapping is open-coded. Arguably this
is fine because mapping is done in only one place, hva_to_pfn_retry(), but
adding the helper does make that function more readable.

No functional change intended.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
v8:
- Re-work commit comment.
- Fix CONFIG_HAS_IOMEM=n build.

x86/kvm: Do not try to disable kvmclock if it was not enabled

kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
present in the VM. It leads to write to a MSR that doesn't exist on some
configurations, namely in TDX guest:

unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)

kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
features.

Do not disable kvmclock if it was not enabled.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Message-Id: <20231205004510.27164-6-kirill.shutemov@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvm-x86-mmu-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 MMU changes for 6.8:

- Fix a relatively benign off-by-one error when splitting huge pages during
   CLEAR_DIRTY_LOG.

- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
   TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.

- Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read
   versus write so that KVM doesn't pass "bool shared" all over the place just
   to have precise assertions in paths that don't actually care about whether
   the caller is a reader or a writer.

Merge tag 'kvm-x86-xen-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM Xen change for 6.8:

To workaround Xen guests that don't expect Xen PV clocks to be marked as being
based on a stable TSC, add a Xen config knob to allow userspace to opt out of
KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit
was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't
entirely blameless for the buggy guest behavior.

Merge tag 'kvm-x86-svm-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.8:

- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.

- Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
   flushes on nested transitions, i.e. always satisfies flush requests.  This
   allows running bleeding edge versions of VMware Workstation on top of KVM.

- Sanity check that the CPU supports flush-by-ASID when enabling SEV support.

- Fix a benign NMI virtualization bug where KVM would unnecessarily intercept
   IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects
   a second, "simultaneous" NMI.

Merge tag 'kvm-x86-lam-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 support for virtualizing Linear Address Masking (LAM)

Add KVM support for Linear Address Masking (LAM).  LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit.  This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.

LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking

- 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
- 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.

For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively.  Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:

- CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
- CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
- CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers

For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:

- CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers

The modified LAM canonicality checks:
- LAM_S48                : [ 1 ][ metadata ][ 1 ]
                              63               47
- LAM_U48                : [ 0 ][ metadata ][ 0 ]
                              63               47
- LAM_S57                : [ 1 ][ metadata ][ 1 ]
                              63               56
- LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
                              63               56
- LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
                              63               56..47

The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks.  The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48.  The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows

Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.

Merge tag 'kvm-x86-pmu-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 PMU changes for 6.8:

- Fix a variety of bugs where KVM fail to stop/reset counters and other state
   prior to refreshing the vPMU model.

- Fix a double-overflow PMU bug by tracking emulated counter events using a
   dedicated field instead of snapshotting the "previous" counter.  If the
   hardware PMC count triggers overflow that is recognized in the same VM-Exit
   that KVM manually bumps an event count, KVM would pend PMIs for both the
   hardware-triggered overflow and for KVM-triggered overflow.

Merge tag 'kvm-x86-misc-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.8:

- Turn off KVM_WERROR by default for all configs so that it's not
   inadvertantly enabled by non-KVM developers, which can be problematic for
   subsystems that require no regressions for W=1 builds.

- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
   "features".

- Don't force a masterclock update when a vCPU synchronizes to the current TSC
   generation, as updating the masterclock can cause kvmclock's time to "jump"
   unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.

- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
   partly as a super minor optimization, but mostly to make KVM play nice with
   position independent executable builds.

Merge tag 'kvm-x86-hyperv-6.8' of https://github.com/kvm-x86/linux into HEAD

KVM x86 Hyper-V changes for 6.8:

- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.

- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.

Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEAD

Common KVM changes for 6.8:

- Use memdup_array_user() to harden against overflow.

- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.

Merge tag 'kvmarm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for Linux 6.8

- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
  base granule sizes. Branch shared with the arm64 tree.

- Large Fine-Grained Trap rework, bringing some sanity to the
  feature, although there is more to come. This comes with
  a prefix branch shared with the arm64 tree.

- Some additional Nested Virtualization groundwork, mostly
  introducing the NV2 VNCR support and retargetting the NV
  support to that version of the architecture.

- A small set of vgic fixes and associated cleanups.

KVM: x86: add missing "depends on KVM"

Support for KVM software-protected VMs should not be configurable,
if KVM is not available at all.

Fixes: 89ea60c2c7b5 ("KVM: x86: Add support for "protected VMs" that can utilize private memory")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: fix direction of dependency on MMU notifiers

KVM_GENERIC_MEMORY_ATTRIBUTES requires the generic MMU notifier code, because
it uses kvm_mmu_invalidate_begin/end. However, it would not work with a bespoke
implementation of MMU notifiers that does not use KVM_GENERIC_MMU_NOTIFIER,
because most likely it would not synchronize correctly on invalidation. So
the right thing to do is to note the problematic configuration if the
architecture does not select itself KVM_GENERIC_MMU_NOTIFIER; not to
enable it blindly.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: introduce CONFIG_KVM_COMMON

CONFIG_HAVE_KVM is currently used by some architectures to either
enabled the KVM config proper, or to enable host-side code that is
not part of the KVM module. However, CONFIG_KVM's "select" statement
in virt/kvm/Kconfig corresponds to a third meaning, namely to
enable common Kconfigs required by all architectures that support
KVM.

These three meanings can be replaced respectively by an
architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by
a new Kconfig symbol that is in turn selected by the
architecture-specific "config KVM".

Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON.
Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by
architecture code, and it brings in all dependencies of common
KVM code. In particular, INTERVAL_TREE was missing in loongarch
and riscv, so that is another thing that is fixed.

Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08)
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd

In commit f320bc742bc23 ("KVM: arm64: Prepare the creation of s1
mappings at EL2"), pKVM switches from a temporary host-provided
page-table to its own page-table at EL2. Since there is only a single
TTBR for the nVHE hypervisor, this involves disabling and re-enabling
the MMU in __pkvm_init_switch_pgd().

Unfortunately, the memory barriers here are not quite correct.
Specifically:

  - A DSB is required to complete the TLB invalidation executed while
    the MMU is disabled.

  - An ISB is required to make the new TTBR value visible to the
    page-table walker before the MMU is enabled in the SCTLR.

An earlier version of the patch actually got this correct:

  https://lore.kernel.org/lkml/20210304184717.GB21795@willie-the-truck/

but thanks to some badly worded review comments from yours truly, these
were dropped for the version that was eventually merged.

Bring back the barriers and fix the potential issue (but note that this
was found by code inspection).

Cc: Quentin Perret <qperret@google.com>
Fixes: f320bc742bc23 ("KVM: arm64: Prepare the creation of s1 mappings at EL2")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240104164220.7968-1-will@kernel.org

Merge branch kvm-arm64/vgic-6.8 into kvmarm-master/next

* kvm-arm64/vgic-6.8:
  : .
  : Fix for the GICv4.1 vSGI pending state being set/cleared from
  : userspace, and some cleanup to the MMIO and userspace accessors
  : for the pending state.
  :
  : Also a fix for a potential UAF in the ITS translation cache.
  : .
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  KVM: arm64: vgic-v3: Reinterpret user ISPENDR writes as I{C,S}PENDR
  KVM: arm64: vgic: Use common accessor for writes to ICPENDR
  KVM: arm64: vgic: Use common accessor for writes to ISPENDR
  KVM: arm64: vgic-v4: Restore pending state on host userspace write

Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache

There is a potential UAF scenario in the case of an LPI translation
cache hit racing with an operation that invalidates the cache, such
as a DISCARD ITS command. The root of the problem is that
vgic_its_check_cache() does not elevate the refcount on the vgic_irq
before dropping the lock that serializes refcount changes.

Have vgic_its_check_cache() raise the refcount on the returned vgic_irq
and add the corresponding decrement after queueing the interrupt.

Cc: stable@vger.kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240104183233.3560639-1-oliver.upton@linux.dev

Merge tag 'kvm-riscv-6.8-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.8 part #1

- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Steal time account support along with selftest

Merge tag 'kvm-s390-next-6.8-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

- uvdevice fixed additional data return length
- stfle (feature indication) vsie fixes and minor cleanup

Merge tag 'loongarch-kvm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.8

1. Optimization for memslot hugepage checking.
2. Cleanup and fix some HW/SW timer issues.
3. Add LSX/LASX (128bit/256bit SIMD) support.

RISC-V: KVM: selftests: Add get-reg-list test for STA registers

Add SBI STA and its two registers to the get-reg-list test.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: selftests: Add steal_time test support

With the introduction of steal-time accounting support for
RISC-V KVM we can add RISC-V support to the steal_time test.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: selftests: Add guest_sbi_probe_extension

Add guest_sbi_probe_extension(), allowing guest code to probe for
SBI extensions. As guest_sbi_probe_extension() needs
SBI_ERR_NOT_SUPPORTED, take the opportunity to bring in all SBI
error codes. We don't bring in all current extension IDs or base
extension function IDs though, even though we need one of each,
because we'd prefer to bring those in as necessary.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: selftests: Move sbi_ecall to processor.c

sbi_ecall() isn't ucall specific and its prototype is already in
processor.h. Move its implementation to processor.c.

Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Implement SBI STA extension

Add a select SCHED_INFO to the KVM config in order to get run_delay
info. Then implement SBI STA's set-steal-time-shmem function and
kvm_riscv_vcpu_record_steal_time() to provide the steal-time info
to guests.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Add support for SBI STA registers

KVM userspace needs to be able to save and restore the steal-time
shared memory address. Provide the address through the get/set-one-reg
interface with two ulong-sized SBI STA extension registers (lo and hi).
64-bit KVM userspace must not set the hi register to anything other
than zero and is allowed to completely neglect saving/restoring it.

Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Add support for SBI extension registers

Some SBI extensions have state that needs to be saved / restored
when migrating the VM. Provide a get/set-one-reg register type
for SBI extension registers. Each SBI extension that uses this type
will have its own subtype. There are currently no subtypes defined.
The next patch introduces the first one.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Add SBI STA info to vcpu_arch

KVM's implementation of SBI STA needs to track the address of each
VCPU's steal-time shared memory region as well as the amount of
stolen time. Add a structure to vcpu_arch to contain this state
and make sure that the address is always set to INVALID_GPA on
vcpu reset. And, of course, ensure KVM won't try to update steal-
time when the shared memory address is invalid.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Add steal-update vcpu request

Add a new vcpu request to inform a vcpu that it should record its
steal-time information. The request is made each time it has been
detected that the vcpu task was not assigned a cpu for some time,
which is easy to do by making the request from vcpu-load. The record
function is just a stub for now and will be filled in with the rest
of the steal-time support functions in following patches.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Add SBI STA extension skeleton

Add the files and functions needed to support the SBI STA
(steal-time accounting) extension. In the next patches we'll
complete the functions to fully enable SBI STA support.

Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: paravirt: Implement steal-time support

When the SBI STA extension exists we can use it to implement
paravirt steal-time support. Fill in the empty pv-time functions
with an SBI STA implementation and add the Kconfig knobs allowing
it to be enabled.

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: Add SBI STA extension definitions

The SBI STA extension enables steal-time accounting. Add the
definitions it specifies.

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: paravirt: Add skeleton for pv-time support

Add the files and functions needed to support paravirt time on
RISC-V. Also include the common code needed for the first
application of pv-time, which is steal-time. In the next
patches we'll complete the functions to fully enable steal-time
support.

Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()

The indentation of "break" in kvm_riscv_vcpu_set_reg_csr() is
inconsistent hence let us fix it.

Fixes: c04913f2b54e ("RISCV: KVM: Add sstateen0 to ONE_REG")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202312190719.kBuYl6oJ-lkp@intel.com/
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: add vector registers and CSRs in KVM_GET_REG_LIST

Add all vector registers and CSRs (vstart, vl, vtype, vcsr, vlenb) in
get-reg-list.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: add 'vlenb' Vector CSR

Userspace requires 'vlenb' to be able to encode it in reg ID. Otherwise
it is not possible to retrieve any vector reg since we're returning
EINVAL if reg_size isn't vlenb (see kvm_riscv_vcpu_vreg_addr()).

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: set 'vlenb' in kvm_riscv_vcpu_alloc_vector_context()

'vlenb', added to riscv_v_ext_state by commit c35f3aa34509 ("RISC-V:
vector: export VLENB csr in __sc_riscv_v_state"), isn't being
initialized in guest_context. If we export 'vlenb' as a KVM CSR,
something we want to do in the next patch, it'll always return 0.

Set 'vlenb' to riscv_v_size/32.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: selftests: Treat SBI ext regs like ISA ext regs

SBI extension registers may not be present and indeed when
running on a platform without sscofpmf the PMU SBI extension
is not. Move the SBI extension registers from the base set of
registers to the filter list. Individual configs should test
for any that may or may not be present separately. Since
the PMU extension may disappear and the DBCN extension is only
present in later kernels, separate them from the rest into
their own configs. The rest are lumped together into the same
config.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: selftests: Use register subtypes

Always use register subtypes in the get-reg-list test when registers
have them. The only registers neglecting to do so were ISA extension
registers. While we don't really need to use KVM_REG_RISCV_ISA_SINGLE
(since it's zero), the main purpose is to avoid confusion and to
self-document the tests. Also add print support for the multi
registers like SBI extensions have, even though they're only used for
debugging.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Haibo Xu <haibo1.xu@intel.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: selftests: Add RISCV_SBI_EXT_REG

While adding RISCV_SBI_EXT_REG(), acknowledge that some registers
have subtypes and extend __kvm_reg_id() to take a subtype field.
Then, update all macros to set the new field appropriately. The
general CSR macro gets renamed to include "GENERAL", but the other
macros, like the new RISCV_SBI_EXT_REG, just use the SINGLE subtype.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Make SBI uapi consistent with ISA uapi

When an SBI extension cannot be enabled, that's a distinct state vs.
enabled and disabled. Modify enum kvm_riscv_sbi_ext_status to
accommodate it, which allows KVM userspace to tell the difference
in state too, as the SBI extension register will disappear when it
cannot be enabled, i.e. accesses to it return ENOENT. get-reg-list is
updated as well to only add SBI extension registers to the list which
may be enabled. Returning ENOENT for SBI extension registers which
cannot be enabled makes them consistent with ISA extension registers.
Any SBI extensions which were enabled by default are still enabled by
default, if they can be enabled at all.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: selftests: Drop SBI multi registers

These registers are no longer getting added to get-reg-list.
We keep sbi_ext_multi_id_to_str() for printing, even though
we don't expect it to normally be used, because it may be
useful for debug.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: Don't add SBI multi regs in get-reg-list

The multi regs are derived from the single registers. Only list the
single registers in get-reg-list. This also makes the SBI extension
register listing consistent with the ISA extension register listing.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Haibo Xu <haibo1.xu@intel.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Anup Patel <anup@brainfault.org>

KVM: riscv: selftests: Generate ISA extension reg_list using macros

Various ISA extension reg_list have common pattern so let us generate
these using macros.

We define two macros for the above purpose:
1) KVM_ISA_EXT_SIMPLE_CONFIG - Macro to generate reg_list for
ISA extension without any additional ONE_REG registers
2) KVM_ISA_EXT_SUBLIST_CONFIG - Macro to generate reg_list for
ISA extension with additional ONE_REG registers

This patch also adds the missing config for svnapot.

Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

RISC-V: KVM: remove a redundant condition in kvm_arch_vcpu_ioctl_run()

The latest ret value is updated by kvm_riscv_vcpu_aia_update(),
the loop will continue if the ret is less than or equal to zero.
So the later condition will never hit. Thus remove it.

Signed-off-by: Chao Du <duchao@eswincomputing.com>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

riscv: kvm: use ".L" local labels in assembly when applicable

For the sake of coherency, use local labels in assembly when
applicable. This also avoid kprobes being confused when applying a
kprobe since the size of function is computed by checking where the
next visible symbol is located. This might end up in computing some
function size to be way shorter than expected and thus failing to apply
kprobes to the specified offset.

Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

riscv: kvm: Use SYM_*() assembly macros instead of deprecated ones

ENTRY()/END()/WEAK() macros are deprecated and we should make use of the
new SYM_*() macros [1] for better annotation of symbols. Replace the
deprecated ones with the new ones and fix wrong usage of END()/ENDPROC()
to correctly describe the symbols.

[1] https://docs.kernel.org/core-api/asm-annotations.html

Signed-off-by: Clément Léger <cleger@rivosinc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Anup Patel <anup@brainfault.org>

Linux 6.7-rc7

Merge tag 'x86-urgent-2023-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

- Fix a secondary CPUs enumeration regression caused by creative MADT
   APIC table entries on certain systems.

- Fix a race in the NOP-patcher that can spuriously trigger crashes on
   bootup.

- Fix a bootup failure regression caused by the parallel bringup code,
   caused by firmware inconsistency between the APIC initialization
   states of the boot and secondary CPUs, on certain systems.

* tag 'x86-urgent-2023-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/acpi: Handle bogus MADT APIC tables gracefully
  x86/alternatives: Disable interrupts and sync when optimizing NOPs in place
  x86/alternatives: Sync core before enabling interrupts
  x86/smpboot/64: Handle X2APIC BIOS inconsistency gracefully

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Four small fixes, three in drivers with the core one adding a batch
  indicator (for drivers which use it) to the error handler"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ufs: core: Let the sq_lock protect sq_tail_slot access
  scsi: ufs: qcom: Return ufs_qcom_clk_scale_*() errors in ufs_qcom_clk_scale_notify()
  scsi: core: Always send batch on reset or error handling command
  scsi: bnx2fc: Fix skb double free in bnx2fc_rcv()

Merge tag 'usb-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB / Thunderbolt fixes from Greg KH:
"Here are some small bugfixes and new device ids for USB and
  Thunderbolt drivers for 6.7-rc7. Included in here are:

   - new usb-serial device ids

   - thunderbolt driver fixes

   - typec driver fix

   - usb-storage driver quirk added

   - fotg210 driver fix

  All of these have been in linux-next with no reported issues"

* tag 'usb-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: serial: option: add Quectel EG912Y module support
  USB: serial: ftdi_sio: update Actisense PIDs constant names
  usb: fotg210-hcd: delete an incorrect bounds test
  usb-storage: Add quirk for incorrect WP on Kingston DT Ultimate 3.0 G3
  usb: typec: ucsi: fix gpio-based orientation detection
  net: usb: ax88179_178a: avoid failed operations when device is disconnected
  USB: serial: option: add Quectel RM500Q R13 firmware support
  USB: serial: option: add Foxconn T99W265 with new baseline
  thunderbolt: Fix minimum allocated USB 3.x and PCIe bandwidth
  thunderbolt: Fix memory leak in margining_port_remove()

Merge tag 'char-misc-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char / misc driver fixes from Greg KH:
"Here are a small number of various driver fixes for 6.7-rc7 that
  normally come through the char-misc tree, and one debugfs fix as well.

  Included in here are:

   - iio and hid sensor driver fixes for a number of small things

   - interconnect driver fixes

   - brcm_nvmem driver fixes

   - debugfs fix for previous fix

   - guard() definition in device.h so that many subsystems can start
     using it for 6.8-rc1 (requested by Dan Williams to make future
     merges easier)

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
  debugfs: initialize cancellations earlier
  Revert "iio: hid-sensor-als: Add light color temperature support"
  Revert "iio: hid-sensor-als: Add light chromaticity support"
  nvmem: brcm_nvram: store a copy of NVRAM content
  dt-bindings: nvmem: mxs-ocotp: Document fsl,ocotp
  driver core: Add a guard() definition for the device_lock()
  interconnect: qcom: icc-rpm: Fix peak rate calculation
  iio: adc: MCP3564: fix hardware identification logic
  iio: adc: MCP3564: fix calib_bias and calib_scale range checks
  iio: adc: meson: add separate config for axg SoC family
  iio: adc: imx93: add four channels for imx93 adc
  iio: adc: ti_am335x_adc: Fix return value check of tiadc_request_dma()
  interconnect: qcom: sm8250: Enable sync_state
  iio: triggered-buffer: prevent possible freeing of wrong buffer
  iio: imu: inv_mpu6050: fix an error code problem in inv_mpu6050_read_raw
  iio: imu: adis16475: use bit numbers in assign_bit()
  iio: imu: adis16475: add spi_device_id table
  iio: tmag5273: fix temperature offset
  interconnect: Treat xlate() returning NULL node as an error
  iio: common: ms_sensors: ms_sensors_i2c: fix humidity conversion time table
  ...

Merge tag 'input-for-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull input updates from Dmitry Torokhov:

- a quirk to AT keyboard driver to skip issuing "GET ID" command when
   8042 is in translated mode and the device is a laptop/portable,
   because the "GET ID" command makes a bunch of recent laptops unhappy

- a quirk to i8042 to disable multiplexed mode on Acer P459-G2-M which
   causes issues on resume

- psmouse will activate native RMI4 protocol support for touchpad on
   ThinkPad L14 G1

- addition of Razer Wolverine V2 ID to xpad gamepad driver

- mapping for airplane mode button in soc_button_array driver for
   TUXEDO laptops

- improved error handling in ipaq-micro-keys driver

- amimouse being prepared for platform remove callback returning void

* tag 'input-for-v6.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: soc_button_array - add mapping for airplane mode button
  Input: xpad - add Razer Wolverine V2 support
  Input: ipaq-micro-keys - add error handling for devm_kmemdup
  Input: amimouse - convert to platform remove callback returning void
  Input: i8042 - add nomux quirk for Acer P459-G2-M
  Input: atkbd - skip ATKBD_CMD_GETID in translated mode
  Input: psmouse - enable Synaptics InterTouch for ThinkPad L14 G1

KVM: s390: cpu model: Use proper define for facility mask size

Use the previously unused S390_ARCH_FAC_MASK_SIZE_U64 instead of
S390_ARCH_FAC_LIST_SIZE_U64 for defining the fac_mask array.
Note that both values are the same, there is no functional change.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-4-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-4-nsg@linux.ibm.com>

KVM: s390: vsie: Fix length of facility list shadowed

The length of the facility list accessed when interpretively executing
STFLE is the same as the hosts facility list (in case of format-0)
The memory following the facility list doesn't need to be accessible.
The current VSIE implementation accesses a fixed length that exceeds the
guest/host facility list length and can therefore wrongly inject a
validity intercept.
Instead, find out the host facility list length by running STFLE and
copy only as much as necessary when shadowing.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-3-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-3-nsg@linux.ibm.com>

KVM: s390: vsie: Fix STFLE interpretive execution identification

STFLE can be interpretively executed.
This occurs when the facility list designation is unequal to zero.
Perform the check before applying the address mask instead of after.

Fixes: 66b630d5b7f2 ("KVM: s390: vsie: support STFLE interpretation")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-2-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-2-nsg@linux.ibm.com>

Input: soc_button_array - add mapping for airplane mode button

This add a mapping for the airplane mode button on the TUXEDO Pulse Gen3.

While it is physically a key it behaves more like a switch, sending a key
down on first press and a key up on 2nd press. Therefor the switch event
is used here. Besides this behaviour it uses the HID usage-id 0xc6
(Wireless Radio Button) and not 0xc8 (Wireless Radio Slider Switch), but
since neither 0xc6 nor 0xc8 are currently implemented at all in
soc_button_array this not to standard behaviour is not put behind a quirk
for the moment.

Signed-off-by: Christoffer Sandberg <cs@tuxedo.de>
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Link: https://lore.kernel.org/r/20231215171718.80229-1-wse@tuxedocomputers.com
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'block-6.7-2023-12-22' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
"Just an NVMe pull request this time, with a fix for bad sleeping
  context, and a revert of a patch that caused some trouble"

* tag 'block-6.7-2023-12-22' of git://git.kernel.dk/linux:
  nvme-pci: fix sleeping function called from interrupt context
  Revert "nvme-fc: fix race between error recovery and creating association"

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"RISC-V:

   - Fix a race condition in updating external interrupt for
     trap-n-emulated IMSIC swfile

   - Fix print_reg defaults in get-reg-list selftest

  ARM:

   - Ensure a vCPU's redistributor is unregistered from the MMIO bus if
     vCPU creation fails

   - Fix building KVM selftests for arm64 from the top-level Makefile

  x86:

   - Fix breakage for SEV-ES guests that use XSAVES

  Selftests:

   - Fix bad use of strcat(), by not using strcat() at all"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SEV: Do not intercept accesses to MSR_IA32_XSS for SEV-ES guests
  KVM: selftests: Fix dynamic generation of configuration names
  RISCV: KVM: update external interrupt atomically for IMSIC swfile
  KVM: riscv: selftests: Fix get-reg-list print_reg defaults
  KVM: selftests: Ensure sysreg-defs.h is generated at the expected path
  KVM: Convert comment into an assertion in kvm_io_bus_register_dev()
  KVM: arm64: vgic: Ensure that slots_lock is held in vgic_register_all_redist_iodevs()
  KVM: arm64: vgic: Force vcpu vgic teardown on vcpu destroy
  KVM: arm64: vgic: Add a non-locking primitive for kvm_vgic_vcpu_destroy()
  KVM: arm64: vgic: Simplify kvm_vgic_destroy()

Merge tag 'kvm-riscv-fixes-6.7-1' of https://github.com/kvm-riscv/linux into kvm-master

KVM/riscv fixes for 6.7, take #1

- Fix a race condition in updating external interrupt for
trap-n-emulated IMSIC swfile
- Fix print_reg defaults in get-reg-list selftest

Merge tag 'kvmarm-fixes-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm64 fixes for 6.7, part #2

- Ensure a vCPU's redistributor is unregistered from the MMIO bus
if vCPU creation fails

- Fix building KVM selftests for arm64 from the top-level Makefile

Merge tag 'printk-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:

- Prevent refcount warning from code releasing a fwnode

* tag 'printk-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
lib/vsprintf: Fix %pfwf when current node refcount == 0

Merge tag 'sound-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Apparently there were so many kids wishing bug fixes that made Santa
  busy; here we have lots of fixes although it's a bit late. But all
  changes are device-specific, hence it should be relatively safe to
  apply.

  Most of changes are for Cirrus codecs (for both ASoC and HD-audio),
  while the remaining are fixes for TI codecs, HD-audio and USB-audio
  quirks"

* tag 'sound-6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
  ALSA: hda: cs35l41: Only add SPI CS GPIO if SPI is enabled in kernel
  ALSA: hda: cs35l41: Do not allow uninitialised variables to be freed
  ASoC: fsl_sai: Fix channel swap issue on i.MX8MP
  ASoC: hdmi-codec: fix missing report for jack initial status
  ALSA: hda/realtek: Add quirks for ASUS Zenbook 2023 Models
  ALSA: hda: cs35l41: Support additional ASUS Zenbook 2023 Models
  ALSA: hda/realtek: Add quirks for ASUS Zenbook 2022 Models
  ALSA: hda: cs35l41: Support additional ASUS Zenbook 2022 Models
  ALSA: hda/realtek: Add quirks for ASUS ROG 2023 models
  ALSA: hda: cs35l41: Support additional ASUS ROG 2023 models
  ALSA: hda: cs35l41: Add config table to support many laptops without _DSD
  ASoC: Intel: bytcr_rt5640: Add new swapped-speakers quirk
  ASoC: Intel: bytcr_rt5640: Add quirk for the Medion Lifetab S10346
  kselftest: alsa: fixed a print formatting warning
  ALSA: usb-audio: Increase delay in MOTU M quirk
  ASoC: tas2781: check the validity of prm_no/cfg_no
  ALSA: hda/tas2781: select program 0, conf 0 by default
  ALSA: hda/realtek: Add quirk for ASUS ROG GV302XA
  ASoC: cs42l43: Don't enable bias sense during type detect
  ASoC: Intel: soc-acpi-intel-mtl-match: Change CS35L56 prefixes to AMPn
  ...

Merge tag 'i2c-for-6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

- error path fixes (qcom-geni)

- polling mode fix (rk3x)

- target mode state machine fix (aspeed)

* tag 'i2c-for-6.7-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: aspeed: Handle the coalesced stop conditions with the start conditions.
  i2c: rk3x: fix potential spinlock recursion on poll
  i2c: qcom-geni: fix missing clk_disable_unprepare() and geni_se_resources_off()

Merge tag 'gpio-fixes-for-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:
"Here's another round of fixes from the GPIO subsystem for this release
  cycle.

  There's one commit adding synchronization to an ioctl() we overlooked
  previously and another synchronization changeset for one of the
  drivers:

   - add protection against GPIO device removal to an overlooked ioctl()

   - synchronize the interrupt mask register manually in gpio-dwapb"

* tag 'gpio-fixes-for-v6.7-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: dwapb: mask/unmask IRQ when disable/enale it
  gpiolib: cdev: add gpio_device locking wrapper around gpio_ioctl()

Merge tag 'for-linus-6.7a-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fix from Juergen Gross:
"A single patch fixing a build issue for x86 32-bit configurations with
CONFIG_XEN, which was introduced in the 6.7 development cycle"

* tag 'for-linus-6.7a-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: add CPU dependencies for 32-bit build

Merge tag 'drm-fixes-2023-12-22' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Pretty quiet for this week, just i915 and amdgpu fixes,

  I think the misc tree got lost this week, but didn't seem to have too
  much in it, so it can wait. I've also got a bunch of nouveau GSP fixes
  sailing around that'll probably land next time as well.

  amdgpu:
   - DCN 3.5 fixes
   - DCN 3.2 SubVP fix
   - GPUVM fix

  amdkfd:
   - SVM fix for APUs

  i915:
   - Fix state readout and check for DSC and bigjoiner combo
   - Fix a potential integer overflow
   - Reject async flips with bigjoiner
   - Fix MTL HDMI/DP PLL clock selection
   - Fix various issues by disabling pipe DMC events"

* tag 'drm-fixes-2023-12-22' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: re-create idle bo's PTE during VM state machine reset
  drm/amd/display: dereference variable before checking for zero
  drm/amd/display: get dprefclk ss info from integration info table
  drm/amd/display: Add case for dcn35 to support usb4 dmub hpd event
  drm/amd/display: disable FPO and SubVP for older DMUB versions on DCN32x
  drm/amdkfd: svm range always mapped flag not working on APU
  drm/amd/display: Revert " drm/amd/display: Use channel_width = 2 for vram table 3.0"
  drm/i915/dmc: Don't enable any pipe DMC events
  drm/i915/mtl: Fix HDMI/DP PLL clock selection
  drm/i915: Reject async flips with bigjoiner
  drm/i915/hwmon: Fix static analysis tool reported issues
  drm/i915/display: Get bigjoiner config before dsc config during readout

Merge tag '9p-for-6.7-rc7' of https://github.com/martinetd/linux

Pull 9p fixes from Dominique Martinet:
"Two small fixes scheduled for stable trees:

  A tracepoint fix that's been reading past the end of messages forever,
  but semi-recently also went over the end of the buffer. And a
  potential incorrectly freeing garbage in pdu parsing error path"

* tag '9p-for-6.7-rc7' of https://github.com/martinetd/linux:
  net: 9p: avoid freeing uninit memory in p9pdu_vreadf
  9p: prevent read overrun in protocol dump tracepoint

KVM: arm64: vgic-v3: Reinterpret user ISPENDR writes as I{C,S}PENDR

User writes to ISPENDR for GICv3 are treated specially, as zeroes
actually clear the pending state for interrupts (unlike HW). Reimplement
it using the ISPENDR and ICPENDR user accessors.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231219065855.1019608-4-oliver.upton@linux.dev

KVM: arm64: vgic: Use common accessor for writes to ICPENDR

Fold MMIO and user accessors into a common helper while maintaining the
distinction between the two.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231219065855.1019608-3-oliver.upton@linux.dev

KVM: arm64: vgic: Use common accessor for writes to ISPENDR

Perhaps unsurprisingly, there is a considerable amount of duplicate
code between the MMIO and user accessors for ISPENDR. At the same
time there are some important differences between user and guest
MMIO, like how SGIs can only be made pending from userspace.

Fold user and MMIO accessors into a common helper, maintaining the
distinction between the two.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231219065855.1019608-2-oliver.upton@linux.dev

KVM: arm64: vgic-v4: Restore pending state on host userspace write

When the VMM writes to ISPENDR0 to set the state pending state of
an SGI, we fail to convey this to the HW if this SGI is already
backed by a GICv4.1 vSGI.

This is a bit of a corner case, as this would only occur if the
vgic state is changed on an already running VM, but this can
apparently happen across a guest reset driven by the VMM.

Fix this by always writing out the pending_latch value to the
HW, and reseting it to false.

Reported-by: Kunkun Jiang <jiangkunkun@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Cc: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/r/7e7f2c0c-448b-10a9-8929-4b8f4f6e2a32@huawei.com

Merge tag 'usb-serial-6.7-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus

Johan writes:

USB-serial device ids for 6.7-rc6

Here are some new modem device ids and a rename of a few ftdi product id
defines.

All have been in linux-next with no reported issues.

* tag 'usb-serial-6.7-rc6' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
  USB: serial: option: add Quectel EG912Y module support
  USB: serial: ftdi_sio: update Actisense PIDs constant names
  USB: serial: option: add Quectel RM500Q R13 firmware support
  USB: serial: option: add Foxconn T99W265 with new baseline

debugfs: initialize cancellations earlier

Tetsuo Handa pointed out that in the (now reverted)
lockdep commit I initialized the data too late. The
same is true for the cancellation data, it must be
initialized before the cmpxchg(), otherwise it may
be done twice and possibly even overwriting data in
there already when there's a race. Fix that, which
also requires destroying the mutex in case we lost
the race.

Fixes: 8c88a474357e ("debugfs: add API to allow debugfs operations cancellation")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20231221150444.1e47a0377f80.If7e8ba721ba2956f12c6e8405e7d61e154aa7ae7@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Merge tag 'drm-intel-fixes-2023-12-21' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

drm/i915 fixes for v6.7-rc7:
- Fix state readout and check for DSC and bigjoiner combo
- Fix a potential integer overflow
- Reject async flips with bigjoiner
- Fix MTL HDMI/DP PLL clock selection
- Fix various issues by disabling pipe DMC events

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87plyzsnxi.fsf@intel.com

Merge tag 'amd-drm-fixes-6.7-2023-12-20' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.7-2023-12-20:

amdgpu:
- DCN 3.5 fixes
- DCN 3.2 SubVP fix
- GPUVM fix

amdkfd:
- SVM fix for APUs

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20231220164845.4975-1-alexander.deucher@amd.com

Merge tag 'pinctrl-v6.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
"Some driver fixes for v6.7, all are in drivers, the most interesting
  one is probably the AMD laptop suspend bug which really needs fixing.
  Freedestop org has the bug description:

    https://gitlab.freedesktop.org/drm/amd/-/issues/2812

  Summary:

   - Ignore disabled device tree nodes in the Starfive 7100 and 7100
     drivers.

   - Mask non-wake source pins with interrupt enabled at suspend in the
     AMD driver, this blocks unnecessary wakeups from misc interrupts.
     This can be power consuming because in many cases the system
     doesn't really suspend, it just wakes right back up.

   - Fix a typo breaking compilation of the cy8c95x0 driver, and fix up
     bugs in the get/set config callbacks.

   - Use a dedicated lock class for the PIO4 drivers IRQ. This fixes a
     crash on suspend"

* tag 'pinctrl-v6.7-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: at91-pio4: use dedicated lock class for IRQ
  pinctrl: cy8c95x0: Fix get_pincfg
  pinctrl: cy8c95x0: Fix regression
  pinctrl: cy8c95x0: Fix typo
  pinctrl: amd: Mask non-wake source pins with interrupt enabled at suspend
  pinctrl: starfive: jh7100: ignore disabled device tree nodes
  pinctrl: starfive: jh7110: ignore disabled device tree nodes

Merge tag 'nvme-6.7-2023-12-21' of git://git.infradead.org/nvme into block-6.7

Pull NVMe fixes from Keith:

"nvme fixes for Linux 6.7

- Revert a commit with improper sleep context (Keith)
- Fix async event handling sleep context (Maurizio)"

* tag 'nvme-6.7-2023-12-21' of git://git.infradead.org/nvme:
nvme-pci: fix sleeping function called from interrupt context
Revert "nvme-fc: fix race between error recovery and creating association"

afs: Fix use-after-free due to get/remove race in volume tree

When an afs_volume struct is put, its refcount is reduced to 0 before
the cell->volume_lock is taken and the volume removed from the
cell->volumes tree.

Unfortunately, this means that the lookup code can race and see a volume
with a zero ref in the tree, resulting in a use-after-free:

    refcount_t: addition on 0; use-after-free.
    WARNING: CPU: 3 PID: 130782 at lib/refcount.c:25 refcount_warn_saturate+0x7a/0xda
    ...
    RIP: 0010:refcount_warn_saturate+0x7a/0xda
    ...
    Call Trace:
     afs_get_volume+0x3d/0x55
     afs_create_volume+0x126/0x1de
     afs_validate_fc+0xfe/0x130
     afs_get_tree+0x20/0x2e5
     vfs_get_tree+0x1d/0xc9
     do_new_mount+0x13b/0x22e
     do_mount+0x5d/0x8a
     __do_sys_mount+0x100/0x12a
     do_syscall_64+0x3a/0x94
     entry_SYSCALL_64_after_hwframe+0x62/0x6a

Fix this by:

(1) When putting, use a flag to indicate if the volume has been removed
     from the tree and skip the rb_erase if it has.

(2) When looking up, use a conditional ref increment and if it fails
     because the refcount is 0, replace the node in the tree and set the
     removal flag.

Fixes: 20325960f875 ("afs: Reorganise volume and server trees to be rooted on the cell")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>