www.infradead.org Git - users/dwmw2/linux.git/log

sched/cputime: Cope with steal time going backwards or negative

In steal_account_process_time(), a delta is calculated between the value
returned by paravirt_steal_clock(), and this_rq()->prev_steal_time which
is assumed to be the *previous* value returned by paravirt_steal_clock().

However, instead of just assigning the newly-read value directly into
->prev_steal_time for use in the next iteration, ->prev_steal_time is
*incremented* by the calculated delta.

This used to be roughly the same, modulo conversion to jiffies and back,
until commit 807e5b80687c0 ("sched/cputime: Add steal time support to
full dynticks CPU time accounting") started clamping that delta to a
maximum of the actual time elapsed. So now, if the value returned by
paravirt_steal_clock() jumps by a large amount, instead of a *single*
period of reporting 100% steal time, the system will report 100% steal
time for as long as it takes to "catch up" with the reported value.
Which is up to 584 years.

But there is a benefit to advancing ->prev_steal_time only by the time
which was *accounted* as having been stolen. It means that any extra
time truncated by the clamping will be accounted in the next sample
period rather than lost. Given the stochastic nature of the sampling,
that is more accurate overall.

So, continue to advance ->prev_steal_time by the accounted value as
long as the delta isn't egregiously large (for which, use maxtime * 2).
If the delta is more than that, just set ->prev_steal_time directly to
the value returned by paravirt_steal_clock().

Fixes: 807e5b80687c0 ("sched/cputime: Add steal time support to full dynticks CPU time accounting")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86/xen: Prevent runstate times from becoming negative

When kvm_xen_update_runstate() is invoked to set a vCPU's runstate, the
time spent in the previous runstate is accounted. This is based on the
delta between the current KVM clock time, and the previous value stored
in vcpu->arch.xen.runstate_entry_time.

If the KVM clock goes backwards, that delta will be negative. Or, since
it's an unsigned 64-bit integer, very *large*. Linux guests deal with
that particularly badly, reporting 100% steal time for ever more (well,
for *centuries* at least, until the delta has been consumed).

So when a negative delta is detected, just refrain from updating the
runstates until the KVM clock catches up with runstate_entry_time again.

The userspace APIs for setting the runstate times do not allow them to
be set past the current KVM clock, but userspace can still adjust the
KVM clock *after* setting the runstate times, which would cause this
situation to occur.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Avoid periodic KVM clock updates in master clock mode

When the KVM clock is in master clock mode, updating the KVM clock is
pointless. Let the periodic work 'expire', and start it running again
from kvm_end_pvclock_update() if the master clock mode is ever turned
off again.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Avoid gratuitous global clock reload in kvm_arch_vcpu_load()

Commit d98d07ca7e034 ("KVM: x86: update pvclock area conditionally, on
cpu migration") turned an unconditional KVM_REQ_CLOCK_UPDATE into a
conditional one, if either the master clock isn't enabled *or* the vCPU
was not previously scheduled (vcpu->cpu == -1). The commit message doesn't
explain the latter condition, which is specifically for the master clock
case.

Commit 0061d53daf26f ("KVM: x86: limit difference between kvmclock
updates") later turned that into a KVM_REQ_GLOBAL_CLOCK_UPDATE to avoid
skew between vCPUs.

In master clock mode there is no need for any of that, regardless of
whether/where this vCPU was previously scheduled.

Do it only if (!kvm->arch.use_master_clock).

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Avoid global clock update on setting KVM clock MSR

Commit 0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
introduced a KVM_REQ_GLOBAL_CLOCK_UPDATE when one vCPU set up its clock.

This was a workaround for the ever-drifting clocks which were based on the
host's CLOCK_MONOTONIC and thus subject to NTP skew. On booting or resuming
a guest, it just leads to running kvm_guest_time_update() twice for each
vCPU for now good reason.

Just use KVM_REQ_CLOCK_UPDATE on the vCPU itself, and only in the case
where the KVM clock is being set up, not turned off.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Factor out kvm_use_master_clock()

Both kvm_track_tsc_matching() and pvclock_update_vm_gtod_copy() make a
decision about whether the KVM clock should be in master clock mode.

They use *different* criteria for the decision though. This isn't really
a problem; it only has the potential to cause unnecessary invocations of
KVM_REQ_MASTERCLOCK_UPDATE if the masterclock was disabled due to TSC
going backwards, or the guest using the old MSR. But it isn't pretty.

Factor the decision out to a single function. And document the historical
reason why it's disabled for guests that use the old MSR_KVM_SYSTEM_TIME.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Allow KVM master clock mode when TSCs are offset from each other

There is no reason why the KVM clock cannot be in masterclock mode when
the TSCs are not in sync, as long as they are at the same *frequency*.

Running at a different frequency would lead to a systemic skew between
the clock(s) as observed by different vCPUs due to arithmetic precision
in the scaling. So that should indeed force the clock to be based on the
host's CLOCK_MONOTONIC_RAW instead of being in masterclock mode where it
is defined by the (or 'a') guest TSC.

But when the vCPUs merely have a different TSC *offset*, that's not a
problem. The offset is applied to that vCPU's kvmclock->tsc_timestamp
field, and it all comes out in the wash.

So, remove ka->nr_vcpus_matched_tsc and replace it with a new field
ka->all_vcpus_matched_tsc which is not only changed to a boolean, but
also now tracks that the *frequency* matches, not the precise offset.

Using a *count* was always racy because a new vCPU could be being
created *while* kvm_track_tsc_matching() was running and comparing with
kvm->online_vcpus. That variable is only atomic with respect to itself.
In particular, kvm_arch_vcpu_create() runs before kvm->online_vcpus is
incremented for the new vCPU, and kvm_arch_vcpu_postcreate() runs later.

Repurpose kvm_track_tsc_matching() to be called from kvm_set_tsc_khz(),
and kill the cur_tsc_generation/last_tsc_generation fields which tracked
the precise TSC matching.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Kill cur_tsc_{nsec,offset,write} fields

These pointlessly duplicate the last_tsc_{nsec,offset,write} values.

The only place they were used was where the TSC is stable and a new vCPU
is being synchronized to the previous setting, in which case the 'last_'
value is definitely identical.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Improve synchronization in kvm_synchronize_tsc()

When synchronizing to an existing TSC (either by explicitly writing zero,
or the legacy hack where the TSC is written within one second's worth of
the previously written TSC), the last_tsc_write and last_tsc_nsec values
were being misrecorded by __kvm_synchronize_tsc(). The *unsynchronized*
value of the TSC (perhaps even zero) was bring recorded, along with the
current time at which kvm_synchronize_tsc() was called. This could cause
*subsequent* writes to fail to synchronize correctly.

Fix that by resetting {data, ns} to the previous values before passing
them to __kvm_synchronize_tsc() when synchronization is detected. Except
in the case where the TSC is unstable and *has* to be synthesised from
the host clock, in which case attempt to create a nsec/tsc pair which is
on the correct line.

Furthermore, there were *three* different TSC reads used for calculating
the "current" time, all slightly different from each other. Fix that by
using kvm_get_time_and_clockread() where possible and using the same
host_tsc value in all cases.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Remove implicit rdtsc() from kvm_compute_l1_tsc_offset()

Let the callers pass the host TSC value in as an explicit parameter.

This leaves some fairly obviously stupid code, which using this function
to compare the guest TSC at some *other* time, with the newly-minted TSC
value from rdtsc(). Unless it's being used to measure *elapsed* time,
that isn't very sensible.

In this case, "obviously stupid" is an improvement over being non-obviously
so.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Simplify and comment kvm_get_time_scale()

Commit 3ae13faac400 ("KVM: x86: pass kvm_get_time_scale arguments in hertz")
made this function take 64-bit values in Hz rather than 32-bit kHz. Thus
making it entrely pointless to shadow its arguments into local 64-bit
variables. Just use scaled_hz and base_hz directly.

Also rename the 'tps32' variable to 'base32', having utterly failed to
think of any reason why it might have been called that in the first place.
This could probably have been eliminated too, but it helps to make the
code clearer and *might* just help a naïve 32-bit compiler realise that it
doesn't need to do full 64-bit shifts.

Having taken the time to reverse-engineer the function, add some comments
explaining it.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()

There was some confusion in kvm_update_guest_time() when software needs
to advance the guest TSC.

In master clock mode, there are two points of time which need to be taken
into account. First there is the master clock reference point, stored in
kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
Secondly, there is the time *now*, at the point kvm_update_guest_time()
is being called.

With software TSC upscaling, the guest TSC is getting further and further
ahead of the host TSC as time elapses. So at time "now", the guest TSC
should be further ahead of the host, than it was at master_kernel_ns.

The adjustment in kvm_update_guest_time() was not taking that into
account, and was only advancing the guest TSC by the appropriate amount
for master_kernel_ns, *not* the current time.

Fix it to calculate them both correctly.

Since the KVM clock reference point in master_kernel_ns might actually
be *earlier* than the reference point used for the guest TSC
(vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
compute_guest_tsc() function to cope with negative numbers, which
then means there is no need to force a master clock update when the
guest TSC is written.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Fix KVM clock precision in __get_kvmclock()

When in 'master clock mode' (i.e. when host and guest TSCs are behaving
sanely and in sync), the KVM clock is defined in terms of the guest TSC.

When TSC scaling is used, calculating the KVM clock directly from *host*
TSC cycles leads to a systemic drift from the values calculated by the
guest from its TSC.

Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
had a simple workaround for the specific case of Xen timers, as it had an
actual vCPU to hand and could use its scaling information. That commit
noted that it was broken for the general case of get_kvmclock_ns(), and
said "I'll come back to that".

Since __get_kvmclock() is invoked without a specific CPU, it needs to
be able to find or generate the scaling values required to perform the
correct calculation.

Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
so it isn't as complex as it might have been.

In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
kvm->arch.last_tsc_scaling_ratio. That is only protected by the
tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
shift factors to convert to nanoseconds for the corresponding KVM clock,
just as kvm_guest_time_update() would.

In __get_kvmclock(), which runs within a seqcount retry loop, use those
values to convert host to guest TSC and then to nanoseconds. Only fall
back to using get_kvmclock_base_ns() when not in master clock mode.

There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.

The kvm_get_wall_clock_epoch() function had the same problem; make it
just call get_kvmclock() and subtract kvmclock from wallclock, with
the same fallback as before.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host

Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") did so only for 64-bit hosts, by capturing the boot offset from
within the existing clocksource notifier update_pvclock_gtod().

That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
clocksource changes") but only on x86_64, because its original purpose
was just to disable the "master clock" mode which is only supported on
x86_64.

Now that the notifier is used for more than disabling master clock mode,
(well, OK, more than a decade later but clocks are hard), enable it for
the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
NTP sync on 32-bit too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration

The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
inadequate. It ignores TSC scaling, and ignores the fact that the host
TSC may differ from one host to the next (and in fact because of the way
the kernel calibrates it, it generally differs from one boot to the next
even on the same hardware).

Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
and attempt to document the process that userspace needs to follow to
preserve the TSC across migration.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC

KVM does make an attempt to cope with non-constant TSC, and has notifiers
to handle host TSC frequency changes. However, it *only* adjusts the KVM
clock, and doesn't adjust TSC frequency scaling when the host changes.

This is presumably because non-constant TSCs were fixed in hardware long
before TSC scaling was implemented, so there should never be real CPUs
which have TSC scaling but *not* CONSTANT_TSC.

Such a combination could potentially happen in some odd L1 nesting
environment, but it isn't worth trying to support it. Just make the
dependency explicit.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

KVM: selftests: Add KVM/PV clock selftest to prove timer correction

A VM's KVM/PV clock has an inherent relationship to its TSC (guest). When
either the host system live-updates or the VM is live-migrated this pairing
of the two clock sources should stay the same.

In reality this is not the case without some correction taking place. Two
new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
in either a live-update/migration scenario.

This test proves that without the necessary fixup there is a perceived
change in the guest TSC & KVM/PV clock relationship before and after a LU/
LM takes place.

The following steps are made to verify there is a delta in the relationship
and that it can be corrected:

1. PVTI is sampled by guest at boot (let's call this PVTI0).
2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
3. PVTI is sampled by guest after change (PVTI1).
4. Correction is requested by usermode to KVM using PVTI0.
5. PVTI is sampled by guest after correction (PVTI2).

The guest the records a singular TSC reference point in time and uses it to
calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
call each clock value CLK[0-2].

In a perfect world CLK[0-2] should all be the same value if the KVM clock
& TSC relationship is preserved across the LU/LM (or faked in this test),
however it is not.

A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
(and the inaccuracies associated with that). A delta of ~3500ns can be
observed if guest TSC scaling to half host TSC frequency is also enabled,
where as without scaling this is observed at ~180ns.

With the correction it should be possible to achieve a delta of ±1ns.

An option to enable guest TSC scaling is available via invoking the tester
with -s/--scale-tsc.

Example of the test output below:
* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1

Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>

UAPI: x86: Move pvclock-abi to UAPI for x86 platforms

KVM provides a new interface for performing a fixup/correction of the KVM
clock against the reference TSC. The KVM_[GS]ET_CLOCK_GUEST API requires a
pvclock_vcpu_time_info, as such the caller must know about this definition.

Move the definition to the UAPI folder so that it is exported to usermode
and also change the type definitions to use the standard for UAPI exports.

Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based on
a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.

The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which is
necessarily imprecise.

So on live update, the guest TSC can remain cycle accurate at precisely the
same offset from the host TSC, but there is no way for userspace to restore
the KVM clock accurately.

Even on live migration to a new host, where the accuracy of the guest time-
keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest TSC,
and not suffer an *additional* discontinuity.

To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
save and restore the actual PV clock info in pvclock_vcpu_time_info.

The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed through
the user-provided pvclock structure to generate the *intended* KVM clock
value at that point in time, and through the *actual* KVM clock calculation.
Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.

Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.

On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
*both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
migration data (unless the intent is to refuse to resume on a host with bad
TSC).

(It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
non-masterclock mode, as that mode is necessarily imprecise anyway. The
explicit fallback allows userspace to deliberately fail migration to a host
with misbehaving TSC where master clock mode wouldn't be active.)

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>

KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force

The kvm_guest_time_update() function scales the host TSC frequency to
the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
scaling ratio previously calculated for that vCPU. Then calcuates the
scaling factors for the KVM clock itself based on that guest TSC
frequency.

However, it uses kHz as the unit when scaling, and then multiplies by
1000 only at the end.

With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
the KVM clock advertised to the guest is based on a frequency of
2,499,999,000 Hz.

By using Hz as the unit from the beginning, the KVM clock would be based
on a more accurate frequency of 2,499,999,999 Hz in this example.

Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()

The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.

Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.

But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.

In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.

The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.

One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.

Eventually, definition C should just be eliminated. Commit 451a707813ae
("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
the specific case of Xen timers, which are defined in terms of the KVM
clock and suffered from a continually increasing error in timer expiry
times. That commit notes that get_kvmclock_ns() is non-trivial to fix
and says "I'll come back to that", which remains true.

Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.

In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.

When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
should probably *adjust* kvm->arch.kvmclock_offset to account for the
drift, instead of yanking the clock back to defintion A. But in the
meantime there are a bunch of places where it just doesn't need to be
invoked at all.

To start with: there is no need to do such an update when a Xen guest
populates the shared_info page. This seems to have been a hangover from
the very first implementation of shared_info which automatically
populated the vcpu_info structures at their default locations, but even
then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
expected to explicitly set the vcpu_info even in its default locations,
there's not even any need for that either.

Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>

KVM: SEV: use u64_to_user_ptr throughout

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument

TDX uses different ABI to get information about VM exit.  Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS.  The eventual code will be

VMX:
  - get exit reason, intr_info, exit_qualification, and etc from VMCS
  - call NMI/INTR handlers (common code)

TDX:
  - get exit reason, intr_info, exit_qualification, and etc from guest
    registers
  - call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX

KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures.  This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure.  To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX.  Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.

The eventually converted code will look like this:

vmx.c:
  vmx_op() { ... }
  VMX initialization
tdx.c:
  tdx_op() { ... }
  TDX initialization
x86_ops.h:
  vmx_op();
  tdx_op();
main.c:
  static vt_op() { if (tdx) tdx_op() else vmx_op() }
  static struct kvm_x86_ops vt_x86_ops = {
        .op = vt_op,
  initialization functions call both VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Split core of hypercall emulation to helper function

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'kvm-sev-init2' into HEAD

The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  The first source of variability
that was encountered is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

This series adds all the APIs that are needed to customize the features,
with room for future enhancements:

- a new /dev/kvm device attribute to retrieve the set of supported
  features (right now, only debug swap)

- a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
  replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT

It then puts the new op to work by including the VMSA features as a field
of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility; but I am considering
also making them use zero as the feature mask, and will gladly adjust the
patches if so requested.

In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
I could as well make SEV and SEV-ES use VM types.  This allows SEV-SNP
to reuse the KVM_SEV_INIT2 ioctl.

And while at it, KVM_SEV_INIT2 also includes two bugfixes.  First of all,
SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
once the VMSA has been encrypted...  which is how the API should have
always behaved.  Second, they also synchronize the FPU and AVX state.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'mm-delete-change-gpte' into HEAD

The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair. It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.

It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence. Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.

Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

mm: replace set_pte_at_notify() with just set_pte_at()

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify(). It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

mmu_notifier: remove the .change_pte() callback

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: remove unused argument of kvm_handle_hva_range()

The only user was kvm_mmu_notifier_change_pte(), which is now gone.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: delete .change_pte MMU notifier callback

The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

selftests: kvm: add test for transferring FPU state into VMSA

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-18-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

selftests: kvm: split "launch" phase of SEV VM creation

Allow the caller to set the initial state of the VM. Doing this
before sev_vm_launch() matters for SEV-ES, since that is the
place where the VMSA is updated and after which the guest state
becomes sealed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-17-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

selftests: kvm: switch to using KVM_X86_*_VM

This removes the concept of "subtypes", instead letting the tests use proper
VM types that were recently added. While the sev_init_vm() and sev_es_init_vm()
are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT
ioctls, this is limited to VMs that are created manually with
vm_create_barebones().

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-16-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

selftests: kvm: add tests for KVM_SEV_INIT2

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-15-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: allow SEV-ES DebugSwap again

The DebugSwap feature of SEV-ES provides a way for confidential guests
to use data breakpoints.  Its status is record in VMSA, and therefore
attestation signatures depend on whether it is enabled or not.  In order
to avoid invalidating the signatures depending on the host machine, it
was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES
DebugSwap by default", 2024-03-09).

However, we now have a new API to create SEV VMs that allows enabling
DebugSwap based on what the user tells KVM to do, and we also changed the
legacy KVM_SEV_ES_INIT API to never enable DebugSwap.  It is therefore
possible to re-enable the feature without breaking compatibility with
kernels that pre-date the introduction of DebugSwap, so go ahead.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: introduce KVM_SEV_INIT2 operation

The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic.  In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired.  Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.

Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct.  The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.

The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time

SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.

Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr(). However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: define VM types for SEV and SEV-ES

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: introduce to_kvm_sev_info

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Add supported_vm_types to kvm_caps

This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES),
and also allows the vendor module to specify which VM types are supported.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: add fields to struct kvm_arch for CoCo features

Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.

We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected. For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.

For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: store VMSA features in kvm_sev_info

Right now, the set of features that are stored in the VMSA upon
initialization is fixed and depends on the module parameters for
kvm-amd.ko. However, the hypervisor cannot really change it at will
because the feature word has to match between the hypervisor and whatever
computes a measurement of the VMSA for attestation purposes.

Add a field to kvm_sev_info that holds the set of features to be stored
in the VMSA; and query it instead of referring to the module parameters.

Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this
does not yet introduce any functional change, but it paves the way for
an API that allows customization of the features per-VM.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-6-pbonzini@redhat.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SEV: publish supported VMSA features

Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.

The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR

Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function. You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.

Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: use u64_to_user_ptr()

There is no danger to the kernel if 32-bit userspace provides a 64-bit
value that has the high bits set, but for whatever reason happens to
resolve to an address that has something mapped there. KVM uses the
checked version of get_user() and put_user(), so any faults are caught
properly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y

Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs
in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers
is quite confusing.

To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks
and the #VMGEXIT handler. Stubs are also restricted to functions that
check sev_enabled and to the destruction functions sev_free_cpu() and
sev_vm_destroy(), where the style of their callers is to leave checks
to the callers. Most call sites instead rely on dead code elimination
to take care of functions that are guarded with sev_guest() or
sev_es_guest().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: SVM: Invert handling of SEV and SEV_ES feature flags

Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them
in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled. Aside
from the fact that sev_set_cpu_caps() is wildly misleading when it *clears*
capabilities, this will allow compiling out sev.c without falsely
advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvm-riscv-fixes-6.9-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.9, take #1

- Fix spelling mistake in arch_timer selftest
- Remove redundant semicolon in num_isa_ext_regs()
- Fix APLIC setipnum_le/be write emulation
- Fix APLIC in_clrip[x] read emulation

Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.9, part #1

- Ensure perf events programmed to count during guest execution
   are actually enabled before entering the guest in the nVHE
   configuration.

- Restore out-of-range handler for stage-2 translation faults.

- Several fixes to stage-2 TLB invalidations to avoid stale
   translations, possibly including partial walk caches.

- Fix early handling of architectural VHE-only systems to ensure E2H is
   appropriately set.

- Correct a format specifier warning in the arch_timer selftest.

- Make the KVM banner message correctly handle all of the possible
   configurations.

KVM: arm64: Rationalise KVM banner output

We are not very consistent when it comes to displaying which mode
we're in (VHE, {n,h}VHE, protected or not). For example, booting
in protected mode with hVHE results in:

[ 0.969545] kvm [1]: Protected nVHE mode initialized successfully

which is mildly amusing considering that the machine is VHE only.

We already cleaned this up a bit with commit 1f3ca7023fe6 ("KVM:
arm64: print Hyp mode"), but that's still unsatisfactory.

Unify the three strings into one and use a mess of conditional
statements to sort it out (yes, it's a slow day).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240321173706.3280796-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

arm64: Fix early handling of FEAT_E2H0 not being implemented

Commit 3944382fa6f2 introduced checks for the FEAT_E2H0 not being
implemented. However, the check is absolutely wrong and makes a
point it testing a bit that is guaranteed to be zero.

On top of that, the detection happens way too late, after the
init_el2_state has done its job.

This went undetected because the HW this was tested on has E2H being
RAO/WI, and not RES1. However, the bug shows up when run as a nested
guest, where HCR_EL2.E2H is not necessarily set to 1. As a result,
booting the kernel in hVHE mode fails with timer accesses being
cought in a trap loop (which was fun to debug).

Fix the check for ID_AA64MMFR4_EL1.E2H0, and set the HCR_EL2.E2H bit
early so that it can be checked by the rest of the init sequence.

With this, hVHE works again in a NV environment that doesn't have
FEAT_E2H0.

Fixes: 3944382fa6f2 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240321115414.3169115-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: arm64: Ensure target address is granule-aligned for range TLBI

When zapping a table entry in stage2_try_break_pte(), we issue range
TLB invalidation for the region that was mapped by the table. However,
we neglect to align the base address down to the granule size and so
if we ended up reaching the table entry via a misaligned address then
we will accidentally skip invalidation for some prefix of the affected
address range.

Align 'ctx->addr' down to the granule size when performing TLB
invalidation for an unmapped table in stage2_try_break_pte().

Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Fixes: defc8cc7abf0 ("KVM: arm64: Invalidate the table entries upon a range")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-5-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()

Commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
FEAT_LPA2") updated the __tlbi_level() macro to take the target level
as an argument, with TLBI_TTL_UNKNOWN (rather than 0) indicating that
the caller cannot provide level information. Unfortunately, the two
implementations of __kvm_tlb_flush_vmid_range() were not updated and so
now ask for an level 0 invalidation if FEAT_LPA2 is implemented.

Fix the problem by passing TLBI_TTL_UNKNOWN instead of 0 as the level
argument to __flush_s2_tlb_range_op() in __kvm_tlb_flush_vmid_range().

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-4-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: arm64: Don't pass a TLBI level hint when zapping table entries

The TLBI level hints are for leaf entries only, so take care not to pass
them incorrectly after clearing a table entry.

Cc: Gavin Shan <gshan@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Quentin Perret <qperret@google.com>
Fixes: 82bb02445de5 ("KVM: arm64: Implement kvm_pgtable_hyp_unmap() at EL2")
Fixes: 6d9d2115c480 ("KVM: arm64: Add support for stage-2 map()/unmap() in generic page-table")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-3-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: arm64: Don't defer TLB invalidation when zapping table entries

Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for
unmap") introduced deferred TLB invalidation for the stage-2 page-table
so that range-based invalidation can be used for the accumulated
addresses. This works fine if the structure of the page-tables remains
unchanged, but if entire tables are zapped and subsequently freed then
we transiently leave the hardware page-table walker with a reference
to freed memory thanks to the translation walk caches. For example,
stage2_unmap_walker() will free page-table pages:

if (childp)
mm_ops->put_page(childp);

and issue the TLB invalidation later in kvm_pgtable_stage2_unmap():

if (stage2_unmap_defer_tlb_flush(pgt))
/* Perform the deferred TLB invalidations */
kvm_tlb_flush_vmid_range(pgt->mmu, addr, size);

For now, take the conservative approach and invalidate the TLB eagerly
when we clear a table entry. Note, however, that the existing level
hint passed to __kvm_tlb_flush_vmid_ipa() is incorrect and will be
fixed in a subsequent patch.

Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-2-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test

Use %x instead of %lx when printing uint32_t variables to fix format
warnings in ARM's arch timer test.

aarch64/arch_timer.c: In function ‘guest_run_stage’:
aarch64/arch_timer.c:138:33: warning: format ‘%lx’ expects argument of type ‘long unsigned int’,
                             but argument 6 has type ‘uint32_t’ {aka ‘unsigned int’} [-Wformat=]
  138 |                                 "config_iter + 1 = 0x%lx, irq_iter = 0x%lx.\n"
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
  141 |                                 config_iter + 1, irq_iter);
      |                                 ~~~~~~~~~~~~~~~
      |                                             |
      |                                             uint32_t {aka unsigned int}

Fixes: d1dafd065a23 ("KVM: arm64: selftests: Enable tuning of error margin in arch_timer test")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20240314175116.2366301-1-seanjc@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: arm64: Fix out-of-IPA space translation fault handling

Commit 11e5ea5242e3 ("KVM: arm64: Use helpers to classify exception
types reported via ESR") tried to abstract the translation fault
check when handling an out-of IPA space condition, but incorrectly
replaced it with a permission fault check.

Restore the previous translation fault check.

Fixes: 11e5ea5242e3 ("KVM: arm64: Use helpers to classify exception types reported via ESR")
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wujie Duan <wjduan@linx-info.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/kvmarm/864jd3269g.wl-maz@kernel.org/
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

KVM: arm64: Fix host-programmed guest events in nVHE

Programming PMU events in the host that count during guest execution is
a feature supported by perf, e.g.

perf stat -e cpu_cycles:G ./lkvm run

While this works for VHE, the guest/host event bitmaps are not carried
through to the hypervisor in the nVHE configuration. Make
kvm_pmu_update_vcpu_events() conditional on whether or not _hardware_
supports PMUv3 rather than if the vCPU as vPMU enabled.

Cc: stable@vger.kernel.org
Fixes: 84d751a019a9 ("KVM: arm64: Pass pmu events to hyp via vcpu")
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240305184840.636212-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>

RISC-V: KVM: Fix APLIC in_clrip[x] read emulation

The reads to APLIC in_clrip[x] registers returns rectified input values
of the interrupt sources.

A rectified input value of an interrupt source is defined by the section
"4.5.2 Source configurations (sourcecfg[1]–sourcecfg[1023])" of the
RISC-V AIA specification as:

rectified input value = (incoming wire value) XOR (source is inverted)

Update the riscv_aplic_input() implementation to match the above.

Cc: stable@vger.kernel.org
Fixes: 74967aa208e2 ("RISC-V: KVM: Add in-kernel emulation of AIA APLIC")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240321085041.1955293-3-apatel@ventanamicro.com

RISC-V: KVM: Fix APLIC setipnum_le/be write emulation

The writes to setipnum_le/be register for APLIC in MSI-mode have special
consideration for level-triggered interrupts as-per the section "4.9.2
Special consideration for level-sensitive interrupt sources" of the RISC-V
AIA specification.

Particularly, the below text from the RISC-V AIA specification defines
the behaviour of writes to setipnum_le/be register for level-triggered
interrupts:

"A second option is for the interrupt service routine to write the
APLIC’s source identity number for the interrupt to the domain’s
setipnum register just before exiting. This will cause the interrupt’s
pending bit to be set to one again if the source is still asserting
an interrupt, but not if the source is not asserting an interrupt."

Fix setipnum_le/be write emulation for in-kernel APLIC by implementing
the above behaviour in aplic_write_pending() function.

Cc: stable@vger.kernel.org
Fixes: 74967aa208e2 ("RISC-V: KVM: Add in-kernel emulation of AIA APLIC")
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240321085041.1955293-2-apatel@ventanamicro.com

RISC-V: KVM: Remove second semicolon

There is a statement with two semicolons. Remove the second one, it
is redundant.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240315092914.2431214-1-colin.i.king@gmail.com

KVM: selftests: Fix spelling mistake "trigged" -> "triggered"

There are spelling mistakes in __GUEST_ASSERT messages. Fix them.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20240307081951.1954830-1-colin.i.king@gmail.com

Linux 6.9-rc1

Merge tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

- Fix logic that is supposed to prevent placement of the kernel image
   below LOAD_PHYSICAL_ADDR

- Use the firmware stack in the EFI stub when running in mixed mode

- Clear BSS only once when using mixed mode

- Check efi.get_variable() function pointer for NULL before trying to
   call it

* tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: fix panic in kdump kernel
  x86/efistub: Don't clear BSS twice in mixed mode
  x86/efistub: Call mixed mode boot services on the firmware's stack
  efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address

Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

- Ensure that the encryption mask at boot is properly propagated on
   5-level page tables, otherwise the PGD entry is incorrectly set to
   non-encrypted, which causes system crashes during boot.

- Undo the deferred 5-level page table setup as it cannot work with
   memory encryption enabled.

- Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
   to the default value but the cached variable is not, so subsequent
   comparisons might yield the wrong result and as a consequence the
   result prevents updating the MSR.

- Register the local APIC address only once in the MPPARSE enumeration
   to prevent triggering the related WARN_ONs() in the APIC and topology
   code.

- Handle the case where no APIC is found gracefully by registering a
   fake APIC in the topology code. That makes all related topology
   functions work correctly and does not affect the actual APIC driver
   code at all.

- Don't evaluate logical IDs during early boot as the local APIC IDs
   are not yet enumerated and the invoked function returns an error
   code. Nothing requires the logical IDs before the final CPUID
   enumeration takes place, which happens after the enumeration.

- Cure the fallout of the per CPU rework on UP which misplaced the
   copying of boot_cpu_data to per CPU data so that the final update to
   boot_cpu_data got lost which caused inconsistent state and boot
   crashes.

- Use copy_from_kernel_nofault() in the kprobes setup as there is no
   guarantee that the address can be safely accessed.

- Reorder struct members in struct saved_context to work around another
   kmemleak false positive

- Remove the buggy code which tries to update the E820 kexec table for
   setup_data as that is never passed to the kexec kernel.

- Update the resource control documentation to use the proper units.

- Fix a Kconfig warning observed with tinyconfig

* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot/64: Move 5-level paging global variable assignments back
  x86/boot/64: Apply encryption mask to 5-level pagetable update
  x86/cpu: Add model number for another Intel Arrow Lake mobile processor
  x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
  Documentation/x86: Document that resctrl bandwidth control units are MiB
  x86/mpparse: Register APIC address only once
  x86/topology: Handle the !APIC case gracefully
  x86/topology: Don't evaluate logical IDs during early boot
  x86/cpu: Ensure that CPU info updates are propagated on UP
  kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
  x86/pm: Work around false positive kmemleak report in msr_build_context()
  x86/kexec: Do not update E820 kexec table for setup_data
  x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'

Merge tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler doc clarification from Thomas Gleixner:
"A single update for the documentation of the base_slice_ns tunable to
  clarify that any value which is less than the tick slice has no effect
  because the scheduler tick is not guaranteed to happen within the set
  time slice"

* tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation

Merge tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
"This has a set of swiotlb alignment fixes for sometimes very long
  standing bugs from Will. We've been discussion them for a while and
  they should be solid now"

* tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
  iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
  swiotlb: Fix alignment checks when both allocation and DMA masks are present
  swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
  swiotlb: Enforce page alignment in swiotlb_alloc()
  swiotlb: Fix double-allocation of slots due to broken alignment handling

efi: fix panic in kdump kernel

Check if get_next_variable() is actually valid pointer before
calling it. In kdump kernel this method is set to NULL that causes
panic during the kexec-ed kernel boot.

Tested with QEMU and OVMF firmware.

Fixes: bad267f9e18f ("efi: verify that variable services are supported")
Signed-off-by: Oleksandr Tymoshenko <ovt@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

x86/efistub: Don't clear BSS twice in mixed mode

Clearing BSS should only be done once, at the very beginning.
efi_pe_entry() is the entrypoint from the firmware, which may not clear
BSS and so it is done explicitly. However, efi_pe_entry() is also used
as an entrypoint by the mixed mode startup code, in which case BSS will
already have been cleared, and doing it again at this point will corrupt
global variables holding the firmware's GDT/IDT and segment selectors.

So make the memset() conditional on whether the EFI stub is running in
native mode.

Fixes: b3810c5a2cc4a666 ("x86/efistub: Clear decompressor BSS in native EFI entrypoint")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

x86/efistub: Call mixed mode boot services on the firmware's stack

Normally, the EFI stub calls into the EFI boot services using the stack
that was live when the stub was entered. According to the UEFI spec,
this stack needs to be at least 128k in size - this might seem large but
all asynchronous processing and event handling in EFI runs from the same
stack and so quite a lot of space may be used in practice.

In mixed mode, the situation is a bit different: the bootloader calls
the 32-bit EFI stub entry point, which calls the decompressor's 32-bit
entry point, where the boot stack is set up, using a fixed allocation
of 16k. This stack is still in use when the EFI stub is started in
64-bit mode, and so all calls back into the EFI firmware will be using
the decompressor's limited boot stack.

Due to the placement of the boot stack right after the boot heap, any
stack overruns have gone unnoticed. However, commit

5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code")

moved the definition of the boot heap into C code, and now the boot
stack is placed right at the base of BSS, where any overruns will
corrupt the end of the .data section.

While it would be possible to work around this by increasing the size of
the boot stack, doing so would affect all x86 systems, and mixed mode
systems are a tiny (and shrinking) fraction of the x86 installed base.

So instead, record the firmware stack pointer value when entering from
the 32-bit firmware, and switch to this stack every time a EFI boot
service call is made.

Cc: <stable@kernel.org> # v6.1+
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

x86/boot/64: Move 5-level paging global variable assignments back

Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.

While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.

Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com

x86/boot/64: Apply encryption mask to 5-level pagetable update

When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.

Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.

Fixes: 533568e06b15 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com

x86/cpu: Add model number for another Intel Arrow Lake mobile processor

This one is the regular laptop CPU.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240322161725.195614-1-tony.luck@intel.com

x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD

Commit 672365477ae8 ("x86/fpu: Update XFD state where required") and
commit 8bf26758ca96 ("x86/fpu: Add XFD state to fpstate") introduced a
per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
order to avoid unnecessary writes to the MSR.

On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
wipes out any stale state. But the per CPU cached xfd value is not
reset, which brings them out of sync.

As a consequence a subsequent xfd_update_state() might fail to update
the MSR which in turn can result in XRSTOR raising a #NM in kernel
space, which crashes the kernel.

To fix this, introduce xfd_set_state() to write xfd_state together
with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.

Fixes: 672365477ae8 ("x86/fpu: Update XFD state where required")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com
Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de

Documentation/x86: Document that resctrl bandwidth control units are MiB

The memory bandwidth software controller uses 2^20 units rather than
10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
Linux define for 0x00100000.

Update the documentation to use MiB when describing this feature.
It's too late to fix the mount option "mba_MBps" as that is now an
established user interface.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240322182016.196544-1-tony.luck@intel.com

Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
"Two regression fixes for the timer and timer migration code:

   - Prevent endless timer requeuing which is caused by two CPUs racing
     out of idle. This happens when the last CPU goes idle and therefore
     has to ensure to expire the pending global timers and some other
     CPU come out of idle at the same time and the other CPU wins the
     race and expires the global queue. This causes the last CPU to
     chase ghost timers forever and reprogramming it's clockevent device
     endlessly.

     Cure this by re-evaluating the wakeup time unconditionally.

   - The split into local (pinned) and global timers in the timer wheel
     caused a regression for NOHZ full as it broke the idle tracking of
     global timers. On NOHZ full this prevents an self IPI being sent
     which in turn causes the timer to be not programmed and not being
     expired on time.

     Restore the idle tracking for the global timer base so that the
     self IPI condition for NOHZ full is working correctly again"

* tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix removed self-IPI on global timer's enqueue in nohz_full
  timers/migration: Fix endless timer requeue after idle interrupts

Merge tag 'timers-core-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull more clocksource updates from Thomas Gleixner:
"A set of updates for clocksource and clockevent drivers:

   - A fix for the prescaler of the ARM global timer where the prescaler
     mask define only covered 4 bits while it is actully 8 bits wide.
     This obviously restricted the possible range of prescaler
     adjustments

   - A fix for the RISC-V timer which prevents a timer interrupt being
     raised while the timer is initialized

   - A set of device tree updates to support new system on chips in
     various drivers

   - Kernel-doc and other cleanups all over the place"

* tag 'timers-core-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/timer-riscv: Clear timer interrupt on timer initialization
  dt-bindings: timer: Add support for cadence TTC PWM
  clocksource/drivers/arm_global_timer: Simplify prescaler register access
  clocksource/drivers/arm_global_timer: Guard against division by zero
  clocksource/drivers/arm_global_timer: Make gt_target_rate unsigned long
  dt-bindings: timer: add Ralink SoCs system tick counter
  clocksource: arm_global_timer: fix non-kernel-doc comment
  clocksource/drivers/arm_global_timer: Remove stray tab
  clocksource/drivers/arm_global_timer: Fix maximum prescaler value
  clocksource/drivers/imx-sysctr: Add i.MX95 support
  clocksource/drivers/imx-sysctr: Drop use global variables
  dt-bindings: timer: nxp,sysctr-timer: support i.MX95
  dt-bindings: timer: renesas: ostm: Document RZ/Five SoC
  dt-bindings: timer: renesas,tmu: Document input capture interrupt
  clocksource/drivers/ti-32K: Fix misuse of "/**" comment
  clocksource/drivers/stm32: Fix all kernel-doc warnings
  dt-bindings: timer: exynos4210-mct: Add google,gs101-mct compatible
  clocksource/drivers/imx: Fix -Wunused-but-set-variable warning

Merge tag 'irq-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
"A series of fixes for the Renesas RZG21 interrupt chip driver to
  prevent spurious and misrouted interrupts.

   - Ensure that posted writes are flushed in the eoi() callback

   - Ensure that interrupts are masked at the chip level when the
     trigger type is changed

   - Clear the interrupt status register when setting up edge type
     trigger modes

   - Ensure that the trigger type and routing information is set before
     the interrupt is enabled"

* tag 'irq-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/renesas-rzg2l: Do not set TIEN and TINT source at the same time
  irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type
  irqchip/renesas-rzg2l: Rename rzg2l_irq_eoi()
  irqchip/renesas-rzg2l: Rename rzg2l_tint_eoi()
  irqchip/renesas-rzg2l: Flush posted write in irq_eoi()

Merge tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry fix from Thomas Gleixner:
"A single fix for the generic entry code:

  The trace_sys_enter() tracepoint can modify the syscall number via
  kprobes or BPF in pt_regs, but that requires that the syscall number
  is re-evaluted from pt_regs after the tracepoint.

  A seccomp fix in that area removed the re-evaluation so the change
  does not take effect as the code just uses the locally cached number.

  Restore the original behaviour by re-evaluating the syscall number
  after the tracepoint"

* tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Respect changes to system call number by trace_sys_enter()

Merge tag 'powerpc-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc updates from Michael Ellerman:

- Handle errors in mark_rodata_ro() and mark_initmem_nx()

- Make struct crash_mem available without CONFIG_CRASH_DUMP

Thanks to Christophe Leroy and Hari Bathini.

* tag 'powerpc-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/kdump: Split KEXEC_CORE and CRASH_DUMP dependency
  powerpc/kexec: split CONFIG_KEXEC_FILE and CONFIG_CRASH_DUMP
  kexec/kdump: make struct crash_mem available without CONFIG_CRASH_DUMP
  powerpc: Handle error in mark_rodata_ro() and mark_initmem_nx()

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM updates from Russell King:

- remove a misuse of kernel-doc comment

- use "Call trace:" for backtraces like other architectures

- implement copy_from_kernel_nofault_allowed() to fix a LKDTM test

- add a "cut here" line for prefetch aborts

- remove unnecessary Kconfing entry for FRAME_POINTER

- remove iwmmxy support for PJ4/PJ4B cores

- use bitfield helpers in ptrace to improve readabililty

- check if folio is reserved before flushing

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 9359/1: flush: check if the folio is reserved for no-mapping addresses
  ARM: 9354/1: ptrace: Use bitfield helpers
  ARM: 9352/1: iwmmxt: Remove support for PJ4/PJ4B cores
  ARM: 9353/1: remove unneeded entry for CONFIG_FRAME_POINTER
  ARM: 9351/1: fault: Add "cut here" line for prefetch aborts
  ARM: 9350/1: fault: Implement copy_from_kernel_nofault_allowed()
  ARM: 9349/1: unwind: Add missing "Call trace:" line
  ARM: 9334/1: mm: init: remove misuse of kernel-doc comment

Merge tag 'hardening-v6.9-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more hardening updates from Kees Cook:

- CONFIG_MEMCPY_SLOW_KUNIT_TEST is no longer needed (Guenter Roeck)

- Fix needless UTF-8 character in arch/Kconfig (Liu Song)

- Improve __counted_by warning message in LKDTM (Nathan Chancellor)

- Refactor DEFINE_FLEX() for default use of __counted_by

- Disable signed integer overflow sanitizer on GCC < 8

* tag 'hardening-v6.9-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lkdtm/bugs: Improve warning message for compilers without counted_by support
  overflow: Change DEFINE_FLEX to take __counted_by member
  Revert "kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST"
  arch/Kconfig: eliminate needless UTF-8 character in Kconfig help
  ubsan: Disable signed integer overflow sanitizer on GCC < 8

x86/mpparse: Register APIC address only once

The APIC address is registered twice. First during the early detection and
afterwards when actually scanning the table for APIC IDs. The APIC and
topology core warn about the second attempt.

Restrict it to the early detection call.

Fixes: 81287ad65da5 ("x86/apic: Sanitize APIC address setup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de

x86/topology: Handle the !APIC case gracefully

If there is no local APIC enumerated and registered then the topology
bitmaps are empty. Therefore, topology_init_possible_cpus() will die with
a division by zero exception.

Prevent this by registering a fake APIC id to populate the topology
bitmap. This also allows to use all topology query interfaces
unconditionally. It does not affect the actual APIC code because either
the local APIC address was not registered or no local APIC could be
detected.

Fixes: f1f758a80516 ("x86/topology: Add a mechanism to track topology via APIC IDs")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de

x86/topology: Don't evaluate logical IDs during early boot

The local APICs have not yet been enumerated so the logical ID evaluation
from the topology bitmaps does not work and would return an error code.

Skip the evaluation during the early boot CPUID evaluation and only apply
it on the final run.

Fixes: 380414be78bf ("x86/cpu/topology: Use topology logical mapping mechanism")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.186943142@linutronix.de

x86/cpu: Ensure that CPU info updates are propagated on UP

The boot sequence evaluates CPUID information twice:

  1) During early boot

  2) When finalizing the early setup right before
     mitigations are selected and alternatives are patched.

In both cases the evaluation is stored in boot_cpu_data, but on UP the
copying of boot_cpu_data to the per CPU info of the boot CPU happens
between #1 and #2. So any update which happens in #2 is never propagated to
the per CPU info instance.

Consolidate the whole logic and copy boot_cpu_data right before applying
alternatives as that's the point where boot_cpu_data is in it's final
state and not supposed to change anymore.

This also removes the voodoo mb() from smp_prepare_cpus_common() which
had absolutely no purpose.

Fixes: 71eb4893cfaf ("x86/percpu: Cure per CPU madness on UP")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de

lkdtm/bugs: Improve warning message for compilers without counted_by support

The current message for telling the user that their compiler does not
support the counted_by attribute in the FAM_BOUNDS test does not make
much sense either grammatically or semantically. Fix it to make it
correct in both aspects.

Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20240321-lkdtm-improve-lack-of-counted_by-msg-v1-1-0fbf7481a29c@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>

overflow: Change DEFINE_FLEX to take __counted_by member

The norm should be flexible array structures with __counted_by
annotations, so DEFINE_FLEX() is updated to expect that. Rename
the non-annotated version to DEFINE_RAW_FLEX(), and update the
few existing users. Additionally add selftests for the macros.

Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20240306235128.it.933-kees@kernel.org
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull more SCSI updates from James Bottomley:
"The vfs has long had a write lifetime hint mechanism that gives the
  expected longevity on storage of the data being written. f2fs was the
  original consumer of this and used the hint for flash data placement
  (mostly to avoid write amplification by placing objects with similar
  lifetimes in the same erase block).

  More recently the SCSI based UFS (Universal Flash Storage) drivers
  have wanted to take advantage of this as well, for the same reasons as
  f2fs, necessitating plumbing the write hints through the block layer
  and then adding it to the SCSI core.

  The vfs write_hints already taken plumbs this as far as block and this
  completes the SCSI core enabling based on a recently agreed reuse of
  the old write command group number. The additions to the scsi_debug
  driver are for emulating this property so we can run tests on it in
  the absence of an actual UFS device"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: scsi_debug: Maintain write statistics per group number
  scsi: scsi_debug: Implement GET STREAM STATUS
  scsi: scsi_debug: Implement the IO Advice Hints Grouping mode page
  scsi: scsi_debug: Allocate the MODE SENSE response from the heap
  scsi: scsi_debug: Rework subpage code error handling
  scsi: scsi_debug: Rework page code error handling
  scsi: scsi_debug: Support the block limits extension VPD page
  scsi: scsi_debug: Reduce code duplication
  scsi: sd: Translate data lifetime information
  scsi: scsi_proto: Add structures and constants related to I/O groups and streams
  scsi: core: Query the Block Limits Extension VPD page

Merge tag 'block-6.9-20240322' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

- NVMe pull request via Keith:
     - Make an informative message less ominous (Keith)
     - Enhanced trace decoding (Guixin)
     - TCP updates (Hannes, Li)
     - Fabrics connect deadlock fix (Chunguang)
     - Platform API migration update (Uwe)
     - A new device quirk (Jiawei)

- Remove dead assignment in fd (Yufeng)

* tag 'block-6.9-20240322' of git://git.kernel.dk/linux:
  nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
  nvme: remove redundant BUILD_BUG_ON check
  floppy: remove duplicated code in redo_fd_request()
  nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
  nvme-tcp: Export the nvme_tcp_wq to sysfs
  drivers/nvme: Add quirks for device 126f:2262
  nvme: parse format command's lbafu when tracing
  nvme: add tracing of reservation commands
  nvme: parse zns command's zsa and zrasf to string
  nvme: use nvme_disk_is_ns_head helper
  nvme: fix reconnection fail due to reserved tag allocation
  nvmet: add tracing of zns commands
  nvmet: add tracing of authentication commands
  nvme-apple: Convert to platform remove callback returning void
  nvmet-tcp: do not continue for invalid icreq
  nvme: change shutdown timeout setting message

Merge tag 'io_uring-6.9-20240322' of git://git.kernel.dk/linux

Pull more io_uring updates from Jens Axboe:
"One patch just missed the initial pull, the rest are either fixes or
  small cleanups that make our life easier for the next kernel:

   - Fix a potential leak in error handling of pinned pages, and clean
     it up (Gabriel, Pavel)

   - Fix an issue with how read multishot returns retry (me)

   - Fix a problem with waitid/futex removals, if we hit the case of
     needing to remove all of them at exit time (me)

   - Fix for a regression introduced in this merge window, where we
     don't always have sr->done_io initialized if the ->prep_async()
     path is used (me)

   - Fix for SQPOLL setup error handling (me)

   - Fix for a poll removal request being delayed (Pavel)

   - Rename of a struct member which had a confusing name (Pavel)"

* tag 'io_uring-6.9-20240322' of git://git.kernel.dk/linux:
  io_uring/sqpoll: early exit thread if task_context wasn't allocated
  io_uring: clear opcode specific data for an early failure
  io_uring/net: ensure async prep handlers always initialize ->done_io
  io_uring/waitid: always remove waitid entry for cancel all
  io_uring/futex: always remove futex entry for cancel all
  io_uring: fix poll_remove stalled req completion
  io_uring: Fix release of pinned pages when __io_uaddr_map fails
  io_uring/kbuf: rename is_mapped
  io_uring: simplify io_pages_free
  io_uring: clean rings on NO_MMAP alloc fail
  io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry
  io_uring: don't save/restore iowait state

Merge tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

- Fix a memory leak in DM integrity recheck code that was added during
   the 6.9 merge. Also fix the recheck code to ensure it issues bios
   with proper alignment.

- Fix DM snapshot's dm_exception_table_exit() to schedule while
   handling an large exception table during snapshot device shutdown.

* tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-integrity: align the outgoing bio in integrity_recheck
  dm snapshot: fix lockup in dm_exception_table_exit
  dm-integrity: fix a memory leak when rechecking the data

Merge tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
"A patch to minimize blockage when processing very large batches of
  dirty caps and two fixes to better handle EOF in the face of multiple
  clients performing reads and size-extending writes at the same time"

* tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client:
  ceph: set correct cap mask for getattr request for read
  ceph: stop copying to iter at EOF on sync reads
  ceph: remove SLAB_MEM_SPREAD flag usage
  ceph: break the check delayed cap loop every 5s

Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

- Fix invalid pointer dereference by initializing xmbuf before
   tracepoint function is invoked

- Use memalloc_nofs_save() when inserting into quota radix tree

* tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: quota radix tree allocations need to be NOFS on insert
  xfs: fix dev_t usage in xmbuf tracepoints

Merge tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

- Support for various vector-accelerated crypto routines

- Hibernation is now enabled for portable kernel builds

- mmap_rnd_bits_max is larger on systems with larger VAs

- Support for fast GUP

- Support for membarrier-based instruction cache synchronization

- Support for the Andes hart-level interrupt controller and PMU

- Some cleanups around unaligned access speed probing and Kconfig
   settings

- Support for ACPI LPI and CPPC

- Various cleanus related to barriers

- A handful of fixes

* tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits)
  riscv: Fix syscall wrapper for >word-size arguments
  crypto: riscv - add vector crypto accelerated AES-CBC-CTS
  crypto: riscv - parallelize AES-CBC decryption
  riscv: Only flush the mm icache when setting an exec pte
  riscv: Use kcalloc() instead of kzalloc()
  riscv/barrier: Add missing space after ','
  riscv/barrier: Consolidate fence definitions
  riscv/barrier: Define RISCV_FULL_BARRIER
  riscv/barrier: Define __{mb,rmb,wmb}
  RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ
  cpufreq: Move CPPC configs to common Kconfig and add RISC-V
  ACPI: RISC-V: Add CPPC driver
  ACPI: Enable ACPI_PROCESSOR for RISC-V
  ACPI: RISC-V: Add LPI driver
  cpuidle: RISC-V: Move few functions to arch/riscv
  riscv: Introduce set_compat_task() in asm/compat.h
  riscv: Introduce is_compat_thread() into compat.h
  riscv: add compile-time test into is_compat_task()
  riscv: Replace direct thread flag check with is_compat_task()
  riscv: Improve arch_get_mmap_end() macro
  ...

Merge tag 'loongarch-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch updates from Huacai Chen:

- Add objtool support for LoongArch

- Add ORC stack unwinder support for LoongArch

- Add kernel livepatching support for LoongArch

- Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig

- Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig

- Some bug fixes and other small changes

* tag 'loongarch-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch/crypto: Clean up useless assignment operations
  LoongArch: Define the __io_aw() hook as mmiowb()
  LoongArch: Remove superfluous flush_dcache_page() definition
  LoongArch: Move {dmw,tlb}_virt_to_page() definition to page.h
  LoongArch: Change __my_cpu_offset definition to avoid mis-optimization
  LoongArch: Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig
  LoongArch: Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig
  LoongArch: Add kernel livepatching support
  LoongArch: Add ORC stack unwinder support
  objtool: Check local label in read_unwind_hints()
  objtool: Check local label in add_dead_ends()
  objtool/LoongArch: Enable orc to be built
  objtool/x86: Separate arch-specific and generic parts
  objtool/LoongArch: Implement instruction decoder
  objtool/LoongArch: Enable objtool to be built

Merge tag 'fbdev-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev

Pull fbdev updates from Helge Deller:

- Allow console fonts up to 64x128 pixels (Samuel Thibault)

- Prevent division-by-zero in fb monitor code (Roman Smirnov)

- Drop Renesas ARM platforms from Mobile LCDC framebuffer driver (Geert
   Uytterhoeven)

- Various code cleanups in viafb, uveafb and mb862xxfb drivers by
   Aleksandr Burakov, Li Zhijian and Michael Ellerman

* tag 'fbdev-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  fbdev: panel-tpo-td043mtea1: Convert sprintf() to sysfs_emit()
  fbmon: prevent division by zero in fb_videomode_from_videomode()
  fbcon: Increase maximum font width x height to 64 x 128
  fbdev: viafb: fix typo in hw_bitblt_1 and hw_bitblt_2
  fbdev: mb862xxfb: Fix defined but not used error
  fbdev: uvesafb: Convert sprintf/snprintf to sysfs_emit
  fbdev: Restrict FB_SH_MOBILE_LCDC to SuperH

Merge tag 'spi-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A small collection of fixes that came in since the merge window. Most
  of it is relatively minor driver specific fixes, there's also fixes
  for error handling with SPI flash devices and a fix restoring delay
  control functionality for non-GPIO chip selects managed by the core"

* tag 'spi-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: spi-mt65xx: Fix NULL pointer access in interrupt handler
  spi: docs: spidev: fix echo command format
  spi: spi-imx: fix off-by-one in mx51 CPU mode burst length
  spi: lm70llp: fix links in doc and comments
  spi: Fix error code checking in spi_mem_exec_op()
  spi: Restore delays for non-GPIO chip select
  spi: lpspi: Avoid potential use-after-free in probe()

Merge tag 'regulator-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
"One fix that came in during the merge window, fixing a problem with
  bootstrapping the state of exclusive regulators which have a parent
  regulator"

* tag 'regulator-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: core: Propagate the regulator state in case of exclusive get