David Woodhouse [Thu, 18 Apr 2024 11:31:54 +0000 (12:31 +0100)]
KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE
This was introduced in commit 0061d53daf26 ("KVM: x86: limit difference
between kvmclock updates") to reduce cross-vCPU differences which arose
because the KVM clock was based on CLOCK_MONOTONIC and thus subject to
NTP frequency corrections.
However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
for the KVM clock, avoiding the NTP frequency skew altogether.
So remove KVM_REQ_GLOBAL_CLOCK_UPDATE. In kvm_write_system_time(), all
that's needed is a single KVM_REQ_CLOCK_UPDATE for the vCPU whose KVM
clock is being configured.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse [Thu, 18 Apr 2024 18:36:57 +0000 (19:36 +0100)]
KVM: x86: Remove periodic global clock updates
This effectively reverts commit 332967a3eac0 ("x86: kvm: introduce
periodic global clock updates"). The periodic update was introduced to
propagate NTP corrections to the guest KVM clock, when the KVM clock was
based on CLOCK_MONOTONIC.
However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
for the KVM clock, avoiding the NTP frequency skew altogether.
So the periodic update serves no purpose. Remove it.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse [Fri, 26 Apr 2024 12:29:12 +0000 (13:29 +0100)]
KVM: x86: Simplify and comment kvm_get_time_scale()
Commit 3ae13faac400 ("KVM: x86: pass kvm_get_time_scale arguments in hertz")
made this function take 64-bit values in Hz rather than 32-bit kHz. Thus
making it entrely pointless to shadow its arguments into local 64-bit
variables. Just use scaled_hz and base_hz directly.
Also rename the 'tps32' variable to 'base32', having utterly failed to
think of any reason why it might have been called that in the first place.
This could probably have been eliminated too, but it helps to make the
code clearer and *might* just help a naïve 32-bit compiler realise that it
doesn't need to do full 64-bit shifts.
Having taken the time to reverse-engineer the function, add some comments
explaining it.
No functional change intended.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
David Woodhouse [Fri, 19 Apr 2024 12:36:57 +0000 (13:36 +0100)]
KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
There was some confusion in kvm_update_guest_time() when software needs
to advance the guest TSC.
In master clock mode, there are two points of time which need to be taken
into account. First there is the master clock reference point, stored in
kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
Secondly, there is the time *now*, at the point kvm_update_guest_time()
is being called.
With software TSC upscaling, the guest TSC is getting further and further
ahead of the host TSC as time elapses. So at time "now", the guest TSC
should be further ahead of the host, than it was at master_kernel_ns.
The adjustment in kvm_update_guest_time() was not taking that into
account, and was only advancing the guest TSC by the appropriate amount
for master_kernel_ns, *not* the current time.
Fix it to calculate them both correctly.
Since the KVM clock reference point in master_kernel_ns might actually
be *earlier* than the reference point used for the guest TSC
(vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
compute_guest_tsc() function to cope with negative numbers, which
then means there is no need to force a master clock update when the
guest TSC is written.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
David Woodhouse [Thu, 18 Apr 2024 15:00:22 +0000 (16:00 +0100)]
KVM: x86: Fix KVM clock precision in __get_kvmclock()
When in 'master clock mode' (i.e. when host and guest TSCs are behaving
sanely and in sync), the KVM clock is defined in terms of the guest TSC.
When TSC scaling is used, calculating the KVM clock directly from *host*
TSC cycles leads to a systemic drift from the values calculated by the
guest from its TSC.
Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
had a simple workaround for the specific case of Xen timers, as it had an
actual vCPU to hand and could use its scaling information. That commit
noted that it was broken for the general case of get_kvmclock_ns(), and
said "I'll come back to that".
Since __get_kvmclock() is invoked without a specific CPU, it needs to
be able to find or generate the scaling values required to perform the
correct calculation.
Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
so it isn't as complex as it might have been.
In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
kvm->arch.last_tsc_scaling_ratio. That is only protected by the
tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
shift factors to convert to nanoseconds for the corresponding KVM clock,
just as kvm_guest_time_update() would.
In __get_kvmclock(), which runs within a seqcount retry loop, use those
values to convert host to guest TSC and then to nanoseconds. Only fall
back to using get_kvmclock_base_ns() when not in master clock mode.
There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.
The kvm_get_wall_clock_epoch() function had the same problem; make it
just call get_kvmclock() and subtract kvmclock from wallclock, with
the same fallback as before.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
David Woodhouse [Thu, 18 Apr 2024 12:13:04 +0000 (13:13 +0100)]
KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") did so only for 64-bit hosts, by capturing the boot offset from
within the existing clocksource notifier update_pvclock_gtod().
That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
clocksource changes") but only on x86_64, because its original purpose
was just to disable the "master clock" mode which is only supported on
x86_64.
Now that the notifier is used for more than disabling master clock mode,
(well, OK, more than a decade later but clocks are hard), enable it for
the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
NTP sync on 32-bit too.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse [Wed, 13 Sep 2023 14:08:22 +0000 (16:08 +0200)]
KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
inadequate. It ignores TSC scaling, and ignores the fact that the host
TSC may differ from one host to the next (and in fact because of the way
the kernel calibrates it, it generally differs from one boot to the next
even on the same hardware).
Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
and attempt to document the process that userspace needs to follow to
preserve the TSC across migration.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse [Thu, 18 Apr 2024 08:28:24 +0000 (09:28 +0100)]
KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
KVM does make an attempt to cope with non-constant TSC, and has notifiers
to handle host TSC frequency changes. However, it *only* adjusts the KVM
clock, and doesn't adjust TSC frequency scaling when the host changes.
This is presumably because non-constant TSCs were fixed in hardware long
before TSC scaling was implemented, so there should never be real CPUs
which have TSC scaling but *not* CONSTANT_TSC.
Such a combination could potentially happen in some odd L1 nesting
environment, but it isn't worth trying to support it. Just make the
dependency explicit.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
Jack Allister [Wed, 10 Apr 2024 09:52:44 +0000 (09:52 +0000)]
KVM: selftests: Add KVM/PV clock selftest to prove timer correction
A VM's KVM/PV clock has an inherent relationship to its TSC (guest). When
either the host system live-updates or the VM is live-migrated this pairing
of the two clock sources should stay the same.
In reality this is not the case without some correction taking place. Two
new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
in either a live-update/migration scenario.
This test proves that without the necessary fixup there is a perceived
change in the guest TSC & KVM/PV clock relationship before and after a LU/
LM takes place.
The following steps are made to verify there is a delta in the relationship
and that it can be corrected:
1. PVTI is sampled by guest at boot (let's call this PVTI0).
2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
3. PVTI is sampled by guest after change (PVTI1).
4. Correction is requested by usermode to KVM using PVTI0.
5. PVTI is sampled by guest after correction (PVTI2).
The guest the records a singular TSC reference point in time and uses it to
calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
call each clock value CLK[0-2].
In a perfect world CLK[0-2] should all be the same value if the KVM clock
& TSC relationship is preserved across the LU/LM (or faked in this test),
however it is not.
A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
(and the inaccuracies associated with that). A delta of ~3500ns can be
observed if guest TSC scaling to half host TSC frequency is also enabled,
where as without scaling this is observed at ~180ns.
With the correction it should be possible to achieve a delta of ±1ns.
An option to enable guest TSC scaling is available via invoking the tester
with -s/--scale-tsc.
Example of the test output below:
* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1
Signed-off-by: Jack Allister <jalliste@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> CC: Dongli Zhang <dongli.zhang@oracle.com>
Jack Allister [Wed, 10 Apr 2024 09:52:43 +0000 (09:52 +0000)]
KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based on
a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.
The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which is
necessarily imprecise.
So on live update, the guest TSC can remain cycle accurate at precisely the
same offset from the host TSC, but there is no way for userspace to restore
the KVM clock accurately.
Even on live migration to a new host, where the accuracy of the guest time-
keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest TSC,
and not suffer an *additional* discontinuity.
To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
save and restore the actual PV clock info in pvclock_vcpu_time_info.
The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed through
the user-provided pvclock structure to generate the *intended* KVM clock
value at that point in time, and through the *actual* KVM clock calculation.
Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.
On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
*both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
migration data (unless the intent is to refuse to resume on a host with bad
TSC).
(It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
non-masterclock mode, as that mode is necessarily imprecise anyway. The
explicit fallback allows userspace to deliberately fail migration to a host
with misbehaving TSC where master clock mode wouldn't be active.)
Co-developed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Jack Allister <jalliste@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> CC: Dongli Zhang <dongli.zhang@oracle.com>
David Woodhouse [Wed, 17 Apr 2024 18:49:01 +0000 (19:49 +0100)]
KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
The kvm_guest_time_update() function scales the host TSC frequency to
the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
scaling ratio previously calculated for that vCPU. Then calcuates the
scaling factors for the KVM clock itself based on that guest TSC
frequency.
However, it uses kHz as the unit when scaling, and then multiplies by
1000 only at the end.
With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
the KVM clock advertised to the guest is based on a frequency of
2,499,999,000 Hz.
By using Hz as the unit from the beginning, the KVM clock would be based
on a more accurate frequency of 2,499,999,999 Hz in this example.
Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
David Woodhouse [Sun, 7 Apr 2024 12:26:28 +0000 (13:26 +0100)]
KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.
Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.
But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.
In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.
The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.
One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.
Eventually, definition C should just be eliminated. Commit 451a707813ae
("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
the specific case of Xen timers, which are defined in terms of the KVM
clock and suffered from a continually increasing error in timer expiry
times. That commit notes that get_kvmclock_ns() is non-trivial to fix
and says "I'll come back to that", which remains true.
Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.
In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.
When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
should probably *adjust* kvm->arch.kvmclock_offset to account for the
drift, instead of yanking the clock back to defintion A. But in the
meantime there are a bunch of places where it just doesn't need to be
invoked at all.
To start with: there is no need to do such an update when a Xen guest
populates the shared_info page. This seems to have been a hangover from
the very first implementation of shared_info which automatically
populated the vcpu_info structures at their default locations, but even
then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
expected to explicitly set the vcpu_info even in its default locations,
there's not even any need for that either.
Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
Sean Christopherson [Mon, 22 Jan 2024 23:54:00 +0000 (15:54 -0800)]
KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
TDX uses different ABI to get information about VM exit. Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.
When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS. The eventual code will be
VMX:
- get exit reason, intr_info, exit_qualification, and etc from VMCS
- call NMI/INTR handlers (common code)
TDX:
- get exit reason, intr_info, exit_qualification, and etc from guest
registers
- call NMI/INTR handlers (common code)
Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 18 Mar 2024 15:39:16 +0000 (11:39 -0400)]
KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures. This means we must have a TDX
version of kvm_x86_ops.
The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.
To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX. Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.
The eventually converted code will look like this:
Paolo Bonzini [Thu, 4 Apr 2024 13:17:09 +0000 (09:17 -0400)]
Merge branch 'kvm-sev-init2' into HEAD
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic. The first source of variability
that was encountered is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.
This series adds all the APIs that are needed to customize the features,
with room for future enhancements:
- a new /dev/kvm device attribute to retrieve the set of supported
features (right now, only debug swap)
- a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT
It then puts the new op to work by including the VMSA features as a field
of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility; but I am considering
also making them use zero as the feature mask, and will gladly adjust the
patches if so requested.
In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP
to reuse the KVM_SEV_INIT2 ioctl.
And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all,
SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
once the VMSA has been encrypted... which is how the API should have
always behaved. Second, they also synchronize the FPU and AVX state.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 11 Apr 2024 17:06:24 +0000 (13:06 -0400)]
Merge branch 'mm-delete-change-gpte' into HEAD
The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair. It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.
It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence. Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.
Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 5 Apr 2024 11:58:15 +0000 (07:58 -0400)]
mm: replace set_pte_at_notify() with just set_pte_at()
With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify(). It is a synonym of
set_pte_at() and can be replaced with it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 5 Apr 2024 11:58:14 +0000 (07:58 -0400)]
mmu_notifier: remove the .change_pte() callback
The scope of set_pte_at_notify() has reduced more and more through the
years. Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}(). However,
that has not been so for over ten years. During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.
Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed. For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 5 Apr 2024 11:58:13 +0000 (07:58 -0400)]
KVM: remove unused argument of kvm_handle_hva_range()
The only user was kvm_mmu_notifier_change_pte(), which is now gone.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 5 Apr 2024 11:58:12 +0000 (07:58 -0400)]
KVM: delete .change_pte MMU notifier callback
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().
Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.
In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.
This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:27 +0000 (08:13 -0400)]
selftests: kvm: add test for transferring FPU state into VMSA
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-18-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:26 +0000 (08:13 -0400)]
selftests: kvm: split "launch" phase of SEV VM creation
Allow the caller to set the initial state of the VM. Doing this
before sev_vm_launch() matters for SEV-ES, since that is the
place where the VMSA is updated and after which the guest state
becomes sealed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-17-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:25 +0000 (08:13 -0400)]
selftests: kvm: switch to using KVM_X86_*_VM
This removes the concept of "subtypes", instead letting the tests use proper
VM types that were recently added. While the sev_init_vm() and sev_es_init_vm()
are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT
ioctls, this is limited to VMs that are created manually with
vm_create_barebones().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-16-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:24 +0000 (08:13 -0400)]
selftests: kvm: add tests for KVM_SEV_INIT2
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-15-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:23 +0000 (08:13 -0400)]
KVM: SEV: allow SEV-ES DebugSwap again
The DebugSwap feature of SEV-ES provides a way for confidential guests
to use data breakpoints. Its status is record in VMSA, and therefore
attestation signatures depend on whether it is enabled or not. In order
to avoid invalidating the signatures depending on the host machine, it
was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES
DebugSwap by default", 2024-03-09).
However, we now have a new API to create SEV VMs that allows enabling
DebugSwap based on what the user tells KVM to do, and we also changed the
legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore
possible to re-enable the feature without breaking compatibility with
kernels that pre-date the introduction of DebugSwap, so go ahead.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:22 +0000 (08:13 -0400)]
KVM: SEV: introduce KVM_SEV_INIT2 operation
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired. Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.
Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.
The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:21 +0000 (08:13 -0400)]
KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.
Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr(). However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:20 +0000 (08:13 -0400)]
KVM: SEV: define VM types for SEV and SEV-ES
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:19 +0000 (08:13 -0400)]
KVM: SEV: introduce to_kvm_sev_info
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:18 +0000 (08:13 -0400)]
KVM: x86: Add supported_vm_types to kvm_caps
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES),
and also allows the vendor module to specify which VM types are supported.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:17 +0000 (08:13 -0400)]
KVM: x86: add fields to struct kvm_arch for CoCo features
Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.
We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected. For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.
For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:16 +0000 (08:13 -0400)]
KVM: SEV: store VMSA features in kvm_sev_info
Right now, the set of features that are stored in the VMSA upon
initialization is fixed and depends on the module parameters for
kvm-amd.ko. However, the hypervisor cannot really change it at will
because the feature word has to match between the hypervisor and whatever
computes a measurement of the VMSA for attestation purposes.
Add a field to kvm_sev_info that holds the set of features to be stored
in the VMSA; and query it instead of referring to the module parameters.
Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this
does not yet introduce any functional change, but it paves the way for
an API that allows customization of the features per-VM.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20240209183743.22030-6-pbonzini@redhat.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:15 +0000 (08:13 -0400)]
KVM: SEV: publish supported VMSA features
Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.
The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:14 +0000 (08:13 -0400)]
KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR
Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function. You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.
Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:13 +0000 (08:13 -0400)]
KVM: x86: use u64_to_user_ptr()
There is no danger to the kernel if 32-bit userspace provides a 64-bit
value that has the high bits set, but for whatever reason happens to
resolve to an address that has something mapped there. KVM uses the
checked version of get_user() and put_user(), so any faults are caught
properly.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Apr 2024 12:13:12 +0000 (08:13 -0400)]
KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y
Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs
in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers
is quite confusing.
To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks
and the #VMGEXIT handler. Stubs are also restricted to functions that
check sev_enabled and to the destruction functions sev_free_cpu() and
sev_vm_destroy(), where the style of their callers is to leave checks
to the callers. Most call sites instead rely on dead code elimination
to take care of functions that are guarded with sev_guest() or
sev_es_guest().
Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Thu, 4 Apr 2024 12:13:11 +0000 (08:13 -0400)]
KVM: SVM: Invert handling of SEV and SEV_ES feature flags
Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them
in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled. Aside
from the fact that sev_set_cpu_caps() is wildly misleading when it *clears*
capabilities, this will allow compiling out sev.c without falsely
advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20240404121327.3107131-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Zyngier [Thu, 21 Mar 2024 17:37:06 +0000 (17:37 +0000)]
KVM: arm64: Rationalise KVM banner output
We are not very consistent when it comes to displaying which mode
we're in (VHE, {n,h}VHE, protected or not). For example, booting
in protected mode with hVHE results in:
Marc Zyngier [Thu, 21 Mar 2024 11:54:14 +0000 (11:54 +0000)]
arm64: Fix early handling of FEAT_E2H0 not being implemented
Commit 3944382fa6f2 introduced checks for the FEAT_E2H0 not being
implemented. However, the check is absolutely wrong and makes a
point it testing a bit that is guaranteed to be zero.
On top of that, the detection happens way too late, after the
init_el2_state has done its job.
This went undetected because the HW this was tested on has E2H being
RAO/WI, and not RES1. However, the bug shows up when run as a nested
guest, where HCR_EL2.E2H is not necessarily set to 1. As a result,
booting the kernel in hVHE mode fails with timer accesses being
cought in a trap loop (which was fun to debug).
Fix the check for ID_AA64MMFR4_EL1.E2H0, and set the HCR_EL2.E2H bit
early so that it can be checked by the rest of the init sequence.
With this, hVHE works again in a NV environment that doesn't have
FEAT_E2H0.
Fixes: 3944382fa6f2 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative") Signed-off-by: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20240321115414.3169115-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Will Deacon [Wed, 27 Mar 2024 12:48:53 +0000 (12:48 +0000)]
KVM: arm64: Ensure target address is granule-aligned for range TLBI
When zapping a table entry in stage2_try_break_pte(), we issue range
TLB invalidation for the region that was mapped by the table. However,
we neglect to align the base address down to the granule size and so
if we ended up reaching the table entry via a misaligned address then
we will accidentally skip invalidation for some prefix of the affected
address range.
Align 'ctx->addr' down to the granule size when performing TLB
invalidation for an unmapped table in stage2_try_break_pte().
Cc: Raghavendra Rao Ananta <rananta@google.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Quentin Perret <qperret@google.com> Fixes: defc8cc7abf0 ("KVM: arm64: Invalidate the table entries upon a range") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240327124853.11206-5-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Will Deacon [Wed, 27 Mar 2024 12:48:52 +0000 (12:48 +0000)]
KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()
Commit c910f2b65518 ("arm64/mm: Update tlb invalidation routines for
FEAT_LPA2") updated the __tlbi_level() macro to take the target level
as an argument, with TLBI_TTL_UNKNOWN (rather than 0) indicating that
the caller cannot provide level information. Unfortunately, the two
implementations of __kvm_tlb_flush_vmid_range() were not updated and so
now ask for an level 0 invalidation if FEAT_LPA2 is implemented.
Fix the problem by passing TLBI_TTL_UNKNOWN instead of 0 as the level
argument to __flush_s2_tlb_range_op() in __kvm_tlb_flush_vmid_range().
Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Marc Zyngier <maz@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Fixes: c910f2b65518 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240327124853.11206-4-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Will Deacon [Wed, 27 Mar 2024 12:48:50 +0000 (12:48 +0000)]
KVM: arm64: Don't defer TLB invalidation when zapping table entries
Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for
unmap") introduced deferred TLB invalidation for the stage-2 page-table
so that range-based invalidation can be used for the accumulated
addresses. This works fine if the structure of the page-tables remains
unchanged, but if entire tables are zapped and subsequently freed then
we transiently leave the hardware page-table walker with a reference
to freed memory thanks to the translation walk caches. For example,
stage2_unmap_walker() will free page-table pages:
if (childp)
mm_ops->put_page(childp);
and issue the TLB invalidation later in kvm_pgtable_stage2_unmap():
if (stage2_unmap_defer_tlb_flush(pgt))
/* Perform the deferred TLB invalidations */
kvm_tlb_flush_vmid_range(pgt->mmu, addr, size);
For now, take the conservative approach and invalidate the TLB eagerly
when we clear a table entry. Note, however, that the existing level
hint passed to __kvm_tlb_flush_vmid_ipa() is incorrect and will be
fixed in a subsequent patch.
Cc: Raghavendra Rao Ananta <rananta@google.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240327124853.11206-2-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Wujie Duan [Mon, 18 Mar 2024 09:47:35 +0000 (17:47 +0800)]
KVM: arm64: Fix out-of-IPA space translation fault handling
Commit 11e5ea5242e3 ("KVM: arm64: Use helpers to classify exception
types reported via ESR") tried to abstract the translation fault
check when handling an out-of IPA space condition, but incorrectly
replaced it with a permission fault check.
Restore the previous translation fault check.
Fixes: 11e5ea5242e3 ("KVM: arm64: Use helpers to classify exception types reported via ESR") Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Wujie Duan <wjduan@linx-info.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/kvmarm/864jd3269g.wl-maz@kernel.org/ Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Oliver Upton [Tue, 5 Mar 2024 18:48:39 +0000 (18:48 +0000)]
KVM: arm64: Fix host-programmed guest events in nVHE
Programming PMU events in the host that count during guest execution is
a feature supported by perf, e.g.
perf stat -e cpu_cycles:G ./lkvm run
While this works for VHE, the guest/host event bitmaps are not carried
through to the hypervisor in the nVHE configuration. Make
kvm_pmu_update_vcpu_events() conditional on whether or not _hardware_
supports PMUv3 rather than if the vCPU as vPMU enabled.
Anup Patel [Thu, 21 Mar 2024 08:50:41 +0000 (14:20 +0530)]
RISC-V: KVM: Fix APLIC in_clrip[x] read emulation
The reads to APLIC in_clrip[x] registers returns rectified input values
of the interrupt sources.
A rectified input value of an interrupt source is defined by the section
"4.5.2 Source configurations (sourcecfg[1]–sourcecfg[1023])" of the
RISC-V AIA specification as:
rectified input value = (incoming wire value) XOR (source is inverted)
Update the riscv_aplic_input() implementation to match the above.
The writes to setipnum_le/be register for APLIC in MSI-mode have special
consideration for level-triggered interrupts as-per the section "4.9.2
Special consideration for level-sensitive interrupt sources" of the RISC-V
AIA specification.
Particularly, the below text from the RISC-V AIA specification defines
the behaviour of writes to setipnum_le/be register for level-triggered
interrupts:
"A second option is for the interrupt service routine to write the
APLIC’s source identity number for the interrupt to the domain’s
setipnum register just before exiting. This will cause the interrupt’s
pending bit to be set to one again if the source is still asserting
an interrupt, but not if the source is not asserting an interrupt."
Fix setipnum_le/be write emulation for in-kernel APLIC by implementing
the above behaviour in aplic_write_pending() function.
Linus Torvalds [Sun, 24 Mar 2024 20:54:06 +0000 (13:54 -0700)]
Merge tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Fix logic that is supposed to prevent placement of the kernel image
below LOAD_PHYSICAL_ADDR
- Use the firmware stack in the EFI stub when running in mixed mode
- Clear BSS only once when using mixed mode
- Check efi.get_variable() function pointer for NULL before trying to
call it
* tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: fix panic in kdump kernel
x86/efistub: Don't clear BSS twice in mixed mode
x86/efistub: Call mixed mode boot services on the firmware's stack
efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
Linus Torvalds [Sun, 24 Mar 2024 18:13:56 +0000 (11:13 -0700)]
Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Ensure that the encryption mask at boot is properly propagated on
5-level page tables, otherwise the PGD entry is incorrectly set to
non-encrypted, which causes system crashes during boot.
- Undo the deferred 5-level page table setup as it cannot work with
memory encryption enabled.
- Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
to the default value but the cached variable is not, so subsequent
comparisons might yield the wrong result and as a consequence the
result prevents updating the MSR.
- Register the local APIC address only once in the MPPARSE enumeration
to prevent triggering the related WARN_ONs() in the APIC and topology
code.
- Handle the case where no APIC is found gracefully by registering a
fake APIC in the topology code. That makes all related topology
functions work correctly and does not affect the actual APIC driver
code at all.
- Don't evaluate logical IDs during early boot as the local APIC IDs
are not yet enumerated and the invoked function returns an error
code. Nothing requires the logical IDs before the final CPUID
enumeration takes place, which happens after the enumeration.
- Cure the fallout of the per CPU rework on UP which misplaced the
copying of boot_cpu_data to per CPU data so that the final update to
boot_cpu_data got lost which caused inconsistent state and boot
crashes.
- Use copy_from_kernel_nofault() in the kprobes setup as there is no
guarantee that the address can be safely accessed.
- Reorder struct members in struct saved_context to work around another
kmemleak false positive
- Remove the buggy code which tries to update the E820 kexec table for
setup_data as that is never passed to the kexec kernel.
- Update the resource control documentation to use the proper units.
- Fix a Kconfig warning observed with tinyconfig
* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/64: Move 5-level paging global variable assignments back
x86/boot/64: Apply encryption mask to 5-level pagetable update
x86/cpu: Add model number for another Intel Arrow Lake mobile processor
x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Documentation/x86: Document that resctrl bandwidth control units are MiB
x86/mpparse: Register APIC address only once
x86/topology: Handle the !APIC case gracefully
x86/topology: Don't evaluate logical IDs during early boot
x86/cpu: Ensure that CPU info updates are propagated on UP
kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
x86/pm: Work around false positive kmemleak report in msr_build_context()
x86/kexec: Do not update E820 kexec table for setup_data
x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
Linus Torvalds [Sun, 24 Mar 2024 18:11:05 +0000 (11:11 -0700)]
Merge tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler doc clarification from Thomas Gleixner:
"A single update for the documentation of the base_slice_ns tunable to
clarify that any value which is less than the tick slice has no effect
because the scheduler tick is not guaranteed to happen within the set
time slice"
* tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation
Linus Torvalds [Sun, 24 Mar 2024 17:45:31 +0000 (10:45 -0700)]
Merge tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"This has a set of swiotlb alignment fixes for sometimes very long
standing bugs from Will. We've been discussion them for a while and
they should be solid now"
* tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
swiotlb: Fix alignment checks when both allocation and DMA masks are present
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Fix double-allocation of slots due to broken alignment handling
Oleksandr Tymoshenko [Sat, 23 Mar 2024 06:33:33 +0000 (06:33 +0000)]
efi: fix panic in kdump kernel
Check if get_next_variable() is actually valid pointer before
calling it. In kdump kernel this method is set to NULL that causes
panic during the kexec-ed kernel boot.
Tested with QEMU and OVMF firmware.
Fixes: bad267f9e18f ("efi: verify that variable services are supported") Signed-off-by: Oleksandr Tymoshenko <ovt@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Fri, 22 Mar 2024 16:01:45 +0000 (17:01 +0100)]
x86/efistub: Don't clear BSS twice in mixed mode
Clearing BSS should only be done once, at the very beginning.
efi_pe_entry() is the entrypoint from the firmware, which may not clear
BSS and so it is done explicitly. However, efi_pe_entry() is also used
as an entrypoint by the mixed mode startup code, in which case BSS will
already have been cleared, and doing it again at this point will corrupt
global variables holding the firmware's GDT/IDT and segment selectors.
So make the memset() conditional on whether the EFI stub is running in
native mode.
Ard Biesheuvel [Fri, 22 Mar 2024 15:03:58 +0000 (17:03 +0200)]
x86/efistub: Call mixed mode boot services on the firmware's stack
Normally, the EFI stub calls into the EFI boot services using the stack
that was live when the stub was entered. According to the UEFI spec,
this stack needs to be at least 128k in size - this might seem large but
all asynchronous processing and event handling in EFI runs from the same
stack and so quite a lot of space may be used in practice.
In mixed mode, the situation is a bit different: the bootloader calls
the 32-bit EFI stub entry point, which calls the decompressor's 32-bit
entry point, where the boot stack is set up, using a fixed allocation
of 16k. This stack is still in use when the EFI stub is started in
64-bit mode, and so all calls back into the EFI firmware will be using
the decompressor's limited boot stack.
Due to the placement of the boot stack right after the boot heap, any
stack overruns have gone unnoticed. However, commit
5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code")
moved the definition of the boot heap into C code, and now the boot
stack is placed right at the base of BSS, where any overruns will
corrupt the end of the .data section.
While it would be possible to work around this by increasing the size of
the boot stack, doing so would affect all x86 systems, and mixed mode
systems are a tiny (and shrinking) fraction of the x86 installed base.
So instead, record the firmware stack pointer value when entering from
the 32-bit firmware, and switch to this stack every time a EFI boot
service call is made.
Tom Lendacky [Fri, 22 Mar 2024 15:41:07 +0000 (10:41 -0500)]
x86/boot/64: Move 5-level paging global variable assignments back
Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.
While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.
Tom Lendacky [Fri, 22 Mar 2024 15:41:06 +0000 (10:41 -0500)]
x86/boot/64: Apply encryption mask to 5-level pagetable update
When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.
Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.
Adamos Ttofari [Fri, 22 Mar 2024 23:04:39 +0000 (16:04 -0700)]
x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Commit 672365477ae8 ("x86/fpu: Update XFD state where required") and
commit 8bf26758ca96 ("x86/fpu: Add XFD state to fpstate") introduced a
per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
order to avoid unnecessary writes to the MSR.
On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
wipes out any stale state. But the per CPU cached xfd value is not
reset, which brings them out of sync.
As a consequence a subsequent xfd_update_state() might fail to update
the MSR which in turn can result in XRSTOR raising a #NM in kernel
space, which crashes the kernel.
To fix this, introduce xfd_set_state() to write xfd_state together
with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.
Fixes: 672365477ae8 ("x86/fpu: Update XFD state where required") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
Tony Luck [Fri, 22 Mar 2024 18:20:15 +0000 (11:20 -0700)]
Documentation/x86: Document that resctrl bandwidth control units are MiB
The memory bandwidth software controller uses 2^20 units rather than
10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
Linux define for 0x00100000.
Update the documentation to use MiB when describing this feature.
It's too late to fix the mount option "mba_MBps" as that is now an
established user interface.
Linus Torvalds [Sat, 23 Mar 2024 21:49:25 +0000 (14:49 -0700)]
Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two regression fixes for the timer and timer migration code:
- Prevent endless timer requeuing which is caused by two CPUs racing
out of idle. This happens when the last CPU goes idle and therefore
has to ensure to expire the pending global timers and some other
CPU come out of idle at the same time and the other CPU wins the
race and expires the global queue. This causes the last CPU to
chase ghost timers forever and reprogramming it's clockevent device
endlessly.
Cure this by re-evaluating the wakeup time unconditionally.
- The split into local (pinned) and global timers in the timer wheel
caused a regression for NOHZ full as it broke the idle tracking of
global timers. On NOHZ full this prevents an self IPI being sent
which in turn causes the timer to be not programmed and not being
expired on time.
Restore the idle tracking for the global timer base so that the
self IPI condition for NOHZ full is working correctly again"
* tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix removed self-IPI on global timer's enqueue in nohz_full
timers/migration: Fix endless timer requeue after idle interrupts
Linus Torvalds [Sat, 23 Mar 2024 21:42:45 +0000 (14:42 -0700)]
Merge tag 'timers-core-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more clocksource updates from Thomas Gleixner:
"A set of updates for clocksource and clockevent drivers:
- A fix for the prescaler of the ARM global timer where the prescaler
mask define only covered 4 bits while it is actully 8 bits wide.
This obviously restricted the possible range of prescaler
adjustments
- A fix for the RISC-V timer which prevents a timer interrupt being
raised while the timer is initialized
- A set of device tree updates to support new system on chips in
various drivers
- Kernel-doc and other cleanups all over the place"
* tag 'timers-core-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-riscv: Clear timer interrupt on timer initialization
dt-bindings: timer: Add support for cadence TTC PWM
clocksource/drivers/arm_global_timer: Simplify prescaler register access
clocksource/drivers/arm_global_timer: Guard against division by zero
clocksource/drivers/arm_global_timer: Make gt_target_rate unsigned long
dt-bindings: timer: add Ralink SoCs system tick counter
clocksource: arm_global_timer: fix non-kernel-doc comment
clocksource/drivers/arm_global_timer: Remove stray tab
clocksource/drivers/arm_global_timer: Fix maximum prescaler value
clocksource/drivers/imx-sysctr: Add i.MX95 support
clocksource/drivers/imx-sysctr: Drop use global variables
dt-bindings: timer: nxp,sysctr-timer: support i.MX95
dt-bindings: timer: renesas: ostm: Document RZ/Five SoC
dt-bindings: timer: renesas,tmu: Document input capture interrupt
clocksource/drivers/ti-32K: Fix misuse of "/**" comment
clocksource/drivers/stm32: Fix all kernel-doc warnings
dt-bindings: timer: exynos4210-mct: Add google,gs101-mct compatible
clocksource/drivers/imx: Fix -Wunused-but-set-variable warning
Linus Torvalds [Sat, 23 Mar 2024 21:30:38 +0000 (14:30 -0700)]
Merge tag 'irq-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A series of fixes for the Renesas RZG21 interrupt chip driver to
prevent spurious and misrouted interrupts.
- Ensure that posted writes are flushed in the eoi() callback
- Ensure that interrupts are masked at the chip level when the
trigger type is changed
- Clear the interrupt status register when setting up edge type
trigger modes
- Ensure that the trigger type and routing information is set before
the interrupt is enabled"
* tag 'irq-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/renesas-rzg2l: Do not set TIEN and TINT source at the same time
irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type
irqchip/renesas-rzg2l: Rename rzg2l_irq_eoi()
irqchip/renesas-rzg2l: Rename rzg2l_tint_eoi()
irqchip/renesas-rzg2l: Flush posted write in irq_eoi()
Linus Torvalds [Sat, 23 Mar 2024 21:17:37 +0000 (14:17 -0700)]
Merge tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry fix from Thomas Gleixner:
"A single fix for the generic entry code:
The trace_sys_enter() tracepoint can modify the syscall number via
kprobes or BPF in pt_regs, but that requires that the syscall number
is re-evaluted from pt_regs after the tracepoint.
A seccomp fix in that area removed the re-evaluation so the change
does not take effect as the code just uses the locally cached number.
Restore the original behaviour by re-evaluating the syscall number
after the tracepoint"
* tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Respect changes to system call number by trace_sys_enter()
Linus Torvalds [Sat, 23 Mar 2024 16:21:26 +0000 (09:21 -0700)]
Merge tag 'powerpc-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
- Handle errors in mark_rodata_ro() and mark_initmem_nx()
- Make struct crash_mem available without CONFIG_CRASH_DUMP
Thanks to Christophe Leroy and Hari Bathini.
* tag 'powerpc-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kdump: Split KEXEC_CORE and CRASH_DUMP dependency
powerpc/kexec: split CONFIG_KEXEC_FILE and CONFIG_CRASH_DUMP
kexec/kdump: make struct crash_mem available without CONFIG_CRASH_DUMP
powerpc: Handle error in mark_rodata_ro() and mark_initmem_nx()
Linus Torvalds [Sat, 23 Mar 2024 16:17:03 +0000 (09:17 -0700)]
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:
- remove a misuse of kernel-doc comment
- use "Call trace:" for backtraces like other architectures
- implement copy_from_kernel_nofault_allowed() to fix a LKDTM test
- add a "cut here" line for prefetch aborts
- remove unnecessary Kconfing entry for FRAME_POINTER
- remove iwmmxy support for PJ4/PJ4B cores
- use bitfield helpers in ptrace to improve readabililty
- check if folio is reserved before flushing
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9359/1: flush: check if the folio is reserved for no-mapping addresses
ARM: 9354/1: ptrace: Use bitfield helpers
ARM: 9352/1: iwmmxt: Remove support for PJ4/PJ4B cores
ARM: 9353/1: remove unneeded entry for CONFIG_FRAME_POINTER
ARM: 9351/1: fault: Add "cut here" line for prefetch aborts
ARM: 9350/1: fault: Implement copy_from_kernel_nofault_allowed()
ARM: 9349/1: unwind: Add missing "Call trace:" line
ARM: 9334/1: mm: init: remove misuse of kernel-doc comment
Linus Torvalds [Sat, 23 Mar 2024 15:43:21 +0000 (08:43 -0700)]
Merge tag 'hardening-v6.9-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull more hardening updates from Kees Cook:
- CONFIG_MEMCPY_SLOW_KUNIT_TEST is no longer needed (Guenter Roeck)
- Fix needless UTF-8 character in arch/Kconfig (Liu Song)
- Improve __counted_by warning message in LKDTM (Nathan Chancellor)
- Refactor DEFINE_FLEX() for default use of __counted_by
- Disable signed integer overflow sanitizer on GCC < 8
* tag 'hardening-v6.9-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lkdtm/bugs: Improve warning message for compilers without counted_by support
overflow: Change DEFINE_FLEX to take __counted_by member
Revert "kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST"
arch/Kconfig: eliminate needless UTF-8 character in Kconfig help
ubsan: Disable signed integer overflow sanitizer on GCC < 8
Thomas Gleixner [Fri, 22 Mar 2024 18:56:39 +0000 (19:56 +0100)]
x86/mpparse: Register APIC address only once
The APIC address is registered twice. First during the early detection and
afterwards when actually scanning the table for APIC IDs. The APIC and
topology core warn about the second attempt.
Thomas Gleixner [Fri, 22 Mar 2024 18:56:38 +0000 (19:56 +0100)]
x86/topology: Handle the !APIC case gracefully
If there is no local APIC enumerated and registered then the topology
bitmaps are empty. Therefore, topology_init_possible_cpus() will die with
a division by zero exception.
Prevent this by registering a fake APIC id to populate the topology
bitmap. This also allows to use all topology query interfaces
unconditionally. It does not affect the actual APIC code because either
the local APIC address was not registered or no local APIC could be
detected.
Fixes: f1f758a80516 ("x86/topology: Add a mechanism to track topology via APIC IDs") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de
Thomas Gleixner [Fri, 22 Mar 2024 18:56:35 +0000 (19:56 +0100)]
x86/cpu: Ensure that CPU info updates are propagated on UP
The boot sequence evaluates CPUID information twice:
1) During early boot
2) When finalizing the early setup right before
mitigations are selected and alternatives are patched.
In both cases the evaluation is stored in boot_cpu_data, but on UP the
copying of boot_cpu_data to the per CPU info of the boot CPU happens
between #1 and #2. So any update which happens in #2 is never propagated to
the per CPU info instance.
Consolidate the whole logic and copy boot_cpu_data right before applying
alternatives as that's the point where boot_cpu_data is in it's final
state and not supposed to change anymore.
This also removes the voodoo mb() from smp_prepare_cpus_common() which
had absolutely no purpose.
Fixes: 71eb4893cfaf ("x86/percpu: Cure per CPU madness on UP") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de
Nathan Chancellor [Thu, 21 Mar 2024 20:18:17 +0000 (13:18 -0700)]
lkdtm/bugs: Improve warning message for compilers without counted_by support
The current message for telling the user that their compiler does not
support the counted_by attribute in the FAM_BOUNDS test does not make
much sense either grammatically or semantically. Fix it to make it
correct in both aspects.
Kees Cook [Wed, 6 Mar 2024 23:51:36 +0000 (15:51 -0800)]
overflow: Change DEFINE_FLEX to take __counted_by member
The norm should be flexible array structures with __counted_by
annotations, so DEFINE_FLEX() is updated to expect that. Rename
the non-annotated version to DEFINE_RAW_FLEX(), and update the
few existing users. Additionally add selftests for the macros.
Linus Torvalds [Fri, 22 Mar 2024 20:31:07 +0000 (13:31 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"The vfs has long had a write lifetime hint mechanism that gives the
expected longevity on storage of the data being written. f2fs was the
original consumer of this and used the hint for flash data placement
(mostly to avoid write amplification by placing objects with similar
lifetimes in the same erase block).
More recently the SCSI based UFS (Universal Flash Storage) drivers
have wanted to take advantage of this as well, for the same reasons as
f2fs, necessitating plumbing the write hints through the block layer
and then adding it to the SCSI core.
The vfs write_hints already taken plumbs this as far as block and this
completes the SCSI core enabling based on a recently agreed reuse of
the old write command group number. The additions to the scsi_debug
driver are for emulating this property so we can run tests on it in
the absence of an actual UFS device"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: scsi_debug: Maintain write statistics per group number
scsi: scsi_debug: Implement GET STREAM STATUS
scsi: scsi_debug: Implement the IO Advice Hints Grouping mode page
scsi: scsi_debug: Allocate the MODE SENSE response from the heap
scsi: scsi_debug: Rework subpage code error handling
scsi: scsi_debug: Rework page code error handling
scsi: scsi_debug: Support the block limits extension VPD page
scsi: scsi_debug: Reduce code duplication
scsi: sd: Translate data lifetime information
scsi: scsi_proto: Add structures and constants related to I/O groups and streams
scsi: core: Query the Block Limits Extension VPD page
Linus Torvalds [Fri, 22 Mar 2024 19:46:07 +0000 (12:46 -0700)]
Merge tag 'block-6.9-20240322' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- Make an informative message less ominous (Keith)
- Enhanced trace decoding (Guixin)
- TCP updates (Hannes, Li)
- Fabrics connect deadlock fix (Chunguang)
- Platform API migration update (Uwe)
- A new device quirk (Jiawei)
- Remove dead assignment in fd (Yufeng)
* tag 'block-6.9-20240322' of git://git.kernel.dk/linux:
nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
nvme: remove redundant BUILD_BUG_ON check
floppy: remove duplicated code in redo_fd_request()
nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
nvme-tcp: Export the nvme_tcp_wq to sysfs
drivers/nvme: Add quirks for device 126f:2262
nvme: parse format command's lbafu when tracing
nvme: add tracing of reservation commands
nvme: parse zns command's zsa and zrasf to string
nvme: use nvme_disk_is_ns_head helper
nvme: fix reconnection fail due to reserved tag allocation
nvmet: add tracing of zns commands
nvmet: add tracing of authentication commands
nvme-apple: Convert to platform remove callback returning void
nvmet-tcp: do not continue for invalid icreq
nvme: change shutdown timeout setting message
Linus Torvalds [Fri, 22 Mar 2024 19:42:55 +0000 (12:42 -0700)]
Merge tag 'io_uring-6.9-20240322' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"One patch just missed the initial pull, the rest are either fixes or
small cleanups that make our life easier for the next kernel:
- Fix a potential leak in error handling of pinned pages, and clean
it up (Gabriel, Pavel)
- Fix an issue with how read multishot returns retry (me)
- Fix a problem with waitid/futex removals, if we hit the case of
needing to remove all of them at exit time (me)
- Fix for a regression introduced in this merge window, where we
don't always have sr->done_io initialized if the ->prep_async()
path is used (me)
- Fix for SQPOLL setup error handling (me)
- Fix for a poll removal request being delayed (Pavel)
- Rename of a struct member which had a confusing name (Pavel)"
* tag 'io_uring-6.9-20240322' of git://git.kernel.dk/linux:
io_uring/sqpoll: early exit thread if task_context wasn't allocated
io_uring: clear opcode specific data for an early failure
io_uring/net: ensure async prep handlers always initialize ->done_io
io_uring/waitid: always remove waitid entry for cancel all
io_uring/futex: always remove futex entry for cancel all
io_uring: fix poll_remove stalled req completion
io_uring: Fix release of pinned pages when __io_uaddr_map fails
io_uring/kbuf: rename is_mapped
io_uring: simplify io_pages_free
io_uring: clean rings on NO_MMAP alloc fail
io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry
io_uring: don't save/restore iowait state
Linus Torvalds [Fri, 22 Mar 2024 19:34:26 +0000 (12:34 -0700)]
Merge tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix a memory leak in DM integrity recheck code that was added during
the 6.9 merge. Also fix the recheck code to ensure it issues bios
with proper alignment.
- Fix DM snapshot's dm_exception_table_exit() to schedule while
handling an large exception table during snapshot device shutdown.
* tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-integrity: align the outgoing bio in integrity_recheck
dm snapshot: fix lockup in dm_exception_table_exit
dm-integrity: fix a memory leak when rechecking the data
Linus Torvalds [Fri, 22 Mar 2024 18:15:45 +0000 (11:15 -0700)]
Merge tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A patch to minimize blockage when processing very large batches of
dirty caps and two fixes to better handle EOF in the face of multiple
clients performing reads and size-extending writes at the same time"
* tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client:
ceph: set correct cap mask for getattr request for read
ceph: stop copying to iter at EOF on sync reads
ceph: remove SLAB_MEM_SPREAD flag usage
ceph: break the check delayed cap loop every 5s
Linus Torvalds [Fri, 22 Mar 2024 18:12:21 +0000 (11:12 -0700)]
Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Fix invalid pointer dereference by initializing xmbuf before
tracepoint function is invoked
- Use memalloc_nofs_save() when inserting into quota radix tree
* tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: quota radix tree allocations need to be NOFS on insert
xfs: fix dev_t usage in xmbuf tracepoints
Linus Torvalds [Fri, 22 Mar 2024 17:22:45 +0000 (10:22 -0700)]
Merge tag 'loongarch-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch updates from Huacai Chen:
- Add objtool support for LoongArch
- Add ORC stack unwinder support for LoongArch
- Add kernel livepatching support for LoongArch
- Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig
- Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig
- Some bug fixes and other small changes
* tag 'loongarch-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch/crypto: Clean up useless assignment operations
LoongArch: Define the __io_aw() hook as mmiowb()
LoongArch: Remove superfluous flush_dcache_page() definition
LoongArch: Move {dmw,tlb}_virt_to_page() definition to page.h
LoongArch: Change __my_cpu_offset definition to avoid mis-optimization
LoongArch: Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig
LoongArch: Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig
LoongArch: Add kernel livepatching support
LoongArch: Add ORC stack unwinder support
objtool: Check local label in read_unwind_hints()
objtool: Check local label in add_dead_ends()
objtool/LoongArch: Enable orc to be built
objtool/x86: Separate arch-specific and generic parts
objtool/LoongArch: Implement instruction decoder
objtool/LoongArch: Enable objtool to be built
Linus Torvalds [Fri, 22 Mar 2024 17:09:08 +0000 (10:09 -0700)]
Merge tag 'fbdev-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev updates from Helge Deller:
- Allow console fonts up to 64x128 pixels (Samuel Thibault)
- Prevent division-by-zero in fb monitor code (Roman Smirnov)
- Drop Renesas ARM platforms from Mobile LCDC framebuffer driver (Geert
Uytterhoeven)
- Various code cleanups in viafb, uveafb and mb862xxfb drivers by
Aleksandr Burakov, Li Zhijian and Michael Ellerman
* tag 'fbdev-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: panel-tpo-td043mtea1: Convert sprintf() to sysfs_emit()
fbmon: prevent division by zero in fb_videomode_from_videomode()
fbcon: Increase maximum font width x height to 64 x 128
fbdev: viafb: fix typo in hw_bitblt_1 and hw_bitblt_2
fbdev: mb862xxfb: Fix defined but not used error
fbdev: uvesafb: Convert sprintf/snprintf to sysfs_emit
fbdev: Restrict FB_SH_MOBILE_LCDC to SuperH
Linus Torvalds [Fri, 22 Mar 2024 16:57:00 +0000 (09:57 -0700)]
Merge tag 'spi-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in since the merge window. Most
of it is relatively minor driver specific fixes, there's also fixes
for error handling with SPI flash devices and a fix restoring delay
control functionality for non-GPIO chip selects managed by the core"
* tag 'spi-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-mt65xx: Fix NULL pointer access in interrupt handler
spi: docs: spidev: fix echo command format
spi: spi-imx: fix off-by-one in mx51 CPU mode burst length
spi: lm70llp: fix links in doc and comments
spi: Fix error code checking in spi_mem_exec_op()
spi: Restore delays for non-GPIO chip select
spi: lpspi: Avoid potential use-after-free in probe()
Linus Torvalds [Fri, 22 Mar 2024 16:52:37 +0000 (09:52 -0700)]
Merge tag 'regulator-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One fix that came in during the merge window, fixing a problem with
bootstrapping the state of exclusive regulators which have a parent
regulator"
* tag 'regulator-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: core: Propagate the regulator state in case of exclusive get
Linus Torvalds [Fri, 22 Mar 2024 16:44:19 +0000 (09:44 -0700)]
Merge tag 'sound-fix2-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull more sound fixes from Takashi Iwai:
"The remaining fixes for 6.9-rc1 that have been gathered in this week.
More about ASoC at this time (one long-standing fix for compress
offload, SOF, AMD ACP, Rockchip, Cirrus and tlv320 stuff) while
another regression fix in ALSA core and a couple of HD-audio quirks as
usual are included"
* tag 'sound-fix2-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: control: Fix unannotated kfree() cleanup
ALSA: hda/realtek: Add quirks for some Clevo laptops
ALSA: hda/realtek: Add quirk for HP Spectre x360 14 eu0000
ALSA: hda/realtek: fix the hp playback volume issue for LG machines
ASoC: soc-compress: Fix and add DPCM locking
ASoC: SOF: amd: Skip IRAM/DRAM size modification for Steam Deck OLED
ASoC: SOF: amd: Move signed_fw_image to struct acp_quirk_entry
ASoC: amd: yc: Revert "add new YC platform variant (0x63) support"
ASoC: amd: yc: Revert "Fix non-functional mic on Lenovo 21J2"
ASoC: soc-core.c: Skip dummy codec when adding platforms
ASoC: rockchip: i2s-tdm: Fix inaccurate sampling rates
ASoC: dt-bindings: cirrus,cs42l43: Fix 'gpio-ranges' schema
ASoC: amd: yc: Fix non-functional mic on ASUS M7600RE
ASoC: tlv320adc3xxx: Don't strip remove function when driver is builtin
efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
Following warning is sometimes observed while booting my servers:
[ 3.594838] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
[ 3.602918] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-1
...
[ 3.851862] DMA: preallocated 1024 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
If 'nokaslr' boot option is set, the warning always happens.
On x86, ZONE_DMA is small zone at the first 16MB of physical address
space. When this problem happens, most of that space seems to be used by
decompressed kernel. Thereby, there is not enough space at DMA_ZONE to
meet the request of DMA pool allocation.
The commit 2f77465b05b1 ("x86/efistub: Avoid placing the kernel below
LOAD_PHYSICAL_ADDR") tried to fix this problem by introducing lower
bound of allocation.
But the fix is not complete.
efi_random_alloc() allocates pages by following steps.
1. Count total available slots ('total_slots')
2. Select a slot ('target_slot') to allocate randomly
3. Calculate a starting address ('target') to be included target_slot
4. Allocate pages, which starting address is 'target'
In step 1, 'alloc_min' is used to offset the starting address of memory
chunk. But in step 3 'alloc_min' is not considered at all. As the
result, 'target' can be miscalculated and become lower than 'alloc_min'.
When KASLR is disabled, 'target_slot' is always 0 and the problem
happens everytime if the EFI memory map of the system meets the
condition.
Fix this problem by calculating 'target' considering 'alloc_min'.
kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
Read from an unsafe address with copy_from_kernel_nofault() in
arch_adjust_kprobe_addr() because this function is used before checking
the address is in text or not. Syzcaller bot found a bug and reported
the case if user specifies inaccessible data area,
arch_adjust_kprobe_addr() will cause a kernel panic.
Anton Altaparmakov [Thu, 14 Mar 2024 14:26:56 +0000 (14:26 +0000)]
x86/pm: Work around false positive kmemleak report in msr_build_context()
Since:
7ee18d677989 ("x86/power: Make restore_processor_context() sane")
kmemleak reports this issue:
unreferenced object 0xf68241e0 (size 32):
comm "swapper/0", pid 1, jiffies 4294668610 (age 68.432s)
hex dump (first 32 bytes):
00 cc cc cc 29 10 01 c0 00 00 00 00 00 00 00 00 ....)...........
00 42 82 f6 cc cc cc cc cc cc cc cc cc cc cc cc .B..............
backtrace:
[<461c1d50>] __kmem_cache_alloc_node+0x106/0x260
[<ea65e13b>] __kmalloc+0x54/0x160
[<c3858cd2>] msr_build_context.constprop.0+0x35/0x100
[<46635aff>] pm_check_save_msr+0x63/0x80
[<6b6bb938>] do_one_initcall+0x41/0x1f0
[<3f3add60>] kernel_init_freeable+0x199/0x1e8
[<3b538fde>] kernel_init+0x1a/0x110
[<938ae2b2>] ret_from_fork+0x1c/0x28
Which is a false positive.
Reproducer:
- Run rsync of whole kernel tree (multiple times if needed).
- start a kmemleak scan
- Note this is just an example: a lot of our internal tests hit these.
The root cause is similar to the fix in:
b0b592cf0836 x86/pm: Fix false positive kmemleak report in msr_build_context()
ie. the alignment within the packed struct saved_context
which has everything unaligned as there is only "u16 gs;" at start of
struct where in the past there were four u16 there thus aligning
everything afterwards. The issue is with the fact that Kmemleak only
searches for pointers that are aligned (see how pointers are scanned in
kmemleak.c) so when the struct members are not aligned it doesn't see
them.
Testing:
We run a lot of tests with our CI, and after applying this fix we do not
see any kmemleak issues any more whilst without it we see hundreds of
the above report. From a single, simple test run consisting of 416 individual test
cases on kernel 5.10 x86 with kmemleak enabled we got 20 failures due to this,
which is quite a lot. With this fix applied we get zero kmemleak related failures.
Fixes: 7ee18d677989 ("x86/power: Make restore_processor_context() sane") Signed-off-by: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Cc: stable@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240314142656.17699-1-anton@tuxera.com
Dave Young [Fri, 22 Mar 2024 05:15:08 +0000 (13:15 +0800)]
x86/kexec: Do not update E820 kexec table for setup_data
crashkernel reservation failed on a Thinkpad t440s laptop recently.
Actually the memblock reservation succeeded, but later insert_resource()
failed.
Test steps:
kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
kexec reboot ->
dmesg|grep "crashkernel reserved";
crashkernel memory range like below reserved successfully:
0x00000000d0000000 - 0x00000000da000000
But no such "Crash kernel" region in /proc/iomem
The background story:
Currently the E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.
Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed.
Thus the old setup data memory is not used at all.
Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().
Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.
Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.
Thus drop the useless buggy code here.
Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Eric DeVolder <eric.devolder@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/Zf0T3HCG-790K-pZ@darkstar.users.ipa.redhat.com
Linus Torvalds [Fri, 22 Mar 2024 02:14:28 +0000 (19:14 -0700)]
Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Various get_inode_info_fixes
- Fix for querying xattrs of cached dirs
- Four minor cleanup fixes (including adding some header corrections
and a missing flag)
- Performance improvement for deferred close
- Two query interface fixes
* tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb311: additional compression flag defined in updated protocol spec
smb311: correct incorrect offset field in compression header
cifs: Move some extern decls from .c files to .h
cifs: remove redundant variable assignment
cifs: fixes for get_inode_info
cifs: open_cached_dir(): add FILE_READ_EA to desired access
cifs: reduce warning log level for server not advertising interfaces
cifs: make sure server interfaces are requested only for SMB3+
cifs: defer close file handles having RH lease
Linus Torvalds [Fri, 22 Mar 2024 02:04:31 +0000 (19:04 -0700)]
Merge tag 'drm-next-2024-03-22' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Fixes from the last week (or 3 weeks in amdgpu case), after amdgpu,
it's xe and nouveau then a few scattered core fixes.
core:
- fix rounding in drm_fixp2int_round()
bridge:
- fix documentation for DRM_BRIDGE_OP_EDID
sun4i:
- fix 64-bit division on 32-bit architectures
tests:
- fix dependency on DRM_KMS_HELPER
probe-helper:
- never return negative values from .get_modes() plus driver fixes
xe:
- invalidate userptr vma on page pin fault
- fail early on sysfs file creation error
- skip VMA pinning on xe_exec if no batches
nouveau:
- clear bo resource bus after eviction
- documentation fixes
- don't check devinit disable on GSP
amdgpu:
- Freesync fixes
- UAF IOCTL fixes
- Fix mmhub client ID mapping
- IH 7.0 fix
- DML2 fixes
- VCN 4.0.6 fix
- GART bind fix
- GPU reset fix
- SR-IOV fix
- OD table handling fixes
- Fix TA handling on boards without display hardware
- DML1 fix
- ABM fix
- eDP panel fix
- DPPCLK fix
- HDCP fix
- Revert incorrect error case handling in ioremap
- VPE fix
- HDMI fixes
- SDMA 4.4.2 fix
- Other misc fixes
amdkfd:
- Fix duplicate BO handling in process restore"
* tag 'drm-next-2024-03-22' of https://gitlab.freedesktop.org/drm/kernel: (50 commits)
drm/amdgpu/pm: Don't use OD table on Arcturus
drm/amdgpu: drop setting buffer funcs in sdma442
drm/amd/display: Fix noise issue on HDMI AV mute
drm/amd/display: Revert Remove pixle rate limit for subvp
Revert "drm/amdgpu/vpe: don't emit cond exec command under collaborate mode"
Revert "drm/amd/amdgpu: Fix potential ioremap() memory leaks in amdgpu_device_init()"
drm/amd/display: Add a dc_state NULL check in dc_state_release
drm/amd/display: Return the correct HDCP error code
drm/amd/display: Implement wait_for_odm_update_pending_complete
drm/amd/display: Lock all enabled otg pipes even with no planes
drm/amd/display: Amend coasting vtotal for replay low hz
drm/amd/display: Fix idle check for shared firmware state
drm/amd/display: Update odm when ODM combine is changed on an otg master pipe with no plane
drm/amd/display: Init DPPCLK from SMU on dcn32
drm/amd/display: Add monitor patch for specific eDP
drm/amd/display: Allow dirty rects to be sent to dmub when abm is active
drm/amd/display: Override min required DCFCLK in dml1_validate
drm/amdgpu: Bypass display ta if display hw is not available
drm/amdgpu: correct the KGQ fallback message
drm/amdgpu/pm: Check the validity of overdiver power limit
...