Ashish Kalra [Mon, 10 Feb 2025 22:54:18 +0000 (22:54 +0000)]
x86/sev: Fix broken SNP support with KVM module built-in
Fix issues with enabling SNP host support and effectively SNP support
which is broken with respect to the KVM module being built-in.
SNP host support is enabled in snp_rmptable_init() which is invoked as
device_initcall(). SNP check on IOMMU is done during IOMMU PCI init
(IOMMU_PCI_INIT stage). And for that reason snp_rmptable_init() is
currently invoked via device_initcall() and cannot be invoked via
subsys_initcall() as core IOMMU subsystem gets initialized via
subsys_initcall().
Now, if kvm_amd module is built-in, it gets initialized before SNP host
support is enabled in snp_rmptable_init() :
And then svm_x86_ops->enable_virtualization_cpu()
(svm_enable_virtualization_cpu) programs MSR_VM_HSAVE_PA as following:
wrmsrl(MSR_VM_HSAVE_PA, sd->save_area_pa);
So VM_HSAVE_PA is non-zero before SNP support is enabled on all CPUs.
snp_rmptable_init() gets invoked after svm_enable_virtualization_cpu()
as following :
...
[ 11.256138] kvm_amd: in svm_enable_virtualization_cpu
...
[ 11.264918] SEV-SNP: in snp_rmptable_init
This triggers a #GP exception in snp_rmptable_init() when snp_enable()
is invoked to set SNP_EN in SYSCFG MSR:
This #GP exception is getting triggered due to the following errata for
AMD family 19h Models 10h-1Fh Processors:
Processor may generate spurious #GP(0) Exception on WRMSR instruction:
Description:
The Processor will generate a spurious #GP(0) Exception on a WRMSR
instruction if the following conditions are all met:
- the target of the WRMSR is a SYSCFG register.
- the write changes the value of SYSCFG.SNPEn from 0 to 1.
- One of the threads that share the physical core has a non-zero
value in the VM_HSAVE_PA MSR.
The document being referred to above:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/revision-guides/57095-PUB_1_01.pdf
To summarize, with kvm_amd module being built-in, KVM/SVM initialization
happens before host SNP is enabled and this SVM initialization
sets VM_HSAVE_PA to non-zero, which then triggers a #GP when
SYSCFG.SNPEn is being set and this will subsequently cause
SNP_INIT(_EX) to fail with INVALID_CONFIG error as SYSCFG[SnpEn] is not
set on all CPUs.
Essentially SNP host enabling code should be invoked before KVM
initialization, which is currently not the case when KVM is built-in.
Add fix to call snp_rmptable_init() early from iommu_snp_enable()
directly and not invoked via device_initcall() which enables SNP host
support before KVM initialization with kvm_amd module built-in.
Add additional handling for `iommu=off` or `amd_iommu=off` options.
Note that IOMMUs need to be enabled for SNP initialization, therefore,
if host SNP support is enabled but late IOMMU initialization fails
then that will cause PSP driver's SNP_INIT to fail as IOMMU SNP sanity
checks in SNP firmware will fail with invalid configuration error as
below:
Sean Christopherson [Mon, 10 Feb 2025 22:54:02 +0000 (22:54 +0000)]
KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading. Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sean Christopherson [Mon, 10 Feb 2025 22:53:47 +0000 (22:53 +0000)]
crypto: ccp: Add external API interface for PSP module initialization
KVM is dependent on the PSP SEV driver and PSP SEV driver needs to be
loaded before KVM module. In case of module loading any dependent
modules are automatically loaded but in case of built-in modules there
is no inherent mechanism available to specify dependencies between
modules and ensure that any dependent modules are loaded implicitly.
Add a new external API interface for PSP module initialization which
allows PSP SEV driver to be loaded explicitly if KVM is built-in.
Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-ID: <15279ca0cad56a07cf12834ec544310f85ff5edc.1739226950.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 14 Feb 2025 23:32:47 +0000 (18:32 -0500)]
Merge tag 'kvmarm-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.14, take #2
- Large set of fixes for vector handling, specially in the interactions
between host and guest state. This fixes a number of bugs affecting
actual deployments, and greatly simplifies the FP/SIMD/SVE handling.
Thanks to Mark Rutland for dealing with this thankless task.
- Fix an ugly race between vcpu and vgic creation/init, resulting in
unexpected behaviours.
- Fix use of kernel VAs at EL2 when emulating timers with nVHE.
Marc Zyngier [Wed, 12 Feb 2025 18:25:58 +0000 (18:25 +0000)]
KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()
If userspace creates vcpus, then a vgic, we end-up in a situation
where irqchip_in_kernel() will return true, but no private interrupt
has been allocated for these vcpus. This situation will continue
until userspace initialises the vgic, at which point we fix the
early vcpus. Should a vcpu run or be initialised in the interval,
bad things may happen.
An obvious solution is to move this fix-up phase to the point where
the vgic is created. This ensures that from that point onwards,
all vcpus have their private interrupts, as new vcpus will directly
allocate them.
With that, we have the invariant that when irqchip_in_kernel() is
true, all vcpus have their private interrupts.
Marc Zyngier [Wed, 12 Feb 2025 18:25:57 +0000 (18:25 +0000)]
KVM: arm64: timer: Drop warning on failed interrupt signalling
We currently spit out a warning if making a timer interrupt pending
fails. But not only this is loud and easy to trigger from userspace,
we also fail to do anything useful with that information.
Dropping the warning is the easiest thing to do for now. We can
always add error reporting if we really want in the future.
Quentin Perret [Thu, 13 Feb 2025 15:36:14 +0000 (15:36 +0000)]
KVM: arm64: Fix alignment of kvm_hyp_memcache allocations
When allocating guest stage-2 page-table pages at EL2, pKVM can consume
pages from the host-provided kvm_hyp_memcache. As pgtable.c expects
zeroed pages, guest_s2_zalloc_page() actively implements this zeroing
with a PAGE_SIZE memset. Unfortunately, we don't check the page
alignment of the host-provided address before doing so, which could
lead to the memset overrunning the page if the host was malicious.
Fix this by simply force-aligning all kvm_hyp_memcache allocations to
page boundaries.
Fixes: 60dfe093ec13 ("KVM: arm64: Instantiate guest stage-2 page-tables at EL2") Reported-by: Ben Simner <ben.simner@cl.cam.ac.uk> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250213153615.3642515-1-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Marc Zyngier [Wed, 12 Feb 2025 17:34:54 +0000 (17:34 +0000)]
KVM: arm64: Convert timer offset VA when accessed in HYP code
Now that EL2 has gained some early timer emulation, it accesses
the offsets pointed to by the timer structure, both of which
live in the KVM structure.
Of course, these are *kernel* pointers, so the dereferencing
of these pointers in non-kernel code must be itself be offset.
Given switch.h its own version of timer_get_offset() and use that
instead.
Fixes: b86fc215dc26d ("KVM: arm64: Handle counter access early in non-HYP context") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Tested-by: Anders Roxell <anders.roxell@linaro.org> Link: https://lore.kernel.org/r/20250212173454.2864462-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Thu, 13 Feb 2025 09:15:31 +0000 (09:15 +0000)]
KVM: arm64: Simplify warning in kvm_arch_vcpu_load_fp()
At the end of kvm_arch_vcpu_load_fp() we check that no bits are set in
SVCR. We only check this for protected mode despite this mattering
equally for non-protected mode, and the comment above this is confusing.
Remove the comment and simplify the check, moving from WARN_ON() to
WARN_ON_ONCE() to avoid spamming the log.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:26 +0000 (19:52 +0000)]
KVM: arm64: Eagerly switch ZCR_EL{1,2}
In non-protected KVM modes, while the guest FPSIMD/SVE/SME state is live on the
CPU, the host's active SVE VL may differ from the guest's maximum SVE VL:
* For VHE hosts, when a VM uses NV, ZCR_EL2 contains a value constrained
by the guest hypervisor, which may be less than or equal to that
guest's maximum VL.
Note: in this case the value of ZCR_EL1 is immaterial due to E2H.
* For nVHE/hVHE hosts, ZCR_EL1 contains a value written by the guest,
which may be less than or greater than the guest's maximum VL.
Note: in this case hyp code traps host SVE usage and lazily restores
ZCR_EL2 to the host's maximum VL, which may be greater than the
guest's maximum VL.
This can be the case between exiting a guest and kvm_arch_vcpu_put_fp().
If a softirq is taken during this period and the softirq handler tries
to use kernel-mode NEON, then the kernel will fail to save the guest's
FPSIMD/SVE state, and will pend a SIGKILL for the current thread.
This happens because kvm_arch_vcpu_ctxsync_fp() binds the guest's live
FPSIMD/SVE state with the guest's maximum SVE VL, and
fpsimd_save_user_state() verifies that the live SVE VL is as expected
before attempting to save the register state:
Fix this and make this a bit easier to reason about by always eagerly
switching ZCR_EL{1,2} at hyp during guest<->host transitions. With this
happening, there's no need to trap host SVE usage, and the nVHE/nVHE
__deactivate_cptr_traps() logic can be simplified to enable host access
to all present FPSIMD/SVE/SME features.
In protected nVHE/hVHE modes, the host's state is always saved/restored
by hyp, and the guest's state is saved prior to exit to the host, so
from the host's PoV the guest never has live FPSIMD/SVE/SME state, and
the host's ZCR_EL1 is never clobbered by hyp.
Fixes: 8c8010d69c132273 ("KVM: arm64: Save/restore SVE state for nVHE") Fixes: 2e3cf82063a00ea0 ("KVM: arm64: nv: Ensure correct VL is loaded before saving SVE state") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-9-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:25 +0000 (19:52 +0000)]
KVM: arm64: Mark some header functions as inline
The shared hyp switch header has a number of static functions which
might not be used by all files that include the header, and when unused
they will provoke compiler warnings, e.g.
| In file included from arch/arm64/kvm/hyp/nvhe/hyp-main.c:8:
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:703:13: warning: 'kvm_hyp_handle_dabt_low' defined but not used [-Wunused-function]
| 703 | static bool kvm_hyp_handle_dabt_low(struct kvm_vcpu *vcpu, u64 *exit_code)
| | ^~~~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:682:13: warning: 'kvm_hyp_handle_cp15_32' defined but not used [-Wunused-function]
| 682 | static bool kvm_hyp_handle_cp15_32(struct kvm_vcpu *vcpu, u64 *exit_code)
| | ^~~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:662:13: warning: 'kvm_hyp_handle_sysreg' defined but not used [-Wunused-function]
| 662 | static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code)
| | ^~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:458:13: warning: 'kvm_hyp_handle_fpsimd' defined but not used [-Wunused-function]
| 458 | static bool kvm_hyp_handle_fpsimd(struct kvm_vcpu *vcpu, u64 *exit_code)
| | ^~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:329:13: warning: 'kvm_hyp_handle_mops' defined but not used [-Wunused-function]
| 329 | static bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
| | ^~~~~~~~~~~~~~~~~~~
Mark these functions as 'inline' to suppress this warning. This
shouldn't result in any functional change.
At the same time, avoid the use of __alias() in the header and alias
kvm_hyp_handle_iabt_low() and kvm_hyp_handle_watchpt_low() to
kvm_hyp_handle_memory_fault() using CPP, matching the style in the rest
of the kernel. For consistency, kvm_hyp_handle_memory_fault() is also
marked as 'inline'.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-8-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:24 +0000 (19:52 +0000)]
KVM: arm64: Refactor exit handlers
The hyp exit handling logic is largely shared between VHE and nVHE/hVHE,
with common logic in arch/arm64/kvm/hyp/include/hyp/switch.h. The code
in the header depends on function definitions provided by
arch/arm64/kvm/hyp/vhe/switch.c and arch/arm64/kvm/hyp/nvhe/switch.c
when they include the header.
This is an unusual header dependency, and prevents the use of
arch/arm64/kvm/hyp/include/hyp/switch.h in other files as this would
result in compiler warnings regarding missing definitions, e.g.
| In file included from arch/arm64/kvm/hyp/nvhe/hyp-main.c:8:
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:733:31: warning: 'kvm_get_exit_handler_array' used but never defined
| 733 | static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu);
| | ^~~~~~~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:735:13: warning: 'early_exit_filter' used but never defined
| 735 | static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code);
| | ^~~~~~~~~~~~~~~~~
Refactor the logic such that the header doesn't depend on anything from
the C files. There should be no functional change as a result of this
patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-7-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:23 +0000 (19:52 +0000)]
KVM: arm64: Refactor CPTR trap deactivation
For historical reasons, the VHE and nVHE/hVHE implementations of
__activate_cptr_traps() pair with a common implementation of
__kvm_reset_cptr_el2(), which ideally would be named
__deactivate_cptr_traps().
Rename __kvm_reset_cptr_el2() to __deactivate_cptr_traps(), and split it
into separate VHE and nVHE/hVHE variants so that each can be paired with
its corresponding implementation of __activate_cptr_traps().
At the same time, fold kvm_write_cptr_el2() into its callers. This
makes it clear in-context whether a write is made to the CPACR_EL1
encoding or the CPTR_EL2 encoding, and removes the possibility of
confusion as to whether kvm_write_cptr_el2() reformats the sysreg fields
as cpacr_clear_set() does.
In the nVHE/hVHE implementation of __activate_cptr_traps(), placing the
sysreg writes within the if-else blocks requires that the call to
__activate_traps_fpsimd32() is moved earlier, but as this was always
called before writing to CPTR_EL2/CPACR_EL1, this should not result in a
functional change.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-6-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:22 +0000 (19:52 +0000)]
KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.SMEN (i.e. CPTR_EL2.SMEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic
has historically been broken, and is currently redundant.
This logic was originally introduced in commit:
861262ab86270206 ("KVM: arm64: Handle SME host state when running guests")
At the time, the VHE hyp code would reset CPTR_EL2.SMEN to 0b00 when
returning to the host, trapping host access to SME state. Unfortunately,
this was unsafe as the host could take a softirq before calling
kvm_arch_vcpu_put_fp(), and if a softirq handler were to use kernel mode
NEON the resulting attempt to save the live FPSIMD/SVE/SME state would
result in a fatal trap.
That issue was limited to VHE mode. For nVHE/hVHE modes, KVM always
saved/restored the host kernel's CPACR_EL1 value, and configured
CPTR_EL2.TSM to 0b0, ensuring that host usage of SME would not be
trapped.
The issue above was incidentally fixed by commit:
375110ab51dec5dc ("KVM: arm64: Fix resetting SME trap values on reset for (h)VHE")
That commit changed the VHE hyp code to configure CPTR_EL2.SMEN to 0b01
when returning to the host, permitting host kernel usage of SME,
avoiding the issue described above. At the time, this was not identified
as a fix for commit 861262ab86270206.
Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SME trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.
Remove the redundant logic.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-5-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:21 +0000 (19:52 +0000)]
KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.ZEN (i.e. CPTR_EL2.ZEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic is
currently redundant.
The VHE hyp code unconditionally configures CPTR_EL2.ZEN to 0b01 when
returning to the host, permitting host kernel usage of SVE.
Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SVE trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.
Remove the redundant logic.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-4-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:20 +0000 (19:52 +0000)]
KVM: arm64: Remove host FPSIMD saving for non-protected KVM
Now that the host eagerly saves its own FPSIMD/SVE/SME state,
non-protected KVM never needs to save the host FPSIMD/SVE/SME state,
and the code to do this is never used. Protected KVM still needs to
save/restore the host FPSIMD/SVE state to avoid leaking guest state to
the host (and to avoid revealing to the host whether the guest used
FPSIMD/SVE/SME), and that code needs to be retained.
Remove the unused code and data structures.
To avoid the need for a stub copy of kvm_hyp_save_fpsimd_host() in the
VHE hyp code, the nVHE/hVHE version is moved into the shared switch
header, where it is only invoked when KVM is in protected mode.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-3-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Mark Rutland [Mon, 10 Feb 2025 19:52:19 +0000 (19:52 +0000)]
KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
There are several problems with the way hyp code lazily saves the host's
FPSIMD/SVE state, including:
* Host SVE being discarded unexpectedly due to inconsistent
configuration of TIF_SVE and CPACR_ELx.ZEN. This has been seen to
result in QEMU crashes where SVE is used by memmove(), as reported by
Eric Auger:
https://issues.redhat.com/browse/RHEL-68997
* Host SVE state is discarded *after* modification by ptrace, which was an
unintentional ptrace ABI change introduced with lazy discarding of SVE state.
* The host FPMR value can be discarded when running a non-protected VM,
where FPMR support is not exposed to a VM, and that VM uses
FPSIMD/SVE. In these cases the hyp code does not save the host's FPMR
before unbinding the host's FPSIMD/SVE/SME state, leaving a stale
value in memory.
Avoid these by eagerly saving and "flushing" the host's FPSIMD/SVE/SME
state when loading a vCPU such that KVM does not need to save any of the
host's FPSIMD/SVE/SME state. For clarity, fpsimd_kvm_prepare() is
removed and the necessary call to fpsimd_save_and_flush_cpu_state() is
placed in kvm_arch_vcpu_load_fp(). As 'fpsimd_state' and 'fpmr_ptr'
should not be used, they are set to NULL; all uses of these will be
removed in subsequent patches.
Historical problems go back at least as far as v5.17, e.g. erroneous
assumptions about TIF_SVE being clear in commit:
8383741ab2e773a9 ("KVM: arm64: Get rid of host SVE tracking/saving")
... and so this eager save+flush probably needs to be backported to ALL
stable trees.
Fixes: 93ae6b01bafee8fa ("KVM: arm64: Discard any SVE state when entering KVM guests") Fixes: 8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Fixes: ef3be86021c3bdf3 ("KVM: arm64: Add save/restore support for FPMR") Reported-by: Eric Auger <eauger@redhat.com> Reported-by: Wilco Dijkstra <wilco.dijkstra@arm.com> Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Eric Auger <eric.auger@redhat.com> Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Fuad Tabba <tabba@google.com> Cc: Jeremy Linton <jeremy.linton@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250210195226.1215254-2-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Sean Christopherson [Sat, 25 Jan 2025 01:18:33 +0000 (17:18 -0800)]
KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
Move the conditional loading of hardware DR6 with the guest's DR6 value
out of the core .vcpu_run() loop to fix a bug where KVM can load hardware
with a stale vcpu->arch.dr6.
When the guest accesses a DR and host userspace isn't debugging the guest,
KVM disables DR interception and loads the guest's values into hardware on
VM-Enter and saves them on VM-Exit. This allows the guest to access DRs
at will, e.g. so that a sequence of DR accesses to configure a breakpoint
only generates one VM-Exit.
For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also
identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest)
and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading
DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop.
But for DR6, the guest's value doesn't need to be loaded into hardware for
KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas
VMX requires software to manually load the guest value, and so loading the
guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done
_inside_ the core run loop.
Unfortunately, saving the guest values on VM-Exit is initiated by common
x86, again outside of the core run loop. If the guest modifies DR6 (in
hardware, when DR interception is disabled), and then the next VM-Exit is
a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and
clobber the guest's actual value.
The bug shows up primarily with nested VMX because KVM handles the VMX
preemption timer in the fastpath, and the window between hardware DR6
being modified (in guest context) and DR6 being read by guest software is
orders of magnitude larger in a nested setup. E.g. in non-nested, the
VMX preemption timer would need to fire precisely between #DB injection
and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the
window where hardware DR6 is "dirty" extends all the way from L1 writing
DR6 to VMRESUME (in L1).
L1's view:
==========
<L1 disables DR interception>
CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0
A: L1 Writes DR6
CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1
Reported-by: John Stultz <jstultz@google.com> Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT") Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Sean Christopherson [Thu, 30 Jan 2025 01:08:25 +0000 (17:08 -0800)]
KVM: nSVM: Enter guest mode before initializing nested NPT MMU
When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest
mode prior to initializing the MMU for nested NPT so that guest_mode is
set in the MMU's role. KVM's model is that all L2 MMUs are tagged with
guest_mode, as the behavior of hypervisor MMUs tends to be significantly
different than kernel MMUs.
Practically speaking, the bug is relatively benign, as KVM only directly
queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and
kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths
that are optimizations (mmu_page_zap_pte() and
shadow_mmu_try_split_huge_pages()).
And while the role is incorprated into shadow page usage, because nested
NPT requires KVM to be using NPT for L1, reusing shadow pages across L1
and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs
will have direct=0.
Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning
of the flow instead of forcing guest_mode in the MMU, as nothing in
nested_vmcb02_prepare_control() between the old and new locations touches
TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present
inconsistent vCPU state to the MMU.
Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control") Cc: stable@vger.kernel.org Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Sean Christopherson [Sat, 18 Jan 2025 00:34:54 +0000 (16:34 -0800)]
KVM: selftests: Add CPUID tests for Hyper-V features that need in-kernel APIC
Add testcases to x86's Hyper-V CPUID test to verify that KVM advertises
support for features that require an in-kernel local APIC appropriately,
i.e. that KVM hides support from the vCPU-scoped ioctl if the VM doesn't
have an in-kernel local APIC.
Allocate, get, and free the CPUID array in the Hyper-V CPUID test in the
test's core helper, instead of copy+pasting code at each call site. In
addition to deduplicating a small amount of code, restricting visibility
of the array to a single invocation of the core test prevents "leaking" an
array across test cases. Passing in @vcpu to the helper will also allow
pivoting on VM-scoped information without needing to pass more booleans,
e.g. to conditionally assert on features that require an in-kernel APIC.
To avoid use-after-free bugs due to overzealous and careless developers,
opportunstically add a comment to explain that the system-scoped helper
caches the Hyper-V CPUID entries, i.e. that the caller is not responsible
for freeing the memory.
Sean Christopherson [Sat, 18 Jan 2025 00:34:51 +0000 (16:34 -0800)]
KVM: x86: Reject Hyper-V's SEND_IPI hypercalls if local APIC isn't in-kernel
Advertise support for Hyper-V's SEND_IPI and SEND_IPI_EX hypercalls if and
only if the local API is emulated/virtualized by KVM, and explicitly reject
said hypercalls if the local APIC is emulated in userspace, i.e. don't rely
on userspace to opt-in to KVM_CAP_HYPERV_ENFORCE_CPUID.
Rejecting SEND_IPI and SEND_IPI_EX fixes a NULL-pointer dereference if
Hyper-V enlightenments are exposed to the guest without an in-kernel local
APIC:
Note, checking the sending vCPU is sufficient, as the per-VM irqchip_mode
can't be modified after vCPUs are created, i.e. if one vCPU has an
in-kernel local APIC, then all vCPUs have an in-kernel local APIC.
Linus Torvalds [Sun, 9 Feb 2025 18:05:32 +0000 (10:05 -0800)]
Merge tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Suppress false-positive -Wformat-{overflow,truncation}-non-kprintf
warnings regardless of the W= option
- Avoid CONFIG_TRIM_UNUSED_KSYMS dropping symbols passed to symbol_get()
- Fix a build regression of the Debian linux-headers package
* tag 'kbuild-fixes-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: install-extmod-build: add missing quotation marks for CC variable
kbuild: fix misspelling in scripts/Makefile.lib
kbuild: keep symbols for symbol_get() even with CONFIG_TRIM_UNUSED_KSYMS
scripts/Makefile.extrawarn: Do not show clang's non-kprintf warnings at W=1
Linus Torvalds [Sun, 9 Feb 2025 17:47:06 +0000 (09:47 -0800)]
Merge tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix a recently introduced kernel crash due to a NULL pointer
dereference during system-wide suspend (Rafael Wysocki)"
* tag 'pm-6.14-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: core: Restrict power.set_active propagation
Linus Torvalds [Sun, 9 Feb 2025 17:41:38 +0000 (09:41 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Correctly clean the BSS to the PoC before allowing EL2 to access it
on nVHE/hVHE/protected configurations
- Propagate ownership of debug registers in protected mode after the
rework that landed in 6.14-rc1
- Stop pretending that we can run the protected mode without a GICv3
being present on the host
- Fix a use-after-free situation that can occur if a vcpu fails to
initialise the NV shadow S2 MMU contexts
- Always evaluate the need to arm a background timer for fully
emulated guest timers
- Fix the emulation of EL1 timers in the absence of FEAT_ECV
- Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0
s390:
- move some of the guest page table (gmap) logic into KVM itself,
inching towards the final goal of completely removing gmap from the
non-kvm memory management code.
As an initial set of cleanups, move some code from mm/gmap into kvm
and start using __kvm_faultin_pfn() to fault-in pages as needed;
but especially stop abusing page->index and page->lru to aid in the
pgdesc conversion.
x86:
- Add missing check in the fix to defer starting the huge page
recovery vhost_task
- SRSO_USER_KERNEL_NO does not need SYNTHESIZED_F"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (31 commits)
KVM: x86/mmu: Ensure NX huge page recovery thread is alive before waking
KVM: remove kvm_arch_post_init_vm
KVM: selftests: Fix spelling mistake "initally" -> "initially"
kvm: x86: SRSO_USER_KERNEL_NO is not synthesized
KVM: arm64: timer: Don't adjust the EL2 virtual timer offset
KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECV
KVM: arm64: timer: Always evaluate the need for a soft timer
KVM: arm64: Fix nested S2 MMU structures reallocation
KVM: arm64: Fail protected mode init if no vgic hardware is present
KVM: arm64: Flush/sync debug state in protected mode
KVM: s390: selftests: Streamline uc_skey test to issue iske after sske
KVM: s390: remove the last user of page->index
KVM: s390: move PGSTE softbits
KVM: s390: remove useless page->index usage
KVM: s390: move gmap_shadow_pgt_lookup() into kvm
KVM: s390: stop using lists to keep track of used dat tables
KVM: s390: stop using page->index for non-shadow gmaps
KVM: s390: move some gmap shadowing functions away from mm/gmap.c
KVM: s390: get rid of gmap_translate()
KVM: s390: get rid of gmap_fault()
...
Commit 3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of
parents and children") exposed an issue related to simple_pm_bus_pm_ops
that uses pm_runtime_force_suspend() and pm_runtime_force_resume() as
bus type PM callbacks for the noirq phases of system-wide suspend and
resume.
The problem is that pm_runtime_force_suspend() does not distinguish
runtime-suspended devices from devices for which runtime PM has never
been enabled, so if it sees a device with runtime PM status set to
RPM_ACTIVE, it will assume that runtime PM is enabled for that device
and so it will attempt to suspend it with the help of its runtime PM
callbacks which may not be ready for that. As it turns out, this
causes simple_pm_bus_runtime_suspend() to crash due to a NULL pointer
dereference.
Another problem related to the above commit and simple_pm_bus_pm_ops is
that setting runtime PM status of a device handled by the latter to
RPM_ACTIVE will actually prevent it from being resumed because
pm_runtime_force_resume() only resumes devices with runtime PM status
set to RPM_SUSPENDED.
To mitigate these issues, do not allow power.set_active to propagate
beyond the parent of the device with DPM_FLAG_SMART_SUSPEND set that
will need to be resumed, which should be a sufficient stop-gap for the
time being, but they will need to be properly addressed in the future
because in general during system-wide resume it is necessary to resume
all devices in a dependency chain in which at least one device is going
to be resumed.
Fixes: 3775fc538f53 ("PM: sleep: core: Synchronize runtime PM status of parents and children") Closes: https://lore.kernel.org/linux-pm/1c2433d4-7e0f-4395-b841-b8eac7c25651@nvidia.com/ Reported-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/6137505.lOV4Wx5bFT@rjwysocki.net
Quentin Perret [Fri, 7 Feb 2025 14:54:38 +0000 (14:54 +0000)]
KVM: arm64: Simplify np-guest hypercalls
When the handling of a guest stage-2 permission fault races with an MMU
notifier, the faulting page might be gone from the guest's stage-2 by
the point we attempt to call (p)kvm_pgtable_stage2_relax_perms(). In the
normal KVM case, this leads to returning -EAGAIN which user_mem_abort()
handles correctly by simply re-entering the guest. However, the pKVM
hypercall implementation has additional logic to check the page state
using __check_host_shared_guest() which gets confused with absence of a
page mapped at the requested IPA and returns -ENOENT, hence breaking
user_mem_abort() and hilarity ensues.
Luckily, several of the hypercalls for managing the stage-2 page-table
of NP guests have no effect on the pKVM ownership tracking (wrprotect,
test_clear_young, mkyoung, and crucially relax_perms), so the extra
state checking logic is in fact not strictly necessary. So, to fix the
discrepancy between standard KVM and pKVM, let's just drop the
superfluous __check_host_shared_guest() logic from those hypercalls and
make the extra state checking a debug assertion dependent on
CONFIG_NVHE_EL2_DEBUG as we already do for other transitions.
Quentin Perret [Fri, 7 Feb 2025 14:54:37 +0000 (14:54 +0000)]
KVM: arm64: Improve error handling from check_host_shared_guest()
The check_host_shared_guest() path expects to find a last-level valid
PTE in the guest's stage-2 page-table. However, it checks the PTE's
level before its validity, which makes it hard for callers to figure out
what went wrong.
To make error handling simpler, check the PTE's validity first.
Linus Torvalds [Sat, 8 Feb 2025 22:12:17 +0000 (14:12 -0800)]
Merge tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
"Address a KUnit stack initialization regression that got tickled on
m68k, and solve a Clang(v14 and earlier) bug found by 0day:
- Fix stackinit KUnit regression on m68k
- Use ARRAY_SIZE() for memtostr*()/strtomem*()"
* tag 'hardening-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
string.h: Use ARRAY_SIZE() for memtostr*()/strtomem*()
compiler.h: Introduce __must_be_byte_array()
compiler.h: Move C string helpers into C-only kernel section
stackinit: Fix comment for test_small_end
stackinit: Keep selftest union size small on m68k
Linus Torvalds [Sat, 8 Feb 2025 22:04:21 +0000 (14:04 -0800)]
Merge tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fix from Kees Cook:
"This is really a work-around for x86_64 having grown a syscall to
implement uretprobe, which has caused problems since v6.11.
This may change in the future, but for now, this fixes the unintended
seccomp filtering when uretprobe switched away from traps, and does so
with something that should be easy to backport.
- Allow uretprobe on x86_64 to avoid behavioral complications (Eyal
Birger)"
* tag 'seccomp-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: validate uretprobe syscall passes through seccomp
seccomp: passthrough uretprobe systemcall without filtering
Linus Torvalds [Sat, 8 Feb 2025 21:59:24 +0000 (13:59 -0800)]
Merge tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fix from Kees Cook:
"This is an alpha-specific fix, but since it touched ELF I was asked to
carry it.
- alpha/elf: Fix misc/setarch test of util-linux by removing 32bit
support (Eric W. Biederman)"
* tag 'execve-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
alpha/elf: Fix misc/setarch test of util-linux by removing 32bit support
Linus Torvalds [Sat, 8 Feb 2025 21:45:34 +0000 (13:45 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"A number of fairly small fixes, mostly in drivers but two in the core
to change a retry for depopulation (a trendy new hdd thing that
reorganizes blocks away from failing elements) and one to fix a GFP_
annotation to avoid a lock dependency (the third core patch is all in
testing)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla1280: Fix kernel oops when debug level > 2
scsi: ufs: core: Fix error return with query response
scsi: storvsc: Set correct data length for sending SCSI command without payload
scsi: ufs: core: Fix use-after free in init error and remove paths
scsi: core: Do not retry I/Os during depopulation
scsi: core: Use GFP_NOIO to avoid circular locking dependency
scsi: ufs: Fix toggling of clk_gating.state when clock gating is not allowed
scsi: ufs: core: Ensure clk_gating.lock is used only after initialization
scsi: ufs: core: Simplify temperature exception event handling
scsi: target: core: Add line break to status show
scsi: ufs: core: Fix the HIGH/LOW_TEMP Bit Definitions
scsi: core: Add passthrough tests for success and no failure definitions
Linus Torvalds [Sat, 8 Feb 2025 20:22:21 +0000 (12:22 -0800)]
Merge tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux
Pull rust fixes from Miguel Ojeda:
- Do not export KASAN ODR symbols to avoid gendwarfksyms warnings
- Fix future Rust 1.86.0 (to be released 2025-04-03) x86_64 builds
- Clean future Rust 1.86.0 (to be released 2025-04-03) warning
- Fix future GCC 15 (to be released in a few months) builds
- Fix `rusttest` target in macOS
* tag 'rust-fixes-6.14' of https://github.com/Rust-for-Linux/linux:
x86: rust: set rustc-abi=x86-softfloat on rustc>=1.86.0
rust: kbuild: do not export generated KASAN ODR symbols
rust: kbuild: add -fzero-init-padding-bits to bindgen_skip_cflags
rust: init: use explicit ABI to clean warning in future compilers
rust: kbuild: use host dylib naming in rusttestlib-kernel
Linus Torvalds [Sat, 8 Feb 2025 20:18:02 +0000 (12:18 -0800)]
Merge tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ftrace fix from Steven Rostedt:
"Function graph fix of notrace functions.
When the function graph tracer was restructured to use the global
section of the meta data in the shadow stack, the bit logic was
changed. There's a TRACE_GRAPH_NOTRACE_BIT that is the bit number in
the mask that tells if the function graph tracer is currently in the
"notrace" mode. The TRACE_GRAPH_NOTRACE is the mask with that bit set.
But when the code we restructured, the TRACE_GRAPH_NOTRACE_BIT was
used when it should have been the TRACE_GRAPH_NOTRACE mask. This made
notrace not work properly"
* tag 'ftrace-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
Linus Torvalds [Sat, 8 Feb 2025 20:04:00 +0000 (12:04 -0800)]
Merge tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix a build regression on GCC 15 builds, caused by GCC changing the
default C version that is overriden in the main Makefile but not in
the x86 boot code Makefile"
* tag 'x86-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Use '-std=gnu11' to fix build with GCC 15
Linus Torvalds [Sat, 8 Feb 2025 19:55:03 +0000 (11:55 -0800)]
Merge tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Fix a PREEMPT_RT bug in the clocksource verification code that caused
false positive warnings.
Also fix a timer migration setup bug when new CPUs are added"
* tag 'timers-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix off-by-one root mis-connection
clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context
Linus Torvalds [Sat, 8 Feb 2025 19:16:22 +0000 (11:16 -0800)]
Merge tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a cfs_rq->h_nr_runnable accounting bug that trips up a defensive
SCHED_WARN_ON() on certain workloads. The bug is believed to be
(accidentally) self-correcting, hence no behavioral side effects are
expected.
Also print se.slice in debug output, since this value can now be set
via the syscall ABI and can be useful to track"
* tag 'sched-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Provide slice length for fair tasks
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
Linus Torvalds [Sat, 8 Feb 2025 19:05:54 +0000 (11:05 -0800)]
Merge tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Another followup fix for the procps genirq output formatting
regression caused by an optimization"
* tag 'irq-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove leading space from irq_chip::irq_print_chip() callbacks
Linus Torvalds [Sat, 8 Feb 2025 18:54:11 +0000 (10:54 -0800)]
Merge tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix a dangling pointer bug in the futex code used by the uring code.
It isn't causing problems at the moment due to uring ABI limitations
leaving it essentially unused in current usages, but is a good idea to
fix nevertheless"
* tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Pass in task to futex_queue()
Steven Rostedt [Sat, 8 Feb 2025 05:15:11 +0000 (00:15 -0500)]
fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BIT
The code was restructured where the function graph notrace code, that
would not trace a function and all its children is done by setting a
NOTRACE flag when the function that is not to be traced is hit.
There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a
TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the
restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used
TRACE_GRAPH_NOTRACE.
Nathan Chancellor [Thu, 17 Oct 2024 17:09:22 +0000 (10:09 -0700)]
kbuild: Move -Wenum-enum-conversion to W=2
-Wenum-enum-conversion was strengthened in clang-19 to warn for C, which
caused the kernel to move it to W=1 in commit 75b5ab134bb5 ("kbuild:
Move -Wenum-{compare-conditional,enum-conversion} into W=1") because
there were numerous instances that would break builds with -Werror.
Unfortunately, this is not a full solution, as more and more developers,
subsystems, and distributors are building with W=1 as well, so they
continue to see the numerous instances of this warning.
Since the move to W=1, there have not been many new instances that have
appeared through various build reports and the ones that have appeared
seem to be following similar existing patterns, suggesting that most
instances of this warning will not be real issues. The only alternatives
for silencing this warning are adding casts (which is generally seen as
an ugly practice) or refactoring the enums to macro defines or a unified
enum (which may be undesirable because of type safety in other parts of
the code).
Move the warning to W=2, where warnings that occur frequently but may be
relevant should reside.
Linus Torvalds [Sat, 8 Feb 2025 03:23:06 +0000 (19:23 -0800)]
Merge tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Three DFS fixes: DFS mount fix, fix for noisy log msg and one to
remove some unused code
- SMB3 Lease fix
* tag 'v6.14rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: change lease epoch type from unsigned int to __u16
smb: client: get rid of kstrdup() in get_ses_refpath()
smb: client: fix noisy when tree connecting to DFS interlink targets
smb: client: don't trust DFSREF_STORAGE_SERVER bit
Linus Torvalds [Fri, 7 Feb 2025 20:21:54 +0000 (12:21 -0800)]
Merge tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Just regular drm fixes, amdgpu, xe and i915 mostly, but a few
scattered fixes. I think one of the i915 fixes fixes some build combos
that Guenter was seeing.
amdgpu:
- Add new tiling flag for DCC write compress disable
- Add BO metadata flag for DCC
- Fix potential out of bounds access in display
- Seamless boot fix
- CONFIG_FRAME_WARN fix
- PSR1 fix
xe:
- OA uAPI related fixes
- Fix SRIOV migration initialization
- Restore devcoredump to a sane state
i915:
- Fix the build error with clamp after WARN_ON on gcc 13.x+
- HDCP related fixes
- PMU fix zero delta busyness issue
- Fix page cleanup on DMA remap failure
- Drop 64bpp YUV formats from ICL+ SDR planes
- GuC log related fix
- DisplayPort related fixes
ivpu:
- Fix error handling
komeda:
- add return check
zynqmp:
- fix locking in DP code
ast:
- fix AST DP timeout
cec:
- fix broken CEC adapter check"
* tag 'drm-fixes-2025-02-08' of https://gitlab.freedesktop.org/drm/kernel: (29 commits)
drm/i915/dp: Fix potential infinite loop in 128b/132b SST
Revert "drm/amd/display: Use HW lock mgr for PSR1"
drm/amd/display: Respect user's CONFIG_FRAME_WARN more for dml files
accel/amdxdna: Add MODULE_FIRMWARE() declarations
drm/i915/dp: Iterate DSC BPP from high to low on all platforms
drm/xe: Fix and re-enable xe_print_blob_ascii85()
drm/xe/devcoredump: Move exec queue snapshot to Contexts section
drm/xe/oa: Set stream->pollin in xe_oa_buffer_check_unlocked
drm/xe/pf: Fix migration initialization
drm/xe/oa: Preserve oa_ctrl unused bits
drm/amd/display: Fix seamless boot sequence
drm/amd/display: Fix out-of-bound accesses
drm/amdgpu: add a BO metadata flag to disable write compression for Vulkan
drm/i915/backlight: Return immediately when scale() finds invalid parameters
drm/i915/dp: Return min bpc supported by source instead of 0
drm/i915/dp: fix the Adaptive sync Operation mode for SDP
drm/i915/guc: Debug print LRC state entries only if the context is pinned
drm/i915: Drop 64bpp YUV formats from ICL+ SDR planes
drm/i915: Fix page cleanup on DMA remap failure
drm/i915/pmu: Fix zero delta busyness issue
...
Linus Torvalds [Fri, 7 Feb 2025 19:00:33 +0000 (11:00 -0800)]
Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- MD pull request via Song:
- fix an error handling path for md-linear
- NVMe pull request via Keith:
- Connection fixes for fibre channel transport (Daniel)
- Endian fixes (Keith, Christoph)
- Cleanup fix for host memory buffer (Francis)
- Platform specific power quirks (Georg)
- Target memory leak (Sagi)
- Use appropriate controller state accessor (Daniel)
- Fixup for a regression introduced last week, where sunvdc wasn't
updated for an API change, causing compilation failures on sparc64.
* tag 'block-6.14-20250207' of git://git.kernel.dk/linux:
drivers/block/sunvdc.c: update the correct AIP call
md: Fix linear_set_limits()
nvme-fc: use ctrl state getter
nvme: make nvme_tls_attrs_group static
nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
nvmet: the result field in nvmet_alloc_ctrl_args is little endian
nvmet: fix a memory leak in controller identify
nvme-fc: do not ignore connectivity loss during connecting
nvme: handle connectivity loss in nvme_set_queue_count
nvme-fc: go straight to connecting state when initializing
nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
nvme-pci: remove redundant dma frees in hmb
nvmet: fix rw control endian access
WangYuli [Fri, 7 Feb 2025 07:08:55 +0000 (15:08 +0800)]
kbuild: install-extmod-build: add missing quotation marks for CC variable
While attempting to build a Debian packages with CC="ccache gcc", I
saw the following error as builddeb builds linux-headers-$KERNELVERSION:
make HOSTCC=ccache gcc VPATH= srcroot=. -f ./scripts/Makefile.build obj=debian/linux-headers-6.14.0-rc1/usr/src/linux-headers-6.14.0-rc1/scripts
make[6]: *** No rule to make target 'gcc'. Stop.
Upon investigation, it seems that one instance of $(CC) variable reference
in ./scripts/package/install-extmod-build was missing quotation marks,
causing the above error.
Add the missing quotation marks around $(CC) to fix build.
Linus Torvalds [Fri, 7 Feb 2025 18:34:50 +0000 (10:34 -0800)]
Merge tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a handful of issues in the amd-pstate driver, the airoha
cpufreq driver build, a (recently added) possible NULL pointer
dereference in the cpufreq code and a possible memory leak in the
power capping subsystem:
- Fix cpufreq_policy reference counting and prevent max_perf from
going above the current limit in amd-pstate, and drop a redundant
goto label from it (Dhananjay Ugwekar)
- Prevent the per-policy boost_enabled flag in amd-pstate from
getting out of sync with the actual state after boot failures
(Lifeng Zheng)
- Fix a recently added possible NULL pointer dereference in the
cpufreq core (Aboorva Devarajan)
- Fix a build issue related to CONFIG_OF and COMPILE_TEST
dependencies in the airoha cpufreq driver (Arnd Bergmann)
- Fix a possible memory leak in the power capping subsystem (Joe
Hattori)"
* tag 'pm-6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Fix cpufreq_policy ref counting
cpufreq: prevent NULL dereference in cpufreq_online()
cpufreq: airoha: modify CONFIG_OF dependency
cpufreq/amd-pstate: Fix max_perf updation with schedutil
cpufreq/amd-pstate: Remove the goto label in amd_pstate_update_limits
cpufreq/amd-pstate: Fix per-policy boost flag incorrect when fail
powercap: call put_device() on an error path in powercap_register_control_type()
Linus Torvalds [Fri, 7 Feb 2025 17:50:33 +0000 (09:50 -0800)]
Merge tag 'gpio-fixes-for-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix interrupt support in gpio-pca953x
- fix configfs attribute locking in gpio-sim
- limit the visibility of the GPIO_GRGPIO Kconfig symbol to OF systems
only
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Use my kernel.org address for ACPI GPIO work
gpio: GPIO_GRGPIO should depend on OF
gpio: sim: lock hog configfs items if present
gpio: pca953x: Improve interrupt support
Linus Torvalds [Fri, 7 Feb 2025 17:22:31 +0000 (09:22 -0800)]
Merge tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix fsnotify FMODE_NONOTIFY* handling.
This also disables fsnotify on all pseudo files by default apart from
very select exceptions. This carries a regression risk so we need to
watch out and adapt accordingly. However, it is overall a significant
improvement over the current status quo where every rando file can
get fsnotify enabled.
- Cleanup and simplify lockref_init() after recent lockref changes.
- Fix vboxfs build with gcc-15.
- Add an assert into inode_set_cached_link() to catch corrupt links.
- Allow users to also use an empty string check to detect whether a
given mount option string was empty or not.
- Fix how security options were appended to statmount()'s ->mnt_opt
field.
- Fix statmount() selftests to always check the returned mask.
- Fix uninitialized value in vfs_statx_path().
- Fix pidfs_ioctl() sanity checks to guard against ioctl() overloading
and preserve extensibility.
* tag 'vfs-6.14-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: sanity check the length passed to inode_set_cached_link()
pidfs: improve ioctl handling
fsnotify: disable pre-content and permission events by default
selftests: always check mask returned by statmount(2)
fsnotify: disable notification by default for all pseudo files
fs: fix adding security options to statmount.mnt_opt
fsnotify: use accessor to set FMODE_NONOTIFY_*
lockref: remove count argument of lockref_init
gfs2: switch to lockref_init(..., 1)
gfs2: use lockref_init for gl_lockref
statmount: let unset strings be empty
vboxsf: fix building with GCC 15
fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()
Linus Torvalds [Fri, 7 Feb 2025 17:16:07 +0000 (09:16 -0800)]
Merge tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Nothing major, things continue to be fairly quiet over here.
- add a SubmittingPatches to clarify that patches submitted for
bcachefs do, in fact, need to be tested
- discard path now correctly issues journal flushes when needed, this
fixes performance issues when the filesystem is nearly full and
we're bottlenecked on copygc
- fix a bug that could cause the pending rebalance work accounting to
be off when devices are being onlined/offlined; users should report
if they are still seeing this
- and a few more trivial ones"
* tag 'bcachefs-2025-02-06.2' of git://evilpiepirate.org/bcachefs:
bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
bcachefs: Fix rcu imbalance in bch2_fs_btree_key_cache_exit()
bcachefs: Fix discard path journal flushing
bcachefs: fix deadlock in journal_entry_open()
bcachefs: fix incorrect pointer check in __bch2_subvolume_delete()
bcachefs docs: SubmittingPatches.rst
Hector Martin [Thu, 6 Feb 2025 18:21:46 +0000 (03:21 +0900)]
MAINTAINERS: Remove myself
I no longer have any faith left in the kernel development process or
community management approach.
Apple/ARM platform development will continue downstream. If I feel like
sending some patches upstream in the future myself for whatever subtree
I may, or I may not. Anyone who feels like fighting the upstreaming
fight themselves is welcome to do so.
Signed-off-by: Hector Martin <marcan@marcan.st> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael J. Wysocki [Fri, 7 Feb 2025 12:06:31 +0000 (13:06 +0100)]
Merge branches 'acpi-property' and 'acpi-resource'
Merge a new ACPI IRQ override quirk for Eluktronics MECH-17 (Gannon
Kolding) and an acpi_data_prop_read() fix making it reflect the OF
counterpart behavior in error cases (Andy Shevchenko).
* acpi-property:
ACPI: property: Fix return value for nval == 0 in acpi_data_prop_read()
* acpi-resource:
ACPI: resource: IRQ override for Eluktronics MECH-17
Mateusz Guzik [Tue, 4 Feb 2025 21:32:07 +0000 (22:32 +0100)]
vfs: sanity check the length passed to inode_set_cached_link()
This costs a strlen() call when instatianating a symlink.
Preferably it would be hidden behind VFS_WARN_ON (or compatible), but
there is no such facility at the moment. With the facility in place the
call can be patched out in production kernels.
In the meantime, since the cost is being paid unconditionally, use the
result to a fixup the bad caller.
This is not expected to persist in the long run (tm).
Sample splat:
bad length passed for symlink [/tmp/syz-imagegen43743633/file0/file0] (got 131109, expected 37)
[rest of WARN blurp goes here]
Christian Brauner [Tue, 4 Feb 2025 13:51:20 +0000 (14:51 +0100)]
pidfs: improve ioctl handling
Pidfs supports extensible and non-extensible ioctls. The extensible
ioctls need to check for the ioctl number itself not just the ioctl
command otherwise both backward- and forward compatibility are broken.
The pidfs ioctl handler also needs to look at the type of the ioctl
command to guard against cases where "[...] a daemon receives some
random file descriptor from a (potentially less privileged) client and
expects the FD to be of some specific type, it might call ioctl() on
this FD with some type-specific command and expect the call to fail if
the FD is of the wrong type; but due to the missing type check, the
kernel instead performs some action that userspace didn't expect."
(cf. [1]]
Christian Brauner [Tue, 4 Feb 2025 10:25:45 +0000 (11:25 +0100)]
Merge patch series "Fix for huge faults regression"
Amir Goldstein <amir73il@gmail.com> says:
The two Fix patches have been tested by Alex together and each one
independently.
I also verified that they pass the LTP inoityf/fanotify tests.
* patches from https://lore.kernel.org/r/20250203223205.861346-1-amir73il@gmail.com:
fsnotify: disable pre-content and permission events by default
fsnotify: disable notification by default for all pseudo files
fsnotify: use accessor to set FMODE_NONOTIFY_*
Amir Goldstein [Mon, 3 Feb 2025 22:32:05 +0000 (23:32 +0100)]
fsnotify: disable pre-content and permission events by default
After introducing pre-content events, we had a regression related to
disabling huge faults on files that should never have pre-content events
enabled.
This happened because the default f_mode of allocated files (0) does
not disable pre-content events.
Pre-content events are disabled in file_set_fsnotify_mode_by_watchers()
but internal files may not get to call this helper.
Initialize f_mode to disable permission and pre-content events for all
files and if needed they will be enabled for the callers of
file_set_fsnotify_mode_by_watchers().
Fixes: 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches") Reported-by: Alex Williamson <alex.williamson@redhat.com> Closes: https://lore.kernel.org/linux-fsdevel/20250131121703.1e4d00a7.alex.williamson@redhat.com/ Tested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250203223205.861346-4-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
Miklos Szeredi [Wed, 29 Jan 2025 16:06:41 +0000 (17:06 +0100)]
selftests: always check mask returned by statmount(2)
STATMOUNT_MNT_OPTS can actually be missing if there are no options. This
is a change of behavior since 75ead69a7173 ("fs: don't let statmount return
empty strings").
The other checks shouldn't actually trigger, but add them for correctness
and for easier debugging if the test fails.
Amir Goldstein [Mon, 3 Feb 2025 22:32:04 +0000 (23:32 +0100)]
fsnotify: disable notification by default for all pseudo files
Most pseudo files are not applicable for fsnotify events at all,
let alone to the new pre-content events.
Disable notifications to all files allocated with alloc_file_pseudo()
and enable legacy inotify events for the specific cases of pipe and
socket, which have known users of inotify events.
Pre-content events are also kept disabled for sockets and pipes.
Amir Goldstein [Mon, 3 Feb 2025 22:32:03 +0000 (23:32 +0100)]
fsnotify: use accessor to set FMODE_NONOTIFY_*
The FMODE_NONOTIFY_* bits are a 2-bits mode. Open coding manipulation
of those bits is risky. Use an accessor file_set_fsnotify_mode() to
set the mode.
Rename file_set_fsnotify_mode() => file_set_fsnotify_mode_from_watchers()
to make way for the simple accessor name.
Christian Brauner [Thu, 30 Jan 2025 15:04:42 +0000 (16:04 +0100)]
Merge patch series "further lockref cleanups"
Andreas Gruenbacher <agruenba@redhat.com> says:
Here's an updated version with an additional comment saying that
lockref_init() initializes count to 1.
* patches from https://lore.kernel.org/r/20250130135624.1899988-1-agruenba@redhat.com:
lockref: remove count argument of lockref_init
gfs2: switch to lockref_init(..., 1)
gfs2: use lockref_init for gl_lockref
Andreas Gruenbacher [Thu, 30 Jan 2025 13:56:22 +0000 (14:56 +0100)]
gfs2: switch to lockref_init(..., 1)
In qd_alloc(), initialize the lockref count to 1 to cover the common
case. Compensate for that in gfs2_quota_init() by adjusting the count
back down to 0; this only occurs when mounting the filesystem rw.
Miklos Szeredi [Thu, 30 Jan 2025 12:15:00 +0000 (13:15 +0100)]
statmount: let unset strings be empty
Just like it's normal for unset values to be zero, unset strings should be
empty instead of containing random values.
It seems to be a typical mistake that the mask returned by statmount is not
checked, which can result in various bugs.
With this fix, these bugs are prevented, since it is highly likely that
userspace would just want to turn the missing mask case into an empty
string anyway (most of the recently found cases are of this type).
Brahmajit Das [Tue, 21 Jan 2025 16:26:48 +0000 (21:56 +0530)]
vboxsf: fix building with GCC 15
Building with GCC 15 results in build error
fs/vboxsf/super.c:24:54: error: initializer-string for array of ‘unsigned char’ is too long [-Werror=unterminated-string-initialization]
24 | static const unsigned char VBSF_MOUNT_SIGNATURE[4] = "\000\377\376\375";
| ^~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Due to GCC having enabled -Werror=unterminated-string-initialization[0]
by default. Separately initializing each array element of
VBSF_MOUNT_SIGNATURE to ensure NUL termination, thus satisfying GCC 15
and fixing the build error.
Su Hui [Sun, 19 Jan 2025 02:59:47 +0000 (10:59 +0800)]
fs/stat.c: avoid harmless garbage value problem in vfs_statx_path()
Clang static checker(scan-build) warning:
fs/stat.c:287:21: warning: The left expression of the compound assignment is
an uninitialized value. The computed value will also be garbage.
287 | stat->result_mask |= STATX_MNT_ID_UNIQUE;
| ~~~~~~~~~~~~~~~~~ ^
fs/stat.c:290:21: warning: The left expression of the compound assignment is
an uninitialized value. The computed value will also be garbage.
290 | stat->result_mask |= STATX_MNT_ID;
When vfs_getattr() failed because of security_inode_getattr(), 'stat' is
uninitialized. In this case, there is a harmless garbage problem in
vfs_statx_path(). It's better to return error directly when
vfs_getattr() failed, avoiding garbage value and more clearly.
Before attaching a new root to the old root, the children counter of the
new root is checked to verify that only the upcoming CPU's top group have
been connected to it. However since the recently added commit b729cc1ec21a
("timers/migration: Fix another race between hotplug and idle entry/exit")
this check is not valid anymore because the old root is pre-accounted
as a child to the new root. Therefore after connecting the upcoming
CPU's top group to the new root, the children count to be expected must
be 2 and not 1 anymore.
This omission results in the old root to not be connected to the new
root. Then eventually the system may run with more than one top level,
which defeats the purpose of a single idle migrator.
Also the old root is pre-accounted but not connected upon the new root
creation. But it can be connected to the new root later on. Therefore
the old root may be accounted twice to the new root. The propagation of
such overcommit can end up creating a double final top-level root with a
groupmask incorrectly initialized. Although harmless given that the final
top level roots will never have a parent to walk up to, this oddity
opportunistically reported the core issue:
Geert Uytterhoeven [Wed, 5 Feb 2025 14:22:56 +0000 (15:22 +0100)]
genirq: Remove leading space from irq_chip::irq_print_chip() callbacks
The space separator was factored out from the multiple chip name prints,
but several irq_chip::irq_print_chip() callbacks still print a leading
space. Remove the superfluous double spaces.
Dave Airlie [Fri, 7 Feb 2025 04:47:11 +0000 (14:47 +1000)]
Merge tag 'drm-misc-fixes-2025-02-06' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
A couple of fixes for ivpu to error handling, komeda for format
handling, AST DP timeout fix when enabling the output, locking fix for
zynqmp DP support, tiled format handling in drm/client, and refcounting
fix for bochs
Kent Overstreet [Sun, 26 Jan 2025 02:29:45 +0000 (21:29 -0500)]
bcachefs: bch2_bkey_sectors_need_rebalance() now only depends on bch_extent_rebalance
Previously, bch2_bkey_sectors_need_rebalance() called
bch2_target_accepts_data(), checking whether the target is writable.
However, this means that adding or removing devices from a target would
change the value of bch2_bkey_sectors_need_rebalance() for an existing
extent; this needs to be invariant so that the extent trigger can
correctly maintain rebalance_work accounting.
Instead, check target_accepts_data() in io_opts_to_rebalance_opts(),
before creating the bch_extent_rebalance entry.
This fixes (one?) cause of rebalance_work accounting being off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Mon, 27 Jan 2025 06:21:44 +0000 (01:21 -0500)]
bcachefs: Fix discard path journal flushing
The discard path is supposed to issue journal flushes when there's too
many buckets empty buckets that need a journal commit before they can be
written to again, but at some point this code seems to have been lost.
Bring it back with a new optimization to make sure we don't issue too
many journal flushes: the journal now tracks the sequence number of the
most recent flush in progress, which the discard path uses when deciding
which buckets need a journal flush.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Jeongjun Park [Sun, 2 Feb 2025 06:13:51 +0000 (15:13 +0900)]
bcachefs: fix deadlock in journal_entry_open()
In the previous commit b3d82c2f2761, code was added to prevent journal sequence
overflow. Among them, the code added to journal_entry_open() uses the
bch2_fs_fatal_err_on() function to handle errors.
However, __journal_res_get() , which calls journal_entry_open() , calls
journal_entry_open() while holding journal->lock , but bch2_fs_fatal_err_on()
internally tries to acquire journal->lock , which results in a deadlock.
So we need to add a locked helper to handle fatal errors even when the
journal->lock is held.
Fixes: b3d82c2f2761 ("bcachefs: Guard against journal seq overflow") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Jeongjun Park [Fri, 31 Jan 2025 16:20:31 +0000 (01:20 +0900)]
bcachefs: fix incorrect pointer check in __bch2_subvolume_delete()
For some unknown reason, checks on struct bkey_s_c_snapshot and struct
bkey_s_c_snapshot_tree pointers are missing.
Therefore, I think it would be appropriate to fix the incorrect pointer checking
through this patch.
Fixes: 4bd06f07bcb5 ("bcachefs: Fixes for snapshot_tree.master_subvol") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kees Cook [Wed, 5 Feb 2025 21:45:26 +0000 (13:45 -0800)]
string.h: Use ARRAY_SIZE() for memtostr*()/strtomem*()
The destination argument of memtostr*() and strtomem*() must be a
fixed-size char array at compile time, so there is no need to use
__builtin_object_size() (which is useful for when an argument is
either a pointer or unknown). Instead use ARRAY_SIZE(), which has the
benefit of working around a bug in Clang (fixed[1] in 15+) that got
__builtin_object_size() wrong sometimes.
Kees Cook [Wed, 5 Feb 2025 20:48:07 +0000 (12:48 -0800)]
compiler.h: Introduce __must_be_byte_array()
In preparation for adding stricter type checking to the str/mem*()
helpers, provide a way to check that a variable is a byte array
via __must_be_byte_array().
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kees Cook <kees@kernel.org>
Kees Cook [Wed, 5 Feb 2025 20:32:49 +0000 (12:32 -0800)]
compiler.h: Move C string helpers into C-only kernel section
The C kernel helpers for evaluating C Strings were positioned where they
were visible to assembly inclusion, which was not intended. Move them
into the kernel and C-only area of the header so future changes won't
confuse the assembler.
Fixes: d7a516c6eeae ("compiler.h: Fix undefined BUILD_BUG_ON_ZERO()") Fixes: 559048d156ff ("string: Check for "nonstring" attribute on strscpy() arguments") Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org>
Alice Ryhl [Mon, 3 Feb 2025 08:40:57 +0000 (08:40 +0000)]
x86: rust: set rustc-abi=x86-softfloat on rustc>=1.86.0
When using Rust on the x86 architecture, we are currently using the
unstable target.json feature to specify the compilation target. Rustc is
going to change how softfloat is specified in the target.json file on
x86, thus update generate_rust_target.rs to specify softfloat using the
new option.
Note that if you enable this parameter with a compiler that does not
recognize it, then that triggers a warning but it does not break the
build.
[ For future reference, this solves the following error:
RUSTC L rust/core.o
error: Error loading target specification: target feature
`soft-float` is incompatible with the ABI but gets enabled in
target spec. Run `rustc --print target-list` for a list of
built-in targets
Eyal Birger [Sun, 2 Feb 2025 16:29:21 +0000 (08:29 -0800)]
selftests/seccomp: validate uretprobe syscall passes through seccomp
The uretprobe syscall is implemented as a performance enhancement on
x86_64 by having the kernel inject a call to it on function exit; User
programs cannot call this system call explicitly.
As such, this syscall is considered a kernel implementation detail and
should not be filtered by seccomp.
Enhance the seccomp bpf test suite to check that uretprobes can be
attached to processes without the killing the process regardless of
seccomp policy.
Eyal Birger [Sun, 2 Feb 2025 16:29:20 +0000 (08:29 -0800)]
seccomp: passthrough uretprobe systemcall without filtering
When attaching uretprobes to processes running inside docker, the attached
process is segfaulted when encountering the retprobe.
The reason is that now that uretprobe is a system call the default seccomp
filters in docker block it as they only allow a specific set of known
syscalls. This is true for other userspace applications which use seccomp
to control their syscall surface.
Since uretprobe is a "kernel implementation detail" system call which is
not used by userspace application code directly, it is impractical and
there's very little point in forcing all userspace applications to
explicitly allow it in order to avoid crashing tracked processes.
Pass this systemcall through seccomp without depending on configuration.
Note: uretprobe is currently only x86_64 and isn't expected to ever be
supported in i386.
Linus Torvalds [Thu, 6 Feb 2025 20:32:03 +0000 (12:32 -0800)]
Merge tag 'pci-v6.14-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- When saving a device's state, always save the upstream bridge's PM L1
Substates configuration as well because the bridge never saves its
own state, and restoring a device needs the state for both ends; this
was a regression that caused link and power management errors after
suspend/resume (Ilpo Järvinen)
- Correct TPH Control Register write, where we wrote the ST Mode where
the THP Requester Enable value was intended (Robin Murphy)
* tag 'pci-v6.14-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/TPH: Restore TPH Requester Enable correctly
PCI/ASPM: Fix L1SS saving
Linus Torvalds [Thu, 6 Feb 2025 20:25:35 +0000 (12:25 -0800)]
Merge tag 'for-linus-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Three fixes for xen_hypercall_hvm() that was introduced in the 6.13
cycle"
* tag 'for-linus-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: remove unneeded dummy push from xen_hypercall_hvm()
x86/xen: add FRAME_END to xen_hypercall_hvm()
x86/xen: fix xen_hypercall_hvm() to not clobber %rbx
Rafael J. Wysocki [Thu, 6 Feb 2025 19:39:43 +0000 (20:39 +0100)]
Merge tag 'amd-pstate-v6.14-2025-02-06' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Merge amd-pstate driver fixes for 6.14-rc2 from Mario Limonciello:
"* Fix some error cleanup paths with mutex use and boost
* Fix a ref counting issue
* Fix a schedutil issue"
* tag 'amd-pstate-v6.14-2025-02-06' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
cpufreq/amd-pstate: Fix cpufreq_policy ref counting
cpufreq/amd-pstate: Fix max_perf updation with schedutil
cpufreq/amd-pstate: Remove the goto label in amd_pstate_update_limits
cpufreq/amd-pstate: Fix per-policy boost flag incorrect when fail
Kees Cook [Tue, 4 Feb 2025 17:45:13 +0000 (09:45 -0800)]
stackinit: Keep selftest union size small on m68k
The stack frame on m68k is very sensitive to the size of what needs to
be stored. Like done for long string testing, reduce the size of the
large trailing struct in the union initialization testing.