Mark Rutland [Thu, 8 May 2025 13:26:39 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace: Gracefully handle errors
Within sve_set_common() we do not handle error conditions correctly:
* When writing to NT_ARM_SSVE, if sme_alloc() fails, the task will be
left with task->thread.sme_state==NULL, but TIF_SME will be set and
task->thread.fp_type==FP_STATE_SVE. This will result in a subsequent
null pointer dereference when the task's state is loaded or otherwise
manipulated.
* When writing to NT_ARM_SSVE, if sve_alloc() fails, the task will be
left with task->thread.sve_state==NULL, but TIF_SME will be set,
PSTATE.SM will be set, and task->thread.fp_type==FP_STATE_FPSIMD.
This is not a legitimate state, and can result in various problems,
including a subsequent null pointer dereference and/or the task
inheriting stale streaming mode register state the next time its state
is loaded into hardware.
* When writing to NT_ARM_SSVE, if the VL is changed but the resulting VL
differs from that in the header, the task will be left with TIF_SME
set, PSTATE.SM set, but task->thread.fp_type==FP_STATE_FPSIMD. This is
not a legitimate state, and can result in various problems as
described above.
Avoid these problems by allocating memory earlier, and by changing the
task's saved fp_type to FP_STATE_SVE before skipping register writes due
to a change of VL.
To make early returns simpler, I've moved the call to
fpsimd_flush_task_state() earlier. As the tracee's state has already
been saved, and the tracee is known to be blocked for the duration of
sve_set_common(), it doesn't matter whether this is called at the start
or the end.
For consistency I've moved the setting of TIF_SVE earlier. This will be
cleared when loading FPSIMD-only state, and so moving this has no
resulting functional change.
Note that we only allocate the memory for SVE state when SVE register
contents are provided, avoiding unnecessary memory allocations for tasks
which only use FPSIMD.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") Fixes: 5d0a8d2fba50 ("arm64/ptrace: Ensure that SME is set up for target when writing SSVE state") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-20-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:38 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace: Mandate SVE payload for streaming-mode state
When a task has PSTATE.SM==1, reads of NT_ARM_SSVE are required to
always present a header with SVE_PT_REGS_SVE, and register data in SVE
format. Reads of NT_ARM_SSVE must never present register data in FPSIMD
format. Within the kernel, we always expect streaming SVE data to be
stored in SVE format.
Currently a user can write to NT_ARM_SSVE with a header presenting
SVE_PT_REGS_FPSIMD rather than SVE_PT_REGS_SVE, placing the task's
FPSIMD/SVE data into an invalid state.
To fix this we can either:
(a) Forbid such writes.
(b) Accept such writes, and immediately convert data into SVE format.
Take the simple option and forbid such writes.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-19-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:37 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace: Do not present register data for inactive mode
The SME ptrace ABI is written around the incorrect assumption that
SVE_PT_REGS_FPSIMD and SVE_PT_REGS_SVE are independent bit flags, where
it is possible for both to be clear. In reality they are different
values for bit 0 of the header flags, where SVE_PT_REGS_FPSIMD is 0 and
SVE_PT_REGS_SVE is 1. In cases where code was written expecting that
neither bit flag would be set, the value is equivalent to
SVE_PT_REGS_FPSIMD.
One consequence of this is that reads of the NT_ARM_SVE or NT_ARM_SSVE
will erroneously present data from the other mode:
* When PSTATE.SM==1, reads of NT_ARM_SVE will present a header with
SVE_PT_REGS_FPSIMD, and FPSIMD-formatted data from streaming mode.
* When PSTATE.SM==0, reads of NT_ARM_SSVE will present a header with
SVE_PT_REGS_FPSIMD, and FPSIMD-formatted data from non-streaming mode.
The original intent was that no register data would be provided in these
cases, as described in commit:
e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
Luckily, debuggers do not consume the bogus register data. Both GDB and
LLDB read the NT_ARM_SSVE regset before the NT_ARM_SVE regset, and
assume that when the NT_ARM_SSVE header presents SVE_PT_REGS_FPSIMD, it
is necessary to read register contents from the NT_ARM_SVE regset,
regardless of whether the NT_ARM_SSVE regset provided bogus register
data.
Fix the code to stop presenting register data from the inactive mode.
At the same time, make the manipulation of the flag clearer, and remove
the bogus comment from sve_set_common(). I've given this a quick spin
with GDB and LLDB, and both seem happy.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-18-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:36 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace: Save task state before generating SVE header
As sve_init_header_from_task() consumes the saved value of PSTATE.SM and
the saved fp_type, both must be saved before the header is generated.
When generating a coredump for the current task, sve_get_common() calls
sve_init_header_from_task() before saving the task's state. Consequently
the header may be bogus, and the contents of the regset may be
misleading.
Fix this by saving the task's state before generting the header.
Fixes: e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers") Fixes: b017a0cea627 ("arm64/ptrace: Use saved floating point state type to determine SVE layout") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-17-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:35 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace/prctl: Ensure VL changes leave task in a valid state
Currently, vec_set_vector_length() can manipulate a task into an invalid
state as a result of a prctl/ptrace syscall which changes the SVE/SME
vector length, resulting in several problems:
(1) When changing the SVE vector length, if the task initially has
PSTATE.ZA==1, and sve_alloc() fails to allocate memory, the task
will be left with PSTATE.ZA==1 and sve_state==NULL. This is not a
legitimate state, and could result in a subsequent null pointer
dereference.
(2) When changing the SVE vector length, if the task initially has
PSTATE.SM==1, the task will be left with PSTATE.SM==1 and
fp_type==FP_STATE_FPSIMD. Streaming mode state always needs to be
saved in SVE format, so this is not a legitimate state.
Attempting to restore this state may cause a task to erroneously
inherit stale streaming mode predicate registers and FFR contents,
behaving non-deterministically and potentially leaving information
from another task.
While in this state, reads of the NT_ARM_SSVE regset will indicate
that the registers are not stored in SVE format. For the NT_ARM_SSVE
regset specifically, debuggers interpret this as meaning that
PSTATE.SM==0.
(3) When changing the SME vector length, if the task initially has
PSTATE.SM==1, the lower 128 bits of task's streaming mode vector
state will be migrated to non-streaming mode, rather than these bits
being zeroed as is usually the case for changes to PSTATE.SM.
To fix the first issue, we can eagerly allocate the new sve_state and
sme_state before modifying the task. This makes it possible to handle
memory allocation failure without modifying the task state at all, and
removes the need to clear TIF_SVE and TIF_SME.
To fix the second issue, we either need to clear PSTATE.SM or not change
the saved fp_type. Given we're going to eagerly allocate sve_state and
sme_state, the simplest option is to preserve PSTATE.SM and the saves
fp_type, and consistently truncate the SVE state. This ensures that the
task always stays in a valid state, and by virtue of not exiting
streaming mode, this also sidesteps the third issue.
I believe these changes should not be problematic for realistic usage:
* When the SVE/SME vector length is changed via prctl(), syscall entry
will have cleared PSTATE.SM. Unless the task's state has been
manipulated via ptrace after entry, the task will have PSTATE.SM==0.
* When the SVE/SME vector length is changed via a write to the
NT_ARM_SVE or NT_ARM_SSVE regsets, PSTATE.SM will be forced
immediately after the length change, and new vector state will be
copied from userspace.
* When the SME vector length is changed via a write to the NT_ARM_ZA
regset, the (S)SVE state is clobbered today, so anyone who cares about
the specific state would need to install this after writing to the
NT_ARM_ZA regset.
As we need to free the old SVE state while TIF_SVE may still be set, we
cannot use sve_free(), and using kfree() directly makes it clear that
the free pairs with the subsequent assignment. As this leaves sve_free()
unused, I've removed the existing sve_free() and renamed __sve_free() to
mirror sme_free().
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-16-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:34 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace/prctl: Ensure VL changes do not resurrect stale data
The SVE/SME vector lengths can be changed via prctl/ptrace syscalls.
Changes to the SVE/SME vector lengths are documented as preserving the
lower 128 bits of the Z registers (i.e. the bits shared with the FPSIMD
V registers). To ensure this, vec_set_vector_length() explicitly copies
register values from a task's saved SVE state to its saved FPSIMD state
when dropping the task to FPSIMD-only.
The logic for this was not updated when when FPSIMD/SVE state tracking
was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Since the last commit above, a task's FPSIMD/SVE state may be stored in
FPSIMD format while TIF_SVE is set, and the stored SVE state is stale.
When vec_set_vector_length() encounters this case, it will erroneously
clobber the live FPSIMD state with stale SVE state by using
sve_to_fpsimd().
Fix this by using fpsimd_sync_from_effective_state() instead.
Related issues with streaming mode state will be addressed in subsequent
patches.
Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-15-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:33 +0000 (14:26 +0100)]
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc and bionic.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with thread creation mechanisms such as fork() and
pthread_create(), which would be implemented in terms of the Linux clone
syscall. The behaviour implemented by Linux and glibc/bionic doesn't
always compose safely, as explained below.
Currently the clone syscall is implemented such that PSTATE.ZA and the
ZA tile are always inherited by the new task, and TPIDR2_EL0 is
inherited unless the 'flags' argument includes CLONE_SETTLS,
in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense:
(a) TPIDR2_EL0 is part of the calling convention, and changes as control
is passed between functions. It is *NOT* used for thread local
storage, despite superficial similarity to TPIDR_EL0, which is is
used as the TLS register.
(b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call
standard, and some combinations of states are illegal. In general,
manipulating the two independently is not guaranteed to be safe.
In practice, code which is compliant with the procedure call standard
may issue a clone syscall while in the "ZA dormant" state, where
PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to
be saved. This can cause a variety of problems, including:
* If the implementation of pthread_create() passes CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the
procedure call standard this is not a legitimate state for most
functions. This can cause data corruption (e.g. as code may rely on
PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will
zero the ZA tile contents), and may result in other undefined
behaviour.
* If the implementation of pthread_create() does not pass CLONE_SETTLS, the
new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a
TPIDR2 block on the parent thread's stack. This can result in a
variety of problems, e.g.
- The child may write back to the parent's za_save_buffer, corrupting
its contents.
- The child may read from the TPIDR2 block after the parent has reused
this memory for something else, and consequently the child may abort
or clobber arbitrary memory.
Ideally we'd require that userspace ensures that a task is in the "ZA
off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a
clone syscall, and have the kernel force this state for new threads.
Unfortunately, contemporary C libraries do not do this, and simply
forcing this state within the implementation of clone would break
fork().
Instead, we can bodge around this by considering the CLONE_VM flag, and
manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that
the new task will run in the same address space as its parent, and in
that case it doesn't make sense to inherit a stale pointer to the
parent's TPIDR2 block:
* For fork(), CLONE_VM will not be set, and it is safe to inherit both
PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the
address space, and cannot clobber its parent's stack.
* For pthread_create() and vfork(), CLONE_VM will be set, and discarding
PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing
assumptions in userspace.
Implement this behaviour for clone(). We currently inherit PSTATE.ZA in
arch_dup_task_struct(), but this does not have access to the clone
flags, so move this logic under copy_thread(). Documentation is updated
to describe the new behaviour.
Mark Rutland [Thu, 8 May 2025 13:26:32 +0000 (14:26 +0100)]
arm64/fpsimd: Clear PSTATE.SM during clone()
Currently arch_dup_task_struct() doesn't handle cases where the parent
task has PSTATE.SM==1. Since syscall entry exits streaming mode, the
parent will usually have PSTATE.SM==0, but this can be change by ptrace
after syscall entry. When this happens, arch_dup_task_struct() will
initialise the new task into an invalid state. The new task inherits the
parent's configuration of PSTATE.SM, but fp_type is set to
FP_STATE_FPSIMD, TIF_SVE and SME may be cleared, and both sve_state and
sme_state may be set to NULL.
This can result in a variety of problems whenever the new task's state
is manipulated, including kernel NULL pointer dereferences and leaking
of streaming mode state between tasks.
When ptrace is not involved, the parent will have PSTATE.SM==0 as a
result of syscall entry, and the documentation in
Documentation/arch/arm64/sme.rst says:
| On process creation (eg, clone()) the newly created process will have
| PSTATE.SM cleared.
... so make this true by using task_smstop_sm() to exit streaming mode
in the child task, avoiding the problems above.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-13-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:31 +0000 (14:26 +0100)]
arm64/fpsimd: Consistently preserve FPSIMD state during clone()
In arch_dup_task_struct() we try to ensure that the child task inherits
the FPSIMD state of its parent, but this depends on the parent task's
saved state being in FPSIMD format, which is not always the case.
Consequently the child task may inherit stale FPSIMD state in some
cases.
This can happen when the parent's state has been modified by ptrace
since syscall entry, as writes to the NT_ARM_SVE regset may save state
in SVE format. This has been possible since commit:
More recently it has been possible for a task's FPSIMD/SVE state to be
saved before lazy discarding was guaranteed to occur, in which case
preemption could cause the effective FPSIMD state to be saved in SVE
format non-deterministically. This has been possible since commit:
f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Fix this by saving the parent task's effective FPSIMD state into FPSIMD
format before copying the task_struct. As this requires modifying the
parent's fpsimd_state, we must save+flush the state to avoid racing with
concurrent manipulation.
Similar issues exist when the parent has streaming mode state, and will
be addressed by subsequent patches.
Fixes: bc0ee4760364 ("arm64/sve: Core task context handling") Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-12-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:30 +0000 (14:26 +0100)]
arm64/fpsimd: Remove redundant task->mm check
For historical reasons, arch_dup_task_struct() only calls
fpsimd_preserve_current_state() when current->mm is non-NULL, but this
is no longer necessary.
Historically TIF_FOREIGN_FPSTATE was only managed for user threads, and
was never set for kernel threads. At the time, various functions
attempted to avoid saving/restoring state for kernel threads by checking
task_struct::mm to try to distinguish user threads from kernel threads.
We added the current->mm check to arch_dup_task_struct() in commit:
6eb6c80187c5 ("arm64: kernel thread don't need to save fpsimd context.")
... where the intent was to avoid pointlessly saving state for kernel
threads, which never had live state (and the saved state would never be
consumed).
Subsequently we began setting TIF_FOREIGN_FPSTATE for kernel threads,
and removed most of the task_struct::mm checks in commit:
... but we missed the check in arch_dup_task_struct(), which is similarly
redundant.
Remove the redundant check from arch_dup_task_struct().
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-11-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:29 +0000 (14:26 +0100)]
arm64/fpsimd: signal: Use SMSTOP behaviour in setup_return()
Historically the behaviour of setup_return() was nondeterministic,
depending on whether the task's FSIMD/SVE/SME state happened to be live.
We fixed most of that in commit:
929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
... but we didn't decide on how clearing PSTATE.SM should behave, and left a
TODO comment to that effect.
Use the new task_smstop_sm() helper to make this behave as if an SMSTOP
instruction was used to exit streaming mode. This would have been the
most common behaviour prior to the commit above.
Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-10-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:28 +0000 (14:26 +0100)]
arm64/fpsimd: Add task_smstop_sm()
In a few places we want to transition a task from streaming mode to
non-streaming mode, e.g. signal delivery where we historically tried to
use an SMSTOP SM instruction.
Add a new helper to manipulate a task's state in the same way as an
SMSTOP SM instruction. I have not added a corresponding helper to
simulate the effects of SMSTART SM. Only ptrace transitions a task into
streaming mode, and ptrace has distinct semantics for such transitions.
Per ARM DDI 0487 L.a, section B1.4.6:
| RRSWFQ
| When the Effective value of PSTATE.SM is changed by any method from 0
| to 1, an entry to Streaming SVE mode is performed, and all implemented
| bits of Streaming SVE register state are set to zero.
| RKFRQZ
| When the Effective value of PSTATE.SM is changed by any method from 1
| to 0, an exit from Streaming SVE mode is performed, and in the
| newly-entered mode, all implemented bits of the SVE scalable vector
| registers, SVE predicate registers, and FFR, are set to zero.
Per ARM DDI 0487 L.a, section C5.2.9:
| On entry to or exit from Streaming SVE mode, FPMR is set to 0
Per ARM DDI 0487 L.a, section C5.2.10:
| On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC,
| IXC, IDC, QC} are set to 1 and the remaining bits are set to 0.
This means bits 0, 1, 2, 3, 4, 7, and 27 respectively, i.e. 0x0800009f
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-9-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:27 +0000 (14:26 +0100)]
arm64/fpsimd: Factor out {sve,sme}_state_size() helpers
In subsequent patches we'll need to determine the SVE/SME state size for
a given SVE VL and SME VL regardless of whether a task is currently
configured with those VLs. Split the sizing logic out of
sve_state_size() and sme_state_size() so that we don't need to open-code
this logic elsewhere.
At the same time, apply minor cleanups:
* Move sve_state_size() into fpsimd.h, matching the placement of
sme_state_size().
* Remove the feature checks from sve_state_size(). We only call
sve_state_size() when at least one of SVE and SME are supported, and
when either of the two is not supported, the task's corresponding
SVE/SME vector length will be zero.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-8-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:26 +0000 (14:26 +0100)]
arm64/fpsimd: Clarify sve_sync_*() functions
The sve_sync_{to,from}_fpsimd*() functions are intended to
extract/insert the currently effective FPSIMD state of a task regardless
of whether the task's state is saved in FPSIMD format or SVE format.
Historically they were only used by ptrace, but sve_sync_to_fpsimd() is
now used more widely, and sve_sync_from_fpsimd_zeropad() may be used
more widely in future.
When FPSIMD/SVE state tracking was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
... sve_sync_to_fpsimd() was updated to consider task->thread.fp_type
rather than the task's TIF_SVE and PSTATE.SM, but (apparently due to an
oversight) sve_sync_from_fpsimd_zeropad() was left as-is, leaving the
two inconsistent.
Due to this, sve_sync_from_fpsimd_zeropad() may copy state from
task->thread.uw.fpsimd_state into task->thread.sve_state when
task->thread.fp_type == FP_STATE_FPSIMD. This is redundant (but benign)
as task->thread.uw.fpsimd_state is the effective state that will be
restored, and task->thread.sve_state will not be consumed. For
consistency, and to avoid the redundant work, it better for
sve_sync_from_fpsimd_zeropad() to consider task->thread.fp_type alone,
matching sve_sync_to_fpsimd().
The naming of both functions is somehat unfortunate, as it is unclear
when and why they copy state. It would be better to describe them in
terms of the effective state.
Considering all of the above, clean this up:
* Adjust sve_sync_from_fpsimd_zeropad() to consider
task->thread.fp_type.
* Update comments to clarify the intended semantics/usage. I've removed
the description that task->thread.sve_state must have been allocated,
as this is only necessary when task->thread.fp_type == FP_STATE_SVE,
which itself implies that task->thread.sve_state must have been
allocated.
* Rename the functions to more clearly indicate when/why they copy
state:
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-7-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:25 +0000 (14:26 +0100)]
arm64/fpsimd: ptrace: Consistently handle partial writes to NT_ARM_(S)SVE
Partial writes to the NT_ARM_SVE and NT_ARM_SSVE regsets using an
payload are handled inconsistently and non-deterministically. A comment
within sve_set_common() indicates that we intended that a partial write
would preserve any effective FPSIMD/SVE state which was not overwritten,
but this has never worked consistently, and during syscalls the FPSIMD
vector state may be non-deterministically preserved and may be
erroneously migrated between streaming and non-streaming SVE modes.
The simplest fix is to handle a partial write by consistently zeroing
the remaining state. As detailed below I do not believe this will
adversely affect any real usage.
Neither GDB nor LLDB attempt partial writes to these regsets, and the
documentation (in Documentation/arch/arm64/sve.rst) has always indicated
that state preservation was not guaranteed, as is says:
| The effect of writing a partial, incomplete payload is unspecified.
When the logic was originally introduced in commit:
43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support")
... there were two potential behaviours, depending on TIF_SVE:
* When TIF_SVE was clear, all SVE state would be zeroed, excluding the
low 128 bits of vectors shared with FPSIMD, FPSR, and FPCR.
* When TIF_SVE was set, all SVE state would be zeroed, including the
low 128 bits of vectors shared with FPSIMD, but excluding FPSR and
FPCR.
Note that as writing to NT_ARM_SVE would set TIF_SVE, partial writes to
NT_ARM_SVE would not be idempotent, and if a first write preserved the
low 128 bits, a subsequent (potentially identical) partial write would
discard the low 128 bits.
When support for the NT_ARM_SSVE regset was added in commit:
e12310a0d30f ("arm64/sme: Implement ptrace support for streaming mode SVE registers")
... the above behaviour was retained for writes to the NT_ARM_SVE
regset, though writes to the NT_ARM_SSVE would always zero the SVE
registers and would not inherit FPSIMD register state. This happened as
fpsimd_sync_to_sve() only copied the FPSIMD regs when TIF_SVE was clear
and PSTATE.SM==0.
Subsequently, when FPSIMD/SVE state tracking was changed across commits:
baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") a0136be443d5 (arm64/fpsimd: Load FP state based on recorded data type") bbc6172eefdb ("arm64/fpsimd: SME no longer requires SVE register state") 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
... there was no corresponding update to the ptrace code, nor to
fpsimd_sync_to_sve(), which stil considers TIF_SVE and PSTATE.SM rather
than the saved fp_type. The saved state can be in the FPSIMD format
regardless of whether TIF_SVE is set or clear, and the saved type can
change non-deterministically during syscalls. Consequently a subsequent
partial write to the NT_ARM_SVE or NT_ARM_SSVE regsets may
non-deterministically preserve the FPSIMD state, and may migrate this
state between streaming and non-streaming modes.
Clean this up by never attempting to preserve ANY state when writing an
SVE payload to the NT_ARM_SVE/NT_ARM_SSVE regsets, zeroing all relevant
state including FPSR and FPCR. This simplifies the code, makes the
behaviour deterministic, and avoids migrating state between streaming
and non-streaming modes. As above, I do not believe this should
adversely affect existing userspace applications.
At the same time, remove fpsimd_sync_to_sve(). It is no longer used,
doesn't do what its documentation implies, and gets in the way of other
cleanups and fixes.
Fixes: 43d4da2c45b2 ("arm64/sve: ptrace and ELF coredump support") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-6-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
For historical reasons, restore_sve_fpsimd_context() has an open-coded
copy of the logic from read_fpsimd_context(), which is used to either
restore an FPSIMD-only context, or to merge FPSIMD state into an
SVE state when restoring an SVE+FPSIMD context. The logic is *almost*
identical.
Refactor the logic to avoid duplication and make this clearer.
This comes with two functional changes that I do not believe will be
problematic in practice:
* The user_fpsimd_state::size field will be checked in all restore paths
that consume it user_fpsimd_state. The kernel always populates this
field when delivering a signal, and so this should contain the
expected value unless it has been corrupted.
* If a read of user_fpsimd_state fails, we will return early without
modifying TIF_SVE, the saved SVCR, or the save fp_type. This will
leave the task in a consistent state, without potentially resurrecting
stale FPSIMD state. A read of user_fpsimd_state should never fail
unless the structure has been corrupted or the stack has been
unmapped.
Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-5-mark.rutland@arm.com
[will: Ensure read_fpsimd_context() returns negative error code or zero] Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:23 +0000 (14:26 +0100)]
arm64/fpsimd: signal: Mandate SVE payload for streaming-mode state
Non-streaming SVE state may be preserved without an SVE payload, in
which case the SVE context only has a header with VL==0, and all state
can be restored from the FPSIMD context. Streaming SVE state is always
preserved with an SVE payload, where the SVE context header has VL!=0,
and the SVE_SIG_FLAG_SM flag is set.
The kernel never preserves an SVE context where SVE_SIG_FLAG_SM is set
without an SVE payload. However, restore_sve_fpsimd_context() doesn't
forbid restoring such a context, and will handle this case by clearing
PSTATE.SM and restoring the FPSIMD context into non-streaming mode,
which isn't consistent with the SVE_SIG_FLAG_SM flag.
Forbid this case, and mandate an SVE payload when the SVE_SIG_FLAG_SM
flag is set. This avoids an awkward ABI quirk and reduces the risk that
later rework to this code permits configuring a task with PSTATE.SM==1
and fp_type==FP_STATE_FPSIMD.
I've marked this as a fix given that we never intended to support this
case, and we don't want anyone to start relying upon the old behaviour
once we re-enable SME.
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-4-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:22 +0000 (14:26 +0100)]
arm64/fpsimd: signal: Clear PSTATE.SM when restoring FPSIMD frame only
On systems with SVE and/or SME, the kernel will always create SVE and
FPSIMD signal frames when delivering a signal, but a user can manipulate
signal frames such that a signal return only observes an FPSIMD signal
frame. When this happens, restore_fpsimd_context() will restore state
such that fp_type==FP_STATE_FPSIMD, but will leave PSTATE.SM as-is.
It is possible for a user to set PSTATE.SM between syscall entry and
execution of the sigreturn logic (e.g. via ptrace), and consequently the
sigreturn may result in the task having PSTATE.SM==1 and
fp_type==FP_STATE_FPSIMD.
For various reasons it is not legitimate for a task to be in a state
where PSTATE.SM==1 and fp_type==FP_STATE_FPSIMD. Portions of the user
ABI are written with the requirement that streaming SVE state is always
presented in SVE format rather than FPSIMD format, and as there is no
mechanism to permit access to only the FPSIMD subset of streaming SVE
state, streaming SVE state must always be saved and restored in SVE
format.
Fix restore_fpsimd_context() to clear PSTATE.SM when restoring an FPSIMD
signal frame without an SVE signal frame. This matches the current
behaviour when an SVE signal frame is present, but the SVE signal frame
has no register payload (e.g. as is the case on SME-only systems which
lack SVE).
This change should have no effect for applications which do not alter
signal frames (i.e. almost all applications). I do not expect
non-{malicious,buggy} applications to hide the SVE signal frame, but
I've chosen to clear PSTATE.SM rather than mandating the presence of an
SVE signal frame in case there is some legacy (non-SME) usage that I am
not currently aware of.
For context, the SME handling was originally introduced in commit:
85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling")
... and subsequently updated/fixed to handle SME-only systems in commits:
7dde62f0687c ("arm64/signal: Always accept SVE signal frames on SME only systems") f26cd7372160 ("arm64/signal: Always allocate SVE signal frames on SME only systems")
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-3-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Thu, 8 May 2025 13:26:21 +0000 (14:26 +0100)]
arm64/fpsimd: Do not discard modified SVE state
Historically SVE state was discarded deterministically early in the
syscall entry path, before ptrace is notified of syscall entry. This
permitted ptrace to modify SVE state before and after the "real" syscall
logic was executed, with the modified state being retained.
This behaviour was changed by commit:
8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
That commit was intended to speed up workloads that used SVE by
opportunistically leaving SVE enabled when returning from a syscall.
The syscall entry logic was modified to truncate the SVE state without
disabling userspace access to SVE, and fpsimd_save_user_state() was
modified to discard userspace SVE state whenever
in_syscall(current_pt_regs()) is true, i.e. when
current_pt_regs()->syscallno != NO_SYSCALL.
Leaving SVE enabled opportunistically resulted in a couple of changes to
userspace visible behaviour which weren't described at the time, but are
logical consequences of opportunistically leaving SVE enabled:
* Signal handlers can observe the type of saved state in the signal's
sve_context record. When the kernel only tracks FPSIMD state, the 'vq'
field is 0 and there is no space allocated for register contents. When
the kernel tracks SVE state, the 'vq' field is non-zero and the
register contents are saved into the record.
As a result of the above commit, 'vq' (and the presence of SVE
register state) is non-deterministically zero or non-zero for a period
of time after a syscall. The effective register state is still
deterministic.
Hopefully no-one relies on this being deterministic. In general,
handlers for asynchronous events cannot expect a deterministic state.
* Similarly to signal handlers, ptrace requests can observe the type of
saved state in the NT_ARM_SVE and NT_ARM_SSVE regsets, as this is
exposed in the header flags. As a result of the above commit, this is
now in a non-deterministic state after a syscall. The effective
register state is still deterministic.
Hopefully no-one relies on this being deterministic. In general,
debuggers would have to handle this changing at arbitrary points
during program flow.
Discarding the SVE state within fpsimd_save_user_state() resulted in
other changes to userspace visible behaviour which are not desirable:
* A ptrace tracer can modify (or create) a tracee's SVE state at syscall
entry or syscall exit. As a result of the above commit, the tracee's
SVE state can be discarded non-deterministically after modification,
rather than being retained as it previously was.
Note that for co-operative tracer/tracee pairs, the tracer may
(re)initialise the tracee's state arbitrarily after the tracee sends
itself an initial SIGSTOP via a syscall, so this affects realistic
design patterns.
* The current_pt_regs()->syscallno field can be modified via ptrace, and
can be altered even when the tracee is not really in a syscall,
causing non-deterministic discarding to occur in situations where this
was not previously possible.
Further, using current_pt_regs()->syscallno in this way is unsound:
* There are data races between readers and writers of the
current_pt_regs()->syscallno field.
The current_pt_regs()->syscallno field is written in interruptible
task context using plain C accesses, and is read in irq/softirq
context using plain C accesses. These accesses are subject to data
races, with the usual concerns with tearing, etc.
* Writes to current_pt_regs()->syscallno are subject to compiler
reordering.
As current_pt_regs()->syscallno is written with plain C accesses,
the compiler is free to move those writes arbitrarily relative to
anything which doesn't access the same memory location.
In theory this could break signal return, where prior to restoring the
SVE state, restore_sigframe() calls forget_syscall(). If the write
were hoisted after restore of some SVE state, that state could be
discarded unexpectedly.
In practice that reordering cannot happen in the absence of LTO (as
cross compilation-unit function calls happen prevent this reordering),
and that reordering appears to be unlikely in the presence of LTO.
Additionally, since commit:
f130ac0ae4412dbe ("arm64: syscall: unmask DAIF earlier for SVCs")
... DAIF is unmasked before el0_svc_common() sets regs->syscallno to the
real syscall number. Consequently state may be saved in SVE format prior
to this point.
Considering all of the above, current_pt_regs()->syscallno should not be
used to infer whether the SVE state can be discarded. Luckily we can
instead use cpu_fp_state::to_save to track when it is safe to discard
the SVE state:
* At syscall entry, after the live SVE register state is truncated, set
cpu_fp_state::to_save to FP_STATE_FPSIMD to indicate that only the
FPSIMD portion is live and needs to be saved.
* At syscall exit, once the task's state is guaranteed to be live, set
cpu_fp_state::to_save to FP_STATE_CURRENT to indicate that TIF_SVE
must be considered to determine which state needs to be saved.
* Whenever state is modified, it must be saved+flushed prior to
manipulation. The state will be truncated if necessary when it is
saved, and reloading the state will set fp_state::to_save to
FP_STATE_CURRENT, preventing subsequent discarding.
This permits SVE state to be discarded *only* when it is known to have
been truncated (and the non-FPSIMD portions must be zero), and ensures
that SVE state is retained after it is explicitly modified.
For backporting, note that this fix depends on the following commits:
* b2482807fbd4 ("arm64/sme: Optimise SME exit on syscall entry")
* f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
* 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch") Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-2-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
Mark Rutland [Wed, 30 Apr 2025 17:32:40 +0000 (18:32 +0100)]
arm64/fpsimd: Avoid warning when sve_to_fpsimd() is unused
Historically fpsimd_to_sve() and sve_to_fpsimd() were (conditionally)
called by functions which were defined regardless of CONFIG_ARM64_SVE.
Hence it was necessary that both fpsimd_to_sve() and sve_to_fpsimd()
were always defined and not guarded by ifdeffery.
As a result of the removal of fpsimd_signal_preserve_current_state() in
commit:
929fa99b1215966f ("arm64/fpsimd: signal: Always save+flush state early")
... sve_to_fpsimd() has no callers when CONFIG_ARM64_SVE=n, resulting in
a build-time warnign that it is unused:
In contrast, fpsimd_to_sve() still has callers which are defined when
CONFIG_ARM64_SVE=n, and it would be awkward to hide this behind
ifdeffery and/or to use stub functions.
For now, suppress the warning by marking both fpsimd_to_sve() and
sve_to_fpsimd() as 'static inline', as we usually do for stub functions.
The compiler will no longer warn if either function is unused.
Aside from suppressing the warning, there should be no functional change
as a result of this patch.
Mark Rutland [Thu, 17 Apr 2025 19:01:13 +0000 (20:01 +0100)]
arm64/fpsimd: signal: Clear TPIDR2 when delivering signals
Linux is intended to be compatible with userspace written to Arm's
AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension
(SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's
ZA tile is lazily callee-saved and caller-restored. In this scheme,
TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by
pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer"
pointer. This scheme has been implemented in GCC and LLVM, with
necessary runtime support implemented in glibc.
AAPCS64 does not specify how the ZA lazy saving scheme is expected to
interact with signal handling, and the behaviour that AAPCS64 currently
recommends for (sig)setjmp() and (sig)longjmp() does not always compose
safely with signal handling, as explained below.
When Linux delivers a signal, it creates signal frames which contain the
original values of PSTATE.ZA, the ZA tile, and TPIDR_EL2. Between saving
the original state and entering the signal handler, Linux clears
PSTATE.ZA, but leaves TPIDR2_EL0 unchanged. Consequently a signal
handler can be entered with PSTATE.ZA=0 (meaning accesses to ZA will
trap), while TPIDR_EL0 is non-null (which may indicate that ZA needs to
be lazily saved, depending on the contents of the TPIDR2 block). While
in this state, libc and/or compiler runtime code, such as longjmp(), may
attempt to save ZA. As PSTATE.ZA=0, these accesses will trap, causing
the kernel to inject a SIGILL. Note that by virtue of lazy saving
occurring in libc and/or C runtime code, this can be triggered by
application/library code which is unaware of SME.
To avoid the problem above, the kernel must ensure that signal handlers
are entered with PSTATE.ZA and TPIDR2_EL0 configured in a manner which
complies with the ZA lazy saving scheme. Practically speaking, the only
choice is to enter signal handlers with PSTATE.ZA=0 and TPIDR2_EL0=NULL.
This change should not impact SME code which does not follow the ZA lazy
saving scheme (and hence does not use TPIDR2_EL0).
An alternative approach that was considered is to have the signal
handler inherit the original values of both PSTATE.ZA and TPIDR2_EL0,
relying on lazy save/restore sequences being idempotent and capable of
racing safely. This is not safe as signal handlers must be assumed to
have a "private ZA" interface, and therefore cannot be entered with
PSTATE.ZA=1 and TPIDR2_EL0=NULL, but it is legitimate for signals to be
taken from this state.
With the kernel fixed to clear TPIDR2_EL0, there are a couple of
remaining issues (largely masked by the first issue) that must be fixed
in userspace:
(1) When a (sig)setjmp() + (sig)longjmp() pair cross a signal boundary,
ZA state may be discarded when it needs to be preserved.
Currently, the ZA lazy saving scheme recommends that setjmp() does
not save ZA, and recommends that longjmp() is responsible for saving
ZA. A call to longjmp() in a signal handler will not have visibility
of ZA state that existed prior to entry to the signal, and when a
longjmp() is used to bypass a usual signal return, unsaved ZA state
will be discarded erroneously.
To fix this, it is necessary for setjmp() to eagerly save ZA state,
and for longjmp() to configure PSTATE.ZA=0 and TPIDR2_EL0=NULL. This
works regardless of whether a signal boundary is crossed.
(2) When a C++ exception is thrown and crosses a signal boundary before
it is caught, ZA state may be discarded when it needs to be
preserved.
AAPCS64 requires that exception handlers are entered with
PSTATE.{SM,ZA}={0,0} and TPIDR2_EL0=NULL, with exception unwind code
expected to perform any necessary save of ZA state.
Where it is necessary to perform an exception unwind across an
exception boundary, the unwind code must recover any necessary ZA
state (along with TPIDR2) from signal frames.
Fix the kernel as described above, with setup_return() clearing
TPIDR2_EL0 when delivering a signal. Folk on CC are working on fixes for
the remaining userspace issues, including updates/fixes to the AAPCS64
specification and glibc.
Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling") Fixes: 39e54499280f ("arm64/signal: Include TPIDR2 in the signal context") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Daniel Kiss <daniel.kiss@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Richard Sandiford <richard.sandiford@arm.com> Cc: Sander De Smalen <sander.desmalen@arm.com> Cc: Tamas Petz <tamas.petz@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yury Khrustalev <yury.khrustalev@arm.com> Link: https://lore.kernel.org/r/20250417190113.3778111-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
During a context-switch, tls_thread_switch() reads and writes a task's
thread_struct::tpidr2_el0 field. Other code shouldn't access this field
for an active task, as such accesses would form a data-race with a
concurrent context-switch.
The usage in preserve_tpidr2_context() is suspicious, but benign as any
race with a context switch will write the same value back to
current->thread.tpidr2_el0.
Make this clearer and match restore_tpidr2_context() by using a
temporary variable instead, avoiding the (benign) data-race.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-14-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:40:09 +0000 (17:40 +0100)]
arm64/fpsimd: signal: Always save+flush state early
There are several issues with the way the native signal handling code
manipulates FPSIMD/SVE/SME state, described in detail below. These
issues largely result from races with preemption and inconsistent
handling of live state vs saved state.
Known issues with native FPSIMD/SVE/SME state management include:
* On systems with FPMR, the code to save/restore the FPMR accesses the
register while it is not owned by the current task. Consequently, this
may corrupt the FPMR of the current task and/or may corrupt the FPMR
of an unrelated task. The FPMR save/restore has been broken since it
was introduced in commit:
* On systems with SME, setup_return() modifies both the live register
state and the saved state register state regardless of whether the
task's state is live, and without holding the cpu fpsimd context.
Consequently:
- This may corrupt the state an unrelated task which has PSTATE.SM set
and/or PSTATE.ZA set.
- The task may enter the signal handler in streaming mode, and or with
ZA storage enabled unexpectedly.
- The task may enter the signal handler in non-streaming SVE mode with
stale SVE register state, which may have been inherited from
streaming SVE mode unexpectedly. Where the streaming and
non-streaming vector lengths differ, this may be packed into
registers arbitrarily.
This logic has been broken since it was introduced in commit:
40a8e87bb32855b3 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Further incorrect manipulation of state was added in commits:
ea64baacbc36a0d5 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode") baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
* Several restoration functions use fpsimd_flush_task_state() to discard
the live FPSIMD/SVE/SME while the in-memory copy is stale.
When a subset of the FPSIMD/SVE/SME state is restored, the remainder
may be non-deterministically reset to a stale snapshot from some
arbitrary point in the past.
This non-deterministic discarding was introduced in commit:
* On systems with SME/SME2, the entire FPSIMD/SVE/SME state may be
loaded onto the CPU redundantly. Either restore_fpsimd_context() or
restore_sve_fpsimd_context() will load the entire FPSIMD/SVE/SME state
via fpsimd_update_current_state() before restore_za_context() and
restore_zt_context() each discard the state via
fpsimd_flush_task_state().
This is purely redundant work, and not a functional bug.
To fix these issues, rework the native signal handling code to always
save+flush the current task's FPSIMD/SVE/SME state before manipulating
that state. This avoids races with preemption and ensures that state is
manipulated consistently regardless of whether it happened to be live
prior to manipulation. This largely involes:
* Using fpsimd_save_and_flush_current_state() to save+flush the state
for both signal delivery and signal return, before the state is
manipulated in any way.
* Removing fpsimd_signal_preserve_current_state() and updating
preserve_fpsimd_context() to explicitly ensure that the FPSIMD state
is up-to-date, as preserve_fpsimd_context() is the only consumer of
the FPSIMD state during signal delivery.
* Modifying fpsimd_update_current_state() to not reload the FPSIMD state
onto the CPU. Ideally we'd remove fpsimd_update_current_state()
entirely, but I've left that for subsequent patches as there are a
number of of other problems with the FPSIMD<->SVE conversion helpers
that should be addressed at the same time. For now I've removed the
misleading comment.
For setup_return(), we need to decide (for ABI reasons) whether signal
delivery should have all the side-effects of an SMSTOP. For now I've
left a TODO comment, as there are other questions in this area that I'll
address with subsequent patches.
Fixes: 8c46def44409 ("arm64/signal: Add FPMR signal handling") Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals") Fixes: ea64baacbc36 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode") Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE") Fixes: 8cd969d28fd2 ("arm64/sve: Signal handling support") Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling") Fixes: ee072cf70804 ("arm64/sme: Implement signal handling for ZT") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-13-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:40:08 +0000 (17:40 +0100)]
arm64/fpsimd: signal32: Always save+flush state early
There are several issues with the way the native signal handling code
manipulates FPSIMD/SVE/SME state. To fix those issues, subsequent
patches will rework the native signal handling code to always save+flush
the current task's FPSIMD/SVE/SME state before manipulating that state.
In preparation for those changes, rework the compat signal handling code
to save+flush the current task's FPSIMD state before manipulating it.
Subsequent patches will remove fpsimd_signal_preserve_current_state()
and fpsimd_update_current_state(). Compat tasks can only have FPSIMD
state, and cannot have any SVE or SME state. Thus, the SVE state
manipulation present in fpsimd_signal_preserve_current_state() and
fpsimd_update_current_state() is not necessary, and it is safe to
directly manipulate current->thread.uw.fpsimd_state once it has been
saved+flushed.
Use fpsimd_save_and_flush_current_state() to save+flush the state for
both signal delivery and signal return, before the state is manipulated
in any way. While it would be safe for compat_restore_vfp_context() to
use fpsimd_flush_task_state(current), there are several extant issues in
the native signal code resulting from incorrect use of
fpsimd_flush_task_state(), and for consistency it is preferable to use
fpsimd_save_and_flush_current_state().
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-12-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
When the current task's FPSIMD/SVE/SME state may be live on *any* CPU in
the system, special care must be taken when manipulating that state, as
this manipulation can race with preemption and/or asynchronous usage of
FPSIMD/SVE/SME (e.g. kernel-mode NEON in softirq handlers).
Even when manipulation is is protected with get_cpu_fpsimd_context() and
get_cpu_fpsimd_context(), the logic necessary when the state is live on
the current CPU can be wildly different from the logic necessary when
the state is not live on the current CPU. A number of historical and
extant issues result from failing to handle these cases consistetntly
and/or correctly.
To make it easier to get such manipulation correct, add a new
fpsimd_save_and_flush_current_state() helper function, which ensures
that the current task's state has been saved to memory and any stale
state on any CPU has been "flushed" such that is not live on any CPU in
the system. This will allow code to safely manipulate the saved state
without risk of races.
Subsequent patches will use the new function.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-11-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:40:06 +0000 (17:40 +0100)]
arm64/fpsimd: Fix merging of FPSIMD state during signal return
For backwards compatibility reasons, when a signal return occurs which
restores SVE state, the effective lower 128 bits of each of the SVE
vector registers are restored from the corresponding FPSIMD vector
register in the FPSIMD signal frame, overriding the values in the SVE
signal frame. This is intended to be the case regardless of streaming
mode.
To make this happen, restore_sve_fpsimd_context() uses
fpsimd_update_current_state() to merge the lower 128 bits from the
FPSIMD signal frame into the SVE register state. Unfortunately,
fpsimd_update_current_state() performs this merging dependent upon
TIF_SVE, which is not always correct for streaming SVE register state:
* When restoring non-streaming SVE register state there is no observable
problem, as the signal return code configures TIF_SVE and the saved
fp_type to match before calling fpsimd_update_current_state(), which
observes either:
- TIF_SVE set AND fp_type == FP_STATE_SVE
- TIF_SVE clear AND fp_type == FP_STATE_FPSIMD
* On systems which have SME but not SVE, TIF_SVE cannot be set. Thus the
merging will never happen for the streaming SVE register state.
* On systems which have SVE and SME, TIF_SVE can be set and cleared
independently of PSTATE.SM. Thus the merging may or may not happen for
streaming SVE register state.
As TIF_SVE can be cleared non-deterministically during syscalls
(including at the start of sigreturn()), the merging may occur
non-deterministically from the perspective of userspace.
This logic has been broken since its introduction in commit:
85ed24dad2904f7c ("arm64/sme: Implement streaming SVE signal handling")
... at which point both fpsimd_signal_preserve_current_state() and
fpsimd_update_current_state() only checked TIF SVE. When PSTATE.SM==1
and TIF_SVE was clear, signal delivery would place stale FPSIMD state
into the FPSIMD signal frame, and signal return would not merge this
into the restored register state.
Subsequently, signal delivery was fixed as part of commit:
61da7c8e2a602f66 ("arm64/signal: Don't assume that TIF_SVE means we saved SVE state")
... but signal restore was not given a corresponding fix, and when
TIF_SVE was clear, signal restore would still fail to merge the FPSIMD
state into the restored SVE register state. The 'Fixes' tag did not
indicate that this had been broken since its introduction.
Fix this by merging the FPSIMD state dependent upon the saved fp_type,
matching what we (currently) do during signal delivery.
As described above, when backporting this commit, it will also be
necessary to backport commit:
61da7c8e2a602f66 ("arm64/signal: Don't assume that TIF_SVE means we saved SVE state")
... and prior to commit:
baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
... it will be necessary for fpsimd_signal_preserve_current_state() and
fpsimd_update_current_state() to consider both TIF_SVE and
thread_sm_enabled(¤t->thread), in place of the saved fp_type.
Fixes: 85ed24dad290 ("arm64/sme: Implement streaming SVE signal handling") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-10-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:40:05 +0000 (17:40 +0100)]
arm64/fpsimd: Reset FPMR upon exec()
An exec() is expected to reset all FPSIMD/SVE/SME state, and barring
special handling of the vector lengths, the state is expected to reset
to zero. This reset is handled in fpsimd_flush_thread(), which the core
exec() code calls via flush_thread().
When support was added for FPMR, no logic was added to
fpsimd_flush_thread() to reset the FPMR value, and thus it is
erroneously inherited across an exec().
Add the missing reset of FPMR.
Fixes: 203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-9-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:40:04 +0000 (17:40 +0100)]
arm64/fpsimd: Avoid clobbering kernel FPSIMD state with SMSTOP
On system with SME, a thread's kernel FPSIMD state may be erroneously
clobbered during a context switch immediately after that state is
restored. Systems without SME are unaffected.
If the CPU happens to be in streaming SVE mode before a context switch
to a thread with kernel FPSIMD state, fpsimd_thread_switch() will
restore the kernel FPSIMD state using fpsimd_load_kernel_state() while
the CPU is still in streaming SVE mode. When fpsimd_thread_switch()
subsequently calls fpsimd_flush_cpu_state(), this will execute an
SMSTOP, causing an exit from streaming SVE mode. The exit from
streaming SVE mode will cause the hardware to reset a number of
FPSIMD/SVE/SME registers, clobbering the FPSIMD state.
Fix this by calling fpsimd_flush_cpu_state() before restoring the kernel
FPSIMD state.
Fixes: e92bee9f861b ("arm64/fpsimd: Avoid erroneous elide of user state reload") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-8-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown [Wed, 9 Apr 2025 16:40:03 +0000 (17:40 +0100)]
arm64/fpsimd: Don't corrupt FPMR when streaming mode changes
When the effective value of PSTATE.SM is changed from 0 to 1 or from 1
to 0 by any method, an entry or exit to/from streaming SVE mode is
performed, and hardware automatically resets a number of registers. As
of ARM DDI 0487 L.a, this means:
* All implemented bits of the SVE vector registers are set to zero.
* All implemented bits of the SVE predicate registers are set to zero.
* All implemented bits of FFR are set to zero, if FFR is implemented in
the new mode.
* FPSR is set to 0x0000_0000_0800_009f.
* FPMR is set to 0, if FPMR is implemented.
Currently task_fpsimd_load() restores FPMR before restoring SVCR (which
is an accessor for PSTATE.{SM,ZA}), and so the restored value of FPMR
may be clobbered if the restored value of PSTATE.SM happens to differ
from the initial value of PSTATE.SM.
Mark Brown [Wed, 9 Apr 2025 16:40:02 +0000 (17:40 +0100)]
arm64/fpsimd: Discard stale CPU state when handling SME traps
The logic for handling SME traps manipulates saved FPSIMD/SVE/SME state
incorrectly, and a race with preemption can result in a task having
TIF_SME set and TIF_FOREIGN_FPSTATE clear even though the live CPU state
is stale (e.g. with SME traps enabled). This can result in warnings from
do_sme_acc() where SME traps are not expected while TIF_SME is set:
| /* With TIF_SME userspace shouldn't generate any traps */
| if (test_and_set_thread_flag(TIF_SME))
| WARN_ON(1);
This is very similar to the SVE issue we fixed in commit:
751ecf6afd6568ad ("arm64/sve: Discard stale CPU state when handling SVE traps")
The race can occur when the SME trap handler is preempted before and
after manipulating the saved FPSIMD/SVE/SME state, starting and ending on
the same CPU, e.g.
| void do_sme_acc(unsigned long esr, struct pt_regs *regs)
| {
| // Trap on CPU 0 with TIF_SME clear, SME traps enabled
| // task->fpsimd_cpu is 0.
| // per_cpu_ptr(&fpsimd_last_state, 0) is task.
|
| ...
|
| // Preempted; migrated from CPU 0 to CPU 1.
| // TIF_FOREIGN_FPSTATE is set.
|
| get_cpu_fpsimd_context();
|
| /* With TIF_SME userspace shouldn't generate any traps */
| if (test_and_set_thread_flag(TIF_SME))
| WARN_ON(1);
|
| if (!test_thread_flag(TIF_FOREIGN_FPSTATE)) {
| unsigned long vq_minus_one =
| sve_vq_from_vl(task_get_sme_vl(current)) - 1;
| sme_set_vq(vq_minus_one);
|
| fpsimd_bind_task_to_cpu();
| }
|
| put_cpu_fpsimd_context();
|
| // Preempted; migrated from CPU 1 to CPU 0.
| // task->fpsimd_cpu is still 0
| // If per_cpu_ptr(&fpsimd_last_state, 0) is still task then:
| // - Stale HW state is reused (with SME traps enabled)
| // - TIF_FOREIGN_FPSTATE is cleared
| // - A return to userspace skips HW state restore
| }
Fix the case where the state is not live and TIF_FOREIGN_FPSTATE is set
by calling fpsimd_flush_task_state() to detach from the saved CPU
state. This ensures that a subsequent context switch will not reuse the
stale CPU state, and will instead set TIF_FOREIGN_FPSTATE, forcing the
new state to be reloaded from memory prior to a return to userspace.
Mark Rutland [Wed, 9 Apr 2025 16:40:01 +0000 (17:40 +0100)]
arm64/fpsimd: Remove opportunistic freeing of SME state
When a task's SVE vector length (NSVL) is changed, and the task happens
to have SVCR.{SM,ZA}=={0,0}, vec_set_vector_length() opportunistically
frees the task's sme_state and clears TIF_SME.
The opportunistic freeing was added with no rationale in commit:
d4d5be94a8787242 ("arm64/fpsimd: Ensure SME storage is allocated after SVE VL changes")
That commit fixed an unrelated problem where the task's sve_state was
freed while it could be used to store streaming mode register state,
where the fix was to re-allocate the task's sve_state.
There is no need to free and/or reallocate the task's sme_state when the
SVE vector length changes, and there is no need to clear TIF_SME. Given
the SME vector length (SVL) doesn't change, the task's sme_state remains
correctly sized.
Remove the unnecessary opportunistic freeing of the task's sme_state
when the task's SVE vector length is changed. The task's sme_state is
still freed when the SME vector length is changed.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-5-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:40:00 +0000 (17:40 +0100)]
arm64/fpsimd: Remove redundant SVE trap manipulation
When task_fpsimd_load() loads the saved FPSIMD/SVE/SME state, it
configures EL0 SVE traps by calling sve_user_{enable,disable}(). This is
unnecessary, and this is suspicious/confusing as task_fpsimd_load() does
not configure EL0 SME traps.
All calls to task_fpsimd_load() are followed by a call to
fpsimd_bind_task_to_cpu(), where the latter configures traps for SVE and
SME dependent upon the current values of TIF_SVE and TIF_SME, overriding
any trap configuration performed by task_fpsimd_load().
The calls to sve_user_{enable,disable}() calls in task_fpsimd_load()
have been redundant (though benign) since they were introduced in
commit:
a0136be443d51803 ("arm64/fpsimd: Load FP state based on recorded data type")
Remove the unnecessary and confusing SVE trap manipulation.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-4-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
There have been no users of fpsimd_force_sync_to_sve() since commit:
bbc6172eefdb276b ("arm64/fpsimd: SME no longer requires SVE register state")
Remove fpsimd_force_sync_to_sve().
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Wed, 9 Apr 2025 16:39:58 +0000 (17:39 +0100)]
arm64/fpsimd: Avoid RES0 bits in the SME trap handler
The SME trap handler consumes RES0 bits from the ESR when determining
the reason for the trap, and depends upon those bits reading as zero.
This may break in future when those RES0 bits are allocated a meaning
and stop reading as zero.
For SME traps taken with ESR_ELx.EC == 0b011101, the specific reason for
the trap is indicated by ESR_ELx.ISS.SMTC ("SME Trap Code"). This field
occupies bits [2:0] of ESR_ELx.ISS, and as of ARM DDI 0487 L.a, bits
[24:3] of ESR_ELx.ISS are RES0. ESR_ELx.ISS itself occupies bits [24:0]
of ESR_ELx.
Extract the SMTC field specifically, matching the way we handle ESR_ELx
fields elsewhere, and ensuring that the handler is future-proof.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20250409164010.3480271-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Thomas Weißschuh [Wed, 2 Apr 2025 20:21:57 +0000 (21:21 +0100)]
tools/include: make uapi/linux/types.h usable from assembly
The "real" linux/types.h UAPI header gracefully degrades to a NOOP when
included from assembly code.
Mirror this behaviour in the tools/ variant.
Test for __ASSEMBLER__ over __ASSEMBLY__ as the former is provided by the
toolchain automatically.
Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/lkml/af553c62-ca2f-4956-932c-dd6e3a126f58@sirena.org.uk/ Fixes: c9fbaa879508 ("selftests: vDSO: parse_vdso: Use UAPI headers instead of libc headers") Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20250321-uapi-consistency-v1-1-439070118dc0@linutronix.de Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'soundwire-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire
Pull soundwire fix from Vinod Koul:
- add missing config symbol CONFIG_SND_HDA_EXT_CORE required for asoc
driver CONFIG_SND_SOF_SOF_HDA_SDW_BPT
* tag 'soundwire-6.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire:
ASoC: SOF: Intel: Let SND_SOF_SOF_HDA_SDW_BPT select SND_HDA_EXT_CORE
Len Brown [Sun, 6 Apr 2025 18:49:20 +0000 (14:49 -0400)]
tools/power turbostat: v2025.05.06
Support up to 8192 processors
Add cpuidle governor debug telemetry, disabled by default
Update default output to exclude cpuidle invocation counts
Bug fixes
Merge tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Ingo Molnar:
"Fix a perf events time accounting bug"
* tag 'perf-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix child_total_time_enabled accounting bug at task exit
Merge tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix a nonsensical Kconfig combination
- Remove an unnecessary rseq-notification
* tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Eliminate useless task_work on execve
sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP
... and don't error out so hard on missing module descriptions.
Before commit 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()")
we used to warn about missing module descriptions, but only when
building with extra warnigns (ie 'W=1').
After that commit the warning became an unconditional hard error.
And it turns out not all modules have been converted despite the claims
to the contrary. As reported by Damian Tometzki, the slub KUnit test
didn't have a module description, and apparently nobody ever really
noticed.
The reason nobody noticed seems to be that the slub KUnit tests get
disabled by SLUB_TINY, which also ends up disabling a lot of other code,
both in tests and in slub itself. And so anybody doing full build tests
didn't actually see this failre.
So let's disable SLUB_TINY for build-only tests, since it clearly ends
up limiting build coverage. Also turn the missing module descriptions
error back into a warning, but let's keep it around for non-'W=1'
builds.
Len Brown [Sun, 6 Apr 2025 15:18:39 +0000 (11:18 -0400)]
tools/power turbostat: report CoreThr per measurement interval
The CoreThr column displays total thermal throttling events
since boot time.
Change it to report events during the measurement interval.
This is more useful for showing a user the current conditions.
Total events since boot time are still available to the user via
/sys/devices/system/cpu/cpu*/thermal_throttle/*
Document CoreThr on turbostat.8
Fixes: eae97e053fe30 ("turbostat: Support thermal throttle count print") Reported-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com> Cc: Chen Yu <yu.c.chen@intel.com>
Justin Ernst [Wed, 19 Mar 2025 20:27:31 +0000 (15:27 -0500)]
tools/power turbostat: Increase CPU_SUBSET_MAXCPUS to 8192
On systems with >= 1024 cpus (in my case 1152), turbostat fails with the error output:
"turbostat: /sys/fs/cgroup/cpuset.cpus.effective: cpu str malformat 0-1151"
A similar error appears with the use of turbostat --cpu when the inputted cpu
range contains a cpu number >= 1024:
# turbostat -c 1100-1151
"--cpu 1100-1151" malformed
...
Both errors are caused by parse_cpu_str() reaching its limit of CPU_SUBSET_MAXCPUS.
It's a good idea to limit the maximum cpu number being parsed, but 1024 is too low.
For a small increase in compute and allocated memory, increasing CPU_SUBSET_MAXCPUS
brings support for parsing cpu numbers >= 1024.
Increase CPU_SUBSET_MAXCPUS to 8192, a common setting for CONFIG_NR_CPUS on x86_64.
Signed-off-by: Justin Ernst <justin.ernst@hpe.com> Signed-off-by: Len Brown <len.brown@intel.com>
Merge tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A set of final cleanups for the timer subsystem:
- Convert all del_timer[_sync]() instances over to the new
timer_delete[_sync]() API and remove the legacy wrappers.
Conversion was done with coccinelle plus some manual fixups as
coccinelle chokes on scoped_guard().
- The final cleanup of the hrtimer_init() to hrtimer_setup()
conversion.
This has been delayed to the end of the merge window, so that all
patches which have been merged through other trees are in mainline
and all new users are catched.
Doing this right before rc1 ensures that new code which is merged post
rc1 is not introducing new instances of the original functionality"
* tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/timers: Rename the hrtimer_init event to hrtimer_setup
hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
hrtimers: Rename debug_init() to debug_setup()
hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
hrtimers: Make callback function pointer private
hrtimers: Merge __hrtimer_init() into __hrtimer_setup()
hrtimers: Switch to use __htimer_setup()
hrtimers: Delete hrtimer_init()
treewide: Convert new and leftover hrtimer_init() users
treewide: Switch/rename to timer_delete[_sync]()
Merge tag 'irq-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more irq updates from Thomas Gleixner:
"A set of updates for the interrupt subsystem:
- A treewide cleanup for the irq_domain code, which makes the naming
consistent and gets rid of the original oddity of naming domains
'host'.
This is a trivial mechanical change and is done late to ensure that
all instances have been catched and new code merged post rc1 wont
reintroduce new instances.
- A trivial consistency fix in the migration code
The recent introduction of irq_force_complete_move() in the core
code, causes a problem for the nostalgia crowd who maintains ia64
out of tree.
The code assumes that hierarchical interrupt domains are enabled
and dereferences irq_data::parent_data unconditionally. That works
in mainline because both architectures which enable that code have
hierarchical domains enabled. Though it breaks the ia64 build,
which enables the functionality, but does not have hierarchical
domains.
While it's not really a problem for mainline today, this
unconditional dereference is inconsistent and trivially fixable by
using the existing helper function irqd_get_parent_data(), which
has the appropriate #ifdeffery in place"
* tag 'irq-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/migration: Use irqd_get_parent_data() in irq_force_complete_move()
irqdomain: Stop using 'host' for domain
irqdomain: Rename irq_get_default_host() to irq_get_default_domain()
irqdomain: Rename irq_set_default_host() to irq_set_default_domain()
Merge tag 'timers-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A revert to fix a adjtimex() regression:
The recent change to prevent that time goes backwards for the coarse
time getters due to immediate multiplier adjustments via adjtimex(),
changed the way how the timekeeping core treats that.
That change result in a regression on the adjtimex() side, which is
user space visible:
1) The forwarding of the base time moves the update out of the
original period and establishes a new one. That's changing the
behaviour of the [PF]LL control, which user space expects to be
applied periodically.
2) The clearing of the accumulated NTP error due to #1, changes the
behaviour as well.
An attempt to delay the multiplier/frequency update to the next tick
did not solve the problem as userspace expects that the multiplier or
frequency updates are in effect, when the syscall returns.
There is a different solution for the coarse time problem available,
so revert the offending commit to restore the existing adjtimex()
behaviour"
* tag 'timers-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "timekeeping: Fix possible inconsistencies in _COARSE clockids"
Merge tag 'sh-for-v6.15-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux
Pull sh updates from John Paul Adrian Glaubitz:
"One important fix and one small configuration update.
The first patch by Artur Rojek fixes an issue with the J2 firmware
loader not being able to find the location of the device tree blob due
to insufficient alignment of the .bss section which rendered J2 boards
unbootable.
The second patch by Johan Korsnes updates the defconfigs on sh to drop
the CONFIG_NET_CLS_TCINDEX configuration option which became obsolete
after 8c710f75256b ("net/sched: Retire tcindex classifier").
Summary:
- sh: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
- sh: Align .bss section padding to 8-byte boundary"
* tag 'sh-for-v6.15-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
sh: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
sh: Align .bss section padding to 8-byte boundary
Merge tag 'kbuild-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Improve performance in gendwarfksyms
- Remove deprecated EXTRA_*FLAGS and KBUILD_ENABLE_EXTRA_GCC_CHECKS
- Support CONFIG_HEADERS_INSTALL for ARCH=um
- Use more relative paths to sources files for better reproducibility
- Support the loong64 Debian architecture
- Add Kbuild bash completion
- Introduce intermediate vmlinux.unstripped for architectures that need
static relocations to be stripped from the final vmlinux
- Fix versioning in Debian packages for -rc releases
- Treat missing MODULE_DESCRIPTION() as an error
- Convert Nios2 Makefiles to use the generic rule for built-in DTB
- Add debuginfo support to the RPM package
* tag 'kbuild-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (40 commits)
kbuild: rpm-pkg: build a debuginfo RPM
kconfig: merge_config: use an empty file as initfile
nios2: migrate to the generic rule for built-in DTB
rust: kbuild: skip `--remap-path-prefix` for `rustdoc`
kbuild: pacman-pkg: hardcode module installation path
kbuild: deb-pkg: don't set KBUILD_BUILD_VERSION unconditionally
modpost: require a MODULE_DESCRIPTION()
kbuild: make all file references relative to source root
x86: drop unnecessary prefix map configuration
kbuild: deb-pkg: add comment about future removal of KDEB_COMPRESS
kbuild: Add a help message for "headers"
kbuild: deb-pkg: remove "version" variable in mkdebian
kbuild: deb-pkg: fix versioning for -rc releases
Documentation/kbuild: Fix indentation in modules.rst example
x86: Get rid of Makefile.postlink
kbuild: Create intermediate vmlinux build with relocations preserved
kbuild: Introduce Kconfig symbol for linking vmlinux with relocations
kbuild: link-vmlinux.sh: Make output file name configurable
kbuild: do not generate .tmp_vmlinux*.map when CONFIG_VMLINUX_MAP=y
Revert "kheaders: Ignore silly-rename files"
...
Merge tag 'drm-next-2025-04-05' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly fixes, mostly from the end of last week, this week was very
quiet, maybe you scared everyone away. It's mostly amdgpu, and xe,
with some i915, adp and bridge bits, since I think this is overly
quiet I'd expect rc2 to be a bit more lively.
bridge:
- tda998x: Select CONFIG_DRM_KMS_HELPER
amdgpu:
- Guard against potential division by 0 in fan code
- Zero RPM support for SMU 14.0.2
- Properly handle SI and CIK support being disabled
- PSR fixes
- DML2 fixes
- DP Link training fix
- Vblank fixes
- RAS fixes
- Partitioning fix
- SDMA fix
- SMU 13.0.x fixes
- Rom fetching fix
- MES fixes
- Queue reset fix
xe:
- Fix NULL pointer dereference on error path
- Add missing HW workaround for BMG
- Fix survivability mode not triggering
- Fix build warning when DRM_FBDEV_EMULATION is not set
i915:
- Bounds check for scalers in DSC prefill latency computation
- Fix build by adding a missing include
adp:
- Fix error handling in plane setup"
# -----BEGIN PGP SIGNATURE-----
* tag 'drm-next-2025-04-05' of https://gitlab.freedesktop.org/drm/kernel: (34 commits)
drm/i2c: tda998x: select CONFIG_DRM_KMS_HELPER
drm/amdgpu/gfx12: fix num_mec
drm/amdgpu/gfx11: fix num_mec
drm/amd/pm: Add gpu_metrics_v1_8
drm/amdgpu: Prefer shadow rom when available
drm/amd/pm: Update smu metrics table for smu_v13_0_6
drm/amd/pm: Remove host limit metrics support
Remove unnecessary firmware version check for gc v9_4_2
drm/amdgpu: stop unmapping MQD for kernel queues v3
Revert "drm/amdgpu/sdma_v4_4_2: update VM flush implementation for SDMA"
drm/amdgpu: Parse all deferred errors with UMC aca handle
drm/amdgpu: Update ta ras block
drm/amdgpu: Add NPS2 to DPX compatible mode
drm/amdgpu: Use correct gfx deferred error count
drm/amd/display: Actually do immediate vblank disable
drm/amd/display: prevent hang on link training fail
Revert "drm/amd/display: dml2 soc dscclk use DPM table clk setting"
drm/amd/display: Increase vblank offdelay for PSR panels
drm/amd: Handle being compiled without SI or CIK support better
drm/amd/pm: Add zero RPM enabled OD setting support for SMU14.0.2
...
Uday Shankar [Mon, 31 Mar 2025 22:46:32 +0000 (16:46 -0600)]
kbuild: rpm-pkg: build a debuginfo RPM
The rpm-pkg make target currently suffers from a few issues related to
debuginfo:
1. debuginfo for things built into the kernel (vmlinux) is not available
in any RPM produced by make rpm-pkg. This makes using tools like
systemtap against a make rpm-pkg kernel impossible.
2. debug source for the kernel is not available. This means that
commands like 'disas /s' in gdb, which display source intermixed with
assembly, can only print file names/line numbers which then must be
painstakingly resolved to actual source in a separate editor.
3. debuginfo for modules is available, but it remains bundled with the
.ko files that contain module code, in the main kernel RPM. This is a
waste of space for users who do not need to debug the kernel (i.e.
most users).
Address all of these issues by additionally building a debuginfo RPM
when the kernel configuration allows for it, in line with standard
patterns followed by RPM distributors. With these changes:
1. systemtap now works (when these changes are backported to 6.11, since
systemtap lags a bit behind in compatibility), as verified by the
following simple test script:
129 op_str = blk_op_name[op];
130
131 return op_str;
132 }
0xffffffff814c8768 <+40>: jmp 0xffffffff81d01360 <__x86_return_thunk>
End of assembler dump.
3. The size of the main kernel package goes down substantially,
especially if many modules are built (quite typical). Here is a
comparison of installed size of the kernel package (configured with
allmodconfig, dwarf4 debuginfo, and module compression turned off)
before and after this patch:
# rpm -qi kernel-6.13* | grep -E '^(Version|Size)'
Version : 6.13.0postpatch+
Size : 1382874089
Version : 6.13.0prepatch+
Size : 17870795887
This is a ~92% size reduction.
Note that a debuginfo package can only be produced if the following
configs are set:
- CONFIG_DEBUG_INFO=y
- CONFIG_MODULE_COMPRESS=n
- CONFIG_DEBUG_INFO_SPLIT=n
The first of these is obvious - we can't produce debuginfo if the build
does not generate it. The second two requirements can in principle be
removed, but doing so is difficult with the current approach, which uses
a generic rpmbuild script find-debuginfo.sh that processes all packaged
executables. If we want to remove those requirements the best path
forward is likely to add some debuginfo extraction/installation logic to
the modules_install target (controllable by flags). That way, it's
easier to operate on modules before they're compressed, and the logic
can be reused by all packaging targets.
Daniel Gomez [Fri, 28 Mar 2025 14:28:37 +0000 (14:28 +0000)]
kconfig: merge_config: use an empty file as initfile
The scripts/kconfig/merge_config.sh script requires an existing
$INITFILE (or the $1 argument) as a base file for merging Kconfig
fragments. However, an empty $INITFILE can serve as an initial starting
point, later referenced by the KCONFIG_ALLCONFIG Makefile variable
if -m is not used. This variable can point to any configuration file
containing preset config symbols (the merged output) as stated in
Documentation/kbuild/kconfig.rst. When -m is used $INITFILE will
contain just the merge output requiring the user to run make (i.e.
KCONFIG_ALLCONFIG=<$INITFILE> make <allnoconfig/alldefconfig> or make
olddefconfig).
Instead of failing when `$INITFILE` is missing, create an empty file and
use it as the starting point for merges.
Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Masahiro Yamada [Sun, 22 Dec 2024 00:30:53 +0000 (09:30 +0900)]
nios2: migrate to the generic rule for built-in DTB
Commit 654102df2ac2 ("kbuild: add generic support for built-in boot
DTBs") introduced generic support for built-in DTBs.
Select GENERIC_BUILTIN_DTB when built-in DTB support is enabled.
To keep consistency across architectures, this commit also renames
CONFIG_NIOS2_DTB_SOURCE_BOOL to CONFIG_BUILTIN_DTB, and
CONFIG_NIOS2_DTB_SOURCE to CONFIG_BUILTIN_DTB_NAME.
Johan Korsnes [Sun, 23 Mar 2025 19:13:30 +0000 (20:13 +0100)]
sh: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
This option was removed from Kconfig in 8c710f75256b ("net/sched:
Retire tcindex classifier") but from the defconfigs.
Fixes: 8c710f75256b ("net/sched: Retire tcindex classifier") Signed-off-by: Johan Korsnes <johan.korsnes@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Artur Rojek [Sun, 16 Feb 2025 17:55:44 +0000 (18:55 +0100)]
sh: Align .bss section padding to 8-byte boundary
J2-based devices expect to find a device tree blob at the end of the
.bss section. As of a77725a9a3c5 ("scripts/dtc: Update to upstream
version v1.6.1-19-g0a3a9d3449c8"), libfdt enforces 8-byte alignment
for the DTB, causing J2 devices to fail early in sh_fdt_init().
As the J2 loader firmware calculates the DTB location based on the kernel
image .bss section size rather than the __bss_stop symbol offset, the
required alignment can't be enforced with BSS_SECTION(0, PAGE_SIZE, 8).
To fix this, inline a modified version of the above macro which grows
.bss by the required size. While this change affects all existing SH
boards, it should be benign on platforms which don't need this alignment.
Signed-off-by: Artur Rojek <contact@artur-rojek.eu> Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Tested-by: Rob Landley <rob@landley.net> Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Merge tag 'input-for-v6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- a brand new driver for touchpads and touchbars in newer Apple devices
- support for Berlin-A series in goodix-berlin touchscreen driver
- improvements to matrix_keypad driver to better handle GPIOs toggling
- assorted small cleanups in other input drivers
* tag 'input-for-v6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: goodix_berlin - add support for Berlin-A series
dt-bindings: input: goodix,gt9916: Document gt9897 compatible
dt-bindings: input: matrix_keypad - add wakeup-source property
dt-bindings: input: matrix_keypad - add missing property
Input: pm8941-pwrkey - fix dev_dbg() output in pm8941_pwrkey_irq()
Input: synaptics - hide unused smbus_pnp_ids[] array
Input: apple_z2 - fix potential confusion in Kconfig
Input: matrix_keypad - use fsleep for delays after activating columns
Input: matrix_keypad - add settle time after enabling all columns
dt-bindings: input: matrix_keypad: add settle time after enabling all columns
dt-bindings: input: matrix_keypad: convert to YAML
dt-bindings: input: Correct indentation and style in DTS example
MAINTAINERS: Add entries for Apple Z2 touchscreen driver
Input: apple_z2 - add a driver for Apple Z2 touchscreens
dt-bindings: input: touchscreen: Add Z2 controller
Input: Switch to use hrtimer_setup()
Input: drop vb2_ops_wait_prepare/finish
Nam Cao [Wed, 5 Feb 2025 10:55:20 +0000 (11:55 +0100)]
hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init_on_stack() to debug_setup_on_stack() as well, to keep the
names consistent.
Nam Cao [Wed, 5 Feb 2025 10:55:18 +0000 (11:55 +0100)]
hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() as well, to
keep the names consistent.
Nam Cao [Wed, 5 Feb 2025 10:55:17 +0000 (11:55 +0100)]
hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
The struct hrtimer::function field can only be changed using
hrtimer_setup*() or hrtimer_update_function(), and both already null-check
'function'. Therefore, null-checking 'function' in hrtimer_start_range_ns()
is not necessary.
Nam Cao [Wed, 5 Feb 2025 10:55:16 +0000 (11:55 +0100)]
hrtimers: Make callback function pointer private
Make the struct hrtimer::function field private, to prevent users from
changing this field in an unsafe way. hrtimer_update_function() should be
used if the callback function needs to be changed.
Nam Cao [Wed, 5 Feb 2025 10:55:11 +0000 (11:55 +0100)]
hrtimers: Switch to use __htimer_setup()
__hrtimer_init_sleeper() calls __hrtimer_init() and also sets up the
callback function. But there is already __hrtimer_setup() which does both
actions.
Switch to use __hrtimer_setup() to simplify the code.
Merge tag 'v6.15-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"This fixes a race condition in the newly added eip93 driver"
* tag 'v6.15-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: inside-secure/eip93 - acquire lock on eip93_put_descriptor hash
Merge tag 's390-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Fix machine check handler _CIF_MCCK_GUEST bit setting by adding the
missing base register for relocated lowcore address
- Fix build failure on older linkers by conditionally adding the
-no-pie linker option only when it is supported
- Fix inaccurate kernel messages in vfio-ap by providing descriptive
error notifications for AP queue sharing violations
- Fix PCI isolation logic by ensuring non-VF devices correctly return
false in zpci_bus_is_isolated_vf()
- Fix PCI DMA range map setup by using dma_direct_set_offset() to add a
proper sentinel element, preventing potential overruns and
translation errors
- Cleanup header dependency problems with asm-offsets.c
- Add fault info for unexpected low-address protection faults in user
mode
- Add support for HOTPLUG_SMT, replacing the arch-specific "nosmt"
handling with common code handling
- Use bitop functions to implement CPU flag helper functions to ensure
that bits cannot get lost if modified in different contexts on a CPU
- Remove unused machine_flags for the lowcore
* tag 's390-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/vfio-ap: Fix no AP queue sharing allowed message written to kernel log
s390/pci: Fix dev.dma_range_map missing sentinel element
s390/mm: Dump fault info in case of low address protection fault
s390/smp: Add support for HOTPLUG_SMT
s390: Fix linker error when -no-pie option is unavailable
s390/processor: Use bitop functions for cpu flag helper functions
s390/asm-offsets: Remove ASM_OFFSETS_C
s390/asm-offsets: Include ftrace_regs.h instead of ftrace.h
s390/kvm: Split kvm_host header file
s390/pci: Fix zpci_bus_is_isolated_vf() for non-VFs
s390/lowcore: Remove unused machine_flags
s390/entry: Fix setting _CIF_MCCK_GUEST with lowcore relocation
Merge tag '6.15-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- reconnect fixes: three for updating rsize/wsize and an SMB1 reconnect
fix
- RFC1001 fixes: fixing connections to nonstandard ports, and negprot
retries
- fix mfsymlinks to old servers
- make mapping of open flags for SMB1 more accurate
- permission fixes: adding retry on open for write, and one for stat to
workaround unexpected access denied
- add two new xattrs, one for retrieving SACL and one for retrieving
owner (without having to retrieve the whole ACL)
- fix mount parm validation for echo_interval
- minor cleanup (including removing now unneeded cifs_truncate_page)
* tag '6.15-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
cifs: Implement is_network_name_deleted for SMB1
cifs: Remove cifs_truncate_page() as it should be superfluous
cifs: Do not add FILE_READ_ATTRIBUTES when using GENERIC_READ/EXECUTE/ALL
cifs: Improve SMB2+ stat() to work also without FILE_READ_ATTRIBUTES
cifs: Add fallback for SMB2 CREATE without FILE_READ_ATTRIBUTES
cifs: Fix querying and creating MF symlinks over SMB1
cifs: Fix access_flags_to_smbopen_mode
cifs: Fix negotiate retry functionality
cifs: Improve handling of NetBIOS packets
cifs: Allow to disable or force initialization of NetBIOS session
cifs: Add a new xattr system.smb3_ntsd_owner for getting or setting owner
cifs: Add a new xattr system.smb3_ntsd_sacl for getting or setting SACLs
smb: client: Update IO sizes after reconnection
smb: client: Store original IO parameters and prevent zero IO sizes
smb:client: smb: client: Add reverse mapping from tcon to superblocks
cifs: remove unreachable code in cifs_get_tcp_session()
cifs: fix integer overflow in match_server()
Merge tag 'ntb-6.15' of https://github.com/jonmason/ntb
Pull ntb fixes from Jon Mason:
"Bug fixes for NTB Switchtec driver mw negative shift, Intel NTB link
status db, ntb_perf double unmap (in error case), and MSI 64bit
arithmetic.
Also, add new AMD NTB PCI IDs, update AMD NTB maintainer, and pull in
patch to reduce the stack usage in IDT driver"
* tag 'ntb-6.15' of https://github.com/jonmason/ntb:
ntb_hw_amd: Add NTB PCI ID for new gen CPU
ntb: reduce stack usage in idt_scan_mws
ntb: use 64-bit arithmetic for the MSI doorbell mask
MAINTAINERS: Update AMD NTB maintainers
ntb_perf: Delete duplicate dmaengine_unmap_put() call in perf_copy_chunk()
ntb: intel: Fix using link status DB's
ntb_hw_switchtec: Fix shift-out-of-bounds in switchtec_ntb_mw_set_trans
Miroslav reported that the changes for handling the inconsistencies in the
coarse time getters result in a regression on the adjtimex() side.
There are two issues:
1) The forwarding of the base time moves the update out of the original
period and establishes a new one.
2) The clearing of the accumulated NTP error is changing the behaviour as
well.
Userspace expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.
Revert the change, so that the established expectations of user space
implementations (ntpd, chronyd) are restored. The re-introduced
inconsistency of the coarse time getters will be addressed in a subsequent
fix.
Fixes: 757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids") Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/Z-qsg6iDGlcIJulJ@localhost
Merge tag 'riscv-for-linus-6.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- The sub-architecture selection Kconfig system has been cleaned up,
the documentation has been improved, and various detections have been
fixed
- The vector-related extensions dependencies are now validated when
parsing from device tree and in the DT bindings
- Misaligned access probing can be overridden via a kernel command-line
parameter, along with various fixes to misalign access handling
- Support for relocatable !MMU kernels builds
- Support for hpge pfnmaps, which should improve TLB utilization
- Support for runtime constants, which improves the d_hash()
performance
- Support for bfloat16, Zicbom, Zaamo, Zalrsc, Zicntr, Zihpm
- Various fixes, including:
- We were missing a secondary mmu notifier call when flushing the
tlb which is required for IOMMU
- Fix ftrace panics by saving the registers as expected by ftrace
- Fix a couple of stimecmp usage related to cpu hotplug
- purgatory_start is now aligned as per the STVEC requirements
- A fix for hugetlb when calculating the size of non-present PTEs
* tag 'riscv-for-linus-6.15-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (65 commits)
riscv: Add norvc after .option arch in runtime const
riscv: Make sure toolchain supports zba before using zba instructions
riscv/purgatory: 4B align purgatory_start
riscv/kexec_file: Handle R_RISCV_64 in purgatory relocator
selftests: riscv: fix v_exec_initval_nolibc.c
riscv: Fix hugetlb retrieval of number of ptes in case of !present pte
riscv: print hartid on bringup
riscv: Add norvc after .option arch in runtime const
riscv: Remove CONFIG_PAGE_OFFSET
riscv: Support CONFIG_RELOCATABLE on riscv32
asm-generic: Always define Elf_Rel and Elf_Rela
riscv: Support CONFIG_RELOCATABLE on NOMMU
riscv: Allow NOMMU kernels to access all of RAM
riscv: Remove duplicate CONFIG_PAGE_OFFSET definition
RISC-V: errata: Use medany for relocatable builds
dt-bindings: riscv: document vector crypto requirements
dt-bindings: riscv: add vector sub-extension dependencies
dt-bindings: riscv: d requires f
RISC-V: add f & d extension validation checks
RISC-V: add vector crypto extension validation checks
...
Merge tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter.
Current release - regressions:
- four fixes for the netdev per-instance locking
Current release - new code bugs:
- consolidate more code between existing Rx zero-copy and uring so
that the latter doesn't miss / have to duplicate the safety checks
Previous releases - regressions:
- ipv6: fix omitted Netlink attributes when using SKIP_STATS
Previous releases - always broken:
- net: fix geneve_opt length integer overflow
- udp: fix multiple wrap arounds of sk->sk_rmem_alloc when it
approaches INT_MAX
- dsa: mvpp2: add a lock to avoid corruption of the shared TCAM
- dsa: airoha: fix issues with traffic QoS configuration / offload,
and flow table offload
Misc:
- touch up the Netlink YAML specs of old families to make them usable
for user space C codegen"
* tag 'net-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits)
selftests: net: amt: indicate progress in the stress test
netlink: specs: rt_route: pull the ifa- prefix out of the names
netlink: specs: rt_addr: pull the ifa- prefix out of the names
netlink: specs: rt_addr: fix get multi command name
netlink: specs: rt_addr: fix the spec format / schema failures
net: avoid false positive warnings in __net_mp_close_rxq()
net: move mp dev config validation to __net_mp_open_rxq()
net: ibmveth: make veth_pool_store stop hanging
arcnet: Add NULL check in com20020pci_probe()
ipv6: Do not consider link down nexthops in path selection
ipv6: Start path selection from the first nexthop
usbnet:fix NPE during rx_complete
net: octeontx2: Handle XDP_ABORTED and XDP invalid as XDP_DROP
net: fix geneve_opt length integer overflow
io_uring/zcrx: fix selftests w/ updated netdev Python helpers
selftests: net: use netdevsim in netns test
docs: net: document netdev notifier expectations
net: dummy: request ops lock
netdevsim: add dummy device notifiers
net: rename rtnl_net_debug to lock_debug
...
Merge tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in during the merge window,
everything is driver specific with nothing standing out particularly"
* tag 'spi-fix-v6.15-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: bcm2835: Restore native CS probing when pinctrl-bcm2835 is absent
spi: bcm2835: Do not call gpiod_put() on invalid descriptor
spi: cadence-qspi: revert "Improve spi memory performance"
spi: cadence: Fix out-of-bounds array access in cdns_mrvl_xspi_setup_clock()
spi: fsl-qspi: use devm function instead of driver remove
spi: SPI_QPIC_SNAND should be tristate and depend on MTD
spi-rockchip: Fix register out of bounds access
Merge tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull more SoC driver updates from Arnd Bergmann:
"This is the promised follow-up to the soc drivers branch, adding minor
updates to omap and freescale drivers.
Most notably, Ioana Ciornei takes over maintenance of the DPAA bus
driver used in some NXP (originally Freescale) chips"
* tag 'soc-drivers-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
bus: fsl-mc: Remove deadcode
MAINTAINERS: add the linuppc-dev list to the fsl-mc bus entry
MAINTAINERS: fix nonexistent dtbinding file name
MAINTAINERS: add myself as maintainer for the fsl-mc bus
irqdomain: soc: Switch to irq_find_mapping()
Input: tsc2007 - accept standard properties
Merge tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- thinkpad_acpi:
- Fix NULL pointer dereferences while probing
- Disable ACPI fan access for T495* and E560
- ISST: Correct command storage data length
* tag 'platform-drivers-x86-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
MAINTAINERS: consistently use my dedicated email address
platform/x86: ISST: Correct command storage data length
platform/x86: thinkpad_acpi: disable ACPI fan access for T495* and E560
platform/x86: thinkpad_acpi: Fix NULL pointer dereferences while probing
Thomas Gleixner [Fri, 4 Apr 2025 14:51:19 +0000 (16:51 +0200)]
genirq/migration: Use irqd_get_parent_data() in irq_force_complete_move()
Frank reported, that the common irq_force_complete_move() breaks the out of
tree build of ia64. The reason is that ia64 uses the migration code, but
does not have hierarchical interrupt domains enabled.
This went unnoticed in mainline as both x86 and RISC-V have hierarchical
domains enabled. Not that it matters for mainline, but it's still
inconsistent.
Use irqd_get_parent_data() instead of accessing the parent_data field
directly. The helper returns NULL when hierarchical domains are disabled
otherwise it accesses the parent_data field of the domain.
No functional change.
Fixes: 751dc837dabd ("genirq: Introduce common irq_force_complete_move() implementation") Reported-by: Frank Scheiner <frank.scheiner@web.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Frank Scheiner <frank.scheiner@web.de> Link: https://lore.kernel.org/all/87h634ugig.ffs@tglx
Jakub Kicinski [Thu, 3 Apr 2025 14:56:36 +0000 (07:56 -0700)]
selftests: net: amt: indicate progress in the stress test
Our CI expects output from the test at least once every 10 minutes.
The AMT test when running on debug kernel is just on the edge
of that time for the stress test. Improve the output:
- print the name of the test first, before starting it,
- output a dot every 10% of the way.
Output after:
TEST: amt discovery [ OK ]
TEST: IPv4 amt multicast forwarding [ OK ]
TEST: IPv6 amt multicast forwarding [ OK ]
TEST: IPv4 amt traffic forwarding torture .......... [ OK ]
TEST: IPv6 amt traffic forwarding torture .......... [ OK ]
It is confusing to see 'host' and 'domain' to be used as 'domain'. Given
this header is all about domains, switch the remaining 'host' uses to
'domain'.
Jakub Kicinski [Thu, 3 Apr 2025 01:37:06 +0000 (18:37 -0700)]
netlink: specs: rt_route: pull the ifa- prefix out of the names
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in route-attrs and metrics and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
Jakub Kicinski [Thu, 3 Apr 2025 01:37:05 +0000 (18:37 -0700)]
netlink: specs: rt_addr: pull the ifa- prefix out of the names
YAML specs don't normally include the C prefix name in the name
of the YAML attr. Remove the ifa- prefix from all attributes
in addr-attrs and specify name-prefix instead.
This is a bit risky, hopefully there aren't many users out there.
====================
net: make memory provider install / close paths more common
We seem to be fixing bugs in config path for devmem which also exist
in the io_uring ZC path. Let's try to make the two paths more common,
otherwise this is bound to keep happening.
Jakub Kicinski [Thu, 3 Apr 2025 01:34:05 +0000 (18:34 -0700)]
net: avoid false positive warnings in __net_mp_close_rxq()
Commit under Fixes solved the problem of spurious warnings when we
uninstall an MP from a device while its down. The __net_mp_close_rxq()
which is used by io_uring was not fixed. Move the fix over and reuse
__net_mp_close_rxq() in the devmem path.
Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250403013405.2827250-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 3 Apr 2025 01:34:04 +0000 (18:34 -0700)]
net: move mp dev config validation to __net_mp_open_rxq()
devmem code performs a number of safety checks to avoid having
to reimplement all of them in the drivers. Move those to
__net_mp_open_rxq() and reuse that function for binding to make
sure that io_uring ZC also benefits from them.
While at it rename the queue ID variable to rxq_idx in
__net_mp_open_rxq(), we touch most of the relevant lines.
The XArray insertion is reordered after the netdev_rx_queue_restart()
call, otherwise we'd need to duplicate the queue index check
or risk inserting an invalid pointer. The XArray allocation
failures should be extremely rare.
Reviewed-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 6e18ed929d3b ("net: add helpers for setting a memory provider on an rx queue") Link: https://patch.msgid.link/20250403013405.2827250-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dave Marquardt [Wed, 2 Apr 2025 15:44:03 +0000 (10:44 -0500)]
net: ibmveth: make veth_pool_store stop hanging
v2:
- Created a single error handling unlock and exit in veth_pool_store
- Greatly expanded commit message with previous explanatory-only text
Summary: Use rtnl_mutex to synchronize veth_pool_store with itself,
ibmveth_close and ibmveth_open, preventing multiple calls in a row to
napi_disable.
Background: Two (or more) threads could call veth_pool_store through
writing to /sys/devices/vio/30000002/pool*/*. You can do this easily
with a little shell script. This causes a hang.
I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new
kernel. I ran this test again and saw:
The control APIs are not idempotent. Control API calls are safe
against concurrent use of datapath APIs but an incorrect sequence of
control API calls may result in crashes, deadlocks, or race
conditions. For example, calling napi_disable() multiple times in a
row will deadlock.
In the normal open and close paths, rtnl_mutex is acquired to prevent
other callers. This is missing from veth_pool_store. Use rtnl_mutex in
veth_pool_store fixes these hangs.
Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com> Fixes: 860f242eb534 ("[PATCH] ibmveth change buffer pools dynamically") Reviewed-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Henry Martin [Wed, 2 Apr 2025 13:50:36 +0000 (21:50 +0800)]
arcnet: Add NULL check in com20020pci_probe()
devm_kasprintf() returns NULL when memory allocation fails. Currently,
com20020pci_probe() does not check for this case, which results in a
NULL pointer dereference.
Add NULL check after devm_kasprintf() to prevent this issue and ensure
no resources are left allocated.
ipv6: Do not consider link down nexthops in path selection
Nexthops whose link is down are not supposed to be considered during
path selection when the "ignore_routes_with_linkdown" sysctl is set.
This is done by assigning them a negative region boundary.
However, when comparing the computed hash (unsigned) with the region
boundary (signed), the negative region boundary is treated as unsigned,
resulting in incorrect nexthop selection.
Fix by treating the computed hash as signed. Note that the computed hash
is always in range of [0, 2^31 - 1].
Fixes: 3d709f69a3e7 ("ipv6: Use hash-threshold instead of modulo-N") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250402114224.293392-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>