]> www.infradead.org Git - users/willy/linux.git/log
users/willy/linux.git
4 months agoKVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing
Sean Christopherson [Wed, 11 Jun 2025 22:45:10 +0000 (15:45 -0700)]
KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routing

Delete the IRTE link from the previous vCPU irrespective of the new
routing state, i.e. even if the IRTE won't be configured to post IRQs to a
vCPU.  Whether or not the new route is postable as no bearing on the *old*
route.  Failure to delete the link can result in KVM incorrectly updating
the IRTE, e.g. if the "old" vCPU is scheduled in/out.

Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoiommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields
Sean Christopherson [Wed, 11 Jun 2025 22:45:09 +0000 (15:45 -0700)]
iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fields

Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.

Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE
Sean Christopherson [Wed, 11 Jun 2025 22:45:08 +0000 (15:45 -0700)]
KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTE

Delete the previous per-vCPU IRTE link prior to modifying the IRTE.  If
forcing the IRTE back to remapped mode fails, the IRQ is already broken;
keeping stale metadata won't change that, and the IOMMU should be
sufficiently paranoid to sanitize the IRTE when the IRQ is freed and
reallocated.

This will allow hoisting the vCPU tracking to common x86, which in turn
will allow most of the IRTE update code to be deduplicated.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure
Sean Christopherson [Wed, 11 Jun 2025 22:45:07 +0000 (15:45 -0700)]
KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structure

Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
structure and GSI routing instead of dynamically allocating a separate
data structure.  In addition to eliminating an atomic allocation, this
will allow hoisting much of the IRTE update logic to common x86.

Cc: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: Pass new routing entries and irqfd when updating IRTEs
Sean Christopherson [Wed, 11 Jun 2025 22:45:06 +0000 (15:45 -0700)]
KVM: Pass new routing entries and irqfd when updating IRTEs

When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.

Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.

Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).

Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Fold irq_comm.c into irq.c
Sean Christopherson [Wed, 11 Jun 2025 21:35:57 +0000 (14:35 -0700)]
KVM: x86: Fold irq_comm.c into irq.c

Drop irq_comm.c, a.k.a. common IRQ APIs, as there has been no non-x86 user
since commit 003f7de62589 ("KVM: ia64: remove") (at the time, irq_comm.c
lived in virt/kvm, not arch/x86/kvm).

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Move IRQ mask notifier infrastructure to I/O APIC emulation
Sean Christopherson [Wed, 11 Jun 2025 21:35:56 +0000 (14:35 -0700)]
KVM: x86: Move IRQ mask notifier infrastructure to I/O APIC emulation

Move the IRQ mask logic to ioapic.c as KVM's only user is its in-kernel
I/O APIC emulation.  In addition to encapsulating more I/O APIC specific
code, trimming down irq_comm.c helps pave the way for removing it entirely.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: selftests: Fall back to split IRQ chip if full in-kernel chip is unsupported
Sean Christopherson [Wed, 11 Jun 2025 21:35:55 +0000 (14:35 -0700)]
KVM: selftests: Fall back to split IRQ chip if full in-kernel chip is unsupported

Now that KVM x86 allows compiling out support for in-kernel I/O APIC (and
PIC and PIT) emulation, i.e. allows disabling KVM_CREATE_IRQCHIP for all
intents and purposes, fall back to a split IRQ chip for x86 if creating
the full in-kernel version fails with ENOTTY.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: Squash two CONFIG_HAVE_KVM_IRQCHIP #ifdefs into one
Sean Christopherson [Wed, 11 Jun 2025 21:35:54 +0000 (14:35 -0700)]
KVM: Squash two CONFIG_HAVE_KVM_IRQCHIP #ifdefs into one

Squash two #idef CONFIG_HAVE_KVM_IRQCHIP regions in KVM's trace events, as
the only code outside of the #idefs depends on CONFIG_KVM_IOAPIC, and that
Kconfig only exists for x86, which unconditionally selects HAVE_KVM_IRQCHIP.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APIC
Sean Christopherson [Wed, 11 Jun 2025 21:35:53 +0000 (14:35 -0700)]
KVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APIC

Add a Kconfig to allow building KVM without support for emulating a I/O
APIC, PIC, and PIT, which is desirable for deployments that effectively
don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to
create an in-kernel I/O APIC.  E.g. compiling out support eliminates a few
thousand lines of guest-facing code and gives security folks warm fuzzies.

As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes
it much easier for readers to understand which bits and pieces exist
specifically for fully in-kernel IRQ chips.

Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to
CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub
for kvm_arch_post_irq_routing_update().

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: Move x86-only tracepoints to x86's trace.h
Sean Christopherson [Wed, 11 Jun 2025 21:35:52 +0000 (14:35 -0700)]
KVM: Move x86-only tracepoints to x86's trace.h

Move the I/O APIC tracepoints and trace_kvm_msi_set_irq() to x86, as
__KVM_HAVE_IOAPIC is just code for "x86", and trace_kvm_msi_set_irq()
isn't unique to I/O APIC emulation.

Opportunistically clean up the absurdly messy #includes in ioapic.c.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Explicitly check for in-kernel PIC when getting ExtINT
Sean Christopherson [Wed, 11 Jun 2025 21:35:51 +0000 (14:35 -0700)]
KVM: x86: Explicitly check for in-kernel PIC when getting ExtINT

Explicitly check for an in-kernel PIC when checking for a pending ExtINT
in the PIC.  Effectively swapping the split vs. full irqchip logic will
allow guarding the in-kernel I/O APIC (and PIC) emulation with a Kconfig,
and also makes it more obvious that kvm_pic_read_irq() won't result in a
NULL pointer dereference.

Opportunistically add WARNs in the fallthrough path, mostly to document
that the userspace ExtINT logic is only relevant to split IRQ chips.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Don't clear PIT's IRQ line status when destroying PIT
Sean Christopherson [Wed, 11 Jun 2025 21:35:50 +0000 (14:35 -0700)]
KVM: x86: Don't clear PIT's IRQ line status when destroying PIT

Don't bother clearing the PIT's IRQ line status when destroying the PIT,
as userspace can't possibly rely on KVM to lower the IRQ line in any sane
use case, and it's not at all obvious that clearing the PIT's IRQ line is
correct/desirable in kvm_create_pit()'s error path.

When called from kvm_arch_pre_destroy_vm(), the entire VM is being torn
down and thus {kvm_pic,kvm_ioapic}.irq_states are unreachable.

As for the error path in kvm_create_pit(), the only way the PIT's bit in
irq_states can be set is if userspace raises the associated IRQ before
KVM_CREATE_PIT{2} completes.  Forcefully clearing the bit would clobber
userspace's input, nonsensical though that input may be.  Not to mention
that no known VMM will continue on if PIT creation fails.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Hardcode the PIT IRQ source ID to '2'
Sean Christopherson [Wed, 11 Jun 2025 21:35:49 +0000 (14:35 -0700)]
KVM: x86: Hardcode the PIT IRQ source ID to '2'

Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2
is always the first available bit in irq_sources_bitmap.  Bits 0 and 1 are
set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can
be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's
impossible for the PIT to find a different free bit (there are no other
users of kvm_request_irq_source_id().

Delete the now-defunct irq_sources_bitmap and all its associated code.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Move kvm_{request,free}_irq_source_id() to i8254.c (PIT)
Sean Christopherson [Wed, 11 Jun 2025 21:35:48 +0000 (14:35 -0700)]
KVM: x86: Move kvm_{request,free}_irq_source_id() to i8254.c (PIT)

Move kvm_{request,free}_irq_source_id() to i8254.c, i.e. the dedicated PIT
emulation file, in anticipation of removing them entirely in favor of
hardcoding the PIT's "requested" source ID (the source ID can only ever be
'2', and the request can never fail).

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Move kvm_setup_default_irq_routing() into irq.c
Sean Christopherson [Wed, 11 Jun 2025 21:35:47 +0000 (14:35 -0700)]
KVM: x86: Move kvm_setup_default_irq_routing() into irq.c

Move the default IRQ routing table used for in-kernel I/O APIC and PIC
routing to irq.c, and tweak the name to make it explicitly clear what
routing is being initialized.

In addition to making it more obvious that the so called "default" routing
only applies to an in-kernel I/O APIC, getting it out of irq_comm.c will
allow removing irq_comm.c entirely.  And placing the function alongside
other I/O APIC and PIC code will allow for guarding KVM's in-kernel I/O
APIC and PIC emulation with a Kconfig with minimal #ifdefs.

No functional change intended.

Cc: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Rename irqchip_kernel() to irqchip_full()
Sean Christopherson [Wed, 11 Jun 2025 21:35:46 +0000 (14:35 -0700)]
KVM: x86: Rename irqchip_kernel() to irqchip_full()

Rename irqchip_kernel() to irqchip_full(), as "kernel" is very ambiguous
due to the existence of split IRQ chip support, where only some of the
"irqchip" is in emulated by the kernel/KVM.  E.g. irqchip_kernel() often
gets confused with irqchip_in_kernel().

Opportunistically hoist the definition up in irq.h so that it's
co-located with other "full" irqchip code in anticipation of wrapping it
all with a Kconfig/#ifdef.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Move KVM_{GET,SET}_IRQCHIP ioctl helpers to irq.c
Sean Christopherson [Wed, 11 Jun 2025 21:35:45 +0000 (14:35 -0700)]
KVM: x86: Move KVM_{GET,SET}_IRQCHIP ioctl helpers to irq.c

Move the ioctl helpers for getting/setting fully in-kernel IRQ chip state
to irq.c, partly to trim down x86.c, but mostly in preparation for adding
a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT
emulation.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Move PIT ioctl helpers to i8254.c
Sean Christopherson [Wed, 11 Jun 2025 21:35:44 +0000 (14:35 -0700)]
KVM: x86: Move PIT ioctl helpers to i8254.c

Move the PIT ioctl helpers to i8254.c, i.e. to the file that implements
PIT emulation.  Eliminating PIT code in x86.c will allow adding a Kconfig
to control support for in-kernel I/O APIC, PIC, and PIT emulation with
minimal #ifdefs.

Opportunistically make kvm_pit_set_reinject() and kvm_pit_load_count()
local to i8254.c as they were only publicly visible to make them available
to the ioctl helpers.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Drop superfluous kvm_hv_set_sint() => kvm_hv_synic_set_irq() wrapper
Sean Christopherson [Wed, 11 Jun 2025 21:35:43 +0000 (14:35 -0700)]
KVM: x86: Drop superfluous kvm_hv_set_sint() => kvm_hv_synic_set_irq() wrapper

Drop the superfluous kvm_hv_set_sint() and instead wire up ->set() directly
to its final destination, kvm_hv_synic_set_irq().  Keep hv_synic_set_irq()
instead of kvm_hv_set_sint() to provide some amount of consistency in the
->set() helpers, e.g. to match kvm_pic_set_irq() and kvm_ioapic_set_irq().

kvm_set_msi() is arguably the oddball, e.g. kvm_set_msi_irq() should be
something like kvm_msi_to_lapic_irq() so that kvm_set_msi() can instead be
kvm_set_msi_irq(), but that's a future problem to solve.

No functional change intended.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Drop superfluous kvm_set_ioapic_irq() => kvm_ioapic_set_irq() wrapper
Sean Christopherson [Wed, 11 Jun 2025 21:35:42 +0000 (14:35 -0700)]
KVM: x86: Drop superfluous kvm_set_ioapic_irq() => kvm_ioapic_set_irq() wrapper

Drop the superfluous and confusing kvm_set_ioapic_irq() and instead wire
up ->set() directly to its final destination.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Drop superfluous kvm_set_pic_irq() => kvm_pic_set_irq() wrapper
Sean Christopherson [Wed, 11 Jun 2025 21:35:41 +0000 (14:35 -0700)]
KVM: x86: Drop superfluous kvm_set_pic_irq() => kvm_pic_set_irq() wrapper

Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq()
wrapper, and instead wire up ->set() directly to its final destination.

Opportunistically move the declaration kvm_pic_set_irq() to irq.h to
start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}.

No functional change intended.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()
Sean Christopherson [Wed, 11 Jun 2025 21:35:40 +0000 (14:35 -0700)]
KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()

Trigger the I/O APIC route rescan that's performed for a split IRQ chip
after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e.
before dropping kvm->irq_lock.  Calling kvm_make_all_cpus_request() under
a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair
in __kvm_make_request()+kvm_check_request() ensures the new routing is
visible to vCPUs prior to the request being visible to vCPUs.

In all likelihood, commit b053b2aef25d ("KVM: x86: Add EOI exit bitmap
inference") somewhat arbitrarily made the request outside of irq_lock to
avoid holding irq_lock any longer than is strictly necessary.  And then
commit abdb080f7ac8 ("kvm/irqchip: kvm_arch_irq_routing_update renaming
split") took the easy route of adding another arch hook instead of risking
a functional change.

Note, the call to synchronize_srcu_expedited() does NOT provide ordering
guarantees with respect to vCPUs scanning the new routing; as above, the
request infrastructure provides the necessary ordering.  I.e. there's no
need to wait for kvm_scan_ioapic_routes() to complete if it's actively
running, because regardless of whether it grabs the old or new table, the
vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again
and see the new mappings.

Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Require producers to pass in Linux IRQ number during registration
Sean Christopherson [Fri, 16 May 2025 23:07:34 +0000 (16:07 -0700)]
irqbypass: Require producers to pass in Linux IRQ number during registration

Pass in the Linux IRQ associated with an IRQ bypass producer instead of
relying on the caller to set the field prior to registration, as there's
no benefit to relying on callers to do the right thing.

Take care to set producer->irq before __connect(), as KVM expects the IRQ
to be valid as soon as a connection is possible.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Use xarray to track producers and consumers
Sean Christopherson [Fri, 16 May 2025 23:07:33 +0000 (16:07 -0700)]
irqbypass: Use xarray to track producers and consumers

Track IRQ bypass producers and consumers using an xarray to avoid the O(2n)
insertion time associated with walking a list to check for duplicate
entries, and to search for an partner.

At low (tens or few hundreds) total producer/consumer counts, using a list
is faster due to the need to allocate backing storage for xarray.  But as
count creeps into the thousands, xarray wins easily, and can provide
several orders of magnitude better latency at high counts.  E.g. hundreds
of nanoseconds vs. hundreds of milliseconds.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: David Matlack <dmatlack@google.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379
Link: https://lore.kernel.org/all/20230801115646.33990-1-likexu@tencent.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Use guard(mutex) in lieu of manual lock+unlock
Sean Christopherson [Fri, 16 May 2025 23:07:32 +0000 (16:07 -0700)]
irqbypass: Use guard(mutex) in lieu of manual lock+unlock

Use guard(mutex) to clean up irqbypass's error handling.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Use paired consumer/producer to disconnect during unregister
Sean Christopherson [Fri, 16 May 2025 23:07:31 +0000 (16:07 -0700)]
irqbypass: Use paired consumer/producer to disconnect during unregister

Use the paired consumer/producer information to disconnect IRQ bypass
producers/consumers in O(1) time (ignoring the cost of __disconnect()).

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Explicitly track producer and consumer bindings
Sean Christopherson [Fri, 16 May 2025 23:07:30 +0000 (16:07 -0700)]
irqbypass: Explicitly track producer and consumer bindings

Explicitly track IRQ bypass producer:consumer bindings.  This will allow
making removal an O(1) operation; searching through the list to find
information that is trivially tracked (and useful for debug) is wasteful.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Take ownership of producer/consumer token tracking
Sean Christopherson [Fri, 16 May 2025 23:07:29 +0000 (16:07 -0700)]
irqbypass: Take ownership of producer/consumer token tracking

Move ownership of IRQ bypass token tracking into irqbypass.ko, and
explicitly require callers to pass an eventfd_ctx structure instead of a
completely opaque token.  Relying on producers and consumers to set the
token appropriately is error prone, and hiding the fact that the token must
be an eventfd_ctx pointer (for all intents and purposes) unnecessarily
obfuscates the code and makes it more brittle.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Drop superfluous might_sleep() annotations
Sean Christopherson [Fri, 16 May 2025 23:07:28 +0000 (16:07 -0700)]
irqbypass: Drop superfluous might_sleep() annotations

Drop superfluous might_sleep() annotations from irqbypass, mutex_lock()
provides all of the necessary tracking.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoirqbypass: Drop pointless and misleading THIS_MODULE get/put
Sean Christopherson [Fri, 16 May 2025 23:07:27 +0000 (16:07 -0700)]
irqbypass: Drop pointless and misleading THIS_MODULE get/put

Drop irqbypass.ko's superfluous and misleading get/put calls on
THIS_MODULE.  A module taking a reference to itself is useless; no amount
of checks will prevent doom and destruction if the caller hasn't already
guaranteed the liveliness of the module (this goes for any module).  E.g.
if try_module_get() fails because irqbypass.ko is being unloaded, then the
kernel has already hit a use-after-free by virtue of executing code whose
lifecycle is tied to irqbypass.ko.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: arm64: WARN if unmapping a vLPI fails in any path
Sean Christopherson [Thu, 12 Jun 2025 23:51:47 +0000 (16:51 -0700)]
KVM: arm64: WARN if unmapping a vLPI fails in any path

When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if
failure occurs when freeing an ITE.  If undoing vCPU affinity fails, then
odds are very good that vLPI state tracking has has gotten out of whack,
i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI.  At best,
inconsistent state means there is a lurking bug/flaw somewhere.  At worst,
the inconsistency could eventually be fatal to the host, e.g. if an ITS
command fails because KVM's view of things doesn't match reality/hardware.

Note, only the call from kvm_arch_irq_bypass_del_producer() by way of
kvm_vgic_v4_unset_forwarding() doesn't already WARN.  Common KVM's
kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails.
For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(),
the only possible causes are that the GIC doesn't have a v4 ITS (from
its_irq_set_vcpu_affinity()):

        /* Need a v4 ITS */
        if (!is_v4(its_dev->its))
                return -EINVAL;

        guard(raw_spinlock)(&its_dev->event_map.vlpi_lock);

        /* Unmap request? */
        if (!info)
                return its_vlpi_unmap(d);

or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()):

        if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d))
                return -EINVAL;

All of the above failure scenarios are warnable offences, as they should
never occur absent a kernel/KVM bug.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/all/aFWY2LTVIxz5rfhh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
4 months agoKVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
Paolo Bonzini [Fri, 20 Jun 2025 18:20:20 +0000 (14:20 -0400)]
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities

Allow userspace to advertise TDG.VP.VMCALL subfunctions that the
kernel also supports.  For each output register of GetTdVmCallInfo's
leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported
TDVMCALLs (userspace can set those blindly) and one for user-supported
TDVMCALLs (userspace can set those if it knows how to handle them).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
Paolo Bonzini [Fri, 20 Jun 2025 17:28:08 +0000 (13:28 -0400)]
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: TDX: Exit to userspace for GetTdVmCallInfo
Binbin Wu [Tue, 10 Jun 2025 02:14:21 +0000 (10:14 +0800)]
KVM: TDX: Exit to userspace for GetTdVmCallInfo

Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX,
to allow userspace to provide information about the support of
TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API.

GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>,
<ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>,
<Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and
<Instruction.WRMSR>. They must be supported by VMM to support TDX guests.

For GetTdVmCallInfo
- When leaf (r12) to enumerate TDVMCALL functionality is set to 0,
  successful execution indicates all GHCI base TDVMCALLs listed above are
  supported.

  Update the KVM TDX document with the set of the GHCI base APIs.

- When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it
  indicates the TDX guest is querying the supported TDVMCALLs beyond
  the GHCI base TDVMCALLs.
  Exit to userspace to let userspace set the TDVMCALL sub-function bit(s)
  accordingly to the leaf outputs.  KVM could set the TDVMCALL bit(s)
  supported by itself when the TDVMCALLs don't need support from userspace
  after returning from userspace and before entering guest. Currently, no
  such TDVMCALLs implemented, KVM just sets the values returned from
  userspace.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
Binbin Wu [Tue, 10 Jun 2025 02:14:20 +0000 (10:14 +0800)]
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>

Handle TDVMCALL for GetQuote to generate a TD-Quote.

GetQuote is a doorbell-like interface used by TDX guests to request VMM
to generate a TD-Quote signed by a service hosting TD-Quoting Enclave
operating on the host.  A TDX guest passes a TD Report (TDREPORT_STRUCT) in
a shared-memory area as parameter.  Host VMM can access it and queue the
operation for a service hosting TD-Quoting enclave.  When completed, the
Quote is returned via the same shared-memory area.

KVM only checks the GPA from the TDX guest has the shared-bit set and drops
the shared-bit before exiting to userspace to avoid bleeding the shared-bit
into KVM's exit ABI.  KVM forwards the request to userspace VMM (e.g. QEMU)
and userspace VMM queues the operation asynchronously.  KVM sets the return
code according to the 'ret' field set by userspace to notify the TDX guest
whether the request has been queued successfully or not.  When the request
has been queued successfully, the TDX guest can poll the status field in
the shared-memory area to check whether the Quote generation is completed
or not.  When completed, the generated Quote is returned via the same
buffer.

Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is
required to handle the KVM exit reason as the initial support for TDX,
by reentering KVM to ensure that the TDVMCALL is complete.  While at it,
add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
[Adjust userspace API. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoKVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
Binbin Wu [Tue, 10 Jun 2025 02:14:19 +0000 (10:14 +0800)]
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs

Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and
return it for unimplemented TDVMCALL subfunctions.

Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not
implemented is vague because TDX guests can't tell the error is due to
the subfunction is not supported or an invalid input of the subfunction.
New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the
ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND.

Before the change, for common guest implementations, when a TDX guest
receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases:
1. Some operand is invalid. It could change the operand to another value
   retry.
2. The subfunction is not supported.

For case 1, an invalid operand usually means the guest implementation bug.
Since the TDX guest can't tell which case is, the best practice for
handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf,
treating the failure as fatal if the TDVMCALL is essential or ignoring
it if the TDVMCALL is optional.

With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to
old TDX guest that do not know about it, but it is expected that the
guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND.
Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND
specifically; for example Linux just checks for success.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
[Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 months agoMerge tag 'kvm-riscv-fixes-6.16-1' of https://github.com/kvm-riscv/linux into HEAD
Paolo Bonzini [Fri, 20 Jun 2025 17:07:24 +0000 (13:07 -0400)]
Merge tag 'kvm-riscv-fixes-6.16-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.16, take #1

- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs

4 months agoMerge tag 'kvmarm-fixes-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Bonzini [Fri, 20 Jun 2025 17:07:10 +0000 (13:07 -0400)]
Merge tag 'kvmarm-fixes-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.16, take #3

- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some
  missing synchronisation

- A small fix for the irqbypass hook fixes, tightening the check and
  ensuring that we only deal with MSI for both the old and the new
  route entry

- Rework the way the shadow LRs are addressed in a nesting
  configuration, plugging an embarrassing bug as well as simplifying
  the whole process

- Add yet another fix for the dreaded arch_timer_edge_cases selftest

4 months agoKVM: arm64: VHE: Centralize ISBs when returning to host
Mark Rutland [Tue, 17 Jun 2025 13:37:18 +0000 (14:37 +0100)]
KVM: arm64: VHE: Centralize ISBs when returning to host

The VHE hyp code has recently gained a few ISBs. Simplify this to one
unconditional ISB in __kvm_vcpu_run_vhe(), and remove the unnecessary
ISB from the kvm_call_hyp_ret() macro.

While kvm_call_hyp_ret() is also used to invoke
__vgic_v3_get_gic_config(), but no ISB is necessary in that case either.

For the moment, an ISB is left in kvm_call_hyp(), as there are many more
users, and removing the ISB would require a more thorough audit.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250617133718.4014181-8-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: Remove cpacr_clear_set()
Mark Rutland [Tue, 17 Jun 2025 13:37:17 +0000 (14:37 +0100)]
KVM: arm64: Remove cpacr_clear_set()

We no longer use cpacr_clear_set().

Remove cpacr_clear_set() and its helper functions.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250617133718.4014181-7-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()
Mark Rutland [Tue, 17 Jun 2025 13:37:16 +0000 (14:37 +0100)]
KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()

The hyp code FPSIMD/SVE/SME trap handling logic has some rather messy
open-coded manipulation of CPTR/CPACR. This is benign for non-nested
guests, but broken for nested guests, as the guest hypervisor's CPTR
configuration is not taken into account.

Consider the case where L0 provides FPSIMD+SVE to an L1 guest
hypervisor, and the L1 guest hypervisor only provides FPSIMD to an L2
guest (with L1 configuring CPTR/CPACR to trap SVE usage from L2). If the
L2 guest triggers an FPSIMD trap to the L0 hypervisor,
kvm_hyp_handle_fpsimd() will see that the vCPU supports FPSIMD+SVE, and
will configure CPTR/CPACR to NOT trap FPSIMD+SVE before returning to the
L2 guest. Consequently the L2 guest would be able to manipulate SVE
state even though the L1 hypervisor had configured CPTR/CPACR to forbid
this.

Clean this up, and fix the nested virt issue by always using
__deactivate_cptr_traps() and __activate_cptr_traps() to manage the CPTR
traps. This removes the need for the ad-hoc fixup in
kvm_hyp_save_fpsimd_host(), and ensures that any guest hypervisor
configuration of CPTR/CPACR is taken into account.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250617133718.4014181-6-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()
Mark Rutland [Tue, 17 Jun 2025 13:37:15 +0000 (14:37 +0100)]
KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()

There's no need for fpsimd_sve_sync() to write to CPTR/CPACR. All
relevant traps are always disabled earlier within __kvm_vcpu_run(), when
__deactivate_cptr_traps() configures CPTR/CPACR.

With irrelevant details elided, the flow is:

handle___kvm_vcpu_run(...)
{
flush_hyp_vcpu(...) {
fpsimd_sve_flush(...);
}

__kvm_vcpu_run(...) {
__activate_traps(...) {
__activate_cptr_traps(...);
}

do {
__guest_enter(...);
} while (...);

__deactivate_traps(....) {
__deactivate_cptr_traps(...);
}
}

sync_hyp_vcpu(...) {
fpsimd_sve_sync(...);
}
}

Remove the unnecessary write to CPTR/CPACR. An ISB is still necessary,
so a comment is added to describe this requirement.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250617133718.4014181-5-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: Reorganise CPTR trap manipulation
Mark Rutland [Tue, 17 Jun 2025 13:37:14 +0000 (14:37 +0100)]
KVM: arm64: Reorganise CPTR trap manipulation

The NVHE/HVHE and VHE modes have separate implementations of
__activate_cptr_traps() and __deactivate_cptr_traps() in their
respective switch.c files. There's some duplication of logic, and it's
not currently possible to reuse this logic elsewhere.

Move the logic into the common switch.h header so that it can be reused,
and de-duplicate the common logic.

This rework changes the way SVE traps are deactivated in VHE mode,
aligning it with NVHE/HVHE modes:

* Before this patch, VHE's __deactivate_cptr_traps() would
  unconditionally enable SVE for host EL2 (but not EL0), regardless of
  whether the ARM64_SVE cpucap was set.

* After this patch, VHE's __deactivate_cptr_traps() will take the
  ARM64_SVE cpucap into account. When ARM64_SVE is not set, SVE will be
  trapped from EL2 and below.

The old and new behaviour are both benign:

* When ARM64_SVE is not set, the host will not touch SVE state, and will
  not reconfigure SVE traps. Host EL0 access to SVE will be trapped as
  expected.

* When ARM64_SVE is set, the host will configure EL0 SVE traps before
  returning to EL0 as part of reloading the EL0 FPSIMD/SVE/SME state.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250617133718.4014181-4-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: VHE: Synchronize CPTR trap deactivation
Mark Rutland [Tue, 17 Jun 2025 13:37:13 +0000 (14:37 +0100)]
KVM: arm64: VHE: Synchronize CPTR trap deactivation

Currently there is no ISB between __deactivate_cptr_traps() disabling
traps that affect EL2 and fpsimd_lazy_switch_to_host() manipulating
registers potentially affected by CPTR traps.

When NV is not in use, this is safe because the relevant registers are
only accessed when guest_owns_fp_regs() && vcpu_has_sve(vcpu), and this
also implies that SVE traps affecting EL2 have been deactivated prior to
__guest_entry().

When NV is in use, a guest hypervisor may have configured SVE traps for
a nested context, and so it is necessary to have an ISB between
__deactivate_cptr_traps() and fpsimd_lazy_switch_to_host().

Due to the current lack of an ISB, when a guest hypervisor enables SVE
traps in CPTR, the host can take an unexpected SVE trap from within
fpsimd_lazy_switch_to_host(), e.g.

| Unhandled 64-bit el1h sync exception on CPU1, ESR 0x0000000066000000 -- SVE
| CPU: 1 UID: 0 PID: 164 Comm: kvm-vcpu-0 Not tainted 6.15.0-rc4-00138-ga05e0f012c05 #3 PREEMPT
| Hardware name: FVP Base RevC (DT)
| pstate: 604023c9 (nZCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : __kvm_vcpu_run+0x6f4/0x844
| lr : __kvm_vcpu_run+0x150/0x844
| sp : ffff800083903a60
| x29: ffff800083903a90 x28: ffff000801f4a300 x27: 0000000000000000
| x26: 0000000000000000 x25: ffff000801f90000 x24: ffff000801f900f0
| x23: ffff800081ff7720 x22: 0002433c807d623f x21: ffff000801f90000
| x20: ffff00087f730730 x19: 0000000000000000 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff000801f90d70
| x5 : 0000000000001000 x4 : ffff8007fd739000 x3 : ffff000801f90000
| x2 : 0000000000000000 x1 : 00000000000003cc x0 : ffff800082f9d000
| Kernel panic - not syncing: Unhandled exception
| CPU: 1 UID: 0 PID: 164 Comm: kvm-vcpu-0 Not tainted 6.15.0-rc4-00138-ga05e0f012c05 #3 PREEMPT
| Hardware name: FVP Base RevC (DT)
| Call trace:
|  show_stack+0x18/0x24 (C)
|  dump_stack_lvl+0x60/0x80
|  dump_stack+0x18/0x24
|  panic+0x168/0x360
|  __panic_unhandled+0x68/0x74
|  el1h_64_irq_handler+0x0/0x24
|  el1h_64_sync+0x6c/0x70
|  __kvm_vcpu_run+0x6f4/0x844 (P)
|  kvm_arm_vcpu_enter_exit+0x64/0xa0
|  kvm_arch_vcpu_ioctl_run+0x21c/0x870
|  kvm_vcpu_ioctl+0x1a8/0x9d0
|  __arm64_sys_ioctl+0xb4/0xf4
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| SMP: stopping secondary CPUs
| Kernel Offset: disabled
| CPU features: 0x0000,000002c0,02df4fb9,97ee773f
| Memory Limit: none
| ---[ end Kernel panic - not syncing: Unhandled exception ]---

Fix this by adding an ISB between __deactivate_traps() and
fpsimd_lazy_switch_to_host().

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250617133718.4014181-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: VHE: Synchronize restore of host debug registers
Mark Rutland [Tue, 17 Jun 2025 13:37:12 +0000 (14:37 +0100)]
KVM: arm64: VHE: Synchronize restore of host debug registers

When KVM runs in non-protected VHE mode, there's no context
synchronization event between __debug_switch_to_host() restoring the
host debug registers and __kvm_vcpu_run() unmasking debug exceptions.
Due to this, it's theoretically possible for the host to take an
unexpected debug exception due to the stale guest configuration.

This cannot happen in NVHE/HVHE mode as debug exceptions are masked in
the hyp code, and the exception return to the host will provide the
necessary context synchronization before debug exceptions can be taken.

For now, avoid the problem by adding an ISB after VHE hyp code restores
the host debug registers.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250617133718.4014181-2-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases
Zenghui Yu [Sun, 8 Jun 2025 09:54:02 +0000 (17:54 +0800)]
KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases

Close the GIC FD to free the reference it holds to the VM so that we can
correctly clean up the VM. This also gets rid of the

"KVM: debugfs: duplicate directory 395722-4"

warning when running arch_timer_edge_cases.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20250608095402.1131-1-yuzenghui@huawei.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: Explicitly treat routing entry type changes as changes
Sean Christopherson [Wed, 11 Jun 2025 22:45:04 +0000 (15:45 -0700)]
KVM: arm64: Explicitly treat routing entry type changes as changes

Explicitly treat type differences as GSI routing changes, as comparing MSI
data between two entries could get a false negative, e.g. if userspace
changed the type but left the type-specific data as-

Note, the same bug was fixed in x86 by commit bcda70c56f3e ("KVM: x86:
Explicitly treat routing entry type changes as changes").

Fixes: 4bf3693d36af ("KVM: arm64: Unmap vLPIs affected by changes to GSI routing information")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250611224604.313496-3-seanjc@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
4 months agoKVM: arm64: nv: Fix tracking of shadow list registers
Marc Zyngier [Sun, 15 Jun 2025 15:11:38 +0000 (16:11 +0100)]
KVM: arm64: nv: Fix tracking of shadow list registers

Wei-Lin reports that the tracking of shadow list registers is
majorly broken when resync'ing the L2 state after a run, as
we confuse the guest's LR index with the host's, potentially
losing the interrupt state.

While this could be fixed by adding yet another side index to
track it (Wei-Lin's fix), it may be better to refactor this
code to avoid having a side index altogether, limiting the
risk to introduce this class of bugs.

A key observation is that the shadow index is always the number
of bits in the lr_map bitmap. With that, the parallel indexing
scheme can be completely dropped.

While doing this, introduce a couple of helpers that abstract
the index conversion and some of the LR repainting, making the
whole exercise much simpler.

Reported-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw>
Reviewed-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250614145721.2504524-1-r09922117@csie.ntu.edu.tw
Link: https://lore.kernel.org/r/86qzzkc5xa.wl-maz@kernel.org
4 months agoRISC-V: KVM: Don't treat SBI HFENCE calls as NOPs
Anup Patel [Thu, 5 Jun 2025 06:14:47 +0000 (11:44 +0530)]
RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs

The SBI specification clearly states that SBI HFENCE calls should
return SBI_ERR_NOT_SUPPORTED when one of the target hart doesn’t
support hypervisor extension (aka nested virtualization in-case
of KVM RISC-V).

Fixes: c7fa3c48de86 ("RISC-V: KVM: Treat SBI HFENCE calls as NOPs")
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20250605061458.196003-3-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
4 months agoRISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
Anup Patel [Thu, 5 Jun 2025 06:14:46 +0000 (11:44 +0530)]
RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls

As-per the SBI specification, an SBI remote fence operation applies
to the entire address space if either:
1) start_addr and size are both 0
2) size is equal to 2^XLEN-1

>From the above, only #1 is checked by SBI SFENCE calls so fix the
size parameter check in SBI SFENCE calls to cover #2 as well.

Fixes: 13acfec2dbcc ("RISC-V: KVM: Add remote HFENCE functions based on VCPU requests")
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Link: https://lore.kernel.org/r/20250605061458.196003-2-apatel@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
4 months agoLinux 6.16-rc2
Linus Torvalds [Sun, 15 Jun 2025 20:49:41 +0000 (13:49 -0700)]
Linux 6.16-rc2

4 months agoMerge tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masah...
Linus Torvalds [Sun, 15 Jun 2025 16:14:27 +0000 (09:14 -0700)]
Merge tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - Move warnings about linux/export.h from W=1 to W=2

 - Fix structure type overrides in gendwarfksyms

* tag 'kbuild-fixes-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  gendwarfksyms: Fix structure type overrides
  kbuild: move warnings about linux/export.h from W=1 to W=2

4 months agogendwarfksyms: Fix structure type overrides
Sami Tolvanen [Sat, 14 Jun 2025 00:55:33 +0000 (00:55 +0000)]
gendwarfksyms: Fix structure type overrides

As we always iterate through the entire die_map when expanding
type strings, recursively processing referenced types in
type_expand_child() is not actually necessary. Furthermore,
the type_string kABI rule added in commit c9083467f7b9
("gendwarfksyms: Add a kABI rule to override type strings") can
fail to override type strings for structures due to a missing
kabi_get_type_string() check in this function.

Fix the issue by dropping the unnecessary recursion and moving
the override check to type_expand(). Note that symbol versions
are otherwise unchanged with this patch.

Fixes: c9083467f7b9 ("gendwarfksyms: Add a kABI rule to override type strings")
Reported-by: Giuliano Procida <gprocida@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
4 months agokbuild: move warnings about linux/export.h from W=1 to W=2
Masahiro Yamada [Thu, 12 Jun 2025 16:08:48 +0000 (01:08 +0900)]
kbuild: move warnings about linux/export.h from W=1 to W=2

This hides excessive warnings, as nobody builds with W=2.

Fixes: a934a57a42f6 ("scripts/misc-check: check missing #include <linux/export.h> when W=1")
Fixes: 7d95680d64ac ("scripts/misc-check: check unnecessary #include <linux/export.h> when W=1")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
4 months agoMerge tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 14 Jun 2025 17:13:32 +0000 (10:13 -0700)]
Merge tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - SMB3.1.1 POSIX extensions fix for char remapping

 - Fix for repeated directory listings when directory leases enabled

 - deferred close handle reuse fix

* tag 'v6.16-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: improve directory cache reuse for readdir operations
  smb: client: fix perf regression with deferred closes
  smb: client: disable path remapping with POSIX extensions

4 months agoMerge tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 14 Jun 2025 17:01:47 +0000 (10:01 -0700)]
Merge tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux

Pull iommu fix from Joerg Roedel:

 - Fix PTE size calculation for NVidia Tegra

* tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
  iommu/tegra: Fix incorrect size calculation

4 months agoMerge tag 'block-6.16-20250614' of git://git.kernel.dk/linux
Linus Torvalds [Sat, 14 Jun 2025 16:25:22 +0000 (09:25 -0700)]
Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix for a deadlock on queue freeze with zoned writes

 - Fix for zoned append emulation

 - Two bio folio fixes, for sparsemem and for very large folios

 - Fix for a performance regression introduced in 6.13 when plug
   insertion was changed

 - Fix for NVMe passthrough handling for polled IO

 - Document the ublk auto registration feature

 - loop lockdep warning fix

* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
  nvme: always punt polled uring_cmd end_io work to task_work
  Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
  block: Fix bvec_set_folio() for very large folios
  bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
  block: use plug request list tail for one-shot backmerge attempt
  block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
  block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
  ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
  loop: move lo_set_size() out of queue freeze

4 months agoMerge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux
Linus Torvalds [Sat, 14 Jun 2025 15:44:54 +0000 (08:44 -0700)]
Merge tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for a race between SQPOLL exit and fdinfo reading.

   It's slim and I was only able to reproduce this with an artificial
   delay in the kernel. Followup sparse fix as well to unify the access
   to ->thread.

 - Fix for multiple buffer peeking, avoiding truncation if possible.

 - Run local task_work for IOPOLL reaping when the ring is exiting.

   This currently isn't done due to an assumption that polled IO will
   never need task_work, but a fix on the block side is going to change
   that.

* tag 'io_uring-6.16-20250614' of git://git.kernel.dk/linux:
  io_uring: run local task_work from ring exit IOPOLL reaping
  io_uring/kbuf: don't truncate end buffer for multiple buffer peeks
  io_uring: consistently use rcu semantics with sqpoll thread
  io_uring: fix use-after-free of sq->thread in __io_uring_show_fdinfo()

4 months agoMerge tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda...
Linus Torvalds [Sat, 14 Jun 2025 15:38:34 +0000 (08:38 -0700)]
Merge tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull Rust fix from Miguel Ojeda:

  - 'hrtimer': fix future compile error when the 'impl_has_hr_timer!'
    macro starts to get called

* tag 'rust-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  rust: time: Fix compile error in impl_has_hr_timer macro

4 months agoMerge tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sat, 14 Jun 2025 15:18:09 +0000 (08:18 -0700)]
Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues
  or aren't considered necessary for -stable kernels. Only 4 are for MM"

* tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: add mmap_prepare() compatibility layer for nested file systems
  init: fix build warnings about export.h
  MAINTAINERS: add Barry as a THP reviewer
  drivers/rapidio/rio_cm.c: prevent possible heap overwrite
  mm: close theoretical race where stale TLB entries could linger
  mm/vma: reset VMA iterator on commit_merge() OOM failure
  docs: proc: update VmFlags documentation in smaps
  scatterlist: fix extraneous '@'-sign kernel-doc notation
  selftests/mm: skip failed memfd setups in gup_longterm

4 months agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Fri, 13 Jun 2025 23:49:39 +0000 (16:49 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "All fixes for drivers.

  The core change in the error handler is simply to translate an ALUA
  specific sense code into a retry the ALUA components can handle and
  won't impact any other devices"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: error: alua: I/O errors for ALUA state transitions
  scsi: storvsc: Increase the timeouts to storvsc_timeout
  scsi: s390: zfcp: Ensure synchronous unit_add
  scsi: iscsi: Fix incorrect error path labels for flashnode operations
  scsi: mvsas: Fix typos in per-phy comments and SAS cmd port registers
  scsi: core: ufs: Fix a hang in the error handler

4 months agoMerge tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Fri, 13 Jun 2025 23:27:27 +0000 (16:27 -0700)]
Merge tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Quiet week, only two pull requests came my way, xe has a couple of
  fixes and then a bunch of fixes across the board, vc4 probably fixes
  the biggest problem:

  vc4:
   - Fix infinite EPROBE_DEFER loop in vc4 probing

  amdxdna:
   - Fix amdxdna firmware size

  meson:
   - modesetting fixes

  sitronix:
   - Kconfig fix for st7171-i2c

  dma-buf:
   - Fix -EBUSY WARN_ON_ONCE in dma-buf

  udmabuf:
   - Use dma_sync_sgtable_for_cpu in udmabuf

  xe:
   - Fix regression disallowing 64K SVM migration
   - Use a bounce buffer for WA BB"

* tag 'drm-fixes-2025-06-14' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe/lrc: Use a temporary buffer for WA BB
  udmabuf: use sgtable-based scatterlist wrappers
  dma-buf: fix compare in WARN_ON_ONCE
  drm/sitronix: st7571-i2c: Select VIDEOMODE_HELPERS
  drm/meson: fix more rounding issues with 59.94Hz modes
  drm/meson: use vclk_freq instead of pixel_freq in debug print
  drm/meson: fix debug log statement when setting the HDMI clocks
  drm/vc4: fix infinite EPROBE_DEFER loop
  drm/xe/svm: Fix regression disallowing 64K SVM migration
  accel/amdxdna: Fix incorrect PSP firmware size

4 months agoio_uring: run local task_work from ring exit IOPOLL reaping
Jens Axboe [Fri, 13 Jun 2025 21:24:53 +0000 (15:24 -0600)]
io_uring: run local task_work from ring exit IOPOLL reaping

In preparation for needing to shift NVMe passthrough to always use
task_work for polled IO completions, ensure that those are suitably
run at exit time. See commit:

9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work")

for details on why that is necessary.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 months agonvme: always punt polled uring_cmd end_io work to task_work
Jens Axboe [Fri, 13 Jun 2025 19:37:41 +0000 (13:37 -0600)]
nvme: always punt polled uring_cmd end_io work to task_work

Currently NVMe uring_cmd completions will complete locally, if they are
polled. This is done because those completions are always invoked from
task context. And while that is true, there's no guarantee that it's
invoked under the right ring context, or even task. If someone does
NVMe passthrough via multiple threads and with a limited number of
poll queues, then ringA may find completions from ringB. For that case,
completing the request may not be sound.

Always just punt the passthrough completions via task_work, which will
redirect the completion, if needed.

Cc: stable@vger.kernel.org
Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 months agoMerge tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 13 Jun 2025 20:39:15 +0000 (13:39 -0700)]
Merge tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "These fix an ACPI APEI error injection driver failure that started to
  occur after switching it over to using a faux device, address an EC
  driver issue related to invalid ECDT tables, clean up the usage of
  mwait_idle_with_hints() in the ACPI PAD driver, add a new IRQ override
  quirk, and fix a NULL pointer dereference related to nosmp:

   - Update the faux device handling code in the driver core and address
     an ACPI APEI error injection driver failure that started to occur
     after switching it over to using a faux device on top of that (Dan
     Williams)

   - Update data types of variables passed as arguments to
     mwait_idle_with_hints() in the ACPI PAD (processor aggregator
     device) driver to match the function definition after recent
     changes (Uros Bizjak)

   - Fix a NULL pointer dereference in the ACPI CPPC library that occurs
     when nosmp is passed to the kernel in the command line (Yunhui Cui)

   - Ignore ECDT tables with an invalid ID string to prevent using an
     incorrect GPE for signaling events on some systems (Armin Wolf)

   - Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan)"

* tag 'acpi-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: resource: Use IRQ override on MACHENIKE 16P
  ACPI: EC: Ignore ECDT tables with an invalid ID string
  ACPI: CPPC: Fix NULL pointer dereference when nosmp is used
  ACPI: PAD: Update arguments of mwait_idle_with_hints()
  ACPI: APEI: EINJ: Do not fail einj_init() on faux_device_create() failure
  driver core: faux: Quiet probe failures
  driver core: faux: Suppress bind attributes

4 months agoMerge tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Fri, 13 Jun 2025 20:27:41 +0000 (13:27 -0700)]
Merge tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix the cpupower utility installation, fix up the recently added
  Rust abstractions for cpufreq and OPP, restore the x86 update
  eliminating mwait_play_dead_cpuid_hint() that has been reverted during
  the 6.16 merge window along with preventing the failure caused by it
  from happening, and clean up mwait_idle_with_hints() usage in
  intel_idle:

   - Implement CpuId Rust abstraction and use it to fix doctest failure
     related to the recently introduced cpumask abstraction (Viresh
     Kumar)

   - Do minor cleanups in the `# Safety` sections for cpufreq
     abstractions added recently (Viresh Kumar)

   - Unbreak cpupower systemd service units installation on some systems
     by adding a unitdir variable for specifying the location to install
     them (Francesco Poli)

   - Eliminate mwait_play_dead_cpuid_hint() again after reverting its
     elimination during the 6.16 merge window due to a problem with
     handling "dead" SMT siblings, but this time prevent leaving them in
     C1 after initialization by taking them online and back offline when
     a proper cpuidle driver for the platform has been registered
     (Rafael Wysocki)

   - Update data types of variables passed as arguments to
     mwait_idle_with_hints() to match the function definition after
     recent changes (Uros Bizjak)"

* tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  rust: cpu: Add CpuId::current() to retrieve current CPU ID
  rust: Use CpuId in place of raw CPU numbers
  rust: cpu: Introduce CpuId abstraction
  intel_idle: Update arguments of mwait_idle_with_hints()
  cpufreq: Convert `/// SAFETY` lines to `# Safety` sections
  cpupower: split unitdir from libdir in Makefile
  Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
  ACPI: processor: Rescan "dead" SMT siblings during initialization
  intel_idle: Rescan "dead" SMT siblings during initialization
  x86/smp: PM/hibernate: Split arch_resume_nosmt()
  intel_idle: Use subsys_initcall_sync() for initialization

4 months agoMerge branches 'acpi-pad', 'acpi-cppc', 'acpi-ec' and 'acpi-resource'
Rafael J. Wysocki [Fri, 13 Jun 2025 19:55:30 +0000 (21:55 +0200)]
Merge branches 'acpi-pad', 'acpi-cppc', 'acpi-ec' and 'acpi-resource'

Merge assorted ACPI updates for 6.16-rc2:

 - Update data types of variables passed as arguments to
   mwait_idle_with_hints() in the ACPI PAD (processor aggregator device)
   driver to match the function definition after recent changes (Uros
   Bizjak).

 - Fix a NULL pointer dereference in the ACPI CPPC library that occurs
   when nosmp is passed to the kernel in the command line (Yunhui Cui).

 - Ignore ECDT tables with an invalid ID string to prevent using an
   incorrect GPE for signaling events on some systems (Armin Wolf).

 - Add a new IRQ override quirk for MACHENIKE 16P (Wentao Guan).

* acpi-pad:
  ACPI: PAD: Update arguments of mwait_idle_with_hints()

* acpi-cppc:
  ACPI: CPPC: Fix NULL pointer dereference when nosmp is used

* acpi-ec:
  ACPI: EC: Ignore ECDT tables with an invalid ID string

* acpi-resource:
  ACPI: resource: Use IRQ override on MACHENIKE 16P

4 months agoMerge branch 'pm-cpuidle'
Rafael J. Wysocki [Fri, 13 Jun 2025 19:28:07 +0000 (21:28 +0200)]
Merge branch 'pm-cpuidle'

Merge cpuidle updates for 6.16-rc2:

 - Update data types of variables passed as arguments to
   mwait_idle_with_hints() to match the function definition
   after recent changes (Uros Bizjak).

 - Eliminate mwait_play_dead_cpuid_hint() again after reverting its
   elimination during the merge window due to a problem with handling
   "dead" SMT siblings, but this time prevent leaving them in C1 after
   initialization by taking them online and back offline when a proper
   cpuidle driver for the platform has been registered (Rafael Wysocki).

* pm-cpuidle:
  intel_idle: Update arguments of mwait_idle_with_hints()
  Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()"
  ACPI: processor: Rescan "dead" SMT siblings during initialization
  intel_idle: Rescan "dead" SMT siblings during initialization
  x86/smp: PM/hibernate: Split arch_resume_nosmt()
  intel_idle: Use subsys_initcall_sync() for initialization

4 months agoMerge branch 'pm-tools'
Rafael J. Wysocki [Fri, 13 Jun 2025 19:25:38 +0000 (21:25 +0200)]
Merge branch 'pm-tools'

Merge a cpupower utility fix for 6.16-rc2 that unbreaks systemd service
units installation on some sysems (Francesco Poli).

* pm-tools:
  cpupower: split unitdir from libdir in Makefile

4 months agoMerge tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brooni...
Linus Torvalds [Fri, 13 Jun 2025 18:01:44 +0000 (11:01 -0700)]
Merge tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A collection of driver specific fixes, most minor apart from the OMAP
  ones which disable some recent performance optimisations in some
  non-standard cases where we could start driving the bus incorrectly.

  The change to the stm32-ospi driver to use the newer reset APIs is a
  fix for interactions with other IP sharing the same reset line in some
  SoCs"

* tag 'spi-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine
  spi: stm32-ospi: clean up on error in probe()
  spi: stm32-ospi: Make usage of reset_control_acquire/release() API
  spi: offload: check offload ops existence before disabling the trigger
  spi: spi-pci1xxxx: Fix error code in probe
  spi: loongson: Fix build warnings about export.h
  spi: omap2-mcspi: Disable multi-mode when the previous message kept CS asserted
  spi: omap2-mcspi: Disable multi mode when CS should be kept asserted after message

4 months agoMerge tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 13 Jun 2025 18:00:19 +0000 (11:00 -0700)]
Merge tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "One minor fix for a leak in the DT parsing code in the max20086 driver"

* tag 'regulator-fix-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: max20086: Fix refcount leak in max20086_parse_regulators_dt()

4 months agoposix-cpu-timers: fix race between handle_posix_cpu_timers() and posix_cpu_timer_del()
Oleg Nesterov [Fri, 13 Jun 2025 17:26:50 +0000 (19:26 +0200)]
posix-cpu-timers: fix race between handle_posix_cpu_timers() and posix_cpu_timer_del()

If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().

If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.

Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.

This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.

Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 months agoMerge tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Fri, 13 Jun 2025 17:51:11 +0000 (10:51 -0700)]
Merge tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:

 - Do not free "head" variable in filter_free_subsystem_filters()

   The first error path jumps to "free_now" label but first frees the
   newly allocated "head" variable. But the "free_now" code checks this
   variable, and if it is not NULL, it will iterate the list. As this
   list variable was already initialized, the "free_now" code will not
   do anything as it is empty. But freeing it will cause a UAF bug.

   The error path should simply jump to the "free_now" label and leave
   the "head" variable alone.

* tag 'trace-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Do not free "head" on error path of filter_free_subsystem_filters()

4 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Fri, 13 Jun 2025 17:05:31 +0000 (10:05 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Rework of system register accessors for system registers that are
     directly writen to memory, so that sanitisation of the in-memory
     value happens at the correct time (after the read, or before the
     write). For convenience, RMW-style accessors are also provided.

   - Multiple fixes for the so-called "arch-timer-edge-cases' selftest,
     which was always broken.

  x86:

   - Make KVM_PRE_FAULT_MEMORY stricter for TDX, allowing userspace to
     pass only the "untouched" addresses and flipping the shared/private
     bit in the implementation.

   - Disable SEV-SNP support on initialization failure

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY
  KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY
  KVM: SEV: Disable SEV-SNP support on initialization failure
  KVM: arm64: selftests: Determine effective counter width in arch_timer_edge_cases
  KVM: arm64: selftests: Fix xVAL init in arch_timer_edge_cases
  KVM: arm64: selftests: Fix thread migration in arch_timer_edge_cases
  KVM: arm64: selftests: Fix help text for arch_timer_edge_cases
  KVM: arm64: Make __vcpu_sys_reg() a pure rvalue operand
  KVM: arm64: Don't use __vcpu_sys_reg() to get the address of a sysreg
  KVM: arm64: Add RMW specific sysreg accessor
  KVM: arm64: Add assignment-specific sysreg accessor

4 months agoio_uring/kbuf: don't truncate end buffer for multiple buffer peeks
Jens Axboe [Fri, 13 Jun 2025 17:01:49 +0000 (11:01 -0600)]
io_uring/kbuf: don't truncate end buffer for multiple buffer peeks

If peeking a bunch of buffers, normally io_ring_buffers_peek() will
truncate the end buffer. This isn't optimal as presumably more data will
be arriving later, and hence it's better to stop with the last full
buffer rather than truncate the end buffer.

Cc: stable@vger.kernel.org
Fixes: 35c8711c8fc4 ("io_uring/kbuf: add helpers for getting/peeking multiple buffers")
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 months agoMerge tag 'v6.16-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 13 Jun 2025 16:59:29 +0000 (09:59 -0700)]
Merge tag 'v6.16-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
 "Fix a broken self-test in hkdf (new regression)"

* tag 'v6.16-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: hkdf - move to late_initcall

4 months agoMerge tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Fri, 13 Jun 2025 16:49:07 +0000 (09:49 -0700)]
Merge tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "As usual, highlighting the ones users have been noticing:

   - Fix a small issue with has_case_insensitive not being propagated on
     snapshot creation; this led to fsck errors, which we're harmless
     because we're not using this flag yet (it's for overlayfs +
     casefolding).

   - Log the error being corrected in the journal when we're doing fsck
     repair: this was one of the "lessons learned" from the i_nlink 0 ->
     subvolume deletion bug, where reconstructing what had happened by
     analyzing the journal was a bit more difficult than it needed to
     be.

   - Don't schedule btree node scan to run in the superblock: this fixes
     a regression from the 6.16 recovery passes rework, and let to it
     running unnecessarily.

     The real issue here is that we don't have online, "self healing"
     style topology repair yet: topology repair currently has to run
     before we go RW, which means that we may schedule it unnecessarily
     after a transient error. This will be fixed in the future.

   - We now track, in btree node flags, the reason it was scheduled to
     be rewritten. We discovered a deadlock in recovery when many btree
     nodes need to be rewritten because they're degraded: fully fixing
     this will take some work but it's now easier to see what's going
     on.

     For the bug report where this came up, a device had been kicked RO
     due to transient errors: manually setting it back to RW was
     sufficient to allow recovery to succeed.

   - Mark a few more fsck errors as autofix: as a reminder to users,
     please do keep reporting cases where something needs to be repaired
     and is not repaired automatically (i.e. cases where -o fix_errors
     or fsck -y is required).

   - rcu_pending.c now works with PREEMPT_RT

   - 'bcachefs device add', then umount, then remount wasn't working -
     we now emit a uevent so that the new device's new superblock is
     correctly picked up

   - Assorted repair fixes: btree node scan will no longer incorrectly
     update sb->version_min,

   - Assorted syzbot fixes"

* tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs: (23 commits)
  bcachefs: Don't trace should_be_locked unless changing
  bcachefs: Ensure that snapshot creation propagates has_case_insensitive
  bcachefs: Print devices we're mounting on multi device filesystems
  bcachefs: Don't trust sb->nr_devices in members_to_text()
  bcachefs: Fix version checks in validate_bset()
  bcachefs: ioctl: avoid stack overflow warning
  bcachefs: Don't pass trans to fsck_err() in gc_accounting_done
  bcachefs: Fix leak in bch2_fs_recovery() error path
  bcachefs: Fix rcu_pending for PREEMPT_RT
  bcachefs: Fix downgrade_table_extra()
  bcachefs: Don't put rhashtable on stack
  bcachefs: Make sure opts.read_only gets propagated back to VFS
  bcachefs: Fix possible console lock involved deadlock
  bcachefs: mark more errors autofix
  bcachefs: Don't persistently run scan_for_btree_nodes
  bcachefs: Read error message now prints if self healing
  bcachefs: Only run 'increase_depth' for keys from btree node csan
  bcachefs: Mark need_discard_freespace_key_bad autofix
  bcachefs: Update /dev/disk/by-uuid on device add
  bcachefs: Add more flags to btree nodes for rewrite reason
  ...

4 months agoDocumentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
Bagas Sanjaya [Fri, 13 Jun 2025 02:38:57 +0000 (09:38 +0700)]
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists

Stephen Rothwell reports htmldocs warning on ublk docs:

Documentation/block/ublk.rst:414: ERROR: Unexpected indentation. [docutils]

Fix the warning by separating sublists of auto buffer registration
fallback behavior from their appropriate parent list item.

Fixes: ff20c516485e ("ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250612132638.193de386@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20250613023857.15971-1-bagasdotme@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 months agoiommu/tegra: Fix incorrect size calculation
Jason Gunthorpe [Tue, 3 Jun 2025 19:14:45 +0000 (16:14 -0300)]
iommu/tegra: Fix incorrect size calculation

This driver uses a mixture of ways to get the size of a PTE,
tegra_smmu_set_pde() did it as sizeof(*pd) which became wrong when pd
switched to a struct tegra_pd.

Switch pd back to a u32* in tegra_smmu_set_pde() so the sizeof(*pd)
returns 4.

Fixes: 50568f87d1e2 ("iommu/terga: Do not use struct page as the handle for as->pd memory")
Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Closes: https://lore.kernel.org/all/62e7f7fe-6200-4e4f-ad42-d58ad272baa6@tecnico.ulisboa.pt/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Tested-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Link: https://lore.kernel.org/r/0-v1-da7b8b3d57eb+ce-iommu_terga_sizeof_jgg@nvidia.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
4 months agoblock: Fix bvec_set_folio() for very large folios
Matthew Wilcox (Oracle) [Thu, 12 Jun 2025 14:42:53 +0000 (15:42 +0100)]
block: Fix bvec_set_folio() for very large folios

Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if
we attempt to add a folio that is larger than 4GB, we'll silently
truncate the offset and len.  Widen the parameters to size_t, assert
that the length is less than 4GB and set the first page that contains
the interesting data rather than the first page of the folio.

Fixes: 26db5ee15851 (block: add a bvec_set_folio helper)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 months agobio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
Matthew Wilcox (Oracle) [Thu, 12 Jun 2025 14:41:25 +0000 (15:41 +0100)]
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP

It is possible for physically contiguous folios to have discontiguous
struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not.
This is correctly handled by folio_page_idx(), so remove this open-coded
implementation.

Fixes: 640d1930bef4 (block: Add bio_for_each_folio_all())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 months agospi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine
Thangaraj Samynathan [Thu, 12 Jun 2025 02:30:59 +0000 (08:00 +0530)]
spi: spi-pci1xxxx: Drop MSI-X usage as unsupported by DMA engine

Removes MSI-X from the interrupt request path, as the DMA engine used by
the SPI controller does not support MSI-X interrupts.

Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Link: https://patch.msgid.link/20250612023059.71726-1-thangaraj.s@microchip.com
Signed-off-by: Mark Brown <broonie@kernel.org>
4 months agoMerge tag 'drm-misc-fixes-2025-06-12' of https://gitlab.freedesktop.org/drm/misc...
Dave Airlie [Fri, 13 Jun 2025 04:57:09 +0000 (14:57 +1000)]
Merge tag 'drm-misc-fixes-2025-06-12' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

drm-misc-fixes for v6.16-rc2:
- Fix infinite EPROBE_DEFER loop in vc4 probing.
- Fix amdxdna firmware size.
- mode fixes for meson.
- Kconfig fix for st7171-i2c.
- Fix -EBUSY WARN_ON_ONCE in dma-buf
- Use dma_sync_sgtable_for_cpu in udmabuf.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/62c06195-8bc1-4dae-8777-e86d94e4d9d9@linux.intel.com
4 months agomm: add mmap_prepare() compatibility layer for nested file systems
Lorenzo Stoakes [Mon, 9 Jun 2025 16:57:49 +0000 (17:57 +0100)]
mm: add mmap_prepare() compatibility layer for nested file systems

Nested file systems, that is those which invoke call_mmap() within their
own f_op->mmap() handlers, may encounter underlying file systems which
provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83
("mm: introduce new .mmap_prepare() file callback").

We have a chicken-and-egg scenario here - until all file systems are
converted to using .mmap_prepare(), we cannot convert these nested
handlers, as we can't call f_op->mmap from an .mmap_prepare() hook.

So we have to do it the other way round - invoke the .mmap_prepare() hook
from an .mmap() one.

in order to do so, we need to convert VMA state into a struct vm_area_desc
descriptor, invoking the underlying file system's f_op->mmap_prepare()
callback passing a pointer to this, and then setting VMA state accordingly
and safely.

This patch achieves this via the compat_vma_mmap_prepare() function, which
we invoke from call_mmap() if f_op->mmap_prepare() is specified in the
passed in file pointer.

We place the fundamental logic into mm/vma.h where VMA manipulation
belongs.  We also update the VMA userland tests to accommodate the
changes.

The compat_vma_mmap_prepare() function and its associated machinery is
temporary, and will be removed once the conversion of file systems is
complete.

We carefully place this code so it can be used with CONFIG_MMU and also
with cutting edge nommu silicon.

[akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build]
[lorenzo.stoakes@oracle.com: remove unused declarations]
Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local
Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com
Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback").
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 months agoMerge tag 'drm-xe-fixes-2025-06-12' of https://gitlab.freedesktop.org/drm/xe/kernel...
Dave Airlie [Fri, 13 Jun 2025 01:05:23 +0000 (11:05 +1000)]
Merge tag 'drm-xe-fixes-2025-06-12' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

Driver Changes:
- Fix regression disallowing 64K SVM migration (Maarten)
- Use a bounce buffer for WA BB (Lucas)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/aEsBQoh5Si3ouPgE@fedora
4 months agoMerge tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux
Linus Torvalds [Thu, 12 Jun 2025 19:32:09 +0000 (12:32 -0700)]
Merge tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux

Pull bitmap fix from Yury Norov:
 "Fix for __GENMASK() and __GENMASK_ULL() in UAPI"

* tag 'bitmap-for-6.16-rc2' of https://github.com/norov/linux:
  uapi: bitops: use UAPI-safe variant of BITS_PER_LONG again

4 months agosmb: improve directory cache reuse for readdir operations
Bharath SM [Wed, 11 Jun 2025 11:29:02 +0000 (16:59 +0530)]
smb: improve directory cache reuse for readdir operations

Currently, cached directory contents were not reused across subsequent
'ls' operations because the cache validity check relied on comparing
the ctx pointer, which changes with each readdir invocation. As a
result, the cached dir entries was not marked as valid and the cache was
not utilized for subsequent 'ls' operations.

This change uses the file pointer, which remains consistent across all
readdir calls for a given directory instance, to associate and validate
the cache. As a result, cached directory contents can now be
correctly reused, improving performance for repeated directory listings.

Performance gains with local windows SMB server:

Without the patch and default actimeo=1:
 1000 directory enumeration operations on dir with 10k files took 135.0s

With this patch and actimeo=0:
 1000 directory enumeration operations on dir with 10k files took just 5.1s

Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
4 months agosmb: client: fix perf regression with deferred closes
Paulo Alcantara [Thu, 12 Jun 2025 15:45:04 +0000 (12:45 -0300)]
smb: client: fix perf regression with deferred closes

Customer reported that one of their applications started failing to
open files with STATUS_INSUFFICIENT_RESOURCES due to NetApp server
hitting the maximum number of opens to same file that it would allow
for a single client connection.

It turned out the client was failing to reuse open handles with
deferred closes because matching ->f_flags directly without masking
off O_CREAT|O_EXCL|O_TRUNC bits first broke the comparision and then
client ended up with thousands of deferred closes to same file.  Those
bits are already satisfied on the original open, so no need to check
them against existing open handles.

Reproducer:

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <pthread.h>

 #define NR_THREADS      4
 #define NR_ITERATIONS   2500
 #define TEST_FILE       "/mnt/1/test/dir/foo"

 static char buf[64];

 static void *worker(void *arg)
 {
         int i, j;
         int fd;

         for (i = 0; i < NR_ITERATIONS; i++) {
                 fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_APPEND, 0666);
                 for (j = 0; j < 16; j++)
                         write(fd, buf, sizeof(buf));
                 close(fd);
         }
 }

 int main(int argc, char *argv[])
 {
         pthread_t t[NR_THREADS];
         int fd;
         int i;

         fd = open(TEST_FILE, O_WRONLY|O_CREAT|O_TRUNC, 0666);
         close(fd);
         memset(buf, 'a', sizeof(buf));
         for (i = 0; i < NR_THREADS; i++)
                 pthread_create(&t[i], NULL, worker, NULL);
         for (i = 0; i < NR_THREADS; i++)
                 pthread_join(t[i], NULL);
         return 0;
 }

Before patch:

$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1391

After patch:

$ mount.cifs //srv/share /mnt/1 -o ...
$ mkdir -p /mnt/1/test/dir
$ gcc repro.c && ./a.out
...
number of opens: 1

Cc: linux-cifs@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Pierguido Lambri <plambri@redhat.com>
Fixes: b8ea3b1ff544 ("smb: enable reuse of deferred file handles for write operations")
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
4 months agoMerge tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 12 Jun 2025 16:50:36 +0000 (09:50 -0700)]
Merge tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bluetooth and wireless.

  Current release - regressions:

   - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD

  Current release - new code bugs:

   - eth: airoha: correct enable mask for RX queues 16-31

   - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
     disappears under traffic

   - ipv6: move fib6_config_validate() to ip6_route_add(), prevent
     invalid routes

  Previous releases - regressions:

   - phy: phy_caps: don't skip better duplex match on non-exact match

   - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0

   - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused
     transient packet loss, exact reason not fully understood, yet

  Previous releases - always broken:

   - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)

   - sched: sfq: fix a potential crash on gso_skb handling

   - Bluetooth: intel: improve rx buffer posting to avoid causing issues
     in the firmware

   - eth: intel: i40e: make reset handling robust against multiple
     requests

   - eth: mlx5: ensure FW pages are always allocated on the local NUMA
     node, even when device is configure to 'serve' another node

   - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
     prevent kernel crashes

   - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
     for 3 sec if fw_stats_done is not set"

* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
  selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
  net: ethtool: Don't check if RSS context exists in case of context 0
  af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
  ipv6: Move fib6_config_validate() to ip6_route_add().
  net: drv: netdevsim: don't napi_complete() from netpoll
  net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
  veth: prevent NULL pointer dereference in veth_xdp_rcv
  net_sched: remove qdisc_tree_flush_backlog()
  net_sched: ets: fix a race in ets_qdisc_change()
  net_sched: tbf: fix a race in tbf_change()
  net_sched: red: fix a race in __red_change()
  net_sched: prio: fix a race in prio_tune()
  net_sched: sch_sfq: reject invalid perturb period
  net: phy: phy_caps: Don't skip better duplex macth on non-exact match
  MAINTAINERS: Update Kuniyuki Iwashima's email address.
  selftests: net: add test case for NAT46 looping back dst
  net: clear the dst when changing skb protocol
  net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
  net/mlx5e: Fix leak of Geneve TLV option object
  net/mlx5: HWS, make sure the uplink is the last destination
  ...

4 months agodrm/xe/lrc: Use a temporary buffer for WA BB
Lucas De Marchi [Wed, 4 Jun 2025 15:03:05 +0000 (08:03 -0700)]
drm/xe/lrc: Use a temporary buffer for WA BB

In case the BO is in iomem, we can't simply take the vaddr and write to
it. Instead, prepare a separate buffer that is later copied into io
memory. Right now it's just a few words that could be using
xe_map_write32(), but the intention is to grow the WA BB for other
uses.

Fixes: 617d824c5323 ("drm/xe: Add WA BB to capture active context utilization")
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://lore.kernel.org/r/20250604-wa-bb-fix-v1-1-0dfc5dafcef0@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit ef48715b2d3df17c060e23b9aa636af3d95652f8)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
4 months agoMerge tag 'pinctrl-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Thu, 12 Jun 2025 15:21:13 +0000 (08:21 -0700)]
Merge tag 'pinctrl-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:

 - Add some missing pins on the Qualcomm QCM2290, along with a managed
   resources patch that make it clean and nice

 - Drop an unused function in the ST Micro driver

 - Drop bouncing MAINTAINER entry

 - Drop of_match_ptr() macro to rid compile warnings in the TB10x
   driver

 - Fix up calculation of pin numbers from base in the Sunxi driver

* tag 'pinctrl-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: sunxi: dt: Consider pin base when calculating bank number from pin
  pinctrl: tb10x: Drop of_match_ptr for ID table
  pinctrl: MAINTAINERS: Drop bouncing Jianlong Huang
  pinctrl: st: Drop unused st_gpio_bank() function
  pinctrl: qcom: pinctrl-qcm2290: Add missing pins
  pinctrl: qcom: switch to devm_gpiochip_add_data()

4 months agoMerge tag 'arc-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Linus Torvalds [Thu, 12 Jun 2025 15:17:56 +0000 (08:17 -0700)]
Merge tag 'arc-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc

Pull ARC fixes from Vineet Gupta:

 - arch_atomic64_cmpxchg relaxed variant [Jason]

 - use of inbuilt swap in stack unwinder  [Yu-Chun Lin]

 - use of __ASSEMBLER__ in kernel headers [Thomas Huth]

* tag 'arc-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
  ARC: Replace __ASSEMBLY__ with __ASSEMBLER__ in the non-uapi headers
  ARC: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers
  ARC: unwind: Use built-in sort swap to reduce code size and improve performance
  ARC: atomics: Implement arch_atomic64_cmpxchg using _relaxed

4 months agoMerge tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Thu, 12 Jun 2025 15:16:47 +0000 (08:16 -0700)]
Merge tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Another quick round of updates:

 - revert mwifiex HT40 that was causing issues
 - many ath10k/ath11k/ath12k fixes
 - re-add some iwlwifi code I lost in a merge
 - use kfree_sensitive() on an error path in cfg80211

* tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: cfg80211: use kfree_sensitive() for connkeys cleanup
  wifi: iwlwifi: fix merge damage related to iwl_pci_resume
  Revert "wifi: mwifiex: Fix HT40 bandwidth issue."
  wifi: ath12k: fix uaf in ath12k_core_init()
  wifi: ath12k: Fix hal_reo_cmd_status kernel-doc
  wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850
  wifi: ath11k: validate ath11k_crypto_mode on top of ath11k_core_qmi_firmware_ready
  wifi: ath11k: consistently use ath11k_mac_get_fw_stats()
  wifi: ath11k: move locking outside of ath11k_mac_get_fw_stats()
  wifi: ath11k: adjust unlock sequence in ath11k_update_stats_event()
  wifi: ath11k: move some firmware stats related functions outside of debugfs
  wifi: ath11k: don't wait when there is no vdev started
  wifi: ath11k: don't use static variables in ath11k_debugfs_fw_stats_process()
  wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
  wil6210: fix support for sparrow chipsets
  wifi: ath10k: Avoid vdev delete timeout when firmware is already down
  ath10k: snoc: fix unbalanced IRQ enable in crash recovery
====================

Link: https://patch.msgid.link/20250612082519.11447-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge branch 'fix-ntuple-rules-targeting-default-rss'
Jakub Kicinski [Thu, 12 Jun 2025 15:15:37 +0000 (08:15 -0700)]
Merge branch 'fix-ntuple-rules-targeting-default-rss'

Gal Pressman says:

====================
Fix ntuple rules targeting default RSS

This series addresses a regression in ethtool flow steering where rules
targeting the default RSS context (context 0) were incorrectly rejected.

The default RSS context always exists but is not stored in the rss_ctx
xarray like additional contexts. The current validation logic was
checking for the existence of context 0 in this array, causing valid
flow steering rules to be rejected.

This prevented configurations such as:
- High priority rules directing specific traffic to the default context
- Low priority catch-all rules directing remaining traffic to additional
  contexts

Patch 1 fixes the validation logic to skip the existence check for
context 0.

Patch 2 adds a selftest that verifies this behavior.

v3: https://lore.kernel.org/20250609120250.1630125-1-gal@nvidia.com
v2: https://lore.kernel.org/20250225071348.509432-1-gal@nvidia.com
====================

Link: https://patch.msgid.link/20250612071958.1696361-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoselftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
Gal Pressman [Thu, 12 Jun 2025 07:19:58 +0000 (10:19 +0300)]
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context

Add test_rss_default_context_rule() to verify that ntuple rules can
correctly direct traffic to the default RSS context (context 0).

The test creates two ntuple rules with explicit location priorities:
- A high-priority rule (loc 0) directing specific port traffic to
  context 0.
- A low-priority rule (loc 1) directing all other TCP traffic to context
  1.

This validates that:
1. Rules targeting the default context function properly.
2. Traffic steering works as expected when mixing default and
   additional RSS contexts.

The test was written by AI, and reviewed by humans.

Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20250612071958.1696361-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agonet: ethtool: Don't check if RSS context exists in case of context 0
Gal Pressman [Thu, 12 Jun 2025 07:19:57 +0000 (10:19 +0300)]
net: ethtool: Don't check if RSS context exists in case of context 0

Context 0 (default context) always exists, there is no need to check
whether it exists or not when adding a flow steering rule.

The existing check fails when creating a flow steering rule for context
0 as it is not stored in the rss_ctx xarray.

For example:
$ ethtool --config-ntuple eth2 flow-type tcp4 dst-ip 194.237.147.23 dst-port 19983 context 0 loc 618
rmgr: Cannot insert RX class rule: Invalid argument
Cannot insert classification rule

An example usecase for this could be:
- A high-priority rule (loc 0) directing specific port traffic to
  context 0.
- A low-priority rule (loc 1) directing all other TCP traffic to context
  1.

This is a user-visible regression that was caught in our testing
environment, it was not reported by a user yet.

Fixes: de7f7582dff2 ("net: ethtool: prevent flow steering to RSS contexts which don't exist")
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250612071958.1696361-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoMerge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluet...
Jakub Kicinski [Thu, 12 Jun 2025 15:13:47 +0000 (08:13 -0700)]
Merge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - eir: Fix NULL pointer deference on eir_get_service_data
 - eir: Fix possible crashes on eir_create_adv_data
 - hci_sync: Fix broadcast/PA when using an existing instance
 - ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
 - ISO: Fix not using bc_sid as advertisement SID
 - MGMT: Fix sparse errors

* tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: MGMT: Fix sparse errors
  Bluetooth: ISO: Fix not using bc_sid as advertisement SID
  Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
  Bluetooth: eir: Fix possible crashes on eir_create_adv_data
  Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
  Bluetooth: Fix NULL pointer deference on eir_get_service_data
====================

Link: https://patch.msgid.link/20250611204944.1559356-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoaf_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
Kuniyuki Iwashima [Wed, 11 Jun 2025 20:27:35 +0000 (13:27 -0700)]
af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.

Before the cited commit, the kernel unconditionally embedded SCM
credentials to skb for embryo sockets even when both the sender
and listener disabled SO_PASSCRED and SO_PASSPIDFD.

Now, the credentials are added to skb only when configured by the
sender or the listener.

However, as reported in the link below, it caused a regression for
some programs that assume credentials are included in every skb,
but sometimes not now.

The only problematic scenario would be that a socket starts listening
before setting the option.  Then, there will be 2 types of non-small
race window, where a client can send skb without credentials, which
the peer receives as an "invalid" message (and aborts the connection
it seems ?):

  Client                    Server
  ------                    ------
                            s1.listen()  <-- No SO_PASS{CRED,PIDFD}
  s2.connect()
  s2.send()  <-- w/o cred
                            s1.setsockopt(SO_PASS{CRED,PIDFD})
  s2.send()  <-- w/  cred

or

  Client                    Server
  ------                    ------
                            s1.listen()  <-- No SO_PASS{CRED,PIDFD}
  s2.connect()
  s2.send()  <-- w/o cred
                            s3, _ = s1.accept()  <-- Inherit cred options
  s2.send()  <-- w/o cred                            but not set yet

                            s3.setsockopt(SO_PASS{CRED,PIDFD})
  s2.send()  <-- w/  cred

It's unfortunate that buggy programs depend on the behaviour,
but let's restore the previous behaviour.

Fixes: 3f84d577b79d ("af_unix: Inherit sk_flags at connect().")
Reported-by: Jacek Łuczak <difrost.kernel@gmail.com>
Closes: https://lore.kernel.org/all/68d38b0b-1666-4974-85d4-15575789c8d4@gmail.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Christian Heusel <christian@heusel.eu>
Tested-by: André Almeida <andrealmeid@igalia.com>
Tested-by: Jacek Łuczak <difrost.kernel@gmail.com>
Link: https://patch.msgid.link/20250611202758.3075858-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 months agoipv6: Move fib6_config_validate() to ip6_route_add().
Kuniyuki Iwashima [Wed, 11 Jun 2025 19:35:02 +0000 (12:35 -0700)]
ipv6: Move fib6_config_validate() to ip6_route_add().

syzkaller created an IPv6 route from a malformed packet, which has
a prefix len > 128, triggering the splat below. [0]

This is a similar issue fixed by commit 586ceac9acb7 ("ipv6: Restore
fib6_config validation for SIOCADDRT.").

The cited commit removed fib6_config validation from some callers
of ip6_add_route().

Let's move the validation back to ip6_route_add() and
ip6_route_multipath_add().

[0]:
UBSAN: array-index-out-of-bounds in ./include/net/ipv6.h:616:34
index 20 is out of range for type '__u8 [16]'
CPU: 1 UID: 0 PID: 7444 Comm: syz.0.708 Not tainted 6.16.0-rc1-syzkaller-g19272b37aa4f #0 PREEMPT
Hardware name: riscv-virtio,qemu (DT)
Call Trace:
[<ffffffff80078a80>] dump_backtrace+0x2e/0x3c arch/riscv/kernel/stacktrace.c:132
[<ffffffff8000327a>] show_stack+0x30/0x3c arch/riscv/kernel/stacktrace.c:138
[<ffffffff80061012>] __dump_stack lib/dump_stack.c:94 [inline]
[<ffffffff80061012>] dump_stack_lvl+0x12e/0x1a6 lib/dump_stack.c:120
[<ffffffff800610a6>] dump_stack+0x1c/0x24 lib/dump_stack.c:129
[<ffffffff8001c0ea>] ubsan_epilogue+0x14/0x46 lib/ubsan.c:233
[<ffffffff819ba290>] __ubsan_handle_out_of_bounds+0xf6/0xf8 lib/ubsan.c:455
[<ffffffff85b363a4>] ipv6_addr_prefix include/net/ipv6.h:616 [inline]
[<ffffffff85b363a4>] ip6_route_info_create+0x8f8/0x96e net/ipv6/route.c:3793
[<ffffffff85b635da>] ip6_route_add+0x2a/0x1aa net/ipv6/route.c:3889
[<ffffffff85b02e08>] addrconf_prefix_route+0x2c4/0x4e8 net/ipv6/addrconf.c:2487
[<ffffffff85b23bb2>] addrconf_prefix_rcv+0x1720/0x1e62 net/ipv6/addrconf.c:2878
[<ffffffff85b92664>] ndisc_router_discovery+0x1a06/0x3504 net/ipv6/ndisc.c:1570
[<ffffffff85b99038>] ndisc_rcv+0x500/0x600 net/ipv6/ndisc.c:1874
[<ffffffff85bc2c18>] icmpv6_rcv+0x145e/0x1e0a net/ipv6/icmp.c:988
[<ffffffff85af6798>] ip6_protocol_deliver_rcu+0x18a/0x1976 net/ipv6/ip6_input.c:436
[<ffffffff85af8078>] ip6_input_finish+0xf4/0x174 net/ipv6/ip6_input.c:480
[<ffffffff85af8262>] NF_HOOK include/linux/netfilter.h:317 [inline]
[<ffffffff85af8262>] NF_HOOK include/linux/netfilter.h:311 [inline]
[<ffffffff85af8262>] ip6_input+0x16a/0x70c net/ipv6/ip6_input.c:491
[<ffffffff85af8dcc>] ip6_mc_input+0x5c8/0x1268 net/ipv6/ip6_input.c:588
[<ffffffff85af6112>] dst_input include/net/dst.h:469 [inline]
[<ffffffff85af6112>] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline]
[<ffffffff85af6112>] NF_HOOK include/linux/netfilter.h:317 [inline]
[<ffffffff85af6112>] NF_HOOK include/linux/netfilter.h:311 [inline]
[<ffffffff85af6112>] ipv6_rcv+0x5ae/0x6e0 net/ipv6/ip6_input.c:309
[<ffffffff85087e84>] __netif_receive_skb_one_core+0x106/0x16e net/core/dev.c:5977
[<ffffffff85088104>] __netif_receive_skb+0x2c/0x144 net/core/dev.c:6090
[<ffffffff850883c6>] netif_receive_skb_internal net/core/dev.c:6176 [inline]
[<ffffffff850883c6>] netif_receive_skb+0x1aa/0xbf2 net/core/dev.c:6235
[<ffffffff8328656e>] tun_rx_batched.isra.0+0x430/0x686 drivers/net/tun.c:1485
[<ffffffff8329ed3a>] tun_get_user+0x2952/0x3d6c drivers/net/tun.c:1938
[<ffffffff832a21e0>] tun_chr_write_iter+0xc4/0x21c drivers/net/tun.c:1984
[<ffffffff80b9b9ae>] new_sync_write fs/read_write.c:593 [inline]
[<ffffffff80b9b9ae>] vfs_write+0x56c/0xa9a fs/read_write.c:686
[<ffffffff80b9c2be>] ksys_write+0x126/0x228 fs/read_write.c:738
[<ffffffff80b9c42e>] __do_sys_write fs/read_write.c:749 [inline]
[<ffffffff80b9c42e>] __se_sys_write fs/read_write.c:746 [inline]
[<ffffffff80b9c42e>] __riscv_sys_write+0x6e/0x94 fs/read_write.c:746
[<ffffffff80076912>] syscall_handler+0x94/0x118 arch/riscv/include/asm/syscall.h:112
[<ffffffff8637e31e>] do_trap_ecall_u+0x396/0x530 arch/riscv/kernel/traps.c:341
[<ffffffff863a69e2>] handle_exception+0x146/0x152 arch/riscv/kernel/entry.S:197

Fixes: fa76c1674f2e ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().")
Reported-by: syzbot+4c2358694722d304c44e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6849b8c3.a00a0220.1eb5f5.00f0.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250611193551.2999991-1-kuni1840@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>