Russell King (Oracle) [Thu, 11 Sep 2025 10:46:43 +0000 (11:46 +0100)]
net: dsa: mv88e6xxx: remove unused support for PPS event capture
mv88e6352_config_eventcap() is documented as handling both EXTTS and
PPS capture modes, but nothing ever calls it for PPS capture. Remove
the unused PPS capture mode support, and the now unused
MV88E6XXX_TAI_EVENT_STATUS_CAP_TRIG definition.
Russell King (Oracle) [Thu, 11 Sep 2025 10:46:38 +0000 (11:46 +0100)]
net: dsa: mv88e6xxx: remove chip->evcap_config
evcap_config is only read and written in mv88e6352_config_eventcap(),
so it makes little sense to store it in the global chip struct. Make
it a local variable instead.
Russell King (Oracle) [Thu, 11 Sep 2025 10:46:28 +0000 (11:46 +0100)]
net: dsa: mv88e6xxx: remove mv88e6250_ptp_ops
mv88e6250_ptp_ops and mv88e6352_ptp_ops are identical since commit 7e3c18097a70 ("net: dsa: mv88e6xxx: read cycle counter period from
hardware"). Remove the unnecessary duplication.
net/cls_cgroup: Fix task_get_classid() during qdisc run
During recent testing with the netem qdisc to inject delays into TCP
traffic, we observed that our CLS BPF program failed to function correctly
due to incorrect classid retrieval from task_get_classid(). The issue
manifests in the following call stack:
The problem occurs when multiple tasks share a single qdisc. In such cases,
__qdisc_run() may transmit skbs created by different tasks. Consequently,
task_get_classid() retrieves an incorrect classid since it references the
current task's context rather than the skb's originating task.
Given that dev_queue_xmit() always executes with bh disabled, we can use
softirq_count() instead to obtain the correct classid.
The simple steps to reproduce this issue:
1. Add network delay to the network interface:
such as: tc qdisc add dev bond0 root netem delay 1.5ms
2. Build two distinct net_cls cgroups, each with a network-intensive task
3. Initiate parallel TCP streams from both tasks to external servers.
Under this specific condition, the issue reliably occurs. The kernel
eventually dequeues an SKB that originated from Task-A while executing in
the context of Task-B.
It is worth noting that it will change the established behavior for a
slightly different scenario:
<sock S is created by task A>
<class ID for task A is changed>
<skb is created by sock S xmit and classified>
prior to this patch the skb will be classified with the 'new' task A
classid, now with the old/original one. The bpf_get_cgroup_classid_curr()
function is a more appropriate choice for this case.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250902062933.30087-1-laoar.shao@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: mana: Reduce waiting time if HWC not responding
If HW Channel (HWC) is not responding, reduce the waiting time, so further
steps will fail quickly.
This will prevent getting stuck for a long time (30 minutes or more), for
example, during unloading while HWC is not responding.
Eric Dumazet [Tue, 9 Sep 2025 12:19:42 +0000 (12:19 +0000)]
net: use NUMA drop counters for softnet_data.dropped
Hosts under DOS attack can suffer from false sharing
in enqueue_to_backlog() : atomic_inc(&sd->dropped).
This is because sd->dropped can be touched from many cpus,
possibly residing on different NUMA nodes.
Generalize the sk_drop_counters infrastucture
added in commit c51613fa276f ("net: add sk->sk_drop_counters")
and use it to replace softnet_data.dropped
with NUMA friendly softnet_data.drop_counters.
This adds 64 bytes per cpu, maybe more in the future
if we increase the number of counters (currently 2)
per 'struct numa_drop_counters'.
====================
Add GMAC support for Renesas RZ/{T2H, N2H} SoCs
This series adds support for the Ethernet MAC (GMAC) IP present on
the Renesas RZ/T2H and RZ/N2H SoCs.
While these SoCs use the same Synopsys DesignWare MAC IP (version 5.20) as
the existing RZ/V2H(P), the hardware is synthesized with different options
that require driver and binding updates:
- 8 RX/TX queue pairs instead of 4 (requiring 19 interrupts vs 11)
- Different clock requirements (3 clocks vs 7)
- Different reset handling (2 named resets vs 1 unnamed)
- Split header feature enabled
- GMAC connected through a MIIC PCS on RZ/T2H
The series first updates the generic dwmac binding to accommodate the
higher interrupt count, then extends the Renesas-specific binding with
a to document both SoCs.
The driver changes prepare for multi-SoC support by introducing OF match
data for per-SoC configuration, then add RZ/T2H support including PCS
integration through the existing RZN1 MIIC driver.
Note this patch series is dependent on the PCS driver [0]
(not a build dependency).
[0] https://lore.kernel.org/all/20250904114204.4148520-1-prabhakar.mahadev-lad.rj@bp.renesas.com/
====================
net: stmmac: dwmac-renesas-gbeth: Add support for RZ/T2H SoC
Extend the Renesas GBETH stmmac glue driver to support the RZ/T2H SoC,
where the GMAC is connected through a MIIC PCS. Introduce a new
`has_pcs` flag in `struct renesas_gbeth_of_data` to indicate when PCS
handling is required.
When enabled, the driver parses the `pcs-handle` phandle, creates a PCS
instance with `miic_create()`, and wires it into phylink. Proper cleanup
is done with `miic_destroy()`. New init/exit/select hooks are added to
`plat_stmmacenet_data` for PCS integration.
Update Kconfig to select `PCS_RZN1_MIIC` when building the Renesas GBETH
driver so the PCS support is always available.
net: stmmac: dwmac-renesas-gbeth: Use OF data for configuration
Prepare for adding RZ/T2H SoC support by making the driver configuration
selectable via OF match data. While the RZ/V2H(P) and RZ/T2H use the same
version of the Synopsys DesignWare MAC (version 5.20), the hardware is
synthesized with different options. To accommodate these differences,
introduce a struct holding per-SoC configuration such as clock list,
number of clocks, TX clock rate control, and STMMAC flags, and retrieve
it from the device tree match entry during probe.
dt-bindings: net: renesas,rzv2h-gbeth: Document Renesas RZ/T2H and RZ/N2H SoCs
Add device tree binding support for the Gigabit Ethernet MAC (GMAC) IP
on Renesas RZ/T2H and RZ/N2H SoCs. While these SoCs use the same
Synopsys DesignWare MAC version 5.20 as RZ/V2H, they are synthesized
with different hardware configurations.
Add new compatible strings "renesas,r9a09g077-gbeth" for RZ/T2H and
"renesas,r9a09g087-gbeth" for RZ/N2H, with the latter using RZ/T2H as
fallback since they share identical GMAC IP.
Update the schema to handle hardware differences between SoC variants.
RZ/T2H requires only 3 clocks compared to 7 on RZ/V2H, supports 8 RX/TX
queue pairs instead of 4, and needs 2 reset controls with reset-names
property versus a single unnamed reset. RZ/T2H also has the split header
feature enabled which is disabled on RZ/V2H.
Add support for an optional pcs-handle property to connect the GMAC to
the MIIC PCS converter on RZ/T2H. Use conditional schema validation to
enforce the correct clock, reset, and interrupt configurations per SoC
variant.
Extend the base snps,dwmac.yaml schema to accommodate the increased
interrupt count, supporting up to 19 interrupts and extending the
rx-queue and tx-queue interrupt name patterns to cover queues 0-7.
Jonas Rebmann [Thu, 11 Sep 2025 08:29:03 +0000 (10:29 +0200)]
net: phy: micrel: Update Kconfig help text
This driver by now supports 17 different Microchip (formerly known as
Micrel) chips: KSZ9021, KSZ9031, KSZ9131, KSZ8001, KS8737, KSZ8021,
KSZ8031, KSZ8041, KSZ8051, KSZ8061, KSZ8081, KSZ8873MLL, KSZ886X,
KSZ9477, LAN8814, LAN8804 and LAN8841.
Support for the VSC8201 was removed in commit 51f932c4870f ("micrel phy
driver - updated(1)")
Update the help text to reflect that, list families instead of models to
ease future maintenance.
Jakub Kicinski [Sat, 13 Sep 2025 00:06:25 +0000 (17:06 -0700)]
Merge tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
1) Don't respond to ICMP_UNREACH errors with another ICMP_UNREACH
error.
2) Support fetching the current bridge ethernet address.
This allows a more flexible approach to packet redirection
on bridges without need to use hardcoded addresses. From
Fernando Fernandez Mancera.
3) Zap a few no-longer needed conditionals from ipvs packet path
and convert to READ/WRITE_ONCE to avoid KCSAN warnings.
From Zhang Tengfei.
4) Remove a no-longer-used macro argument in ipset, from Zhen Ni.
* tag 'nf-next-25-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_reject: don't reply to icmp error messages
ipvs: Use READ_ONCE/WRITE_ONCE for ipvs->enable
netfilter: nft_meta_bridge: introduce NFT_META_BRI_IIFHWADDR support
netfilter: ipset: Remove unused htable_bits in macro ahash_region
selftest:net: fixed spelling mistakes
====================
Russell King [Wed, 10 Sep 2025 12:50:46 +0000 (13:50 +0100)]
net: mvneta: add support for hardware timestamps
Add support for hardware timestamps in (e.g.) the PHY by calling
skb_tx_timestamp() as close as reasonably possible to the point that
the hardware is instructed to send the queued packets.
udp_tunnel: use netdev_warn() instead of netdev_WARN()
netdev_WARN() uses WARN/WARN_ON to print a backtrace along with
file and line information. In this case, udp_tunnel_nic_register()
returning an error is just a failed operation, not a kernel bug.
udp_tunnel_nic_register() can fail due to a memory allocation
failure (kzalloc() or udp_tunnel_nic_alloc()).
This is a normal runtime error and not a kernel bug.
Replace netdev_WARN() with netdev_warn() accordingly.
====================
tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()
On one side a minor/cosmetic issue, especially nowadays when
TCP-AO/TCP-MD5 signature verification failures aren't logged to dmesg.
Yet, I think worth addressing for two reasons:
- unsigned RST gets ignored by the peer and the connection is alive for
longer (keep-alive interval)
- netstat counters increase and trace events report that trusted BGP peer
is sending unsigned/incorrectly signed segments, which can ring alarm
on monitoring.
====================
Now that the destruction of info/keys is delayed until the socket
destructor, it's safe to use kfree() without an RCU callback.
The socket is in TCP_CLOSE state either because it never left it,
or it's already closed and the refcounter is zero. In any way,
no one can discover it anymore, it's safe to release memory
straight away.
tcp: Destroy TCP-AO, TCP-MD5 keys in .sk_destruct()
Currently there are a couple of minor issues with destroying the keys
tcp_v4_destroy_sock():
1. The socket is yet in TCP bind buckets, making it reachable for
incoming segments [on another CPU core], potentially available to send
late FIN/ACK/RST replies.
2. There is at least one code path, where tcp_done() is called before
sending RST [kudos to Bob for investigation]. This is a case of
a server, that finished sending its data and just called close().
The socket is in TCP_FIN_WAIT2 and has RCV_SHUTDOWN (set by
__tcp_close())
Note the signed RSTs later in the dump - those are sent by the server
when the fin-wait socket gets removed from hash buckets, by
the listener socket.
Instead of destroying AO/MD5 info and their keys in inet_csk_destroy_sock(),
slightly delay it until the actual socket .sk_destruct(). As shutdown'ed
socket can yet send non-data replies, they should be signed in order for
the peer to process them. Now it also matches how AO/MD5 gets destructed
for TIME-WAIT sockets (in tcp_twsk_destructor()).
This seems optimal for TCP-MD5, while for TCP-AO it seems to have an
open problem: once RST get sent and socket gets actually destructed,
there is no information on the initial sequence numbers. So, in case
this last RST gets lost in the network, the server's listener socket
won't be able to properly sign another RST. Nothing in RFC 1122
prescribes keeping any local state after non-graceful reset.
Luckily, BGP are known to use keep alive(s).
While the issue is quite minor/cosmetic, these days monitoring network
counters is a common practice and getting invalid signed segments from
a trusted BGP peer can get customers worried.
====================
bridge: Allow keeping local FDB entries only on VLAN 0
The bridge FDB contains one local entry per port per VLAN, for the MAC of
the port in question, and likewise for the bridge itself. This allows
bridge to locally receive and punt "up" any packets whose destination MAC
address matches that of one of the bridge interfaces or of the bridge
itself.
The number of these local "service" FDB entries grows linearly with number
of bridge-global VLAN memberships, but that in turn will tend to grow
quadratically with number of ports and per-port VLAN memberships. While
that does not cause issues during forwarding lookups, it does make dumps
impractically slow.
As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB
that just contains these 400K local entries, takes 6.5s. That's _without_
considering iproute2 formatting overhead, this is just how long it takes to
walk the FDB (repeatedly), serialize it into netlink messages, and parse
the messages back in userspace.
This is to illustrate that with growing number of ports and VLANs, the time
required to dump this repetitive information blows up. Arguably 4K VLANs
per interface is not a very realistic configuration, but then modern
switches can instead have several hundred interfaces, and we have fielded
requests for >1K VLAN memberships per port among customers.
FDB entries are currently all kept on a single linked list, and then
dumping uses this linked list to walk all entries and dump them in order.
When the message buffer is full, the iteration is cut short, and later
restarted. Of course, to restart the iteration, it's first necessary to
walk the already-dumped front part of the list before starting dumping
again. So one possibility is to organize the FDB entries in different
structure more amenable to walk restarts.
One option is to walk directly the hash table. The advantage is that no
auxiliary structure needs to be introduced. With a rough sketch of this
approach, the above scenario gets dumped in not quite 3 s, saving over 50 %
of time. However hash table iteration requires maintaining an active cursor
that must be collected when the dump is aborted. It looks like that would
require changes in the NDO protocol to allow to run this cleanup. Moreover,
on hash table resize the iteration is simply restarted. FDB dumps are
currently not guaranteed to correspond to any one particular state: entries
can be missed, or be duplicated. But with hash table iteration we would get
that plus the much less graceful resize behavior, where swaths of FDB are
duplicated.
Another option is to maintain the FDB entries in a red-black tree. We have
a PoC of this approach on hand, and the above scenario is dumped in about
2.5 s. Still not as snappy as we'd like it, but better than the hash table.
However the savings come at the expense of a more expensive insertion, and
require locking during dumps, which blocks insertion.
The upside of these approaches is that they provide benefits whatever the
FDB contents. But it does not seem like either of these is workable.
However we intend to clean up the RB tree PoC and present it for
consideration later on in case the trade-offs are considered acceptable.
Yet another option might be to use in-kernel FDB filtering, and to filter
the local entries when dumping. Unfortunately, this does not help all that
much either, because the linked-list walk still needs to happen. Also, with
the obvious filtering interface built around ndm_flags / ndm_state
filtering, one can't just exclude pure local entries in one query. One
needs to dump all non-local entries first, and then to get permanent
entries in another run filter local & added_by_user. I.e. one needs to pay
the iteration overhead twice, and then integrate the result in userspace.
To get significant savings, one would need a very specific knob like "dump,
but skip/only include local entries". But if we are adding a local-specific
knobs, maybe let's have an option to just not duplicate them in the first
place.
All this FDB duplication is there merely to make things snappy during
forwarding. But high-radix switches with thousands of VLANs typically do
not process much traffic in the SW datapath at all, but rather offload vast
majority of it. So we could exchange some of the runtime performance for a
neater FDB.
To that end, in this patchset, introduce a new bridge option,
BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries
installed only on VLAN 0, instead of duplicating them across all VLANs.
Then to maintain the local termination behavior, on FDB miss, the bridge
does a second lookup on VLAN 0.
Enabling this option changes the bridge behavior in expected ways. Since
the entries are only kept on VLAN 0, FDB get, flush and dump will not
perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects
forwarding on all VLANs.
This patchset is loosely based on a privately circulated patch by Nikolay
Aleksandrov.
The patchset progresses as follows:
- Patch #1 introduces a bridge option to enable the above feature. Then
patches #2 to #5 gradually patch the bridge to do the right thing when
the option is enabled. Finally patch #6 adds the UAPI knob and the code
for when the feature is enabled or disabled.
- Patches #7, #8 and #9 contain fixes and improvements to selftest
libraries
- Patch #10 contains a new selftest
====================
Usually the autodefer helpers in lib.sh are expected to be run in context
where success is the expected outcome. However when using them for feature
detection, failure can legitimately occur. But the failed command still
schedules a cleanup, which will likely fail again.
Instead, only schedule deferred cleanup when the positive command succeeds.
This way of organizing the cleanup has the added benefit that now the
return code from these functions reflects whether the command passed.
Petr Machata [Thu, 4 Sep 2025 17:07:25 +0000 (19:07 +0200)]
selftests: defer: Introduce DEFER_PAUSE_ON_FAIL
The fact that all cleanup (ideally) goes through the defer framework makes
debugging of these commands a bit tricky. However, this also gives us a
nice point to place a hook along the lines of PAUSE_ON_FAIL. When the
environment variable DEFER_PAUSE_ON_FAIL is set, and a cleanup command
results in non-zero exit status, show a bit of debuginfo and give the user
an opportunity to interrupt the execution altogether.
Petr Machata [Thu, 4 Sep 2025 17:07:24 +0000 (19:07 +0200)]
selftests: defer: Allow spaces in arguments of deferred commands
Currently the way deferred commands are stored and invoked causes any
whitespace to act as an argument separator when the command is executed.
To make it possible to use spaces in deferred commands, store the commands
quoted, and then eval the string prior to execution.
Petr Machata [Thu, 4 Sep 2025 17:07:23 +0000 (19:07 +0200)]
net: bridge: Introduce UAPI for BR_BOOLOPT_FDB_LOCAL_VLAN_0
The previous patches introduced a new option, BR_BOOLOPT_FDB_LOCAL_VLAN_0.
When enabled, it has local FDB entries installed only on VLAN 0, instead of
duplicating them across all VLANs.
In this patch, add the corresponding UAPI toggle, and the code for turning
the feature on and off.
Petr Machata [Thu, 4 Sep 2025 17:07:22 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: Skip local FDBs on VLAN creation
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
member ports as well as the bridge itself should not be created per-VLAN,
but instead only on VLAN 0.
Thus when a VLAN is added for a port or the bridge itself, a local FDB
entry with the corresponding address should not be added when in the VLAN-0
mode.
Petr Machata [Thu, 4 Sep 2025 17:07:21 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: On bridge changeaddr, skip per-VLAN FDBs
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
bridge itself should not be created per-VLAN, but instead only on VLAN 0.
When the bridge address changes, the local FDB entries need to be updated,
which is done in br_fdb_change_mac_address().
Bail out early when in VLAN-0 mode, so that the per-VLAN FDB entries are
not created. The per-VLAN walk is only done afterwards.
Petr Machata [Thu, 4 Sep 2025 17:07:20 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: On port changeaddr, skip per-VLAN FDBs
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for member
ports should not be created per-VLAN, but instead only on VLAN 0. When the
member port address changes, the local FDB entries need to be updated,
which is done in br_fdb_changeaddr().
Under the VLAN-0 mode, only one local FDB entry will ever be added for a
port's address, and that on VLAN 0. Thus bail out of the delete loop early.
For the same reason, also skip adding the per-VLAN entries.
Petr Machata [Thu, 4 Sep 2025 17:07:19 +0000 (19:07 +0200)]
net: bridge: BROPT_FDB_LOCAL_VLAN_0: Look up FDB on VLAN 0 on miss
When BROPT_FDB_LOCAL_VLAN_0 is enabled, the local FDB entries for the
member ports as well as the bridge itself should not be created per-VLAN,
but instead only on VLAN 0.
That means that br_handle_frame_finish() needs to make two lookups: the
primary lookup on an appropriate VLAN, and when that misses, a lookup on
VLAN 0.
Have the second lookup only accept local MAC addresses. Turning this into a
generic second-lookup feature is not the goal.
Petr Machata [Thu, 4 Sep 2025 17:07:18 +0000 (19:07 +0200)]
net: bridge: Introduce BROPT_FDB_LOCAL_VLAN_0
The following patches will gradually introduce the ability of the bridge
to look up local FDB entries on VLAN 0 instead of using the VLAN indicated
by a packet.
In this patch, just introduce the option itself, with which the feature
will be linked.
Stanislav Fomichev [Wed, 10 Sep 2025 16:24:29 +0000 (09:24 -0700)]
net: devmem: expose tcp_recvmsg_locked errors
tcp_recvmsg_dmabuf can export the following errors:
- EFAULT when linear copy fails
- ETOOSMALL when cmsg put fails
- ENODEV if one of the frags is readable
- ENOMEM on xarray failures
But they are all ignored and replaced by EFAULT in the caller
(tcp_recvmsg_locked). Expose real error to the userspace to
add more transparency on what specifically fails.
In non-devmem case (skb_copy_datagram_msg) doing `if (!copied)
copied=-EFAULT` is ok because skb_copy_datagram_msg can return only EFAULT.
Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250910162429.4127997-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Hildenbrand [Wed, 10 Sep 2025 01:36:43 +0000 (03:36 +0200)]
wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config
It's no longer user-selectable (and the default was already "y"), so
let's just drop it.
It was never really relevant to the wireguard selftests either way.
Cc: Shuah Khan <shuah@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://patch.msgid.link/20250910013644.4153708-4-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard: queueing: always return valid online CPU in wg_cpumask_choose_online()
The function gets number of online CPUS, and uses it to search for
Nth cpu in cpu_online_mask.
If id == num_online_cpus() - 1, and one CPU gets offlined between
calling num_online_cpus() -> cpumask_nth(), there's a chance for
cpumask_nth() to find nothing and return >= nr_cpu_ids.
The caller code in __queue_work() tries to avoid that by checking the
returned CPU against WORK_CPU_UNBOUND, which is NR_CPUS. It's not the
same as '>= nr_cpu_ids'. On a typical Ubuntu desktop, NR_CPUS is 8192,
while nr_cpu_ids is the actual number of possible CPUs, say 8.
The non-existing cpu may later be passed to rcu_dereference() and
corrupt the logic. Fix it by switching from 'if' to 'while'.
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://patch.msgid.link/20250910013644.4153708-3-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wg_cpumask_choose_online() opencodes cpumask_nth(). Use it and make the
function significantly simpler. While there, fix opencoded cpu_online()
too.
Signed-off-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://patch.msgid.link/20250910013644.4153708-2-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Move the conflicting declaration to the end of the corresponding
structure. Notice that `struct ip_tunnel_info` is a flexible
structure, this is a structure that contains a flexible-array
member.
Fix the following warning:
drivers/net/geneve.c:56:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/aMBK78xT2fUnpwE5@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In a system with high real-time requirements, the timeout mechanism of
ordinary timers with jiffies granularity is insufficient to meet the
demands for real-time performance. Meanwhile, the optimization of CPU
usage with af_packet is quite significant. Use hrtimer instead of timer
to help compensate for the shortcomings in real-time performance.
In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time
enough, with fluctuations reaching over 8ms (on a system with HZ=250).
This is unacceptable in some high real-time systems that require timely
processing of network packets. By replacing it with hrtimer, if a timeout
of 2ms is set, the update of TP_STATUS_USER can be stabilized to within
3 ms.
====================
net: af_packet: Use hrtimer to do the retire operation
In a system with high real-time requirements, the timeout mechanism of
ordinary timers with jiffies granularity is insufficient to meet the
demands for real-time performance. Meanwhile, the optimization of CPU
usage with af_packet is quite significant. Use hrtimer instead of timer
to help compensate for the shortcomings in real-time performance.
In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time
enough, with fluctuations reaching over 8ms (on a system with HZ=250).
This is unacceptable in some high real-time systems that require timely
processing of network packets. By replacing it with hrtimer, if a timeout
of 2ms is set, the update of TP_STATUS_USER can be stabilized to within
3 ms.
Delete delete_blk_timer field, because hrtimer_cancel will check and wait
until the timer callback return and ensure never enter callback again.
Simplify the logic related to setting timeout, only update the hrtimer
expire time within the hrtimer callback, no longer update the expire time
in prb_open_block which is called by tpacket_rcv or timer callback.
Reasons why NOT update hrtimer in prb_open_block:
1) It will increase complexity to distinguish the two caller scenario.
2) hrtimer_cancel and hrtimer_start need to be called if you want to update
TMO of an already enqueued hrtimer, leading to complex shutdown logic.
One side effect of NOT update hrtimer when called by tpacket_rcv is that
a newly opened block triggered by tpacket_rcv may be retired earlier than
expected. On the other hand, if timeout is updated in prb_open_block, the
frequent reception of network packets that leads to prb_open_block being
called may cause hrtimer to be removed and enqueued repeatedly.
The retire hrtimer expiration is unconditional and periodic. If there are
numerous packet sockets on the system, please set an appropriate timeout
to avoid frequent enqueueing of hrtimers.
kactive_blk_num (K) is only incremented on block close.
In timer callback prb_retire_rx_blk_timer_expired, except delete_blk_timer
is true, last_kactive_blk_num (L) is set to match kactive_blk_num (K) in
all cases. L is also set to match K in prb_open_block.
The only case K not equal to L is when scheduled by tpacket_rcv
and K is just incremented on block close but no new block could be opened,
so that it does not call prb_open_block in prb_dispatch_next_block.
This patch modifies the prb_retire_rx_blk_timer_expired function by simply
removing the check for L == K. This patch just provides another checkpoint
to thaw the might-be-frozen block in any case. It doesn't have any effect
because __packet_lookup_frame_in_block() has the same logic and does it
again without this patch when detecting the ring is frozen. The patch only
advances checking the status of the ring.
The daughter driver rcar_gen4_ptp used by both rswitch and rtsn where
upstreamed with support for possible different memory layouts on
different users. With all Gen4 boards upstream no such setup is
documented.
There are other issues related to how the rcar_gen4_ptp driver is shared
between multiple useres that needs to be cleaned up. But that will be a
larger work. So before that get some simple fixes done.
Patch 1/3 and 2/3 removes the support to allow different register
layouts on different SoCs by looking up offsets at runtime with a much
simpler interface. The new interface computes the offsets at compile
time.
While patch 3/3 is a drive-by patch taking a spurs comment and making a
lockdep check of it.
There is no intentional functional change in this series just cleaning
up in preparation of larger works to follow.
====================
With the support for multiple register layout removed all support
structures can be removed from the header file. Covert to a simpler
structure using defines for the register offsets.
There is no functional change, only switching from looking up offsets at
runtime to compile time.
net: ethernet: renesas: rcar_gen4_ptp: Remove different memory layout
When upstreaming the Gen4 PTP support for R-Car S4 the possibility for
different memory layouts on other Gen4 SoCs was build in. It turns out
this is not needed and instead needlessly makes the driver harder to
read, remove the support code that would have allowed different memory
layouts.
This change only deals with the public functions used by other drivers,
follow up work will clean up the rcar_gen4_ptp internals.
Daniel Palmer [Sun, 7 Sep 2025 06:43:49 +0000 (15:43 +0900)]
eth: 8139too: Make 8139TOO_PIO depend on !NO_IOPORT_MAP
When 8139too is probing and 8139TOO_PIO=y it will call pci_iomap_range()
and from there __pci_ioport_map() for the PCI IO space.
If HAS_IOPORT_MAP=n and NO_GENERIC_PCI_IOPORT_MAP=n, like it is on my
m68k config, __pci_ioport_map() becomes NULL, pci_iomap_range() will
always fail and the driver will complain it couldn't map the PIO space
and return an error.
NO_IOPORT_MAP seems to cover the case where what 8139too is trying
to do cannot ever work so make 8139TOO_PIO depend on being it false
and avoid creating an unusable driver.
Jakub Kicinski [Fri, 12 Sep 2025 00:50:46 +0000 (17:50 -0700)]
Merge tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Plenty of things going on, notably:
- iwlwifi: major cleanups/rework
- brcmfmac: gets AP isolation support
- mac80211: gets more S1G support
* tag 'wireless-next-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (94 commits)
wifi: mwifiex: fix endianness handling in mwifiex_send_rgpower_table
wifi: cfg80211: Remove the redundant wiphy_dev
wifi: mac80211: fix incorrect comment
wifi: cfg80211: update the time stamps in hidden ssid
wifi: mac80211: Fix HE capabilities element check
wifi: mac80211: add tx_handlers_drop statistics to ethtool
wifi: mac80211: fix reporting of all valid links in sta_set_sinfo()
wifi: iwlwifi: mld: CHANNEL_SURVEY_NOTIF is always supported
wifi: iwlwifi: mld: remove support of iwl_esr_mode_notif version 1
wifi: iwlwifi: mld: remove support from of sta cmd version 1
wifi: iwlwifi: mld: remove support of roc cmd version 5
wifi: iwlwifi: mld: remove support of mac cmd ver 2
wifi: iwlwifi: mld: don't consider phy cmd version 5
wifi: iwlwifi: implement wowlan status notification API update
wifi: iwlwifi: fw: Add ASUS to PPAG and TAS list
wifi: iwlwifi: add kunit tests for nvm parse
wifi: iwlwifi: api: add a flag to iwl_link_ctx_modify_flags
wifi: iwlwifi: pcie: move ltr_enabled to the specific transport
wifi: iwlwifi: pcie: move pm_support to the specific transport
wifi: iwlwifi: rename iwl_finish_nic_init
...
====================
Remove stubs for fixed_phy_set_link_update() and
fixed_phy_change_carrier() because all callers
(actually just one per function) select config
symbol FIXED_PHY.
Merge tag 'net-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN, netfilter and wireless.
We have an IPv6 routing regression with the relevant fix still a WiP.
This includes a last-minute revert to avoid more problems.
Current release - new code bugs:
- wifi: nl80211: completely disable per-link stats for now
Previous releases - regressions:
- dev_ioctl: take ops lock in hwtstamp lower paths
- netfilter:
- fix spurious set lookup failures
- fix lockdep splat due to missing annotation
- genetlink: fix genl_bind() invoking bind() after -EPERM
- phy: transfer phy_config_inband() locking responsibility to phylink
- can: xilinx_can: fix use-after-free of transmitted SKB
- hsr: fix lock warnings
- eth:
- igb: fix NULL pointer dereference in ethtool loopback test
- i40e: fix Jumbo Frame support after iPXE boot
- macsec: sync features on RTM_NEWLINK
Previous releases - always broken:
- tunnels: reset the GSO metadata before reusing the skb
- mptcp: make sync_socket_options propagate SOCK_KEEPOPEN
* tag 'net-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
Revert "net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups"
hsr: hold rcu and dev lock for hsr_get_port_ndev
hsr: use hsr_for_each_port_rtnl in hsr_port_get_hsr
hsr: use rtnl lock when iterating over ports
wifi: nl80211: completely disable per-link stats for now
net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups
net: ethtool: fix wrong type used in struct kernel_ethtool_ts_info
MAINTAINERS: add Phil as netfilter reviewer
netfilter: nf_tables: restart set lookup on base_seq change
netfilter: nf_tables: make nft_set_do_lookup available unconditionally
netfilter: nf_tables: place base_seq in struct net
netfilter: nft_set_rbtree: continue traversal if element is inactive
netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation
can: rcar_can: rcar_can_resume(): fix s2ram with PSCI
can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB
can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails
can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed
can: j1939: implement NETDEV_UNREGISTER notification handler
selftests: can: enable CONFIG_CAN_VCAN as a module
...
Merge tag 's390-6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- ptep_modify_prot_start() may be called in a loop, which might lead to
the preempt_count overflow due to the unnecessary preemption
disabling. Do not disable preemption to prevent the overflow
- Events of type PERF_TYPE_HARDWARE are not tested for sampling and
return -EOPNOTSUPP eventually.
Instead, deny all sampling events by CPUMF counter facility and
return -ENOENT to allow other PMUs to be tried
- The PAI PMU driver returns -EINVAL if an event out of its range. That
aborts a search for an alternative PMU driver.
Instead, return -ENOENT to allow other PMUs to be tried
* tag 's390-6.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpum_cf: Deny all sampling events by counter PMU
s390/pai: Deny all events not handled by this PMU
s390/mm: Prevent possible preempt_count overflow
Merge tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a nasty hibernation regression introduced during the 6.16
cycle, an issue related to energy model management occurring on Intel
hybrid systems where some CPUs are offline to start with, and two
regressions in the amd-pstate driver:
- Restore a pm_restrict_gfp_mask() call in hibernation_snapshot()
that was removed incorrectly during the 6.16 development cycle
(Rafael Wysocki)
- Introduce a function for registering a perf domain without
triggering a system-wide CPU capacity update and make the
intel_pstate driver use it to avoid reocurring unsuccessful
attempts to update capacities of all CPUs in the system (Rafael
Wysocki)
- Fix setting of CPPC.min_perf in the active mode with performance
governor in the amd-pstate driver to restore its expected behavior
changed recently (Gautham Shenoy)
- Avoid mistakenly setting EPP to 0 in the amd-pstate driver after
system resume as a result of recent code changes (Mario
Limonciello)"
* tag 'pm-6.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Restrict GFP mask in hibernation_snapshot()
PM: EM: Add function for registering a PD without capacity update
cpufreq/amd-pstate: Fix a regression leading to EPP 0 after resume
cpufreq/amd-pstate: Fix setting of CPPC.min_perf in active mode for performance governor
Merge tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix delayed inode tracking in xarray, eviction can race with
insertion and leave behind a disconnected inode
- on systems with large page (64K) and small block size (4K) fix
compression read that can return partially filled folio
- slightly relax compression option format for backward compatibility,
allow to specify level for LZO although there's only one
- fix simple quota accounting of compressed extents
- validate minimum device size in 'device add'
- update maintainers' entry
* tag 'for-6.17-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't allow adding block device of less than 1 MB
MAINTAINERS: update btrfs entry
btrfs: fix subvolume deletion lockup caused by inodes xarray race
btrfs: fix corruption reading compressed range when block size is smaller than page size
btrfs: accept and ignore compression level for lzo
btrfs: fix squota compressed stats leak
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
"A number of fixes accumulated due to summer vacations
- Fix out-of-bounds dynptr write in bpf_crypto_crypt() kfunc which
was misidentified as a security issue (Daniel Borkmann)
- Update the list of BPF selftests maintainers (Eduard Zingerman)
- Fix selftests warnings with icecc compiler (Ilya Leoshkevich)
- Disable XDP/cpumap direct return optimization (Jesper Dangaard
Brouer)
- Fix unexpected get_helper_proto() result in unusual configuration
BPF_SYSCALL=y and BPF_EVENTS=n (Jiri Olsa)
- Allow fallback to interpreter when JIT support is limited (KaFai
Wan)
- Fix rqspinlock and choose trylock fallback for NMI waiters. Pick
the simplest fix. More involved fix is targeted bpf-next (Kumar
Kartikeya Dwivedi)
- Fix cleanup when tcp_bpf_send_verdict() fails to allocate
psock->cork (Kuniyuki Iwashima)
- Disallow bpf_timer in PREEMPT_RT for now. Proper solution is being
discussed for bpf-next. (Leon Hwang)
- Fix XSK cq descriptor production (Maciej Fijalkowski)
- Tell memcg to use allow_spinning=false path in bpf_timer_init() to
avoid lockup in cgroup_file_notify() (Peilin Ye)
- Fix bpf_strnstr() to handle suffix match cases (Rong Tao)"
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Skip timer cases when bpf_timer is not supported
bpf: Reject bpf_timer for PREEMPT_RT
tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork.
bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()
bpf: Allow fall back to interpreter for programs with stack size <= 512
rqspinlock: Choose trylock fallback for NMI waiters
xsk: Fix immature cq descriptor production
bpf: Update the list of BPF selftests maintainers
selftests/bpf: Add tests for bpf_strnstr
selftests/bpf: Fix "expression result unused" warnings with icecc
bpf: Fix bpf_strnstr() to handle suffix match cases better
selftests/bpf: Extend crypto_sanity selftest with invalid dst buffer
bpf: Fix out-of-bounds dynptr write in bpf_crypto_crypt
bpf: Check the helper function is valid in get_helper_proto
bpf, cpumap: Disable page_pool direct xdp_return need larger scope
Paolo Abeni [Thu, 11 Sep 2025 14:33:31 +0000 (16:33 +0200)]
Revert "net: usb: asix: ax88772: drop phylink use in PM to avoid MDIO runtime PM wakeups"
This reverts commit 5537a4679403 ("net: usb: asix: ax88772: drop
phylink use in PM to avoid MDIO runtime PM wakeups"), it breaks
operation of asix ethernet usb dongle after system suspend-resume
cycle.
Florian Westphal [Fri, 29 Aug 2025 15:01:02 +0000 (17:01 +0200)]
netfilter: nf_reject: don't reply to icmp error messages
tcp reject code won't reply to a tcp reset.
But the icmp reject 'netdev' family versions will reply to icmp
dst-unreach errors, unlike icmp_send() and icmp6_send() which are used
by the inet family implementation (and internally by the REJECT target).
Check for the icmp(6) type and do not respond if its an unreachable error.
Without this, something like 'ip protocol icmp reject', when used
in a netdev chain attached to 'lo', cause a packet loop.
Same for two hosts that both use such a rule: each error packet
will be replied to.
Such situation persist until the (bogus) rule is amended to ratelimit or
checks the icmp type before the reject statement.
As the inet versions don't do this make the netdev ones follow along.
KCSAN reported a data-race on the `ipvs->enable` flag, which is
written in the control path and read concurrently from many other
contexts.
Following a suggestion by Julian, this patch fixes the race by
converting all accesses to use `WRITE_ONCE()/READ_ONCE()`.
This lightweight approach ensures atomic access and acts as a
compiler barrier, preventing unsafe optimizations where the flag
is checked in loops (e.g., in ip_vs_est.c).
Additionally, the `enable` checks in the fast-path hooks
(`ip_vs_in_hook`, `ip_vs_out_hook`, `ip_vs_forward_icmp`) are
removed. These are unnecessary since commit 857ca89711de
("ipvs: register hooks only with services"). The `enable=0`
condition they check for can only occur in two rare and non-fatal
scenarios: 1) after hooks are registered but before the flag is set,
and 2) after hooks are unregistered on cleanup_net. In the worst
case, a single packet might be mishandled (e.g., dropped), which
does not lead to a system crash or data corruption. Adding a check
in the performance-critical fast-path to handle this harmless
condition is not a worthwhile trade-off.
Paolo Abeni [Thu, 11 Sep 2025 10:49:52 +0000 (12:49 +0200)]
Merge tag 'wireless-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Some more fixes:
- iwlwifi: fix 130/1030 devices
- ath12k: fix alignment, power save
- virt_wifi: fix crash
- cfg80211: disable per-link stats due
to buffer size issues
* tag 'wireless-2025-09-11' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: nl80211: completely disable per-link stats for now
wifi: virt_wifi: Fix page fault on connect
wifi: cfg80211: Fix "no buffer space available" error in nl80211_get_station() for MLO
wifi: iwlwifi: fix 130/1030 configs
wifi: ath12k: fix WMI TLV header misalignment
wifi: ath12k: Fix missing station power save configuration
====================
====================
ipv4: icmp: Fix source IP derivation in presence of VRFs
Align IPv4 with IPv6 and in the presence of VRFs generate ICMP error
messages with a source IP that is derived from the receiving interface
and not from its VRF master. This is especially important when the error
messages are "Time Exceeded" messages as it means that utilities like
traceroute will show an incorrect packet path.
Patches #1-#2 are preparations.
Patch #3 is the actual change.
Patches #4-#7 make small improvements in the existing traceroute test.
Patch #8 extends the traceroute test with VRF test cases for both IPv4
and IPv6.
Create versions of the existing test cases where the routers generating
the ICMP error messages are using VRFs. Check that the source IPs of
these messages do not change in the presence of VRFs.
IPv6 always behaved correctly, but IPv4 fails when reverting "ipv4:
icmp: Fix source IP derivation in presence of VRFs".
Without IPv4 change:
# ./traceroute.sh
TEST: IPv6 traceroute [ OK ]
TEST: IPv6 traceroute with VRF [ OK ]
TEST: IPv4 traceroute [ OK ]
TEST: IPv4 traceroute with VRF [FAIL]
traceroute did not return 1.0.3.1
$ echo $?
1
The test fails because the ICMP error message is sent with the VRF
device's IP (1.0.4.1):
# traceroute -n -s 1.0.1.3 1.0.2.4
traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
1 1.0.4.1 0.165 ms 0.110 ms 0.103 ms
2 1.0.2.4 0.098 ms 0.085 ms 0.078 ms
# traceroute -n -s 1.0.3.3 1.0.2.4
traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
1 1.0.4.1 0.201 ms 0.138 ms 0.129 ms
2 1.0.2.4 0.123 ms 0.105 ms 0.098 ms
With IPv4 change:
# ./traceroute.sh
TEST: IPv6 traceroute [ OK ]
TEST: IPv6 traceroute with VRF [ OK ]
TEST: IPv4 traceroute [ OK ]
TEST: IPv4 traceroute with VRF [ OK ]
$ echo $?
0
selftests: traceroute: Test traceroute with different source IPs
When generating ICMP error messages, the kernel will prefer a source IP
that is on the same subnet as the destination IP (see
inet_select_addr()). Test this behavior by invoking traceroute with
different source IPs and checking that the ICMP error message is
generated with a source IP in the same subnet.
Both of the addresses are configured as primary addresses, but the
kernel is expected to choose 10.0.1.1/24 as the source IP of the ICMP
error message since it is on the same subnet as the destination IP of
the message (10.0.1.3/24). Reword the comment to reflect that.
selftests: traceroute: Return correct value on failure
The test always returns success even if some tests were modified to
fail. Fix by converting the test to use the appropriate library
functions instead of using its own functions.
ipv4: icmp: Fix source IP derivation in presence of VRFs
When the "icmp_errors_use_inbound_ifaddr" sysctl is enabled, the source
IP of ICMP error messages should be the "primary address of the
interface that received the packet that caused the icmp error".
The IPv4 ICMP code determines this interface using inet_iif() which in
the input path translates to skb->skb_iif. If the interface that
received the packet is a VRF port, skb->skb_iif will contain the ifindex
of the VRF device and not that of the receiving interface. This is
because in the input path the VRF driver overrides skb->skb_iif with the
ifindex of the VRF device itself (see vrf_ip_rcv()).
As such, the source IP that will be chosen for the ICMP error message is
either an address assigned to the VRF device itself (if present) or an
address assigned to some VRF port, not necessarily the input or output
interface.
This behavior is especially problematic when the error messages are
"Time Exceeded" messages as it means that utilities like traceroute will
show an incorrect packet path.
Solve this by determining the input interface based on the iif field in
the control block, if present. This field is set in the input path to
skb->skb_iif and is not later overridden by the VRF driver, unlike
skb->skb_iif.
This behavior is consistent with the IPv6 counterpart that already uses
the iif from the control block.
Reported-by: Andy Roulin <aroulin@nvidia.com> Reported-by: Rajkumar Srinivasan <rajsrinivasa@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250908073238.119240-4-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv4: icmp: Pass IPv4 control block structure as an argument to __icmp_send()
__icmp_send() is used to generate ICMP error messages in response to
various situations such as MTU errors (i.e., "Fragmentation Required")
and too many hops (i.e., "Time Exceeded").
The skb that generated the error does not necessarily come from the IPv4
layer and does not always have a valid IPv4 control block in skb->cb.
Therefore, commit 9ef6b42ad6fd ("net: Add __icmp_send helper.") changed
the function to take the IP options structure as argument instead of
deriving it from the skb's control block. Some callers of this function
such as icmp_send() pass the IP options structure from the skb's control
block as in these call paths the control block is known to be valid, but
other callers simply pass a zeroed structure.
A subsequent patch will need __icmp_send() to access more information
from the IPv4 control block (specifically, the ifindex of the input
interface). As a preparation for this change, change the function to
take the IPv4 control block structure as an argument instead of the IP
options structure. This makes the function similar to its IPv6
counterpart that already takes the IPv6 control block structure as an
argument.
No functional changes intended.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv4: cipso: Simplify IP options handling in cipso_v4_error()
When __ip_options_compile() is called with an skb, the IP options are
parsed from the skb data into the provided IP option argument. This is
in contrast to the case where the skb argument is NULL and the options
are parsed from opt->__data.
Given that cipso_v4_error() always passes an skb to
__ip_options_compile(), there is no need to allocate an extra 40 bytes
(maximum IP options size).
Therefore, simplify the function by removing these extra bytes and make
the function similar to ipv4_send_dest_unreach() which also calls both
__ip_options_compile() and __icmp_send().
This is a preparation for changing the arguments being passed to
__icmp_send().
No functional changes intended.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-2-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net: xdp: handle frags with unreadable memory
Make XDP helpers compatible with unreadable memory. This is very
similar to how we handle pfmemalloc frags today. Record the info
in xdp_buf flags as frags get added and then update the skb once
allocated.
This series adds the unreadable memory metadata tracking to drivers
using xdp_build_skb_from*() with no changes on the driver side - hence
the only driver changes here are refactoring. Obviously, unreadable memory
is incompatible with XDP today, but thanks to xdp_build_skb_from_buf()
increasing number of drivers have a unified datapath, whether XDP is
enabled or not.
Jakub Kicinski [Fri, 5 Sep 2025 22:15:39 +0000 (15:15 -0700)]
net: xdp: handle frags with unreadable memory
We don't expect frags with unreadable memory to be presented
to XDP programs today, but the XDP helpers are designed to be
usable whether XDP is enabled or not. Support handling frags
with unreadable memory.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-3-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Fri, 5 Sep 2025 22:15:38 +0000 (15:15 -0700)]
net: xdp: pass full flags to xdp_update_skb_shared_info()
xdp_update_skb_shared_info() needs to update skb state which
was maintained in xdp_buff / frame. Pass full flags into it,
instead of breaking it out bit by bit. We will need to add
a bit for unreadable frags (even tho XDP doesn't support
those the driver paths may be common), at which point almost
all call sites would become:
Keep a helper for accessing the flags, in case we need to
transform them somehow in the future (e.g. to cover up xdp_buff
vs xdp_frame differences).
While we are touching call callers - rename the helper to
xdp_update_skb_frags_info(), previous name may have implied that
it's shinfo that's updated. We are updating flags in struct sk_buff
based on frags that got attched.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adding rcu_read_lock() for all hsr_for_each_port() looks confusing.
Introduce a new helper, hsr_for_each_port_rtnl(), that assumes the
RTNL lock is held. This allows callers in suitable contexts to iterate
ports safely without explicit RCU locking.
Other code paths that rely on RCU protection continue to use
hsr_for_each_port() with rcu_read_lock().
====================
Hangbin Liu [Fri, 5 Sep 2025 09:15:33 +0000 (09:15 +0000)]
hsr: hold rcu and dev lock for hsr_get_port_ndev
hsr_get_port_ndev calls hsr_for_each_port, which need to hold rcu lock.
On the other hand, before return the port device, we need to hold the
device reference to avoid UaF in the caller function.
Suggested-by: Paolo Abeni <pabeni@redhat.com> Fixes: 9c10dd8eed74 ("net: hsr: Create and export hsr_get_port_ndev()") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250905091533.377443-4-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu [Fri, 5 Sep 2025 09:15:32 +0000 (09:15 +0000)]
hsr: use hsr_for_each_port_rtnl in hsr_port_get_hsr
hsr_port_get_hsr() iterates over ports using hsr_for_each_port(),
but many of its callers do not hold the required RCU lock.
Switch to hsr_for_each_port_rtnl(), since most callers already hold
the rtnl lock. After review, all callers are covered by either the rtnl
lock or the RCU lock, except hsr_dev_xmit(). Fix this by adding an
RCU read lock there.
Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250905091533.377443-3-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hangbin Liu [Fri, 5 Sep 2025 09:15:31 +0000 (09:15 +0000)]
hsr: use rtnl lock when iterating over ports
hsr_for_each_port is called in many places without holding the RCU read
lock, this may trigger warnings on debug kernels. Most of the callers
are actually hold rtnl lock. So add a new helper hsr_for_each_port_rtnl
to allow callers in suitable contexts to iterate ports safely without
explicit RCU locking.
This patch only fixed the callers that is hold rtnl lock. Other caller
issues will be fixed in later patches.
Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250905091533.377443-2-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Marc Harvey [Fri, 5 Sep 2025 04:04:41 +0000 (04:04 +0000)]
selftests: net: Add tests to verify team driver option set and get.
There are currently no kernel tests that verify setting and getting
options of the team driver.
In the future, options may be added that implicitly change other
options, which will make it useful to have tests like these that show
nothing breaks. There will be a follow up patch to this that adds new
"rx_enabled" and "tx_enabled" options, which will implicitly affect the
"enabled" option value and vice versa.
The tests use teamnl to first set options to specific values and then
gets them to compare to the set values.
Johannes Berg [Wed, 10 Sep 2025 13:11:21 +0000 (15:11 +0200)]
wifi: nl80211: completely disable per-link stats for now
After commit 8cc71fc3b82b ("wifi: cfg80211: Fix "no buffer
space available" error in nl80211_get_station() for MLO"),
the per-link data is only included in station dumps, where
the size limit is somewhat less of an issue. However, it's
still an issue, depending on how many links a station has
and how much per-link data there is. Thus, for now, disable
per-link statistics entirely.
A complete fix will need to take this into account, make it
opt-in by userspace, and change the dump format to be able
to split a single station's data across multiple netlink
dump messages, which all together is too much development
for a fix.
Fixes: 82d7f841d9bd ("wifi: cfg80211: extend to embed link level statistics in NL message") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Merge tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 15 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 14 of these
fixes are for MM.
This includes
- kexec fixes from Breno for a recently introduced
use-uninitialized bug
- DAMON fixes from Quanmin Yan to avoid div-by-zero crashes
which can occur if the operator uses poorly-chosen insmod
parameters
and misc singleton fixes"
* tag 'mm-hotfixes-stable-2025-09-10-20-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add tree entry to numa memblocks and emulation block
mm/damon/sysfs: fix use-after-free in state_show()
proc: fix type confusion in pde_set_flags()
compiler-clang.h: define __SANITIZE_*__ macros only when undefined
mm/vmalloc, mm/kasan: respect gfp mask in kasan_populate_vmalloc()
ocfs2: fix recursive semaphore deadlock in fiemap call
mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory
mm/mremap: fix regression in vrm->new_addr check
percpu: fix race on alloc failed warning limit
mm/memory-failure: fix redundant updates for already poisoned pages
s390: kexec: initialize kexec_buf struct
riscv: kexec: initialize kexec_buf struct
arm64: kexec: initialize kexec_buf struct in load_other_segments()
mm/damon/reclaim: avoid divide-by-zero in damon_reclaim_apply_parameters()
mm/damon/lru_sort: avoid divide-by-zero in damon_lru_sort_apply_parameters()
mm/damon/core: set quota->charged_from to jiffies at first charge window
mm/hugetlb: add missing hugetlb_lock in __unmap_hugepage_range()
init/main.c: fix boot time tracing crash
mm/memory_hotplug: fix hwpoisoned large folio handling in do_migrate_range()
mm/khugepaged: fix the address passed to notifier on testing young
Merge tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vmescape mitigation fixes from Dave Hansen:
"Mitigate vmscape issue with indirect branch predictor flushes.
vmscape is a vulnerability that essentially takes Spectre-v2 and
attacks host userspace from a guest. It particularly affects
hypervisors like QEMU.
Even if a hypervisor may not have any sensitive data like disk
encryption keys, guest-userspace may be able to attack the
guest-kernel using the hypervisor as a confused deputy.
There are many ways to mitigate vmscape using the existing Spectre-v2
defenses like IBRS variants or the IBPB flushes. This series focuses
solely on IBPB because it works universally across vendors and all
vulnerable processors. Further work doing vendor and model-specific
optimizations can build on top of this if needed / wanted.
Do the normal issue mitigation dance:
- Add the CPU bug boilerplate
- Add a list of vulnerable CPUs
- Use IBPB to flush the branch predictors after running guests"
* tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmscape: Add old Intel CPUs to affected list
x86/vmscape: Warn when STIBP is disabled with SMT
x86/bugs: Move cpu_bugs_smt_update() down
x86/vmscape: Enable the mitigation
x86/vmscape: Add conditional IBPB mitigation
x86/vmscape: Enumerate VMSCAPE bug
Documentation/hw-vuln: Add VMSCAPE documentation
First patch adds a lockdep annotation for a false-positive splat.
Last patch adds formal reviewer tag for Phil Sutter to MAINTAINERS.
Rest of the patches resolve spurious false negative results during set
lookups while another CPU is processing a transaction.
This has been broken at least since v4.18 when an unconditional
synchronize_rcu call was removed from the commit phase of nf_tables.
Quoting from Stefan Hanreichs original report:
It seems like we've found an issue with atomicity when reloading
nftables rulesets. Sometimes there is a small window where rules
containing sets do not seem to apply to incoming traffic, due to the set
apparently being empty for a short amount of time when flushing / adding
elements.
Exanple ruleset:
table ip filter {
set match {
type ipv4_addr
flags interval
elements = { 0.0.0.0-192.168.2.19, 192.168.2.21-255.255.255.255 }
}
chain pre {
type filter hook prerouting priority filter; policy accept;
ip saddr @match accept
counter comment "must never match"
}
}
Reproducer transaction:
while true:
nft -f -<<EOF
flush set ip filter match
create element ip filter match { \
0.0.0.0-192.168.2.19, 192.168.2.21-255.255.255.255 }
EOF
done
Then create traffic. to/from e.g. 192.168.2.1 to 192.168.3.10.
Once in a while the counter will increment even though the
'ip saddr @match' rule should have accepted the packet.
See individual patches for details.
Thanks to Stefan Hanreich for an initial description and reproducer for
this bug and to Pablo Neira Ayuso for reviewing earlier iterations of
the patchset.
* tag 'nf-25-09-10-v2' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
MAINTAINERS: add Phil as netfilter reviewer
netfilter: nf_tables: restart set lookup on base_seq change
netfilter: nf_tables: make nft_set_do_lookup available unconditionally
netfilter: nf_tables: place base_seq in struct net
netfilter: nft_set_rbtree: continue traversal if element is inactive
netfilter: nft_set_pipapo: don't check genbit from packetpath lookups
netfilter: nft_set_bitmap: fix lockdep splat due to missing annotation
====================
Jakub Kicinski [Thu, 11 Sep 2025 02:21:11 +0000 (19:21 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-09-09 (igb, i40e)
For igb:
Tianyu Xu removes passing of, no longer needed, NAPI id to avoid NULL
pointer dereference on ethtool loopback testing.
Kohei Enju corrects reporting/testing of link state when interface is
down.
For i40e:
Michal Schmidt corrects value being passed to free_irq().
Jake sets hardware maximum frame size on probe to ensure
expected/consistent state.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
i40e: fix Jumbo Frame support after iPXE boot
i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path
igb: fix link test skipping when interface is admin down
igb: Fix NULL pointer dereference in ethtool loopback test
====================
Jakub Kicinski [Tue, 9 Sep 2025 22:38:37 +0000 (15:38 -0700)]
selftests: net: replace sleeps in fcnal-test with waits
fcnal-test.sh already includes lib.sh, use relevant helpers
instead of sleeping. Replace sleep after starting nettest
as a server with wait_local_port_listen.
====================
tools: ynl: fix errors reported by Ruff
When looking at the YNL code to add a new feature, my text editor
automatically executed 'ruff check', and found out at least one
interesting error: one variable was used while not being defined.
I then decided to fix this error, and all the other ones reported by
Ruff. After this series, 'ruff check' reports no more errors with
version 0.12.12.
====================
tools: ynl: remove f-string without any placeholders
'f-strings' without any placeholders don't need to be marked as such
according to Ruff. This 'f' can be safely removed.
This is linked to Ruff error F541 [1]:
f-strings are a convenient way to format strings, but they are not
necessary if there are no placeholder expressions to format. In this
case, a regular string should be used instead, as an f-string without
placeholders can be confusing for readers, who may expect such a
placeholder to be present.