Create versions of the existing test cases where the routers generating
the ICMP error messages are using VRFs. Check that the source IPs of
these messages do not change in the presence of VRFs.
IPv6 always behaved correctly, but IPv4 fails when reverting "ipv4:
icmp: Fix source IP derivation in presence of VRFs".
Without IPv4 change:
# ./traceroute.sh
TEST: IPv6 traceroute [ OK ]
TEST: IPv6 traceroute with VRF [ OK ]
TEST: IPv4 traceroute [ OK ]
TEST: IPv4 traceroute with VRF [FAIL]
traceroute did not return 1.0.3.1
$ echo $?
1
The test fails because the ICMP error message is sent with the VRF
device's IP (1.0.4.1):
# traceroute -n -s 1.0.1.3 1.0.2.4
traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
1 1.0.4.1 0.165 ms 0.110 ms 0.103 ms
2 1.0.2.4 0.098 ms 0.085 ms 0.078 ms
# traceroute -n -s 1.0.3.3 1.0.2.4
traceroute to 1.0.2.4 (1.0.2.4), 30 hops max, 60 byte packets
1 1.0.4.1 0.201 ms 0.138 ms 0.129 ms
2 1.0.2.4 0.123 ms 0.105 ms 0.098 ms
With IPv4 change:
# ./traceroute.sh
TEST: IPv6 traceroute [ OK ]
TEST: IPv6 traceroute with VRF [ OK ]
TEST: IPv4 traceroute [ OK ]
TEST: IPv4 traceroute with VRF [ OK ]
$ echo $?
0
selftests: traceroute: Test traceroute with different source IPs
When generating ICMP error messages, the kernel will prefer a source IP
that is on the same subnet as the destination IP (see
inet_select_addr()). Test this behavior by invoking traceroute with
different source IPs and checking that the ICMP error message is
generated with a source IP in the same subnet.
Both of the addresses are configured as primary addresses, but the
kernel is expected to choose 10.0.1.1/24 as the source IP of the ICMP
error message since it is on the same subnet as the destination IP of
the message (10.0.1.3/24). Reword the comment to reflect that.
selftests: traceroute: Return correct value on failure
The test always returns success even if some tests were modified to
fail. Fix by converting the test to use the appropriate library
functions instead of using its own functions.
ipv4: icmp: Fix source IP derivation in presence of VRFs
When the "icmp_errors_use_inbound_ifaddr" sysctl is enabled, the source
IP of ICMP error messages should be the "primary address of the
interface that received the packet that caused the icmp error".
The IPv4 ICMP code determines this interface using inet_iif() which in
the input path translates to skb->skb_iif. If the interface that
received the packet is a VRF port, skb->skb_iif will contain the ifindex
of the VRF device and not that of the receiving interface. This is
because in the input path the VRF driver overrides skb->skb_iif with the
ifindex of the VRF device itself (see vrf_ip_rcv()).
As such, the source IP that will be chosen for the ICMP error message is
either an address assigned to the VRF device itself (if present) or an
address assigned to some VRF port, not necessarily the input or output
interface.
This behavior is especially problematic when the error messages are
"Time Exceeded" messages as it means that utilities like traceroute will
show an incorrect packet path.
Solve this by determining the input interface based on the iif field in
the control block, if present. This field is set in the input path to
skb->skb_iif and is not later overridden by the VRF driver, unlike
skb->skb_iif.
This behavior is consistent with the IPv6 counterpart that already uses
the iif from the control block.
Reported-by: Andy Roulin <aroulin@nvidia.com> Reported-by: Rajkumar Srinivasan <rajsrinivasa@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250908073238.119240-4-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv4: icmp: Pass IPv4 control block structure as an argument to __icmp_send()
__icmp_send() is used to generate ICMP error messages in response to
various situations such as MTU errors (i.e., "Fragmentation Required")
and too many hops (i.e., "Time Exceeded").
The skb that generated the error does not necessarily come from the IPv4
layer and does not always have a valid IPv4 control block in skb->cb.
Therefore, commit 9ef6b42ad6fd ("net: Add __icmp_send helper.") changed
the function to take the IP options structure as argument instead of
deriving it from the skb's control block. Some callers of this function
such as icmp_send() pass the IP options structure from the skb's control
block as in these call paths the control block is known to be valid, but
other callers simply pass a zeroed structure.
A subsequent patch will need __icmp_send() to access more information
from the IPv4 control block (specifically, the ifindex of the input
interface). As a preparation for this change, change the function to
take the IPv4 control block structure as an argument instead of the IP
options structure. This makes the function similar to its IPv6
counterpart that already takes the IPv6 control block structure as an
argument.
No functional changes intended.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-3-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ipv4: cipso: Simplify IP options handling in cipso_v4_error()
When __ip_options_compile() is called with an skb, the IP options are
parsed from the skb data into the provided IP option argument. This is
in contrast to the case where the skb argument is NULL and the options
are parsed from opt->__data.
Given that cipso_v4_error() always passes an skb to
__ip_options_compile(), there is no need to allocate an extra 40 bytes
(maximum IP options size).
Therefore, simplify the function by removing these extra bytes and make
the function similar to ipv4_send_dest_unreach() which also calls both
__ip_options_compile() and __icmp_send().
This is a preparation for changing the arguments being passed to
__icmp_send().
No functional changes intended.
Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250908073238.119240-2-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net: xdp: handle frags with unreadable memory
Make XDP helpers compatible with unreadable memory. This is very
similar to how we handle pfmemalloc frags today. Record the info
in xdp_buf flags as frags get added and then update the skb once
allocated.
This series adds the unreadable memory metadata tracking to drivers
using xdp_build_skb_from*() with no changes on the driver side - hence
the only driver changes here are refactoring. Obviously, unreadable memory
is incompatible with XDP today, but thanks to xdp_build_skb_from_buf()
increasing number of drivers have a unified datapath, whether XDP is
enabled or not.
Jakub Kicinski [Fri, 5 Sep 2025 22:15:39 +0000 (15:15 -0700)]
net: xdp: handle frags with unreadable memory
We don't expect frags with unreadable memory to be presented
to XDP programs today, but the XDP helpers are designed to be
usable whether XDP is enabled or not. Support handling frags
with unreadable memory.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-3-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Fri, 5 Sep 2025 22:15:38 +0000 (15:15 -0700)]
net: xdp: pass full flags to xdp_update_skb_shared_info()
xdp_update_skb_shared_info() needs to update skb state which
was maintained in xdp_buff / frame. Pass full flags into it,
instead of breaking it out bit by bit. We will need to add
a bit for unreadable frags (even tho XDP doesn't support
those the driver paths may be common), at which point almost
all call sites would become:
Keep a helper for accessing the flags, in case we need to
transform them somehow in the future (e.g. to cover up xdp_buff
vs xdp_frame differences).
While we are touching call callers - rename the helper to
xdp_update_skb_frags_info(), previous name may have implied that
it's shinfo that's updated. We are updating flags in struct sk_buff
based on frags that got attched.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20250905221539.2930285-2-kuba@kernel.org Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Marc Harvey [Fri, 5 Sep 2025 04:04:41 +0000 (04:04 +0000)]
selftests: net: Add tests to verify team driver option set and get.
There are currently no kernel tests that verify setting and getting
options of the team driver.
In the future, options may be added that implicitly change other
options, which will make it useful to have tests like these that show
nothing breaks. There will be a follow up patch to this that adds new
"rx_enabled" and "tx_enabled" options, which will implicitly affect the
"enabled" option value and vice versa.
The tests use teamnl to first set options to specific values and then
gets them to compare to the set values.
Jakub Kicinski [Tue, 9 Sep 2025 22:38:37 +0000 (15:38 -0700)]
selftests: net: replace sleeps in fcnal-test with waits
fcnal-test.sh already includes lib.sh, use relevant helpers
instead of sleeping. Replace sleep after starting nettest
as a server with wait_local_port_listen.
====================
tools: ynl: fix errors reported by Ruff
When looking at the YNL code to add a new feature, my text editor
automatically executed 'ruff check', and found out at least one
interesting error: one variable was used while not being defined.
I then decided to fix this error, and all the other ones reported by
Ruff. After this series, 'ruff check' reports no more errors with
version 0.12.12.
====================
tools: ynl: remove f-string without any placeholders
'f-strings' without any placeholders don't need to be marked as such
according to Ruff. This 'f' can be safely removed.
This is linked to Ruff error F541 [1]:
f-strings are a convenient way to format strings, but they are not
necessary if there are no placeholder expressions to format. In this
case, a regular string should be used instead, as an f-string without
placeholders can be confusing for readers, who may expect such a
placeholder to be present.
This 'except' was used without specifying the exception class according
to Ruff. Here, only the ValueError class is expected and handled.
This is linked to Ruff error E722 [1]:
A bare except catches BaseException which includes KeyboardInterrupt,
SystemExit, Exception, and others. Catching BaseException can make it
hard to interrupt the program (e.g., with Ctrl-C) and can disguise
other problems.
This variable used in the error path was not defined according to Ruff.
msg_format.attr_set is used instead, presumably the one that was
supposed to be used originally.
This is linked to Ruff error F821 [1]:
An undefined name is likely to raise NameError at runtime.
NET_SHAPER is always selected for MANA driver. When NET_SHAPER is enabled,
netdev_lock_ops_to_full() reduces effectively to only an assert for lock,
which is always held in the path when NET_SHAPER is enabled.
Remove the redundant netdev_lock_ops_to_full() call.
The 88e1510 PHY has an erratum where the phy downshift counter is not
cleared after phy being suspended(BMCR_PDOWN set) and then later
resumed(BMCR_PDOWN cleared). This can cause the gigabit link to
intermittently downshift to a lower speed.
Disabling and re-enabling the downshift feature clears the counter,
allowing the PHY to retry gigabit link negotiation up to the programmed
retry count times before downshifting. This behavior has been observed
on copper links.
====================
ptp: add pulse signal loopback support for debugging
Some PTP devices support looping back the periodic pulse signal for
debugging. For example, the PTP device of QorIQ platform and the NETC v4
Timer has the ability to loop back the pulse signal and record the extts
events for the loopback signal. So we can make sure that the pulse
intervals and their phase alignment are correct from the perspective of
the emitting PHC's time base. In addition, we can use this loopback
feature as a built-in extts event generator when we have no external
equipment which does that. Therefore, add the generic debugfs interfaces
to the ptp_clock driver. The first two patch are separated from the
previous patch set [1]. The third patch is new added.
[1]: https://lore.kernel.org/imx/20250827063332.1217664-1-wei.fang@nxp.com/ #patch 3 and 9
====================
ptp: qoriq: convert to use generic interfaces to set loopback mode
Since the generic debugfs interfaces for setting the periodic pulse
signal loopback have been added to the ptp_clock driver, so convert
the vendor-defined debugfs interfaces to the generic interfaces.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250905030711.1509648-4-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ptp: netc: add the periodic output signal loopback support
The NETC Timer supports looping back the output pulse signal of Fiper-n
into Trigger-n input, so that users can leverage this feature to validate
some other features without external hardware support. For example, users
can use it to test external trigger stamp (EXTTS). And users can combine
EXTTS with loopback mode to check whether the generation time of PPS is
aligned with an integral second of PHC, or the periodic output signal
(PTP_CLK_REQ_PEROUT) whether is generated at the specified time.
Since ptp_clock_info::perout_loopback() has been added to the ptp_clock
driver as a generic interface to enable or disable the periodic output
signal loopback, therefore, netc_timer_perout_loopback() is added as a
callback of ptp_clock_info::perout_loopback().
Test the generation time of PPS event:
$ echo 0 1 > /sys/kernel/debug/ptp0/perout_loopback
$ echo 1 > /sys/class/ptp/ptp0/pps_enable
$ testptp -d /dev/ptp0 -e 3
external time stamp request okay
event index 0 at 63.000000017
event index 0 at 64.000000017
event index 0 at 65.000000017
Test the generation time of the periodic output signal:
$ echo 0 1 > /sys/kernel/debug/ptp0/perout_loopback
$ echo 0 150 0 1 500000000 > /sys/class/ptp/ptp0/period
$ testptp -d /dev/ptp0 -e 3
external time stamp request okay
event index 0 at 150.000000014
event index 0 at 151.500000015
event index 0 at 153.000000014
ptp: add debugfs interfaces to loop back the periodic output signal
For some PTP devices, they have the capability to loop back the periodic
output signal for debugging, such as the ptp_qoriq device. So add the
generic interfaces to set the periodic output signal loopback, rather
than each vendor having a different implementation.
Show how many channels support the periodic output signal loopback:
$ cat /sys/kernel/debug/ptp<N>/n_perout_loopback
Enable the loopback of the periodic output signal of channel X:
$ echo <X> 1 > /sys/kernel/debug/ptp<N>/perout_loopback
Disable the loopback of the periodic output signal of channel X:
$ echo <X> 0 > /sys/kernel/debug/ptp<N>/perout_loopback
This small series by Dragos covers gaps requested in the initial pcie
congestion series [1]:
- Make pcie congestion thresholds configurable via devlink.
- Add a counter for stale pcie congestion events.
net/mlx5e: Make PCIe congestion event thresholds configurable
Add devlink driverinit parameters for configuring the thresholds for
PCIe congestion events. These parameters are registered only when the
firmware supports this feature.
Update the mlx5 devlink docs as well on these new params.
====================
devlink, mlx5: Add new parameters for link management and SRIOV/eSwitch configurations [part]
This patch series introduces several devlink parameters improving device
configuration capabilities, link management, and SRIOV/eSwitch, by adding
NV config boot time parameters.
Implement the following parameters:
a) total_vfs Parameter:
-----------------------
Adds support for managing the number of VFs (total_vfs) and enabling
SR-IOV (enable_sriov for mlx5) through devlink. These additions enhance
user control over virtualization features directly from standard kernel
interfaces without relying on additional external tools. total_vfs
functionality is critical for environments that require flexible num VF
configuration.
b) CQE Compression Type:
------------------------
Introduces a new devlink parameter, cqe_compress_type, to configure the
rate of CQE compression based on PCIe bus conditions. This setting
provides a balance between compression efficiency and overall NIC
performance under different traffic loads.
====================
Some devices support both symmetric (same value for all PFs) and
asymmetric, while others only support symmetric configuration. This
implementation prefers asymmetric, since it is closer to the devlink
model (per function settings), but falls back to symmetric when needed.
Example usage:
devlink dev param set pci/0000:01:00.0 name total_vfs value <u16> cmode permanent
devlink dev reload pci/0000:01:00.0 action fw_activate
echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove
echo 1 >/sys/bus/pci/rescan
cat /sys/bus/pci/devices/0000:01:00.0/sriov_totalvfs
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Example usage:
devlink dev param set pci/0000:01:00.0 name enable_sriov value {true, false} cmode permanent
devlink dev reload pci/0000:01:00.0 action fw_activate
echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove
echo 1 >/sys/bus/pci/rescan
grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_*
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-4-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5: Implement cqe_compress_type via devlink params
Selects which algorithm should be used by the NIC in order to decide rate of
CQE compression dependeng on PCIe bus conditions.
Supported values:
1) balanced, merges fewer CQEs, resulting in a moderate compression ratio
but maintaining a balance between bandwidth savings and performance
2) aggressive, merges more CQEs into a single entry, achieving a higher
compression rate and maximizing performance, particularly under high
traffic loads.
NICs are typically configured with total_vfs=0, forcing users to rely
on external tools to enable SR-IOV (a widely used and essential feature).
Add total_vfs parameter to devlink for SR-IOV max VF configurability.
Enables standard kernel tools to manage SR-IOV, addressing the need for
flexible VF configuration.
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
mptcp: make ADD_ADDR retransmission timeout adaptive
Currently, the MPTCP ADD_ADDR notifications are retransmitted after a
fixed timeout controlled by the net.mptcp.add_addr_timeout sysctl knob,
if the corresponding "echo" packets are not received before. This can be
too slow (or too quick), especially with a too cautious default value
set to 2 minutes.
- Patch 1: make ADD_ADDR retransmission timeout adaptive, using the
TCP's retransmission timeout. The corresponding sysctl knob is now
used as a maximum value.
- Patch 2: now that these ADD_ADDR retransmissions can happen faster,
all MPTCP Join subtests checking ADD_ADDR counters accept more
ADD_ADDR than expected (if any). This is aligned with the previous
behaviour, when the ADD_ADDR RTO was lowered down to 1 second.
- Patch 3: Some CIs have reported that some MPTCP Join signalling tests
were unstable. It seems that it is due to the time it can take in slow
environments to send a bunch of ADD_ADDR notifications and wait each
time for their echo reply. Use a longer transfer to avoid such errors.
selftests: mptcp: join: allow more time to send ADD_ADDR
When many ADD_ADDR need to be sent, it can take some time to send each
of them, and create new subflows. Some CIs seem to occasionally have
issues with these tests, especially with "debug" kernels.
Two subtests will now run for a slightly longer time: the last two where
3 or more ADD_ADDR are sent during the test.
ADD_ADDR can be retransmitted, and with, the parent commit, these
retransmissions can be sent quicker: from 2 minutes to less than one
second.
To avoid false positives where retransmitted ADD_ADDR causes higher
counters than expected, it is required to be more tolerant. Errors are
now only reported when fewer ADD_ADDRs have been sent/received, except
if no ADD_ADDR are expected.
Before the parent commit, the tolerance was present for each tests where
the ADD_ADDR could be retransmitted in a reasonable time (1 sec). Now
that all tests can have retransmitted ADD_ADDR, it is normal to apply
the same tolerance for all tests.
An alternative could be to disable the ADD_ADDR retransmissions by
default, but that's changing the default kernel behaviour. Plus,
ADD_ADDR retransmissions can be required for some tests. To avoid adding
exceptions to many tests, it seems better to increase the tolerance.
Later, we could add a new MIB counter to identify the ADD_ADDR
retransmissions, and remove the tolerance when this counter is
available.
mptcp: make ADD_ADDR retransmission timeout adaptive
Currently the ADD_ADDR option is retransmitted with a fixed timeout. This
patch makes the retransmission timeout adaptive by using the maximum RTO
among all the subflows, while still capping it at the configured maximum
value (add_addr_timeout_max). This improves responsiveness when
establishing new subflows.
Specifically:
1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive
timeout.
2. Uses maximum subflow RTO (icsk_rto) when available.
3. Applies exponential backoff based on retransmission count.
4. Maintains fallback to configured max timeout when no RTO data exists.
This slightly changes the behaviour of the MPTCP "add_addr_timeout"
sysctl knob to be used as a maximum instead of a fixed value. But this
is seen as an improvement: the ADD_ADDR might be sent quicker than
before to improve the overall MPTCP connection. Also, the default
value is set to 2 min, which was already way too long, and caused the
ADD_ADDR not to be retransmitted for connections shorter than 2 minutes.
Jakub Kicinski [Wed, 10 Sep 2025 01:44:07 +0000 (18:44 -0700)]
Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
idpf: add XDP support
Alexander Lobakin says:
Add XDP support (w/o XSk for now) to the idpf driver using the libeth_xdp
sublib. All possible verdicts, .ndo_xdp_xmit(), multi-buffer etc. are here.
In general, nothing outstanding comparing to ice, except performance --
let's say, up to 2x for .ndo_xdp_xmit() on certain platforms and
scenarios.
idpf doesn't support VLAN Rx offload, so only the hash hint is
available for now.
Patches 1-7 are prereqs, without which XDP would either not work at all
or work slower/worse/...
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
idpf: add XDP RSS hash hint
idpf: add support for .ndo_xdp_xmit()
idpf: add support for XDP on Rx
idpf: use generic functions to build xdp_buff and skb
idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
idpf: prepare structures to support XDP
idpf: add support for nointerrupt queues
idpf: remove SW marker handling from NAPI
idpf: add 4-byte completion descriptor definition
idpf: link NAPIs to queues
idpf: use a saner limit for default number of queues to allocate
idpf: fix Rx descriptor ready check barrier in splitq
xdp, libeth: make the xdp_init_buff() micro-optimization generic
====================
vxlan: Make vxlan_fdb_find_uc() more robust against NPDs
first_remote_rcu() can return NULL if the FDB entry points to an FDB
nexthop group instead of a remote destination. However, unlike other
users of first_remote_rcu(), NPD cannot currently happen in
vxlan_fdb_find_uc() as it is only invoked by one driver which vetoes the
creation of FDB nexthops.
Make the function more robust by making sure the remote destination is
only dereferenced if it is not NULL.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250908075141.125087-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is a development artifact of commit a76f26f7a81e ("net: phy:
aquantia: support phy-mode = "10g-qxgmii" on NXP SPF-30841 (AQR412C)").
This function name isn't used. Instead we have aqr_build_fingerprint()
in aquantia_main.c.
Jakub Kicinski [Sat, 6 Sep 2025 21:45:35 +0000 (14:45 -0700)]
selftests: net: speed up pmtu.sh by avoiding unnecessary cleanup
The pmtu test takes nearly an hour when run on a debug kernel
(10min on a normal kernel, so the debug slow down is quite significant).
NIPA tries to ensure all results are delivered by a certain deadline
so this prevents it from retrying the test in case of a flake.
Looks like one of the slowest operations in the test is calling out
to ./openvswitch/ovs-dpctl.py to remove potential leftover OvS interfaces.
Check whether the interfaces exist in the first place in sysfs,
since it can be done directly in bash it is very fast.
This should save us around 20-30% of the test runtime.
Brett A C Sheffield [Wed, 3 Sep 2025 15:46:01 +0000 (15:46 +0000)]
selftests: net: add test for ipv6 fragmentation
Add selftest for the IPv6 fragmentation regression which affected
several stable kernels.
Commit a18dfa9925b9 ("ipv6: save dontfrag in cork") was backported to
stable without some prerequisite commits. This caused a regression when
sending IPv6 UDP packets by preventing fragmentation and instead
returning -1 (EMSGSIZE).
Add selftest to check for this issue by attempting to send a packet
larger than the interface MTU. The packet will be fragmented on a
working kernel, with sendmsg(2) correctly returning the expected number
of bytes sent. When the regression is present, sendmsg returns -1 and
sets errno to EMSGSIZE.
Hangbin Liu [Tue, 2 Sep 2025 06:55:58 +0000 (06:55 +0000)]
hsr: use netdev_master_upper_dev_link() when linking lower ports
Unlike VLAN devices, HSR changes the lower device’s rx_handler, which
prevents the lower device from being attached to another master.
Switch to using netdev_master_upper_dev_link() when setting up the lower
device.
This could improves user experience, since ip link will now display the
HSR device as the master for its ports.
Hangbin Liu [Tue, 2 Sep 2025 06:45:01 +0000 (06:45 +0000)]
selftests: bonding: add test for LACP actor port priority
Add comprehensive selftest to verify:
- Per-port actor priority setting via ad_actor_port_prio
- Aggregator selection behavior with port_priority ad_select policy
Also move cmd_jq helper from forwarding/lib.sh to net/lib.sh for
broader reusability across network selftests.
Here is the result output
# ./bond_lacp_prio.sh
TEST: bond 802.3ad (ad_actor_port_prio setting) [ OK ]
TEST: bond 802.3ad (ad_actor_port_prio select) [ OK ]
TEST: bond 802.3ad (ad_actor_port_prio switch) [ OK ]
Hangbin Liu [Tue, 2 Sep 2025 06:45:00 +0000 (06:45 +0000)]
bonding: support aggregator selection based on port priority
Add a new ad_select policy 'port_priority' that uses the per-port
actor priority values (set via ad_actor_port_prio) to determine
aggregator selection.
This allows administrators to influence which ports are preferred
for aggregation by assigning different priority values, providing
more flexible load balancing control in LACP configurations.
Hangbin Liu [Tue, 2 Sep 2025 06:44:59 +0000 (06:44 +0000)]
bonding: add support for per-port LACP actor priority
Introduce a new netlink attribute 'actor_port_prio' to allow setting
the LACP actor port priority on a per-slave basis. This extends the
existing bonding infrastructure to support more granular control over
LACP negotiations.
The priority value is embedded in LACPDU packets and will be used by
subsequent patches to influence aggregator selection policies.
Define the Ports Phy Histogram Configuration Register (PPHCR) to expose
RS-FEC histogram bin ranges, and expose a new counter group in the Ports
Performance Counters Register (PPCNT) to report the corresponding
histogram values.
====================
Support exposing raw cycle counters in PTP and mlx5
This series by Carolina adds support in ptp and usage in mlx5 for
exposing the raw free-running cycle counter of PTP hardware clocks.
This is V2. Find previous one here:
https://lore.kernel.org/all/1752556533-39218-1-git-send-email-tariqt@nvidia.com/
Find detailed description by Carolina below [1].
[1]
This patch series introduces support for exposing the raw free-running
cycle counter of PTP hardware clocks. When the device is in free-running
mode, it emits timestamps as raw cycle values instead of nanoseconds.
These values may be passed directly to user space through:
- fwctl: exposes internal device event records that include raw
cycle-based timestamps.
- DPDK: retrieves CQEs that contain raw cycle counters, which are passed
to user space unmodified.
To address this, the series introduces two new ioctl commands that allow
userspace to query the device's raw cycle counter together with host
time:
- PTP_SYS_OFFSET_PRECISE_CYCLES
- PTP_SYS_OFFSET_EXTENDED_CYCLES
These commands work like their existing counterparts but return the
device timestamp in cycle units instead of real-time nanoseconds. This
allows user space to collect (cycle, time) pairs and build a mapping
between the device’s free-running clock and host time.
This can also be useful in the XDP fast path: if a driver inserts the
raw cycle value into metadata instead of a real-time timestamp, it can
avoid the overhead of converting cycles to time in the kernel. Then
userspace can resolve the cycle-to-time mapping using this ioctl when
needed.
The ioctl enables user space to correlate those with host time, without
requiring the PHC to be synchronized, so long as the drift remains
stable during collection.
Adds the new PTP ioctls and integrates support in ptp_ioctl():
- ptp: Add ioctl commands to expose raw cycle counter values
Support for exposing raw cycles in mlx5:
- net/mlx5: Extract MTCTR register read logic into helper function
- net/mlx5: Support getcyclesx and getcrosscycles
====================
Carolina Jubran [Tue, 12 Aug 2025 14:17:06 +0000 (17:17 +0300)]
ptp: Add ioctl commands to expose raw cycle counter values
Introduce two new ioctl commands, PTP_SYS_OFFSET_PRECISE_CYCLES and
PTP_SYS_OFFSET_EXTENDED_CYCLES, to allow user space to access the
raw free-running cycle counter from PTP devices.
These ioctls are variants of the existing PRECISE and EXTENDED
offset queries, but instead of returning device time in realtime,
they return the raw cycle counter value.
In the old days, RDS used FMR (Fast Memory Registration) to register
IB MRs to be used by RDMA. A newer and better verbs based
registration/de-registration method called FRWR (Fast Registration
Work Request) was added to RDS by commit 1659185fb4d0 ("RDS: IB:
Support Fastreg MR (FRMR) memory registration mode") in 2016.
Detection and enablement of FRWR was done in commit 2cb2912d6563
("RDS: IB: add Fastreg MR (FRMR) detection support"). But said commit
added an extern bool prefer_frmr, which was not used by said commit -
nor used by later commits. Hence, remove it.
Jakub Kicinski [Tue, 9 Sep 2025 01:12:10 +0000 (18:12 -0700)]
Merge branch 'net-stmmac-mdio-cleanups'
Russell King says:
====================
net: stmmac: mdio cleanups
Clean up the stmmac MDIO code:
- provide an address register formatter to avoid repeated code
- provide a common function to wait for the busy bit to clear
- pre-compute the CR field (mdio clock divider)
- move address formatter into read/write functions
- combine the read/write functions into a common accessor function
- move runtime PM handling into common accessor function
- rename register constants to better reflect manufacturer names
- move stmmac_clk_csr_set() into stmmac_mdio
- make stmmac_clk_csr_set() return the CR field value and remove
priv->clk_csr
- clean up if() range tests in stmmac_clk_csr_set()
- use STMMAC_CSR_xxx definitions in initialisers
For Qualcomm QCS9100 Ride R3 board with the AQR115C PHY:
Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com>
====================
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:51 +0000 (13:11 +0100)]
net: stmmac: use STMMAC_CSR_xxx definitions in platform glue
Use the STMMAC_CSR_xxx definitions to initialise plat->clk_csr in the
platform glue drivers to make the integer values meaningful.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oh-00000001vpT-0vk2@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oc-00000001vpN-0S1A@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:40 +0000 (13:11 +0100)]
net: stmmac: mdio: return clk_csr value from stmmac_clk_csr_set()
Return the clk_csr value from stmmac_clk_csr_set() rather than
using priv->clk_csr, as this struct member now serves very little
purpose. This allows us to remove priv->clk_csr.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oW-00000001vpH-46zf@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:35 +0000 (13:11 +0100)]
net: stmmac: mdio: move initialisation of priv->clk_csr to stmmac_mdio
The only user of priv->clk_csr is the MDIO code, so move its
initialisation to stmmac_mdio.c.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oR-00000001vpB-3fbY@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:30 +0000 (13:11 +0100)]
net: stmmac: mdio: improve mdio register field definitions
Include the register name in the definitions, and use a name which
more closely resembles that used in documentation, while still being
descriptive.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oM-00000001vp4-3DC5@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:25 +0000 (13:11 +0100)]
net: stmmac: mdio: move runtime PM into stmmac_mdio_access()
Move the runtime PM handling into the common stmmac_mdio_access()
function, rather than having it in the four top-level bus access
functions.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oH-00000001voy-2jfU@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:20 +0000 (13:11 +0100)]
net: stmmac: mdio: merge stmmac_mdio_read() and stmmac_mdio_write()
stmmac_mdio_read() and stmmac_mdio_write() are virtually identical
except for the final read in the stmmac_mdio_read(). Handle this as
a flag.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8oC-00000001vos-2JnA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:15 +0000 (13:11 +0100)]
net: stmmac: mdio: move stmmac_mdio_format_addr() into read/write
Move stmmac_mdio_format_addr() into stmmac_mdio_read() and
stmmac_mdio_write().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8o7-00000001vom-1pN8@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:10 +0000 (13:11 +0100)]
net: stmmac: mdio: provide priv->gmii_address_bus_config
Provide a pre-formatted value for the MDIO address register fields
which remain constant across the various different transactions
rather than recreating the register value from scratch every time.
Currently, we only do this for the CR (clock range) field.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8o2-00000001vog-1LyK@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:05 +0000 (13:11 +0100)]
net: stmmac: mdio: provide stmmac_mdio_wait()
All the readl_poll_timeout()s follow the same pattern - test a register
for a bit being clear every 100us, and timeout after 10ms returning
-EBUSY. Wrap this up into a function to avoid duplicating this.
This slightly changes the return value for stmmac_mdio_write() if the
second readl_poll_timeout() fails - rather than returning -ETIMEDOUT
we return -EBUSY matching the stmmac_mdio_read() behaviour.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8nx-00000001voa-0tJ0@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Thu, 4 Sep 2025 12:11:00 +0000 (13:11 +0100)]
net: stmmac: mdio: provide address register formatter
Rather than duplicating the logic for filling the PA (MDIO address),
GR (MDIO register/devad), CR (clock range) and GB (busy) fields of the
address register in four locations, provide a helper to do this.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Mohd Ayaan Anwar <quic_mohdayaa@quicinc.com> Link: https://patch.msgid.link/E1uu8ns-00000001voU-0S7b@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ipv6: snmp: avoid performance issue with RATELIMITHOST
Addition of ICMP6_MIB_RATELIMITHOST in commit d0941130c9351
("icmp: Add counters for rate limits") introduced a performance drop
in case of DOS (like receiving UDP packets to closed ports).
Per netns ICMP6_MIB_RATELIMITHOST tracking uses per-cpu storage and
is enough, we do not need per-device and slow tracking for this metric.
In v2 of this series, I completed the removal of SNMP_MIB_SENTINEL
in all the kernel for consistency.
Jakub Kicinski [Sat, 6 Sep 2025 21:13:51 +0000 (14:13 -0700)]
selftests: net: move netlink-dumps back to progs
Commit 9bb88c659673 ("selftests: net: test extacks in netlink dumps")
moved netlink-dumps from TEST_GEN_PROGS to YNL_GEN_FILES.
But _FILES are not for tests, rather for utilities / helpers.
Create YNL_GEN_PROGS and include netlink-dumps there.
This makes netlink-dumps part of executed tests, again.
Jakub Kicinski [Sat, 6 Sep 2025 21:13:50 +0000 (14:13 -0700)]
selftests: net: make the dump test less sensitive to mem accounting
Recent changes to make netlink socket memory accounting must
have broken the implicit assumption of the netlink-dump test
that we can fit exactly 64 dumps into the socket. Handle the
failure mode properly, and increase the dump count to 80
to make sure we still run into the error condition if
the default buffer size increases in the future.
Alexander Lobakin [Tue, 26 Aug 2025 15:55:07 +0000 (17:55 +0200)]
idpf: add XDP RSS hash hint
Add &xdp_metadata_ops with a callback to get RSS hash hint from the
descriptor. Declare the splitq 32-byte descriptor as 4 u64s to parse
them more efficiently when possible.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alexander Lobakin [Tue, 26 Aug 2025 15:55:06 +0000 (17:55 +0200)]
idpf: add support for .ndo_xdp_xmit()
Use libeth XDP infra to implement .ndo_xdp_xmit() in idpf.
The Tx callbacks are reused from XDP_TX code. XDP redirect target
feature is set/cleared depending on the XDP prog presence, as for now
we still don't allocate XDP Tx queues when there's no program.
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alexander Lobakin [Tue, 26 Aug 2025 15:55:05 +0000 (17:55 +0200)]
idpf: add support for XDP on Rx
Use libeth XDP infra to support running XDP program on Rx polling.
This includes all of the possible verdicts/actions.
XDP Tx queues are cleaned only in "lazy" mode when there are less than
1/4 free descriptors left on the ring. libeth helper macros to define
driver-specific XDP functions make sure the compiler could uninline
them when needed.
Use __LIBETH_WORD_ACCESS to parse descriptors more efficiently when
applicable. It really gives some good boosts and code size reduction
on x86_64:
with the most demanding workloads like XSk xmit differing in up to 5-8%.
Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alexander Lobakin [Tue, 26 Aug 2025 15:55:04 +0000 (17:55 +0200)]
idpf: use generic functions to build xdp_buff and skb
In preparation of XDP support, move from having skb as the main frame
container during the Rx polling to &xdp_buff.
This allows to use generic and libeth helpers for building an XDP
buffer and changes the logics: now we try to allocate an skb only
when we processed all the descriptors related to the frame.
Store &libeth_xdp_stash instead of the skb pointer on the Rx queue.
It's only 8 bytes wider, but contains everything we may need.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Michal Kubiak [Tue, 26 Aug 2025 15:55:03 +0000 (17:55 +0200)]
idpf: implement XDP_SETUP_PROG in ndo_bpf for splitq
Implement loading/removing XDP program using .ndo_bpf callback
in the split queue mode. Reconfigure and restart the queues if needed
(!!old_prog != !!new_prog), otherwise, just update the pointers.
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Michal Kubiak [Tue, 26 Aug 2025 15:55:02 +0000 (17:55 +0200)]
idpf: prepare structures to support XDP
Extend basic structures of the driver (e.g. 'idpf_vport', 'idpf_*_queue',
'idpf_vport_user_config_data') by adding members necessary to support XDP.
Add extra XDP Tx queues needed to support XDP_TX and XDP_REDIRECT actions
without interfering with regular Tx traffic.
Also add functions dedicated to support XDP initialization for Rx and
Tx queues and call those functions from the existing algorithms of
queues configuration.
Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Co-developed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Tested-by: Ramu R <ramu.r@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>