The race window can be easily closed by checking inet6_dev->dead
under inet6_dev->mc_lock in __ipv6_dev_mc_inc() as addrconf_ifdown()
will acquire it after marking inet6_dev dead.
Let's check inet6_dev->dead under mc_lock in __ipv6_dev_mc_inc().
Note that now __ipv6_dev_mc_inc() no longer depends on RTNL and
we can remove ASSERT_RTNL() there and the RTNL comment above
addrconf_join_solict().
Jason Xing [Thu, 3 Jul 2025 14:17:12 +0000 (22:17 +0800)]
selftests/bpf: add a new test to check the consumer update case
The subtest sends 33 packets at one time on purpose to see if xsk
exitting __xsk_generic_xmit() updates the global consumer of tx queue
when reaching the max loop (max_tx_budget, 32 by default). The number 33
can avoid xskq_cons_peek_desc() updates the consumer when it's about to
quit sending, to accurately check if the issue that the first patch
resolves remains. The new case will not check this issue in zero copy
mode.
Jason Xing [Thu, 3 Jul 2025 14:17:11 +0000 (22:17 +0800)]
net: xsk: update tx queue consumer immediately after transmission
For afxdp, the return value of sendto() syscall doesn't reflect how many
descs handled in the kernel. One of use cases is that when user-space
application tries to know the number of transmitted skbs and then decides
if it continues to send, say, is it stopped due to max tx budget?
The following formular can be used after sending to learn how many
skbs/descs the kernel takes care of:
Prior to the current patch, in non-zc mode, the consumer of tx queue is
not immediately updated at the end of each sendto syscall when error
occurs, which leads to the consumer value out-of-dated from the perspective
of user space. So this patch requires store operation to pass the cached
value to the shared value to handle the problem.
More than those explicit errors appearing in the while() loop in
__xsk_generic_xmit(), there are a few possible error cases that might
be neglected in the following call trace:
__xsk_generic_xmit()
xskq_cons_peek_desc()
xskq_cons_read_desc()
xskq_cons_is_valid_desc()
It will also cause the premature exit in the while() loop even if not
all the descs are consumed.
Based on the above analysis, using @sent_frame could cover all the possible
cases where it might lead to out-of-dated global state of consumer after
finishing __xsk_generic_xmit().
The patch also adds a common helper __xsk_tx_release() to keep align
with the zc mode usage in xsk_tx_release().
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).
The application uses TCP_INQ for TCP, but AF_UNIX doesn't have it
and requires an extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO)
as an alternative.
Also, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM is more expensive because
it needs to iterate all skb in the receive queue.
This series adds a cached field for SIOCINQ to speed it up and introduce
SO_INQ, the generic version of TCP_INQ to get the queue length as cmsg in
each recvmsg().
====================
Let's add a simple test to check the basic functionality of SO_INQ.
The test does the following:
1. Create socketpair in self->fd[]
2. Enable SO_INQ
3. Send data via self->fd[0]
4. Receive data from self->fd[1]
5. Compare the SCM_INQ cmsg with ioctl(SIOCINQ)
We have an application that uses almost the same code for TCP and
AF_UNIX (SOCK_STREAM).
TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an
extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an
alternative.
Let's introduce the generic version of TCP_INQ.
If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that
contains the exact value of ioctl(SIOCINQ). The cmsg is also
included when msg->msg_get_inq is non-zero to make sockets
io_uring-friendly.
Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to
override setsockopt() for SOL_SOCKET.
By having the flag in struct unix_sock, instead of struct sock, we
can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq.
Note also that supporting custom getsockopt() for SOL_SOCKET will need
preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP).
af_unix: Use cached value for SOCK_STREAM in unix_inq_len().
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more
expensive, as unix_inq_len() requires iterating through the receive queue
and accumulating skb->len.
Let's cache the value for SOCK_STREAM to a new field during sendmsg()
and recvmsg().
The field is protected by the receive queue lock.
Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first
skb in the queue.
SOCK_SEQPACKET still requires iterating through the queue because we do
not touch functions shared with unix_dgram_ops. But, if really needed,
we can support it by switching __skb_try_recv_datagram() to a custom
version.
af_unix: Don't use skb_recv_datagram() in unix_stream_read_skb().
unix_stream_read_skb() calls skb_recv_datagram() with MSG_DONTWAIT,
which is mostly equivalent to sock_error(sk) + skb_dequeue().
In the following patch, we will add a new field to cache the number
of bytes in the receive queue. Then, we want to avoid introducing
atomic ops in the fast path, so we will reuse the receive queue lock.
As a preparation for the change, let's not use skb_recv_datagram()
in unix_stream_read_skb().
Note that sock_error() is now moved out of the u->iolock mutex as
the mutex does not synchronise the peer's close() at all.
====================
eth: fbnic: Add firmware logging support
Firmware running on fbnic generates device logs. These logs contain useful
information about the device which may or may not be related to the host.
Logs are stored in a ring buffer and accessible through DebugFS.
====================
Lee Trager [Wed, 2 Jul 2025 19:12:11 +0000 (12:12 -0700)]
eth: fbnic: Enable firmware logging
The firmware log buffer is enabled during probe and freed during remove.
Early versions of firmware do not support sending logs. Once the mailbox is
up driver will enable logging when supported firmware versions are detected.
Logging is disabled before the mailbox is freed.
Lee Trager [Wed, 2 Jul 2025 19:12:10 +0000 (12:12 -0700)]
eth: fbnic: Add mailbox support for firmware logs
By default firmware will not send logs to the host. This must be explicitly
enabled by the driver. The mailbox has the concept of a flag which is a u32
used as a boolean. Lack of flag defaults to a value of false. When enabling
logging historical logs may be optionally requested. These are log messages
generated by the NIC before the driver was loaded. The driver also sends a
log version to support changing the logging format in the future.
[SEND_LOGS_REQ] = {
[SEND_LOGS] /* flag to request log reporting */
[SEND_LOGS_HISTORY] /* flag to request historical logs */
[SEND_LOGS_VERSION] /* u32 indicating the log format version */
}
Logs may be sent to the user either one at a time, or when historical logs
are requested in bulk. Firmware may not send more than 14 messages in bulk
to prevent flooding the mailbox.
[LOG_MSG] = {
[LOG_INDEX] /* entry 0 - u64 index of log */
[LOG_MSEC] /* entry 0 - u32 timestamp of log */
[LOG_MSG] /* entry 0 - char log message up to 256 */
[LOG_LENGTH] /* u32 of remaining log items in arrays */
[LOG_INDEX_ARRAY] = {
[LOG_INDEX] /* entry 1 - u64 index of log */
[LOG_INDEX] /* entry 2 - u64 index of log */
...
}
[LOG_MSEC_ARRAY] = {
[LOG_MSEC] /* entry 1 - u32 timestamp of log */
[LOG_MSEC] /* entry 2 - u32 timestamp of log */
...
}
[LOG_MSG_ARRAY] = {
[LOG_MSG] /* entry 1 - char log message up to 256 */
[LOG_MSG] /* entry 2 - char log message up to 256 */
...
}
}
Lee Trager [Wed, 2 Jul 2025 19:12:09 +0000 (12:12 -0700)]
eth: fbnic: Create ring buffer for firmware logs
When enabled, firmware may send logs messages which are specific to the
device and not the host. Create a ring buffer to store these messages
which are read by a user through DebugFS. Buffer access is protected by
a spinlock.
Lee Trager [Wed, 2 Jul 2025 19:12:08 +0000 (12:12 -0700)]
eth: fbnic: Use FIELD_PREP to generate minimum firmware version
Create a new macro based on FIELD_PREP to generate easily readable minimum
firmware version ints. This macro will prevent the mistake from the
previous patch from happening again.
Lee Trager [Wed, 2 Jul 2025 19:12:07 +0000 (12:12 -0700)]
eth: fbnic: Fix incorrect minimum firmware version
The full minimum version is 0.10.6-0. The six is now correctly defined as
patch and shifted appropriately. 0.10.6-0 is a preproduction version of
firmware which was released over a year and a half ago. All production
devices meet this requirement.
====================
net: migrate remaining drivers to dedicated _rxfh_context ops
Around a year ago Ed added dedicated ops for managing RSS contexts.
This significantly improved the clarity of the driver facing API.
Migrate the remaining 3 drivers and remove the old way of muxing
the RSS context operations via .set_rxfh().
Jakub Kicinski [Mon, 7 Jul 2025 18:41:13 +0000 (11:41 -0700)]
eth: mlx5: migrate to the *_rxfh_context ops
Convert mlx5 to dedicated RXFH ops. This is a fairly shallow
conversion, TBH, most of the driver code stays as is, but we
let the core allocate the context ID for the driver.
mlx5e_rx_res_rss_get_rxfh() and friends are made void, since
core only calls the driver for context 0. The second call
is right after context creation so it must exist (tm).
Jakub Kicinski [Mon, 7 Jul 2025 18:41:12 +0000 (11:41 -0700)]
eth: ice: drop the dead code related to rss_contexts
ICE appears to have some odd form of rss_context use plumbed
in for .get_rxfh. The .set_rxfh side does not support creating
contexts, however, so this must be dead code. For at least a year
now (since commit 7964e7884643 ("net: ethtool: use the tracking
array for get_rxfh on custom RSS contexts")) we have not been
calling .get_rxfh with a non-zero rss_context. We just get
the info from the RSS XArray under dev->ethtool.
Remove what must be dead code in the driver, clear the support flags.
Jakub Kicinski [Mon, 7 Jul 2025 18:41:11 +0000 (11:41 -0700)]
eth: otx2: migrate to the *_rxfh_context ops
otx2 only supports additional indirection tables (no separate keys
etc.) so the conversion to dedicated callbacks and core-allocated
context is mostly removing the code which stores the extra tables
in the driver. Core already stores the indirection tables for
additional contexts, and doesn't call .get for them.
One subtle change here is that we'll now start with the table
covering all queues, not directing all traffic to queue 0.
This is what core expects if the user doesn't pass the initial
indir table explicitly (there's a WARN_ON() in the core trying
to make sure driver authors don't forget to populate ctx to
defaults).
Drivers implementing .create_rxfh_context don't have to set
cap_rss_ctx_supported, so remove it.
Fengyuan Gong [Wed, 2 Jul 2025 16:07:41 +0000 (16:07 +0000)]
net: account for encap headers in qdisc pkt len
Refine qdisc_pkt_len_init to include headers up through
the inner transport header when computing header size
for encapsulations. Also refine net/sched/sch_cake.c
borrowed from qdisc_pkt_len_init().
Signed-off-by: Fengyuan Gong <gfengyuan@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250702160741.1204919-1-gfengyuan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the driver uses phylib to operate PHY by default.
On some boards, the PHY device is separated from the MAC device.
As a result, the hibmcge driver cannot operate the PHY device.
In this patch, the driver determines whether a PHY is available
based on register configuration. If no PHY is available,
the driver will use fixed_phy to register fake phydev.
====================
net: Remove unused function parameters in skbuff.c
Couple of cleanup patches to get rid of unused function parameters around
skbuff.c, plus little things spotted along the way.
Offshoot of my question in [1], but way more contained. Found by adding
"-Wunused-parameter -Wno-error" to KBUILD_CFLAGS and grepping for specific
skbuff.c warnings.
Michal Luczaj [Wed, 2 Jul 2025 13:38:12 +0000 (15:38 +0200)]
net: skbuff: Drop unused @skb
Since its introduction in commit 6fa01ccd8830 ("skbuff: Add pskb_extract()
helper function"), pskb_carve_frag_list() never used the argument @skb.
Drop it and adapt the only caller.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michal Luczaj [Wed, 2 Jul 2025 13:38:11 +0000 (15:38 +0200)]
net: skbuff: Drop unused @skb
Since its introduction in commit ce098da1497c ("skbuff: Introduce
slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove
it and adapt both callers.
No functional change intended.
Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michal Luczaj [Wed, 2 Jul 2025 13:38:08 +0000 (15:38 +0200)]
net: splice: Drop unused @gfp
Since its introduction in commit 2e910b95329c ("net: Add a function to
splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter()
never used the @gfp argument. Remove it and adapt callers.
net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"
Use the newly added of_reserved_mem_region_to_resource{_byname}()
functions to handle "memory-region" properties.
The error handling is a bit different for mtk_wed_mcu_load_firmware().
A failed match of the "memory-region-names" would skip the entry, but
then other errors in the lookup and retrieval of the address would not
skip the entry. However, that distinction is not really important.
Either the region is available and usable or it is not. So now, errors
from of_reserved_mem_region_to_resource() are ignored so the region is
simply skipped.
Hannes Reinecke [Tue, 1 Jul 2025 14:46:57 +0000 (16:46 +0200)]
net/handshake: Add new parameter 'HANDSHAKE_A_ACCEPT_KEYRING'
Add a new netlink parameter 'HANDSHAKE_A_ACCEPT_KEYRING' to provide
the serial number of the keyring to use.
Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250701144657.104401-1-hare@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Eric Dumazet [Wed, 2 Jul 2025 07:12:30 +0000 (07:12 +0000)]
net/sched: acp_api: no longer acquire RTNL in tc_action_net_exit()
tc_action_net_exit() got an rtnl exclusion in commit a159d3c4b829 ("net_sched: acquire RTNL in tc_action_net_exit()")
Since then, commit 16af6067392c ("net: sched: implement reference
counted action release") made this RTNL exclusion obsolete for
most cases.
Only tcf_action_offload_del() might still require it.
Move the rtnl locking into tcf_idrinfo_destroy() when
an offload action is found.
Most netns do not have actions, yet deleting them is adding a lot
of pressure on RTNL, which is for many the most contended mutex
in the kernel.
We are moving to a per-netns 'rtnl', so tc_action_net_exit()
will not be able to grab 'rtnl' a single time for a batch of netns.
Before the patch:
perf probe -a rtnl_lock
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (25 samples) ]
After the patch:
perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.304 MB perf.data (9 samples) ]
====================
net: mctp: Add support for gateway routing
This series adds a gateway route type for the MCTP core, allowing
non-local EIDs as the match for a route.
Example setup using the mctp tools:
mctp route add 9 via mctpi2c0
mctp neigh add 9 dev mctpi2c0 lladdr 0x1d
mctp route add 10 gw 9
- will route packets to eid 10 through mctpi2c0, using a dest lladdr
of 0x1d (ie, that of the directly-attached eid 9).
The core change to support this is the introduction of a struct
mctp_dst, which represents the result of a route lookup. Since this
involves a bit of surgery through the routing code, we add a few tests
along the way.
We're introducing an ABI change in the new RTM_{NEW,GET,DEL}ROUTE
netlink formats, with the support for a RTA_GATEWAY attribute. Because
we need a network ID specified to fully-qualify a gateway EID, the
RTA_GATEWAY attribute carries the (net, eid) tuple in full:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
Of course, any questions, comments etc are most welcome.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
====================
Jeremy Kerr [Wed, 2 Jul 2025 06:20:14 +0000 (14:20 +0800)]
net: mctp: test: Add tests for gateway routes
Add a few kunit tests for the gateway routing. Because we have multiple
route types now (direct and gateway), rename mctp_test_create_route to
mctp_test_create_route_direct, and add a _gateway variant too.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:13 +0000 (14:20 +0800)]
net: mctp: add gateway routing support
This change allows for gateway routing, where a route table entry
may reference a routable endpoint (by network and EID), instead of
routing directly to a netdevice.
We add support for a RTM_GATEWAY attribute for netlink route updates,
with an attribute format of:
struct mctp_fq_addr {
unsigned int net;
mctp_eid_t eid;
}
- we need the net here to uniquely identify the target EID, as we no
longer have the device reference directly (which would provide the net
id in the case of direct routes).
This makes route lookups recursive, as a route lookup that returns a
gateway route must be resolved into a direct route (ie, to a device)
eventually. We provide a limit to the route lookups, to prevent infinite
loop routing.
The route lookup populates a new 'nexthop' field in the dst structure,
which now specifies the key for the neighbour table lookup on device
output, rather than using the packet destination address directly.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:12 +0000 (14:20 +0800)]
net: mctp: allow NL parsing directly into a struct mctp_route
The netlink route parsing functions end up setting a bunch of output
variables from the rt attributes. This will get messy when the routes
become more complex.
So, split the rt parsing into two types: a lookup (returning route
target data suitable for a route lookup, like when deleting a route) and
a populate (setting fields of a struct mctp_route).
In doing this, we need to separate the route allocation from
mctp_route_add, so add some comments on the lifetime semantics for the
latter.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:11 +0000 (14:20 +0800)]
net: mctp: remove routes by netid, not by device
In upcoming changes, a route may not have a device associated. Since the
route is matched on the (network, eid) tuple, pass the netid itself into
mctp_route_remove.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:09 +0000 (14:20 +0800)]
net: mctp: test: Add initial socket tests
Recent changes have modified the extaddr path a little, so add a couple
of kunit tests to af-mctp.c. These check that we're correctly passing
lladdr data between sendmsg/recvmsg and the routing layer.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:05 +0000 (14:20 +0800)]
net: mctp: test: Add an addressed device constructor
Upcoming tests will check semantics of hardware addressing, which
require a dev with ->addr_len != 0. Add a constructor to create a
MCTP interface using a physically-addressed bus type.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:04 +0000 (14:20 +0800)]
net: mctp: separate cb from direct-addressing routing
Now that we have the dst->haddr populated by sendmsg (when extended
addressing is in use), we no longer need to stash the link-layer address
in the skb->cb.
Instead, only use skb->cb for incoming lladdr data.
While we're at it: remove cb->src, as was never used.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:03 +0000 (14:20 +0800)]
net: mctp: separate routing database from routing operations
This change adds a struct mctp_dst, representing the result of a routing
lookup. This decouples the struct mctp_route from the actual
implementation of a routing operation.
This will allow for future routing changes which may require more
involved lookup logic, such as gateway routing - which may require
multiple traversals of the routing table.
Since we only use the struct mctp_route at lookup time, we no longer
hold routes over a routing operation, as we only need it to populate the
dst. However, we do hold the dev while the dst is active.
This requires some changes to the route test infrastructure, as we no
longer have a mock route to handle the route output operation, and
transient dsts are created by the routing code, so we can't override
them as easily.
Instead, we use kunit->priv to stash a packet queue, and a custom
dst_output function queues into that packet queue, which we can use for
later expectations.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:02 +0000 (14:20 +0800)]
net: mctp: test: make cloned_frag buffers more appropriately-sized
In our input_cloned_frag test, we currently allocate our test buffers
arbitrarily-sized at 100 bytes.
We only expect to receive a max of 15 bytes from the socket, so reduce
to a more appropriate size. There are some upcoming changes to the
routing code which hit a frame-size limit on s390, so reduce the usage
before that lands.
Jeremy Kerr [Wed, 2 Jul 2025 06:20:01 +0000 (14:20 +0800)]
net: mctp: don't use source cb data when forwarding, ensure pkt_type is set
In the output path, only check the skb->cb data when we know it's from
a local socket; input packets will have source address information there
instead.
In order to detect when we're forwarding, set skb->pkt_type on
input/output.
====================
add broadcast_neighbor for no-stacking networking arch
For no-stacking networking arch, and enable the bond mode 4(lacp) in
datacenter, the switch require arp/nd packets as session synchronization.
More details please see patch.
Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Cc: Zengbing Tu <tuzengbing@didiglobal.com>
====================
Tonghao Zhang [Fri, 27 Jun 2025 13:49:30 +0000 (21:49 +0800)]
net: bonding: send peer notify when failure recovery
In LACP mode with broadcast_neighbor enabled, after LACP protocol
recovery, the port can transmit packets. However, if the bond port
doesn't send gratuitous ARP/ND packets to the switch, the switch
won't return packets through the current interface. This causes
traffic imbalance. To resolve this issue, when LACP protocol recovers,
send ARP/ND packets if broadcast_neighbor is enabled.
Cc: Jay Vosburgh <jv@jvosburgh.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com> Signed-off-by: Zengbing Tu <tuzengbing@didiglobal.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/3993652dc093fffa9504ce1c2448fb9dea31d2d2.1751031306.git.tonghao@bamaicloud.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tonghao Zhang [Fri, 27 Jun 2025 13:49:28 +0000 (21:49 +0800)]
net: bonding: add broadcast_neighbor option for 802.3ad
Stacking technology is a type of technology used to expand ports on
Ethernet switches. It is widely used as a common access method in
large-scale Internet data center architectures. Years of practice
have proved that stacking technology has advantages and disadvantages
in high-reliability network architecture scenarios. For instance,
in stacking networking arch, conventional switch system upgrades
require multiple stacked devices to restart at the same time.
Therefore, it is inevitable that the business will be interrupted
for a while. It is for this reason that "no-stacking" in data centers
has become a trend. Additionally, when the stacking link connecting
the switches fails or is abnormal, the stack will split. Although it is
not common, it still happens in actual operation. The problem is that
after the split, it is equivalent to two switches with the same
configuration appearing in the network, causing network configuration
conflicts and ultimately interrupting the services carried by the
stacking system.
To improve network stability, "non-stacking" solutions have been
increasingly adopted, particularly by public cloud providers and
tech companies like Alibaba, Tencent, and Didi. "non-stacking" is
a method of mimicing switch stacking that convinces a LACP peer,
bonding in this case, connected to a set of "non-stacked" switches
that all of its ports are connected to a single switch
(i.e., LACP aggregator), as if those switches were stacked. This
enables the LACP peer's ports to aggregate together, and requires
(a) special switch configuration, described in the linked article,
and (b) modifications to the bonding 802.3ad (LACP) mode to send
all ARP/ND packets across all ports of the active aggregator.
Note that, with multiple aggregators, the current broadcast mode
logic will send only packets to the selected aggregator(s).
This series optimizes ICM usage for unidirectional rules and
empty matchers and with the last patch we make hardware steering
the default FDB steering provider for NICs that don't support software
steering.
Hardware steering (HWS) uses a type of rule table container (RTC) that
is unidirectional, so matchers consist of two RTCs to accommodate
bidirectional rules.
This small series enables resizing the two RTCs independently by
tracking the number of rules separately. For extreme cases where all
rules are unidirectional, this results in saving close to half the
memory footprint.
Results for inserting 1M unidirectional rules using a simple module:
Pages Memory
Before this patch: 300k 1.5GiB
After this patch: 160k 900MiB
The 'Pages' column measures the number of 4KiB pages the device requests
for itself (the ICM).
The 'Memory' column is the difference between peak usage and baseline
usage (before starting the test) as reported by `free -h`.
In addition, second to last patch of the series handles a case where all
the matcher's rules were deleted: the large RTCs of the matcher are no
longer required, and we can save some more ICM by shrinking the matcher
to its initial size.
Finally the last patch makes hardware steering the default mode
when in swichdev for NICs that don't have software steering support.
====================
Add HW Steering (HWS) as a secondary option for device steering mode. If
the device does not support SW Steering (SWS), HW Steering will be used
as the default, provided it is supported. FW Steering will now be
selected as the default only if both HWS and SWS are unavailable.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-11-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matcher size is dynamic: it starts at initial size, and then it grows
through rehash as more and more rules are added to this matcher.
When rules are deleted, matcher's size is not decreased. Rehash
approach is greedy. The idea is: if the matcher got to a certain size
at some point, chances are - it will get to this size again, so it is
better to avoid costly rehash operations whenever possible.
However, when all the rules of the matcher are deleted, this should
be viewed as special case. If the matcher actually got to the point
where it has zero rules, it might be an indication that some usecase
from the past is no longer happening. This is where some ICM can be
freed.
This patch handles this case: when a number of rules in a matcher
goes down to zero, the matcher's tables are shrunk to the initial
size.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5: HWS, Rearrange to prevent forward declaration
As a preparation for the following patch that will add support
for shrinking empty matchers, rearrange the code to prevent
forward declaration of functions.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Track and grow matcher sizes individually for RX and TX RTCs. This
allows RX-only or TX-only use cases to effectively halve the device
resources they use.
For testing we used a simple module that inserts 1M RX-only rules and
measured the number of pages the device requests, and memory usage as
reported by `free -h`.
Pages Memory
Before this patch: 300k 1.5GiB
After this patch: 160k 900MiB
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kernel HWS only uses FDB tables and, as such, creates two lower level
containers (RTCs) for each matcher: one for RX and one for TX. Allow
these RTCs to differ in size by converting the size part of the matcher
attribute to a two element array.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matchers were using the pool abstraction solely as a convenience
to allocate two STE ranges. The pool's core functionality, that
of allocating individual items from the range, was unused.
Matchers rely either on the hardware to hash rules into a table,
or on a user-provided index.
Remove the STE pool from the matcher and allocate the STE ranges
manually instead.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 8 Jul 2025 02:06:13 +0000 (19:06 -0700)]
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-07-03
Vladimir Oltean converts Intel drivers (ice, igc, igb, ixgbe, i40e) to
utilize new timestamping API (ndo_hwtstamp_get() and ndo_hwtstamp_set()).
For ixgbe:
Paul, Don, Slawomir, and Radoslaw add Malicious Driver Detection (MDD)
support for X550 and E610 devices to detect, report, and handle
potentially malicious VFs.
Simon Horman corrects spelling mistakes.
For igbvf:
Kohei Enju removes a couple of unreported counters and adds reporting
of Tx timeouts.
* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
igbvf: add tx_timeout_count to ethtool statistics
igbvf: remove unused interrupt counter fields from struct igbvf_adapter
ixgbe: spelling corrections
ixgbe: turn off MDD while modifying SRRCTL
ixgbe: add Tx hang detection unhandled MDD
ixgbe: check for MDD events
ixgbe: add MDD support
i40e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
ixgbe: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igb: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
igc: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
ice: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
====================
====================
net: phylink: support !autoneg configuration for SFPs
This series comes from discussion during a patch series that was posted
at the beginning of April, but these patches were never posted (I was
too busy!)
We restrict ->sfp_interfaces to those that the host system supports,
and ensure that ->sfp_interfaces is cleared when a SFP is removed. We
then add phylink_sfp_select_interface_speed() which will select an
appropriate interface from ->sfp_interfaces for the speed, and use that
in our phylink_ethtool_ksettings_set() when a SFP bus is present on a
directly connected host (not with a PHY.)
====================
Add phylink_sfp_select_interface_speed() which attempts to select the
SFP interface based on the ethtool speed when autoneg is turned off.
This allows users to turn off autoneg for SFPs that support multiple
interface modes, and have an appropriate interface mode selected.
Russell King (Oracle) [Wed, 2 Jul 2025 09:44:29 +0000 (10:44 +0100)]
net: phylink: clear SFP interfaces when not in use
Clear the SFP interfaces bitmap when we're not using it - in other
words, when a module is unplugged, or we're using a PHY on the
module.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1uWu0z-005KXi-EM@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King (Oracle) [Wed, 2 Jul 2025 09:44:24 +0000 (10:44 +0100)]
net: phylink: restrict SFP interfaces to those that are supported
When configuring an optical SFP interface, restrict the bitmap of SFP
interfaces (pl->sfp_interfaces) to those that are supported by the
host, rather than calculating this in a local variable.
This will allow us to avoid recomputing this in the
phylink_ethtool_ksettings_set() path.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1uWu0u-005KXc-A4@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch series introduces the Ethernet driver for Broadcom’s
BCM5770X chip family, which supports 50/100/200/400/800 Gbps
link speeds. The driver is built as the bng_en.ko kernel module.
To keep the series within a reviewable size (~5K lines of code), this initial
submission focuses on the core infrastructure and initialization, including:
1) PCIe support (device IDs, probe/remove)
2) Devlink support
3) Firmware communication mechanism
4) Creation of network device
5) PF Resource management (rings, IRQs, etc. for netdev & aux dev)
Support for Tx/Rx datapaths, link management, ethtool/devlink operations
and additional features will be introduced in the subsequent patch series.
The bng_en driver shares the bnxt_hsi.h file with the bnxt_en driver,
as the bng_en driver leverages the hardware communication protocol
used by the bnxt_en driver.
====================
Add a network device with netdev features enabled.
Some features are enabled based on the capabilities
advertised by the firmware. Add the skeleton of minimal
netdev operations. Additionally, initialize the parameters
for rings (TX/RX/Completion).
Query resources from the firmware and, based on the
availability of resources, initialize the default
settings. The default settings include:
1. Rings and other resource reservations with the
firmware. This ensures that resources are reserved
before network and auxiliary devices are created.
2. Mapping the BAR, which helps access doorbells since
its size is known after querying the firmware.
3. Retrieving the TCs and hardware CoS queue mappings.
Get the resources and capabilities from the firmware.
Add functions to manage the resources with the firmware.
These functions will help netdev reserve the resources
with the firmware before registering the device in future
patches. The resources and their information, such as
the maximum available and reserved, are part of the members
present in the bnge_hw_resc struct.
The bnge_reserve_rings() function also populates
the RSS table entries once the RX rings are reserved with
the firmware.
Backing store or context memory on the host helps the
device to manage rings, stats and other resources.
Context memory is allocated with the help of ring
alloc/free functions.
Add ring allocation/free mechanism which help
to allocate rings (TX/RX/Completion) and backing
stores memory on the host for the device.
Future patches will use these functions.
Query firmware with the help of basic firmware commands and
cache the capabilities. With the help of basic commands
start the initialization process of the driver with the
firmware.
Since basic information is available from the firmware,
register with devlink.
Add support to communicate with the firmware.
Future patches will use these functions to send the
messages to the firmware.
Functions support allocating request/response buffers
to send a particular command. Each command has certain
timeout value to which the driver waits for response from
the firmware. In error case, commands may be either timed
out waiting on response from the firmware or may return
a specific error code.
Allocate a base device and devlink interface with minimal
devlink ops.
Add dsn and board related information.
Map PCIe BAR (bar0), which helps to communicate with the
firmware.
====================
netpoll: Factor out functions from netpoll_send_udp() and add ipv6 selftest
Refactors the netpoll UDP transmit path to improve code clarity,
maintainability, and protocol-layer encapsulation.
Function netpoll_send_udp() has more than 100 LoC, which is hard to
understand and review. After this patchset, it has only 32 LoC, which is
more manageable.
The series systematically moves the construction of protocol headers
(UDP, IPv4, IPv6, Ethernet) out of the core `netpoll_send_udp()`
function into dedicated static helpers:
- `push_udp()` for UDP header setup
- `push_ipv4()` and `push_ipv6()` for IP header setup
- `push_eth()` for Ethernet header setup
This results in a clean, layered abstraction that mirrors the protocol
stack, reduces code duplication, and improves readability.
Also, to make sure this is not breaking anything, add IPv6 selftest to
netconsole tests, which will exercise this code. This test would also pick
problems similiar to the one fixed by f599020702698 ("net: netpoll:
Initialize UDP checksum field before checksumming"), which was
embarrassin we didn't have a selftest catch it.
Anyway, there are **no functional changes** intended in this patchset.
selftests: net: Add IPv6 support to netconsole basic tests
Add IPv6 support to the netconsole basic functionality tests by:
- Introducing separate IPv4 and IPv6 address variables (SRCIP4/SRCIP6,
DSTIP4/DSTIP6) to replace the single SRCIP/DSTIP variables
- Adding select_ipv4_or_ipv6() function to choose protocol version
- Updating socat configuration to use UDP6-LISTEN for IPv6 tests
- Adding wait_for_port() wrapper to handle protocol-specific port waiting
- Expanding test matrix to run both basic and extended formats against
both IPv4 and IPv6 protocols
- Improving cleanup to kill any remaining socat processes
- Adding sleep delays for better IPv6 packet handling reliability
The test now validates netconsole functionality across both IP versions,
improving test coverage for dual-stack network environments.
This test would avoid the regression fixed by commit f59902070269 ("net:
netpoll: Initialize UDP checksum field before checksumming")
netpoll: factor out UDP header setup into push_udp() helper
Move UDP header construction from netpoll_send_udp() into a new
static helper function push_udp(). This completes the protocol
layer refactoring by:
1. Creating a dedicated helper for UDP header assembly
2. Removing UDP-specific logic from the main send function
3. Establishing a consistent pattern with existing IPv4/IPv6 helpers:
- push_udp()
- push_ipv4()
- push_ipv6()
The change improves code organization and maintains the encapsulation
pattern established in previous refactorings.
netpoll: factor out IPv4 header setup into push_ipv4() helper
Move IPv4 header construction from netpoll_send_udp() into a new
static helper function push_ipv4(). This completes the refactoring
started with IPv6 header handling, creating symmetric helper functions
for both IP versions.
Changes include:
1. Extracting IPv4 header setup logic into push_ipv4()
2. Replacing inline IPv4 code with helper call
3. Moving eth assignment after helper calls for consistency
The refactoring reduces code duplication and improves maintainability
by isolating IP version-specific logic.
netpoll: factor out IPv6 header setup into push_ipv6() helper
Move IPv6 header construction from netpoll_send_udp() into a new
static helper function, push_ipv6(). This refactoring reduces code
duplication and improves readability in netpoll_send_udp().